Distributed Training

Communication Methods

Intra-Machine Communication

  1. NVLink: Technology for direct communication between GPUs. NVLink 3.0 has a unidirectional bandwidth of up to 50 GB/s.
  2. PCIe: A universal high-speed bus standard used to connect GPUs, CPUs, and other peripherals. Under the PCIe architecture, communication between GPUs requires going through the CPU and system memory, with bandwidth and latency not as good as NVLink. PCIe 4.0 x16 has a unidirectional bandwidth of 31.5 GB/s. PCIe Switch is used to expand PCIe lanes, allowing multiple devices to share bandwidth.

Inter-Machine Communication

TCP/IP: Communication rate is limited by hardware and is slower than intra-machine communication.

Communication Backends

NCCL

NCCL is a high-performance communication library developed by NVIDIA, optimized for communication between multiple GPUs and multiple nodes.
It utilizes high-speed interconnect technologies such as NVLink and PCIe, providing extremely high bandwidth and low latency.
NCCL-Test is a tool used to verify the performance and correctness of NCCL in different hardware environments (such as NVLink, PCIe, TCP).

Gloo

Suitable for CPU distributed training or multi-node training.

RLlib

RLlib is a library in the Ray ecosystem, focused on distributed training for reinforcement learning.

Fine-Tuning

Speed Metrics

tgs (tokens/gpu/second)

Meaning: This metric is used to measure the number of tokens a single GPU can process per second. Here, a token represents a discrete text unit, which can be a word, punctuation, number, or other language element, used as the basic unit for training and generating text. Role: It is one of the core performance indicators, reflecting the GPU’s computational speed in processing text data. In natural language processing tasks, the token processing speed directly affects the efficiency of model training and inference. For example, in large language model training, a higher tgs value means the GPU can process input text faster, thereby accelerating the training process.

tf lops

Meaning: Stands for Teraflops, representing the number of floating-point operations per second, i.e., billions of floating-point operations per second. It is an indicator used to measure the performance of computers or computing devices, indicating the number of floating-point operations the processor can perform per second. Role: In model training, a large number of computations involve floating-point operations, such as matrix multiplication and vector operations. A higher tf lops value indicates stronger computational capability of the device, enabling faster completion of complex computational tasks in model training, which is crucial for accelerating model training and improving efficiency.

Performance Metrics

Loss Value

Meaning: In SFT (Supervised Fine-Tuning), the Loss value serves as an important indicator for evaluating model performance and guiding the model optimization process. The Loss value reflects the model’s adaptation to the training data and its generalization ability, measuring the difference between the model’s predicted results and the true labels. Evaluation Method: Evaluate by calculating the variance between the overall loss of the first 100 steps and the standard calculated loss value. Variance reflects the dispersion of data, used here to judge the fluctuation of loss values in the early stages of training, thereby reflecting whether the training effect of the large model on the tested hardware meets expectations. If the variance is small, it indicates that the loss values are relatively stable in the early training stages, suggesting good training effects; conversely, a large variance may indicate instability in the training process, requiring further adjustments and optimizations.

Communication Testing

Testing Tools and Operation Types

Testing Tool: Use the DeepSpeed benchmark test code for testing. DeepSpeed is a library for optimizing deep learning training, and the benchmark test code is a tool for evaluating related performance.

Collective Communication Operation Types:

all_reduce: This is a commonly used collective communication operation that reduces data (e.g., summing, averaging) from all participating processes and then returns the reduced result to all processes. In distributed deep learning training, it is often used to aggregate gradients from different GPUs for unified parameter updates.

all_gather: This operation collects data from all processes and distributes the complete data set to each process. For example, in multi-GPU training, each GPU has a portion of the data, and through the all_gather operation, each GPU can obtain all the data.

all_to_all: It allows each process to send its data to all other processes while receiving data from all other processes. This operation is used in some complex distributed computing scenarios, such as in many-to-many data exchange needs.

broadcast: The broadcast operation, where one process (usually called the root process) sends data to all other processes. In deep learning training, it is often used for initializing parameter distribution and other scenarios.

pt2pt: Point-to-point communication, referring to data transmission between two specific processes. Unlike the previous collective communication operations that involve collaboration among multiple processes, it focuses on direct communication between two processes.

Main Focused Operations and Metrics

Main Focused Operation: all_reduce is listed as the main focused operation. This is because in distributed training, the all_reduce operation is crucial for key steps such as gradient aggregation, and its performance directly affects training efficiency and speed.

Size (Bytes): Indicates the size of the communication data in bytes. This metric helps understand the amount of data transmitted in one communication operation, and the data size affects communication time and bandwidth utilization.

Description: May be a description of the communication operation, such as specific parameter settings, data types, etc., used to describe the characteristics of the communication process in more detail.

Duration: Refers to the time spent on the communication operation. By measuring this time, the execution efficiency of different communication operations can be evaluated; shorter time indicates faster communication operations.

Throughput (Gbps): Throughput, in Gbps (gigabits per second). It reflects the amount of data successfully transmitted per unit time and is one of the important indicators for measuring communication performance. Higher throughput means more data can be transmitted in the same amount of time.

BusBW (Gbps): Possibly refers to Bus Bandwidth, also in Gbps. It represents the bandwidth capability of the bus used in the communication process. The bus bandwidth limits the data transmission speed. By comparing actual throughput and bus bandwidth, it can be analyzed whether the communication operation fully utilizes the bus resources or if there are bandwidth bottlenecks.