Skip to content

Commit 782f8e5

Browse files
support distributed train (#4364)
1 parent 804c106 commit 782f8e5

File tree

3 files changed

+53
-0
lines changed

3 files changed

+53
-0
lines changed
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
---
2+
comments: true
3+
---
4+
5+
# Distributed Training
6+
7+
## Introduction
8+
9+
Distributed training refers to splitting a training task across multiple computing nodes according to certain methods, and then aggregating and updating the gradients and other information obtained from the split computations. PaddlePaddle’s distributed training technology originates from Baidu’s business practices and has been validated in ultra-large-scale business scenarios in fields such as natural language processing, computer vision, search, and recommendation. High-performance distributed training is one of PaddlePaddle’s core technical advantages. For example, in tasks such as image classification, distributed training can achieve nearly linear speedup. Take ImageNet as an example, the ImageNet22k dataset contains 14 million images, and training on a single GPU would be extremely time-consuming. Therefore, PaddleX supports distributed training interfaces to complete training tasks, supporting both single-machine and multi-machine training. For more methods and documentation on distributed training, please refer to:[Distributed Training Quick Start Tutorial](https://fleet-x.readthedocs.io/en/latest/paddle_fleet_rst/parameter_server/ps_quick_start.html)。
10+
11+
12+
## Usage
13+
14+
* Taking [Image Classification Model Training](../tutorials/cv_modules/image_classification.en.md)as an example, compared to single-machine training, for multi-machine training you only need to add the `Train.dist_ips` parameter, which indicates the list of IP addresses of machines participating in distributed training, separated by commas. Below is a sample code to run.
15+
16+
```
17+
python main.py -c paddlex/configs/modules/image_classification/PP-LCNet_x1_0.yaml \
18+
-o Global.mode=train \
19+
-o Global.dataset_dir=./dataset/cls_flowers_examples
20+
-o Train.dist_ips="xx.xx.xx.xx,xx.xx.xx.xx"
21+
```
22+
**Note**
23+
24+
- The IP addresses of different machines should be separated by commas and can be checked using `ifconfig` or `ipconfig`.
25+
- Passwordless SSH should be set up between different machines, and they should be able to ping each other directly; otherwise, communication cannot be completed.
26+
- The code, data, and execution commands or scripts must be consistent across all machines, and the training command or script must be run on all machines. Finally, the first device of the first machine in the `Train.dist_ips` list will be trainer0, and so on.
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
---
2+
comments: true
3+
---
4+
5+
# 分布式训练
6+
7+
## 简介
8+
9+
分布式训练指的是将训练任务按照一定方法拆分到多个计算节点进行计算,再按照一定的方法对拆分后计算得到的梯度等信息进行聚合与更新。飞桨分布式训练技术源自百度的业务实践,在自然语言处理、计算机视觉、搜索和推荐等领域经过超大规模业务检验。分布式训练的高性能,是飞桨的核心优势技术之一,例如在图像分类等任务上,分布式训练可以达到几乎线性的加速比,以ImageNet为例,ImageNet22k数据集中包含1400W张图像,如果使用单卡训练,会非常耗时。因此PaddleX中支持使用分布式训练接口完成训练任务,同时支持单机训练与多机训练。更多关于分布式训练的方法与文档可以参考:[分布式训练快速开始教程](https://fleet-x.readthedocs.io/en/latest/paddle_fleet_rst/parameter_server/ps_quick_start.html)
10+
11+
12+
## 使用方法
13+
14+
*[图像分类模型训练](../tutorials/cv_modules/image_classification.md)为例,相比单机训练,多机训练时,只需要添加 `Train.dist_ips` 的参数,该参数表示需要参与分布式训练的机器的ip列表,不同机器的ip用逗号隔开。下面为运行代码示例。
15+
16+
```
17+
python main.py -c paddlex/configs/modules/image_classification/PP-LCNet_x1_0.yaml \
18+
-o Global.mode=train \
19+
-o Global.dataset_dir=./dataset/cls_flowers_examples
20+
-o Train.dist_ips="xx.xx.xx.xx,xx.xx.xx.xx"
21+
```
22+
****
23+
24+
- 不同机器的ip信息需要用逗号隔开,可以通过 `ifconfig` 或者 `ipconfig` 查看。
25+
- 不同机器之间需要做免密设置,且可以直接ping通,否则无法完成通信。
26+
- 不同机器之间的代码、数据与运行命令或脚本需要保持一致,且所有的机器上都需要运行设置好的训练命令或者脚本。最终 `Train.dist_ips` 中的第一台机器的第一块设备是trainer0,以此类推。

paddlex/modules/base/trainer.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,7 @@ def train(self, *args, **kwargs):
8484
"uniform_output_enabled", True
8585
),
8686
"export_with_pir": export_with_pir,
87+
"ips": self.train_config.get("dist_ips", None)
8788
}
8889
)
8990

0 commit comments

Comments
 (0)