Victarry's picture
Add support for DualPipe.
99a6901
|
raw
history blame
4.94 kB
# Pipeline Parallelism Emulation
This project provides tools for emulating and visualizing pipeline parallelism strategies used in large language model training.
## Overview
Pipeline parallelism is a technique used to train large models by partitioning the model across multiple devices and processing data in a pipelined fashion. This project allows you to:
- Simulate different pipeline parallelism strategies (1F1B, Interleaved, Zero-Bubble, etc.)
- Visualize the execution schedule on multiple devices
- Compare different strategies for efficiency
## Features
- **Supported Pipeline Strategies**:
- 1F1B (One-Forward-One-Backward)
- Interleaved 1F1B
- Zero-Bubble 1F1B (ZB-1P)
- 1F1B with computation-communication overlap
- Interleaved 1F1B with computation-communication overlap
- DualPipe (Bidirectional pipeline parallelism with full forward-backward overlap)
- **Visualization**:
- Interactive visualization dashboard using Plotly/Dash
- **Configuration**:
- Configurable simulation parameters through Hydra
- Customizable stage latency and communication costs
## Installation
This project uses [uv](https://github.com/astral-sh/uv) for dependency management.
Setup `uv` if not installed on your computer:
```bash
# On macOS and Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
```
## Usage
### Running for 1F1B strategy:
```bash
uv run python main.py strategy=1f1b num_devices=4 num_stages=4 num_batches=8
```
![1f1b](assets/1f1b.png)
### Running for interleaved strategy:
```bash
uv run python main.py strategy=interleave num_devices=4 num_stages=8 num_batches=8
```
![interleave](assets/interleave_1f1b.png)
### Running for ZB-1P strategy:
```bash
uv run python main.py strategy=zb1p num_devices=4 num_stages=4 num_batches=8
```
![zb1p](assets/zb1p.png)
### Running for DualPipe strategy:
```bash
uv run python main.py strategy=dualpipe num_devices=8 num_stages=8 num_batches=20
```
![dualpipe](assets/dualpipe.png)
### Running for 1F1B-batch-overlap strategy:
```bash
uv run python main.py strategy=1f1b_overlap num_devices=4 num_stages=4 num_batches=8
```
![1f1b_overlap](assets/1f1b_overlap.png)
### Running for 1F1B-interleave-overlap strategy:
```bash
uv run python main.py strategy=1f1b_interleave_overlap num_devices=4 num_stages=8 num_batches=8
```
![1f1b_interleave_overlap](assets/1f1b_interleave_overlap.png)
## Configuration
The default configuration is in `conf/config.yaml`. You can override any parameter on the command line or create configuration groups for different scenarios.
#### Override Specific Parameters
You can override specific parameters at runtime:
```bash
uv run python main.py op_times.forward=0.5 op_times.backward=1.0 num_batches=6
```
Use DualPipe as an example, you can manually set different time for forward/backward/backward_D/backward_W/overlapped_forward_backward:
```bash
uv run python main.py strategy=dualpipe num_devices=8 num_stages=8 num_batches=32 op_times.forward=1.0 op_times.backward=2.0 op_times.backward_D=1.0 op_times.backward_W=1.0 op_times.overlapped_forward_backward=2.5
```
### Using Different Configuration Files
You can use different configuration files with Hydra in several ways:
#### Recommended Approach
1. Create multiple configuration files in the `conf` directory for different use cases:
```
conf/
β”œβ”€β”€ config.yaml # Default configuration
└── model_A.yaml # Create your own config with stage-specific latency for performance projection
```
2. Run with your desired configuration using the `--config-name` flag:
```bash
uv run python main.py --config-name=model_A
```
## Project Structure
```
PP-Emulation/
β”œβ”€β”€ conf/ # Hydra configuration files
β”‚ └── config.yaml # Default configuration
β”œβ”€β”€ src/ # Source code
β”‚ β”œβ”€β”€ __init__.py # Package initialization
β”‚ β”œβ”€β”€ execution_model.py # Schedule execution models
β”‚ β”œβ”€β”€ strategies.py # Pipeline parallelism strategies
β”‚ └── visualizer.py # Visualization utilities
β”œβ”€β”€ main.py # Main entry point
β”œβ”€β”€ pyproject.toml # Project metadata and dependencies
└── README.md # This file
```
## References
1. _PipeDream: Fast and Efficient Pipeline Parallel DNN Training_. [arxiv](https://arxiv.org/abs/1806.03377)
2. _Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM_. [arxiv](https://arxiv.org/abs/2104.04473)
3. _Zero Bubble Pipeline Parallelism_. [arxiv](https://arxiv.org/abs/2401.10241)
4. _Communication-Computation Overlap in MoE Training with 1F1B Pipeline Parallelism_. [blog](https://zhuanlan.zhihu.com/p/28463368206)
## License
This project is licensed under the MIT License - see the LICENSE file for details.
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.