mirror of https://github.com/inclusionAI/AReaL
2.3 KiB
Executable File
2.3 KiB
Executable File
Troubleshooting
If the following content does not address your issue, feel free to raise a GitHub Issue.
Automatic Recovery
When setting recover_mode=auto
and the experiment configuration remains unchanged, AReaL will attempt to discover previous checkpoints and recover the experiment from them.
Recovery Failure Causes
If automatic recovery fails, check the following possibilities:
Configuration Changes:
- The
experiment_name
andtrial_name
in the training script differ from the previous run - Changes in batch size (
dataset.train_bs_n_seqs
parameter) - Changes in group size (
group_size
parameter) - Changes in number of nodes (
n_nodes
parameter)
Missing Recovery Checkpoints: Recovery checkpoints are generated under two conditions by default:
- After completion of the second step
- When a step completes and more than 600 seconds have passed since the last recovery checkpoint (controlled by
exp_ctrl.ckpt_freq_secs=600
)
Verify Recovery Checkpoint Creation
You can confirm if a recovery checkpoint was generated by searching for the following message in the logs:
(master_worker/0 pid=96390, ip=xxx.xxx.xxx.xxx) 20250222-11:52:02.760 master worker INFO: Dumped recover info to file.
Memory Issues
torch.cuda.CudaOutOfMemoryError
The key to resolving this issue is identifying the phase where the error occurs:
During Initialization
- Check for idle processes on the GPU
- Distributed scenarios: Restart the Ray cluster
- Single-machine scenarios: Use
pkill
to terminate processes
During SGLang Generation
- Decrease the
actor.sglang.mem_fraction_static
parameter - Increase the tensor parallelism degree
During actor_inf
or actor_train
- Adjust microbatch size: Set parameters like
actor_train.mb_spec.max_tokens_per_mb=20480
. This parameter limits tokens per forward/backward pass and can be set as low as the maximum sequence length (including prompt) - Modify parallelism strategy: Adjust
allocation_mode
by:- Reducing data parallelism
- Increasing tensor or pipeline parallelism
- Preferring pipeline parallelism over tensor parallelism
CUDA Error: Out of Memory
This issue may occur during data transfer. Try increasing mem_per_xx_worker
in the CLI arguments.