2.3 KiB

Executable File

Raw Blame History

Troubleshooting

If the following content does not address your issue, feel free to raise a GitHub Issue.

Automatic Recovery

When setting recover_mode=auto and the experiment configuration remains unchanged, AReaL will attempt to discover previous checkpoints and recover the experiment from them.

Recovery Failure Causes

If automatic recovery fails, check the following possibilities:

Configuration Changes:

The experiment_name and trial_name in the training script differ from the previous run
Changes in batch size (dataset.train_bs_n_seqs parameter)
Changes in group size (group_size parameter)
Changes in number of nodes (n_nodes parameter)

Missing Recovery Checkpoints: Recovery checkpoints are generated under two conditions by default:

After completion of the second step
When a step completes and more than 600 seconds have passed since the last recovery checkpoint (controlled by exp_ctrl.ckpt_freq_secs=600)

Verify Recovery Checkpoint Creation

You can confirm if a recovery checkpoint was generated by searching for the following message in the logs:

(master_worker/0 pid=96390, ip=xxx.xxx.xxx.xxx) 20250222-11:52:02.760 master worker INFO: Dumped recover info to file.

Memory Issues

torch.cuda.CudaOutOfMemoryError

The key to resolving this issue is identifying the phase where the error occurs:

During Initialization

Check for idle processes on the GPU
Distributed scenarios: Restart the Ray cluster
Single-machine scenarios: Use pkill to terminate processes

During SGLang Generation

Decrease the actor.sglang.mem_fraction_static parameter
Increase the tensor parallelism degree

During `actor_inf` or `actor_train`

Adjust microbatch size: Set parameters like actor_train.mb_spec.max_tokens_per_mb=20480. This parameter limits tokens per forward/backward pass and can be set as low as the maximum sequence length (including prompt)
Modify parallelism strategy: Adjust allocation_mode by:
- Reducing data parallelism
- Increasing tensor or pipeline parallelism
- Preferring pipeline parallelism over tensor parallelism

CUDA Error: Out of Memory

This issue may occur during data transfer. Try increasing mem_per_xx_worker in the CLI arguments.

2.3 KiB Executable File Raw Blame History