Abstract Summary
•
The research introduces Dion, a communication-efficient orthonormalizing optimizer designed for large-scale distributed LLM training systems.
•
Dion leverages low-rank approximation and decoupled momentum buffers to efficiently apply orthonormal matrix updates without the need for full gradient synchronization, enabling compatibility with various parallelism techniques and enhancing performance with larger model sizes and batch sizes.
Abstract
Recent work has shown that orthonormal matrix updates speed up neural network optimization, improve training stability, and offer better hyperparameter transfer across model sizes. Applying these updates efficiently when model weights and optimizer states are sharded across a large-scale distributed LLM training system remains a major challenge. We introduce Dion (DIstributed OrthoNormalization), a scalable and communication-efficient orthonormalizing optimizer. Dion leverages low-rank approximation and decoupled momentum buffers, eliminating the need for full gradient synchronization while producing numerically equivalent results. It is compatible with simultaneous DDP, FSDP, and TP parallelism, and it computes an orthonormalized update without unsharding a full parameter matrix on any single device. We evaluate Dion on language models from 120M to 3B parameters and find that its benefits improve with increasing model size and batch size.