Train Generative AI Models More Efficiently with New NVIDIA Megatron-Core Functionalities

First introduced in 2019, NVIDIA Megatron-LM sparked a wave of innovation in the AI community, enabling researchers and developers to use the underpinnings of this open-source library to further large language model (LLM) advancements. Today, many of the most popular LLM developer frameworks have been inspired by and built using the Megatron-LM library, spurring a wave of foundation models and AI startups. Some of the most popular LLM frameworks built on top of Megatron-LM include Colossal-AI, Hugging Face Accelerate, and NVIDIA NeMo.

To facilitate easy migration and enable researchers and model developers to access the latest research in distributed training, NVIDIA has recently revamped Megatron-LM. This resulted in NVIDIA Megatron-Core, an open-source PyTorch-based library with a collection of GPU-optimized techniques, cutting-edge system-level innovations, and modular APIs for training models at large scale.

Megatron-Core continues to advance large-scale distributed training. This post highlights some of the recent advancements, including the new Large Language and Vision Assistant (LLaVA) pipeline for multimodal training.

NVIDIA Megatron-Core

Megatron-Core contains GPU-optimized techniques with cutting-edge system-level innovations. It abstracts these techniques into composable and modular APIs, providing full flexibility for framework developers and researchers to train custom transformers at scale on NVIDIA accelerated computing infrastructure.

The Megatron-Core library offers the core building blocks for transformer models, such as attention mechanisms, transformer blocks and layers, normalization layers, and embedding techniques. Additional functionality, including activation recomputation and distributed checkpointing, is also natively built into this library.

Popular LLM architectures such as GPT, BERT, T5, and RETRO can be efficiently built at large compute scales using Megatron-Core. Furthermore, Megatron-Core is compatible with all NVIDIA Tensor Core GPUs and can take advantage of the FP8 data format supported by the NVIDIA Hopper architecture to further boost compute throughput and reduce memory footprint. Megatron-Core has enabled customers like Reka AI and Codeium to train models at large scale.

“Megatron-Core’s modular, composable design seamlessly integrates into our multimodal LLM architecture,” said Deyu Fu, a technical staff member at Reka AI. “With optimized GPU kernels and parallelism techniques, it helps us handle very large models and extensive contexts with ease, all while enabling dense and sparse training to scale efficiently at cluster levels.”

“By using Megatron-Core, we’re able to stay on the frontier of techniques for training large language models by simply turning on a flag in the library,” said Devin Chotzen-Hartzell, a machine learning engineer at Codeium. “This enables us to focus on our differentiators in data and alignment.”

Multimodal training is now supported in Megatron-Core

With the introduction of visual instruction tuning, large multimodal models have garnered widespread interest from both researchers and industries. These models leverage various types of data to generate comprehensive and context-aware responses, using multiple sensory inputs to understand and interact with their environment. This advancement brings generative AI models closer to how humans process the world.

We’re excited to announce that Megatron-Core v0.7 now supports multimodality. For a complete multimodality reference pipeline with LLaVA, visit NVIDIA/Megatron-LM on GitHub. Model developers can easily blend multimodal datasets with determinism and reproducibility using the open-source multimodal data loader under Megatron. This also works across checkpoint saves and loads.

The LLaVA pipeline example walks you through how to:

Prepare pretraining and supervised fine-tuning (SFT) datasets for Megatron webdataset-based formats.
Leverage Megatron Core parallelism and memory-saving techniques to train a LLaVA architecture model initialized from Mistral and CLIP.
Evaluate with different tasks like COCO captioning and VQAv2.

The Megatron-Core v0.7 release focuses on functional aspects of the LLaVA pipeline, with a Massive Multidiscipline Multimodal Understanding (MMMU) score of 38, which is in the expected range for a 7B-parameter LLM-based LLaVA architecture.

Additionally, with the Megatron-Core (Mcore) spec system, researchers can easily customize submodules in the PyTorch model definition. Within the next few releases, Megatron-Core will enable the use of heterogeneous parallelism strategies for different models. This approach is particularly beneficial because vision models, which are often smaller, typically require less complex sharding techniques compared to large language models in multimodal training.

All Megatron-Core multimodal training capabilities, including the multimodal data loader, will soon be integrated into NVIDIA NeMo, enhancing the current multimodal features in NeMo for models like NeVa.

Training throughput optimization for mixture of experts

In the rapidly evolving landscape of generative AI, mixture of experts (MoE) models have become an attractive option, as they can be pretrained to achieve better accuracy without increasing the number of floating-point operations. In MoEs, the dense FFN layer is replaced with an MoE layer where each token is routed to a few experts, chosen by a router.

Megatron-Core v0.7 expands MoE functionality and adds various training speed and memory optimizations, making Megatron-Core the most comprehensive solution for training MoEs at large scale. Specifically, Megatron-Core now supports MoE training with token dropping as used in GShard, and has training speed optimizations such as an enhanced GroupedGEMM with multi-CUDA stream computation and gradient accumulation fusion.

Table 1 shows that Megatron-Core achieves throughput of over 400 TFLOP/s per-GPU throughput when training in BF16 precision, with an all-to-all dispatcher and sequence length = 4096. Each token is routed to (–moe-router-topk) two experts. We continue to optimize our FP8 recipes for MoE and will make them available in upcoming Megatron-Core releases.

Megatron-Core also supports expert parallelism for MoEs, which can be combined with other parallelism techniques such as tensor, data, sequence, and pipeline parallelism already supported by Megatron-Core. For more details, see the User Guide.

Model	Precision	# of GPUs	MBS	GBS	TP	EP	PP	Gradient accumulation	Per-GPU throughput (TFLOP/s/GPU)
Mistral 7B (Dense model baseline)	BF16	128	4	256	2	N/A	1	1	492
Mixtral 8x7B	BF16	128	1	256	1	8	4	8	402

Table 1. Per-GPU throughput for Mixtral 8x7B on NVIDIA H100 GPUs with dropless-token implementation in Megatron-Core v0.7

Fast distributed checkpointing for better training resiliency

Distributed checkpointing is crucial for maintaining resiliency in large-scale training. The PyTorch native solution torch.save often lacks efficiency and scalability, leading to the development of more efficient solutions. For example, Azure Nebula and AWS Gemini offer asynchronous checkpointing, and the PyTorch Distributed Checkpoint (DCP) saves checkpoints per rank using threads. While these methods are faster than the vanilla `torch.save` by leveraging parallelism and asynchrony, challenges remain in achieving efficient asynchronous checkpointing.

Specifically, these solutions perform asynchronous checkpointing either without parallelism within a multi-GPU server or by using Python threads (which can be inefficient due to the Python Global Interpreter Lock), leading to increased checkpointing times and lower training speed. These solutions also force users to load a checkpoint with the same parallelism configuration (PP and TP size, for example) used to store the checkpoint, preventing easy dynamic parallelism reconfiguration during a long training run.

Megatron-Core v0.7 addresses these issues by introducing fully parallel and asynchronous saving capabilities. With fully parallel saving (FPS), data-parallel replicas perform parallel writes, enabling better utilization of the available file system bandwidth. Asynchronous parallel saving further speeds up distributed checkpointing by copying model parameters to the CPU (or local storage in the future) first before persisting the checkpoint to stable storage in the background, with minimal interruption to the main training process.

Illustration of fully parallel saving (FPS) in Megatron-Core at TP size 2, PP size 2, and DP size 4, with a total of 16 processes. — *Figure 1. Fully parallel saving in Megatron-Core uses the data-parallel replicas for parallel writing across nodes*

Most importantly, Megatron-Core enables users to resume training from a checkpoint saved with different tensor and pipeline parallelism degrees, providing the flexibility to change training configurations as needed during training.

The save and load APIs in Megatron-Core are designed to be highly similar to PyTorch native APIs, making it easy to adopt Megatron-Core distributed checkpointing. When not using a distributed optimizer, this improvement reduces checkpointing overhead by 26x for Nemotron-4 340B compared to the native PyTorch solution, and by 50x for Nemotron-4 15B. With the distributed optimizer, users can achieve a 42x reduction in checkpoint overhead for Nemotron-4 340B (Figure 2).

Checkpoint overhead comparison across Pytorch native and the recent Megatron-Core releases for a Nemotron-4 340B model. With the distributed optimizer, users can achieve a 42x reduction in checkpoint overhead for Nemotron-4 340B. — *Figure 2. Checkpoint overhead comparison for a Nemotron-4 340B model across PyTorch native and the recent Megatron-Core releases*

Improved scalability

Since the v0.5 release, Megatron-Core supports fine-grained overlapping of the data parallelism gradient all-reduce with the backward pass. This happens by grouping parameters into buckets and initiating asynchronous communication collectives for a bucket when all gradients for the bucket’s parameters are ready. This improves Megatron-Core throughput by reducing the amount of exposed data-parallel communication, and is especially useful when running configurations with a small batch size per GPU (and less gradient accumulation).

Figure 3 shows per-GPU throughput for the Nemotron-4 15B model in a weak scaling experiment, where the batch size is increased as the data-parallel size is also increased (global batch size is 3*data_parallel_size), and a tensor-parallel size of 8. We observe that this optimization improves throughput by 34% with data-parallel size 32 and a batch size of 96. The --overlap-grad-reduce flag can enable this overlapping technique when using data parallelism. For more details, see the Megatron-Core documentation.

Graph showing the effect of the --overlap-grad-reduce optimization for Nemotron-4 15B using NVIDIA H100 GPUs. With a data-parallel size of 32, the --overlap-grad-reduce optimization improves throughput by 34%. — *Figure 3. Effect of the `--overlap-grad-reduce` optimization for Nemotron-4 15B using NVIDIA H100 GPUs* *and BF16 precision*

In the Megatron-Core v0.6 release, we introduced distributed optimizer support where the optimizer state is split over data-parallel replicas, reducing peak memory footprint. The distributed optimizer also breaks up the gradient all-reduce that was previously required into a gradient reduce-scatter (RS) and parameter all-gather (AG). Megatron-Core overlaps the reduce-scatter with the backward pass computation and the all-gather with the forward pass computation. These optimizations facilitate near-linear scaling for Nemotron-4 15B. Figure 4 shows per-GPU throughput for the 15B model with these optimizations enabled using a similar experiment setup to Figure 3.

Graph showing the effect of overlapping the reduce-scatter (RS) collective with the backward pass and all-gather (AG) collective with the forward pass for the Nemotron-4 15B model using NVIDIA H100 GPUs. Throughput scales nearly linearly with both of these optimizations. — *Figure 4. Effect of overlapping the reduce-scatter (RS) collective with the backward pass and all-gather (AG) collective with the forward pass for the Nemotron-4 15B model using NVIDIA H100 GPUs*

In addition, the v0.7 release further improves the speed of Megatron-Core at large data-parallel sizes by providing better in-built heuristics on how to size buckets. This enables communication to remain bandwidth-bound and not latency-bound even at DP sizes of >300 (GPU count of >3,000). Figure 5 shows per-GPU throughput for the 15B model up to a data-parallel size of 384 with the same weak scaling experiment setup and all optimizations enabled (distributed optimizer with both reduce-scatter and all-gather overlapping).

Comparison of Megatron-Core 0.6 and 0.7 releases on the Nemotron-4 15B model using H100 GPUs. Megatron-Core v0.7 has improved scalability. — *Figure 5. Comparison of Megatron-Core 0.6 and 0.7 releases on Nemotron-4 15B using NVIDIA H100 GPUs up to a data-parallel size of 384*

Megatron-Core optimizations also work out of the box for larger models that require other parallelism dimensions, including pipeline parallelism. The Nemotron-4 340B model uses these optimizations in Megatron-Core to achieve high training throughput at large GPU scales with BF16. Table 2 shows per-GPU throughput of Megatron-Core on the Nemotron-4 340B base model with different batch sizes. TP size is 8, PP size is 12, the number of virtual pipeline stages is 8, and sequence length is 4096. For more details, see the Nemotron-4 340B Technical Report. Note that other parallelism configurations can result in slightly higher throughputs.

Precision	# of GPUs (H100)	Data-parallel size	Batch size	Per-GPU throughput (TFLOP/s/GPU)
BF16	1536	16	768	419.3
BF16	3072	32	1536	418.3
BF16	6144	64	2304	405.0

Table 2. Per-GPU throughput of Megatron-Core on the Nemotron-4 340B base model with different batch sizes

Get started

Megatron-Core is available as open source in the NVIDIA/Megatron-LM repository on GitHub and can be used with Megatron-LM or NVIDIA NeMo. Megatron-LM, a lightweight training framework, offers a customizable native PyTorch training loop, ideal for users preferring fewer abstraction layers. It serves as a straightforward entry point for exploring Megatron-Core. For more details, see the Megatron-Core documentation.

Megatron-Core is deeply integrated into NVIDIA NeMo, an enterprise-grade AI software platform with security, stability, manageability, and support. It incorporates Megatron-Core for LLM capabilities and provides a more extensive set of tools for multimodal and speech AI. To learn more, see the NVIDIA NeMo framework documentation.