Close Menu
SkytikSkytik

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

    November 17, 2025

    Here’s how I turned a Raspberry Pi into an in-car media server

    November 17, 2025

    Beloved SF cat’s death fuels Waymo criticism

    November 17, 2025
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    SkytikSkytik
    • Home
    • AI Tools
    • Online Tools
    • Tech News
    • Guides
    • Reviews
    • SEO & Marketing
    • Social Media Tools
    SkytikSkytik
    Home»AI Tools»Train a Model Faster with torch.compile and Gradient Accumulation
    AI Tools

    Train a Model Faster with torch.compile and Gradient Accumulation

    AwaisBy AwaisDecember 28, 2025No Comments6 Mins Read0 Views
    Facebook Twitter Pinterest LinkedIn Telegram Tumblr Email
    Train a Model Faster with torch.compile and Gradient Accumulation
    Share
    Facebook Twitter LinkedIn Pinterest Email

    Training a language model with a deep transformer architecture is time-consuming. However, there are techniques you can use to accelerate training. In this article, you will learn about:

    • Using torch.compile() to speed up the model
    • Using gradient accumulation to train a model with a larger effective batch size

    Let’s get started!

    Train a Model Faster with torch.compile and Gradient Accumulation
    Photo by François Genon. Some rights reserved.

    Overview

    This article is divided into two parts; they are:

    • Using torch.compile()
    • Gradient Accumulation

    Using torch.compile

    When you write your model code and run it with PyTorch, the code is executed in eager mode. This means the code is executed line by line, and the results are stored in memory. This is native to Python since it is an interpreted language. You know this is the case because when you make a mistake in your code, you will not see the error until you run that line of code.

    Running a model in eager mode is slow. Starting with PyTorch 2.0, you can use torch.compile() to compile a model for improved performance. This generates a new model object that is optimized. It is not the same model object you created using nn.Module, but it shares the same tensors with the original model. You can use this compiled model for forward pass, backward pass, and optimizer updates as usual.

    Building a model and compiling it as a computation graph is how TensorFlow 1.0 was supposed to work. This makes debugging harder, since the model you execute cannot match line by line with the code you wrote. Therefore, you should not compile your model until you have run a trial and confirmed that it is error-free.

    Not all models can be compiled. However, if your model supports compilation, you immediately benefit from the speedup. To compile a model, all you need to do is replace the model object right before you are ready to use it:

    ...

    model = LlamaForPretraining(model_config).to(device)

    model.load_state_dict(checkpoint)

    model = torch.compile(model)

    ...

    Do not load the model weights after compilation. This is because the compiled model is an object that shares the same weights as the original model. During compilation, the computation graph is built referencing the weight tensors of the original model. If you load the weights after compilation, the model may not work as expected.

    Similarly, to save the compiled model, you should refer to the original model’s state dict, as follows:

    torch.save(getattr(model, “_orig_mod”, model).state_dict(), “model.pth”)

    The original model can be accessed from the compiled model using model._orig_mod. In the code above, we use getattr(model, "_orig_mod", model) to get the original model if it exists, or use model itself if it does not. This line of code works for both compiled and original models.

    Gradient Accumulation

    When you train a model, you likely spend two to three times more time on the backward pass than the forward pass. This is because the backward pass is more computationally intensive and uses more memory.

    One easy trick to speed up training is to perform fewer backward passes. This can be achieved by increasing the batch size: with the same number of data samples, a larger batch size means fewer batches to process.

    However, a larger batch size requires more memory. In a memory-constrained environment, you can mimic a larger batch size by running multiple forward passes and accumulating the gradients. This is called gradient accumulation.

    It is easier to explain this idea with code:

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    16

    17

    18

    19

    20

    21

    22

    ..

    accumulate_steps = 4

    for epoch in range(num_epochs):

        optimizer.zero_grad()

        for i, batch in enumerate(dataloader):

            # get batched data

            input_ids, target_ids = batch

            # create attention mask: causal mask + padding mask

            attn_mask = create_causal_mask(input_ids.shape[1], device) + \

                        create_padding_mask(input_ids, PAD_TOKEN_ID, device)

            # extract output from model

            logits = model(input_ids, attn_mask)

            # compute loss: cross-entropy between logits and target, ignoring padding tokens

            loss = loss_fn(logits.view(–1, logits.size(–1)), target_ids.view(–1))

            loss = loss / accumulate_steps

            # Run backward, but update only once every `accumulate_steps` steps

            loss.backward()

            if (i + 1) % accumulate_steps == 0:

                torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

                optimizer.step()

                optimizer.zero_grad()

                scheduler.step()

    The training loop above is an excerpt from the previous article for training a Llama model on your local GPU.

    Normally, when you run a forward pass, you calculate the loss. Then you call loss.backward() to backpropagate the loss gradient through the model parameters. In PyTorch, the backward() method is cumulative, meaning gradients are added up. Therefore, you need to call optimizer.zero_grad() explicitly to clear the gradients before running the backward pass.

    In the code above, you deliberately do not call optimizer.zero_grad() in every iteration. Instead, you run backpropagation for the loss divided by accumulate_steps. This way, the gradients are scaled down but accumulated over accumulate_steps iterations. Once every accumulate_steps iterations, you run the optimizer to adjust the model parameters.

    This approach yields results comparable to using a larger batch size. However, since you run fewer optimizer updates, the learning rate schedule should be adjusted accordingly. This means you need to initialize the scheduler with a different number of steps:

    ...

    num_training_steps = (len(dataloader) // accumulate_steps) * num_epochs

    cosine_scheduler = lr_scheduler.CosineAnnealingLR(

        optimizer,

        T_max=num_training_steps – num_warmup_steps,

        eta_min=0

    )

    Further Reading

    Below are some materials that you may find interesting:

    Summary

    In this article, you learned that using torch.compile() can help you speed up the model by compiling the computation graph. You also learned that gradient accumulation is a technique for training with a larger effective batch size by accumulating gradients from multiple mini-batches. Since you run fewer optimizer updates this way, you save time on backward passes and parameter updates.

    Accumulation faster Gradient Model torch.compile train
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Awais
    • Website

    Related Posts

    Ratio-Aware Layer Editing for Targeted Unlearning in Vision Transformers and Diffusion Models

    March 17, 2026

    Generalizing Real-World Robot Manipulation via Generative Visual Transfer

    March 17, 2026

    CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents

    March 17, 2026

    Follow the AI Footpaths | Towards Data Science

    March 17, 2026

    Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration

    March 17, 2026

    Hallucinations in LLMs Are Not a Bug in the Data

    March 16, 2026
    Leave A Reply Cancel Reply

    Top Posts

    At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

    November 17, 20250 Views

    Here’s how I turned a Raspberry Pi into an in-car media server

    November 17, 20250 Views

    Beloved SF cat’s death fuels Waymo criticism

    November 17, 20250 Views
    Don't Miss

    Post, Story, and Reels Dimensions

    March 17, 2026

    A few months ago, I created an Instagram Reel that looked great when I was…

    How nonprofits can build a digital presence that actually drives impact

    March 17, 2026

    How Google Profits From Demand You Already Own

    March 17, 2026

    Extra-Creamy Deviled Eggs Recipe | Epicurious

    March 17, 2026
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews

    Vibe Coding Plugins? Validate With Official WordPress Plugin Checker

    March 17, 2026

    Generalizing Real-World Robot Manipulation via Generative Visual Transfer

    March 17, 2026
    Most Popular

    13 Trending Songs on TikTok in Nov 2025 (+ How to Use Them)

    November 18, 20257 Views

    How to watch the 2026 GRAMMY Awards online from anywhere

    February 1, 20263 Views

    Corporate Reputation Management Strategies | Sprout Social

    November 19, 20252 Views
    Our Picks

    At Least 32 People Dead After a Mine Bridge Collapsed Due to Overcrowding

    November 17, 2025

    Here’s how I turned a Raspberry Pi into an in-car media server

    November 17, 2025

    Beloved SF cat’s death fuels Waymo criticism

    November 17, 2025

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest YouTube Dribbble
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions
    • Disclaimer

    © 2025 skytik.cc. All rights reserved.

    Type above and press Enter to search. Press Esc to cancel.