• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Friday, June 12, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

Train a Model Faster with torch.compile and Gradient Accumulation

Josh by Josh
December 27, 2025
in Al, Analytics and Automation
0
Train a Model Faster with torch.compile and Gradient Accumulation


Training a language model with a deep transformer architecture is time-consuming. However, there are techniques you can use to accelerate training. In this article, you will learn about:

  • Using torch.compile() to speed up the model
  • Using gradient accumulation to train a model with a larger effective batch size

Let’s get started!

Train a Model Faster with torch.compile and Gradient Accumulation
Photo by François Genon. Some rights reserved.

Overview

This article is divided into two parts; they are:

READ ALSO

MIT affiliates win 2026 Hertz Foundation Fellowships | MIT News

Meet ‘North Mini Code’: Cohere’s 30B Open-Weight Mixture-of-Experts Model With 3B Active Parameters for Agentic Coding

  • Using torch.compile()
  • Gradient Accumulation

Using torch.compile

When you write your model code and run it with PyTorch, the code is executed in eager mode. This means the code is executed line by line, and the results are stored in memory. This is native to Python since it is an interpreted language. You know this is the case because when you make a mistake in your code, you will not see the error until you run that line of code.

Running a model in eager mode is slow. Starting with PyTorch 2.0, you can use torch.compile() to compile a model for improved performance. This generates a new model object that is optimized. It is not the same model object you created using nn.Module, but it shares the same tensors with the original model. You can use this compiled model for forward pass, backward pass, and optimizer updates as usual.

Building a model and compiling it as a computation graph is how TensorFlow 1.0 was supposed to work. This makes debugging harder, since the model you execute cannot match line by line with the code you wrote. Therefore, you should not compile your model until you have run a trial and confirmed that it is error-free.

Not all models can be compiled. However, if your model supports compilation, you immediately benefit from the speedup. To compile a model, all you need to do is replace the model object right before you are ready to use it:

...

model = LlamaForPretraining(model_config).to(device)

model.load_state_dict(checkpoint)

model = torch.compile(model)

...

Do not load the model weights after compilation. This is because the compiled model is an object that shares the same weights as the original model. During compilation, the computation graph is built referencing the weight tensors of the original model. If you load the weights after compilation, the model may not work as expected.

Similarly, to save the compiled model, you should refer to the original model’s state dict, as follows:

torch.save(getattr(model, “_orig_mod”, model).state_dict(), “model.pth”)

The original model can be accessed from the compiled model using model._orig_mod. In the code above, we use getattr(model, "_orig_mod", model) to get the original model if it exists, or use model itself if it does not. This line of code works for both compiled and original models.

Gradient Accumulation

When you train a model, you likely spend two to three times more time on the backward pass than the forward pass. This is because the backward pass is more computationally intensive and uses more memory.

One easy trick to speed up training is to perform fewer backward passes. This can be achieved by increasing the batch size: with the same number of data samples, a larger batch size means fewer batches to process.

However, a larger batch size requires more memory. In a memory-constrained environment, you can mimic a larger batch size by running multiple forward passes and accumulating the gradients. This is called gradient accumulation.

It is easier to explain this idea with code:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

..

accumulate_steps = 4

for epoch in range(num_epochs):

    optimizer.zero_grad()

    for i, batch in enumerate(dataloader):

        # get batched data

        input_ids, target_ids = batch

        # create attention mask: causal mask + padding mask

        attn_mask = create_causal_mask(input_ids.shape[1], device) + \

                    create_padding_mask(input_ids, PAD_TOKEN_ID, device)

        # extract output from model

        logits = model(input_ids, attn_mask)

        # compute loss: cross-entropy between logits and target, ignoring padding tokens

        loss = loss_fn(logits.view(–1, logits.size(–1)), target_ids.view(–1))

        loss = loss / accumulate_steps

        # Run backward, but update only once every `accumulate_steps` steps

        loss.backward()

        if (i + 1) % accumulate_steps == 0:

            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

            optimizer.step()

            optimizer.zero_grad()

            scheduler.step()

The training loop above is an excerpt from the previous article for training a Llama model on your local GPU.

Normally, when you run a forward pass, you calculate the loss. Then you call loss.backward() to backpropagate the loss gradient through the model parameters. In PyTorch, the backward() method is cumulative, meaning gradients are added up. Therefore, you need to call optimizer.zero_grad() explicitly to clear the gradients before running the backward pass.

In the code above, you deliberately do not call optimizer.zero_grad() in every iteration. Instead, you run backpropagation for the loss divided by accumulate_steps. This way, the gradients are scaled down but accumulated over accumulate_steps iterations. Once every accumulate_steps iterations, you run the optimizer to adjust the model parameters.

This approach yields results comparable to using a larger batch size. However, since you run fewer optimizer updates, the learning rate schedule should be adjusted accordingly. This means you need to initialize the scheduler with a different number of steps:

...

num_training_steps = (len(dataloader) // accumulate_steps) * num_epochs

cosine_scheduler = lr_scheduler.CosineAnnealingLR(

    optimizer,

    T_max=num_training_steps – num_warmup_steps,

    eta_min=0

)

Further Reading

Below are some materials that you may find interesting:

Summary

In this article, you learned that using torch.compile() can help you speed up the model by compiling the computation graph. You also learned that gradient accumulation is a technique for training with a larger effective batch size by accumulating gradients from multiple mini-batches. Since you run fewer optimizer updates this way, you save time on backward passes and parameter updates.



Source_link

Related Posts

MIT affiliates win 2026 Hertz Foundation Fellowships | MIT News
Al, Analytics and Automation

MIT affiliates win 2026 Hertz Foundation Fellowships | MIT News

June 11, 2026
Meet ‘North Mini Code’: Cohere’s 30B Open-Weight Mixture-of-Experts Model With 3B Active Parameters for Agentic Coding
Al, Analytics and Automation

Meet ‘North Mini Code’: Cohere’s 30B Open-Weight Mixture-of-Experts Model With 3B Active Parameters for Agentic Coding

June 11, 2026
Building Semantic Search with Transformers.js and Sentence Embeddings
Al, Analytics and Automation

Building Semantic Search with Transformers.js and Sentence Embeddings

June 11, 2026
Startup’s nuclear-inspired cooling system could make data centers more sustainable | MIT News
Al, Analytics and Automation

Startup’s nuclear-inspired cooling system could make data centers more sustainable | MIT News

June 10, 2026
Top AI Coding Agents and Development Platforms in 2026: Atoms, Devin, Windsurf, Cursor, Warp, and More Compared
Al, Analytics and Automation

Top AI Coding Agents and Development Platforms in 2026: Atoms, Devin, Windsurf, Cursor, Warp, and More Compared

June 10, 2026
The Practitioner’s Guide to AgentOps
Al, Analytics and Automation

The Practitioner’s Guide to AgentOps

June 10, 2026
Next Post
The Best After-Christmas Deals on Gear We’ve Tested (2025)

The Best After-Christmas Deals on Gear We've Tested (2025)

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

Comparing the Top 7 Large Language Models LLMs/Systems for Coding in 2025

November 4, 2025

EDITOR'S PICK

iOS 26.2 is here with another Liquid Glass tweak, new Podcasts features and more

iOS 26.2 is here with another Liquid Glass tweak, new Podcasts features and more

December 13, 2025
PR Strategies That Drive Success for New Lifestyle Summits

PR Strategies That Drive Success for New Lifestyle Summits

July 25, 2025

After the spotlight: How to rebuild your reputation after a crisis

August 3, 2025
Veo 3 Fast and new image-to-video capabilities

Veo 3 Fast and new image-to-video capabilities

July 31, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Why LinkedIn Is the Most-Cited Source in AI Search (and What Your Business Should Do Next)
  • Push Delivery Tests, ChatGPT Ads Updates, and More
  • Researchers Are Developing Textiles That Can Produce Drinking Water From The Air
  • Father’s Day marketing in 2026: five trends every advertiser needs to know
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions