• About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
Friday, January 23, 2026
mGrowTech
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions
No Result
View All Result
mGrowTech
No Result
View All Result
Home Al, Analytics and Automation

Train a Model Faster with torch.compile and Gradient Accumulation

Josh by Josh
December 27, 2025
in Al, Analytics and Automation
0
Train a Model Faster with torch.compile and Gradient Accumulation
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


Training a language model with a deep transformer architecture is time-consuming. However, there are techniques you can use to accelerate training. In this article, you will learn about:

  • Using torch.compile() to speed up the model
  • Using gradient accumulation to train a model with a larger effective batch size

Let’s get started!

Train a Model Faster with torch.compile and Gradient Accumulation
Photo by François Genon. Some rights reserved.

Overview

This article is divided into two parts; they are:

READ ALSO

A Missed Forecast, Frayed Nerves and a Long Trip Back

Microsoft Releases VibeVoice-ASR: A Unified Speech-to-Text Model Designed to Handle 60-Minute Long-Form Audio in a Single Pass

  • Using torch.compile()
  • Gradient Accumulation

Using torch.compile

When you write your model code and run it with PyTorch, the code is executed in eager mode. This means the code is executed line by line, and the results are stored in memory. This is native to Python since it is an interpreted language. You know this is the case because when you make a mistake in your code, you will not see the error until you run that line of code.

Running a model in eager mode is slow. Starting with PyTorch 2.0, you can use torch.compile() to compile a model for improved performance. This generates a new model object that is optimized. It is not the same model object you created using nn.Module, but it shares the same tensors with the original model. You can use this compiled model for forward pass, backward pass, and optimizer updates as usual.

Building a model and compiling it as a computation graph is how TensorFlow 1.0 was supposed to work. This makes debugging harder, since the model you execute cannot match line by line with the code you wrote. Therefore, you should not compile your model until you have run a trial and confirmed that it is error-free.

Not all models can be compiled. However, if your model supports compilation, you immediately benefit from the speedup. To compile a model, all you need to do is replace the model object right before you are ready to use it:

...

model = LlamaForPretraining(model_config).to(device)

model.load_state_dict(checkpoint)

model = torch.compile(model)

...

Do not load the model weights after compilation. This is because the compiled model is an object that shares the same weights as the original model. During compilation, the computation graph is built referencing the weight tensors of the original model. If you load the weights after compilation, the model may not work as expected.

Similarly, to save the compiled model, you should refer to the original model’s state dict, as follows:

torch.save(getattr(model, “_orig_mod”, model).state_dict(), “model.pth”)

The original model can be accessed from the compiled model using model._orig_mod. In the code above, we use getattr(model, "_orig_mod", model) to get the original model if it exists, or use model itself if it does not. This line of code works for both compiled and original models.

Gradient Accumulation

When you train a model, you likely spend two to three times more time on the backward pass than the forward pass. This is because the backward pass is more computationally intensive and uses more memory.

One easy trick to speed up training is to perform fewer backward passes. This can be achieved by increasing the batch size: with the same number of data samples, a larger batch size means fewer batches to process.

However, a larger batch size requires more memory. In a memory-constrained environment, you can mimic a larger batch size by running multiple forward passes and accumulating the gradients. This is called gradient accumulation.

It is easier to explain this idea with code:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

..

accumulate_steps = 4

for epoch in range(num_epochs):

    optimizer.zero_grad()

    for i, batch in enumerate(dataloader):

        # get batched data

        input_ids, target_ids = batch

        # create attention mask: causal mask + padding mask

        attn_mask = create_causal_mask(input_ids.shape[1], device) + \

                    create_padding_mask(input_ids, PAD_TOKEN_ID, device)

        # extract output from model

        logits = model(input_ids, attn_mask)

        # compute loss: cross-entropy between logits and target, ignoring padding tokens

        loss = loss_fn(logits.view(–1, logits.size(–1)), target_ids.view(–1))

        loss = loss / accumulate_steps

        # Run backward, but update only once every `accumulate_steps` steps

        loss.backward()

        if (i + 1) % accumulate_steps == 0:

            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

            optimizer.step()

            optimizer.zero_grad()

            scheduler.step()

The training loop above is an excerpt from the previous article for training a Llama model on your local GPU.

Normally, when you run a forward pass, you calculate the loss. Then you call loss.backward() to backpropagate the loss gradient through the model parameters. In PyTorch, the backward() method is cumulative, meaning gradients are added up. Therefore, you need to call optimizer.zero_grad() explicitly to clear the gradients before running the backward pass.

In the code above, you deliberately do not call optimizer.zero_grad() in every iteration. Instead, you run backpropagation for the loss divided by accumulate_steps. This way, the gradients are scaled down but accumulated over accumulate_steps iterations. Once every accumulate_steps iterations, you run the optimizer to adjust the model parameters.

This approach yields results comparable to using a larger batch size. However, since you run fewer optimizer updates, the learning rate schedule should be adjusted accordingly. This means you need to initialize the scheduler with a different number of steps:

...

num_training_steps = (len(dataloader) // accumulate_steps) * num_epochs

cosine_scheduler = lr_scheduler.CosineAnnealingLR(

    optimizer,

    T_max=num_training_steps – num_warmup_steps,

    eta_min=0

)

Further Reading

Below are some materials that you may find interesting:

Summary

In this article, you learned that using torch.compile() can help you speed up the model by compiling the computation graph. You also learned that gradient accumulation is a technique for training with a larger effective batch size by accumulating gradients from multiple mini-batches. Since you run fewer optimizer updates this way, you save time on backward passes and parameter updates.



Source_link

Related Posts

A Missed Forecast, Frayed Nerves and a Long Trip Back
Al, Analytics and Automation

A Missed Forecast, Frayed Nerves and a Long Trip Back

January 23, 2026
Microsoft Releases VibeVoice-ASR: A Unified Speech-to-Text Model Designed to Handle 60-Minute Long-Form Audio in a Single Pass
Al, Analytics and Automation

Microsoft Releases VibeVoice-ASR: A Unified Speech-to-Text Model Designed to Handle 60-Minute Long-Form Audio in a Single Pass

January 23, 2026
Slow Down the Machines? Wall Street and Silicon Valley at Odds Over A.I.’s Nearest Future
Al, Analytics and Automation

Slow Down the Machines? Wall Street and Silicon Valley at Odds Over A.I.’s Nearest Future

January 22, 2026
Inworld AI Releases TTS-1.5 For Realtime, Production Grade Voice Agents
Al, Analytics and Automation

Inworld AI Releases TTS-1.5 For Realtime, Production Grade Voice Agents

January 22, 2026
FlashLabs Researchers Release Chroma 1.0: A 4B Real Time Speech Dialogue Model With Personalized Voice Cloning
Al, Analytics and Automation

FlashLabs Researchers Release Chroma 1.0: A 4B Real Time Speech Dialogue Model With Personalized Voice Cloning

January 22, 2026
Al, Analytics and Automation

Salesforce AI Introduces FOFPred: A Language-Driven Future Optical Flow Prediction Framework that Enables Improved Robot Control and Video Generation

January 21, 2026
Next Post
The Best After-Christmas Deals on Gear We’ve Tested (2025)

The Best After-Christmas Deals on Gear We've Tested (2025)

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

POPULAR NEWS

Trump ends trade talks with Canada over a digital services tax

Trump ends trade talks with Canada over a digital services tax

June 28, 2025
Communication Effectiveness Skills For Business Leaders

Communication Effectiveness Skills For Business Leaders

June 10, 2025
15 Trending Songs on TikTok in 2025 (+ How to Use Them)

15 Trending Songs on TikTok in 2025 (+ How to Use Them)

June 18, 2025
App Development Cost in Singapore: Pricing Breakdown & Insights

App Development Cost in Singapore: Pricing Breakdown & Insights

June 22, 2025
Google announced the next step in its nuclear energy plans 

Google announced the next step in its nuclear energy plans 

August 20, 2025

EDITOR'S PICK

3 new Google Gemini app AI tools for students

3 new Google Gemini app AI tools for students

August 10, 2025
AI Liability Insurance: The Next Step in Safeguarding Businesses from AI Failures

AI Liability Insurance: The Next Step in Safeguarding Businesses from AI Failures

June 9, 2025
New postdoctoral fellowship program to accelerate innovation in health care | MIT News

New postdoctoral fellowship program to accelerate innovation in health care | MIT News

July 9, 2025
Turning AI momentum in Southeast Asia into tremendous economic growth

Turning AI momentum in Southeast Asia into tremendous economic growth

October 26, 2025

About

We bring you the best Premium WordPress Themes that perfect for news, magazine, personal blog, etc. Check our landing page for details.

Follow us

Categories

  • Account Based Marketing
  • Ad Management
  • Al, Analytics and Automation
  • Brand Management
  • Channel Marketing
  • Digital Marketing
  • Direct Marketing
  • Event Management
  • Google Marketing
  • Marketing Attribution and Consulting
  • Marketing Automation
  • Mobile Marketing
  • PR Solutions
  • Social Media Management
  • Technology And Software
  • Uncategorized

Recent Posts

  • Robot butlers look more like Roombas than Rosey from the Jetsons
  • A Missed Forecast, Frayed Nerves and a Long Trip Back
  • I Analyzed G2 Reviews for the 8 Best Free Presentation Tools
  • How Much Does It Cost to Build an App Like Arattai? Full Guide
  • About Us
  • Disclaimer
  • Contact Us
  • Privacy Policy
No Result
View All Result
  • Technology And Software
    • Account Based Marketing
    • Channel Marketing
    • Marketing Automation
      • Al, Analytics and Automation
      • Ad Management
  • Digital Marketing
    • Social Media Management
    • Google Marketing
  • Direct Marketing
    • Brand Management
    • Marketing Attribution and Consulting
  • Mobile Marketing
  • Event Management
  • PR Solutions

Are you sure want to unlock this post?
Unlock left : 0
Are you sure want to cancel subscription?