The dominant recipe for building better language models has not changed much since the Chinchilla era: spend more FLOPs, add more parameters, train on more tokens. But as inference deployments consume an ever-growing share of compute and model deployments push toward the edge, researchers are increasingly asking a harder question — can you scale quality without scaling memory footprint?
A team of researchers from UC San Diego and Together AI have introduced Parcae, a stable looped transformer architecture that outperforms prior looped models and beats fixed-depth Transformer baselines at every scale tested — all while using the same parameter count and the same training data budget

What is a Looped Language Model?
In a standard Transformer, activations flow through a fixed stack of layers exactly once. A looped architecture instead routes activations through a block of layers T times in a loop, multiplying effective compute without adding parameters. Think of it as running the same group of transformer blocks repeatedly rather than building a taller model.
Parcae specifically uses a middle-looped design, partitioning the architecture into three functional blocks: a prelude (P) that embeds the input sequence into a latent state e; a recurrent block (R) that iteratively updates a hidden state ht for T loops, with e injected at each iteration to maintain the input’s influence; and a coda (C) that processes the final hT to produce the output. This structure keeps the model compact in memory, a valuable property for on-device deployment, while enabling significantly more compute per forward pass.
Past works on looped transformers, including Recurrent Depth Models (RDMs), showed early promise but were quite difficult to train. They suffered from residual state explosion — where the hidden state vector grows uncontrollably across loop iterations — and frequent loss spikes. Sensitive hyperparameter tuning was required just to achieve convergence.
The Root Cause: An Unconstrained Residual System
The research team behind Parcae’s key insight is to recast the looped model’s forward pass as a nonlinear time-variant dynamical system over the residual stream:
ht+1 = Ā ht + B̄ e + R̄(ht, e),
Here, Ā controls the balance between prior and current residual states, B̄ injects the input signal, and R̄ is the nonlinear contribution of the transformer blocks (attention and MLPs). Dropping R̄ yields a discrete linear time-invariant (LTI) system, and classical control theory immediately gives you the stability condition: the system is stable when the spectral norm ρ(Ā) < 1, marginally stable when ρ(Ā) = 1, and unstable when ρ(Ā) > 1.
Examining prior methods under this framework reveals the problem precisely. Addition-based input injection sets Ā = I (the identity matrix), meaning ρ(Ā) = 1 — marginally stable. The concatenation-with-projection approach used by RDMs leaves Ā entirely unconstrained, making ρ(Ā) potentially far greater than 1 — unstable. Empirical training curves confirm this directly: divergent training runs learn ρ(Ā) ≥ 1, while the few convergent runs maintain ρ(Ā) < 1.
How Parcae Enforces Stability by Design
Rather than parameterizing Ā directly, Parcae works in continuous form and discretizes using zero-order hold (ZOH) and Euler schemes — borrowing a standard technique from state space models like Mamba and S4 — with a learned step size Δ ∈ ℝdh, giving Ā = exp(ΔA) and B̄ = ΔB. To guarantee ρ(Ā) < 1, the continuous matrix A is constrained as a negative diagonal matrix: A := Diag(−exp(logA)), where logA ∈ ℝdh is a learnable vector. Because diagonal entries are always negative before exponentiation, the spectral norm constraint is satisfied at all times by construction.
Results: Outperforming Models Twice the Size
Against parameter- and data-matched RDMs trained on the Huginn dataset, Parcae reduces validation perplexity by up to 6.3% — a figure that peaks at 350M scale (improving from 10.76 to 10.09 PPL) versus a 4.5% gain at 100M scale (14.23 to 13.59 PPL). WikiText perplexity improves by up to 9.1% at 350M scale. Average downstream zero-shot benchmark accuracy improves by up to 1.8 points.
Against standard fixed-depth Transformer baselines trained with a nanochat-inspired setup on FineWeb-Edu, Parcae outperforms at every scale. At 1.3B parameters trained on 104B tokens, Parcae beats the parameter-matched Transformer by 2.99 points on Core and 1.18 points on Core-Extended. The 770M Parcae model (25.07 Core) reaches quality comparable to the 1.3B Transformer (25.45 Core) — roughly half the parameters for equivalent capability. The research team quantifies Parcae’s parameter efficiency as achieving up to 87.5% of the quality of a Transformer twice its size, measured against the quality gap to the next larger model.
The First Scaling Laws for Looping
The second major contribution of this research is establishing the first predictable scaling laws for layer looping. Using isoFLOP experiments at 140M and 370M scales, the research team shows that compute-optimal training increases mean recurrence µrec and training tokens D in tandem, following power laws with consistent exponents across both scales: optimal µrec scales as C0.40 and optimal tokens scale as C0.78, where C is the training FLOP budget.
When looped Parcae models trained at their optimal µrec are compared against fixed-depth Parcae models (µrec = 1) under identical FLOP and parameter budgets, looping achieves a strictly lower validation loss — translating into 1.2 to 2.0 points higher Core scores depending on the FLOP budget. Looping is a genuinely orthogonal axis for scaling compute, not a free lunch from weight sharing.
At test time, increasing loop count T beyond training depth follows a saturating exponential decay: L(T) = L∞ + Z·e−z·T, where L∞ is an irreducible floor determined by training depth. Gains plateau near µrec — the mean recurrence used during training — meaning training depth sets a hard ceiling on test-time scaling. These dynamics unify into a single parametric law that predicts held-out model loss within 0.85–1.31% average error.
Key Takeaways
- Looped transformers can now be trained reliably at scale: Parcae is a looped architecture to solve the residual state explosion and loss spike problems that have plagued prior looped models, achieving stable training across a wide range of learning rates where previous approaches diverged.
- A 770M Parcae model matches the quality of a 1.3B standard Transformer: By reusing the same layers across multiple loop iterations instead of adding more parameters, Parcae delivers equivalent downstream capability at roughly half the memory footprint.
- Looping is a third orthogonal axis for scaling compute, alongside parameters and data: Under a fixed FLOP and parameter budget, compute-optimal training requires increasing mean recurrence and training tokens in tandem following predictable power laws — giving AI professionals a new lever to improve quality without buying more hardware.
- Test-time looping has a hard ceiling set by training depth: Parcae can use additional loop iterations at inference to scale compute, but gains plateau near the mean recurrence used during training. You cannot infinitely loop your way to better performance without training the model at deeper recurrences first.
Check out the Paper, Model Weights and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us















