Can a 7-million parameter model outperform billion-parameter giants? A new paper suggests yes — if we trade depth for time.
In the current era of Generative AI, the prevailing dogma is “Scaling Laws”: to get smarter models, you need more parameters, more data, and more GPUs. We are accustomed to models like Llama 3 or GPT-4 with hundreds of billions of parameters.
But a recent paper, “Less is More: Recursive Reasoning with Tiny Networks”, challenges this view. It proposes Tiny Recursive Models (TRM), a architecture that achieves state-of-the-art results on complex reasoning tasks (like ARC-AGI and difficult Sudoku) using a single, tiny network that “thinks” recursively.
I was fascinated by this concept, so I decided to adapt it. While the paper focuses on puzzles, I wanted to see if this architecture could write stories. I implemented a TRM in PyTorch and trained it on the TinyStories dataset. The result is a highly efficient model that generates coherent text by iterating on its own internal state before outputting a single token.
Here is how TRM works, why it is efficient, and how I implemented it for text generation.
The Core Concept: Depth in Time, Not Space
Standard Transformers are “feed-forward” in nature. Input goes in layer 1, passes through layer 12 (or 96), and an output comes out. To make the model smarter, we usually add more layers (depth in space).
TRM flips this. Instead of adding more layers, it reuses the same tiny network multiple times (depth in time).
The Loop: x, y, z
The model maintains three states:
-
x (Input): The embedded question or context (e.g., “Once upon a time”).
-
y (Answer): The current best guess for the next token.
-
z (Latent Reasoning): A “scratchpad” or “thought vector” that stores the reasoning process.
In a standard Large Language Model (LLM), the “reasoning” is often spread across the deep layers or outputted explicitly via Chain-of-Thought (CoT). In TRM, the reasoning happens in the loop.
The process works like this:
-
Latent Recursion: The model updates its thought vector z multiple times (n times) based on the input x and current answer y.
-
Prediction Update: It updates the answer y based on the refined thought z.
-
Repeat: This whole cycle repeats T times.
By the time the model actually predicts the next token, it has effectively run through a “virtual” depth of several layers, despite physically having only two.
The TRM Architecture
Why “Tiny”?
The paper argues that on difficult tasks with limited data, large models suffer from an “overfitting penalty”. A massive model memorizes the training data rather than learning the underlying logic.
By forcing a small model (e.g. 2 layers) to solve the problem recursively, we force it to learn generalizable rules (parameter sharing) rather than memorizing patterns. The authors found that a 2-layer TRM significantly outperformed 4-layer variants and massive LLMs on tasks like Sudoku-Extreme.
From Theory to Code: Adapting TRM for Text
For text generation, I adapted the architecture to use Causal Self-Attention (standard GPT style) but kept the recursive backbone.
Here is the implementation strategy I used in the accompanying notebook.
1. The Architecture
I used a TinyRecursiveModel class. Key to the paper’s insight is that we don’t need separate networks for “high” and “low” level reasoning (as previous models like HRM did). A single network is sufficient.
class TinyRecursiveModel(nn.Module):
def __init__(self, vocab_size, dim=256, n_layers=2, ...):
# ...
# Single tiny 2-layer network
self.net = TinyRecursiveNetwork(dim, n_heads, n_layers, ...)
# Learnable initial states for y (answer) and z (reasoning)
self.y_init = nn.Parameter(torch.randn(1, 1, dim) * 0.02)
self.z_init = nn.Parameter(torch.randn(1, 1, dim) * 0.02)
2. The Recursive Step
The magic happens in the latent_recursion method. We combine the input, current guess, and thought vector, pass them through the network to update the thought (z), and finally update the guess (y).
def latent_recursion(self, x, y, z):
# Latent reasoning: update z 'n' times
for _ in range(self.n_latent_recursions):
combined = self.combine_xyz(torch.cat([x, y, z], dim=-1))
z = self.net(combined)
# Refine prediction: update y given (y, z)
combined_yz = self.combine_yz(torch.cat([y, z], dim=-1))
y = self.net(combined_yz)
return y, z
3. Deep Supervision & Training
Training recursive models can be unstable. If the gradient vanishes through too many loops, the model learns nothing. The paper solves this with Deep Supervision.
Instead of only calculating loss at the very end, we calculate loss at every improvement step. This forces the model to try and get the answer right as early as possible, while allowing it to refine the answer if it needs more time.
I implemented this in the forward pass:
for step in range(n_supervision_steps):
# Run a recursion cycle (some steps with gradients, some without)
y, z, logits, halt_logit = self.deep_recursion(x, y, z, use_grad=True)
# Calculate loss on the prediction (y) at THIS step
ce_loss = F.cross_entropy(...)
# Accumulate loss
total_loss += ce_loss
I also included Exponential Moving Average (EMA) for weights, as the paper notes that recursive models on small data can diverge quickly without it.
4. Adaptive Computation Time (ACT)
One cool feature of TRM is it knows when to stop “thinking.” The paper implements a simplified ACT where the model outputs a halting probability. If the model is confident in its answer early, it can stop recursing, saving compute.
In my implementation, I added a halt_head that is trained alongside the token prediction to predict if the current answer is correct.
The Results: TinyStories
I trained this model on the TinyStories dataset—a collection of simple stories generated by GPT-4, designed to test if small models can learn coherent English.
Model Specs:
-
Parameters: ~28 Million
-
Layers: 2
-
Recursion Depth: 3 cycles (T) of 4 latent steps (n).
-
Effective Depth: ~30 layers.
After just 3 epochs, the validation loss dropped to 1.69. The model began generating coherent, albeit simple, narratives.
Here’s an example prompt: “Once upon a time”
TRM Output:
“Once upon a time, there was a big lion. He was very lazy, but he always went for a walk. One day, he saw a little girl walking by… The lion was happy to help her.”
Considering the model has only 2 physical transformer layers, this coherence is remarkable. It is effectively simulating a much deeper network by “chewing” on the latent state z before committing to a word.
Why This Matters
The TRM architecture offers a glimpse into a different future for AI efficiency:
-
Memory Efficiency: You only store weights for 2 layers. This fits easily on consumer hardware or edge devices.
-
Inference-Time Compute: You can trade speed for accuracy dynamically. For a hard token, let the model recurse 16 times. For an easy word like “the,” let it exit early.
-
Data Efficiency: The recursive constraints force the model to learn robust features from less data, avoiding the memorization trap of large LLMs.
While we aren’t replacing GPT-4 with TRMs yet, this approach proves that how a model thinks is just as important as how big it is.
The code for this project is available on GitHub.
References
-
Jolicoeur-Martineau, A. (2025). Less is More: Recursive Reasoning with Tiny Networks. arXiv:2510.04871.
-
Eldan, R., & Li, Y. (2023). TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Next Step
It would be interesting to test this approach with multimodal inputs, e.g. for image captioning. Stay tuned.
