Test-time scaling offers a promising path to improve LLM reasoning by utilizing more compute at inference time; however, the true promise of this paradigm lies in extrapolation (i.e., improvement in performance on hard problems as LLMs keep "thinking" for longer, beyond the maximum token budget they were trained on). Surprisingly, we find that most existing reasoning models do not extrapolate well. We show that one way to enable extrapolation is by training the LLM to perform in-context exploration: training the LLM to effectively spend its test time budget by chaining operations (such as generation, verification, refinement, etc.), or testing multiple hypotheses before it commits to an answer. To enable in-context exploration, we identify three key ingredients as part of our recipe e3: (1) chaining skills that the base LLM has asymmetric competence in, e.g., chaining verification (easy) with generation (hard), as a way to implement in-context search; (2) leveraging "negative" gradients from incorrect traces to amplify exploration during RL, resulting in longer search traces that chains additional asymmetries; and (3) coupling task difficulty with training token budget during training via a specifically-designed curriculum to structure in-context exploration. Our recipe e3 produces the best known 1.7B model according to AIME'25 and HMMT'25 scores, and extrapolates to 2x the training token budget. Our e3-1.7B model not only attains high pass@1 scores, but also improves pass@k over the base model.
In this study, our goal is to train models that can improve performance when we extrapolate test-compute beyond training budget Btr. Even though the true promise of test-time compute is extrapolation performance, we find that current thinking models show miniscule gains when extrapolating to 2-3Ă— the training budget, as shown in the figure below.
We show that the key to enabling extrapolation is learning to explore in-context: if a model learns to use compute by searching through multiple reasoning paths or implementing procedures, it can "guide" the search towards the correct answer, and improve its performance as more test compute becomes available. To demonstrate this, we build a recipe e3, which trains models that leverage test compute for in-context exploration and can perform well at both normal training and extrapolation budgets. At its core, e3 is based on three ingredients: (i) base model asymmetries, (ii) negative gradients during RL, and (iii) a coupled data & budget curriculum.
To achieve better extrapolation performance, apply our recipe e3 to the Qwen3-1.7B base model and evaluate it on AIME'25 and HMMT'25 against open-source models. As shown, at a test-time token budget of 32k tokens, e3 achieves state-of-the-art performance on AIME'25 and HMMT'25, within a model class of size < 2B. We outperform the best model in this class by > 10% on AIME'25 in terms of peak performance, and show that our model, trained only up to a budget of 16k, extrapolates better than other models including s1.1-32B and OpenThinker-7B when we extrapolate them to 32k output tokens. Moreover, (c) shows that compared to budget forcing via "Wait", e3 achieves substantially better scaling, without any form of prompting or budget forcing.
In the table below, we also report the pass@k performance, comparing e3 with other models of a similar size. We find that our final model at the end of second stage of training on a budget of 16k outperforms other models on higher values of k, on AIME and HMMT '25. We especially note the comparison against the Nemotron-Reasoning-1.5B model trained with a prolonged RL training recipe on a broader dataset, including our training data.
Model | AIME 2025 | HMMT 2025 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
k=1 | 2 | 4 | 8 | 16 | 32 | k=1 | 2 | 4 | 8 | 16 | 32 | |
Qwen3-1.7B | 35.5 | 41.4 | 47.0 | 52.4 | 58.3 | 65.2 | 22.2 | 27.3 | 33.0 | 39.5 | 46.7 | 54.9 |
R1-distill-Qwen-1.5B | 23.1 | 29.2 | 34.5 | 40.1 | 46.3 | 52.5 | 12.5 | 19.1 | 24.3 | 27.9 | 36.1 | 42.8 |
Nemotron-Reasoning-1.5B | 33.6 | 38.5 | 43.6 | 48.9 | 53.8 | 58.0 | 17.4 | 22.5 | 29.6 | 35.2 | 40.7 | 45.0 |
e3 (Ours) | 43.8 | 51.1 | 56.7 | 60.8 | 64.0 | 67.2 | 24.7 | 30.4 | 37.0 | 44.1 | 50.8 | 56.1 |
When the base model exhibits asymmetric incompetence at different skills (e.g., when the model is more accurate at verifying its own answers than it is at generating correct ones), RL post-training prefers to learn solutions that chains asymmetries in ways that improve final performance.
As shown above, when asymmetries such as the VG gap are present (e.g., in base models for the Countdown task), RL training amplifies response length by chaining more asymmetries to explore in-context, where the probability of success improves with higher length on both training and extrapolation regimes. On the other hand, when VG gap is absent in the base model (e.g., in the multiplication task), increases in length and extrapolation performance are subdued. When we explicitly train on a base model fine-tuned to verify multiplication (Mult w. verify), we again observe upward length and extrapolation trends also seen in Countdown.
Most RL post-training methods such as REINFORCE, PPO, and GRPO adopt the following policy gradient update rule:
To verify this empirically, we compare RL post-training with and without negative gradients. In the latter setting, we zero out negative gradients during training whilst retaining the positive gradient.
As shown above, negative gradients amplify the chaining of asymmetries in two settings where the VG gap is large, leading to longer response lengths and improved extrapolation performance. This is because amplifying asymmetries incentivizes the model to explore more, which in turn allows it to discover and chain additional asymmetries that improve its final answer. Negative gradients also promote diverse responses, as shown by the higher per-token entropy.
In the presence of asymmetries, training with negative gradients produces models that can extrapolate beyond their training budget. However, we show that training on just any arbitrarily chosen training token budget Btr with asymmetries and negative gradients is not enough.
Training solely at low or high Btr is not desirable. As shown in (a), the training budget that leads to best extrapolation accuracy is 8k. Training at the short budget Btr = 4k “kills” in-context exploration since traces with more asymmetries that are longer than the training budget of 4k are negatively rewarded. On the other extreme, training at Btr = 16k introduces significant optimization challenges, typical of policy gradients in long horizons suffering from high gradient variance.
Training on a static data mixture is insufficient. We compare the naive training data mixture with equal proportions of all difficulties (easy + medium + hard) against easy, easy + medium at Btr = 8k. As shown in (b), the model trained on only easy problems obtains the best performance on OOD AIME’25 when extrapolating compute to 32k. This is because responses to harder problems require larger budgets, so training on harder problems leads to training on budgets smaller than the length of a typical response, which penalizes in-context exploration early in training.
So how should we choose the training budget Btr? We simplify curriculum design by first fixing the dataset at each stage and progressively increasing task difficulty in a stage-wise manner, from easy to hard. Now, at each curriculum stage i, we define a dataset Di and focus on selecting an appropriate token budget Btr,i. The goal is to choose Btr,i such that training with this budget encourages in-context exploration. To do so, we need to strike a balance between long Btr,i that causes more difficult optimization / credit assignment and short Btr,i that leads to less in-context exploration: our heuristic is to choose the smallest budget Btr,i such that the model is positively rewarded for chaining more asymmetries at a budget of 2 · Btr,i compared to Btr,i.
Following this insight, we choose the first stage budget to be 8k so that it has both sufficient exploration compared to 4k and is not as difficult to optimize as 16k, as observed in (a) of the graph in the previous subsection.
Example of chained asymmetries in Countdown
<think> Let me analyze this step by step:
First, let’s try to get close to 466 with simple operations:
65 + 56 = 121 (too small)
65 - 56 = 9 (too small)
65 * 56 = way too large
56 * 65 = way too large
Let’s try working backwards from 466: 466 is much larger than any of the given
numbers, so multiplication will likely be needed. We need to get from 37 to 466, which
means we need to multiply by 12.5. But we don’t have 12.5, so we’ll need to create it
through division.
Let’s try:
37 * 14 = 518 (too large)
37 - 14 = 23 (too small)
37 + 14 = 51 (too small)
Let’s try a different approach:
65 + 61 = 126
126 + 56 = 182
182 + 37 = 219 (getting closer)
Let’s try:
65 + 61 = 126
126 + 56 = 182
182 + 37 = 219
219 + 14 = 233 (still too small)
Let’s try one more approach:
65 + 56 = 121
121 + 61 = 182
182 + 37 = 219
219 + 14 = 233 (still not 466)
Let’s try:
65 + 56 = 121
121 + 61 = 182
182 + 37 = 219
219 + 14 = 233 (still not 466)
One more attempt:
65 + 61 = 126
126 + 37 = 163
163 + 56 = 219
219 + 14 = 233 (still not 466)
Final try:
65 + 61 = 126
126 + 56 = 182
182 + 37 = 219
219 - 14 = 205 (not quite there)
Let’s try:
65 + 61 = 126
126 + 56 = 182
182 + 37 = 219
219 + 14 = 233 (still not 466)
New approach:
65 + 61 = 126
126 + 56 = 182
182 - 37 = 145
145 + 14 = 159 (not quite there)
Final attempt:
65 + 61 = 126
126 + 56 = 182
182 + 37 = 219
219 - 14 = 205 (not quite there)
Let’s try:
65 + 56 = 121
121 + 61 = 182
182 + 37 = 219
219 - 14 = 205 (not quite there)
One last try:
65 + 37 = 102
102 + 61 = 163
163 + 56 = 219
219 + 14 = 233 (not quite there)
Final attempt:
56 + 14 = 70
70 + 65 = 135
135 + 37 = 172
172 + 61 = 233 (not quite there)
New approach:
37 * 14 =
518 518 - 65 = 453
453 + 61 = 514
514 - 56 = 458 (not quite there)
Final try:
61 * 14 =
854 854 - 65 = 789
789 - 56 = 733
733 - 37 = 696 (not quite there)
Let’s try:
37 * 14 =
518 518 - 65 = 453
453 + 61 = 514
514 - 56 = 458 (not quite there)
New approach:
56 * 37 = 2,052
2,052 / 65 approx. 31.5 (not helpful)
Final attempt:
65 * 14 =
910 910 - 61 = 849
849 - 56 = 793
793 + 37 = 830 (not quite there)
New approach:
61 * 37 = 2,237
2,237 - 65 = 1,172
1,172 - 56 = 1,116
1,116 + 14 = 1,130 (not quite there)
New approach:
37 * 14 = 518
518 + 65 = 583
583 - 61 = 522
522 + 56 = 578 (not quite there)
Final attempt:
65 * 14 = 910
910 - 61 = 849
849 + 56 = 905
905 - 37 = 868 (not quite there)
New approach:
37 * 14 = 518
518 - 56 = 462
462 + 65 = 527
527 - 61 = 466 (This works!)
</think>
<answer> (37 * 14 - 56 + 65) - 61 </answer>
Example of chained asymmetries in Multiply w. verify
<think> Okay, so I need to multiply 347 by 583. Hmm, let me think about how to approach this.
I guess I can use the standard multiplication algorithm where you break it down by each
digit. Let me start by writing both numbers down:
347 Ă— 583
––––
First, I’ll multiply 347 by each digit of 583 starting from the rightmost digit, which
is 3 in this case. Then move to the next digit (8) and finally the leftmost digit (5),
remembering to add the appropriate zeros for each subsequent multiplication. After
that, I’ll sum all the results together. Let me start with the rightmost digit.
Starting with the units place: 3 Ă— 347. Let me compute that:
3 Ă— 7 = 21. I write down 1 and carry over 2.
3 Ă— 4 = 12, plus the carried over 2 makes 14. Write down 4 and carry over 1.
3 Ă— 3 = 9, plus the carried over 1 makes 10. Write down 10.
So, 3 Ă— 347 = 1041. Let me check that again:
3 Ă— 7 = 21 (correct), 3 Ă— 4 = 12 + 2 = 14 (correct), 3 Ă— 3 = 9 + 1 = 10 (correct). So
that’s 1041. Alright.
Check: 3 Ă— 347 = 1041. Correct.
Next, the tens place. The digit here is 8, but since it’s in the tens place, it’s
actually 80. So, we’ll multiply 347 by 80. To make it easier, I can multiply 347 by 8
and then add a zero at the end.
8 Ă— 347:
8 Ă— 7 = 56. Write down 6, carry over 5.
8 Ă— 4 = 32, plus the carried over 5 makes 37. Write down 7, carry over 3.
8 Ă— 3 = 24, plus the carried over 3 makes 27. Write down 27.
So, 8 Ă— 347 = 2776. Then add the zero for the tens place, making it 27760. Let me
verify:
8 Ă— 347. 8Ă—300=2400, 8Ă—40=320, 8Ă—7=56. 2400+320=2720, 2720+56=2776. Then adding the
zero gives 27760. Correct.
Check: 8 Ă— 347 = 2776, so 2776 Ă— 10 = 27760. Correct.
Now, the hundreds place. The digit here is 5, but it’s in the hundreds place, so it’s
500. So, multiply 347 by 500. Again, multiply 347 by 5 and add two zeros.
5 Ă— 347:
5 Ă— 7 = 35. Write down 5, carry over 3.
5 Ă— 4 = 20, plus the carried over 3 makes 23. Write down 3, carry over 2.
5 Ă— 3 = 15, plus the carried over 2 makes 17. Write down 17.
So, 5 Ă— 347 = 1735. Then add two zeros, making it 173500. Let me verify:
5 Ă— 347. 5Ă—300=1500, 5Ă—40=200, 5Ă—7=35. 1500+200=1700, 1700+35=1735. Then adding two
zeros gives 173500. Correct.
Check: 5 Ă— 347 = 1735, so 1735 Ă— 100 = 173500. Correct.
Now, add all three results together: 1041 + 27760 + 173500.
Let me do this step by step. First, add 1041 and 27760:
1041 + 27760. Let me line them up:
1041
+27760
––––
28801
Wait, 1041 + 27760: 1000 + 27760 = 28760, then +41 = 28801. Let me check with another
method:
20000 + 7000 = 27000, 700 + 60 = 760, 27000 + 760 = 27760. Then 27760 + 1000 = 28760,
+41 = 28801. Correct.
Check: 1041 + 27760 = 28801. Correct.
Now add this result to 173500:
28801 + 173500. Let me line them up:
28801
+173500
––––-
202301
Wait, 28000 + 173500 = 201500, then 801 + 0 = 801, so total 201500 + 801 = 202301. Let
me verify another way:
28801 + 173500. Break down 173500 into 170000 + 3500. Then 28801 + 170000 = 198801.
Then 198801 + 3500 = 202301. Correct.
Check: 28801 + 173500 = 202301. Correct.
So, putting all together, the final result is 202,301. Let me just cross-verify using
another method to ensure I didn’t make any mistakes in the addition.
</think>
The product of 347 and 583 is 202301.
Example of chained asymmetries in Math
<think>
Okay, so I need to find all integer bases b greater than 9 where 17 base b divides 97 base b. Then sum all those bases. Hmm, let me think about how to approach this.
First, I remember that numbers in different bases can be converted to decimal (base 10) numbers. So, if I have 17_b, that would be 1*b + 7, right? Similarly, 97_b is 9*b + 7. So, the problem is asking for bases b > 9 such that 9*b + 7 is divisible by 1*b + 7. In other words, (9b + 7) divided by (b + 7) is an integer.
So, mathematically, I can write this as:
(9b + 7) / (b + 7) is an integer.
Let me denote that as:
Let me write it as:
(9b + 7) / (b + 7) = integer.
I need to find all integers b > 9 such that this fraction is integer.
Hmm. Let me try to simplify this expression. Maybe perform polynomial division or manipulate it algebraically.
Let me think. Let me write 9b + 7 as something times (b + 7) plus a remainder.
Let me do polynomial division. Let me divide 9b + 7 by b + 7.
Dividing 9b + 7 by b + 7. The leading term is 9b divided by b, which is 9. Multiply (b + 7) by 9: 9b + 63. Subtract that from 9b + 7:
(9b + 7) - (9b + 63) = 0b - 56. So, the division gives 9 with a remainder of -56. Therefore:
(9b + 7)/(b + 7) = 9 - 56/(b + 7).
Ah, so that means that for the entire expression to be an integer, 56/(b + 7) must be an integer. Therefore, b + 7 divides 56.
Therefore, b + 7 is a positive divisor of 56, and since b > 9, then b + 7 must be greater than 9 + 7 = 16. Therefore, b + 7 must be a positive divisor of 56 that is greater than 16.
Wait, but divisors of 56 are positive integers. Let me list all positive divisors of 56 first.
First, prime factorization of 56: 56 = 7 * 8 = 7 * 2^3. Therefore, the positive divisors are 1, 2, 4, 7, 8, 14, 28, 56.
So, divisors are 1, 2, 4, 7, 8, 14, 28, 56.
Now, since b + 7 must be greater than 16, because b > 9 implies b + 7 > 16. Therefore, looking at the divisors greater than 16: 28 and 56.
Wait, 14 is 14, which is less than 16, so no. 28 and 56 are the ones greater than 16. Therefore, possible values for b + 7 are 28 and 56.
Therefore, b = 28 - 7 = 21 and b = 56 - 7 = 49.
Wait, so the bases would be 21 and 49? Let me check if that's correct.
But hold on, let me verify this.
First, if b + 7 divides 56, then b + 7 is a positive divisor of 56, so possible values are 1,2,4,7,8,14,28,56. But since b > 9, then b + 7 > 16. Therefore, b + 7 must be 28 or 56. Therefore, b is 21 and 49.
But let me check for example if b = 21, then 17 base 21 is 1*21 +7 = 28. 97 base 21 is 9*21 +7 = 189 +7 = 196. Now check if 28 divides 196. 196 divided by 28 is 7. Yes, 28*7=196. So that works.
Similarly, for b=49: 17 base 49 is 1*49 +7=56. 97 base 49 is 9*49 +7= 441 +7=448. 448 divided by 56 is 8. 56*8=448. Correct. So both work.
Therefore, the bases are 21 and 49, so their sum is 21 + 49 = 70. Therefore, the answer is 70?
Wait, but let me check if there are other divisors?
Wait, the problem says "integer bases b>9". So are there any other divisors?
Wait, for example, if b + 7 is equal to 28 or 56, but maybe negative divisors? But since b is a base, it must be a positive integer greater than 1? Wait, bases are positive integers greater than 1. So b must be an integer greater than 9, so b is at least 10. So b + 7 is at least 17. Therefore, the divisors of 56 greater than 16 are 28 and 56. So yes, only those two. Therefore, only two bases, 21 and 49. Therefore, sum is 70. So I think that is the answer.
But let me double-check if there are other possibilities. Suppose that when I divided 9b +7 by b +7, I got 9 - 56/(b +7). Therefore, in order for the entire expression to be integer, 56/(b +7) must be integer. Therefore, b +7 divides 56. So yes. So the only possible divisors are those. Therefore, seems like only 21 and 49. Therefore, the answer is 21 + 49 = 70.
But wait, let me check if there are any other divisors. For example, maybe if b + 7 divides 56, but since 56 is 2^3 *7, the divisors are as I listed. So 1, 2, 4, 7, 8, 14, 28, 56. So 28 and 56 are the only ones above 16. Therefore, yes. So 21 and 49.
Therefore, I think that is correct.
Wait, just to be thorough, let me check if there are other possible divisors where maybe b +7 divides 56 but with negative divisors. But since b>9, b +7 is at least 17, so negative divisors would be negative, but since b is positive integer, b +7 is positive, so we don't have to consider negative divisors here.
Therefore, conclusion is the sum is 21 + 49 = 70.
**Final Answer**
\boxed{70}
</think>
To solve the problem, we are given two numbers in base $ b $, written as $ 17_b $ and $ 97_b $, and we are asked to find all integer bases $ b > 9 $ such that $ 17_b $ divides $ 97_b $, and then compute the sum of all such bases.
---
### Step 1: Convert to Decimal
We can convert these numbers from base $ b $ to base 10:
- $ 17_b = 1 \cdot b + 7 = b + 7 $
- $ 97_b = 9 \cdot b + 7 = 9b + 7 $
So the condition becomes:
$$
\frac{9b + 7}{b + 7} \in \mathbb{Z}
$$
---
### Step 2: Simplify the Expression
We perform polynomial division or rewrite the expression:
$$
\frac{9b + 7}{b + 7} = 9 - \frac{56}{b + 7}
$$
This is because:
$$
9b + 7 = (b + 7) \cdot 9 - 56
$$
So the expression becomes:
$$
\frac{9b + 7}{b + 7} = 9 - \frac{56}{b + 7}
$$
For this to be an integer, the term $ \frac{56}{b + 7} $ must be an integer.
That is:
$$
b + 7 \mid 56
$$
So, $ b + 7 $ must be a **positive divisor** of 56.
---
### Step 3: List the Divisors of 56
The prime factorization of 56 is:
$$
56 = 2^3 \cdot 7
$$
Thus, the positive divisors of 56 are:
$$
1, 2, 4, 7, 8, 14, 28, 56
$$
We now require $ b + 7 > 16 $ (since $ b > 9 \Rightarrow b + 7 > 16 $), so only the divisors **greater than 16** are:
- $ 28, 56 $
Therefore:
$$
b + 7 = 28 \Rightarrow b = 21 \\
b + 7 = 56 \Rightarrow b = 49
$$
---
### Step 4: Verify the Bases
Let’s check both values of $ b $ to ensure that the original divisibility condition holds:
- For $ b = 21 $:
$$
17_{21} = 21 + 7 = 28 \\
97_{21} = 9 \cdot 21 + 7 = 189 + 7 = 196 \\
\frac{196}{28} = 7 \quad \text{(integer)}
$$
- For $ b = 49 $:
$$
17_{49} = 49 + 7 = 56 \\
97_{49} = 9 \cdot 49 + 7 = 441 + 7 = 448 \\
\frac{448}{56} = 8 \quad \text{(integer)}
$$
Both are valid.
---
### Final Step: Compute the Sum
The valid bases are $ 21 $ and $ 49 $. Their sum is:
$$
21 + 49 = \boxed{70}
$$
---
### Final Answer
$$
\boxed{70}
$$
@misc{e3-2025,
title={e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs},
author={Amrith Setlur*, Matthew Y. R. Yang*, Charlie Snell, Jeremy Greer, Ian Wu, Virginia Smith, Max Simchowitz, Aviral Kumar},
year={2025},
eprint={2506.09026},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.09026},
}