Derivatives from seed problems
Model Accuracy vs Pass@3
100%
75%
50%
25%
0%
Deepseek-V3.2
Grok-4.1-fast
GPT-5.1 (high)
Claude-Opus-4.5
Gemini-3-Pro-Preview
Accuracy
Pass@3
Avg Token Usage (Per Problem)
21.1K
15.9K
10.6K
5.3K
0
Grok-4.1-fast
Gemini-3-Pro-Preview
Deepseek-V3.2
Claude-Opus-4.5
GPT-5.1 (high)
Avg Tokens / Problem
EntropyMath is an evolutionary multi-agent system and benchmark that generates high-entropy math problems designed to systematically break current LLMs. The EntropyMath_Open_10 benchmark contains derivative problems generated from the seed set. This dataset tests the model's robustness and generalization ability by presenting variations of known problem distributions, ensuring that performance is not merely due to memorization.
Results are reported using Pass@3 metrics to account for generation variance, alongside detailed execution traces for transparency.
Performance Legend
Mastery (100%)
3/3
Strong (66%)
2/3
Weak (33%)
1/3
Fail (0%)
0/3
| Model | Acc | Pass@3 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| API / Others | ||||||||||||
Grok-4.1-fast | 93.3 | 100.0 | 3/3 | 3/3 | 2/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 2/3 | 3/3 |
GPT-5.1 (high) | 93.3 | 100.0 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 2/3 | 3/3 | 3/3 | 2/3 | 3/3 |
Claude-Opus-4.5 | 90.0 | 100.0 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 2/3 | 3/3 | 1/3 | 3/3 |
Gemini-3-Pro-Preview | 90.0 | 100.0 | 3/3 | 2/3 | 2/3 | 3/3 | 3/3 | 3/3 | 2/3 | 3/3 | 3/3 | 3/3 |
Deepseek-V3.2 | 100.0 | 100.0 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 |


