IMDS Logo

EntropyMath Leaderboard

A high-entropy mathematical reasoning benchmark for LLMs

Seed Problems (EntropyMath Standard)

Model Accuracy vs Pass@3

100%
75%
50%
25%
0%
Grok-4.1-fast
GPT-5.1 (high)
Claude-Opus-4.5
Gemini-3-Pro-Preview
Deepseek-V3.2
GPT-oss-20B (high)
Qwen3-30B-A3B-2507
Deepseek-R1-distill-Qwen-32B (high)
Solar-Pro-2 (31B)(high)
Solar-Pro-2 (31B)(high)
EXAONE-4.0.1-32B (high)
EXAONE-4.0.1-32B (high)
HCX-007(high)
HCX-007(high)
Gemma-3-27B
A.X-4.0 (72B)
A.X-4.0 (72B)
Llama-VARCO-8B-Instruct
Llama-VARCO-8B-Instruct
Accuracy
Pass@3

Avg Token Usage (Per Problem)

48,050.09
36,037.568
24,025.045
12,012.523
0
Grok-4.1-fast
GPT-oss-20B (high)
Solar-Pro-2 (31B)(high)
Solar-Pro-2 (31B)(high)
Gemini-3-Pro-Preview
Deepseek-V3.2
Deepseek-R1-distill-Qwen-32B (high)
Qwen3-30B-A3B-2507
EXAONE-4.0.1-32B (high)
EXAONE-4.0.1-32B (high)
Gemma-3-27B
GPT-5.1 (high)
HCX-007(high)
HCX-007(high)
Llama-VARCO-8B-Instruct
Llama-VARCO-8B-Instruct
Claude-Opus-4.5
A.X-4.0 (72B)
A.X-4.0 (72B)
Avg Tokens / Problem

EntropyMath is an evolutionary multi-agent system and benchmark that generates high-entropy math problems designed to systematically break current LLMs. This benchmark consists of 10 curated seed problems from the MATH, serving as the foundational baseline for evaluating mathematical reasoning capabilities. Models are tested on their ability to solve these core problems which serve as the source for generating harder derivative tasks.

Results are reported using Pass@3 metrics to account for generation variance, alongside detailed execution traces for transparency.

Performance Legend

Mastery (100%)
3/3
Strong (66%)
2/3
Weak (33%)
1/3
Fail (0%)
0/3
ModelAccPass@30123456789
API / Others
Grok-4.1-fast
90.0100.03/33/33/33/33/32/31/33/33/33/3
GPT-5.1 (high)
86.790.02/33/33/33/33/33/30/33/33/33/3
Claude-Opus-4.5
86.290.03/33/33/33/33/31/20/33/33/33/3
Gemini-3-Pro-Preview
83.390.03/33/33/33/33/31/30/33/33/33/3
Deepseek-V3.2
82.890.02/33/33/33/33/31/20/33/33/33/3
Soverign 5 - KR
Solar-Pro-2 (31B)(high)Solar-Pro-2 (31B)(high)
53.370.01/33/32/33/33/30/30/30/31/33/3
EXAONE-4.0.1-32B (high)EXAONE-4.0.1-32B (high)
46.760.02/33/33/33/31/30/30/30/30/32/3
HCX-007(high)HCX-007(high)
26.740.00/33/31/32/32/30/30/30/30/30/3
A.X-4.0 (72B)A.X-4.0 (72B)
23.330.00/33/30/32/30/30/30/30/30/32/3
Llama-VARCO-8B-InstructLlama-VARCO-8B-Instruct
7.120.00/31/30/30/30/30/30/10/30/31/3
Local - US
GPT-oss-20B (high)
82.180.02/33/33/33/33/30/10/33/33/33/3
Gemma-3-27B
26.740.00/32/30/32/33/30/30/30/30/31/3
Local - CN
Deepseek-R1-distill-Qwen-32B (high)
56.770.01/33/33/33/33/30/30/30/31/33/3
Qwen3-30B-A3B-2507
57.750.00/33/30/33/33/30/10/30/13/33/3