IMDS LogoCicagolab LogoDeep Fountain Logo

EntropyMath Leaderboard

A high-entropy mathematical reasoning benchmark for LLMs

Seed Problems (EntropyMath Standard v2)

Model Accuracy vs Pass@3

100%
75%
50%
25%
0%
GPT-5.2 (high)
Gemini-3-Pro-Preview
Solar-Pro 2
Solar-Pro 2
Kanana-2-30B-Thinking
Kanana-2-30B-Thinking
K-EXAONE-236B-A23B
K-EXAONE-236B-A23B
Kanana-2-30B-Thinking-2601
Kanana-2-30B-Thinking-2601
GLM-4.5-Air
Solar-Open-100B
Solar-Open-100B
model_d_r1
naver-hyperclovax/HCX-007
EXAONE-4.0-32B
EXAONE-4.0-32B
axk1
Accuracy
Pass@3

Avg Token Usage (Per Problem)

25.2K
18.9K
12.6K
6.3K
0
K-EXAONE-236B-A23B
K-EXAONE-236B-A23B
Solar-Open-100B
Solar-Open-100B
Gemini-3-Pro-Preview
Kanana-2-30B-Thinking-2601
Kanana-2-30B-Thinking-2601
Kanana-2-30B-Thinking
Kanana-2-30B-Thinking
Solar-Pro 2
Solar-Pro 2
GPT-5.2 (high)
GLM-4.5-Air
naver-hyperclovax/HCX-007
EXAONE-4.0-32B
EXAONE-4.0-32B
axk1
model_d_r1
Avg Tokens / Problem

EntropyMath is an evolutionary multi-agent system and benchmark that generates high-entropy math problems designed to systematically break current LLMs.

Results are reported using Pass@3 metrics to account for generation variance, alongside detailed execution traces for transparency.

Performance Legend

Mastery (100%)
3/3
Strong (66%)
2/3
Weak (33%)
1/3
Fail (0%)
0/3
ModelAccPass@30123456789
API / Others
GPT-5.2 (high)
86.790.03/33/33/32/33/33/33/33/30/33/3
Gemini-3-Pro-Preview
86.790.03/33/33/33/33/33/32/33/30/33/3
GLM-4.5-Air
40.060.01/33/33/30/30/32/30/32/30/31/3
K-LLM Project Round 2
K-EXAONE-236B-A23BK-EXAONE-236B-A23B
50.060.00/33/33/30/30/33/31/33/30/32/3
Solar-Open-100BSolar-Open-100B
36.750.00/32/33/30/30/32/30/33/30/31/3
K-LLM Project Round 1
Solar-Pro 2Solar-Pro 2
60.070.01/33/33/32/30/33/30/33/30/33/3
EXAONE-4.0-32BEXAONE-4.0-32B
26.740.00/32/33/30/30/32/30/31/30/30/3
Local - KR
Kanana-2-30B-ThinkingKanana-2-30B-Thinking
53.360.00/33/33/30/30/33/31/33/30/33/3
Kanana-2-30B-Thinking-2601Kanana-2-30B-Thinking-2601
50.060.00/33/33/31/30/33/30/32/30/33/3