EntropyMath Leaderboard

A high-entropy mathematical reasoning benchmark for LLMs

Derivatives from seed problems

Model Accuracy vs Pass@3

100%

75%

50%

25%

Deepseek-V3.2

Grok-4.1-fast

GPT-5.1 (high)

Claude-Opus-4.5

Gemini-3-Pro-Preview

K-EXAONE-236B-A23B

Solar-Open-100B

Accuracy

Pass@3

Avg Token Usage (Per Problem)

23.4K

17.6K

11.7K

5.9K

Solar-Open-100B

Grok-4.1-fast

Gemini-3-Pro-Preview

K-EXAONE-236B-A23B

Deepseek-V3.2

Claude-Opus-4.5

GPT-5.1 (high)

Avg Tokens / Problem

EntropyMath is an evolutionary multi-agent system and benchmark that generates high-entropy math problems designed to systematically break current LLMs. The EntropyMath_Open_10 benchmark contains derivative problems generated from the seed set. This dataset tests the model's robustness and generalization ability by presenting variations of known problem distributions, ensuring that performance is not merely due to memorization.

Results are reported using Pass@3 metrics to account for generation variance, alongside detailed execution traces for transparency.

Performance Legend

Mastery (100%)

3/3

Strong (66%)

2/3

Weak (33%)

1/3

Fail (0%)

0/3

Model	Acc	Pass@3	0	1	2	3	4	5	6	7	8	9
API / Others
Deepseek-V3.2	100.0	100.0	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3	3/3
Grok-4.1-fast	93.3	100.0	3/3	3/3	2/3	3/3	3/3	3/3	3/3	3/3	2/3	3/3
GPT-5.1 (high)	93.3	100.0	3/3	3/3	3/3	3/3	3/3	2/3	3/3	3/3	2/3	3/3
Claude-Opus-4.5	90.0	100.0	3/3	3/3	3/3	3/3	3/3	3/3	2/3	3/3	1/3	3/3
Gemini-3-Pro-Preview	90.0	100.0	3/3	2/3	2/3	3/3	3/3	3/3	2/3	3/3	3/3	3/3
K-LLM Project Round 2
K-EXAONE-236B-A23B	86.7	100.0	2/3	3/3	2/3	3/3	3/3	2/3	2/3	3/3	3/3	3/3
Solar-Open-100B	66.7	80.0	2/3	2/3	0/3	3/3	3/3	3/3	2/3	3/3	0/3	2/3