EntropyMath Leaderboard

A high-entropy mathematical reasoning benchmark for LLMs

Seed Problems (EntropyMath Standard v1)

Model Accuracy vs Pass@3

100%

75%

50%

25%

Grok-4.1-fast

GPT-5.1 (high)

Claude-Opus-4.5

Gemini-3-Pro-Preview

Deepseek-V3.2

GPT-oss-20B (high)

K-EXAONE-236B-A23B

Kanana-2-30B-Thinking-2601

Kanana-2-30B-Thinking

Qwen3-30B-A3B-2507

Solar-Open-100B

Deepseek-R1-distill-Qwen-32B (high)

Solar-Pro-2 (31B)(high)

EXAONE-4.0.1-32B (high)

HCX-007(high)

Gemma-3-27B

A.X-4.0 (72B)

axk1

Llama-VARCO-8B-Instruct

Accuracy

Pass@3

Avg Token Usage (Per Problem)

48.1K

36K

24K

12K

Grok-4.1-fast

GPT-oss-20B (high)

Solar-Pro-2 (31B)(high)

Solar-Open-100B

Gemini-3-Pro-Preview

K-EXAONE-236B-A23B

Kanana-2-30B-Thinking-2601

Deepseek-V3.2

Kanana-2-30B-Thinking

Deepseek-R1-distill-Qwen-32B (high)

Qwen3-30B-A3B-2507

EXAONE-4.0.1-32B (high)

Gemma-3-27B

GPT-5.1 (high)

HCX-007(high)

Llama-VARCO-8B-Instruct

Claude-Opus-4.5

A.X-4.0 (72B)

axk1

Avg Tokens / Problem

EntropyMath is an evolutionary multi-agent system and benchmark that generates high-entropy math problems designed to systematically break current LLMs. This benchmark consists of 10 curated seed problems from the MATH, serving as the foundational baseline for evaluating mathematical reasoning capabilities. Models are tested on their ability to solve these core problems which serve as the source for generating harder derivative tasks.

Results are reported using Pass@3 metrics to account for generation variance, alongside detailed execution traces for transparency.

Performance Legend

Mastery (100%)

3/3

Strong (66%)

2/3

Weak (33%)

1/3

Fail (0%)

0/3

Model	Acc	Pass@3	0	1	2	3	4	5	6	7	8	9
API / Others
Grok-4.1-fast	90.0	100.0	3/3	3/3	3/3	3/3	3/3	2/3	1/3	3/3	3/3	3/3
GPT-5.1 (high)	86.7	90.0	2/3	3/3	3/3	3/3	3/3	3/3	0/3	3/3	3/3	3/3
Claude-Opus-4.5	86.2	90.0	3/3	3/3	3/3	3/3	3/3	1/2	0/3	3/3	3/3	3/3
Gemini-3-Pro-Preview	83.3	90.0	3/3	3/3	3/3	3/3	3/3	1/3	0/3	3/3	3/3	3/3
Deepseek-V3.2	82.8	90.0	2/3	3/3	3/3	3/3	3/3	1/2	0/3	3/3	3/3	3/3
K-LLM Project Round 2
K-EXAONE-236B-A23B	76.7	90.0	3/3	3/3	2/3	3/3	2/3	1/3	0/3	3/3	3/3	3/3
Solar-Open-100B	56.7	70.0	2/3	3/3	2/3	3/3	2/3	0/3	0/3	0/3	2/3	3/3
K-LLM Project Round 1
Solar-Pro-2 (31B)(high)	53.3	70.0	1/3	3/3	2/3	3/3	3/3	0/3	0/3	0/3	1/3	3/3
EXAONE-4.0.1-32B (high)	46.7	60.0	2/3	3/3	3/3	3/3	1/3	0/3	0/3	0/3	0/3	2/3
HCX-007(high)	26.7	40.0	0/3	3/3	1/3	2/3	2/3	0/3	0/3	0/3	0/3	0/3
A.X-4.0 (72B)	23.3	30.0	0/3	3/3	0/3	2/3	0/3	0/3	0/3	0/3	0/3	2/3
Llama-VARCO-8B-Instruct	7.1	20.0	0/3	1/3	0/3	0/3	0/3	0/3	0/1	0/3	0/3	1/3
Local - KR
Kanana-2-30B-Thinking-2601	63.3	80.0	3/3	3/3	1/3	3/3	3/3	0/3	0/3	1/3	2/3	3/3
Kanana-2-30B-Thinking	60.0	80.0	1/3	3/3	2/3	3/3	2/3	0/3	0/3	2/3	2/3	3/3
Local - US
GPT-oss-20B (high)	82.1	80.0	2/3	3/3	3/3	3/3	3/3	0/1	0/3	3/3	3/3	3/3
Gemma-3-27B	26.7	40.0	0/3	2/3	0/3	2/3	3/3	0/3	0/3	0/3	0/3	1/3
Local - CN
Deepseek-R1-distill-Qwen-32B (high)	56.7	70.0	1/3	3/3	3/3	3/3	3/3	0/3	0/3	0/3	1/3	3/3
Qwen3-30B-A3B-2507	57.7	50.0	0/3	3/3	0/3	3/3	3/3	0/1	0/3	0/1	3/3	3/3