IMDS LogoCicagolab LogoDeep Fountain Logo

EntropyMath Leaderboard

A high-entropy mathematical reasoning benchmark for LLMs

Korean CSAT 2025 (KOR)

Model Accuracy vs Pass@3

100%
75%
50%
25%
0%
GPT-5.2 (high)
Gemini-3-Pro-Preview
K-EXAONE-236B-A23B
K-EXAONE-236B-A23B
EXAONE-4.0.1-32B (high)
EXAONE-4.0.1-32B (high)
Kanana-2-30B-Thinking-2601
Kanana-2-30B-Thinking-2601
Accuracy
Pass@3

Avg Token Usage (Per Problem)

20.1K
15.1K
10K
5K
0
K-EXAONE-236B-A23B
K-EXAONE-236B-A23B
Gemini-3-Pro-Preview
Kanana-2-30B-Thinking-2601
Kanana-2-30B-Thinking-2601
GPT-5.2 (high)
EXAONE-4.0.1-32B (high)
EXAONE-4.0.1-32B (high)
Avg Tokens / Problem

EntropyMath is an evolutionary multi-agent system and benchmark that generates high-entropy math problems designed to systematically break current LLMs. The KOR_CSAT_25_KOR dataset represents the 2025 Korean College Scholastic Ability Test (CSAT) Math problems. This is a highly challenging benchmark specifically for verifying mathematical reasoning in Korean.

Results are reported using Pass@3 metrics to account for generation variance, alongside detailed execution traces for transparency.

Performance Legend

Mastery (100%)
3/3
Strong (66%)
2/3
Weak (33%)
1/3
Fail (0%)
0/3
ModelAccPass@30123456789101112
API / Others
GPT-5.2 (high)
100.0100.01/11/11/11/11/11/11/11/11/11/11/11/11/1
Gemini-3-Pro-Preview
100.0100.01/11/11/11/11/11/11/11/11/11/11/11/11/1
K-LLM Project Round 2
K-EXAONE-236B-A23BK-EXAONE-236B-A23B
66.784.61/30/33/31/32/33/33/32/33/30/32/33/33/3
K-LLM Project Round 1
EXAONE-4.0.1-32B (high)EXAONE-4.0.1-32B (high)
53.876.91/30/33/32/31/31/33/33/31/30/30/33/33/3
Local - KR
Kanana-2-30B-Thinking-2601Kanana-2-30B-Thinking-2601
53.869.20/30/33/30/33/32/33/33/33/30/31/32/31/3