Korean CSAT (College Scholastic Ability Test) Math Problems
Model Accuracy vs Pass@3
100%
75%
50%
25%
0%
Gemini-3-Pro-Preview
GPT-5.2 (high)
Claude-Opus-4.5
Grok-4.1-fast
GPT-5.1 (high)
Deepseek-V3.2

Solar-Open-100B

K-EXAONE-236B-A23B

K-EXAONE-236B-A23B

Kanana-2-30B-Thinking-2601

Solar-Pro-2 (31B)(high)

Kanana-2-30B-Thinking

HCX-007(high)

EXAONE-4.0.1-32B (high)

A.X-4.0 (72B)
axk1

Llama-VARCO-8B-Instruct
Accuracy
Pass@3
Avg Token Usage (Per Problem)
112K
84K
56K
28K
0

K-EXAONE-236B-A23B

K-EXAONE-236B-A23B
Grok-4.1-fast

Solar-Open-100B

Kanana-2-30B-Thinking-2601

Kanana-2-30B-Thinking
Deepseek-V3.2
Gemini-3-Pro-Preview
Claude-Opus-4.5

Solar-Pro-2 (31B)(high)
GPT-5.1 (high)

EXAONE-4.0.1-32B (high)
GPT-5.2 (high)

Llama-VARCO-8B-Instruct

HCX-007(high)

A.X-4.0 (72B)
axk1
Avg Tokens / Problem
EntropyMath is an evolutionary multi-agent system and benchmark that generates high-entropy math problems designed to systematically break current LLMs. The EntropyMath_SAT_50 dataset challenges models with SAT-style problems derived from the Korean, Indian, and Japanese College Scholastic Ability Test (CSAT). These problems demand not only high-precision calculation but also deep conceptual understanding and logical inference, representing a significant challenge even for advanced LLMs.
Results are reported using Pass@3 metrics to account for generation variance, alongside detailed execution traces for transparency.
Performance Legend
Mastery (100%)
3/3
Strong (66%)
2/3
Weak (33%)
1/3
Fail (0%)
0/3


