Aggregated benchmark scores for 44 open-weight code models — HumanEval, LiveCodeBench, and SWE-bench Verified — compiled from original papers and third-party evaluations.
The frontier has bifurcated. A small cluster of MoE-architecture models — Kimi K2.5, Qwen3-Coder-480B, and Devstral 2 — now dominate real-world agent tasks (SWE-bench Verified), while dense models max out the saturated HumanEval benchmark. HumanEval scores above 90% no longer differentiate models meaningfully.
LiveCodeBench is the new standard. Only 12 of 44 models report LiveCodeBench scores, but it is now the primary discriminator for frontier models. The gap between Kimi K2.5 (85.0 LCB) and the next open-weight model reveals how quickly the ceiling is rising.
SWE-bench Verified separates agents from assistants. With only 4 models reporting scores, SWE-bench data is sparse but decisive: it measures whether a model can autonomously resolve real GitHub issues, not just complete synthetic problems. Scores above 65% represent a qualitative capability threshold.
Efficient models punch above their weight. GLM-4.7 at 9B parameters achieves 94.2 HumanEval and 84.9 LiveCodeBench — rivaling models 50× larger — demonstrating that parameter count is increasingly a poor proxy for coding capability.
| # | Model | Params | LiveCodeBench ↑ | SWE-bench ↑ | HumanEval ↑ | HumanEval+ ↑ | MBPP ↑ | Tier |
|---|---|---|---|---|---|---|---|---|
| #1 | Kimi K2.5 Moonshot AI | unknown | 85.0 | 76.8 | 99.0 | — | — | S |
| #2 | GLM-4.7 (9B) Zhipu AI / Tsinghua KEG | 9B | 84.9 | — | 94.2 | — | — | S |
| #3 | Qwen3-Coder-480B-A35B-Instruct Alibaba / Qwen Team | 480B | 70.7 | 69.6 | 89.3 | — | 78.2 | S |
| #4 | Kimi K2 Moonshot AI | 1T+ | 53.7 | — | — | — | — | A |
| #5 | DeepSeek-Coder-V2-Instruct DeepSeek AI | 236B | 43.4 | — | 90.2 | 84.8 | — | A |
| #6 | CodeQwen1.5-7B-Chat Alibaba / Qwen Team | 7B | 43.3 | — | 83.5 | — | 78.7 | A |
| #7 | Qwen2.5-Coder-32B-Instruct Alibaba / Qwen Team | 32B | 37.2 | — | 92.7 | 87.0 | 90.2 | B |
| #8 | DeepSeek-Coder-V2-Lite-Instruct DeepSeek AI | 16B | 24.3 | — | 81.1 | — | — | B |
| #9 | Qwen2.5-Coder-14B-Instruct Alibaba / Qwen Team | 14B | 23.4 | — | 89.6 | — | 86.2 | B |
| #10 | Yi-Coder-9B-Chat 01.AI | 9B | 23.4 | — | 85.4 | — | 73.8 | B |
| #11 | Qwen2.5-Coder-7B-Instruct Alibaba / Qwen Team | 7B | 18.2 | — | 88.4 | 84.1 | 83.5 | B |
| #12 | Qwen2.5-Coder-1.5B-Instruct Alibaba / Qwen Team | 1.5B | 6.1 | — | 70.7 | — | 69.2 | C |
| #13 | Devstral 2 (Devstral-2-123B) Mistral AI | 123B | — | 72.2 | — | — | — | S |
| #14 | Devstral Small 2 (24B) Mistral AI | 24B | — | 68.0 | — | — | — | A |
| #15 | Codestral 25.01 Mistral AI | 22B | — | — | 86.6 | — | 91.2 | B |
| #16 | OpenCoder-8B-Instruct INFLY Tech / OpenCoder Team | 8B | — | — | 83.5 | 78.7 | 79.1 | B |
| #17 | Phi-4 Microsoft | 14B | — | — | 82.6 | 82.8 | — | B |
| #18 | WizardCoder-33B-V1.1 WizardLM Team / Microsoft | 33B | — | — | 79.9 | 73.2 | 78.9 | C |
| #19 | DeepSeek-Coder-33B-Instruct DeepSeek AI | 33B | — | — | 79.3 | — | 70.8 | C |
| #20 | DeepSeek-Coder-6.7B-Instruct DeepSeek AI | 6.7B | — | — | 78.6 | — | 74.9 | C |
| #21 | Magicoder-S-DS-6.7B University of Illinois (UIUC ISE) | 6.7B | — | — | 76.8 | — | — | C |
| #22 | Codestral Mamba 7B Mistral AI | 7B | — | — | 75.0 | — | 68.5 | C |
| #23 | Phind-CodeLlama-34B-v2 Phind | 34B | — | — | 73.8 | — | — | C |
| #24 | WizardCoder-Python-34B-V1.0 WizardLM Team / Microsoft | 34B | — | — | 73.2 | — | — | C |
| #25 | OpenCoder-1.5B-Instruct INFLY Tech / OpenCoder Team | 1.5B | — | — | 72.5 | 67.7 | 72.7 | C |
| #26 | Code Llama 70B Instruct Meta AI | 70B | — | — | 67.8 | — | 62.2 | C |
| #27 | Granite-34B-Code-Instruct-8K IBM Research | 34B | — | — | 62.2 | — | 47.2 | C |
| #28 | Qwen2.5-Coder-7B-Base Alibaba / Qwen Team | 7B | — | — | 61.6 | 53.0 | 76.9 | C |
| #29 | CodeGemma 7B-IT 1.1 Google DeepMind | 7B | — | — | 60.4 | — | 55.2 | C |
| #30 | CodeGemma 7B-IT (Instruction Tuned) Google DeepMind | 7B | — | — | 56.1 | — | 54.2 | C |
| #31 | Code Llama - Python 34B Meta AI | 34B | — | — | 53.7 | — | 56.2 | C |
| #32 | StarCoder2-15B BigCode (ServiceNow, Hugging Face, NVIDIA) | 15B | — | — | 46.3 | 37.8 | 66.2 | C |
| #33 | OctoCoder (15.5B) BigCode (Hugging Face, ServiceNow) | 15.5B | — | — | 46.2 | — | — | C |
| #34 | InstructCodeT5+ 16B Salesforce Research | 16B | — | — | 36.1 | — | — | C |
| #35 | StarCoder2-7B BigCode (ServiceNow, Hugging Face, NVIDIA) | 7B | — | — | 35.4 | 29.9 | 54.4 | C |
| #36 | Code Llama 7B Instruct Meta AI | 7B | — | — | 34.8 | — | 44.4 | C |
| #37 | StarCoder (15.5B) BigCode (Hugging Face, ServiceNow) | 15.5B | — | — | 33.6 | — | 52.7 | C |
| #38 | StarCoder2-3B BigCode (ServiceNow, Hugging Face, NVIDIA) | 3B | — | — | 31.7 | 27.4 | 60.2 | C |
| #39 | CodeGen-Mono 16B Salesforce Research | 16.1B | — | — | 29.3 | — | — | C |
| #40 | SantaCoder (1.1B) BigCode (Hugging Face) | 1.1B | — | — | 18.0 | — | — | C |
| #41 | InCoder-6.7B Facebook AI Research / UC Berkeley | 6.7B | — | — | 15.2 | — | — | C |
| #42 | PolyCoder-2.7B Carnegie Mellon University | 2.7B | — | — | 5.6 | — | — | C |
| #43 | Granite-8B-Code-Instruct-4K IBM Research | 8B | — | — | — | — | 42.2 | C |
| #44 | Granite-3B-Code-Base IBM Research | 3B | — | — | — | — | 36.0 | C |
Download the complete structured dataset powering this leaderboard: