April 2026 Snapshot

Open-Source LLM
Code Performance Rankings

Aggregated benchmark scores for 44 open-weight code models — HumanEval, LiveCodeBench, and SWE-bench Verified — compiled from original papers and third-party evaluations.

44
Models Tracked
12
With LiveCodeBench
4
With SWE-bench Scores
85.0
Best LiveCodeBench
76.8
Best SWE-bench

Key Findings — April 2026

The frontier has bifurcated. A small cluster of MoE-architecture models — Kimi K2.5, Qwen3-Coder-480B, and Devstral 2 — now dominate real-world agent tasks (SWE-bench Verified), while dense models max out the saturated HumanEval benchmark. HumanEval scores above 90% no longer differentiate models meaningfully.

LiveCodeBench is the new standard. Only 12 of 44 models report LiveCodeBench scores, but it is now the primary discriminator for frontier models. The gap between Kimi K2.5 (85.0 LCB) and the next open-weight model reveals how quickly the ceiling is rising.

SWE-bench Verified separates agents from assistants. With only 4 models reporting scores, SWE-bench data is sparse but decisive: it measures whether a model can autonomously resolve real GitHub issues, not just complete synthetic problems. Scores above 65% represent a qualitative capability threshold.

Efficient models punch above their weight. GLM-4.7 at 9B parameters achieves 94.2 HumanEval and 84.9 LiveCodeBench — rivaling models 50× larger — demonstrating that parameter count is increasingly a poor proxy for coding capability.

Full Leaderboard

S-Tier (LCB≥70 or SWE≥70) A-Tier (LCB≥40 or SWE≥60) B-Tier (LCB≥20 or HE≥80) C-Tier
# Model Params LiveCodeBench ↑ SWE-bench ↑ HumanEval ↑ HumanEval+ ↑ MBPP ↑ Tier
#1 Kimi K2.5 Moonshot AI unknown 85.0 76.8 99.0 S
#2 GLM-4.7 (9B) Zhipu AI / Tsinghua KEG 9B 84.9 94.2 S
#3 Qwen3-Coder-480B-A35B-Instruct Alibaba / Qwen Team 480B 70.7 69.6 89.3 78.2 S
#4 Kimi K2 Moonshot AI 1T+ 53.7 A
#5 DeepSeek-Coder-V2-Instruct DeepSeek AI 236B 43.4 90.2 84.8 A
#6 CodeQwen1.5-7B-Chat Alibaba / Qwen Team 7B 43.3 83.5 78.7 A
#7 Qwen2.5-Coder-32B-Instruct Alibaba / Qwen Team 32B 37.2 92.7 87.0 90.2 B
#8 DeepSeek-Coder-V2-Lite-Instruct DeepSeek AI 16B 24.3 81.1 B
#9 Qwen2.5-Coder-14B-Instruct Alibaba / Qwen Team 14B 23.4 89.6 86.2 B
#10 Yi-Coder-9B-Chat 01.AI 9B 23.4 85.4 73.8 B
#11 Qwen2.5-Coder-7B-Instruct Alibaba / Qwen Team 7B 18.2 88.4 84.1 83.5 B
#12 Qwen2.5-Coder-1.5B-Instruct Alibaba / Qwen Team 1.5B 6.1 70.7 69.2 C
#13 Devstral 2 (Devstral-2-123B) Mistral AI 123B 72.2 S
#14 Devstral Small 2 (24B) Mistral AI 24B 68.0 A
#15 Codestral 25.01 Mistral AI 22B 86.6 91.2 B
#16 OpenCoder-8B-Instruct INFLY Tech / OpenCoder Team 8B 83.5 78.7 79.1 B
#17 Phi-4 Microsoft 14B 82.6 82.8 B
#18 WizardCoder-33B-V1.1 WizardLM Team / Microsoft 33B 79.9 73.2 78.9 C
#19 DeepSeek-Coder-33B-Instruct DeepSeek AI 33B 79.3 70.8 C
#20 DeepSeek-Coder-6.7B-Instruct DeepSeek AI 6.7B 78.6 74.9 C
#21 Magicoder-S-DS-6.7B University of Illinois (UIUC ISE) 6.7B 76.8 C
#22 Codestral Mamba 7B Mistral AI 7B 75.0 68.5 C
#23 Phind-CodeLlama-34B-v2 Phind 34B 73.8 C
#24 WizardCoder-Python-34B-V1.0 WizardLM Team / Microsoft 34B 73.2 C
#25 OpenCoder-1.5B-Instruct INFLY Tech / OpenCoder Team 1.5B 72.5 67.7 72.7 C
#26 Code Llama 70B Instruct Meta AI 70B 67.8 62.2 C
#27 Granite-34B-Code-Instruct-8K IBM Research 34B 62.2 47.2 C
#28 Qwen2.5-Coder-7B-Base Alibaba / Qwen Team 7B 61.6 53.0 76.9 C
#29 CodeGemma 7B-IT 1.1 Google DeepMind 7B 60.4 55.2 C
#30 CodeGemma 7B-IT (Instruction Tuned) Google DeepMind 7B 56.1 54.2 C
#31 Code Llama - Python 34B Meta AI 34B 53.7 56.2 C
#32 StarCoder2-15B BigCode (ServiceNow, Hugging Face, NVIDIA) 15B 46.3 37.8 66.2 C
#33 OctoCoder (15.5B) BigCode (Hugging Face, ServiceNow) 15.5B 46.2 C
#34 InstructCodeT5+ 16B Salesforce Research 16B 36.1 C
#35 StarCoder2-7B BigCode (ServiceNow, Hugging Face, NVIDIA) 7B 35.4 29.9 54.4 C
#36 Code Llama 7B Instruct Meta AI 7B 34.8 44.4 C
#37 StarCoder (15.5B) BigCode (Hugging Face, ServiceNow) 15.5B 33.6 52.7 C
#38 StarCoder2-3B BigCode (ServiceNow, Hugging Face, NVIDIA) 3B 31.7 27.4 60.2 C
#39 CodeGen-Mono 16B Salesforce Research 16.1B 29.3 C
#40 SantaCoder (1.1B) BigCode (Hugging Face) 1.1B 18.0 C
#41 InCoder-6.7B Facebook AI Research / UC Berkeley 6.7B 15.2 C
#42 PolyCoder-2.7B Carnegie Mellon University 2.7B 5.6 C
#43 Granite-8B-Code-Instruct-4K IBM Research 8B 42.2 C
#44 Granite-3B-Code-Base IBM Research 3B 36.0 C

Full Dataset Export

Download the complete structured dataset powering this leaderboard:

  • All 44 models · 7 benchmark dimensions · raw scores + metadata
  • JSON + CSV formats · source references for every data point
  • Architecture details, license, context window, release dates
  • One-time purchase — no subscription
⬇ Download Dataset — €5
One-time · Instant delivery