April 2026 Snapshot

Open-Source LLM
Code Performance Rankings

Aggregated benchmark scores for 44 open-weight code models — HumanEval, LiveCodeBench, and SWE-bench Verified — compiled from original papers and third-party evaluations.

Models Tracked

With LiveCodeBench

With SWE-bench Scores

85.0

Best LiveCodeBench

76.8

Best SWE-bench

Key Findings — April 2026

The frontier has bifurcated. A small cluster of MoE-architecture models — Kimi K2.5, Qwen3-Coder-480B, and Devstral 2 — now dominate real-world agent tasks (SWE-bench Verified), while dense models max out the saturated HumanEval benchmark. HumanEval scores above 90% no longer differentiate models meaningfully.

LiveCodeBench is the new standard. Only 12 of 44 models report LiveCodeBench scores, but it is now the primary discriminator for frontier models. The gap between Kimi K2.5 (85.0 LCB) and the next open-weight model reveals how quickly the ceiling is rising.

SWE-bench Verified separates agents from assistants. With only 4 models reporting scores, SWE-bench data is sparse but decisive: it measures whether a model can autonomously resolve real GitHub issues, not just complete synthetic problems. Scores above 65% represent a qualitative capability threshold.

Efficient models punch above their weight. GLM-4.7 at 9B parameters achieves 94.2 HumanEval and 84.9 LiveCodeBench — rivaling models 50× larger — demonstrating that parameter count is increasingly a poor proxy for coding capability.

Full Leaderboard

S-Tier (LCB≥70 or SWE≥70) A-Tier (LCB≥40 or SWE≥60) B-Tier (LCB≥20 or HE≥80) C-Tier

#	Model	Params	LiveCodeBench ↑	SWE-bench ↑	HumanEval ↑	HumanEval+ ↑	MBPP ↑	Tier
#1	Kimi K2.5 Moonshot AI	unknown	85.0	76.8	99.0	—	—	S
#2	GLM-4.7 (9B) Zhipu AI / Tsinghua KEG	9B	84.9	—	94.2	—	—	S
#3	Qwen3-Coder-480B-A35B-Instruct Alibaba / Qwen Team	480B	70.7	69.6	89.3	—	78.2	S
#4	Kimi K2 Moonshot AI	1T+	53.7	—	—	—	—	A
#5	DeepSeek-Coder-V2-Instruct DeepSeek AI	236B	43.4	—	90.2	84.8	—	A
#6	CodeQwen1.5-7B-Chat Alibaba / Qwen Team	7B	43.3	—	83.5	—	78.7	A
#7	Qwen2.5-Coder-32B-Instruct Alibaba / Qwen Team	32B	37.2	—	92.7	87.0	90.2	B
#8	DeepSeek-Coder-V2-Lite-Instruct DeepSeek AI	16B	24.3	—	81.1	—	—	B
#9	Qwen2.5-Coder-14B-Instruct Alibaba / Qwen Team	14B	23.4	—	89.6	—	86.2	B
#10	Yi-Coder-9B-Chat 01.AI	9B	23.4	—	85.4	—	73.8	B
#11	Qwen2.5-Coder-7B-Instruct Alibaba / Qwen Team	7B	18.2	—	88.4	84.1	83.5	B
#12	Qwen2.5-Coder-1.5B-Instruct Alibaba / Qwen Team	1.5B	6.1	—	70.7	—	69.2	C
#13	Devstral 2 (Devstral-2-123B) Mistral AI	123B	—	72.2	—	—	—	S
#14	Devstral Small 2 (24B) Mistral AI	24B	—	68.0	—	—	—	A
#15	Codestral 25.01 Mistral AI	22B	—	—	86.6	—	91.2	B
#16	OpenCoder-8B-Instruct INFLY Tech / OpenCoder Team	8B	—	—	83.5	78.7	79.1	B
#17	Phi-4 Microsoft	14B	—	—	82.6	82.8	—	B
#18	WizardCoder-33B-V1.1 WizardLM Team / Microsoft	33B	—	—	79.9	73.2	78.9	C
#19	DeepSeek-Coder-33B-Instruct DeepSeek AI	33B	—	—	79.3	—	70.8	C
#20	DeepSeek-Coder-6.7B-Instruct DeepSeek AI	6.7B	—	—	78.6	—	74.9	C
#21	Magicoder-S-DS-6.7B University of Illinois (UIUC ISE)	6.7B	—	—	76.8	—	—	C
#22	Codestral Mamba 7B Mistral AI	7B	—	—	75.0	—	68.5	C
#23	Phind-CodeLlama-34B-v2 Phind	34B	—	—	73.8	—	—	C
#24	WizardCoder-Python-34B-V1.0 WizardLM Team / Microsoft	34B	—	—	73.2	—	—	C
#25	OpenCoder-1.5B-Instruct INFLY Tech / OpenCoder Team	1.5B	—	—	72.5	67.7	72.7	C
#26	Code Llama 70B Instruct Meta AI	70B	—	—	67.8	—	62.2	C
#27	Granite-34B-Code-Instruct-8K IBM Research	34B	—	—	62.2	—	47.2	C
#28	Qwen2.5-Coder-7B-Base Alibaba / Qwen Team	7B	—	—	61.6	53.0	76.9	C
#29	CodeGemma 7B-IT 1.1 Google DeepMind	7B	—	—	60.4	—	55.2	C
#30	CodeGemma 7B-IT (Instruction Tuned) Google DeepMind	7B	—	—	56.1	—	54.2	C
#31	Code Llama - Python 34B Meta AI	34B	—	—	53.7	—	56.2	C
#32	StarCoder2-15B BigCode (ServiceNow, Hugging Face, NVIDIA)	15B	—	—	46.3	37.8	66.2	C
#33	OctoCoder (15.5B) BigCode (Hugging Face, ServiceNow)	15.5B	—	—	46.2	—	—	C
#34	InstructCodeT5+ 16B Salesforce Research	16B	—	—	36.1	—	—	C
#35	StarCoder2-7B BigCode (ServiceNow, Hugging Face, NVIDIA)	7B	—	—	35.4	29.9	54.4	C
#36	Code Llama 7B Instruct Meta AI	7B	—	—	34.8	—	44.4	C
#37	StarCoder (15.5B) BigCode (Hugging Face, ServiceNow)	15.5B	—	—	33.6	—	52.7	C
#38	StarCoder2-3B BigCode (ServiceNow, Hugging Face, NVIDIA)	3B	—	—	31.7	27.4	60.2	C
#39	CodeGen-Mono 16B Salesforce Research	16.1B	—	—	29.3	—	—	C
#40	SantaCoder (1.1B) BigCode (Hugging Face)	1.1B	—	—	18.0	—	—	C
#41	InCoder-6.7B Facebook AI Research / UC Berkeley	6.7B	—	—	15.2	—	—	C
#42	PolyCoder-2.7B Carnegie Mellon University	2.7B	—	—	5.6	—	—	C
#43	Granite-8B-Code-Instruct-4K IBM Research	8B	—	—	—	—	42.2	C
#44	Granite-3B-Code-Base IBM Research	3B	—	—	—	—	36.0	C

Open-Source LLMCode Performance Rankings

Key Findings — April 2026

Full Leaderboard

Full Dataset Export

Open-Source LLM
Code Performance Rankings