55+ open-weight LLMs benchmarked on real coding tasks. Pick the right model — or waste months deploying the wrong one.
The open-weight LLM landscape shifted faster than any prior quarter. Here's what matters for your stack decisions.
Six models exceed 88% pass@1. Benchmark saturation and contamination risk make LiveCodeBench and SWE-bench the only credible frontier metrics now.
Moonshot AI's model hits 99.0% HumanEval, 85.0% LiveCodeBench, and 76.8% SWE-bench Verified — a triple sweep from a non-US lab that was inconceivable 12 months ago.
72.2% on SWE-bench Verified under Apache 2.0. Devstral Small 2 at 24B achieves 68.0% — enterprise agentic coding now runs on commodity hardware.
Alibaba (Qwen), DeepSeek, Moonshot AI, and Zhipu AI dominate. IBM, Google, Meta, and Mistral hold the remaining Western spots. Procurement teams must adapt.
Apache 2.0 (Mistral, IBM, StarCoder2) vs. Qwen Research License vs. DeepSeek Model License — commercial use restrictions vary wildly. Full license matrix included.
Ranked by composite score: LiveCodeBench 40% + SWE-bench 35% + HumanEval+ 25%. Full 55+ model table in the report.
| # | Model | Provider | HumanEval | LiveCodeBench | SWE-bench | License |
|---|---|---|---|---|---|---|
| 1 | Kimi K2.5 | Moonshot AI | 99.0% | 85.0% | 76.8% | Kimi Open |
| 2 | GLM-4.7 | Zhipu AI | 94.2% | 84.9% | — | Apache 2.0 |
| 3 | Qwen3-Coder-480B-A35B | Alibaba / Qwen | 89.3% | 70.7% | 69.6% | Qwen Research |
| 4 | Devstral 2 | Mistral AI | 84.1% | 52.1% | 72.2% | Apache 2.0 |
| 5 | Kimi K2 | Moonshot AI | 87.9% | 53.7% | 65.8% | Kimi Open |
| 6 | DeepSeek-Coder-V2-Instruct | DeepSeek AI | 90.2% | 43.4% | 51.3% | DeepSeek License |
| 7 | Devstral Small 2 | Mistral AI | 81.7% | 44.8% | 68.0% | Apache 2.0 |
| 8 | Qwen2.5-Coder-32B | Alibaba / Qwen | 92.7% | 37.2% | — | Qwen Research |
| 9 | Yi-Coder-9B-Chat | 01.AI | 85.1% | — | — | Apache 2.0 |
| 10 | OpenCoder-8B-Instruct | OpenCoder Consortium | 83.5% | — | — | Apache 2.0 |
Full report includes 55+ models, MMLU reasoning scores, GSMS8K math performance, hardware requirements, and deployment recommendations.
Get full leaderboard — €9A 40+ page deep-dive built for engineers and CTOs making real infrastructure decisions.
55+ models across HumanEval, LiveCodeBench, SWE-bench Verified, MMLU, and GSM8K
Weighted composite score methodology — no cherry-picked single-metric rankings
Apache 2.0 vs. Qwen Research vs. DeepSeek Model License — commercial use risks mapped
VRAM/RAM needs per model tier — from A100 clusters to consumer GPUs
Alibaba, DeepSeek, Moonshot AI, Mistral, IBM, Meta, Google — strategy & trajectory
IDE integration comparison: Cursor, Continue, Aider, Copilot vs. open alternatives
Q1 2025 → Q1 2026 benchmark trajectory — what's improving and at what rate
Use-case matched recommendations: agentic, RAG, embedded, fine-tuning
No subscription. No fluff. Pay once, own the PDF.
One-time · Instant PDF delivery · No DRM
Secure checkout via Stripe · Card payment · Instant delivery
Genesis is a self-evolving autonomous intelligence built by ArkForge. It continuously monitors the open-weight LLM landscape, ingests benchmark data from HuggingFace, GitHub, and academic sources, and synthesizes actionable intelligence.
This report was researched, written, and formatted autonomously — with editorial standards enforced by the genome's fitness criteria. No vendor relationships. No sponsored rankings.