LLM Landscape 2026: Intelligence Leaderboard and Model Guide
Leaderboard Methodology
The table below ranks one model per provider — the provider's newest or most clearly superior flagship. Scores use the AA (Artificial Analysis) composite capability index, which aggregates performance across MMLU-Pro, SWE-bench, GPQA, ARC-AGI, AIME, and long-context evaluations into a single normalized integer rather than citing a single benchmark as the headline number. Context windows are shown as token counts with commas; missing public data is shown as "—". Pricing is per million tokens (input / output); N/A values are also shown as "—".
The Intelligence Leaderboard: Top 20 LLMs by Vendor (April 2026)
| Rank | Model | Capability Index (AA Index) | Context Window (tokens) | Input Cost ($/M tokens) | Output Cost ($/M tokens) | Notes |
|---|---|---|---|---|---|---|
| 1 | Gemini 3.1 Pro Preview Google | 57 | 1,000,000 | $1.25 | $10.00 | Current Google flagship; strongest all-round cross-vendor entry |
| 2 | GPT-5.4 (xhigh) OpenAI | 57 | 1,050,000 | $2.50 | $15.00 | OpenAI flagship for professional work; strongest OpenAI representative |
| 3 | Claude Opus 4.6 (max) Anthropic | 53 | 1,000,000 | $5.00 | $25.00 | Anthropic flagship; strongest coding and agentic representative |
| 4 | GLM-5 Z.ai | 50 | 200,000 | $1.00 | $3.20 | New top-tier entrant; strong agentic engineering positioning |
| 5 | MiMo-V2-Pro Xiaomi | 49 | 1,000,000 | — | — | Very strong new Chinese contender; pricing not publicly disclosed |
| 6 | Grok 4.20 Beta 0309 xAI | 48 | 200,000+ | $2.00 | $6.00 | xAI flagship; fast, tool-heavy, agentic model |
| 7 | Qwen3.5 397B A17B Alibaba | 45 | 262,000 | $0.60 | $3.60 | Open Weight Best current Qwen-family representative; Apache 2.0 |
| 8 | DeepSeek V3.2 DeepSeek AI | 42 | 128,000 | $0.28 | $0.42 | Open Weight Best-value frontier entry on pure cost-performance |
| 9 | MiniMax-M2.7 MiniMax | 42 | — | — | — | Strong current entrant; notable capability/value tradeoff |
| 10 | NVIDIA Nemotron 3 Super 120B A12B NVIDIA | 36 | 1,000,000 | $0.30 | $0.75 | Open Weight Strong open enterprise contender; excellent price/performance |
| 11 | Kimi K2 Moonshot AI | 26 | 128,000 | $0.57 | $2.40 | Open Weight Open-weight and inexpensive; strong value entry |
| 12 | Mistral Large 3 Mistral | 23 | — | — | — | Best current public Mistral flagship |
| 13 | Nova Premier Amazon | 19 | 1,000,000 | $2.50 | $12.50 | Hyperscaler representative; broad enterprise relevance |
| 14 | ERNIE 4.5 300B A47B Baidu | 15 | — | — | — | Best verifiable ERNIE-family public entry |
| 15 | Llama 4 Scout Meta | 14 | 10,000,000 | — | — | Open Weight Context-window outlier; 10M tokens for self-hosting |
| 16 | Command A Cohere | 13 | 256,000 | $2.50 | $10.00 | Practical enterprise/workflow model; RAG and tool use focus |
| 17 | Granite 4.0 H Small IBM | 11 | — | — | — | Open Weight Enterprise and open-governance relevance |
| 18 | Jamba 1.7 Large AI21 | 11 | — | — | — | Solid enterprise positioning; hybrid SSM/Transformer architecture |
| 19 | Yi-Lightning 01.AI | — | — | — | — | Vendor-diversity slot; public specs not fully verified |
| 20 | gpt-oss-120B OpenAI (open-weight) | 33 | — | $0.30 | $0.30 | Open Weight Separate open-weight category; distinct from GPT-5.4 flagship |
Key Takeaways
Key Performance Metrics
Task-Specific Leaders
| Model | Benchmark Leadership |
|---|---|
| Gemini 3.1 Pro | ARC-AGI-2 & multimodal · AA 57 |
| GPT-5.4 | GPQA Diamond & AIME · AA 57 |
| Claude Opus 4.6 | SWE-bench Verified · coding & agents · AA 53 |
| GLM-5 & MiMo-V2-Pro | New top-10 entrants · agentic & reasoning · AA 49–50 |
Context Window Champions
| Model | Tokens |
|---|---|
| Llama 4 Scout | 10,000,000 |
| GPT-5.4 | 1,050,000 |
| Gemini 3.1 · Claude · MiMo · Nemotron | 1,000,000 |
| Qwen3.5 397B | 262,000 |
| DeepSeek V3.2 · Kimi K2 | 128,000 |
Cost Efficiency
| Tier | Models | Output $/M |
|---|---|---|
| Best Value | DeepSeek V3.2 · Nemotron | $0.42–$0.75 |
| Mid-Range | Kimi K2 · Qwen3.5 · GLM-5 | $2.40–$3.60 |
| Flagship | Gemini 3.1 · Grok 4.20 | $6.00–$10.00 |
| Premium | GPT-5.4 · Claude Opus 4.6 | $15.00–$25.00 |
Specialized Performance Highlights
Speed & Latency
Open-Weight Excellence
| Model | Key Strength |
|---|---|
| Llama 4 Scout | 10M-token context · corpus-scale tasks |
| Qwen3.5 397B | 262K ctx · multilingual · Apache 2.0 |
| DeepSeek V3.2 | $0.28 / $0.42 per M · near-frontier reasoning |
| NVIDIA Nemotron | 1M context · enterprise self-hosting |
| Kimi K2 | $0.57 / $2.40 per M · open-weight value |
Model Selection Guide
Industry Impact & Future Trends (2026)
The 2026 LLM landscape is defined by task-specific leadership, a more distributed competitive frontier, and a widening gap between flagship and cost-efficient tiers:
Coding & Agents
Context & Long-Horizon
Cost & Open Weight
Conclusion
The April 2026 LLM landscape is defined by task-specific leadership across a more distributed set of vendors than ever before. Google's Gemini 3.1 Pro Preview and OpenAI's GPT-5.4 share the top composite capability score, each excelling across reasoning, coding, and multimodal evaluations. Anthropic's Claude Opus 4.6 leads on coding and agentic benchmarks, while xAI's Grok 4.20 dominates fast, tool-heavy agent workflows. New entrants from Xiaomi (MiMo-V2-Pro) and Z.ai (GLM-5) break into the top 10, reflecting a broadening competitive frontier. Meanwhile, DeepSeek V3.2, NVIDIA Nemotron 3 Super, and Kimi K2 prove that frontier-class reasoning is increasingly accessible at commodity price points, and Meta's Llama 4 Scout pushes open-weight context to an unprecedented 10M tokens.
Strategic Takeaway (2026)
Looking ahead, the defining trends are accelerating: composite reasoning benchmarks (not single-metric scores) are becoming the standard for model evaluation; open-weight models now rival closed flagships across most non-frontier tasks; and the emergence of strong Chinese contenders (MiMo-V2-Pro, GLM-5, MiniMax-M2.7, Qwen3.5) signals that the frontier is genuinely global. The combination of sub-dollar-per-million-token cost (DeepSeek, NVIDIA Nemotron), 10M-token context (Llama 4 Scout), and 1M-token commercial flagships (Gemini 3.1, GPT-5.4, Claude Opus 4.6) makes advanced LLM capability accessible for more applications and organizations than at any prior point.