Skip to content
No models found

Surpassing Frontier Performance with Model Fusion

Brian Thomas · 5/29/2026

Surpassing Frontier Performance with Model Fusion

We've found that synthesizing the results of multiple models can significantly outperform what individual models are capable of. Introducing Model Fusion: a tool for getting these combined results just as easily as calling a single model. It allows you to choose a panel of participant models alongside a judge model responsible for fusing the individual results together.

To understand the benefits of Fusion, we used a deep research benchmark that tests the combination of reasoning, tool usage, and knowledge. We found that:

  1. Panels consistently outperform individual models
  2. Beyond-frontier performance can be achieved with frontier panels
  3. Panels of budget models can surpass frontier models and get close to frontier panel performance

Try Fusion now(opens in new tab) in a chatroom, or check out the API docs(opens in new tab) to build it into your application.

Panels of Models Consistently Outperform on Deep Research

We tested Fusion on 100 deep research tasks from the DRACO benchmark(opens in new tab). Some highlights of what we found:

  • Opus 4.7 + GPT-5.5 fused together materially surpassed any individual frontier model.
  • Gemini 3 Flash, Kimi K2.6, and DeepSeek V4 Pro fused together outscored every solo model, including GPT-5.5.

DRACO benchmark scores for Fusion and solo configurations

TypeModel(s)Score
Fusion · 2x FrontierOpus 4.7 + GPT-5.5, synthesized by Opus 4.766.0%
Fusion · 3x BudgetGemini 3 Flash + Kimi K2.6 + DeepSeek V4 Pro, synthesized by Gemini 3 Flash63.8%
SoloGPT-5.560.4%
SoloClaude Opus 4.750.3%
SoloGemini 3.1 Pro43.5%
SoloGemini 3 Flash42.0%
SoloDeepSeek V4 Pro40.6%
SoloKimi K2.627.3%

We believe this demonstrates the benefits of model neurodiversity, similar to the benefits seen on human team performance. Bringing multiple different perspectives to complex problems yields superior results.

One API call that fuses the best output of multiple models

When you send a prompt to Model Fusion, we dispatch it to a panel of models in parallel, each with web search and web fetch enabled. A judge model reads every panel response and produces structured analysis: consensus points, contradictions, partial coverage, unique insights, blind spots. The calling model then writes the final answer grounded in that analysis.

The whole pipeline runs server-side so it can be called just like you would an individual model.

You can add Fusion as a server tool so the model decides when Fusion is worth the extra cost:


Or use the openrouter/fusion model slug and customize the panel:


We chose DRACO to rigorously test reasoning, tool calling, and succinctness

We needed a benchmark that could tell the difference between a model that sounds thorough and one that actually is. Standard benchmarks test factual recall or reasoning puzzles. They don't test the thing Fusion is built for: researching a complex question, synthesizing multiple sources, and producing a comprehensive, well-cited analysis.

DRACO(opens in new tab) (by Perplexity AI) does exactly this. It contains 100 deep research tasks spanning 10 domains: academic research, finance, law, medicine, technology, UX design, general knowledge, needle-in-a-haystack retrieval, personalized assistance, and product comparison.

Each task comes with a rubric of roughly 39 weighted criteria across four categories:

  • Factual Accuracy (~20 criteria): verifiable claims the response must get right
  • Breadth & Depth (~9 criteria): synthesis quality, trade-off analysis, actionable guidance
  • Presentation Quality (~6 criteria): terminology, formatting, readability
  • Citation Quality (~5 criteria): primary source citations with working references

Criteria can carry negative weights. Meeting a negative criterion means the response contains an error. For example, dangerous medical advice carries a big penalty. These negative criteria also make it hard to game the score by being verbose: a model that confidently states wrong things gets punished.

Each response is graded per-criterion by a judge model, three independent times. We reported the mean normalized score (0-100) across all tasks.

Caveat: Performance of Gemini

A surprising result of the benchmark was the performance of Gemini 3.1 Pro. It scored notably lower than other frontier models. When added to a 3x Frontier panel, it yielded improved results in some domains, but actually decreased performance in other domains. The overall score of the 3x Frontier panel was 65.9%, near identical to the 2x panel of Opus 4.7 and GPT-5.5.

When looking at the breakdown of categories, we observed the 3x panel outperforming in academic, medicine, and product comparison, yet that was countered by underperformance in the other domains.

Domain2x3xΔmean
Academic76.979.3+2.4
Finance63.663.5−0.1
General Knowledge66.663.9−2.7
Law88.086.9−1.1
Medicine76.177.1+0.9
Needle in a Haystack66.965.3−1.6
Personalized Assistance61.159.9−1.3
Product Comparison58.760.3+1.6
Technology58.056.8−1.3
UX Design58.858.9+0.1
Overall66.065.9−0.1

We are working to understand this effect better, as it shows there are situations where one model on a panel can actually hold back the others. We'll be looking for improvements to Fusion to control for these per-domain downsides while capturing the upsides.

Give Fusion a try

API: Send "model": "openrouter/fusion" to directly call Fusion, or add {"type": "openrouter:fusion"} to your tools array to let the model decide when to use it. Fusion docs(opens in new tab)

Chatroom: Open openrouter.ai/fusion(opens in new tab) and pick a preset or build a custom panel.

Notes on our DRACO implementation

We carefully replicated the methodology described in the DRACO paper with a couple exceptions that mean our scores are not directly comparable to the original paper's published results. Our goal was to show relative differences between Fusion and individual models.

Judge model. The DRACO paper used Gemini 3 Pro as the judge. We wanted to get the highest level of discernment possible, so we used Gemini 3.1 Pro Preview with the same low reasoning-effort setting. Different judges produce different absolute scores (the paper reports 10-25 point shifts depending on judge), but preserve the ordering between systems.

No code execution. The paper's baseline gave Opus access to web_search and code_execution. OpenRouter doesn't yet provide code execution as a server tool (coming soon), so it was omitted from our benchmark. This is held constant across every configuration, so it affects absolute scores but not the solo-vs-Fusion comparison.

OpenRouter
© 2026 OpenRouter, Inc

Product

  • Chat
  • Rankings
  • Apps
  • Models
  • Providers
  • Pricing
  • Enterprise
  • Labs

Company

  • About
  • Announcements
  • CareersHiring
  • Privacy
  • Terms of Service
  • Support
  • State of AI
  • Works With OR
  • Data

Developer

  • Documentation
  • API Reference
  • SDK
  • Status

Connect

  • Discord
  • GitHub
  • LinkedIn
  • X
  • YouTube