Pier Hybrid: Routing Between Two Models, Turn by Turn

A couple of weeks ago I wrote about pier-code, a terminal coding agent built around Sarvam AI as its India-native default. The lesson that post kept circling back to was that a coding agent is mostly harness: the model is the engine, but the tools, the prompts, and the loop that decides what to do next are the car. Get the harness right and a cheap model punches well above its price.

Two weeks of living inside that harness surfaced the next obvious lever. pier-code launched on a single Sarvam model at a time: you picked the fast 30B or the heavy 105B and lived with the trade-off for the whole session. But a coding session isn't one kind of work. It's a long sequence of turns, and those turns are wildly uneven in difficulty. "Summarize this file," "what does this error mean," and "rename this variable across three files" do not deserve the same engine as "refactor this module without breaking the public API." Paying 105B prices for the easy 90% is waste. Paying 30B quality for the hard 10% is a face-plant.

So the question stopped being which Sarvam model and became why pick one at all. That's pier-hybrid, shipping in v0.2.2: one agent that routes between Sarvam 30B and 105B turn by turn, according to how hard the next turn actually is.

This is the technical post. I'll walk through how the two harnesses were built separately first, the prior art this design stands on (it's a well-studied idea, and I want to be honest about that), the actual mechanism (the router, the backbone, and the caching), and then an explicit note on what I'm not yet claiming, because the hybrid's own numbers aren't benchmarked yet.

First, two harnesses

Before there could be a hybrid, there had to be two good single-model harnesses, because the two Sarvam models want to be prompted differently. They're not the same model at different sizes; they have different sweet spots, and the harness has to respect that.

The 30B is the fast workhorse. It's a mixture-of-experts model with only ~2.4B active parameters per token, a 64K context window, and Sarvam positions it for standard conversations, Q&A, and high-throughput workloads. In a coding agent, that maps cleanly to the inner loop: classifying what to do next, summarizing files and logs, interpreting test output, answering questions about the repo, and making small localized edits. Its harness defaults are tuned for responsiveness: reasoning mode off or low for the fast paths, low temperature for code, tight max_tokens. You don't hand it a long philosophical system prompt; you give it a crisp role, hard constraints, and an output contract.

The 105B is the deep worker. It's also MoE, ~10.3B active parameters, a 128K context window, and Sarvam explicitly positions it for complex multi-step reasoning, code generation, long-context analysis, and agentic tool use, with quality over latency. Its harness leans the other way: reasoning_effort turned up for planning, editing, and debugging; generous token budgets so reasoning doesn't eat the answer; structured outputs for tool control. This is the model you want for multi-file refactors, hard debugging, and architecture decisions.

Here's the part that made the hybrid feel inevitable rather than clever: Sarvam's own docs already describe this division of labor. The 30B guidance literally proposes a "30B controller + 105B deep worker" architecture, and the recommended 30B tool-router schema includes an escalate action and a handoff_prompt field: a compact instruction the small model writes for the big one when it decides a task is over its head. pier-hybrid is a concrete realization of exactly that suggested pattern. I didn't invent the shape; I built the machine the docs were gesturing at.

	Sarvam 30B	Sarvam 105B
Active params (MoE)	~2.4B	~10.3B
Context window	64K	128K
Sarvam's positioning	fast chat, Q&A, throughput	complex reasoning, agentic coding
Role in pier-hybrid	router + inner loop	deep worker on escalation
Default reasoning	off / low	medium / high

Has this been done before?

Yes, and pretending otherwise would be dishonest. Pairing a weaker, cheaper model with a stronger, more expensive one to get the cost of the former and (most of) the quality of the latter is one of the better-studied ideas in applied LLM systems. pier-hybrid borrows directly from this lineage. What's worth saying clearly is which idea it borrows and how the agentic, turn-by-turn setting changes it.

LLM cascades. The foundational version is FrugalGPT (Chen, Zaharia, Zou, Stanford, 2023). It queries a sequence of models from cheapest to most expensive and stops early when a learned scorer judges the cheap answer reliable enough, so expensive models only fire on hard queries. The paper reports matching the performance of the best individual model (GPT-4, in their study) with up to 98% cost reduction, or improving accuracy by ~4% at the same cost. This is the single closest ancestor of pier-hybrid: the cheap model gates, the expensive model is the fallback.

Learned routers. Instead of running the cheap model first and scoring its output, you can predict up front which model a query needs. RouteLLM (Ong et al., LMSYS, 2024) trains routers on human preference data to estimate the probability the strong model beats the weak one, then routes below a tunable threshold to the weak model. They report maintaining 95% of GPT-4's performance with over 2× cost reduction, and over 85% fewer strong-model calls on MT-Bench in some configurations. Hybrid LLM (Ding et al., Microsoft, ICLR 2024) routes by predicted query difficulty against a tunable quality target, reporting up to 40% fewer large-model calls with no drop in quality.

Speculative decoding. A different but related weak+strong pattern lives one level down, at token generation. A small draft model proposes several tokens; the large target model verifies them in a single parallel pass, with a sampling scheme that provably preserves the target's output distribution. Leviathan et al., 2023 (Google) report 2–3× speedups; Chen et al., 2023 (DeepMind) report 2–2.5× on a 70B model. pier-hybrid doesn't do this (it routes whole turns, not tokens), but the draft-then-verify intuition (let the cheap model do the bulk, call the expensive one only where it matters) is the same family.

Ensembling and self-verification. Mixture-of-Agents (Wang et al., Together AI, 2024) layers several models so each proposes and an aggregator synthesizes; using only open models it reports 65.1% on AlpacaEval 2.0, above GPT-4 Omni's 57.5%. AutoMix (Madaan, Aggarwal et al., 2023) has the small model self-verify its own answer and escalate to a larger one through a meta-verifier and a router, reporting over 50% cost reduction at comparable performance.

Prior work	Mechanism	Published result
FrugalGPT	cheap→expensive cascade, score-and-stop	up to 98% cost cut matching GPT-4
RouteLLM	learned router on preference data	95% of GPT-4 quality at >2× lower cost
Hybrid LLM	route by predicted difficulty	up to 40% fewer large-model calls, no quality drop
Speculative decoding	draft model proposes, target verifies	2–3× / 2–2.5× decode speedup
Mixture-of-Agents	layered ensemble + aggregator	65.1% AlpacaEval 2.0 (open models)
AutoMix	small-model self-verify + escalate	>50% cost reduction

Those numbers are theirs, on their benchmarks; I'm citing them as the design's intellectual backing, not as pier-hybrid's results. What none of these target is a stateful, agentic coding loop: most route a single self-contained query and are done. A coding agent is a long sequence of dependent turns sharing one evolving task. Routing in that setting is harder, because the moment you switch models mid-task you risk losing the thread. Solving that is what pier-hybrid actually had to build.

How pier-hybrid works

Three pieces: a shared per-task backbone, a 30B router that runs every turn, and per-model prefix caching that makes switching cheap.

The backbone

The reason naive model-switching breaks an agent is state. If the 30B has been working a task for six turns and you suddenly hand turn seven to the 105B, the 105B has no idea what just happened: what the goal was, which files were already inspected, what was tried, what failed.

So every task in pier-hybrid is anchored to a backbone: a single persistent, structured state object that both models read from and write to. It holds the task goal, the relevant files and symbols, decisions already made, commands run and their results, errors seen, and open questions. It is the canonical record of the task, independent of which model is driving on any given turn.

This is what makes routing safe. A turn isn't "continue this chat history with whichever model"; it's "given this backbone, produce the next action." Either model can pick up the task at any turn because the backbone, not the model's private conversation, is the source of truth. When the 105B finishes a hard turn, what it really did was update the backbone; the 30B reads that update on the next turn and carries on.

The 30B router

Every turn starts on the 30B. Following the router pattern from Sarvam's docs, it classifies the next action against the backbone (inspect, answer, summarize, make a small edit, run a test) and emits a structured decision. If the action is within its competence, it just does it. That's the common case, and it's where the economics come from: the cheap, fast model handles the cheap, frequent turns.

When the 30B judges the next step genuinely hard (a multi-file refactor, a subtle bug, an architecture call), it doesn't attempt it. It emits an escalation: it writes a compact handoff_prompt (exactly the field the docs suggest) describing what the 105B needs to do, and routing hands that turn to the 105B. The 105B does the deep work against the same backbone, writes its result back, and control returns to the 30B for the next turn.

The cheap model is the gate. That's the FrugalGPT cascade shape, adapted to an agent: rather than scoring a finished answer, the 30B scores the upcoming action and decides whether it can be trusted to do it itself.

Per-model prefix caching

Switching models would be expensive if every escalation re-fed the entire task context from scratch. Reasoning models pay a real prefill cost on long prefixes, and a coding task's stable context (the system prompt, the backbone, the relevant repo slice) is large and mostly unchanging across turns.

So pier-hybrid keeps a separate prefix/KV cache per model of that stable backbone prefix. The 30B has its cached view; the 105B has its own. When a turn escalates to the 105B and then returns to the 30B, neither model re-pays prefill on the shared prefix; each resumes against its warm cache, and only the genuinely new tokens for that turn are processed. This maps directly onto Sarvam's cached-input pricing tier, where cached input tokens are billed far below fresh input (the 105B card lists ₹4 input vs ₹2.5 cached; the 30B, ₹2.5 vs ₹1.5 per 1M tokens). Structuring the backbone as a stable, cacheable prefix is what makes turn-by-turn routing economical rather than a hidden tax.

Why this should work, and what I'm not claiming yet

The case for pier-hybrid is the same case the literature has made repeatedly: in a workload where most items are easy and a few are hard, routing the easy majority to a cheap model and reserving the expensive model for the hard minority recovers most of the quality at a fraction of the cost. The cited work puts that "fraction" anywhere from 40% to 98% fewer large-model calls depending on the setup. A coding session is an almost ideal instance of that workload shape: lots of inspect/summarize/small-edit turns, comparatively few hard ones.

But I want to be as honest here as I was in the last post, so let me state the limit plainly: I have not yet formally benchmarked pier-hybrid's own cost or quality. I'm not going to borrow FrugalGPT's 98% or Hybrid LLM's 40% and quietly imply they're mine; they aren't. Those numbers are on different models, different tasks, and different benchmarks. What I have is a sound design with strong prior backing and a build that runs; what I don't have yet is a clean measurement of how often pier-hybrid escalates on real coding sessions, how much that saves against 105B-only, and whether routing ever costs quality on the turns the 30B chose to keep. Getting those numbers right, on agentic, repo-level work, not toy queries, is the next piece of work, and I'll publish them when they're real.

There are open risks I already know about. The 30B can misjudge a turn and try something it should have escalated. The handoff_prompt is a compression of context, and compression can drop something the 105B needed. The escalation threshold is a knob, and the right setting almost certainly varies by repo and task. These are exactly the things real usage will expose, which is why v0.2.2 ships with the hybrid available rather than waiting for a perfect tuning.

Try it

pier-hybrid is live in v0.2.2. If you're already on pier-code, you can flip to the hybrid and let one agent decide, turn by turn, which Sarvam model earns the work. If you're new, you can request access at piercode.com, create an account, and install the CLI.

What I most want back is the routing data I don't have yet: which defaults felt right, where the 30B should have escalated and didn't, where it escalated and didn't need to. That feedback is how the threshold gets tuned and how the honest benchmark eventually gets written.

This is still the same bet as before, that a sovereign, home-grown Indian model belongs in developers' terminals now, not someday, just with a sharper harness around it. Using both Sarvam models well, instead of choosing one and compromising, is a small step toward making that bet pay off sooner.