I spent all of last Sunday building a thing, and by the end of the day it had stopped being a toy.
The thing is sarvam-code: a command-line coding agent you install once and live inside, the same way developers now live inside Claude Code or Codex. You point it at your repo, you tell it what you want, and it reads files, writes code, runs commands, and iterates until the task is done. The twist is what's behind the curtain. It's built as a provider-agnostic shell, and the default engine is Sarvam AI, an Indian model provider, with DeepSeek and Claude pluggable for the heavy lifting.
You can request access here: create an account and you'll be able to install the CLI and start using it in your workflow.
This post is the honest version, and I want to be clear up front about where the optimism comes from. I'll show you the cost numbers, which are genuinely lopsided in Sarvam's favor, and the benchmark numbers, which are not yet. But I'm building this because I believe in Bharat's trajectory: that a sovereign, home-grown heavyweight model will one day go toe-to-toe with its generation's SOTA, and the work of getting it into developers' hands shouldn't wait until that day arrives. It should help bring it closer.
Why build this at all
The motivation was layered, and it got more interesting the longer I worked on it.
1. Distribution is Sarvam's real problem, not raw capability. Sarvam isn't used by many people yet. For an Indian model provider to compete with global players, the first bottleneck isn't the model. It's getting into people's hands and daily habits. Chat apps are how most people first touched LLMs. But for developers, the new front door is the coding agent: a tool you run dozens of times a day, that sits in your terminal, that becomes muscle memory. If you want Sarvam in front of every engineer in the country, the fastest path is a coding tool they actually want to use. That's the wedge.
2. The economics are absurd, in Sarvam's favor. Sarvam's API pricing right now is extremely competitive. The catch has always been that cheap tokens are worthless without a harness that knows how to use them: the tool-calling loop, the editing primitives, the planning, the recovery from mistakes. Get the harness right and a lot of everyday coding-agent work fits into a developer's budget at a small fraction of frontier-tool cost.
3. It's a great way to actually learn how these things work. There's no better way to understand a coding agent than to build one. Writing the harness yourself, designing the system prompts, figuring out the tool schemas, reverse-engineering why the popular tools behave the way they do: that's the kind of work that teaches you more in a Sunday than a month of reading threads. I wanted to know why Claude Code feels reliable and Codex feels different, and the only way to know is to build something that has to make the same decisions.
The hacky start, and the turn
This began as a hack. The plan for Sunday morning was modest: wire a model API into a loop, give it a couple of tools, see if it could edit a file without face-planting. I expected to spend the day fighting the model.
That's not quite what happened. Once I got the harness right (the editing tools, the way the agent narrates its own plan, the guardrails around shell commands, the planning loop), the agent started behaving like a real coding agent. It followed multi-step tasks, recovered from its own mistakes, and respected the structure I gave it. The cleverness wasn't in any single model. It was in the engineering around it.
The lesson that kept repeating itself: a coding agent is mostly harness. The model is the engine, but the car is everything else: the tools, the prompts, the loop that decides what to do next and how to feed the last step's result into the next one. A good harness makes a cheap model punch well above its price. It does not, however, magically close a benchmark gap. Both of those turned out to be true, and the rest of this post is about holding them at the same time.
The cost frontier
Let's fix a budget ($10, which is about ₹951) and ask a simple question: how many tokens does that buy you, per model? I'm using current listed prices. Sarvam's docs list the 105B at ₹4 input / ₹2.5 cached / ₹16 output and the 30B at ₹2.5 / ₹1.5 / ₹10 per 1M tokens. DeepSeek lists V4 Flash at $0.14 / $0.0028 cached / $0.28 output and V4 Pro at $0.435 / $0.003625 / $0.87. Anthropic lists Opus 4.8 at $5 input / $0.50 cached / $25 output and Sonnet 4.6 at $3 / $0.30 cached / $15 per 1M tokens.
Tokens you can buy with $10 / ₹951
| Model | Input-only | Output-only | 80/20 blend | Cached-input-only |
|---|---|---|---|---|
| Sarvam 30B | 380.4M | 95.1M | 237.8M | 634.0M |
| Sarvam 105B | 237.8M | 59.4M | 148.6M | 380.4M |
| DeepSeek V4 Flash | 71.4M | 35.7M | 59.5M | 3.57B |
| DeepSeek V4 Pro | 23.0M | 11.5M | 19.2M | 2.76B |
| Claude Sonnet 4.6 | 3.33M | 666.7K | 1.85M | 33.3M |
| Claude Opus 4.8 | 2.0M | 400K | 1.11M | 20.0M |
The middle column, the 80% input / 20% output blend, is the one that matters most, because it roughly mirrors how a coding agent actually spends tokens: lots of repo context going in, a smaller amount of code coming out. On that blend, Sarvam 30B buys roughly 214x the tokens of Claude Opus 4.8 and about 128x Sonnet 4.6. That's not a rounding-error difference. It's a different category of economics.
One honest caveat on cost: if your architecture achieves very high prompt-cache reuse (the Claude-Code-style move of resending large, stable repo context on every turn), DeepSeek's cached-input pricing is so low that it becomes unusually strong, crossing billions of cached tokens for $10. That's a real edge for cache-heavy designs, and it's worth keeping in mind when choosing your routing defaults.
So on price, the frontier is clear: Sarvam is the cheapest raw tokens, DeepSeek is the best cost/performance challenger, Claude is the expensive quality ceiling. Now the uncomfortable half.
The benchmark frontier
Cost is only half the story, and I'm not going to pretend otherwise. For a Claude-Code-style agent, the benchmarks that actually predict real-world usefulness are the agentic, repo-level ones: SWE-bench Verified, SWE-bench Pro, and Terminal-Bench. These test resolving real issues across maintained repositories, not standalone programming puzzles. Here's where the three families currently land.
Coding benchmarks
| Benchmark | Claude Opus 4.8 | DeepSeek-V4 Pro-Max | DeepSeek-V4 Flash-Max | Sarvam 105B |
|---|---|---|---|---|
| SWE-bench Verified | 88.6% | 80.6% | 79.0% | 45.0% |
| SWE-bench Pro | 69.2% | 55.4% | 52.6% | not found |
| Terminal-Bench | 74.6% (2.1) | 67.9% (2.0) | 56.9% (2.0) | not found |
| LiveCodeBench | not found | 93.5% | 91.6% | 71.7 (v6) |
| Codeforces | not found | 3206 | 3052 | not found |
| HumanEval | n/a | 76.8% (base) | 69.5% (base) | 30B: 92.1 |
Sources: Opus 4.8 summaries report 88.6% SWE-bench Verified, 69.2% SWE-bench Pro, 74.6% Terminal-Bench 2.1. DeepSeek's V4-Pro-Max card reports 80.6% / 55.4% / 67.9% / 93.5% LiveCodeBench / 3206 Codeforces. Sarvam's own model card reports 71.7 LiveCodeBench v6 and 45.0 SWE-bench Verified for the 105B.
There's no spin that makes a 45.0 sit next to an 88.6 comfortably. On agentic repo-level coding, the rough ranking is unambiguous:
- Claude Opus 4.8: clear SOTA among these for agentic repo-level coding.
- DeepSeek-V4 Pro-Max: very strong; materially behind Opus on SWE-bench Pro, but far cheaper.
- DeepSeek-V4 Flash-Max: surprisingly close to Pro on some coding benchmarks, weaker on complex agentic workflows.
- Sarvam 105B: strong on standalone coding (LiveCodeBench v6 71.7, comparable to GPT-OSS-120B's 72.3), but not yet competitive with Claude or DeepSeek for repo-level agentic work.
Sarvam 105B has two genuinely different stories. As a standalone coding model, it's respectable. As an autonomous repo-editing agent, the SWE-bench Verified gap to DeepSeek and Claude is large, and Sarvam's own published comparisons show the 105B trailing several frontier models on that benchmark too. I'd rather say that plainly than have a reader discover it after installing.
And here's the thing: that gap doesn't dampen my optimism, it's exactly what fuels it. Every frontier model you see today was once a "respectable standalone, weak on agentic work" line in a table like this one. Bharat is one or two model generations into a journey the incumbents have been on for a decade, and it's already standing in the same arena. The trajectory matters more than today's snapshot, and the trajectory is steep.
What this actually means for sarvam-code
Holding both tables at once leads somewhere more interesting than "Sarvam beats everyone" or "Sarvam can't compete." It leads to a hybrid.
Sarvam 105B is not yet a drop-in replacement for the model behind your hardest autonomous repo edits, multi-file refactors, and long-horizon debugging loops. But it doesn't need to be the only engine. The right design for v1 is a coding-agent shell that's provider-agnostic, with Sarvam as the cheap, India-native default and DeepSeek or Claude pluggable for serious repo-level tasks. Route the work to the model that fits it:
| Task | Best fit |
|---|---|
| Hard implementation, gnarly bug fixing | Claude Opus 4.8 |
| Cost-sensitive coding loops | DeepSeek-V4 Pro |
| Cheap planning, summarization, codebase Q&A | DeepSeek-V4 Flash or Sarvam 105B |
| India / local-language developer workflows | Sarvam 105B |
| Repo indexing, commit summaries, docs, issue triage | Sarvam 105B |
The cost-quality frontier, stated honestly: Claude Opus 4.8 is the quality ceiling. DeepSeek-V4 Pro is the best cost/performance challenger. Sarvam 105B is the interesting India-native model that wins decisively on price and standalone coding, while the harness and the model both keep improving on agentic work.
That's the right framing for a first version. Not "Claude Code but powered only by Sarvam." Rather: an Anthropic/OpenAI-compatible coding-agent shell where Sarvam is the default low-cost, India-native provider, and the heavyweight models are one config flag away when a task demands them. The distribution wedge for Sarvam doesn't require it to win every benchmark today. It requires it to be the affordable default in a tool people keep open all day, and to keep closing the gap from there.
But make no mistake about the destination. The hybrid is the bridge, not the endgame. The endgame is the day the routing table collapses into a single row, when a sovereign Indian model is the right answer for the hardest repo-level work too, not just the cheap-and-local default. I genuinely believe that day is coming, and that a country that can launch missions to the Moon and Mars on a shoestring can build a heavyweight model that stands level with the world's best. When it does, sarvam-code should already be in a few thousand terminals, ready for it.
Try it
It's early. It's a Sunday project that outgrew its weekend, and I'm opening it up so other developers can put it in their own workflows and tell me where it breaks, where Sarvam is genuinely good enough, and where you'd want to flip to a heavier model.
Request access at sarvam-code.vatsalpandya.com, create an account, and you'll be able to install the CLI.
If you do try it, I'd love to hear what you build, which routing defaults felt right, and where the harness still needs work. That feedback loop is the whole point. Every install is a small bet on Bharat's AI story, and I think it's a bet worth making.