Benchmark engine for AI agents

Shadow-run your production traffic against every model on the market.

Find cheaper, faster, more reliable paths for your AI agents — without sacrificing quality.

3-line SDK · No code changes · Any provider
Shadow run · extract-invoice
production traffic · 4 challengers
Running
00:01Productiongpt-4o · $12.40/1K · accuracy 94.2%
00:03Shadowclaude-sonnet-4 · $4.80/1K · accuracy 96.1%
00:05Shadowgemini-2.0-flash · $1.20/1K · accuracy 91.4%
00:07Shadowdeepseek-v3 · $0.90/1K · accuracy 87.3%
Found a better path
Best accuracysonnet-4 — 96.1% (+2%)
Cheapest matchgemini-2.0 — 91.4% at $1.20
Cost savingsup to 61% vs. current
⚠ gpt-4o accuracy dropped 2.1% vs. last run (Jan 15) — possible provider drift
The problem
Model decisions shouldn't be based on vibes.
Most teams pick models from blog posts and gut feelings. Subtext replaces hunches with production evidence.
Without Subtext
  • "We use GPT-4o because that's what we started with"
  • You read a blog post and test 5 cherry-picked examples
  • You ship the switch and pray nothing breaks
  • A silent model update degrades your accuracy
  • You're overpaying but can't prove it
With Subtext
  • Every task is shadow-tested on your chosen challengers
  • Cost, quality, latency, reliability — scored on your data
  • Hundreds of real traces, not 5 notebook examples
  • Switch with evidence when the data says it's safe
  • Instant alerts the moment a model regresses
How it works
Three lines of code. Zero risk.
Subtext plugs into your existing stack in minutes. No proxy. No rewrite. Your production path stays untouched.
Step 01
Drop in the SDK
Wrap your LLM client with Subtext. Pick your baseline, choose your challengers, and set what you want to optimize for — cost, quality, latency, reliability, or all of them.
// 3 lines. That's it.
import { Subtext } from '@subtext/sdk'

const subtext = new Subtext({
apiKey: 'sk_live_...',
baseline: 'gpt-4o',
challengers: [
'claude-sonnet-4',
'gemini-2.0-flash',
'deepseek-v3'
],
optimize: ['cost', 'quality', 'latency']
})

// Shadow runs start automatically.
Step 02
Subtext shadow-tests in production
Every real task your agents handle gets replayed against your challengers in the background. Not synthetic benchmarks — your actual production traffic, scored across every dimension you care about.
Production
gpt-4o
Shadow 1
sonnet-4
Shadow 2
gemini
Shadow 3
deepseek
↳ Production path untouched. Shadows run async. Users never know.
Step 03
Review evidence. Ship with proof.
When Subtext finds a better option — cheaper, faster, more accurate, or all three — it opens a change request with full evidence. You review the data and approve, or dismiss with one click.
CR-014: Switch tool-call routing
Quality: 96 (+2 vs baseline)
Cost: $4.80/1K (-61%)
Latency: 0.9s (-25%)
Reliability: 99.8% (+0.2%)
Traces: 500 tested · Pass rate: 99.2%
61%
Average cost reduction found
99.2%
Pass rate on recommended switches
5 min
From npm install to first shadow run
“We were spending $40K/month on GPT-4o because nobody wanted to be the person who switched and broke something. Subtext proved the switch was safe with 500 traces of evidence. We saved $24K in month one.”
Sarah Chen
Head of AI · Acme Corp
>_subtext
There's a better model for your workload. Find it.
3-line SDK. Shadow runs start immediately. No credit card.