Benchmark engine for AI agents

Shadow-run your production traffic against every model on the market.

Find cheaper, faster, more reliable paths for your AI agents — without sacrificing quality.

3-line SDK · No code changes · Any provider

Shadow run · extract-invoice

production traffic · 4 challengers

Running

01Productiongpt-4o · $12.40/1K · accuracy 94.2%
03Shadowclaude-sonnet-4 · $4.80/1K · accuracy 96.1%
05Shadowgemini-2.0-flash · $1.20/1K · accuracy 91.4%
07Shadowdeepseek-v3 · $0.90/1K · accuracy 87.3%

Found a better path

Best accuracysonnet-4 — 96.1% (+2%)

Cheapest matchgemini-2.0 — 91.4% at $1.20

Cost savingsup to 61% vs. current

⚠ gpt-4o accuracy dropped 2.1% vs. last run (Jan 15) — possible provider drift

The problem

Model decisions shouldn't be based on vibes.

Most teams pick models from blog posts and gut feelings. Subtext replaces hunches with production evidence.

Without Subtext

✕"We use GPT-4o because that's what we started with"
✕You read a blog post and test 5 cherry-picked examples
✕You ship the switch and pray nothing breaks
✕A silent model update degrades your accuracy
✕You're overpaying but can't prove it

With Subtext

✓Every task is shadow-tested on your chosen challengers
✓Cost, quality, latency, reliability — scored on your data
✓Hundreds of real traces, not 5 notebook examples
✓Switch with evidence when the data says it's safe
✓Instant alerts the moment a model regresses

How it works

Three lines of code. Zero risk.

Subtext plugs into your existing stack in minutes. No proxy. No rewrite. Your production path stays untouched.

Step 01

Drop in the SDK

Wrap your LLM client with Subtext. Pick your baseline, choose your challengers, and set what you want to optimize for — cost, quality, latency, reliability, or all of them.

// 3 lines. That's it.
import { Subtext } from '@subtext/sdk'

const subtext = new Subtext({
  apiKey: 'sk_live_...',
  baseline: 'gpt-4o',
  challengers: [
    'claude-sonnet-4',
    'gemini-2.0-flash',
    'deepseek-v3'
  ],
  optimize: ['cost', 'quality', 'latency']
})

// Shadow runs start automatically.

Step 02

Subtext shadow-tests in production

Every real task your agents handle gets replayed against your challengers in the background. Not synthetic benchmarks — your actual production traffic, scored across every dimension you care about.

Production

gpt-4o

Shadow 1

sonnet-4

Shadow 2

gemini

Shadow 3

deepseek

↳ Production path untouched. Shadows run async. Users never know.

Step 03

Review evidence. Ship with proof.

When Subtext finds a better option — cheaper, faster, more accurate, or all three — it opens a change request with full evidence. You review the data and approve, or dismiss with one click.

CR-014: Switch tool-call routing

Quality: 96 (+2 vs baseline)
Cost: $4.80/1K (-61%)
Latency: 0.9s (-25%)
Reliability: 99.8% (+0.2%)
Traces: 500 tested · Pass rate: 99.2%

61%

Average cost reduction found

99.2%

Pass rate on recommended switches

5 min

From npm install to first shadow run

“We were spending $40K/month on GPT-4o because nobody wanted to be the person who switched and broke something. Subtext proved the switch was safe with 500 traces of evidence. We saved $24K in month one.”

Sarah Chen

Head of AI · Acme Corp

>_subtext

There's a better model for your workload. Find it.

3-line SDK. Shadow runs start immediately. No credit card.