Benchmarking Your Agent Logic Without the Network Noise

Orchid Team5 min read

Your agent takes forty seconds to answer a question, and users are noticing. So you profile it. The flame graph shows what you already suspected. Almost all the time is spent waiting on LLM API calls.

Case closed? Not quite. That forty seconds is actually two numbers tangled together. Time spent waiting on providers, which you mostly can't control, and time spent in your own code, which you absolutely can. Parsing, retrieval, tool execution, prompt assembly, retry logic, orchestration overhead between calls. As long as both are mixed in every measurement, you can't tell whether your optimization work is moving the needle or whether the API just had a good day.

This is the dirty secret of benchmarking AI applications. Provider latency varies wildly between runs, between times of day, and between model versions. Any benchmark that includes live API calls is measuring the weather as much as your code.

Replay Mode as a Performance Harness

Orchid is a recording proxy for LLM traffic. In capture mode it records every request and response your application makes. In replay mode it serves those recorded responses back, matched by a semantic hash of the request, with no outbound network calls at all. We built this primarily for zero-cost deterministic testing, but it has a second life as a profiling tool.

Here's the key property. Replayed responses arrive with near-zero latency. When your application runs against replay mode, the provider essentially answers instantly and identically every time. What remains in your measurements is your own code.

The workflow takes minutes.

Run your agent once in capture mode to record a representative session.
Switch to replay mode with a single environment variable.
Run your benchmark as many times as you like. Every run uses identical responses with no network time.

# Record once
export ORCHID_SESSION_ID="perf-baseline"
export ORCHID_MODE="capture"
python run_agent.py

# Benchmark forever
export ORCHID_MODE="replay"
python run_agent.py

What You Can Suddenly See

With provider time and variance removed, questions that were unanswerable become straightforward.

How much latency is mine? Run the same session live and in replay. The difference is provider and network time. The replay number is your overhead, and for many agent frameworks it's larger than teams expect. Serialization, validation layers, and framework abstractions add up across a dozen calls.

Did my refactor actually help? Benchmark in replay mode before and after a change. Because every run replays identical responses, differences between runs reflect your code change and nothing else. You get clean A/B comparisons with sample sizes as large as you have patience for, at zero API cost.

Where does time go between calls? A typical agent spends time thinking between LLM exchanges. Retrieving documents, running tools, assembling the next prompt. In replay mode those gaps dominate the timeline instead of being dwarfed by API waits, which makes them easy to spot and attack.

Does my code scale with concurrency? Load-testing an agent against live APIs is expensive and rate-limited. Against replay mode, you can hammer your orchestration layer with realistic traffic shapes for free.

Measuring the Recorded Side Too

The recording itself carries timing data. Every captured exchange stores its real-world latency, and the proxy aggregates latency, call counts, token usage, and cost grouped by step and provider. So the same session gives you both halves of the picture. The recorded run tells you what the provider side cost you in time and dollars, and replayed runs tell you what your own code costs. The cost half of that story is covered in Know What Every Agent Run Costs.

If you work with an AI coding assistant, these aggregates are available to it as well through Orchid's MCP server. "Which step in this job had the highest average latency?" is a one-tool-call question. More on that in Let Your AI Debug Your AI.

A Few Honest Caveats

Replay benchmarking measures your code, not the end-to-end user experience. Real users still wait on real providers, so keep measuring live latency in production. Replay also serves responses for requests it has seen, so if your code change alters prompts significantly, re-record the baseline session first. And because replayed responses return faster than live ones, concurrency patterns that depend on slow responses overlapping may behave differently. Treat replay numbers as a controlled experiment, not a production simulation.

None of this diminishes the core value. Controlled experiments are exactly what performance work needs and exactly what live APIs can't give you.

Find Out What Your Code Costs

If your agent feels slow and you've been assuming it's all API time, replay mode will tell you in an afternoon whether that's true. Record a session, replay it, and look at what's left. The setup is one Docker command, described in Record, Inspect, Replay.

You might find your code is lean and the provider really is the bottleneck. You might find a second of avoidable overhead per call hiding in your orchestration layer. Either way, you'll know instead of guessing. Get started at orchidtrace.xyz.