Let Your AI Debug Your AI. Agent-Driven Triage with MCP

Orchid Team5 min read

Here's a familiar scene. Your agent pipeline failed overnight. You open your coding assistant and type "why did my pipeline fail?" The assistant, having no access to what actually happened, responds with a list of generic possibilities. Maybe it was rate limiting. Maybe a malformed tool output. Maybe a prompt issue.

It's guessing, because it can't see the evidence.

Now imagine the same conversation, except your assistant can query every LLM exchange your pipeline made. It lists last night's sessions, finds the failed one, pulls the step outline, searches the payloads for the error, and comes back with the exact exchange that failed, the provider's raw error response, and a suggested fix. That's not a hypothetical. That's what Orchid's MCP server enables today.

What Is the Orchid MCP Server?

The Model Context Protocol (MCP) is an open standard that lets AI assistants call tools exposed by external services. Orchid's proxy ships with a built-in MCP server that exposes its entire recording database as a set of tools your assistant can call.

The server runs on the same port as the visualizer and supports three transports.

Transport	How to connect	When to use it
Streamable HTTP	`POST /v1/mcp` on port 4321	Cursor, VS Code, Gemini CLI, and other remote clients. The current standard.
stdio	`orchid-proxy --mcp`	Claude Desktop and clients that spawn a local binary.
HTTP+SSE	`GET /v1/mcp/sse`	Legacy clients only. Deprecated in the MCP spec.

For most setups, you point your client at the Streamable HTTP endpoint and you're done.

http://localhost:4321/v1/mcp

If your proxy has an API key configured, add an Authorization: Bearer <key> header in your client's MCP settings.

The Triage Toolbox

Once connected, your assistant has access to a focused set of tools designed for efficient investigation. A few highlights.

list_sessions lists recent capture sessions with usage and cost summaries. The starting point for "what ran last night?"
list_job_steps returns a lightweight, metadata-only step outline for a job. Your assistant can scan the shape of a run without downloading megabytes of payloads.
get_event_details pulls the full request and response for a single exchange, with body truncation controls so context windows stay manageable.
search_job_payloads searches the actual prompt and completion text within a job for a substring, returning compact snippets around each match.
search_exchanges does the same globally across all sessions, with filters for provider, model, and status.
get_perf_profile aggregates latency, call counts, cost, and token metrics grouped by step and provider.

Notice the design pattern here. The tools are tiered from cheap summaries to full payloads, so an agent can investigate the way a good engineer would. Survey first, then zoom in.

A Real Triage Session

Let's walk through what this looks like in practice. Suppose a multi-step research pipeline failed and you ask your assistant to investigate.

Step 1. Find the failed run. The assistant calls list_sessions and sees twelve sessions from the last day. One has a failure count. It grabs the session ID.

Step 2. Get the shape of the run. It calls list_job_steps with that ID and receives the step outline. The pipeline ran summarize, extract, and rank steps successfully, then failed on the fourth step, synthesis. Three retries, all failed.

Step 3. Inspect the failure. It calls get_event_details on the first failed exchange. The response body contains the provider's actual error. A context length exceeded message, with the token count right there in the payload.

Step 4. Find the cause. Why was the request so large? The assistant calls search_job_payloads for the synthesis prompt and discovers that the extract step's output, which gets injected into the synthesis prompt, was twenty times larger than usual. One source document was a 200-page PDF.

Step 5. Report and fix. The assistant explains the chain of events and proposes a fix in your code. Truncate or chunk the extract output before it reaches the synthesis prompt. You review the diff, run the test, done.

Total time, about a minute. No log spelunking, no re-running the pipeline to reproduce the failure, no burning API credits to guess at the problem. The evidence was already recorded, and the assistant knew how to read it.

Beyond Debugging

Triage is the headline use case, but a connected assistant can do more with the same tools.

Cost questions. "Which step in my pipeline costs the most?" maps directly to get_perf_profile.
Prompt research. "Find me an example where the model returned valid JSON for this prompt" is a search_exchanges call.
Test fixture management. The export_session and import_session tools let your assistant create and restore replay fixtures, which pairs well with the workflow in Zero-Cost AI Testing.
Pipeline control. Tools like set_active_session let an agent route traffic into a named session during integration tests, which is handy when headers can't be injected directly.

Setting It Up

If you don't have the proxy running yet, start here. The full picture of what Orchid does is in Record, Inspect, Replay.

docker run -d \
  --name orchid-proxy \
  -p 4320:4320 \
  -p 4321:4321 \
  -v orchid-data:/data \
  -e ORCHID_API_KEY=your-secure-api-key \
  -e ORCHID_DB_PATH=/data/orchid.db \
  ghcr.io/mario-guerra/orchid-proxy:latest

Then add the MCP endpoint to your client. For Claude Desktop and other stdio-based clients, run the container in interactive mode instead.

docker run -i --rm \
  -v orchid-data:/data \
  -e ORCHID_DB_PATH=/data/orchid.db \
  ghcr.io/mario-guerra/orchid-proxy:latest --mcp

Route some traffic through the proxy in capture mode, then ask your assistant a question about the run. The first time it answers with the actual failing payload instead of a guess, you'll feel the difference.

Your coding assistant is only as good as the evidence it can reach. Give it the recording. Get started at orchidtrace.xyz.