
Cut MCP Round-Trip Overhead by Looping Inside the Tool
May 30, 2026
MCP round-trip overhead comes from every tool call forcing a full model re-invocation over the whole growing context. The fix is to move the loop INTO the tool. One MCP server tool iterates internally and returns a single result, collapsing N round-trips into 1.
Why MCP round-trip overhead is so expensive#
Here is the part people miss. A tool call is not cheap because the tool is slow. It is expensive because of the loop around it.
The agent loop runs like this: the model emits a tool call, the harness executes it, the result re-enters the context, and the model re-runs over the entire history. Every turn re-reads everything. That re-reading is the agent tool call cost, not the work the tool did.
So the MCP round-trip overhead scales with two things: how many calls you make, and how big the context already is. Loop 50 times and you pay for the full transcript 50 times.
The numbers are brutal. In a 4-server Claude Code setup, you eat about 7,000 tokens of overhead per message, and heavy setups cross 50,000 before you type a word. Tool definitions alone can hit around 55,000 tokens in a 5-server config.
Where the tokens go
I have watched a naive MCP token overhead run balloon to 150,000 tokens on a task that, done right, costs 2,000. That gap is not the model thinking harder. It is the loop.
The hero pattern: loop inside one MCP tool#
Most people register one MCP tool per atomic action, then let the model call it in a loop. Don't. Put the loop inside the tool.
Here is a real handler using the MCP TypeScript SDK. The inputSchema takes an array of items.
The handler iterates server-side and returns ONE consolidated result. N round-trips become 1.
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { z } from "zod";
const server = new McpServer({ name: "fetch-many", version: "1.0.0" });
server.registerTool(
"fetch_all_statuses",
{
title: "Fetch all statuses",
description: "Fetch status for every URL in one call. Loops server-side.",
inputSchema: {
urls: z.array(z.string().url()).min(1).max(200),
},
},
async ({ urls }) => {
const results: { url: string; status: number | string }[] = [];
// The loop the model would have driven now runs in-process.
// The model never sees the 199 intermediate responses.
for (const url of urls) {
try {
const res = await fetch(url, { method: "HEAD" });
results.push({ url, status: res.status });
} catch (err) {
results.push({ url, status: `error: ${(err as Error).message}` });
}
}
const downCount = results.filter((r) => r.status !== 200).length;
return {
// Only this summary re-enters the model context.
content: [
{
type: "text",
text: `Checked ${urls.length} URLs. ${downCount} not 200.`,
},
],
structuredContent: { results, downCount },
};
},
);
The model sends one array. The handler runs the in-process loop. The 199 intermediate HTTP responses never touch the context.
That is the whole trick. Only the summary and structuredContent come back, so the model reasons over a few lines instead of a few hundred.
Tip: Return a shortcontentsummary for the model to reason over, and stash the full data instructuredContent. The model reads the summary, your downstream code reads the structured payload.
If you have never built one of these, start with Building an MCP Server From Scratch. The loop-inside pattern is a one-line change to the handler once the server exists.
Batch many operations into one call#
The in-process loop assumes every item runs the same operation. Sometimes you need different operations in one shot. That is the batch MCP tool pattern.
Tools like mcp-batchit expose a single batch_execute that takes an array of operations plus run options, fans them out server-side, and returns one consolidated result. The repo reports a 70-90% cut in operational token overhead.
{
"operations": [
{ "tool": "create_file", "args": { "path": "a.ts", "body": "..." } },
{ "tool": "create_file", "args": { "path": "b.ts", "body": "..." } },
{ "tool": "create_file", "args": { "path": "c.ts", "body": "..." } }
],
"options": { "maxConcurrent": 4, "stopOnError": false }
}
One call, three writes, server-side concurrency. The model pays for one round-trip instead of three.
- No inter-op data flow. Operation B cannot read operation A's output. They run independently.
- Single downstream server. A batch tool typically fans out to one target, not a mix.
- Best for fan-out. Identical or independent ops where ordering and shared state do not matter.
Let the model write the loop with code execution#
There is a third option, and it is the strongest. Let the model write the loop itself, in code, inside one sandboxed turn.
Anthropic's code execution with MCP exposes tools as a typed code API. The model calls them inside a single execution, runs native loops and filters, and only the values it returns or logs re-enter the context.
// The model writes and runs this inside ONE execution turn.
import { listTickets, closeTicket } from "./servers/support";
const tickets = await listTickets({ status: "resolved" });
// Loop and filter run natively. None of these rows hit the context.
let closed = 0;
for (const t of tickets) {
if (t.ageDays > 30) {
await closeTicket({ id: t.id });
closed++;
}
}
// Only this line's value returns to the model.
console.log(`Closed ${closed} stale tickets of ${tickets.length}.`);
This is where the MCP code execution numbers come from: 150,000 tokens down to 2,000, a 98.7% reduction. Direct tool calls would have paged every ticket through the context. Code mode keeps the data in the sandbox.
Worth keeping distinct: that 98.7% is naive-MCP versus code-execution-MCP. A separate study found MCP servers burned 35x more tokens than raw CLI tools per task, with reliability dropping from 100% to 72%. Different comparison, same lesson: round-trips are the tax.
Batch vs in-process loop vs code mode#
Three ways to collapse the loop and kill the MCP round-trip overhead. They are not interchangeable. Pick by how much logic and data flow the task needs.
- In-process loop runs fixed logic over a list. No inter-step data flow. The tool author orchestrates. Data stays out of context.
- Batch tool runs many independent ops in one call. No inter-step data flow. The batch runner orchestrates. Data stays out of context.
- Code mode runs arbitrary logic AND inter-step data flow. The model orchestrates. Data stays in the sandbox, out of context.
Note: Code mode is the only one of the three that supports real inter-step data flow, where step two reads step one's output. The other two trade that flexibility for a dead-simple tool surface.
If you are still picking servers, Best MCP Servers Worth Installing covers which ones already ship batch-style tools. And the wider token math lives in How to Reduce AI Coding Tool Token Usage by 50%.
When NOT to collapse the loop#
Collapsing the loop is not free. You trade the model's per-step judgment for token savings. Sometimes that judgment is the whole point.
Keep the model in the loop when each step's outcome should change the next decision. A migration that branches on what it finds.
The same goes for live problem solving. A debugging session that course-corrects after each failure. Flatten those and you get a tool that charges ahead while blind.
There is a cost ceiling worth naming too. Once a loop runs hundreds of fast, identical iterations, the MCP round-trip overhead dwarfs everything else, and that is exactly when collapsing pays off most. The closer each step gets to a real decision, the weaker the case for flattening it.
Frequently asked questions
So the rule is simple. If the loop body is deterministic, push it into the tool or into code mode and stop paying MCP round-trip overhead per iteration.
If the body needs the model to think between steps, leave it alone. For more on where these patterns hold up under real load, see AI Agents in Production: What Actually Works.