TubeReads

What even is an "agent harness"?

The term «harness» has quietly become one of the most important — and least understood — concepts in AI coding. While developers debate the merits of Cursor versus Claude Code, they're really debating harnesses: the invisible infrastructure that determines whether a model achieves 77% accuracy or 93%. Most assume these tools are complex proprietary systems. The reality is far more surprising — and empowering.

Durata del video: 39:23·Pubblicato 13 apr 2026·Lingua del video: en-US
6–7 min di lettura·8,653 parole pronunciateriassunto in 1,372 parole (6x)·

1

Punti chiave

1

Harnesses provide tools (like bash, read file, edit file) to AI models, handle the execution of those tools, and manage the chat history — essentially giving text-generating models the ability to interact with your computer.

2

The same model can perform dramatically differently depending on its harness: Opus went from 77% accuracy in Claude Code to 93% in Cursor, purely due to harness optimization.

3

Large context windows actually make models dumber — accuracy drops to ~50% when context exceeds 50,000–100,000 tokens, which is why modern harnesses focus on giving models search tools instead of stuffing entire codebases into context.

4

You can build a functional coding agent harness in roughly 60 lines of Python with just three tools: read file, list files, and edit file — or even simpler, with just a single bash tool.

5

Cursor's superior performance comes not from secret technology but from extensive manual testing and tuning of tool descriptions and system prompts for each model, while companies like Anthropic may not invest as heavily in optimizing their own harnesses.

In breve

A harness is simply the set of tools and environment in which an AI agent operates, and the difference between a good harness and a bad one is not complexity but careful tuning of tool descriptions and system prompts — something you can build yourself in under 100 lines of code.


2

The Harness Performance Gap

Same model, different harness: 77% to 93% accuracy shift.

Opus accuracy in Claude Code
77%
Performance of Claude Opus running in Anthropic's native Claude Code harness
Opus accuracy in Cursor
93%
Same Opus model running in Cursor's optimized harness — a 16 percentage point improvement
Lines of code for basic harness
~60–200
Amount of Python code needed to build a functional AI coding assistant harness from scratch
Context window accuracy threshold
50,000–100,000 tokens
Beyond this range, Sonnet's accuracy for finding repeating words drops to roughly 50% of baseline

3

How Tool Calling Actually Works

Models pause, tools execute, results append, models resume — constantly.

AI models can only generate text, yet they appear to run commands, edit files, and navigate codebases. This illusion is created through «tool calling»: the model outputs special syntax (like wrapping a bash command in tags), stops responding entirely, and waits. Your harness — the local code managing the interaction — parses that syntax, executes the command using traditional code, captures the output, and appends it to the chat history. Then it makes a new API request to the same model with the updated history, and the model continues from where it left off.

This cycle repeats for every tool call. The model's «brain» gets paused and restarted constantly — imagine your memory resetting every 30 seconds while you debug code. The model only knows what's in the chat history at that moment. If a file's contents aren't in the history, the model must use a tool to read it. If the tool call returns nothing useful, the model might call another tool to gather more context. Each pause-execute-resume cycle adds latency but also prevents the model from hallucinating file contents or system state.

Most modern harnesses define tools in a standardized format and pass them to the model via dedicated API parameters (not just system prompts). Providers like OpenAI, Anthropic, and OpenRouter now accept a «tools» array in requests, which they format internally in model-specific syntax. This standardization means you can define tools once and let the provider handle the formatting — though understanding the underlying mechanics remains crucial for debugging and optimization.


4

The Three Core Tools

📖
Read File
Takes a file path, returns the file's contents as a string. The model uses this to examine code before making changes.
📂
List Files
Returns all files in a directory with their types. Lets the model navigate the project structure and locate relevant code.
✏️
Edit File
Accepts old string, new string, and file path. Replaces the first occurrence of old with new, or creates the file if old is empty.
💻
Bash (optional but powerful)
Execute arbitrary shell commands. With just this one tool, a model can read, write, search, and modify anything — the other three become unnecessary.

5

Why Large Context Makes Models Dumber

Stuffing codebases into context creates impossible needle-in-haystack problems.

⚠️

Why Large Context Makes Models Dumber

Tools like RepoPack tried to compress entire codebases into a single XML file for the model, believing huge context windows were the future. This failed spectacularly. When you ask a model to fix a bug and give it 2,000 files instead of 2, you create the worst possible needle-in-haystack scenario — especially when the model's memory resets every 30 seconds (every tool call). Accuracy plummets beyond 50,000–100,000 tokens. Modern harnesses succeed by giving models search tools to build their own minimal context, not by stuffing everything in upfront.


6

Building Your Own Harness in 60 Lines

System prompt, tool registry, execution loop — that's the recipe.

1

Define your tool functions Write simple Python (or JavaScript) functions for read_file, list_files, edit_file, and optionally bash. Each returns a dictionary with results.

2

Create a tool registry Map tool names to functions and extract their signatures and docstrings. The model will use these descriptions to decide when to call each tool.

3

Build the system prompt Tell the model it has access to tools, how to format tool calls (e.g., «tool: tool_name {args}»), and that it should use compact single-line JSON.

4

Parse tool calls from responses After the model responds, scan for lines starting with «tool:». Extract the tool name and arguments, then look up and execute the function from your registry.

5

Append results and loop Add tool outputs to the chat history as new messages, then make another API request to the model. Repeat until the model responds without calling tools.


7

Why Cursor's Harness Outperforms

Manual tuning of prompts and tool descriptions, model by model.

ANTHROPIC / GOOGLE
Generic, Model-Written Harnesses
Companies like Anthropic likely use AI-generated system prompts and tool descriptions that haven't been manually refined. They work across models but aren't optimized for any specific one. Tool descriptions may not steer the model effectively, leading to unnecessary tool calls or ignored capabilities.
CURSOR
Obsessively Tuned Per Model
Cursor employs people whose job is to test every new model with hundreds of micro-adjustments to system prompts and tool descriptions. They rewrite descriptions for each model to account for different behaviors — Gemini might prefer bash, Claude might favor specialized tools. The result: dramatically better accuracy with the same underlying model.

8

You Can Lie to the Model (And Should)

Models only know what you tell them — exploit that.

You can tell it it's a bash tool, but you do something else. You can tell it it's a read file tool, but you do something else. You can tell it it's grep or ripgrep or something different and then go do whatever the fuck you want. I do this all the time. When I want to just fake bash, for example, when I want a model to think it has bash when it doesn't, I'll just tell it it does and I'll tell another model to make a fake response for it.

Presenter


9

T3 Code: A UI, Not a Harness

T3 Code wraps existing harnesses rather than replacing them.

T3 Code is not a harness — it's a UI layer on top of existing harnesses like Claude Code, Codex CLI, and OpenCode. When you select a model in T3 Code, you're not just picking the model; you're choosing which harness runs locally. If you don't have Claude Code installed and authenticated, the Claude option won't work. T3 Code doesn't provide bash tools, file readers, or execution environments. It delegates all of that to the harnesses already on your machine.

This architecture makes T3 Code much simpler to build and maintain, but it also means the quality of your results depends entirely on the underlying harness. The presenter notes that building T3 Code would have been significantly easier if he could have just built the harness himself — but the hard part isn't the harness (60 lines of code), it's building a great UI and experience around it. T3 Code focuses on that layer: model selection, chat history management, observability, and user experience, while leaving the heavy lifting of tool execution to proven harnesses.


10

Persone

Matt Mayer
Independent benchmark creator
mentioned
Mah
Author, «The Emperor Has No Clothes» article
mentioned
Edward
Twitter user (prompted video creation)
mentioned

Glossario
Tool callingThe mechanism by which an AI model requests execution of a function (like reading a file or running a command) by outputting structured syntax, pausing, and waiting for the result.
Context windowThe total amount of text (measured in tokens) that a model can hold in its «memory» during a single conversation.
System promptThe initial instructions given to a model before any user messages, often including tool definitions and behavioral guidelines.
RepoPack / Repo mixA now-deprecated approach that compressed entire codebases into a single XML file to fit in the model's context, which proved ineffective.

Avviso: Questo è un riassunto generato dall'IA di un video YouTube a scopo educativo e di riferimento. Non costituisce consulenza in materia di investimenti, finanziaria o legale. Verificare sempre le informazioni con le fonti originali prima di prendere decisioni. TubeReads non è affiliato con il creatore del contenuto.