What even is an "agent harness"?
The term «harness» has quietly become one of the most important — and least understood — concepts in AI coding. While developers debate the merits of Cursor versus Claude Code, they're really debating harnesses: the invisible infrastructure that determines whether a model achieves 77% accuracy or 93%. Most assume these tools are complex proprietary systems. The reality is far more surprising — and empowering.
Punti chiave
Harnesses provide tools (like bash, read file, edit file) to AI models, handle the execution of those tools, and manage the chat history — essentially giving text-generating models the ability to interact with your computer.
The same model can perform dramatically differently depending on its harness: Opus went from 77% accuracy in Claude Code to 93% in Cursor, purely due to harness optimization.
Large context windows actually make models dumber — accuracy drops to ~50% when context exceeds 50,000–100,000 tokens, which is why modern harnesses focus on giving models search tools instead of stuffing entire codebases into context.
You can build a functional coding agent harness in roughly 60 lines of Python with just three tools: read file, list files, and edit file — or even simpler, with just a single bash tool.
Cursor's superior performance comes not from secret technology but from extensive manual testing and tuning of tool descriptions and system prompts for each model, while companies like Anthropic may not invest as heavily in optimizing their own harnesses.
In breve
A harness is simply the set of tools and environment in which an AI agent operates, and the difference between a good harness and a bad one is not complexity but careful tuning of tool descriptions and system prompts — something you can build yourself in under 100 lines of code.
The Harness Performance Gap
Same model, different harness: 77% to 93% accuracy shift.
How Tool Calling Actually Works
Models pause, tools execute, results append, models resume — constantly.
AI models can only generate text, yet they appear to run commands, edit files, and navigate codebases. This illusion is created through «tool calling»: the model outputs special syntax (like wrapping a bash command in tags), stops responding entirely, and waits. Your harness — the local code managing the interaction — parses that syntax, executes the command using traditional code, captures the output, and appends it to the chat history. Then it makes a new API request to the same model with the updated history, and the model continues from where it left off.
This cycle repeats for every tool call. The model's «brain» gets paused and restarted constantly — imagine your memory resetting every 30 seconds while you debug code. The model only knows what's in the chat history at that moment. If a file's contents aren't in the history, the model must use a tool to read it. If the tool call returns nothing useful, the model might call another tool to gather more context. Each pause-execute-resume cycle adds latency but also prevents the model from hallucinating file contents or system state.
Most modern harnesses define tools in a standardized format and pass them to the model via dedicated API parameters (not just system prompts). Providers like OpenAI, Anthropic, and OpenRouter now accept a «tools» array in requests, which they format internally in model-specific syntax. This standardization means you can define tools once and let the provider handle the formatting — though understanding the underlying mechanics remains crucial for debugging and optimization.
The Three Core Tools
Why Large Context Makes Models Dumber
Stuffing codebases into context creates impossible needle-in-haystack problems.
Why Large Context Makes Models Dumber
Tools like RepoPack tried to compress entire codebases into a single XML file for the model, believing huge context windows were the future. This failed spectacularly. When you ask a model to fix a bug and give it 2,000 files instead of 2, you create the worst possible needle-in-haystack scenario — especially when the model's memory resets every 30 seconds (every tool call). Accuracy plummets beyond 50,000–100,000 tokens. Modern harnesses succeed by giving models search tools to build their own minimal context, not by stuffing everything in upfront.
Building Your Own Harness in 60 Lines
System prompt, tool registry, execution loop — that's the recipe.
Define your tool functions Write simple Python (or JavaScript) functions for read_file, list_files, edit_file, and optionally bash. Each returns a dictionary with results.
Create a tool registry Map tool names to functions and extract their signatures and docstrings. The model will use these descriptions to decide when to call each tool.
Build the system prompt Tell the model it has access to tools, how to format tool calls (e.g., «tool: tool_name {args}»), and that it should use compact single-line JSON.
Parse tool calls from responses After the model responds, scan for lines starting with «tool:». Extract the tool name and arguments, then look up and execute the function from your registry.
Append results and loop Add tool outputs to the chat history as new messages, then make another API request to the model. Repeat until the model responds without calling tools.
Why Cursor's Harness Outperforms
Manual tuning of prompts and tool descriptions, model by model.
You Can Lie to the Model (And Should)
Models only know what you tell them — exploit that.
“You can tell it it's a bash tool, but you do something else. You can tell it it's a read file tool, but you do something else. You can tell it it's grep or ripgrep or something different and then go do whatever the fuck you want. I do this all the time. When I want to just fake bash, for example, when I want a model to think it has bash when it doesn't, I'll just tell it it does and I'll tell another model to make a fake response for it.”
T3 Code: A UI, Not a Harness
T3 Code wraps existing harnesses rather than replacing them.
T3 Code is not a harness — it's a UI layer on top of existing harnesses like Claude Code, Codex CLI, and OpenCode. When you select a model in T3 Code, you're not just picking the model; you're choosing which harness runs locally. If you don't have Claude Code installed and authenticated, the Claude option won't work. T3 Code doesn't provide bash tools, file readers, or execution environments. It delegates all of that to the harnesses already on your machine.
This architecture makes T3 Code much simpler to build and maintain, but it also means the quality of your results depends entirely on the underlying harness. The presenter notes that building T3 Code would have been significantly easier if he could have just built the harness himself — but the hard part isn't the harness (60 lines of code), it's building a great UI and experience around it. T3 Code focuses on that layer: model selection, chat history management, observability, and user experience, while leaving the heavy lifting of tool execution to proven harnesses.
Persone
Glossario
Avviso: Questo è un riassunto generato dall'IA di un video YouTube a scopo educativo e di riferimento. Non costituisce consulenza in materia di investimenti, finanziaria o legale. Verificare sempre le informazioni con le fonti originali prima di prendere decisioni. TubeReads non è affiliato con il creatore del contenuto.