Theo declares war on closed-source AI

Hot Take Developer Tools

Theo: Closed-Source Devs Can't Be Trusted With AI

Theo delivers a 40-minute polemic arguing that AI has broken the implicit contract of closed-source software: when you make every dev 100x faster, they can ship slop 100x faster too. Cursor Glass "collapsed under the weight of even the most basic changes," the Codex app ships updates that are a "random chance of making the app slightly better or significantly worse," macOS 26 is "a shitshow," and Claude Code is "closed source because they're ashamed" — Theo's actual claim, not a paraphrase.^{[1]Theo — I'm serious.} His answer: open source T3 Code, force accountability via forks, and stop tolerating closed-source tools whose performance is actively regressing.

The thesis

Theo's core argument (~18:00) is blunt: closed-source developers took a surprisingly performant open-source codebase (VS Code) and "sloppied the [expletive] out of it to the point where it's barely usable." The Sonnet 3.5 era code that still lives inside Cursor is, in his framing, a liability that has been buried under layer upon layer of AI-generated additions. Every closed-source product had the risk of being ruined before, but AI is a risk multiplier.

Every closed source project had this risk before. If you relied on a closed source program, at any time they could take it down. They could add a new feature that sucks. They could [ruin] the performance. Now, they can and will do that 10 to 100 times faster.

The Yash story: why he changed his mind

The inciting event (~06:20) is Theo's high-school intern Yash, who "doesn't perceive the boundaries between different places where code lives." Yash reverse-engineered T3 Chat's webpack bundle with a userscript to hot-patch in local models, then joined as an intern and "quadrupled the number of patch packages we had in his first two weeks." When the AI SDK didn't support progressive image generation, Yash patch-packaged the dependency itself. Theo's realization: AI + patch-package collapses the cost of forking and fixing, and the historical reflex of "work around it instead of going to the source" is obsolete.

The Cursor Glass roast

Cursor's rebuild (~15:00) is Theo's bete noire this week. He says Glass was "even slower than the core cursor experience despite being comically simpler" and that Julius (T3's lead dev) couldn't keep it running with two codebases open. At a Cursor office event, Theo asked what was going on with performance; the answer — "we're prioritizing making it work and useful first" — broke his faith. His prescription: Cursor needs to hire a head of performance immediately. He still concedes Cursor's harness is "genuinely incredible," able to make Gemini 3.1 Pro usable where other harnesses can't — "there is genuine gold deep within cursor and it is wrapped with the worst, smelliest, grossest pile of slop."

Closed source developers cannot be trusted with AI. They are taking things that are for the most part usable that have their quirks and problems and they are sloppifying them to the point where they don't work.

The Claude Code attack

Theo saves his sharpest hit for Claude Code (~31:00). Anthropic has kept Claude Code and the TypeScript Agents SDK closed source; Boris (the Claude Code lead) reportedly believed the tool itself might be "the secret sauce" Anthropic shouldn't release. Theo calls this pathetic and pins the real reason on shame: "the only reason this code is closed source at this point in time is because they're ashamed." He also recounts the bizarre accidental-open-source moment: Anthropic's bundle shipped with source maps, got pulled from npm (which "almost never" removes packages), and Anthropic sent DMCA takedowns to anyone who re-released the leaked code. For contrast: Gemini CLI, OpenCode, and Codex CLI are all open source, and Codex's app-server primitive lets T3 Code integrate Codex with full fidelity.

I don't think there has ever in history been a single piece of code that is harder to justify closing the source of. The point of Claude Code being a terminal app is that it is easier to interface with and integrate with. Yet, they don't let me do it.

Why T3 Code is open source

T3 Code has ~30,000 users and 1,100 forks — 1 in 30 users has forked. Theo frames this as accountability: if his team takes T3 Code in a bad direction, forks will immediately blow up. He cites Maria's T1 Code (a TUI fork), plus creator forks of his video-review tool Lawn by XLT Jake and Quinn from Snazzy Labs, as proof that open-sourcing shifts power to users. Ben (his channel manager) rage-forked Lawn to rewrite the player in Svelte after a particularly aggressive prompt.

On Tauri vs Electron

A side note worth flagging (~22:00): Theo tried building T3 Code natively in SwiftUI/AppKit and got worse performance than Electron because scrolling text containers with live token-by-token updates are "really hard to do performantly" in native. He then switched to Electron and confirms open-code's Adam is migrating off Tauri to Electron for the same reason: "WebKit is a [expletive]." For anyone considering Tauri in 2026, this is a real data point from a shipped production codebase.

The "Malice" sidebar

Theo references Malice, a service that rewrites open-source projects to be "legally distinct enough to let you violate the license without actually violating the license." He initially assumed it was parody; chat confirms it's real. A signal that AI-accelerated license laundering is now a product category.

Tools: T3 Code, T3 Chat, Cursor, Cursor Glass, Claude Code, Codex CLI, Codex app, Gemini CLI, OpenCode, VS Code, Electron, Tauri, WebKit, patch-package, AI SDK, Notion, macOS 26, Lawn, T1 Code, Ghostty, Semox, Malice

AI Future Developer Tools

Caleb Writes Code

AutoResearch: Karpathy's Loop Beats a Chess Engine to 2600 ELO

Caleb walks through a working application of Andrej Karpathy's auto-research pattern — an AI experiment loop that modifies code, evaluates against a fixed metric, keeps winners, discards losers — and reports two real runs: a restaurant inventory simulation that went from >50% order failure to near-zero, and a chess engine that climbed from 750 ELO to 2600 purely through iterated self-improvement.^{[2]Caleb Writes Code — AutoResearch explained} The honest caveat: it only works in narrow domains where you can hand-craft the eval. "Make this restaurant better" doesn't scale.

How the loop works

Auto-research's internal structure is deliberately minimal (~01:35): program.md holds the goal statement, prepare.py holds the evaluation, and the agent is forbidden to modify anything except the target algorithm file. The inner loop runs experiments, scores against the eval, keeps only those that improve the score, and discards the rest. The chess engine's ELO curve flatlines until a breakthrough experiment lands, then ratchets upward — no gradient, just accept/reject on measurable improvement.

Restaurant sim: what the algorithm actually learned

The baseline algorithm reordered one item per ingredient on depletion, which failed because the 3–5 day lead time meant orders arrived too late (~02:30). After running auto-research, the optimized algorithm learned two things that no one told it: (1) place orders on day one to pre-fill inventory, and (2) batch orders by quantity rather than individually. But the first eval ("maximize inventory") produced a working-capital crisis — the business stayed solvent on paper while cash got trapped in stock. Caleb changed the eval to weight working capital, and auto-research re-converged on a more balanced policy.

Developing software in this scenario is more about understanding the problem well enough and defining into words and putting the structure around them for the agent to build it successfully.

The honest limitation

Caleb doesn't oversell it (~04:20): auto-research required his guidance to pick the right metric and structure, and the simulation's fast feedback loop is what made it feasible. Outside narrow domains with clean evals, this paradigm doesn't generalize. That's a deliberate contrast to the singularity-adjacent framing some people put on self-improving agents.

Tools: AutoResearch, program.md, prepare.py, chess engine eval, inventory simulation

AI Future Developer Tools

Developers Digest

AutoAgent Extends Auto-Research to the Agent Harness Itself

Kevin Goo's AutoAgent takes Karpathy's auto-research loop and retargets it at the agent harness — prompts, tools, orchestration — rather than at training code. A meta-agent spins up thousands of parallel sandboxes, runs a task agent on benchmarks, reads the reasoning traces, and decides what to keep.^{[3]Developers Digest — Self Improving Agents in 5 Minutes} The video's sharper framing: if harness engineering is a domain-specific expertise tax, it's about to become the next layer of abstraction the model writes for us.

Same loop, different target

The structural parallel with Karpathy's setup is exact (~01:30): auto-research edits train.py against a training-loss eval; AutoAgent edits agent.py (the task-agent harness) against a benchmark eval. The task agent starts with essentially nothing — just a bash tool and the program.md research direction. The meta-agent orchestrates the iterations and decides what to revert. Example benchmarks shown: SpreadsheetBench and TerminalBench.

Why this matters if it works

Harness engineering currently requires someone who understands both the model and the domain. Most companies don't have one workflow — they have many (~04:00). Today that means either a monolithic harness that does everything poorly or a custom harness per workflow that doesn't exist because nobody has the time. A meta-agent that can auto-tune per-domain harnesses changes the economics of running cheaper, task-specialized models instead of always reaching for the frontier.

We used to write the actual syntax of all of the different code, and now increasingly more and more of code is just written by AI models. A similar thing potentially could happen with harnesses.

Context: Theo's harness argument, one week later

This video usefully pairs with Theo's April 13 walkthrough of why the same Claude Opus scores 77% in Claude Code vs. 93% in Cursor — the harness is the variable. If AutoAgent's approach pans out, Cursor's "full-time staff rewriting tool descriptions per model" moat becomes automatable. That's a real threat model to watch.

Tools: AutoAgent, AutoResearch, meta-agent, task agent, agent.py, program.md, SpreadsheetBench, TerminalBench

Developer Tools

Arjay McCandless

Three Mistakes Vibe Coders Keep Making

A 1-minute hygiene checklist from Arjay McCandless on what trips up vibe-coded apps in production: hard-coded API keys in the frontend, missing RLS on the database, public S3 buckets; zero observability on both endpoint health and product analytics; and no thought given to cost or scale.^{[4]Arjay McCandless — 3 mistakes vibe coders make}

The full checklist, in order of damage potential:

Security basics: don't hardcode API keys in the frontend, enable Row-Level Security (RLS) on your database, and don't leave S3 buckets public. Arjay's framing: these are 2-second fixes that protect your customers.
Observability, two layers: (1) functionality — are endpoints working, is the site accessible — and (2) product analytics — scroll depth, click paths, action funnels.
Cost and scale: fine to defer early, but don't never think about it. A single unoptimized LLM call in a hot path is how vibe-coded apps go from $20/month to $5K/month overnight.

Short-form content, no transcript timestamps worth citing. Useful as a tweet-sized checklist to run against any vibe-coded project before it goes public.

Theo: Closed-Source Devs Can't Be Trusted With AI

The thesis

The Yash story: why he changed his mind

The Cursor Glass roast

The Claude Code attack

Why T3 Code is open source

On Tauri vs Electron

The "Malice" sidebar

AutoResearch: Karpathy's Loop Beats a Chess Engine to 2600 ELO

How the loop works

Restaurant sim: what the algorithm actually learned

The honest limitation

AutoAgent Extends Auto-Research to the Agent Harness Itself

Same loop, different target

Why this matters if it works

Context: Theo's harness argument, one week later

Three Mistakes Vibe Coders Keep Making

Sources