April 7, 2026
Ryan Lopopolo, who leads frontier product exploration inside OpenAI Frontier, spent five months running a three-person team that shipped a 1M-line Electron app entirely via Codex — with a self-imposed rule that no human was allowed to write code.[1]Latent Space — Extreme Harness Engineering with Ryan Lopopolo By January they hit 5–10 PRs per engineer per day. He personally burns ~1B tokens/day (Swyx does the math to ~$2.5M/year equivalent retail) and treats PR review as entirely post-merge. The post also introduces Symphony, OpenAI's internal Elixir-based agent orchestrator released as a "ghost library" spec.
Lopopolo's opening framing (~03:00): "starting with this constraint of I can't write the code meant that the only way I could do my job was to get the agent to do my job." The first month and a half was 10x slower than he'd have been by hand, but paying that up-front cost bought an "assembly station" where a 3-person team moves at the throughput of a much larger org.
The models are there enough, the harnesses are there enough where they're isomorphic to me in capability, in the ability to do the job.
When Codex 5.3 added background shells, the agent became "less patient, less willing to block," so they retooled the build system to complete in under a minute — bespoke Makefile → Bazel → Turbo → Nx over the course of a week (~06:00). Why one minute? "Because we were able to hit it." This is the philosophical core of the article: cheap tokens plus massive parallelism means you can permanently garden invariants that human-led teams let rot.
Tokens are so cheap, and we're so insanely parallel with the model, we can just constantly be gardening this thing to maintain these invariants — which means there's way less dispersion in the code and the SDLC.
Review is almost entirely post-merge (~09:00). A Codex review agent comments on PRs; the authoring Codex has to "at least acknowledge and respond." Early versions of this would bully each other into scope creep, so they gave the reviewer an explicit instruction to bias toward merging and to not surface anything above P2, and gave the author explicit permission to defer or push back (~15:00).
The only fundamentally scarce thing is the synchronous human attention of my team. There's only so many hours in the day. We have to eat lunch.
On-call page? Tell Codex to update the reliability doc to require network-call timeouts. PR comment? That's a signal the agent was missing context — codify it as a lint or a skill. The team slurps Codex session logs to blob storage and runs a nightly agent over them to find team-wide gaps (~44:00). "Everybody benefits from everybody else's behavior for free."
End of December they were at 3.5 PRs/engineer/day; after 5.2 dropped in January they jumped to 5–10. That was too taxing for humans to context-switch between tmux panes, so they built Symphony (~36:00). The model chose Elixir because BEAM's process supervision maps onto the daemon-per-task pattern. When a PR is kicked back to "rework," Symphony trashes the whole worktree and restarts from scratch — the agent is cheap enough that re-doing is fine.
Symphony is distributed as a spec, not a binary: Twitter is calling these "ghost libraries." The authoring flow is itself Ralph-style recursion: spawn disconnected Codex to implement the spec, spawn another Codex to diff the implementation vs. upstream, update the spec to reduce divergence, loop. Several users just fed the spec to Codex and had it rebuild the system from scratch successfully.
Backing up Brett Taylor's "software dependencies are going away" take (~27:00): a few-thousand-line dependency can be in-housed in an afternoon, and you strip the generic parts you don't need. Still pay for Datadog and Temporal. But NPM-style liberal-accept plugins are ripe for being inlined, especially because Codex Security can then directly audit the vendored copy rather than chasing transitive CVEs.
MCPs are dismissed: "I'm pretty bearish on MCPs because the harness forcibly injects all those tokens into the context and I don't really get a say over it" (~38:00). A teammate vibe-coded a local Playwright daemon with a shim CLI exposing only the 3 commands actually needed, and Lopopolo didn't know until days later. CLI design lesson: patch --silent onto prettier, suppress walls of passing-test output, bias everything toward compact structured signal.
5.4 merges top-tier coding with general reasoning in one model and adds computer use (~18:00). Lopopolo now lets Codex author his own blog post; with 5.3 he was manually balancing between chat and Codex. Spark (the fast model) he hasn't figured out how to deploy — it blows through three compactions before writing a line of code. Still useful for spiking prototypes and doc updates.
Two persistent weaknesses (~59:00): zero-to-one product from a mock (too much tacit intent) and gnarly multi-system refactorings. Lopopolo spends most of his synchronous time on the latter.
OpenAI Frontier is positioned as the enterprise deployment platform for agents — observable, identity-bound, safety-spec-interfaced via GPT-OSS safeguard. Two buyer personas: the employees actually using the agents, and the IT/GRC/security/AI-office stakeholders who need the dashboards, attestations, and revocation hooks (~65:00). Lopopolo's team is building an internal data agent against their warehouse, with an explicit nod that defining "what is revenue" inside an enterprise is still unsolved even for humans.
Gergely Orosz brought Martin Fowler and Kent Beck on stage at what looks like an AI-engineering conference. Both argue the agile-era disciplines — TDD, refactoring, small verifiable experiments — are exactly the muscles you need for a world where a "big powerful genie" writes most of the code.[2]Pragmatic Engineer — Fowler & Beck: Frameworks for reinventing software Beck calls this "the golden age of the junior programmer"; Fowler is more cautious, flagging an industry drowning in AI-flavored snake oil and large enterprises racing to give LLMs full email access.
Beck on the feedback he now gets (~03:00): "thank goodness for all of your pushing of TDD for the last 20 years because it's really important now we've got AI agents." His own gloss: when you've got a big powerful genie you really have to learn how to verify it's doing the right thing. Fowler is characteristically skeptical of his own pleasure at this validation ("I'm always suspicious of it cuz I want it to be true").
Fowler's framing (~11:00): 25 years of snake oil means his skepticism has to be "absolute and total" — which is why he has to be skeptical about his own skepticism. He killed blockchain outright in his head; AI he can't. Beck adds the operational version: what's the smallest experiment I can run to verify this claim is true, for my own satisfaction? That's the 1000x-more-valuable skill of the last year.
When the models come out and they're faster, I'm like, "Oh, there's less time to talk." You give a prompt and it's like, "Oh, blah blah blah," and then it's gone for 3 minutes and we can talk about our philosophy of naming.
Beck (~19:00): parents keep telling him their CS sons want to drop out because AI. His analogy is carpenters and the circular saw. The young who learn fast will learn faster, the senior who work effectively will work more effectively, but the middle — programmers who entered for the paycheck — is where he's worried. After the zero-interest-rate retrenchment and the AI boom converging, he doesn't know where that middle goes.
"Large scale confusion and panic is pretty much the order of the day" (~23:00). Fowler is most worried about security: he's seen multiple large companies earnestly discuss giving LLMs full read/reply access to enterprise email. He predicts "really bad security incidents" this year.
I've now run into several different groups, including some at surprisingly large companies, that are talking about, let's have the LLM have complete control over my email. It can read all my emails, and it can reply to most of the emails. And I'm going, NO.
Beck's contrarian concern (~25:00): XP explicitly built a "safe social environment for basically antisocial people." Now the framing is "I'm a programmer with 6 agents, so I'm managing a team." He pushes back hard: you're not managing a team, you're using six tools. Two humans and n genies is a very different (and in his experience, healthier) pattern than one-human-many-genies. Slow models were actually nice — they gave you conversation time.
Fowler quotes a colleague from the same previous week's Utah conference (~29:30): well-modularized code and good tests help the agents as much as they help the humans. One colleague (Unashi) is getting traction by developing a precise language to communicate about domains with the agent — which is just DDD by another name. Craft practices carry over; the meta-practice is building that domain language together.
"I used to take a kind of OCD enjoyment in the craft, and I need to let go of that" (~31:00). That feeling of getting a messy file, making tiny safe steps, feeling it snap into focus — that doesn't have leverage anymore. He's shifting to taking satisfaction in understanding the domain rather than the program itself.
Theo argues bash was a brilliant stepping stone for agent harnesses — one tool instead of dozens — but it's missing standards for what counts as destructive, permission scoping, shared sign-in state, and typed inputs/outputs.[3]Theo — The language holding our agents back He walks through Cloudflare's Code Mode (~40% token reduction, 25.6 → 28.5 benchmark bump by turning MCP calls into TypeScript), Vercel's Just Bash (virtualized bash in TypeScript), and Mahdavi's Just JS (real TypeScript execution in isolated runtimes) as signposts toward a portable, shareable "environment file" for agents.
Theo's opening polemic against Repomix (~07:00): it cost T3 Chat "at least six figures" back when they priced per message. Users would dump 100K+ token codebases into a single message and get worse answers because of it. Needle-in-haystack performance falls off past ~50-100K tokens; more tokens effectively equals more random.
I would put a little warning at the top of the page saying, "Hey, we've learned this is the worst possible way to ever code with AI, and we recommend you do literally anything else."
A 7-token grep fetching 30 tokens of code is vastly better than dumping 100K tokens and hoping (~13:00). Theo thinks this is part of why Google's models still lag — they were optimized for long-context retrieval; Anthropic, OpenAI, and the Chinese labs optimized for tool-calling. "Every tool you add is a tool the model will try to use," so giving it just one (bash) and letting it do everything is the hack.
No standard for "is this destructive?", no wildcard approvals, no type system, no signed-in-state sharing between Cursor/OpenCode/OpenClaw, no team-scoped permissions like "sales can hit Salesforce, engineering can't." Approval prompts numb users into skipping permissions entirely — Theo admits to running --dangerously-skip-permissions (~20:00).
Cloudflare's Code Mode replaces MCP tool descriptions with TypeScript SDKs the model writes code against (~23:00). Anthropic's own data shows MCP servers consuming ~40% of context (72K tokens) just for descriptions. Code Mode results: average response dropped from 43,500 to 27,000 tokens (~40% reduction), accuracy up from 25.6 to 28.5 on their benchmarks, and much lower latency. The model filters the users array in a .filter() rather than handing 100K rows back to the model.
Vercel's Just Bash (~26:00) is a fake bash written in TypeScript that runs in a Node isolate — the model thinks it has a real computer. Malte's Just JS adds TypeScript/JavaScript execution in the same isolation primitive, so a model can write an FS command that "never leaves RAM." Dax is experimenting with removing the bash tool from OpenCode entirely.
By creating a TypeScript environment for the LLMs to call tools through, you can create portable environments that can be shared with teams. They're super lightweight to run. They have a strong ecosystem around them, and they're strongly typed so you can get really creative with approval rules.
Theo's closing pitch: the environment file — a TypeScript file that configures the entire sandbox your agent runs in, sharable across a team. Executor (Reese), Rivet, Daytona, and adjacent sandbox companies are all converging on this shape.
Lenny's Podcast clipped a minute-long Simon Willison segment where he predicts a headline-grabbing prompt-injection incident — his Challenger-disaster analogy for the AI-agent era.[4]Lenny's Podcast — The AI Challenger disaster prediction The O-rings keep holding, so every safe launch reinforces the idea that the risk isn't real — which is exactly the "normalization of deviance" Willison worries about. He also admits he's been making this same prediction every six months for three years.
The problem we've been having with prompt injection is that we've been using these systems in increasingly unsafe ways, and so far there hasn't been a headline-grabbing story of a prompt injection where an attacker has stolen a million dollars, which means that we keep on taking risks. We have this normalization of deviance in the field of AI around how we're using these tools.
His self-aware caveat: "I've made a version of this prediction every 6 months for the past 3 years, and it hasn't happened." Pairs directly with Martin Fowler's enterprise-email alarm in today's Pragmatic Engineer segment[2]Pragmatic Engineer — Fowler & Beck: Frameworks for reinventing software — two veterans, same week, same warning.
NVIDIA released Nemotron 3 Super: a fully open 120B-parameter model trained on 25T tokens, accompanied by a 51-page paper that — unusually — publishes the training data and methodology.[5]Two Minute Papers — NVIDIA's new AI just changed everything The NVFP4 quantized version runs ~3.5x faster than the BF16 baseline and up to 7x faster than comparably-smart open models with no meaningful accuracy loss. Roughly matches the best closed frontier models from ~18 months ago.
The release packages weights + dataset + methodology in a way open releases typically don't — usually at least one of the three is missing. Karoly (Two Minute Papers) flags NVIDIA's reported "tens of billions of dollars" commitment to fully open systems as a strategic pivot: the closed labs don't get to be the only game in town anymore.
The story is not just the similarly smart part. The story is that it is seven times faster while it is similarly smart.
Caveat from the host himself: on his torture-test "robotic cows with lots of math" prompt it still thinks for nearly an hour, so heavy-compute prompts still want a bigger GPU instance.
Tech Brew reports OpenAI, Anthropic, and Google are coordinating through the Frontier Model Forum to share intelligence on distillation attacks — specifically, Chinese labs scraping frontier-model outputs to train cheaper replicas.[6]Tech Brew — AI's biggest rivals unite against China Anthropic claims three Chinese AI companies used 24,000+ fake accounts to generate 16M Claude exchanges; Microsoft previously accused DeepSeek of extracting "large amounts" of OpenAI API data. US officials put the cost to Silicon Valley in the billions annually.
When DeepSeek shipped its reasoning model in January 2025, US and European tech stocks lost nearly $1T of market cap in a day. Distillation is the specific mechanism that makes the frontier labs' training-cost moats leaky: if the outputs are freely queryable and outputs can't be copyrighted under US law, the only defenses are terms-of-service violations and political solutions.
Three frontier labs explicitly coordinating raises the usual legal concern: where does "threat intelligence sharing" end and "collusion" begin? The Frontier Model Forum (founded with Microsoft in 2023) is the stated venue, but the article flags that the Trump administration's AI Action Plan will likely be the real lever — political rather than legal.
AI outputs cannot be copyrighted under US law, forcing companies to rely on terms-of-service violations and political solutions.