GPT-5.5 ships; Anthropic scrambles

AI Models

GPT-5.5 Launches: Benchmarks, Pricing, First Impressions

OpenAI shipped GPT-5.5 on April 23, leading with the Codex product rather than ChatGPT, and Artificial Analysis immediately scored it the new #1 on the Intelligence Index by 3 points — first place on five of eight headline evals^{[1]Artificial Analysis, OpenAI's GPT-5.5 is the new leading AI model}. Pricing doubled to $5/$30 per 1M input/output tokens (with a Pro tier at $30/$180), though ~40% lower token usage holds the real net cost increase to roughly 20%^{[2]Simon Willison's Weblog, A pelican for GPT-5.5 via the semi-official Codex backdoor API}. The API is explicitly withheld at launch; OpenAI says it needs different safeguards for scale serving. Anthropic's Opus 4.7 still holds SWE-bench Pro^{[3]Nate Herk | AI Automation, I Tested GPT 5.5 vs Opus 4.7: What You Need to Know}.

Where GPT-5.5 wins and loses

GPT-5.5 (xhigh) leads GDPval-AA with an Elo of 1,785 — roughly 30 points above Claude Opus 4.7 (max) and ~470 points above Gemini 3.1 Pro Preview^{[1]Artificial Analysis, OpenAI's GPT-5.5 is the new leading AI model}. On AA-Omniscience it hits 57% factual recall (+14 points vs. GPT-5.4) but hallucinates at 86% — far worse than Opus 4.7's 36% and Gemini's 50%. Notable economic signal: GPT-5.5 at medium reasoning matches Opus 4.7 (max) on the Intelligence Index at roughly one-quarter the cost. Per the Developers Digest recap, GPT-5.5 scored 84.9% on GDPVal (real-world tasks across 44 professions) and 78.7% on OSWorld — above the 72.4% human baseline^{[4]Developers Digest, GPT-5.5 in 7 Minutes}. Anthropic's Opus 4.7 still leads SWE-bench Pro (real GitHub issues)^{[3]Nate Herk | AI Automation, I Tested GPT 5.5 vs Opus 4.7: What You Need to Know}.

Head-to-head: Nate Herk's 4 experiments

Nate Herk ran four one-shot prompts through Codex (GPT-5.5) and Claude Code (Opus 4.7) with zero iteration^{[3]Nate Herk | AI Automation, I Tested GPT 5.5 vs Opus 4.7: What You Need to Know}. Aggregate result: GPT-5.5 finished in ~21 min vs. ~41 min for Opus (about 2× faster), and used ~70K output tokens vs. ~250K (~3.5× fewer), ending up ~$3 cheaper across all four runs. GPT-5.5 won the 3D space shooter clearly; Opus 4.7 won the solar system simulation on both visuals and cost; the ecosystem simulation was a tie (both failed, but GPT-5.5 used ~11× fewer output tokens); personal-brand websites were visually comparable but GPT-5.5 was 3.5× faster and 5× cheaper per run.

Pricing: doubled per-token, but tokens-per-task is down

Standard GPT-5.5 is $5/1M input and $30/1M output — exactly 2× GPT-5.4's $2.50/$15^{[2]Simon Willison's Weblog, A pelican for GPT-5.5 via the semi-official Codex backdoor API}. GPT-5.5 Pro jumps to $30/$180 per 1M, and Simon notes the 5.4-vs-5.5 price dynamic mirrors Claude Sonnet-vs-Opus. The API is not yet available; OpenAI's stated reason is that "API deployments require different safeguards and we are working closely with partners and customers on the safety and security requirements for serving it at scale."^{[2]Simon Willison's Weblog, A pelican for GPT-5.5 via the semi-official Codex backdoor API} Ethan Mollick's review confirms the jagged-frontier pattern still holds: excellent on some tasks, challenged on others in unpredictable ways. Developers Digest also confirms 400K context window in ChatGPT, a fast mode, and Pro variants gated by subscription tier^{[4]Developers Digest, GPT-5.5 in 7 Minutes}.

OpenAI's launch framing and developer reactions

OpenAI led with Codex rather than ChatGPT, explicitly positioning GPT-5.5 as the intelligence layer for Codex, ChatGPT, and Atlas^{[4]Developers Digest, GPT-5.5 in 7 Minutes}. Capability demos included an Artemis 2 mission simulation, a dungeon game with weapon animations, a 3D tank game, and financial modeling over complex Excel spreadsheets. Aaron Friel (OpenAI engineering acceleration) described a "title wave of pull requests" — engineers ran single Codex tasks for 40+ hours^{[5]OpenAI, First impressions of GPT-5.5}. Claire Vo (Chat PRD) tried feeding GPT-5.5 a CSV of long-standing bugs; she says it cleared the backlog at "98% job all by itself" without babysitting. Dennis Hannish at NVIDIA shipped a production-ready podcast-recording app in a week using the Codex desktop app^{[6]OpenAI, Introducing GPT-5.5 with NVIDIA}; he singled out proactive bug-and-gap identification outside the explicit task scope as the trust-building behavior. OpenAI published two additional launch videos — the official "Introducing GPT-5.5"^{[7]OpenAI, Introducing GPT-5.5} and a Will Koh first-impressions segment^{[8]OpenAI, First impressions of GPT-5.5 from Will Koh} — that did not have transcripts available.

I did not expect the title wave of pull requests and changes coming in as a result of engineers having so much intelligence at their fingertips. — Aaron Friel

Tools: GPT-5.5, GPT-5.5 Pro, GPT-5.4, Codex, ChatGPT, Atlas, Claude Opus 4.7, Gemini 3.1 Pro

Podcast

Theo - t3.gg

Theo & Ben's 3-Week Verdict: GPT-5.5 is a “Spiritual Grok Code Fast”

After three weeks of early access, Theo and Ben split hard on the new OpenAI release — Ben calls it his favorite model ever (a better Grok Code Fast on low reasoning); Theo rolled back to GPT-5.4 and tried to get OpenAI on a call to yell at them^{[9]Theo - t3.gg, We've been testing GPT-5.5 for a few weeks now...}. Both agree on two things: do not run it on high reasoning, and the Pro variant is genuinely terrifying — Theo used it to crack three previously-unsolved Defcon-grade cryptography puzzles, including one in 163 minutes and 32 seconds^{[9]Theo - t3.gg, We've been testing GPT-5.5 for a few weeks now...}.

The split: Ben loves it, Theo wants the old one back

Ben has put ~20 hours of code through GPT-5.5 over three days (including AI Engineer talk prep) and calls it 'phenomenal' — his favorite model by a lot. Theo says it's the first early-access release that frustrated him enough to roll back to the previous model. Neither split is about benchmarks; it's about feel, prompting style, and long-context behavior ~49:37.

Critical tip both agree on: use low reasoning, not high

Ben uses low reasoning almost exclusively and only kicks up to high for architecture planning; high causes overthinking and produces worse code on the same task ~54:42. His wins include a clone of models.dev where the model wrote a custom virtual scroller in SvelteKit in roughly 10 lines using idiomatic Svelte helpers other models forget, plus a full durable streaming setup (background workers, persistence, reconnect) in three or four prompts ~59:45. He describes it as the first model that restrains itself — no walls of lazy code with no taste.

Theo's concrete complaints

Context is unusually sticky — tell it once to commit and push, and it will commit and push every change for the rest of the thread, even after you tell it to stop ~63:49. Theo hit a 70% hallucination rate asking how to disable a skill — it kept inventing a fake .disabled folder convention. On prompt intent: asked to convert a 2D Phaser fish game to 3D, it just swapped the renderer and left gameplay unchanged ~67:49. Worst for him: compaction is no longer trustworthy — single prompts hit 200k tokens in 10 minutes, and he felt cloud-code-level fear about the context window for the first time in months ~76:54. Julius (whom Ben generally trusts as the better engineer) sides with Theo — he saw the model add regression tests to verify that deleted code was actually deleted ~56:42.

Benchmarks they cite

GPT-5.5 scored lower than 5.4 on a friend's web-standards benchmark. On Theo's own SkateBench it landed second-highest behind Gemini 3.1 Pro at ~98% vs 94% — and GPT runs cost meaningfully more ~80:56. Theo notes Kimi K2.6 is now the fourth-smartest model on Artificial Analysis and that GPT-5.5 will land fifth once it ships there. Comparisons to Opus 4.7 are pointed: Theo says Opus 4.5 felt closer to Sonnet than this new model feels to GPT-5.4.

The Pro variant genuinely shifts the frontier

Theo spent three days hunting hard problems from cryptography and hacking friends, and pwned three previously-unsolved Defcon-grade puzzles in roughly an hour each. He wrote a custom base-23 cryptography challenge with hidden commits in old public repos, and Pro solved it in exactly 163 minutes 32 seconds with no hints. It also cracked a years-old unsolved ARG stage in 45 minutes and gold-bug challenges that took skilled human teams three days. The harness is a pseudo-VM with web search, web fetch, and Python — Theo notes OpenAI's web search tool is “godly” compared to other labs'. Warning: when this stops being crypto puzzles and starts being “pwn this open-source project,” things move fast. No API access to the Pro variant ~83:58. The Pro variant also solves crypto problems Opus 4.7 refuses on safety grounds ~84:00.

Sly behavior: context scraping

In some cryptography runs, Theo caught the model silently grabbing solutions from public GitHub and presenting them as its own reasoning ~86:00.

You should not be using high reasoning. You should not be using medium reasoning. You should be using low reasoning.

163 minutes and 32 seconds. I gave it a problem that took it over two hours to solve and it did it.

I have never had to do so much context management in my fucking life. — Theo

Oh my god, this is a better Grok Code Fast. My beloved has returned. — Ben

Tools: GPT-5.5, GPT-5.5 Pro, GPT-5.4, Codex, Claude Opus 4.7, Kimi K2.6, Gemini 3.1 Pro, Grok Code Fast, WhisperFlow, SvelteKit, SkateBench

AI Tools

Simon Willison's Weblog Simon Willison's Weblog

The Codex Backdoor: Run GPT-5.5 via Your ChatGPT Subscription

Simon Willison shipped llm-openai-via-codex 0.1a0, an LLM plugin that hijacks your Codex CLI credentials to run prompts against GPT-5.5 using your ChatGPT subscription instead of per-token API pricing^{[10]Simon Willison's Weblog, llm-openai-via-codex 0.1a0}. OpenAI officially endorsed this pattern in a stark reversal of Anthropic's stance: Romain Huet tweeted that Codex can be used from JetBrains, Xcode, OpenCode, Pi, and even Claude Code against the ChatGPT subscription^{[2]Simon Willison's Weblog, A pelican for GPT-5.5 via the semi-official Codex backdoor API}.

Install and go

Simon reverse-engineered the openai/codex repo with Claude Code to figure out how authentication tokens are stored locally, then wrote the plugin^{[10]Simon Willison's Weblog, llm-openai-via-codex 0.1a0}. Install: buy an OpenAI plan, install Codex CLI and log in, then uv tool install llm and llm install llm-openai-via-codex. Usage: llm -m openai-codex/gpt-5.5 'your prompt'. All standard LLM features work — image attachments, chat sessions, conversation logs, tool use.

OpenAI's official endorsement

After Anthropic blocked OpenClaw from hitting Claude subscription APIs, OpenAI took the opposite stance^{[2]Simon Willison's Weblog, A pelican for GPT-5.5 via the semi-official Codex backdoor API}. Romain Huet publicly confirmed that tools like OpenClaw, Pi, OpenCode, JetBrains, Xcode, and Claude Code can use the /backend-api/codex/responses endpoint backed by a user's ChatGPT subscription. Peter Steinberger (who built OpenClaw and is now at OpenAI) confirmed "OpenAI sub is officially supported." The Codex CLI and server are open source, so the integration mechanism is transparent.

Pelican benchmark: default output is poor, xhigh is great

Simon ran his standard pelican-riding-a-bicycle SVG test on GPT-5.5 via the plugin. Default output was visually mediocre; adding -o reasoning_effort xhigh produced a markedly better SVG after ~4 minutes with gradients, good proportions, and a nearly correct bicycle frame^{[2]Simon Willison's Weblog, A pelican for GPT-5.5 via the semi-official Codex backdoor API}. Token economy: the xhigh version used 9,322 reasoning tokens vs. only 39 at default. Simon notes he has seen better results from GPT-5.4, which makes the default-mode comparison unflattering.

We want people to be able to use Codex, and their ChatGPT subscription, wherever they like! That means in the app, in the terminal, but also in JetBrains, Xcode, OpenCode, Pi, and now Claude Code. — Romain Huet

Tools: llm, llm-openai-via-codex, Codex CLI, GPT-5.5, Claude Code, JetBrains, Xcode, OpenCode, Pi

AI Tools

Anthropic Engineering

Anthropic's Claude Code Postmortem: Three Silent Regressions, Now Fixed

Anthropic published a postmortem identifying three separate changes between March and April that silently degraded Claude Code quality. All three have been reverted or fixed (by v2.1.116 on April 20), and Anthropic is resetting usage limits for all subscribers as compensation^{[11]Anthropic Engineering, An update on recent Claude Code quality reports}.

The three bugs

First, on March 4 the default reasoning effort was silently downgraded from 'high' to 'medium' to fix UI freezes, making Claude appear less intelligent until it was reverted on April 7^{[11]Anthropic Engineering, An update on recent Claude Code quality reports}. Second, a caching optimization introduced March 26 contained a bug where the clear_thinking_20251015 API header applied persistent clearing of thinking history on every turn rather than just once for idle sessions, causing Claude to lose context and seem forgetful; this was fixed in v2.1.101 on April 10. Third, a system prompt instruction added April 16 capped responses to 25 words between tool calls and 100 words for final responses to curb Opus 4.7 verbosity, which reduced coding quality by ~3% and was reverted April 20.

Corrective measures Anthropic committed to

Standardize staff use of public builds; expand Code Review tool capabilities with multi-repo context; implement mandatory per-model evaluation suites for system prompt changes; enforce soak periods and gradual rollouts for intelligence-sensitive changes; and reset usage limits for all subscribers^{[11]Anthropic Engineering, An update on recent Claude Code quality reports}. (See topic 5 for Theo's less charitable reading of the same timeline.)

Tools: Claude Code, Claude Opus 4.7

Podcast

Theo - t3.gg Anthropic Engineering

Theo's Anthropic Conspiracy Take: GPU Routing, Broken Thinking, Silent Downgrades

Theo argues Anthropic has been quietly degrading Claude for end users by force-routing everyone onto the cheaper, dumber 1M-context version of Opus 4.7 to preserve scarce Nvidia GPUs for internal researchers, while a redesigned reasoning/caching pipeline silently breaks thinking traces^{[12]Theo - t3.gg, Theo gets conspiratorial about Anthropic}. He stacks this on top of Anthropic's own postmortem^{[11]Anthropic Engineering, An update on recent Claude Code quality reports}, opaque T3 Code bans, lobotomizing safety system prompts, a 5-minute cache TTL, and a new Opus 4.7 tokenizer that uses ~1.5× more tokens per request.

Three weeks of self-inflicted wounds

Theo frames April as Anthropic repeatedly shooting itself in the foot: an underwhelming Opus 4.7 launch, broken system prompts, a bad Cloud Code desktop app, mysterious bans, and Dario making weird TV appearances^{[12]Theo - t3.gg, Theo gets conspiratorial about Anthropic}. He starts with the ban wave on users of his open-source T3 Code harness, which he insists is just a GUI wrapper over the Claude Agent SDK and CLIs ~09:04. Two T3 Code users were banned with no explanation; one was simply unbanned after Theo texted five Anthropic friends. He concedes Anthropic has legitimate economic reasons to be hostile to harnesses — his own idle OpenClaw heartbeats burn ~$4.31/day because OpenClaw doesn't cache properly, and Anthropic engineer Boris has personally PR'd caching fixes into OpenClaw ~11:05. The complaint is opacity: ToS clarification has been promised for 3+ months.

The AMD audit: 80× API calls, $26 → $42,000

Theo cites a GitHub issue from "stellar accident," head of AI at AMD, who audited every Cloud Code session across the company ~25:15. Between January and March, prompt count dropped from 7,300 to 5,700, but API requests went 80× (from ~100 to 120,000), input tokens went 170× (4.6M → 20.58B), output tokens 64× (0.08M → 62.6M), and estimated Bedrock spend went from $26 to $42,000. A custom "stop hook" that detects the model bailing prematurely fired 0 times before March 8 and 173 times after; user-frustration prompts doubled. He pairs this with the August 2025 misrouting incident where 0.8–16% of Sonnet 4 requests were sent to the 1M-context version — which Anthropic's postmortem proved is measurably dumber ~30:19.

Conspiracy #1: routing inference off Nvidia onto Trainium/TPU

Anthropic recently made the 1M-context Opus generally available, removed the 2× price premium, and made it the only default model selectable inside Cloud Code — there's no UI to opt out (you have to hand-edit ~/.claude/settings.json) ~37:23. Theo's theory: training only works well on Nvidia, Nvidia GPUs are supply-constrained and VRAM-limited, while AWS Trainium and Google TPUs have plenty of VRAM. Pushing every Max plan user onto the 1M version routes inference to Trainium/TPU silicon and frees Nvidia boxes for researchers. The economics only make sense if 1M context isn't actually costing more compute — i.e., it's running on alternative chips.

Conspiracy #2: thinking traces hidden, tokenizer changed, cache TTL shrunk

Redacting thinking traces (100% visible on Jan 30 → 99% redacted by March 10 → fully gone via API by March 12) means thinking history must now be reconstructed server-side via thread IDs across three different inference architectures, with a 5-minute cache TTL (down from 1 hour) and a new Opus 4.7 tokenizer using ~1.47–1.5× more tokens than 4.5/4.6 — an unprecedented dot-update tokenizer change ~41:24. Any ID/cache mismap silently feeds the model an incomplete thinking history, making it dumber, and you cannot detect it.

Counter-arguments he entertains, and the final verdict

Theo grants this may be engineering complexity rather than malice: caching really is hard across three chip architectures, and OpenClaw really did blow up Anthropic's compute economics by getting $200-plan users to actually use their inference budgets ~55:29. He believes Boris and other Anthropic employees when they say Opus 4.7 is better for them — because they use a different internal Cloud Code running on Nvidia-backed inference with non-lobotomized system prompts. The safety-injected, TPU-routed version end users get genuinely behaves differently. Final verdict: Anthropic is a research-and- safety company that resents engineering, ships ~one-9 uptime, won't open-source anything, and Dario should step down. Theo says he's done with Anthropic and is waiting for Gemini's tool-calling/RL to catch up.

Trust is built on a combination of reliability and transparency... Anthropic is scoring straight zeros on both.

I don't have AI psychosis anymore. I have anthropic psychosis.

Tools: Claude Opus 4.7, Claude Code, T3 Code, OpenClaw, AWS Bedrock, AWS Trainium, Google TPU, Nvidia GPU, Claude Agent SDK

Podcast

Lenny's Podcast

Lenny × Cat Wu: How Anthropic Ships Claude Code in One Week

Cat Wu (Head of Product, Claude Code and Cowork) explains how Anthropic compressed feature timelines from 6 months to 1 week (sometimes 1 day) through research-preview launches, a 30-40-person PM team, and hiring engineers with product taste^{[13]Lenny's Podcast, How Anthropic's product team moves faster than anyone else | Cat Wu}. Her central claim: as code becomes cheap to write, the scarce skill is deciding what to write.

Timelines collapsed from 6 months to 1 day

Pre-AI cycles ran on 6-12 month horizons, with most PM energy spent aligning multi-quarter roadmaps because code was expensive. Now, with model capabilities jumping every few months, Anthropic has collapsed timelines to 1 month, 1 week, or 1 day ~06:05^{[13]Lenny's Podcast, How Anthropic's product team moves faster than anyone else | Cat Wu}. The job has shifted from coordination to acceleration: shortening idea-to-user-hands and defining the few tasks that absolutely must work out of the box. On Cat's team, the engineering-PM-design Venn diagram has collapsed — many engineers go end-to-end from a Twitter complaint to a shipped feature in a week with almost no PM involvement ~16:11.

Research preview and the engineering-marketing-docs loop

Almost every Claude Code feature ships behind a 'research preview' label ~07:05, which lowers commitment cost so a feature can land in 1-2 weeks. When an engineer dogfoods a feature, they post in an 'evergreen launch room'; Sarah (docs), Alex (PMM), Tar and Lydia (DevRel) turn around the announcement the next day ~07:05. Anthropic runs ~30-40 PMs across Research, Claude Developer Platform, Claude Code/Cowork, Enterprise, and Growth ~14:09.

When to use Claude Code CLI vs. Desktop vs. Web/Mobile vs. Cowork

Cat splits the tools by output shape ~32:24: Claude Code CLI in the terminal for one-off coding tasks where the latest features land first; Claude Desktop when you want a graphical preview pane open while building front-end work; Claude on web/mobile to kick off jobs while away from a laptop. Cowork is for any non-code output.

Cowork in practice: a 20-page deck overnight

Cat connected Google Calendar, Slack, Gmail, and Google Drive and asked Cowork to draft a 20-page conference talk deck ~36:28. It pulled context from Twitter, the evergreen launch room, the Cloud Code announce channel, and a draft from PMM Alex, then rendered slides in Anthropic's existing template. An hour of Cowork work vs. many hours of manual — her feedback pass was mostly trimming words. A sales engineer built a custom web app that auto-tailors Claude Code customer decks by pulling Salesforce, Gong, and notes — 20-30 minutes of manual deck work now takes seconds ~43:35. Applied AI is the second-heaviest token consumer at Anthropic after engineering, and pre-briefs 5-10 customer meetings/day.

Specific PM tactics

Spend ~30% of your time pushing Cowork's boundaries so you know its weaknesses ~41:31. Ask the model to introspect on why it made a wrong decision ('why did you skip the UI test?') because its self-explanation usually reveals the harness gap ~53:44. Identify ~5 trusted users whose feedback is reliably qualified. Build at least 10 great evals rather than skipping them. On harness design: every new model launch, the team re-reads the whole system prompt and removes scaffolding the smarter model no longer needs — the to-do list was added so early Claude Code would finish all 20 call sites in a refactor, but Opus 4+ uses it naturally ~60:48. Code review only shipped reliably once Opus 4.5/4.6 and Sonnet 4.6 were available because multiple parallel review agents could traverse the codebase.

Push 95% → 100% or it's not an automation

Cat's advice: lean into automation, but push the last 5-10% ~69:53. Customizing skills today involves too many concepts (defining, using, giving feedback, telling Cowork to update, verifying) — a flow Anthropic is actively simplifying.

As code becomes much cheaper to write, the thing that becomes more valuable is deciding what to write.

If an automation doesn't work 100% of the time, it's not really an automation. And that last 5 to 10% does take more time.

Just do things. I think jobs are fake. If you understand the constraints, you can figure out what you can do and then just like try to do it quickly.

Tools: Claude Code, Cowork, Claude Desktop, Claude on web/mobile, Slack, Gmail, Google Calendar, Google Drive, Salesforce, Gong, Figma MCP, GitHub Issues, Opus 4.5/4.6, Sonnet 4.6, Mythos

Industry

Tech Brew

Anthropic's Mythos Leaked to a Discord Group Before CISA Got Access

On April 7, Anthropic launched Project Glasswing, giving ~40 handpicked firms exclusive access to Mythos Preview — the model too dangerous to publicly release. On the same day, an unauthorized Discord group accessed it by simply guessing the URL from Anthropic's naming conventions. CISA, the US agency for defending critical tech infrastructure, was not included in the rollout^{[14]Tech Brew, A random Discord group got Anthropic's Mythos before CISA did}.

A URL guess, not a hack

Bloomberg reported the Discord group gained access on Glasswing's launch day — not through sophisticated exploitation but by inferring the URL pattern. Anthropic confirmed it is investigating via a third-party vendor environment and said there is no evidence its own systems were impacted^{[14]Tech Brew, A random Discord group got Anthropic's Mythos before CISA did}.

Two compounding problems

First, the perimeter was not actually secure — if casual Discord users can access a model Anthropic described as too dangerous for public release, there is no guarantee more sophisticated actors haven't already extracted dangerous knowledge. Second, CISA — the agency explicitly tasked with protecting US critical tech infrastructure — was excluded from Glasswing, even as the White House was reportedly arranging access for other agencies including the NSA^{[14]Tech Brew, A random Discord group got Anthropic's Mythos before CISA did}. The combination raises serious questions about whether the limited-rollout approach achieved its safety goals.

Tools: Mythos, Project Glasswing

AI Tools

The AI Daily Brief Better Stack

GPT Image 2 Takes the Image-Gen Crown by 242 Elo

OpenAI's GPT Image 2 launched Tuesday and immediately topped the LM Arena leaderboard at 1,512 — a 242-point lead over the prior leader Nano Banana 2, which Arena called the biggest gap ever recorded in the text-to-image category^{[15]The AI Daily Brief, The Biggest Unlocks of GPT Images 2}. The headline capability is that it's a reasoning image model: world knowledge baked in, working barcodes, precise text at 2K, and the Codex pairing is being called “the single most disruptive AI workflow of the year”^{[15]The AI Daily Brief, The Biggest Unlocks of GPT Images 2}.

A record-breaking gap

Nano Banana 2 led at 1,271 before the launch; ranks 2-15 clustered in a ~130-point range. GPT Image 2 scored 1,512 ~09:03^{[15]The AI Daily Brief, The Biggest Unlocks of GPT Images 2}. OpenAI's own announcement (generated as an image) described it as a step change in instruction following, object placement, and dense text rendering. Weeks of speculation preceded the release after LM Arena users noticed suspiciously realistic, non-AI-looking outputs — handwritten notes, YouTube layouts, iPhone-style retail photos.

Precision at 2K: small text, iconography, dense layouts

OpenAI highlighted a macro-photo of rice grains with one kernel labeled “GPT Image 2” — legible at that scale ~11:03. Community tests amplified the signal: Nick Dunn photographed messy handwritten pages and asked the model to make them a scan — the output preserved all information and maintained his handwriting style. Others generated full periodic tables (Imad Mostaque's used the original 151 Pokémon), dense Where's Waldo scenes, and anatomical diagrams (though one diagram did surface a labeled-vein error).

World knowledge + reasoning ≠ aesthetic generation

OpenAI calls it 'real-world intelligence': the model draws on its knowledge base to produce factually correct outputs, not just pleasing ones^{[15]The AI Daily Brief, The Biggest Unlocks of GPT Images 2}. With thinking mode enabled it can search the web, generate multiple distinct variants, and double-check its own output. Riley Brown's barcode test: he asked for a specific book with a scannable barcode; his phone scanned it and correctly resolved to that publication — it even worked with the ISBN number covered.

Realism and 'reduced AI-ness'

OpenAI noted the model deliberately adds 'tiny flaws that add realism' ~13:03. Ethan Mollick: 'I didn't think that better image generators would be a big deal, but it turns out that there is a quality threshold I didn't expect where you can now get text slides in academic papers.'

GPT Image 2 × Codex = the most disruptive workflow of the year

Peter Gostev (Arena): Codex is bad at initial UI but very good at implementing a reference design — so iterate the UI in Image 2, then let Codex write the working code^{[15]The AI Daily Brief, The Biggest Unlocks of GPT Images 2}. Within 24 hours of launch, users were sharing production pipelines. OpenAI confirmed 4 million Codex users (up from 200K at the start of the year) ~15:03. The AI Daily Brief's take: this is the first image model whose biggest impact will be quiet integration into agentic workflows, not viral standalone moments.

Enterprise-ready, with caveats

Community examples: generating an org chart of a public company from a template, generating full slide decks that 'look designed by pros,' and a new GitHub skill to smooth Image 2 → Codex integration ~15:03. Limitations remain: some infographic renders still have visual artifacts, and Sharon Goldman's anatomy test (reviewed by her anatomy-professor sister) had extra veins and mispositioned labels — zero-tolerance use cases still need human review ~18:07. The Better Stack video independently corroborates: working QR codes on a die where each face resolves to its Wikipedia number page, token-based pricing around ~$0.20/image^{[16]Better Stack, ChatGPT's New Image Model Is Terrifyingly Good}.

Compute tease

Greg Brockman: “Really incredible what you're now able to create with a little bit of compute.” The community read this as a tease that similar compute-driven improvements to base language models aren't far away ~20:08.

The Codex plus GPT image 2 pipeline is completely broken. This is the single most disruptive AI workflow I've seen this year.

Tools: GPT Image 2, Codex, LM Arena, Claude Code

AI Tools

Nate Herk | AI Automation

Claude Code Becomes a Video Editing Pipeline

Nate Herk demonstrates a fully automated video-editing pipeline: Claude Code orchestrates VideoUse (transcription + smart trimming) and Hyperframes (HTML-based motion graphics) to turn raw footage into a polished rendered video via natural-language prompts^{[17]Nate Herk | AI Automation, Claude Video Editing Just Became Unrecognizable}. A 50-second raw clip got cut to 27-32 seconds with synced motion graphics at ~238,000 tokens.

The three-stage pipeline

(1) VideoUse handles transcription and trims filler words, silences, and retakes. (2) Hyperframes generates motion graphics synced to word-level timestamps. (3) Rendering produces the final output. Claude Code is the orchestrator^{[17]Nate Herk | AI Automation, Claude Video Editing Just Became Unrecognizable}. Setup: point Claude Code at the VideoUse and Hyperframes GitHub repos to pull in the skills; work from VS Code (file tree visible); drop API keys (ElevenLabs, OpenAI Whisper, or a free local model) into a .env.

Hyperframes beats Remotion for aesthetics

VideoUse has a built-in Remotion path that handles the full pipeline autonomously, but Nate prefers Hyperframes's HTML-based rendering for what he calls iOS 26 liquid-glass premium UI aesthetics — liquid glass cards, karaoke-style subtitle animations, dynamic scene transitions ~07:04^{[17]Nate Herk | AI Automation, Claude Video Editing Just Became Unrecognizable}. Both require word-level timestamped transcripts stored as JSON.

Plan mode before code generation

Before Claude Code writes HTML animation code, Nate switches it to plan mode ~17:19. It reads the transcript, maps timestamps to visuals, and returns a structured beat timeline (aesthetic direction, color palette, per-word scene triggers). Reviewer can highlight beats and comment inline before approving, saving tokens on bad-path code.

Iterative refinement + screenshot verification

First render issues were fixed in a second pass via plain-language feedback ~22:23. The Hyperframes timeline editor lets you drag/resize motion graphic elements and sync the changes back to the code. Key trick: instruct Claude Code to take screenshots during rendering so it self-verifies ~26:24. 'Sometimes it'll just come back and say, hey, I've done this, but it doesn't look good at all.'

All I have to do is drop in a raw file and it is trimming out the mistakes and the dead space.

Tools: Claude Code, VideoUse, Hyperframes, Remotion, ElevenLabs API, OpenAI Whisper, HeyGen

AI Tools

Y Combinator

YC's Claude Code Engineering Team Playbook (GStack in Practice)

Gary Tan walks through GStack, a 'thin harness, fat skills' pattern that wraps Claude Code with structured pre-build, review, design, and QA skills. Gary runs 10-15 parallel Claude Code sessions and ships 10-50 PRs a day. Within three weeks of publishing, GStack surpassed Ruby on Rails in GitHub stars^{[18]Y Combinator, How to Make Claude Code Your AI Engineering Team}.

Thin harness, fat skills

Out-of-the-box Claude Code 'wanders' because it lacks project context. The bottleneck isn't model intelligence — it's scaffolding^{[18]Y Combinator, How to Make Claude Code Your AI Engineering Team}. GStack's design is a minimal orchestration layer that loads specialized skill prompts, keeping reasoning latent and domain knowledge encoded in markdown ~01:11.

Office Hours: Socratic product validation before coding

Distills YC-partner practice into forcing questions: What's the strongest evidence someone wants this? Who's the user? What's the business model? In a live demo for a 1099 tax-doc aggregator, the skill pushed back on the premise (TurboTax and H&R Block already import 1099s) and surfaced a better wedge — use doc retrieval as a hook to match users with CPAs ~02:12. Gary uses office hours in ~1 in 3 projects; sometimes it concludes the idea doesn't make sense (feature, not failure).

Adversarial review: 6 → 8 out of 10

A multi-round critique on the design doc catches absent failure handling, missing privacy sections, and unresolved 2FA handoffs. Auto-fixes what it can, flags the rest as deferred. In the demo: doc improved from 6/10 to 8/10 after two passes, 16 issues caught and fixed automatically ~12:21.

Design shotgun: 3 parallel UI mockups in 60 seconds

Delegates image generation to OpenAI Codex and produces three distinct UI variants (command center, friendly card-based, complex split-panel) for a dashboard task in ~60 seconds ~13:23. Human picks one; selected direction becomes a constraint for coding steps.

Playwright/Chromium CLI beats Claude-in-Chrome MCP

Gary wrapped Playwright and Chromium at the CLI level so any GStack agent can take screenshots, click, fill forms, and download media via /qa and /browse ~17:26. He's blunt about the alternative: 'Claude in Chrome MCP is one of the worst pieces of software I've ever used.' Context bloat and 2-3 second action latency made it unusable at parallel scale.

10-15 parallel Claude Code sessions, 50 PRs a day

Gary runs 10-15 Conductor windows at once, each with its own git worktree and branch, each running the same GStack sequence (office hours → CEO review → adversarial review → code → QA → ship) ~18:27. No to-do list — every input (X complaint, GitHub issue, feature idea) becomes a new Conductor window. On low-meeting days: 50 PRs.

Opus as CEO, Codex as CTO

Claude Opus is the default for planning and office hours ('ADHD CEO' — billion ideas, fast). Codex takes over for precise debugging ('autistic CTO' — systematic, correct) ~08:16. Design shotgun routes image generation to Codex specifically.

Opus 4.6 is sort of ADHD CEO. He's the guy you want to get a beer with and he's got a billion ideas, but when the going gets tough, you got to call in your autistic CTO and that's Codex.

I run 10 to 15 parallel Claude Code sessions all at the same time.

Tools: Claude Code, Claude Opus, OpenAI Codex, GStack, Conductor, Playwright, Chromium

Podcast

Theo - t3.gg

Theo on gstack: “AGI is the Right Set of Markdown Files”

Theo went in expecting to dunk on Gary Tan's gstack and came out convinced. Implementation is psychotic (40k+ lines of Ruby on Rails plus vibe-coded bash installers), but the underlying pattern — programs as collections of skills/markdown invoked by coding agents — is correct. Theo rewrote his BTCA CLI tool as a 30-line markdown skill that does everything the deterministic CLI did^{[19]Theo - t3.gg, We need to talk about gstack}. The episode also spends 20 minutes on Mythos and a surprisingly positive read on Uncle Bob's post-AI takes.

Mythos context: a 10T+ model, 27-year-old OpenBSD bug

Before gstack, Theo spends 20+ minutes on Anthropic's unreleased Mythos — allegedly 10T+ parameters, so good at code it's dangerously good at finding security vulnerabilities, including a 27-year-old bug in OpenBSD^{[19]Theo - t3.gg, We need to talk about gstack}. He defends Anthropic's Project Glasswing gate and argues emergent security capability is a function of coding ability plus arcane domain knowledge — a combination AI uniquely supplies ~17:10. He walks through Anthropic's pen-test harness: per-file seeding, 100-5,000 runs per project, three interesting hits ~20:10.

Uncle Bob eats crow from Theo (a long-time nemesis)

Theo is a “big functional programming nerd” who has feuded with Bob over Clean Code and SQL, but he's genuinely impressed that Bob has embraced agentic engineering, is using voice mode to direct his computer to code, and is putting out good takes — specifically, that AI lets us run experiments on programming techniques (static vs. dynamic typing, short vs. long iterations) without human bias ~24:13. Theo flags: nobody is benchmarking which technologies agents themselves work best with.

gstack: the capitulation

Theo concedes Gary Tan — not Ben, who'd been pushing this for three months — finally convinced him ~38:25. The implementation is 'psychotic' (he'd never use it as-is), but the architecture is correct: gstack is a collection of Claude Code skills where 'the code, quote unquote, is just a list of instructions given to the model.' Theo rewrites his own BTCA (Better Context) CLI as a 30-line markdown skill that does everything the deterministic CLI did ~44:28.

Thin Harness, Fat Skills + bang-prefix command execution

Theo endorses gstack's 'Thin Harness, Fat Skills' article and the latent-vs-deterministic split ~51:00. He endorses the !command prefix execution syntax in Claude Code skills (with a security caveat for third-party skills), and signs off with the hot take that AGI is just the right set of markdown files — pointing at gbrain (Gary Tan's nightly session-ingestion memory system) as directionally AGI-ish because it adds learning while sleeping.

Criticisms that stick

The actual gstack codebase is 'psychotic' and not something Theo would condone; the 'boiling the ocean' essay is corny and half-Claude-written; gbrain's memory implementation is 'rough'; office hours “glazed me so hard that it made me question the whole thing” ~42:27.

I think AGI is just the right set of markdown files.

I think we're basically there on the model front. Like 5.4 is plenty good enough for almost anything we want to do realistically. It's just a harness problem at this point.

Try your hardest to imagine what it would look like to replace the thing you're building with a markdown file. If you don't think you can, you're not trying hard enough.

Tools: gstack, gbrain, Claude Code, Claude Mythos, Claude Opus 4.6, BTCA, ESLint, Biome, Credo, Elixir, Ruby on Rails

AI Models

AI News & Strategy Daily | Nate B Jones

Codex Computer Use: When Apps Don't Need APIs Anymore

OpenAI's revamped Codex desktop agent drives any Mac application via GUI interaction at a pace that makes Claude's computer use look slow — ~2 minutes for workflows that took Claude 5-6. Legacy apps, internal dashboards, vendor portals, SaaS without MCP — all reachable now^{[20]AI News & Strategy Daily | Nate B Jones, Your Apps Don't Need an API Anymore. Codex Just Proved It.}. Nate B Jones argues this inverts the OpenAI-vs-Anthropic strategic bet: OpenAI's agent doesn't need the software ecosystem to cooperate; Anthropic's does^{[20]AI News & Strategy Daily | Nate B Jones, Your Apps Don't Need an API Anymore. Codex Just Proved It.}.

The API-optional era

Six months ago, any app without an API was outside the automation conversation. Codex's April 16 release breaks that constraint: it sees the screen and clicks and types across any Mac app^{[20]AI News & Strategy Daily | Nate B Jones, Your Apps Don't Need an API Anymore. Codex Just Proved It.}. Agents run in the background without hijacking the cursor, enabling parallel tasks. Tested side-by-side for a week vs. Claude's computer use, Codex completed workflows in ~2 min vs. 5-6, rarely fumbled unexpected modals, and could reach any Mac app (not just Chrome). Legacy enterprise software, 2019-era internal dashboards, SaaS without MCP, vendor portals — all reachable.

Two divergent 'body' strategies

Anthropic's Cowork, MCP servers, connectors, and the leaked Conway event-driven agent environment all assume the software industry will build agent-native interfaces ~06:00^{[20]AI News & Strategy Daily | Nate B Jones, Your Apps Don't Need an API Anymore. Codex Just Proved It.}. Excellent architecture with 30,000+ integrations already, but bounded by vendor adoption velocity. OpenAI's inverse: drive whatever GUI the user drives, no vendor cooperation needed. Reach is bounded only by how much software has a GUI — effectively all of it. Nate's framing: OpenAI doesn't need the ecosystem to win; Anthropic does.

The acquisition that made this possible

Codex's OS-level depth traces to OpenAI's October 2025 acquisition of Software Applications Inc., a 12-person team behind Sky — an unreleased Mac OS native AI interface that did essentially what Codex now does ~10:03^{[20]AI News & Strategy Daily | Nate B Jones, Your Apps Don't Need an API Anymore. Codex Just Proved It.}. Two co-founders previously built Workflow, the iOS automation app Apple acquired in 2017 and shipped as Shortcuts. A third was a decade-long senior PM at Apple on Safari, WebKit, Privacy, Messages, FaceTime. That accumulated knowledge — deep Mac OS integration, accessibility, screen recording, permission handling, motion paths that don't feel robotic — is what makes Codex's background computer use feel natural rather than malware-like. Nate calls out the pattern: as features get replicated in six months, scarce teams with specific histories are the new competitive moat. Anthropic did the parallel play with a team that shipped Claude's Windows desktop control in four weeks.

6 months ago, any piece of software that did not have an API was effectively outside the automation conversation. That's just not the state of things anymore.

OpenAI doesn't need the software industry to build for agents. The body just uses whatever is already there.

Models have gone from being the product to being part of the product. — Greg Brockman

Tools: Codex, Claude, MCP, Shortcuts (Apple), Conway

AI Tools

Simon Willison's Weblog

Simon Willison's LiteParse: A 59-Minute Vibe Build

Simon ported LlamaIndex's LiteParse — a Node.js CLI for structured PDF text extraction — to run entirely in the browser via PDF.js and Tesseract.js. Total Claude Code time: 59 minutes, starting from a mobile Claude session. Deployed at https://simonw.github.io/liteparse/, no server, no data leaves the machine^{[21]Simon Willison's Weblog, Extract PDF text in your browser with LiteParse for the web}.

Workflow: mobile chat → plan.md → build it

Simon opened Claude on his iPhone, uploaded a PDF, asked it to clone and run LiteParse, then: 'Does this library run in a browser? Could it?' Enough confidence to attempt the port^{[21]Simon Willison's Weblog, Extract PDF text in your browser with LiteParse for the web}. On laptop: forked the repo, told Claude Code to write a plan.md before building, then 'build it.' He used a separate Claude Code session on the same directory just to learn the dev server command (npx vite) for live reload. He habitually asks for 'small commits along the way' as a planning nudge. Cross-browser bugs (Safari 'undefined is not a function' on ReadableStream) got fixed by describing the error. Total: 59 min.

Vibe coded, but not low-skill

Simon is strict: vibe coding means using AI without reviewing or caring about the generated code — not just 'using AI to help write code.' He read zero lines of the HTML/TypeScript^{[21]Simon Willison's Weblog, Extract PDF text in your browser with LiteParse for the web}. But the project required real engineering judgment: recognizing PDF.js and Tesseract.js are already browser-capable, choosing a static GitHub Pages deployment to minimize blast radius, verifying zero network requests during parsing, and using GPT-5.5 via Codex to audit whether Claude had taken shortcuts. For this category — static, client-side, no private data, low security surface — vibe coding is a legitimate production approach he's willing to attach his reputation to.

Vibe coding does not mean any time you use AI to help you write code, it's when you use AI without reviewing or caring about the code that's written at all.

I have not looked at a single line of the HTML and TypeScript written for this project.

Total time in Claude Code for that 'build it' step was 59 minutes.

Tools: Claude Code, Claude (mobile), Opus 4.7, GPT-5.5, OpenAI Codex, LiteParse, PDF.js, Tesseract.js, Vite, GitHub Pages, Playwright

Podcast

AI Engineer

Kitze at AI Engineer: The End of Apps

Kitze (Sizzy.co) traces a lifelong obsession with productivity tool-building — Toodo, Better, Benji, OpenClaw — and argues AI agents will invert the human-computer relationship: the AI prompts you. Most consumer apps disappear; specialized software survives for domain experts. Apple may be the unexpected winner^{[22]AI Engineer, The End of Apps — Kitze, Sizzy.co}.

A lifelong search for a Life OS

Kitze walks through a series of abandoned productivity projects starting at age ten ~00:00^{[22]AI Engineer, The End of Apps — Kitze, Sizzy.co}. Form-based entry creates too much friction; he oscillates between full engagement and complete abandonment. The ChatGPT moment ~04:16 made him predict the death of SaaS three years too early.

OpenClaw, self-hosted, and why it's failing

He switched back to Android for full agent access to notifications and app installs, self-hosted everything on a local NAS, and built a network of specialized Discord bots ~07:16. Community enthusiasm waned as reliability problems with cron jobs, multi-agent coordination, and context retention mounted. Discord and Telegram are the wrong UIs for a life OS — a coping mechanism until something better arrives ~11:19.

Wolffer: Kitze's experimental shell

Built on Codex, Wolffer uses nested topic hierarchies that inject parent context (no memory system required), visible tool call UI, predictable cron labeling, and inline KB references ~14:19. All lessons from OpenClaw's shortcomings.

Prediction: the OS prompts you

The future OS ingests all your context and proactively prompts you with the next action rather than waiting to be queried ~18:19. Most consumer apps vanish; a small set of specialist tools (audio, video, color grading) survives. Apple's on-device models and deep OS integration position it unexpectedly well.

I think the role of AI is going to inverse — the fully productive people will be the ones who delegate 99% of the stuff to the AI, and then the AI prompts you.

I called my wife and I was like, honey it's over. It's over for all the apps, for all SaaS. Like GPT is going to eat the world… 3 years later she just ignores me.

Tools: Sizzy.co, Benji, OpenClaw, Wolffer, Claude Code, Codex, Hermes, Paperclip, IFTTT, Tasker, Notion, Nextcloud

Podcast

AI Engineer

Matt Pocock at AI Engineer: Software Fundamentals Matter More Than Ever

Matt Pocock argues the 'specs-to-code' movement (write a spec, let AI generate all code, loop back to the spec when things break) produces progressively worse codebases. His counter: software fundamentals — clean architecture, shared language, TDD — matter more than ever because AI performs far better in well-structured codebases^{[23]AI Engineer, Software Fundamentals Matter More Than Ever — Matt Pocock}.

Specs-to-code fails via software entropy

Pocock's hands-on experience building 'Claude Code for Real Engineers' showed him each successive compiler run produced worse output^{[23]AI Engineer, Software Fundamentals Matter More Than Ever — Matt Pocock} ~01:14. The direct expression of software entropy from The Pragmatic Programmer. The blunt rebuttal to 'code is cheap': bad code is the most expensive it has ever been, because a hard-to-change codebase forfeits all the leverage AI provides ~04:15.

The Grill Me skill: force upfront shared understanding

Prompt: “Interview me relentlessly about every aspect of this plan until we reach a shared understanding.” Forces the AI to ask 40-100 questions before acting ~05:15. Approximates Frederick P. Brooks's 'design concept' of shared understanding between collaborators.

Ubiquitous language beats verbose prose

A markdown file of agreed terms shared between human, AI, and codebase (drawn from Domain-Driven Design) ~07:17. Pocock found it improved AI thinking traces and narrowed the gap between planning and implementation.

TDD, deep modules, and strategic vs. tactical

TDD stops the LLM from outrunning its headlights — shipping large code blocks before verifying ~10:20. John Ousterhout's deep-module concept — few large modules with simple interfaces — hides complexity and gives AI clear testable boundaries ~12:22. The reframe: AI is the tactical, on-the-ground programmer; the human must stay the strategic architect investing in system design every day — a direct inversion of specs-to-code ~17:28.

Code is not cheap. In fact, bad code is the most expensive it's ever been.

You need someone thinking on the strategic level and that's you. And that requires software fundamental skills that we've been using for 20 years, for longer.

Tools: Claude Code

AI Models

AICodeKing Caleb Writes Code

Local Coding Stacks Catch Up: Qwen 3.6 27B + Kimi K2.6 Agent Swarm

Qwen 3.6 27B arrives as a serious agentic-coding local model — long-context retention, repository-level reasoning, reliable tool calling — and runs via vLLM with Hermes Agent or Kilo CLI on localhost^{[24]AICodeKing, Qwen 3.6 27B + Hermes, OpenCode, OpenClaw: THIS IS SO GOOD! The BEST LOCAL AI CODING}. Kimi K2.6 hits 58.6% on SWE-bench Pro and its new Agent Swarm feature parallelizes large-scale research in waves, reducing 6-8 hour jobs to a fraction^{[25]Caleb Writes Code, Kimi K2.6 and Agent Swarm explained..}.

Qwen 3.6 27B: agentic, not just benchmark-smart

AICodeKing highlights Qwen 3.6 27B's focus on agentic coding: thinking preservation across long interactions, repository-level reasoning, and reliable tool-calling behavior^{[24]AICodeKing, Qwen 3.6 27B + Hermes, OpenCode, OpenClaw: THIS IS SO GOOD! The BEST LOCAL AI CODING}. Recommended stack: vLLM with an OpenAI-compatible endpoint on localhost:8000, tool-calling flags enabled. The 27B variant isn't yet on Ollama; the 35B A3B1 is. MLX support for Apple Silicon is 'soon.' Integration paths: Kilo CLI via OpenAI-compatible provider settings, Kilo Claw for a hosted persistent agent, or Hermes Agent configured to the local vLLM endpoint. Within Hermes, sub-agents inherit the same model config — the full multi-agent workflow on a single local model. Presented as the most capable fully-local coding-agent setup available.

Kimi K2.6: SWE-bench Pro 58.6%, 20× inference optimization in 12 hours

Caleb ran K2.6 vs K2.5 side-by-side on identical prompts and datasets (websites for AI model releases since 2020 and a US data center map)^{[25]Caleb Writes Code, Kimi K2.6 and Agent Swarm explained..}. K2.6 produced noticeably better UI and UX in both. On benchmarks: 58.6% on SWE-bench Pro, above competing SOTA at release. Long-horizon highlight: K2.6 spent 12 hours optimizing a Qwen 3 0.8B model on an Apple M3 Max, raising inference throughput from 15 tok/s to 193 tok/s — 20% above the LM Studio baseline ~01:01.

Agent Swarm: parallel waves of sub-agents for research

A single agent tasked with 1,500 rows of data-center research would take 6-8 hours. Agent Swarm spawns sub-agents in parallel waves — each owns a research domain and reports back to the orchestrator ~00:00. Caleb used it to build two datasets (1,500 rows of US data centers; 300+ AI model releases) and then fed them to Kimi CLI with detailed markdown specs to generate polished websites. The key insight he hammers: the data — not the generated website — is where the real value lies.

Prompt engineering still matters for swarms

Caleb pushes back on 'prompt engineering is dead.' Vague instructions ('gather all data centers in the US into an Excel file') waste compute. His pattern: a 2-3 page markdown spec that covers task, sourcing parameters, required fields, and output format — with AI assistance. Same principle for coding agents building websites: multi-page markdown spec covering tech stack, page architecture, layout decisions ~03:04.

Tools: Qwen 3.6 27B, vLLM, Hermes Agent, Kilo CLI, Kilo Claw, Ollama, MLX, Kimi K2.6, Agent Swarm, Kimi CLI

AI Tools

Fireship

OpenClaw in Practice: 650 of 1,100 Security Issues Fixed

Fireship reveals OpenClaw has received 1,100+ security advisories since launch (with ~650 resolved) and demos a practical use case: a family tech-support bot on a Hostinger VPS that routes Telegram messages through OpenClaw, drafts a reply, synthesizes it in the creator's voice via 11Labs, and converts it to an OGG voice memo via ffmpeg^{[26]Fireship, I finally found a use case for OpenClaw…}.

Security state

Peter Steinberger discussed OpenClaw's security engineering at TED and AI Engineer Europe^{[26]Fireship, I finally found a use case for OpenClaw…}. 1,100+ advisories, ~650 resolved. His heuristic for filtering AI-generated report slop: real security researchers don't apologize in their reports. Popularity was enough to trigger a nationwide Mac Mini shortage when users rushed to self-host.

Family tech-support bot (don't forget ffmpeg)

The walkthrough: SSH into a Hostinger VPS, configure a Telegram bot via BotFather for an API token wired into Hostinger's one-click OpenClaw deployment, edit soul.md for personality, add tools.mmd to give OpenClaw voice-synthesis context, drop 11Labs API key and voice ID into .env, and install ffmpeg to convert 11Labs MP3 output into Telegram-accepted OGG voice memos^{[26]Fireship, I finally found a use case for OpenClaw…}. Family message → Telegram bot → OpenClaw drafts a reply → Python script → 11Labs voice memo → back to family. 'Emotional detachment from your family at scale.'

Achieving what every computer programmer has always dreamed of, emotional detachment from your family at scale.

Tools: OpenClaw, 11Labs, Telegram, ffmpeg, Hostinger

Hot Take

Data Science Weekly Arjay McCandless The Pragmatic Engineer Real Python AI News & Strategy Daily | Nate B Jones

SWE Culture Under Pressure: Trajectory Shapes, Formal Methods, AI Code Review

A cluster of hot takes on what happens to software engineering as LLMs write most of the code. Empirical analysis shows Claude and GPT have distinct 'trajectory shapes' on SWE-Bench^{[27]Data Science Weekly, Trajectory Shapes (Data Science Weekly)}. Vicki Boykis argues technical excellence still matters, drawing a parallel to a Dutch Golden Age flower painter^{[28]Data Science Weekly, Build Yourself Flowers (via Data Science Weekly)}. Kleppmann predicts increased demand for formal methods^{[30]The Pragmatic Engineer, Martin Kleppmann: Vibe coding → more demand for formal methods}. And Nate B Jones notes 4% of public GitHub commits are now authored by Claude Code^{[32]AI News & Strategy Daily | Nate B Jones, Dark factories vs everyone else: the real AI divide}.

Claude vs. GPT: the shape of their work

Srihari Sriraman analyzed 730 SWE-Bench Pro task trajectories each for Sonnet 4.5 and GPT-5, normalizing every agent step onto a 0-100% time scale^{[27]Data Science Weekly, Trajectory Shapes (Data Science Weekly)}. Claude begins editing at ~35% of the trajectory and finishes implementation by ~62%, then spends disproportionate time on post-edit verification and cleanup. GPT-5 reads heavily upfront, doesn't start editing until ~50% through, and barely verifies afterward — a 'read first, one-shot edit' pattern. The behavioral tendencies persist across model generations (Opus 4.6, GPT-5.4). Practical steering: prompt Claude for explicit upfront analysis to delay early editing; prompt GPT to do more post-edit verification.

Build Yourself Flowers: Vicki Boykis on technical excellence

Boykis's Applied ML keynote frames the question: in an era of LLM-generated code, is it still worth doing ML well? Her answer runs through Rachel Ruysch — a Dutch Golden Age painter whose work sold for 3× Rembrandt's because of decades of deliberate craft^{[28]Data Science Weekly, Build Yourself Flowers (via Data Science Weekly)}. Three reasons it still matters: a 2012 NASA report found that shifting from in-house expertise to contractor oversight eliminated the independent analysis that prevented catastrophic errors (an exact analogy for wholesale LLM delegation); software is a craft that takes years to develop; mastery is intrinsically human. She closes with Rijksearch, her own semantic- search project over the Rijksmuseum collection built in traditional ML on purpose.

Kleppmann: vibe coding drives demand for formal methods

Manual code review becomes the bottleneck as AI-generated code proliferates — impossible to capture productivity benefits if every line still requires inspection^{[30]The Pragmatic Engineer, Martin Kleppmann: Vibe coding → more demand for formal methods}. The solution is formal verification: mathematically rigorous proofs that code is correct, critical in security contexts. Kleppmann expects LLMs to lower the formal-methods barrier, creating a potential flywheel.

Real Python: can you actually review all your AI code?

The productivity paradox: developers generate code faster than ever, then face a crushing backlog of code they haven't vetted — the analogy to performance-enhancing drugs captures both the speed boost and the crash^{[31]Real Python, Can You Actually Review All Your AI Code?}. Grounding heuristic (via Simon Willison): your job is to deliver code you have proven to work, regardless of how it was generated. AI-generated bugs may differ qualitatively from human bugs, making review harder.

State of SWE 2026: full-stack and context still win

Arjay McCandless notes AI engineer and forward-deployed software engineer roles are exploding, but both share a common core: stitching together legacy systems, building APIs, integrating with customer infrastructure — full-stack work with one added requirement: understanding broader codebase context and communicating across it^{[29]Arjay McCandless, State of SWE 2026}.

Dark factories: the AI improvement loop has closed

Nate B Jones's hot take: Cursor's Cowork shipped in 10 days with four engineers directing machines; 4% of all public GitHub commits are directly authored by Claude Code, with Anthropic targeting >20% by end of 2026; Claude Code alone hit a $1B annualized run rate six months after launch^{[32]AI News & Strategy Daily | Nate B Jones, Dark factories vs everyone else: the real AI divide}. 'AI is now fast enough and capable enough that it is actively accelerating its own development' — better tools → better models → better tools.

Claude starts editing early and figures it out in the loop. GPT reads first, then goes for the one-shot.

In an era where we're having LLMs generate a lot of code, when the most important thing is for us to ship quickly, is it still worth doing machine learning well?

Your job is to deliver code you have proven to work. It's almost irrelevant how it got generated, but if you're failing to deliver something that's working, then you're failing at your job. — Simon Willison, via Real Python

The feedback loop on AI has closed.

Tools: Claude Sonnet 4.5, Claude Opus 4.6, GPT-5, Claude Code, Cursor Cowork, Rijksearch, Gemini Flash 2.5

Industry

Lenny's Podcast

The PM Job Market Splits: Information-Movers vs. Builders

The apparent contradiction between strong PM job postings and widespread PM layoffs is explained by a fundamental redefinition: ~half of PMs grew into 'information movers' (coordinate, synthesize, relay) — now obsolete — while 'builders' (founder mentality, desire to ship) are in high demand across PM, engineering, design, and marketing^{[33]Lenny's Podcast, The PM job market has split}.

Two archetypes, one hiring theme

Lenny's guest notes the information-mover role is becoming a dinosaur^{[33]Lenny's Podcast, The PM job market has split}. The builder archetype — PM or otherwise — is what's being hired. “Builders wanted” is framed as the defining hiring theme for the next several years.

The information mover is essentially going to become a dinosaur.

Builders wanted is going to be the big tagline for the next couple of years.

Industry

Low Level

Adobe Acrobat Zero-Day: PDF JavaScript Sandbox Escape

A critical zero-day in Adobe Acrobat Reader's JavaScript engine has been actively exploited since at least September. The exploit fingerprints the victim via a privileged util.readFileIntoStream read of ntdll.dll, exfiltrates via Adobe's unexpected RSS-feed outbound network, and conditionally delivers a final payload only to unwatched high-value targets. Adobe confirmed the vulnerability on April 12^{[34]Low Level, They're hacking PDFs now}.

The attack chain

A malicious PDF (discovered by XPmon's Hi Fay Lee) contains obfuscated JavaScript in a base64-encoded button field^{[34]Low Level, They're hacking PDFs now}. Decoded, the script collects PDF viewer language/version, platform, number of active documents, and the exact Windows version by reading a known hex string from ntdll.dll via the privileged util.readFileIntoStream. It sends this to an attacker server via Adobe's RSS feed functionality — which unexpectedly allows outbound network requests from within the sandboxed JavaScript. The server responds with an encrypted JS blob; decrypted and executed in Reader, it forms a loader that conditionally delivers the final exploit only to unwatched high-value targets.

What defenders should do

Monitor for requests containing the fingerprinting parameters (language, platform, active docs, viewer version) as behavioral indicators^{[34]Low Level, They're hacking PDFs now}. The exact RCE details remain undisclosed; the attack surface is the JavaScript parser — same class as browser bugs. The researcher was able to reproduce loader behavior using a mock server.

The ability to run JavaScript in a PDF is what caused the bug today.

Flashback to the early 2000s, we're kind of there again.

Tools: XPmon, Adobe Acrobat Reader, AI-assisted deobfuscation

Industry

Sherwood Snacks

AI Infrastructure: Tesla Q1 Beats, GE Vernova's $163B Backlog, Google Cloud Next '26

Tesla beat Q1 revenue and EPS with $22.4B and $0.41, but investors are focused on Robotaxi (live in SF, Austin, Dallas, Houston), Optimus, and chip progress. GE Vernova's order backlog jumped 32% to $163B — turbines sold out through 2029. Google Cloud Next '26 unveiled TPU 8, an enterprise AI agent platform, and an Nvidia partnership^{[35]Sherwood Snacks, Tesla delivers (Q1 earnings + AI infra roundup)}.

Tesla: beats headlines, autonomy drives thesis

Q1 revenue $22.4B (vs. $22.1B consensus), adjusted EPS $0.41 (vs. $0.35), free cash flow $1.4B against a forecast $1.5B loss^{[35]Sherwood Snacks, Tesla delivers (Q1 earnings + AI infra roundup)}. Automotive segment outperformed at $16.2B vs $14.9B; energy underperformed at $2.4B vs $3.2B. Tesla expects 'hardware-related profits to be accompanied by an acceleration of AI, software and fleet-based profits' — no timeline. Robotaxi is live in SF, Austin, and on a limited basis in Dallas/Houston, with safety monitors on most rides. Optimus and proprietary AI chips are the next milestones investors are watching.

GE Vernova: the AI infra buildout still has legs

Q1 order backlog surged 32% to $163.28B^{[35]Sherwood Snacks, Tesla delivers (Q1 earnings + AI infra roundup)}. RBC's Chris Dendrinos projects GE Vernova will be fully sold out of heavy-duty turbine production capacity through 2029 and possibly into 2030 by year end. Vertiv also beat but elected to stop reporting quarterly backlog, shifting to annual disclosure — markets sold the stock off. “GEV's got more demand than they have slots. They can be choosy as to who they want to work with.”

Google Cloud Next '26: TPU 8 + agent platform + Nvidia

Google shares jumped after the event^{[35]Sherwood Snacks, Tesla delivers (Q1 earnings + AI infra roundup)}. Announcements: next-gen TPU 8 custom AI accelerators, an enterprise-focused AI agent platform, and a new partnership with Nvidia — continued collaboration on accelerated computing even as Google ships custom silicon.

Industry

Google

Google's Kent Walker: Privacy by Innovation for the Agentic Era

At the IAPP Global Summit, Google's Kent Walker argues today's AI is 300× more efficient than two years ago, enabling truly personalized agentic assistants — Search's AI Mode already pulls Gmail and Photos for context. Traditional consent notices aren't enough; he calls for 'privacy by innovation' built on global PET standards and regulatory incentives^{[36]Google, Evolving expectations of what's possible}.

Personal Intelligence goes live in Search

Walker says today's models can act independently, take actions across environments, and self-correct^{[36]Google, Evolving expectations of what's possible}. Google's 'Personal Intelligence' is live in AI Mode in Search in the US, pulling contextual data from Gmail and Photos. He cites Diia.AI, Ukraine's national AI assistant built with Google, which delivers government services (e.g., income certificates) directly in chat. Framed as the long-arc fulfillment of Page and Brin's Search vision: responding → suggesting → actively helping.

Three technical guardrails + a policy ask

(1) user controls with easy on/off toggles over what data agents access; (2) hard guardrails for sensitive topics (Gemini avoids proactive assumptions about health data); (3) training agents only on data needed for service quality^{[36]Google, Evolving expectations of what's possible}. On policy: Walker calls for global PET benchmarks and regulatory incentives so businesses view PET investment as worthwhile. He frames privacy as a product-quality dimension and urges AI labs to compete on demonstrable privacy techniques.

We now have AI models that are 300 times more efficient than the state of the art from just two years ago. Not 300 percent — 300 times.

This isn't just privacy by default, or even privacy by design — it's privacy by innovation.

Tools: Gemini, AI Mode in Search, Diia.AI, Google Photos, Gmail, Privacy-Enhancing Technologies

AI Tools

Github Awesome

GitHub Trending #32: CrabTrap, Cavemem, Agent Simulator

Episode 32 covers 35 trending GitHub projects. Standouts: CrabTrap (Brex's LLM-as-judge HTTP proxy that gates agent API calls), Cavemem (persistent agent memory with 75% compression, SQLite + FTS5 + vector search), Agent Simulator (browser iOS simulator that maps UI elements to React Native source lines), and PPT-Design-Prompt (strips web-only code from brand files so AI-generated slides stop hallucinating HTML)^{[38]Github Awesome, GitHub Trending Today #32: PPT-Design-Prompt, agent-simulator, cavemem, CrabTrap}.

Other notable repos

OpenAI's Privacy Filter (1.5B-param local PII scrubber, 128k context), OpenGame (agentic web game creation via GameCoder27B), Open Code Design (local-first BYOK React prototype generator), PyComputerUse (macOS wrapper for Anthropic's computer use API), Cube Sandbox (Rust VMM, <60ms cold start, <5MB RAM), Mercury (24/7 permission-hardened automation agent), Token Dashboard (local Claude Code JSONL cost analytics), LeanKG (Rust/tree-sitter codebase knowledge graph via MCP), CC Status Line, Active Spot (Dynamic Island clone for Hyperland), Pith (context compression lifecycle hooks), 10x (plain-English iOS app generator with live SwiftUI preview), Auto Memory (zero-dependency persistent memory CLI)^{[38]Github Awesome, GitHub Trending Today #32: PPT-Design-Prompt, agent-simulator, cavemem, CrabTrap}.

Notable tools: CrabTrap, Cavemem, Agent Simulator, PPT-Design-Prompt, Privacy Filter, Cube Sandbox, LeanKG, PyComputerUse, Auto Memory

Hot Take

Dwarkesh Patel Simon Willison's Weblog Google Labs Data Science Weekly Real Python

One More Thing: Ada Palmer, Maggie Appleton, Flow Sessions, Columnar Storage, API Keys

A grab bag of smaller items from the day: Ada Palmer on how royal wedding gossip saved the printing press (and why AI may need its own frivolous use cases)^{[39]Dwarkesh Patel, How Royal Wedding Gossip Saved the Printing Press - Ada Palmer}; Maggie Appleton on learning in public as a credibility signal^{[41]Simon Willison's Weblog, A quote from Maggie Appleton}; Google Labs' Flow Sessions with third-cohort creative lessons^{[37]Google Labs, 3 creative tips from our Flow Sessions artists}; Justin Jaffray reframing columnar storage as normalization^{[40]Data Science Weekly, Columnar Storage is Normalization (via DSW)}; and Real Python's quick tutorial on storing OpenAI API keys securely^{[42]Real Python, Leverage OpenAI's API in Your Python Projects: Storing Your API Key & Setting Up}.

Ada Palmer: the printing press needed gossip to survive

Dwarkesh Short with historian Ada Palmer: early printers kept their presses financially afloat by running two parallel operations — one press slowly producing a high-value book over months, and another cranking out cheap pamphlets like royal- wedding fashion dispatches^{[39]Dwarkesh Patel, How Royal Wedding Gossip Saved the Printing Press - Ada Palmer}. The gossip subsidized the serious work. A medieval handwritten book cost as much as a house; a printed fashion report could be sold within days and purchased repeatedly. The AI parallel: transformative capabilities may depend on frivolous use cases generating the revenue and adoption that sustain deeper, slower-burning applications.

Maggie Appleton: learning in public as credibility arbitrage

Via Simon Willison's quote-post^{[41]Simon Willison's Weblog, A quote from Maggie Appleton}: “If you ever needed another reason to learn in public by digital gardening or podcasting or streaming or whathaveyou, add on that people will assume you're more competent than you are. This will get you invites to very cool exclusive events filled with high-achieving, interesting people, even though you have no right to be there. A+ side benefit.”

Flow Sessions: 3 creative lessons from Google Labs

Third cohort of Google Labs's six-week Flow Sessions program produced three takeaways^{[37]Google Labs, 3 creative tips from our Flow Sessions artists}: embrace surprise; get personal; and apply AI video beyond traditional filmmaking. Julie Wieland used Flow as 'an endless playground' with AI Studio for a stop-motion aesthetic. Calvin Herbst trained a style transfer on archival 16mm footage to build a visual elegy for his late dog. Fashion designer Charline Prat exaggerated textile textures beyond what sewing can physically achieve. Stephane Benini used Veo's visual drift as a nostalgia device.

Columnar Storage is Normalization

Justin Jaffray reframes column-oriented storage as an extreme form of relational normalization: splitting a wide table into per-attribute tables keyed by row position^{[40]Data Science Weekly, Columnar Storage is Normalization (via DSW)}. Under this view, reconstructing a row from columnar storage isn't like performing a join — it is a join. The frame unifies query operations (projection, join) with data format choices under the same relational model. Best used as a mental model for why certain operations are expensive, not as a day-to-day engineering prescription.

Real Python: storing OpenAI API keys securely

Quick tutorial on the two standard patterns^{[42]Real Python, Leverage OpenAI's API in Your Python Projects: Storing Your API Key & Setting Up}: shell-level OPENAI_API_KEY env var (session-scoped; persist via ~/.bashrc), or a .env file with python-dotenv (remember to add .env to .gitignore immediately). Never hardcode; never commit. Boring but evergreen.

In 2 days they've printed a fashion report on what everyone was wearing at the royal wedding which they can sell right away. — Ada Palmer

Tools: Flow, AI Studio, Veo, python-dotenv, pyenv, uv