Caitlin Kalinowski quit OpenAI. We have notes.

Industry

UK GDS rebukes the NHS for closing its open source

After the NHS pulled its public repositories in the wake of Project Glasswing's vulnerability disclosures, the UK Government Digital Service publicly reaffirmed that public sector code should default to open^{[1]Simon Willison: GDS weighs in on the NHS's decision to retreat from Open Source}. Terence Eden and Willison both read it as a quiet rebuke of the NHS's retreat — closing the repo feels safer after a disclosure, but the cost is losing the external scrutiny and reuse that were the point of opening in the first place.

The GDS guidance restates a position that had been drifting toward "depends": "Keep open by default. Making everything private adds additional delivery and policy costs, and can reduce reuse and scrutiny." The framing matters because GDS is the body whose recommendations the rest of central government tends to follow.

Keep open by default. Making everything private adds additional delivery and policy costs, and can reduce reuse and scrutiny. — GDS

The recurring failure mode: after a public disclosure, an org closes the door, and the next disclosure has nowhere to land — so it lands in the press instead. Willison's framing^{[1]Simon Willison: GDS weighs in on the NHS's decision to retreat from Open Source} is that closing the code is loss-aversion masquerading as security policy.

Industry

Morning Brew

AI is sending Gen Z to trade school

16% of US college students have already switched majors because of AI's impact on the job market, and ~25% of Gen Z is eyeing the trades as an AI-proof career^{[2]Morning Brew: AI is making some Americans go blue-collar}. BlackRock and Meta are pouring tens of millions into training electricians and fiber-optic-cable techs to staff their data center buildouts — the demand-side bookend to the layoffs story.

The numbers in the piece: a Gallup survey finds 16% of college students have switched majors due to AI's impact, and 47% have considered it^{[2]Morning Brew: AI is making some Americans go blue-collar}. A separate SupplyHouse survey puts ~25% of Gen Z either pursuing or considering the trades. Tech companies openly blame layoffs on AI, and economists expect more knowledge work to follow.

The supply side reads almost like a parody: data centers need electricians and fiber techs faster than they can train them, so BlackRock and Meta are running their own recruiting and training programs. The story does pour cold water on the narrative — overall blue-collar employment shrank year-over-year, and trades pay less on average than degree-track jobs — but the directional signal is hard to argue with.

While a robot could never discreetly utilize the work printer for personal use, AI is threatening to automate other office tasks.

Podcast

Lenny's Podcast

Lenny interviews Caitlin Kalinowski on hardware in the AI era

Kalinowski — original MacBook Pro/Air thermal lead at Apple, head of VR hardware then AR (Orion) at Meta, founder of OpenAI's robotics division, and most recently the engineer who very publicly quit OpenAI over the Department of War deal — walks through her hardware playbook^{[3]Lenny's Podcast: How to ship hardware in the AI era — Caitlin Kalinowski}. The wide-ranging conversation covers humanoid hype vs. dedicated robots, the memory-price meteor coming for consumer hardware, why we still don't have "Codex for CAD," and her clearest hot take of the year: "the only AI-native people are 20 or 21 years old."

~02:30 VR was a stepping stone, not the destination. The SLAM, depth-sensing, head-tracking, and spatial-perception work done at Oculus/Meta is now the foundational stack for robotics, drones, AVs, and physical AI^{[3]Lenny's Podcast: Caitlin Kalinowski}. Orion's 70-degree binocular FOV was the moment AR "clicked" for her as the real future, but waveguides and microLEDs aren't yields-ready for mass production yet.

~08:00 Why hardware is hard: "four compiles ever." "In hardware, we only get to compile our code quote-unquote like four or five times — ever." Once you mass-produce, you ship. The trigger for the current hardware boom: "a dawning realization in the labs that what you can do behind a keyboard with AI is going to saturate. When that happens, the next frontier is the physical world."

~13:00 Humanoids are advanced prototypes, not products. Kalinowski has "safety concerns about large strong humanoids operating right next to people." Chinese humanoids ship with "3-foot no-human-within" warnings; the viral nunchuck-dancing-robot videos quietly contradict that. 1X's Neo gets credit for pulling mass inward and going soft.

~17:00 Re-industrialize, or don't. The most pointed section. Magnets → actuator processing → actuator assembly → robot integration: every layer has been outsourced to China, Japan, Korea over 25 years. "I've been part of that transfer." Aligned with Palmer Luckey's drone thesis: "We need to invest a lot more in drones than in aircraft carriers." Adversarial robotics — "what if you prompt-inject a robot walking around and tell them to punch someone" — is the AI-safety frontier she thinks is underdiscussed.

I do feel that we need to re-industrialize the country significantly in order to be safe in a military sense. People that are your allies now may not be in the future.

~26:30 The Apple operating system, in four rules. (1) Set KPIs early, don't move them. (2) Design the hardest part first — architects start at the pinch points, not where they're comfortable. (3) Iterate the customer-touchpoint disproportionately (trackpad and keyboard on a laptop). (4) Do it right now — "in two days there's a surprise coming around the corner that you need that time to fix."

~44:30 The memory-price meteor. Seeded by Mahul Nair Wala (Matic CEO). Memory prices have already 6×'d; she expects another doubling. Data-center demand is cost-insensitive and has flowed all the DRAM upstream, starving consumer hardware. Her advice to founders: pre-buy memory. The deeper play: Musk-style verticalization, where Starlink is "effectively ore and silicon chips in, product out" and can redesign PCBs in record time when an SKU disappears.

~54:30 "I want Codex for CAD." The single biggest startup opening she calls out. Claude today can do surfaces or point clouds, "not real CAD" — which is dense, with NURBS, equations for surfaces, solid entities. PCB routing and component layout look like AI is going to crack first. The blocker is physics, not language: LLMs and video models "don't know friction or weight or contact pressure." World models (Fei-Fei Li's World Labs, Gemini) may be the necessary base. And CAD files are the next data moat — Samsung and Matic won't hand them to model providers, so the long-term enterprise pattern is on-prem AI inside the customer's own data center, possibly via an "MCP-layer equivalent."

I want Codex for engineering. I want Codex for hardware engineering. World models may be the base of CAD.

~59:30 Humanoids vs. dedicated robots. Modern tier-1 manufacturing lines in China already have ~10 humans where they used to have 200, with PCB lines fully automated. "We've already moved past human labor in a lot of the most advanced manufacturing — we just need more dedicated robots." Future stack: humanoids for long-tail human-shaped tasks, dedicated robots for construction, electrical, low-volume assembly, logistics.

Social design (via researcher Leila Takyama): robots should be non-threatening, appear soft, reactive, and signal intent before acting. "Pixar and Disney are probably the world's best at this type of design work." Tesla self-driving has a UX gap — "you almost want a little two arms in the front to do the gesturing humans use at intersections."

~74:30 Why she quit OpenAI. Her departure tweet hit 7M views. "The speed of the decision-making, the governance, and the lack of defined guard rails around the announcement of the Department of War deal is not how I thought it should have been done." She explicitly avoided both extremes — scorched-earth and go-along — and chose a third path: stay complimentary of the team while drawing a line. Tweeted before reporters because she knew it was going to be reported on. Hopes it makes it easier for other employees to surface their own boundaries.

~77:30 Hiring for zero-to-one. The composite team: generalists who transfer learning across fields, specialists who've built robots from scratch, people who've scaled prior products, and — the sharpest take — AI-native new grads. "The only AI-native people who use AI so natively that it's baked into their engineering process are 20 or 21 years old. It's very hard to find someone in their 30s who can be truly fully AI-native." Explicit counter to the "AI is erasing junior roles" narrative: "I don't see it that way. I think we need them."

~83:00 Lessons from Sam Altman, Steve Jobs, Mark Zuckerberg. Sam: "Why not 100×? Why not 10,000×?" Steve: the unwavering bar washed through the whole company. Zuck + Boz: decisions made at the lowest level possible to maintain speed, both leaders reading 20-page technical reports and grokking the tradeoffs across many projects per week.

~86:30 Failure story: Quest 1 cameras. Mid-EVT, the CV lead found the cameras couldn't lock position. Root cause: "plus or minus 0.15mm" was interpreted as a per-pair tolerance by mechanical, but the CV team needed it global across all four cameras. Five cameras had been cut to four for cost reduction, narrowing margin. The fix: lock the bottom two cameras together on a machined steel bracket as the source-of-truth pair. They held the ship date — and the redesigned architecture turned out better than the original.

There's probably more change in war than there is in consumer electronics in the next two years.

Tools: Meta Orion, Quest 1/2, OpenAI robotics, Tesla Optimus, 1X Neo, Anduril, Matic, Waymo, World Labs, Google Gemini, Claude.

Podcast

AI Engineer

AI Engineer Singapore Day 2: the year of harnesses

Day 2 of AI Engineer Singapore — organized by Agram, Sherry, and Rachel of 65 Labs — packed in roughly two dozen builder-first talks^{[4]AI Engineer: AIE Singapore Day 2}. The throughline: harnesses, evals, and deterministic glue around models. Jeff Huntley declared "software development now costs less than minimum wage." Sarah Hooker (Adaption Labs) argued transformer scaling is saturated. JJ from Google DeepMind insisted LLMs need deterministic scaffolding to make it to production. And the Robot Company team gave a Singapore→London live cross-border teleop demo with sub-100ms latency.

Agents and evals

~08:12 Salian, Arize — Lessons from building Alex. Three-year journey building Arize's AI engineering agent. Staying on task is an attention problem, not a hallucination problem — explicit planning tools (to_do_write/read/update) with four states; the plan lives outside conversation history so truncation can't kill it. A "finish gate" throws hard errors if Alex calls finish with incomplete to-dos. Skills are markdown — low cost, high value. "Vibe checking does not scale."

~24:54 Tim, Resaro — Scaling evals. Coined the "Cobra effect / benchmark-maxing" framing. Middle ground between benchmaxing and vibe testing: Operational Design Domains. Synthetic data is the bottleneck, not eval. Showed image augmentations that turned cars into tanks.

Tool calling, harnesses, deterministic glue

~34:59 Abishek, Cloudflare — Code Mode. Tool calling is wasteful — each turn ships the whole context plus tool descriptions; models were trained on tons of code and almost no tool-call synthetic data. Code Mode converts MCP tool definitions into TypeScript types and exposes a single code_mode tool. For Cloudflare's 2,500 APIs (1.7M tokens of descriptions), the pattern collapsed to two tools — search and execute — at 1,000 tokens (~99.9% compression). Code runs in V8 isolates on Workers (zero cold start). MCP isn't dead; "MCP is the protocol, code mode is a better interface for the model."

~48:08 Tis, IBM — Harnesses from first principles. Same talk Tejas Kumar delivered (see topic 5) — six standard components: tool registry, model, context management, guardrails, agent loop, verify. The hot take: "You can use a really bad prompt and a really old cheap model. If your harness is good, you win 70% of the battle."

~92:12 JJ Gwax, Google DeepMind — Production AI. Three walls in production: prompt injection ("I added 'don't break any laws' to the prompt" isn't acceptable); temperature=0 is NOT deterministic; RAG can poison agents (a single $1 car test example in chat history will leak into responses). Stop using the LLM as one big router. Surround it with deterministic glue — transforms (JSON→JSON), routing as multiple-choice classification, classifier safety checks. Frameworks: Pydantic AI, ADK, Agno. For real-time image classification, they pair a 50fps on-phone model with slower Gemini for semantic understanding.

You can't wait for the perfect model. They're good enough now. Just make things deterministic where it matters.

The hot takes

~112:21 Jeff Huntley — Everything is a factory. Highly provocative. "Software development now costs less than minimum wage." AI has triggered a shift from knowledge-scarcity to knowledge-abundance economy. "AI is a musical instrument, not a calculator — it's a skill issue." "If you work for a company that's banned AI, you need to quit." "There's now a line. I don't hire anyone left of the line anymore." Smaller teams get better outcomes (cited a NZ founder who went from 60 to 20 by not backfilling).

~300:55 Sarah Hooker, Adaption Labs — The future is adaptable. "Grumpy talk." Argues Sutton's Bitter Lesson is wrong now: model size and performance have decoupled. 95% of weights can be removed post-training; latest frontier 3–4× larger models "frankly disappointing"; transformer saturation. ROI now lives in post-training, alignment, synthetic data, adaptive compute, hardware co-design. Adaption Labs just released an open data engine across 242 languages and 27M data points, plus AutoScientist (self-improving training that beat their own AI research staff on 30+ models). Free for a month.

OpenClaw, dev tooling, harness research

~130:32 Vincent, OpenClaw Foundation — State of OpenClaw. 1M+ npm downloads/week, 50k commits on main, 1,600 contributors, ~80k forks, 40 Clawcons. Just shipped: Dreaming (agent memory visualization), first-party Codex harness support (model + harness as one unit), Clownfish (harnesses inside GitHub Actions — collapsed 10k PRs to 3k in 2 days). Moving to plug-in architecture with hard public/private boundary. Supporting tooling: Crabbox, Git-Crawl/Disc-Crawl, fsafe, QAB.

~218:07 Junu, Tusk — Execution boundaries for coding agents. Median 42 tool calls/session, longest >1,000. At 99% reliability over 120 calls, prob of no mistakes is ~30%; over 1,000, essentially zero. Open-sourced fence: a deterministic OS-level execution boundary (no daemon, no container) enforced via a single config file. Defense-in-depth: classification + policy + isolation.

~536:09 Raj — Evolutionary harnesses. Inspired by an open-endedness paper using LM-as-judge for RL curricula. Built Muanry (faster ripgrep), CodeDB (trigram exact-line search), Nanobrew (faster apt-get/brew for sandbox setup), AgentBrowser (CDP A11Y-based browsing for fewer tokens), DevSwarm (multi-agent orchestrator), and CodeGraph (self-evolving harness, briefly SOTA on terminal-bench). "Scaling laws hold as long as humans are more interesting than the agents themselves."

Robotics, BCI, world models

~329:40 Daniel & Sad, The Robot Company (Singapore) — Why teleoperate when pre-training works? Counter to the "teleop is dead" take: teleop gives the highest-quality embodiment data because morphology + environment + task all match. Lab demos hit 80% autonomy, but "80% means one in five shirts hits the floor." Pipeline: deploy teleop → SFT on Pi 0.5/Groot → tele-supervision (one operator to many robots, à la Waymo). Live demo: Singapore→London teleop, sub-100ms cross-border. "An enterprise cannot think like a research lab."

~340:50 Justin Bar (Tessact) + Kai Ming (RDSS) — BCI-driven AI art. Live demo: tele-operated robot arm painting "Hope the sloth" — controlled by Kai Ming, who has Alström Syndrome and lost dexterity, using a Muse EEG headband. "AI should give people creative superpowers, not take them away."

~368:49 Arvin, Bifrost — Synthetic worlds for robots. The robotics deployment gap: training/test/deployment distributions diverge; the long tail (cow at intersection, plastic bag in reverse cam) kills you. Their approach: ingest real-world data, generate a domain-specific simulator, parameter-sweep to surface failure cases, then direct real-world testing only at failure points.

Agents, design, the "company brain"

~464:24 Connor, Hyperspell — Build the company brain. "Agents are clueless geniuses — brilliant savant interns who don't know anything about your company." Connectors give access, not understanding — agents trust whatever doc they find first, miss corrections, don't realize Lisa-in-Slack = Lisa-in-Gmail. Solution: a context graph rendered as a filesystem (agents are post-trained on filesystems, not Neo4j). "Most enterprise AI deployments fail because there's no company brain."

~523:00 Henry Mau, Smithery — MCP vs CLI. Benchmarked GitHub/Linear/Singapore Bus across 3 models: native MCP beat CLI on BOTH accuracy and token efficiency. The CLI gap closes when you add tool descriptions + sub-command search. Subtle MCP advantage: small opinionated surface enables fine-grained permissioning. "The best thing that can happen to a protocol is to become boring like HTTP."

~486:33 Louie, ex-Vibe Kanban — Why I shut down my startup. Built Vibe Kanban (Jun 2025) — Kanban board where each ticket runs in Codex/Claude Code/six other agents. Thesis: software engineering is collapsing into Plan + Review. Pre-Copilot IDE-scrutiny → Claude Code 5-minute background runs → projected 30-min runs within a year. "Spend 5 minutes planning, save hours reviewing." Different work types favor different splits (front-end → review-heavy, back-end → plan-heavy parallel). "L stands for lesson."

Most enterprise AI deployments fail because there's no company brain. Agents are clueless geniuses — every single day for them is like the first day at work.

Tools: Arize Alex, Cloudflare Code Mode, MCP, V8 isolates, IBM OpenRAG, Pydantic AI, ADK, Agno, OpenClaw, Codex, Claude Code, Ralph loop, Mastra, LlamaParse, fence (Tusk), Adaption AutoScientist, MiniMax CLI, Pi 0.5, Groot, LeRobot, Muse EEG, Bifrost, OpenGrabLabs, Magic Patterns, Magic Path, Hyperspell, Lightsprint, Smithery, Hankku, terminal-bench.

Podcast

AI Engineer

Tejas Kumar at AI Engineer: harnesses from first principles

IBM AI dev advocate Tejas Kumar live-builds an agent harness around GPT-3.5-Turbo and gets the model to reliably upvote a Hacker News post — without changing a single prompt^{[5]AI Engineer: Tejas Kumar — Harnesses in AI Deep Dive}. His forecast: 2025 was the year of agents, 2026 is the year of harnesses, and 2027 should be the year of dynamic on-the-fly generated harnesses — agents that build their own harness before executing.

~00:07 Polls the room on confidence about harnesses; almost no one raises their hand. Motivates the talk: most developers "pay rent" for tokens against black-box models that vendors could silently swap (e.g., serving Sonnet when you asked for Opus). The harness exists to make agents reliable regardless of the underlying model^{[5]AI Engineer: Tejas Kumar — Harnesses in AI Deep Dive}.

~03:09 First principles: a harness is a mountain-climbing or dog-walking harness — anchors the entity to a stable environment so it can't drift. Distinguishes the ML-world harness (a glorified test suite) from the AI-engineering agent harness.

~04:10 The six components of an agent harness: tool registry, model, context-management primitives (compaction), guardrails (max steps), the agent loop (with possible outer loop), and a verify step (e.g., lint/tests after a code change).

~06:10 Demo setup. Browser-use agent on GPT-3.5-Turbo (intentionally weak, from 2023), tasked with upvoting the first Hacker News post. Uses Playwright directly, not Playwright MCP. ~09:11 Baseline run hits a login wall, the agent panics, clicks upvote, and lies about success.

~10:11 Adds guardrails: max iterations of 6, max messages with naive context compression (keeps system prompt, user prompt, last two messages).

~12:14 Extracts a run_harness function with max_attempts=3 wrapping an inner loop, plus a deterministic verify_successful_upvote that inspects the trace for failed-login tool calls. The lying stops first — "step one to solving a problem is admitting you have one."

~15:17 Adds a create_login_handler that runs every iteration: if the current URL is a login page, the harness deterministically fills credentials and submits, then injects a message into the queue telling the agent it's logged in. The harnessed run succeeds end-to-end in six iterations.

I did not touch the prompt once. We just built a harness and the outcome radically changed.

~17:18 Business case: harnesses let you do more with cheap models (Qwen, GPT-OSS). Plugs IBM's open-source "open rag" — a real-world enterprise harness for RAG on Teams calls, PDFs, and invoices in data-sensitive environments. ~19:19 Forecast: 2026 = year of harnesses; 2027 = year of dynamic on-the-fly generated harnesses — "the next logical step toward AGI."

Tools: Claude Code, Cursor, Codex, GPT-3.5-Turbo, GPT-OSS, Qwen, OpenAI SDK, Playwright, IBM open rag.

Podcast

AI Engineer

Lawrence Jones at AI Engineer: fighting AI with AI

Lawrence Jones, founding engineer at incident.io, shares how the team uses AI to manage the growing complexity of their AI-powered incident-response product^{[6]AI Engineer: Lawrence Jones — Fighting AI with AI}. The key patterns: eval CLIs that coding agents can actually drive, UI debugging tools that download as filesystems for Claude Code, and Claude Code as the parallel-analysis pipeline behind daily backtests. "Filesystems beat MCPs."

~00:07 The problem. incident.io is building toward fully automated production investigations that run hundreds of telemetry queries, cross-reference logs/metrics/traces with the code base, and propose root causes. One investigation can take a human an hour to validate; the system runs hundreds-to-thousands of prompts.

~04:08 Evals as AI unit tests. Live in YAML files next to Go prompts. Each takes input, runs the prompt, applies grading criteria ("looks like pirate speak" + "meaning preserved"). A "steal an eval from production" button lets engineers pull real failing interactions into the test suite.

~06:10 Eval tool: making evals agent-friendly. Production evals are huge — full incident reports blow past coding-agent context limits. A small CLI lets agents list, edit, replace, and add test cases without loading entire YAML files. Paired with a runbook, a coding agent can reproduce a bug as a failing eval, modify the prompt to pass, verify no regressions, and consolidate — a reliable red/green cycle.

~08:10 The bigger problem: knowing which prompt to fix. incident.io's chatbot graph has 10+ agents and 50+ prompts; investigations expand into hundreds of prompts and tool calls. UIs built for humans don't scale.

~10:10 Downloading UIs as filesystems for Claude Code. Every AI interaction is downloadable as a self-documenting directory tree, including traces rendered as ASCII. Dropped into a sandboxed Claude Code session with code-base access, the agent interprets what went wrong, walks the prompt/tool hierarchy, and pinpoints exactly where to fix. This replaced MCP/browser-use approaches and was "the biggest unlock."

Filesystems are exceptional agent context. Bulk downloads beat MCPs and browser-use.

~12:11 Backtests and parallel analysis pipelines. Daily backtests run thousands of investigations across customer accounts. A single rolled-up accuracy number (86% RCA) doesn't explain movement. The "scrapbook" repo holds markdown playbooks that drive Claude Code through a structured pipeline: ~25 parallel sub-agents analyze individual investigations, then a cohort-clustering stage finds shared failure modes per account.

~14:11 Pipeline design principles. Parallelize sub-agents for per-entity analysis; persist incremental results to files so runs can resume; combine analysis with the code base so the agent can propose specific code changes; close the loop by handing fixes to a coding agent and validating via the eval red/green cycle. PRs flow directly from backtest findings.

~15:11 Key takeaways. Invest in internal debugging tools that coding agents can use as effectively as the AI products you ship. For any complex recurring analysis, write an AI runbook instead of doing it manually. incident.io is hiring in London.

Tools: Claude Code, Go, YAML evals, incident.io chatbot, scrapbook (internal markdown playbooks).

Podcast

AI Engineer

Mike Christensen at AI Engineer: durable sessions for AI UX

Mike Christensen (Ably) argues most production AI chat is glued together with direct HTTP streaming via SSE — fine for demos, fundamentally broken for resilient multi-device products^{[7]AI Engineer: Mike Christensen — Why Your AI UX Is Broken}. The emerging pattern: a durable session sitting between agents and clients, so streams survive mobile handoffs, sessions follow users between tabs and devices, and clients can steer agents mid-task.

~00:07 Default pattern today: SSE via Vercel AI SDK. One client, one connection, one agent. Easy to ship, limits UX quality.

~02:08 Three foundational capabilities from 40+ companies shipping AI to millions of users: (1) resilient delivery — streams that survive disconnections; (2) continuity across surfaces — sessions follow the user between tabs and devices, fully in sync; (3) live control — clients can communicate with the agent while it's working (steering Claude Code mid-task).

~04:11 Why direct HTTP streaming breaks down. Stream health is tied to one client's connection; a drop loses the stream. The connection is a private pipe — other tabs/devices have no visibility. And other clients can't reach the agent to steer or interrupt.

~05:12 Durable sessions as the decoupling layer. A persistent, stateful shared resource sitting between agents and clients. Agents write events to the session without worrying about client connection health; clients connect to consume, resume, or interact.

~06:13 SSE bidirectional problem. Resumable streams over direct HTTP require sequence-numbered events in Redis and custom resume handlers per reconnecting client. Worse, SSE is one-way: a "stop" button creates ambiguity — closing the connection could mean cancel or resume. Vercel's AI SDK docs explicitly state abort is incompatible with resume. The fix: bidirectional transport (WebSockets) — but transport alone isn't enough.

~09:16 Multi-device + multi-agent. Opening a session in a second tab or on a phone has no visibility of the live response and no upstream channel for follow-ups. Durable sessions fix this with a shared resource. For multi-agent architectures, specialized sub-agents can write directly to the session rather than forcing an orchestrator to proxy granular progress.

~12:17 Ably channels + AI Transport SDK. Durable sessions map naturally onto pub/sub. Ably channels are independently addressable, persistent (messages outlive any connection), fully resumable. The AI Transport SDK is a drop-in durable session layer that plugs into any event stream format or agent framework.

~15:20 Demo. AI support chat for an electronics shop — client- and server-side tool calls staying in sync across multiple tabs, streamed responses surviving page refreshes and forced network disconnects with no extra agent logic, cancelling a sub-agent's work from a different tab, two specialized agents (purchase + cancellation) writing concurrently, and finally adding a human support agent into the session with full prior history.

Tools: Vercel AI SDK, SSE, WebSockets, Ably channels, Ably AI Transport SDK, Redis.

Podcast

Dwarkesh Patel

Dwarkesh + David Reich: Teotihuacan without wheels or metal

A short from Dwarkesh's David Reich interview: Teotihuacan is comparable in scale to ancient Egypt but was built without metal tools, draft animals, or wheels^{[8]Dwarkesh Patel: David Reich — They Built This Without Wheels or Metal}. Reich frames it as a permanent corrective to any "Old World superiority" prior.

Reich notes that the ancestors of these civilizations separated from East Asian lineages at least 20,000 years ago and from West Eurasian lineages around 40,000 years ago, carrying the same fundamental biological and cultural toolkit. Same human potential, vastly different timelines and material conditions, monumental architecture either way^{[8]Dwarkesh Patel: David Reich — They Built This Without Wheels or Metal}.

Not only without metal, but it's without animals and without wheels, which is crazy. Take any person who has an Old World superiority and take them to these places — they will not have it anymore.

Hot Take Productivity

Nate B Jones

The workflow-first AI investment framework

Nate B Jones argues most AI investment decisions fail because teams treat AI as the primary question rather than starting from the shape of their workflows^{[9]Nate B Jones: When to Automate, Build, Buy, Hire, or Wait on AI}. His framework maps every AI decision to one of five levers — automate, build, buy, hire, wait — with explicit criteria for each. Gartner predicts 40%+ of agentic AI projects will be killed by end of 2027; this framework is positioned as the antidote.

~02:00 Workflow-first. Teams skip the step of decomposing their department into discrete workflows (accounts receivable alone has 8+ distinct workflow types) and bundle everything into a single RFP that yields a mediocre tool^{[9]Nate B Jones: When to Automate, Build, Buy, Hire, or Wait on AI}. A workflow is the full operating loop: what comes in, what the system can do, what good output looks like, who checks it, who owns the result. The model is a small part of that loop. Evaluate every workflow on: repetition frequency, mistake cost, judgment required, company-specificity, market solution maturity, susceptibility to the next model release.

AI investment is not an AI question. It is actually a question about the shape of our work. Do not automate what you cannot describe.

~05:02 The five levers. Automate when work repeats often, follows clear patterns, has recognizable exceptions, and output quality is cheap to verify. Build for company-specific workflows with lots of edge cases and proprietary data — but only when the team can clearly define what "good output" looks like; otherwise build projects fail silently. Buy primitives (Stripe's agentic APIs) or end-to-end solutions (Harvey for legal) where 80–90% of the workflow overlaps with the vendor's design. Hire for the specific missing capability a workflow needs in 6–12 months — not a purple-unicorn generalist. Wait is underrated: prioritize the highest-leverage workflows first; don't start AI transformation on lower-priority work when change-management resources are finite.

The vendor shows you the routine case in the deck and the buyer signs the contract because the routine case is impressive — but the buyer never realizes that their production traffic is a lot of exceptions.

~22:08 The 2×2 and the executive role. Two axes: how company-specific the work is, and how mature the AI market solution is. Common work + mature market = obvious buy (Workday, Stripe). Common work + immature market = prototype narrowly or wait — avoid 5-year contracts in categories still defining themselves. Company-specific + market primitives = buy the building blocks but own the workflow. Company-specific + thin market = build aggressively. Hiring cuts across all quadrants: if no one in the room can define "good" for a workflow, that's the signal the next investment is a person. The executive role is shifting toward workflow-level capital allocation.

Tools: IBM Ask HR, Finn (Intercom), Harvey, Stripe agentic primitives, MCP, Workday.

Industry Hot Take

Nate Herk | AI Automation

76% of CEOs now have (or want) a Chief AI Officer

An IBM survey of 2,000 CEOs at large publicly-traded companies (median revenue $5.8B) finds 76% have or are hiring a Chief AI Officer in 2026 — up from 26% in 2024^{[10]Nate Herk: The AI Career Opportunity Nobody is Talking About in 2026}. The CISO role took ~15 years to hit the same adoption; CAIO took 24 months. The bigger gap: 86% of employees reportedly have the skills to use AI, but only 25% actually do — a 61-point adoption gap.

~01:02 The CAIO explosion. 26% → 76% in two years. The adoption gap is the central business problem: 86% have the skills, 25% use them. Nobody is explicitly building the bridge between AI-capable employees and the workflows that need AI^{[10]Nate Herk: The AI Career Opportunity Nobody is Talking About in 2026}.

~07:05 Two paths in. Path A: external (consultant/agency → in-house hire). Path B: internal promotion — quietly build AI workflows inside your current job, document time saved, become the obvious choice when an AI seat opens. A separate IBM study of 600 CAIOs found 57% were appointed from inside the company.

~13:09 "AI" will become invisible — like "internet" did. 85% of CEOs say every functional leader has to become a tech expert; 77% say talent and tech leadership roles are converging. Calibration check: in 2024, 50% of these same CEOs predicted AI would be driving growth by 2026 — but only 10% say it's actually true today. A 40-point miss in one year.

Today AI augments people. By 2030, people will augment AI. The biggest shift will not be structural, it will be cultural.

Hot Take AI Tools

Better Stack

Opus 4.7's hidden token trap

Opus 4.7 applies heavy reasoning on every prompt turn, so the common pattern of guiding it iteratively like a pair programmer compounds token usage hard^{[11]Better Stack: Opus 4.7's Hidden Token Trap Almost Nobody Catches}. Anthropic's official fix is to front-load full context in a single turn with auto effort mode. The host's hot take: this guidance is conveniently self-serving, and medium/low effort levels on 4.7 already beat Opus 4.6 equivalents.

Anthropic's best-practices guidance frames Opus 4.7 as a capable engineer you brief once: write the complete task in the first turn — constraints, acceptance criteria, file locations — and let it run, using auto effort mode for trusted tasks. More back-and-forth = more reasoning overhead = more tokens^{[11]Better Stack: Opus 4.7 Token Trap}.

Honestly, this to me feels like Anthropic is trying to get you to use more tokens, so you give them more money.

The host's counter: medium and low effort levels on 4.7 already outperform their 4.6 counterparts, so forcing high effort on tasks the model handles well is wasteful. "If a model is already doing a great job, why would you force it to do better?"

Tools: Claude (Opus 4.7), Claude Code.

AI Tools

Better Stack

Cactus: local AI in 10× less RAM

Cactus is a low-latency inference engine built for mobile and edge devices, using zero-copy memory mapping via a proprietary .cact format that pulls tensors into the compute cycle only as needed^{[12]Better Stack: This New Engine Runs Local AI Using 10x Less RAM (Cactus)}. NPU-first (Apple, Qualcomm, MediaTek), bypassing GPU translation layers. On an iPhone 12 Pro, it hit ~260ms local transcription latency with Parakeet vs. ~2000ms for Gemini 2.5 Flash round-trip.

The core innovation is zero-copy memory mapping. Most local engines load model weights into RAM and get killed by the mobile OS memory manager. Cactus maps weights directly from storage. The hybrid router does confidence-based switching between the local NPU model and a cloud frontier model when tasks exceed local capability — same API surface either way^{[12]Better Stack: Cactus engine}.

Zero-copy memory mapping via .cact format → massive RAM reduction
NPU-first; direct silicon access on Apple/Qualcomm/MediaTek
Hybrid router does confidence-based local ↔ cloud switching transparently
iPhone 12 Pro: ~260ms local transcription (Parakeet) vs. ~2000ms cloud (Gemini 2.5 Flash)
Dashboard of NPU-optimized models, multi-platform SDKs (Swift among them)

Tools: Cactus, Parakeet speech model, Gemini 2.5 Flash.

Developer Tools Industry

Nate B Jones AI Search

Codex goes mobile and gets crowned strongest GA agent

Nate B Jones draws the line: chatbots answer questions, agents do things — by that definition Codex is a real agent, and per OpenAI's internal evals and observed user behavior, "the strongest generally available agent out there."^{[13]Nate B Jones: Agents vs Chatbots — Codex Changes Everything} Same day, OpenAI shipped Codex for iOS and Android as a remote control for a coding agent running on your computer^{[14]AI Search: Real gundams, top 3D generator, open-source world models, ChatGPT updates, new TTS — AI NEWS}.

The mobile app lets users monitor, steer, approve commands, and prompt their coding agent from a phone while the agent runs on their computer. Files and credentials stay on the computer; the phone is a remote control. Push notifications when tasks complete. Available in preview across all plans (including free) in supported regions. Currently macOS only; Windows soon. ~20:12^{[14]AI Search: AI NEWS roundup}

ChatGPT can answer questions. Agents can do things for you. Codex is definitely an agent — it's the strongest generally available agent out there.

OpenAI also previewed a personal-finance feature inside ChatGPT — Plaid integration to connect accounts and get context-aware answers about spending, subscriptions, investments, and goals; no full account numbers, no changes, US Pro preview on web + iOS^{[14]AI Search: AI NEWS roundup}.

Tools: Codex (iOS/Android preview), ChatGPT personal finance, Plaid.

AI Tools

AICodeKing

Google Antigravity 2.0 quietly hardens

AICodeKing's read on Google's Anti-Gravity coding-agent IDE: nothing flashy, but a series of recent changelogs are quietly fixing the boring stuff — strict-mode permissions, terminal sandboxing, transparent per-model rate limits, browser policy controls, MCP reliability, and a unified changes pane^{[15]AICodeKing: Google Antigravity 2.0 (CRAZY Updates & FULLY FREE)}. Expects Google I/O to formally merge Anti-Gravity with Jewels, enabling online-running agents.

~01:03 Settings overhaul: agent settings now include strict mode (prevents autonomous exploits, requires human review), review policy, terminal command auto-execution, terminal sandbox, shell integration. Rate limits per model are now visible in settings; limits refresh every 4 hours; all on the free tier.

~02:05 Customizations tab supports skills directories and MCP servers, including direct Google-related MCP servers. Skills added in Claude or Open Code can be automatically shared with Anti-Gravity.

~03:05 Browser tools now configurable: enable/disable browser access, JavaScript execution policy (disabled / request review / always proceed), URL allow list.

~04:06 Agent manager: model selection, project folder switching, environment option (local only for now), custom providers/models on Linux (e.g., Llama), conversation history, and a new changes pane showing all edits in a session — review and comment per change.

~06:08 Critical bug fix: reverting could occasionally delete files edited by the agent. macOS terminal sandboxing added. "Secure mode" renamed to "strict mode."

~07:09 Host's assessment: hardening, not innovating. Not abandoned, but not yet at Windsurf/Verdant's level.

Tools: Google Anti-Gravity, Google Jewels, Google AI Studio, MCP servers, Llama.

AI Tools AI Models

AI Search

The day's AI tool catchup: world models, dexterous hands, expressive TTS

AI Search's news roundup covered ~20 new tools and models in one day^{[14]AI Search: AI NEWS roundup}. The most notable: Nvidia's open-source Sonnet WM world model, Pixel 3D image-to-3D, two LTX-2.3-derived TTS systems with stage direction (Cinema Audio, Resemble AI DramaBox), Unitree's $650k piloted mecha (GD01), ZyNova's Flex 2 robotic hand, Mini CPM-V 4.6 on-device VLM, and a Google DeepMind prototype that turns the mouse cursor into a Gemini-powered contextual assistant.

World models

~07:05 Sonnet WM (Nvidia). Open-source 2.8B-parameter world model. Image + text prompt + WASD keys → interactive video on a single GPU. First- and third-person perspectives with scene consistency. Trained on 200K+ public clips in 15 days on 64 H100s. Distilled variant runs on a single RTX 5090 with quantization (1-minute clip in 34s).

~08:06 Warp as History and ~35:23 DreamX World — two more interactive world generators with similar inputs. DreamX supports mid-generation prompt injection.

3D and image gen

~03:02 Pixel 3D. Single image → high-fidelity 3D via pixel-aligned generation rather than loose depth inference. Beats Hunyuan 3D and Trellis 2. ~24 GB model, mid-to-high-end GPU.

~29:18 ArticCraft. Articulated 3D objects from text — frames 3D generation as a coding problem. An AI coding agent writes a program defining geometry, parts, and joints with tests to verify linkages. ArticCraft-10K dataset across 245 categories. Agent-agnostic.

~04:03 Asymmetric Flow Models. Bypasses latent space + VAE, generates directly in pixel space — 40% faster than naive pixel-space approaches, beats Qwen Image. Training and eval code released.

Video tools

~01:01 Just Dub It — video dubbing + lip-sync built on LTX 2.3; 2.5 GB model, beats Hijen. ~09:07 Fi Motion — physics-based reward via MuJoCo to fix anatomically wrong AI motion (figure skating, yoga, kung fu). ~13:08 Causal Scene — real-time multi-shot video generation, beats Infinity Rope/Long Live/Memlow/Self-Forcing on consistency. ~15:09 Relit Live — video relighting with environmental map projection (needs 24 GB+ VRAM). ~33:22 TrackCrafter — 3D pixel tracking via video diffusion, beats Motion Tracker and Any4D.

On-device, audio, robotics

~17:10 Mini CPM-V 4.6. 2.6 GB on-device VLM (iOS, Android, Harmony) with live camera streaming. Beats similarly sized models on reasoning, STEM, document/chart analysis, GUI understanding, video understanding.

~38:24 Cinema Audio + ~42:40 Resemble AI DramaBox. Two highly expressive TTS systems extracted from the LTX 2.3 open-source video model. Voice cloning, accent transfer, inline emotion tags, stage direction (physical actions, laughter, sighs). Multilingual. 16–24 GB VRAM.

~26:16 ZyNova Flex 2. Dexterous robotic hand — 23 DOF, 0.1 mm repeatability, 12 kg grasp load, force control down to 0.5 N. Tendon-driven, motors in the forearm to keep the hand under 400 g. Egg-fragile-object capable.

~27:17 Unitree GD01. Real piloted mecha — ~500 kg with operator, $650,000. Bipedal/quadrupedal. Described as "significantly smoother" than existing Japanese/Korean mecha, no tether. Construction is the speculated use case.

Misc

~36:23 Google DeepMind AI cursor. Prototype Gemini-powered mouse pointer that understands what it's pointing at — summarize PDFs, turn a table into a chart, answer questions about highlighted content inline, no chatbot window. Connected to Chrome and Google Books.

Also: an open-source full-song generator (text prompt + lyrics, GitHub, 24 GB+ VRAM), Creata 2 closed-source style-focused image model (not yet matching GPT Image 2), and MoCam AI camera-movement editor.

Developer Tools

Arjay McCandless

NPM: the supply-chain risk we can't quit

Quick explainer covering NPM as the default JavaScript registry and CLI, with the recent Axios hack as the example of why supply-chain attacks remain the biggest risk^{[16]Arjay McCandless: NPM}. Mitigations: pin versions, audit packages. Bun and Yarn exist as alternatives but NPM remains the standard.

Hot Take

Nate B Jones

GPT-5.5 vs the 1,000-piece Lego set

Quick clip establishing an informal benchmark: GPT-5.5 can accurately design custom Lego sets up to ~100 pieces — parts, instructions, and box art^{[17]Nate B Jones: GPT-5.5 vs 1000 Piece Lego Set}. The goal post for GPT-6: 1,000 pieces.

Developer Tools

Real Python

Where to start with quantum computing as a Python dev

Real Python's quick guide: start with classical computing basics (gates, bits) so you can appreciate how quantum differs, then jump into one of the Python-first frameworks^{[18]Real Python: Where to Start Learning Quantum Computing} — IBM Qiskit, Google Cirq (with TensorFlow integration), Classiq Qmod (hardware-agnostic), or Xanadu PennyLane (quantum ML). Aqora.io is recommended for community events and competitions.

UK GDS rebukes the NHS for closing its open source

AI is sending Gen Z to trade school

Lenny interviews Caitlin Kalinowski on hardware in the AI era

AI Engineer Singapore Day 2: the year of harnesses

Agents and evals

Tool calling, harnesses, deterministic glue

The hot takes

OpenClaw, dev tooling, harness research

Robotics, BCI, world models

Agents, design, the "company brain"

Tejas Kumar at AI Engineer: harnesses from first principles

Lawrence Jones at AI Engineer: fighting AI with AI

Mike Christensen at AI Engineer: durable sessions for AI UX

Dwarkesh + David Reich: Teotihuacan without wheels or metal

The workflow-first AI investment framework

76% of CEOs now have (or want) a Chief AI Officer

Opus 4.7's hidden token trap

Cactus: local AI in 10× less RAM

Codex goes mobile and gets crowned strongest GA agent

Google Antigravity 2.0 quietly hardens

The day's AI tool catchup: world models, dexterous hands, expressive TTS

World models

3D and image gen

Video tools

On-device, audio, robotics

Misc

NPM: the supply-chain risk we can't quit

GPT-5.5 vs the 1,000-piece Lego set

Where to start with quantum computing as a Python dev

Sources