Karpathy goes Claude; Google I/O fires everything

Industry Hot Take

Karpathy joins Anthropic

Andrej Karpathy announced on May 19 that he's joining Anthropic — landing the same week Anthropic surpassed OpenAI in business adoption (34.4% vs 32.3%) on Ramp's AI Index^{[1]Nate Herk: What Karpathy Joining Anthropic Actually Means For Claude}. Nate Herk reads it as a philosophy fit more than a marquee hire: Karpathy's recent "LLM wiki" (April 2026) and "auto research" (March 2026) projects map directly onto where Anthropic is pointing Claude Code — context-as-product and self-running agent loops^{[1]Nate Herk on Karpathys LLM wiki and auto research}. Herk predicts three follow-ons: a Claude context/workflow app store, more specialized /goal-style autonomous loops, and an Anthropic education layer.

Why this hire, why now

Herk frames Karpathy joining Anthropic as the convergence of two aligned philosophies rather than a transfer-window splash ~00:00. Karpathy has spent the past year writing publicly about context engineering, the limits of pure scale, and the importance of agent harness design — all themes Anthropic has been pushing hard.

Anthropic's momentum line

Ramp's AI Index is the underlying signal ~02:02: Anthropic crossed OpenAI in business adoption (34.4% vs 32.3%) for the first time, driven largely by Claude Code. In the same window Anthropic announced an enterprise services JV with Blackstone, Hellman & Friedman, and Goldman Sachs, plus the KPMG alliance announced today^{[1]Nate Herk: Anthropic momentum vs. OpenAI}.

"The wrapper is the product"

Herk's central hot take ~04:02: the foundation model is increasingly not the differentiator — the wrapper (context, memory, workflows, tools) is. He credits Karpathy for coining "context engineering" over "prompt engineering," and reads Claude Code's recent harness work as Anthropic operationalizing that view.

Karpathy's LLM wiki, transplanted onto Claude

Karpathy's April 2026 "LLM wiki" project — a raw/ plus wiki/ folder where an agent synthesizes raw notes into a living markdown knowledge base — is the pattern Herk thinks Anthropic will productize as a context layer for Claude Code^{[1]Nate Herk on Karpathys LLM wiki} ~06:04.

Auto research and the `/goal` loop

Karpathy's March 2026 "auto research" project set up autonomous experimentation loops — propose, run, score, repeat. Herk explicitly maps this onto Claude Code's /goal feature and similar harness primitives~09:06.

Three predictions

Closing the video ~12:06: (1) Anthropic builds a context/workflow app store; (2) Claude Code ships more specialized /goal-style autonomous loop commands; (3) Anthropic launches an education layer paralleling Karpathy's prior teaching work.

Tools: Claude Code, LLM wiki, /goal, Ramp AI Index

AI Models Developer Tools

Google Artificial Analysis Simon Willison Simon Willison Google for Developers AICodeKing

Gemini 3.5 Flash takes the speed crown at I/O

Google launched Gemini 3.5 Flash at I/O 2026 — Artificial Analysis clocks it at Intelligence Index 55 with output speed above 280 tokens/sec, calling it the new leader on the intelligence-vs-speed Pareto frontier^{[2]Artificial Analysis: Gemini 3.5 Flash — leader on intelligence vs speed}. Pricing jumped sharply: $1.50/M input, $9/M output — 3× Gemini 3 Flash Preview and 6× Gemini 3.1 Flash-Lite — yet Google is rolling it across free consumer products, which Simon Willison reads as a test of price tolerance^{[3]Simon Willison: Gemini 3.5 Flash pricing}. Gemini 3.5 Pro arrives next month^{[4]Sundar Pichai I/O 2026 keynote}; Simon shipped llm-gemini 0.32 on launch day with the new model ID^{[5]Simon Willison: llm-gemini 0.32}.

The benchmark picture

Artificial Analysis puts Gemini 3.5 Flash at Intelligence Index 55, ahead of Grok 4.3 high (53) and Claude Sonnet 4.6 max (52); output speed exceeds 280 tokens/sec, ~70% faster than its predecessor^{[2]Artificial Analysis Intelligence Index}. Agentic Elo is 1656 — behind GPT-5.4 xhigh (1674) but well ahead of Gemini 3.1 Pro (1314). Google for Developers claims 3.5 Flash beats Gemini 3.1 Pro on nearly every benchmark while running 4× faster than other frontier models^{[6]Google: Gemini 3.5 Flash benchmarks}.

Price up, distribution wider

Willison's read on the pricing change^{[3]Simon Willison: Gemini 3.5 Flash pricing commentary}: $1.50 / $9 per million in/out is a meaningful step change from the prior $0.50 / $3 Flash tier, but Google is simultaneously deploying 3.5 Flash as the default in free Gemini consumer products — a Hot-Take-worthy bet that quality justifies the spend on Google's side and that competitors can't match the "intelligence per second per dollar" trade.

Sundar's keynote setup

Sundar Pichai framed the launch as Gemini becoming "agentic frontier" — 3.5 Flash today, 3.5 Pro next month^{[4]Sundar Pichai keynote: Gemini 3.5 Flash and Pro}. Google's accompanying claim: shifting 80% of workloads from competitor frontier models to 3.5 Flash could save customers $1B+/year; processing hit 3.2+ quadrillion tokens/month in May 2026, ~7× YoY.

SDK day-one

Simon Willison shipped llm-gemini 0.32 the same day, adding the gemini-3.5-flash model ID^{[5]llm-gemini 0.32}, along with llm-gemini 0.32a0 alpha with streaming for reasoning tokens^{[7]Simon Willison: llm-gemini 0.32a0 with streaming reasoning}.

An informal AICodeKing test

AICodeKing flags what he believes is a silent upgrade of Gemini Flash inside Anti-gravity — produced polished first-attempt outputs on a movie-tracker app and a Rubik's Cube simulator^{[8]AICodeKing: Gemini 3.5 Flash tests inside Anti-gravity} ~00:05. Note: no official benchmarks shared; it's vibes-driven testing.

Tools: Gemini 3.5 Flash, Gemini 3.5 Pro, llm-gemini 0.32, Antigravity, AI Studio

Developer Tools AI Future

Google for Developers Google for Developers Google for Developers

Managed Agents in the Gemini API — Google echoes Anthropic

Google launched Managed Agents in the Gemini API — a single Interactions endpoint that spins up an isolated Linux sandbox per session, persists files and state across turns, and lets agents call built-in code execution, web browsing, and custom tools^{[9]Google: Managed Agents in the Gemini API}. Agents are defined as markdown files (AGENTS.md + SKILL.md) and registered as named managed agents — a structural mirror of Anthropic's Managed Agents launched April 8^{[9]Google Managed Agents: API surface}. Preview is live now; enterprise tier (Gemini Enterprise Agent Platform) in private preview. Early adopters: Ramp, ResembleAI, Klipy, Stitch.

The pitch

Managed Agents is Google's response to the "managed agent infrastructure" race opened by Anthropic in April. The problem statement: developers shouldn't have to wire up sandboxes, state, and tool dispatch — a single API call should do it^{[9]Google: Managed Agents problem statement}.

The six-part API surface

Interactions API — the single top-level endpoint that triggers agent execution.
Isolated Linux sandbox — ephemeral remote Linux environment per session.
Persistent file and state — files and state persist across multi-turn calls within a session.
Built-in tool invocation — native code execution, web browsing, plus custom tools.
Markdown agent definitions — AGENTS.md + SKILL.md files registered as named agents.
AI Studio Playground — no-code UI to compose and test custom agents before API deployment^{[9]Google: Managed Agents API surface}.

Competitive context: Anthropic parallel

Google's offering is a structural mirror of Anthropic's Managed Agents: hosted sandbox, multi-turn state, tool use, single API call, markdown-based agent definitions. Google's blog does not name Anthropic. Differentiation: Gemini model co-optimization, native Workspace tool access, the Antigravity desktop/CLI/SDK ecosystem^{[9]Google: Managed Agents competitive context}. From the developer highlights piece: a single Interactions call gives you persistent Linux + multi-turn state + tool use + markdown instructions^{[6]Google for Developers: Managed Agents — developer perspective}.

Availability

Preview as of May 19, 2026; accessible via Gemini API and AI Studio Playground. Enterprise tier — Gemini Enterprise Agent Platform — in private preview with no pricing disclosed yet^{[9]Google: Managed Agents availability}.

Tools: Gemini API, AI Studio Playground, AGENTS.md, SKILL.md, Antigravity

AI Tools AI Future

Google Gemini Google Google for Developers

Gemini Spark: the always-on personal AI agent

Gemini Spark is Google's pitch at a 24/7 personal AI agent: it runs on dedicated Google Cloud VMs, integrates deeply with Workspace, and handles multi-step tasks autonomously while you sleep^{[10]Google: Gemini app becomes agentic}. The same release adds Daily Brief (a morning agent that aggregates Gmail, Calendar, and follow-ups with priorities)^{[10]Google: Daily Brief agent}, a full app redesign with new motion language, a macOS app with Spark integration, and MCP connections to Canva, OpenTable, and Instacart^{[10]Google: Gemini MCP partners}. The app now claims 900M MAU across 230 countries — more than double the 400M reported at I/O 2025^{[10]Google: Gemini app 900M MAU}.

What Spark actually does

Spark enters Beta next week for AI Ultra subscribers in the U.S.^{[4]Sundar keynote: Gemini Spark} It's persistent — sessions don't end when the chat tab closes — and proactive: it surfaces things in your Workspace it thinks you need.

Daily Brief

A morning-digest agent that pulls Gmail and Calendar context, prioritizes, and suggests next steps. Effectively Inbox Zero meets executive assistant^{[10]Daily Brief details}.

MCP partners go live

The Gemini app picks up MCP support with launch partners Canva, OpenTable, and Instacart^{[10]Gemini MCP partners}. Worth noting alongside Anthropic's KPMG deal as evidence that MCP is rapidly becoming a real interop surface, not just an Anthropic-authored spec.

UI redesign + macOS

Google calls the new design system "Neural Expressive Design Language" — fluid animations, haptic feedback, refreshed type. The macOS app gains Spark integration and advanced voice-to-formatted-text^{[10]Gemini macOS Spark integration}.

Scale claim

900M MAU across 230 countries and 70+ languages — more than 2× the 400M reported at I/O 2025. The total token throughput claim of 3.2+ quadrillion/month is the load-bearing scale stat behind every other I/O announcement^{[10]Gemini app scale}.

Tools: Gemini Spark, Daily Brief, MCP, Canva, OpenTable, Instacart

AI Models

Google Google Gemini Google for Developers

Gemini Omni — any-to-any multimodality, video first

Gemini Omni is Google's new any-modality-in, any-modality-out model — text, image, and video input producing video, image, or text output, with cinematic video generation as the launch headline^{[4]Sundar keynote: Gemini Omni}. It powers new Gemini app capabilities including background swapping, custom AI avatars, and AI-generated effects^{[10]Gemini Omni in the Gemini app}, plus a faster Gemini Omni Flash variant for the Google Flow video creation suite^{[11]Google Flow: Gemini Omni Flash}.

Output modalities

Video output is the launch focus; image and speech outputs follow. Compared to GPT-5.5 and Claude Opus 4.7 (image input only), Omni adds video and speech input as well — distinguishing it on the multimodal axis even where text-only intelligence isn't class-leading^{[2]AA: Gemini 3.5 Flash multimodal comparison}.

Avatars and effects

The Gemini app uses Omni for cinematic video, background swap, and a "custom AI avatar" feature you train on yourself for social/work content^{[10]Gemini app: Omni capabilities}.

Omni Flash in Flow

The cheaper, faster Omni Flash variant ships first inside Google Flow alongside a creative AI agent and Bespoke Tools — no-code custom workflows for video creators. Section-by-section editing and full-track style transformations also land in Flow Music^{[11]Google Flow agent and Omni Flash}.

Tools: Gemini Omni, Gemini Omni Flash, Google Flow, Google Flow Music

Developer Tools

Google for Developers Google for Developers Google for Developers

Google AI Studio + Antigravity 2.0: the agent-first IDE stack

Google's developer story now centers on a tight loop: Antigravity 2.0 desktop app (multi-agent orchestration) + Antigravity CLI + Antigravity SDK + enterprise Google Cloud integration^{[6]Google: Antigravity ecosystem}. AI Studio picks up native Google Workspace API calls, one-click export to Antigravity (carrying conversation history, files, and secrets), a mobile app, native Android development with Play Store publishing, and two free Google Cloud deployments per user^{[12]Google AI Studio expansions}. AI Ultra at $100/month gives 5× higher Antigravity usage limits, with a $100 bonus credit through May 25^{[6]Google AI Ultra subscription}.

Antigravity becomes an ecosystem

Antigravity 2.0 desktop runs Gemini 3.5 at "12× the speed of other frontier models" per Google's claim and adds multi-agent orchestration^{[6]Google: Antigravity 2.0}. A terminal CLI and programmatic SDK round out the local agent dev story; Google Cloud integration covers the enterprise side.

AI Studio: mobile, Workspace, Android

Six concrete expansions^{[12]Google AI Studio details}:

Workspace integration — apps can read/write Sheets, Drive, team docs directly.
Export to Antigravity — carries conversation, files, secrets.
Custom asset generation + in-preview edits — Build agent generates images via Nano Banana; annotation tools let you draw on a running app.
Mobile app — build and preview from your phone with a gallery and live deploy sharing.
Native Android dev — prompt-built Android apps with an in-browser emulator and direct Play Store publish.
Free Cloud deploy — two app deployments to Google Cloud, no credit card.

AI Ultra positioning

$100/month with 5× Antigravity limits; the $100 promo credit through May 25 is a clear push to seed Antigravity 2.0 usage^{[6]AI Ultra details}.

Build with Gemini XPRIZE

$2M prize pool — Google's framing it as the largest hackathon prize pool announced. Finalists pitch at Moonshot Gathering, Los Angeles, September 2026^{[6]Build with Gemini XPRIZE}.

Tools: Antigravity 2.0, Antigravity CLI, Antigravity SDK, AI Studio, Nano Banana, Google Cloud

AI Future AI Tools

Google Google DeepMind Google DeepMind Google DeepMind Google DeepMind

Gemini for Science: Co-Scientist, AlphaFold, WeatherNext go live

Gemini for Science is a suite of experimental AI tools at labs.google/science covering hypothesis generation, computational discovery, and literature research^{[13]Google: Gemini for Science platform}. Launch partners include BASF, Klarna, Daiichi Sankyo, Bayer Crop Science, U.S. National Labs and 100+ academic institutions^{[13]Gemini for Science partners}. DeepMind published four launch demos: Co-Scientist running thousands of hypotheses across tens of thousands of papers, AlphaFold cutting protein structure work from years to 6 minutes for antimicrobial resistance, cancer target identification reducing 15,000 candidates to 15 in Uganda, and WeatherNext predicting Hurricane Melissa's landfall 3 days earlier than prior models.

Co-Scientist

Multi-agent system with specialized agents for literature search, hypothesis generation/evolution, ranking, and cross-field synthesis^{[14]DeepMind Co-Scientist video}. Demo: a liver fibrosis / epigenomics prompt ran for days, testing thousands of hypotheses against tens of thousands of papers — compressing months of work to 1–2 days, with at least one already-published finding ~00:00.

Antimicrobial resistance

AI cut protein structure elucidation from years to 6 minutes, with Gemini acting as an ideation partner that connects findings across prior context unprompted^{[15]DeepMind: drug-resistant bacteria} ~00:00.

Cancer target identification in Uganda

AlphaFold and "AlphaFold Gravity" reduced breast-cancer vaccine candidate sites from 15,000 to 15^{[16]DeepMind: cancer at genetic level}. Democratization angle: research previously required travel abroad due to compute costs; now viable locally with a laptop and server access ~00:00.

WeatherNext and Hurricane Melissa

Case study: Hurricane Melissa (2025) had two divergent forecast scenarios (weak storm over Haiti vs. Cat 5 Jamaica landfall). WeatherNext predicted intensification and landfall 3 days earlier than prior models. The U.S. National Hurricane Center confirmed plans to add WeatherNext to routine forecast operations^{[17]DeepMind: WeatherNext + Hurricane Melissa} ~00:00.

Scope

Gemini for Science includes Co-Scientist, AlphaEvolve, ERA, NotebookLM, and a new "Science Skills" layer that maps onto 30+ biological databases. Two papers in Nature accompany the launch^{[13]Gemini for Science scope}.

Tools: Co-Scientist, AlphaFold, AlphaEvolve, ERA, NotebookLM, WeatherNext, Science Skills

AI Tools

Google DeepMind

Project Genie + Street View = AI-generated real places

Project Genie integrates with Google Street View so users can generate playable virtual worlds grounded in real U.S. locations, with global expansion planned^{[18]Google: Project Genie + Street View}. Available to AI Ultra ($200/mo tier) subscribers; Waymo is the named partner using Genie environments for autonomous-vehicle simulation.

AI Tools

Google Labs Google Labs Google Labs

Stitch, Pomelli, Flow — Google Labs creative tools level up

Stitch picks up real-time canvas streaming, multi-input support, and production-ready export integrations including Netlify and Antigravity^{[19]Google: Stitch updates}. Pomelli adds a brand-identity agent that helps SMBs build a brand book and full website from a conversation or document upload^{[20]Google: Pomelli agent}. Google Flow ships a creative AI agent, Gemini Omni Flash for video, and Bespoke Tools (no-code custom workflows); Flow Music gets section-by-section editing, full track style transformation via "covers," and music-video creation powered by Gemini Omni^{[11]Google Flow + Flow Music updates}.

Stitch

Real-time canvas streaming means design iterations happen live during prompting; multi-input support takes images and notes alongside text; production-ready exports drop straight into Netlify or into AI Studio / Antigravity for further dev^{[19]Stitch details}.

Pomelli

Two-topic rollout: (1) a brand-identity agent that asks SMBs questions or ingests existing docs, and (2) full brand-book + website generation from those "Business DNA" inputs^{[20]Pomelli details}.

Flow + Flow Music

Flow gains an agent, Omni Flash for video, Bespoke Tools (no-code custom workflows), and is available now globally^{[11]Flow updates}. Flow Music adds section-by-section editing, full-track style transformation ("covers"), and music-video creation. Mobile apps for both ship as well.

Tools: Stitch, Pomelli, Google Flow, Flow Music, Gemini Omni, Gemini Omni Flash

Industry

Anthropic

KPMG goes all-in on Claude: 276,000 employees across 138 countries

KPMG announced a strategic alliance with Anthropic to deploy Claude across its 276,000+ global workforce in 138 countries^{[21]Anthropic: KPMG strategic alliance}. Claude is being embedded directly into KPMG's Digital Gateway platform (tax and legal first), Managed Agents + Claude Cowork are core; KPMG becomes Anthropic's preferred consultant for PE portfolio deployments; Claude Code is used inside KPMG Blaze for legacy modernization. Azure is the infrastructure layer. The deal includes an academic partnership with McCombs / UT Austin.

Scope

Three workstreams: (1) Claude embedded in Digital Gateway for tax and legal workflows, (2) KPMG Blaze using Claude Code for legacy code modernization, (3) KPMG as Anthropic's preferred consultant for private-equity portfolio Claude rollouts^{[21]KPMG alliance details}.

Stack

Managed Agents + Claude Cowork as the agent/collab primitives; Azure for compute. The McCombs / UT Austin academic partnership adds research collaboration.

Why this matters for Anthropic's enterprise narrative

Pairs with the Ramp AI Index data point (Anthropic at 34.4% business adoption vs OpenAI's 32.3%) from Karpathy's joining coverage. Big-4 wholesale deployments are how Anthropic locks in revenue Q4 2026 onward.

Tools: Claude, Claude Code, Claude Cowork, Managed Agents, KPMG Digital Gateway, KPMG Blaze, Azure

AI Future Industry

Anthropic Tech Brew

Anthropic widens the AI conversation — and opens Mythos

Anthropic announced structured dialogues with scholars, clergy, philosophers, and ethicists from 15+ religious and cross-cultural traditions to inform Claude's values and moral formation^{[22]Anthropic: widening the conversation on frontier AI} — including an internal experiment where giving Claude an "ethical-reminder tool" produced markedly lower misaligned behavior on alignment evals. Separately, Tech Brew reports Anthropic loosened policies on its Glasswing/Mythos cybersecurity program — ~50 participating companies plus the Pentagon can now publicly share security findings under a standard 90-day responsible-disclosure window^{[23]Tech Brew: Anthropic loosens lid on Mythos}.

Who Anthropic is talking to

15+ traditions: religious leaders, secular ethicists, philosophers. The post is short on names but explicit about goal — shape Claude's underlying values, not just safety filters^{[22]Anthropic: 15+ traditions in conversation}.

The ethical-reminder result

Internal alignment-eval experiment: Claude with access to an explicit "ethical-reminder" tool showed lower rates of misaligned behavior. Small but pointed result — argues for tool-based moral scaffolding rather than fine-tuning alone^{[22]Anthropic ethical-reminder experiment}.

Mythos opens up

Glasswing / Mythos — Anthropic's cybersecurity research program — historically NDA-locked. The new policy allows ~50 corporate participants plus the Pentagon to publicly publish findings under a standard 90-day disclosure timeline^{[23]Tech Brew: Mythos disclosure policy}. Tech Brew reads this as Anthropic positioning itself as a security-research peer to other major labs.

Tools: Claude, Glasswing/Mythos, ethical-reminder tool

AI Tools AI Future

OpenAI Google

OpenAI ships content provenance — C2PA + SynthID, jointly with Google

OpenAI rolled out a three-layer content-provenance stack: C2PA conformance (metadata + cryptographic signatures), SynthID invisible watermarking (partnership with Google DeepMind, applied to ChatGPT, Codex, and API image outputs), and a new public verification tool at openai.com/verify^{[24]OpenAI: advancing content provenance}. The release timing pairs with Google's I/O announcement that SynthID has now watermarked 100B+ images and videos, with new partners including OpenAI and Eleven Labs^{[4]Sundar: SynthID partners}.

Why two layers

C2PA metadata can be stripped by screenshots and resizing; SynthID watermarks survive those transforms but degrade differently. Layering them compensates for each other's weaknesses^{[24]OpenAI: C2PA + SynthID two-layer model}.

The verification tool

Public-facing at openai.com/verify. Deliberately cautious: no detected signal = no definitive conclusion (signals can be stripped), not "this is real."

SynthID scale

Google says 100B+ images and videos now carry SynthID — and the spec is now expanding into Search and Chrome with new partners: OpenAI, Eleven Labs, others^{[4]SynthID scale claim}.

Tools: C2PA, SynthID, OpenAI Verify

Industry

Morning Brew Snacks

Musk loses his OpenAI case — statute of limitations

A jury unanimously dismissed Elon Musk's lawsuit against OpenAI and Sam Altman on statute-of-limitations grounds: the court ruled Musk missed the three-year window to file, since OpenAI became for-profit in 2019^{[25]Morning Brew: Musk loses OpenAI case}. The Snacks newsletter notes jurors rejected all of Musk's allegations against Altman, Greg Brockman, and OpenAI^{[26]Snacks: Musk loses OpenAI case} — closing one of the longest-running legal threads hanging over OpenAI's restructuring.

Industry Hot Take

Snacks

Nvidia: too big to excite?

Snacks notes a pattern: Nvidia has beaten expectations in its last three quarterly reports, and the stock has dropped after each release — as traders increasingly look past Nvidia for the next AI beneficiary rather than buying its earnings^{[26]Snacks: Nvidia — too big to excite}. A useful reality-check the morning after Google flexed in-house TPU 8t / 8i hardware at I/O.

Hot Take Industry

Snacks

EY retracts consulting report riddled with AI hallucinations

EY pulled a consulting report after AI-detection firm GPTZero flagged apparent hallucinations throughout it, per a Sherwood News exclusive^{[26]Snacks: EY retracts AI-hallucinated report}. A small, sharp data point in the "AI consulting" backlash arc — and a reminder that detection tooling is now a market unto itself.

Hot Take AI Models

Simon Willison

Simon Willison's 6-month LLM recap (PyCon US 2026)

Simon Willison published annotated slides from his PyCon US 2026 lightning talk surveying the biggest LLM developments from Nov 2025 to May 2026^{[27]Simon Willison: 6-month LLM recap}. Headline takes: coding agents crossed the daily-driver reliability threshold; the "best model" title changed hands five times in six months; open-weight models (GLM-5.1 754B, Qwen3.6-35B) are punching far above their weight.

Developer Tools

Simon Willison Simon Willison Simon Willison Simon Willison

Simon Willison ships: llm-gemini 0.32, datasette-llm 0.1a8, accountant 0.1a4

Four releases land alongside Gemini 3.5 Flash: llm-gemini 0.32 stable with the new gemini-3.5-flash model ID^{[5]llm-gemini 0.32 stable}, llm-gemini 0.32a0 alpha with reasoning-token streaming (requires LLM CLI 0.32a0+)^{[7]llm-gemini 0.32a0 alpha}, datasette-llm 0.1a8 fixing the llm_prompt_context() hook not collecting full response chains (issue #7)^{[28]datasette-llm 0.1a8}, and datasette-llm-accountant 0.1a4 with a response-chain tracking bug fix^{[29]datasette-llm-accountant 0.1a4}.

Hot Take Developer Tools

Theo - t3․gg

Theo turns a $40 Copilot plan into $40K of inference

Theo demonstrates Copilot's per-message billing exploit: on the $40/month Plus plan he can spin up 50 staggered SSH sessions running an unsolvable cryptography puzzle through agentic models that burn tokens by the millions per message — extracting ~$93,600 of theoretical inference value from a $40 plan^{[30]Theo: $40-to-$40K Copilot exploit} ~02:01. He also lays out the four ways to bill for inference (rate-limited, message-limited, spend-limited, per-token API) and the upcoming June 1 migration to token-based credits — Opus jumping from 15× to 27×, GPT-5.4 from 1× to 6×^{[30]Theo: Copilot June 1 token migration} ~33:21.

The loophole

1,500 messages on the $40 plan × ~$62/message of agentic inference = ~$93,600 theoretical value^{[30]Theo: loophole math} ~02:01. Theo runs 50 staggered sessions on his Claw Mini SSH server, each chewing on a custom cryptography puzzle (ROT47 + fake Git commit; inverted base-64 alphabet) — single-message runs as long as 16 hours 10 minutes with 111.3M input tokens.

Four billing models

Theo's taxonomy ~04:30: rate-limited (Claude Code, Codex), message-limited (Copilot, old T3 Chat), spend-limited (Cursor, Open Code Zen), and per-token API. He argues message limits are the easiest to abuse on agentic workloads.

The T3 Chat near-bankruptcy detour

~11:07 Theo recounts T3 Chat losing $2K in 5 days and ~$500K total to Repo Mix abuse — context for why Copilot's bill-by-message scheme exists at all.

Caching economics

~19:11 Cached input tokens cost 10× less than uncached. A 16-hour single message costs $163 uncached vs $62 cached — and caching is what makes long agentic loops economically viable at all.

The June 1 token-credit migration

~33:21 Copilot switches off message quotas to token credits — Opus multiplier 15× → 27×, GPT-5.4 1× → 6×. Translation: the exploit window closes in < 2 weeks.

Azure benchmark revenge arc

~01:00 Theo frames the whole Copilot stunt as a sequel to his Azure inference benchmark — $1M+ in Azure credits running hourly to prove Azure's P90 was 21× slower than OpenAI, went viral, got fixed.

This is the most evil thing I've done in a minute.

Tools: GitHub Copilot, Claude Code, Codex, Cursor, Open Code Zen, T3 Chat, Claw Mini, Repo Mix

Developer Tools Hot Take

AI News & Strategy Daily | Nate B Jones

Six agent protocols, three that matter: MCP / A2A / AGUI

Nate B Jones argues that six agent protocols have launched in the last year — MCP, A2A, AGUI, A2UI, AP2, X42 — but only the first three form the core agentic stack^{[31]Nate B Jones: Six Agent Protocols} ~00:00. MCP is the tool/data layer (14,000+ servers, but built for high-trust environments)^{[31]Nate B Jones: MCP layer} ~03:04, A2A handles agent-to-agent delegation across product/company boundaries^{[31]Nate B Jones: A2A layer} ~06:06, AGUI is the open candidate for human control of long-running agents^{[31]Nate B Jones: AGUI layer} ~08:09. Hot take: teams are overspecified on models and underspecified on the agent operating surface — they know which LLM to pick, but can't answer which tools the agent should see or what the approval model is^{[31]Nate B Jones: Hot Take} ~18:15.

MCP — the tool and data layer

14,000+ MCP servers live. Won adoption fast, but was designed for high-trust environments — boundary issues at scale are non-trivial^{[31]MCP: trust and security boundaries} ~03:04.

A2A — agent-to-agent coordination

Google launched A2A with 50+ partners. "Agent card" serves as the operating contract — but coordination adds latency and operational complexity^{[31]A2A: agent card contract} ~06:06.

AGUI — the human control layer

Streaming, state sharing, approvals, interruptions. Most teams ignore this layer until agents start causing real problems in production^{[31]AGUI: human control layer} ~08:09.

The three domain-specific protocols

~10:10 A2UI (structured agent-generated UI rendering), AP2 (agentic payment authorization with a cryptographic mandate), X42 (machine-to-machine HTTP-native payments). All important if you're building in those verticals; none are core today.

The real hot take

~18:15 "Most teams know which LLM they want but can't answer which tools the agent should see, what the approval model is, or how to validate multi-agent coordination. The real work lives in those questions."

Most teams know which LLM they want but can't answer which tools the agent should see.

Tools: MCP, A2A, AGUI, A2UI, AP2, X42

Hot Take AI Future

Better Stack

LLMorphism: are we becoming more like LLMs?

Better Stack flags a new paper coining the term "LLMorphism" — the cognitive bias of starting to believe human cognition works the way an LLM does, reversing the traditional direction of anthropomorphism^{[32]Better Stack: LLMorphism explained} ~00:00. A useful frame to file alongside Lenny's-podcast-style hardware/embodiment teasers and the broader "AI dulls thinking" discourse.

Developer Tools Hot Take

Better Stack

Mozilla ships Zero, an AI-native systems language

Better Stack covers Zero — Mozilla's new systems programming language designed so humans and AI agents can ship small native programs together, with a fully JSON-structured toolchain^{[33]Better Stack: Mozilla Zero language} ~00:00. The host's hot take: Zero's core selling point — structured JSON error output — isn't new. Rust already supports it, and LLMs debug normal text errors just fine anyway^{[33]Better Stack: Hot take on Zero} ~06:02.

What Zero is

Mozilla's positioning: small-binary systems language, JSON-shaped tool output everywhere (compiler errors, linter findings, tests), built so coding agents don't have to parse human prose to drive a compile loop^{[33]Zero design pitch} ~00:00.

The pushback

Better Stack argues: Rust already emits structured diagnostics; modern LLMs decode plain-text errors fine. So what's Zero actually solving that rustc --error-format=json doesn't?^{[33]Better Stack: Zero pushback} ~06:02

Tools: Zero, Rust, rustc

Hot Take Industry

Nate B Jones Nate B Jones

Nate B Jones: non-coding AI is exploding; AI jobs have no fixed destination

Two shorts from Nate B Jones worth saving. (1) An AI platform tracking toward 5M users is seeing 40% week-over-week growth in non-coding general use, outpacing overall adoption — Nate's read: Codex-style coding agents are about to be eclipsed in raw user count by general-purpose use^{[34]Nate B Jones: 40% growth nobody saw coming}. (2) The "AI jobs" framing is misleading because the human judgment frontier expands as AI capability grows — unlike literacy or coding, there's no fixed destination to learn; you can only practice moving with the boundary^{[35]Nate B Jones: the biggest lie about AI jobs}.

Developer Tools

Arjay McCandless

Scaling your system from 0 to 1M users — the 7-step playbook

Arjay McCandless walks through the seven scaling stages from a $5 single-VM app to a distributed production system^{[36]Arjay: scaling 0 to 1M users}: (1) single VM with co-located DB ~00:00, (2) split DB to a separate server ~01:00, (3) observability — logging, metrics, product analytics with Amplitude/PostHog ~02:01, (4) horizontal scaling with a load balancer ~04:03, (5) CDN + multi-layer caching ~05:03, (6) async processing via SQS workers to decouple Stripe / Resend calls ~07:04, (7) Postgres read replicas for analytical workloads ~10:05.

Developer Tools

The Pragmatic Engineer

How TypeScript was created — Anders Hejlsberg on the origin

A Pragmatic Engineer teaser: Anders Hejlsberg recounts that TypeScript was born from the realization that fixing JavaScript's tooling was better than compiling other languages into it — pre-TypeScript work on ScriptSharp (a C#-to-JS cross-compiler) led him to conclude that meeting developers in JS itself was the right move^{[37]Pragmatic Engineer: TypeScript origins} ~00:00.

Podcast Industry

AI Engineer

Bilge Yücel at AI Engineer: AI Under Sovereignty Constraints

Bilge Yücel, senior developer relations engineer at deepset (makers of the Haystack orchestration framework), defines sovereign AI through four pillars (data, model, infrastructure, operational) and walks through what breaks when an existing AI system is retrofitted to meet sovereignty constraints. She then shows how Haystack's typed pipelines, YAML serialization, and component swappability help mitigate the work, and closes with a sovereign agent reference architecture and a sovereignty checklist.^{[38]AI Engineer: Bilge Yücel on sovereignty constraints}

Yücel opens at [00:14] by introducing deepset as the company behind the open-source orchestration framework Haystack, with enterprise customers including Airbus, Bosch, Siemens, the European Commission, and German federal ministries — context that establishes sovereignty as a first-class concern for their user base. She offers a policy definition ("the ability of an organization to design, deploy, and operate AI systems on its own terms") and then translates it for engineers as explicit control over data flow, model choice, infrastructure, observability, and operations [01:16]. The core framework is four pillars introduced from [01:16] onward: data sovereignty (where data is stored and processed, plus access permissions — sending European citizen data to an embedding API hosted in Virginia already breaches GDPR), infrastructure sovereignty (where compute happens, on a spectrum from air-gapped through private VPC and sovereign cloud to SaaS, where the US Cloud Act becomes a risk even for EU-hosted apps) [03:17], model sovereignty (freedom to choose and switch models, swappability without architectural changes, training-data origin) [04:19], and operational sovereignty (observability, human-in-the-loop for HR/finance, auditable versioning) [05:19]. She emphasizes that sovereignty is a spectrum and not every team needs full air-gap — the goal is to know your level of vendor lock-in [06:19]. From [07:19] she walks through what concretely breaks when a working system is asked to become sovereign on Monday morning: swapping a frontier API for a self-hosted model forces prompt and evaluation rework; moving private data into the required jurisdiction creates multi-database and search-routing problems; replacing managed infra with on-prem exposes Kubernetes, CPU/GPU connectivity, and network management you previously offloaded; and bolting on observability reveals a black-box application layer with no version control [08:19]. Haystack is presented at [09:20] as a mitigation: a consistent interface for swapping cloud vs. self-hosted backends with a few lines of code, explicit typed and declared inputs/outputs in every pipeline, YAML serialization so applications are versionable in git, and a truly open codebase you can extend. She then sketches a sovereign agent reference architecture [11:22] — input guardrail, agent with tools (APIs, knowledge base, other agents, MCP servers), output compliance guardrail — and walks through Haystack code defining a guardrail with an Nvidia chat generator and an LLM message router, connecting an MCP tool set, building a searchable tool set with BM25 to avoid filling context with tool definitions, defining an agent with a system prompt and confirmation strategies for human-in-the-loop, and wiring it all into a pipeline with a tracer for LLM observability [14:23][15:24][16:25][17:27]. She closes with a three-question sovereignty checklist at [17:27]: Can you swap models without changing application logic? Do you have reproducible run logs stored in a compliant way? Can your team respond to an incident without calling a hyperscaler vendor?

Sections

~00:14 Intro: deepset, Haystack, and why sovereignty matters for EU enterprise customers

~01:16 Defining sovereign AI and the four pillars

~02:17 Data sovereignty: jurisdiction, GDPR, and access permissions

~03:17 Infrastructure sovereignty: air-gapped to SaaS and Cloud Act risk

~04:19 Model and operational sovereignty: swappability and observability

~06:19 What breaks when retrofitting an existing system for sovereignty

~09:20 How Haystack helps: consistent interface, explicit data flow, YAML versioning

~11:22 Sovereign agent reference architecture and closing checklist

Sovereign AI is the ability of an organization to design, deploy, and operate AI systems on its own terms.

If you send that data to an embedding API hosted in Virginia in the US, then you are already losing the control of your data.

Sovereignty is a spectrum. Not everyone needs to be sovereign in all of these pillars.

Tools: Haystack, deepset, MCP, Nvidia chat generator, Mistral, Google, Gina, OpenTelemetry, Kubernetes, BM25, YAML

Podcast Developer Tools

AI Engineer

Ara Khan (Cline) at AI Engineer: Don't Build Slop — 4 Levels of AI Agent Maturity

Ara Khan from Cline pushes back on the "mass psychosis" of agent-building hype and lays out four maturity levels — framework, hand-built state machine, Kanban UX, and cloud — for building agents that actually ship. He shares five rules for hand-built agents and argues that Kanban boards are the right UX form factor when you're inference-bound and running multiple agents in parallel.^{[39]AI Engineer: Ara Khan — 4 Levels of AI Agent Maturity}

Khan opens at [00:14] diagnosing what he calls "mass psychosis" among agent developers: a feeling that everyone else is breezing through with fifteen agents in parallel while you're stuck reading every line of code. He proposes slowing down and breaking agent-building into four maturity levels you can slide up and down as your needs change [01:14]. He underscores how indistinguishable Frontier Lab UIs have become — showing screenshots of Factory, Codex, and Cursor that no one in the audience can tell apart — to motivate that the differentiation has to come from how you build, not which tool you copy [02:14]. Level 1 is using a framework — LangChain, LangGraph, or similar — which he recommends only for finding product-market fit when you don't care about the best model and just want something that works in half an hour. He's skeptical for production use: customizability, modularity, and the ability to stay on the frontier are hard to find inside a framework [03:14]. Level 2 is building the agent yourself, governed by five rules he walks through from [04:15]: (1) every agent is a state machine — a recursive while-loop with a few conditions and end states, and you should always have a mental model of which state you're in; (2) every addition risks making it worse — frontier models perform better with fewer instructions, citing that Codex's GPT-5 prompt is one-third the size of the GPT-5.3 prompt, and noting Cline has been rewritten from scratch at least seven times to prune accumulated junk [05:16][06:17]; (3) make the agent an easy part of a pseudo-RL pipeline by exposing it as a CLI with AGENTS.md and CI/CD so other coding agents can build and test it end-to-end [07:18]; (4) don't build slop — spend time on architecture and at least read the code even if you don't write every line, since architecture must be done thoughtfully by a human [08:19]; (5) Frontier Labs try to lock you down via API asymmetries — the new models (he cites Opus 4.6, Gemini 1.5.1 Pro, GPT-5.3) use reasoning traces that must be sent back in an exact format or performance degrades silently, and OpenRouter alone is not enough to insulate you [09:20][10:20]. Level 3 is the UX form factor: Kanban boards. From [11:22], he argues that when you're inference-bound with multiple agents running 8–10 minutes at a time, you want isolation of state and an engineering-manager view across all of them. He notes he made this claim on March 26th and Claude Code shipped the same form factor ten hours before the talk, validating the direction. Level 4 is shipping agents to the cloud [13:24] — paralleling many agents on separate machines, no local dependencies, agents that can run 15–60 minutes doing UX QA on their own (e.g., clicking through a VS Code extension end-to-end), and a shared cloud setup multiple teammates can mutate. He closes by framing his four levels as heuristics: start with the bare minimum and slide up the levels as your investment in agent infrastructure warrants. In a Q&A from [17:26] he clarifies planning inside Kanban happens through the CLI conversation that initiates the task, with the task transitioning to a review state when the agent needs human input.

Sections

~00:14 Intro: agent mass psychosis and the four-level framework

~02:14 Frontier Lab UIs are indistinguishable — differentiate via how you build

~03:14 Level 1: use a framework to find PMF, but expect to outgrow it

~04:15 Level 2 rules 1-3: state machine, prune additions, expose a CLI

~08:19 Level 2 rules 4-5: don't build slop; Frontier Lab lock-in via reasoning traces

~11:22 Level 3: Kanban as the right UX form factor for parallel inference-bound agents

~13:24 Level 4: shipping to cloud for long-running parallel agents and shared setups

~17:26 Q&A: planning inside Kanban via CLI and review-state transitions

Every single thing you add to an agent risks making it worse.

Frontier models are so good at their job that the less instructions you give them, they actually perform better.

Don't be a slob. Guys, don't please for the love of God don't build slop.

Tools: Cline, LangChain, LangGraph, Factory, Codex, Cursor, Claude Code, GPT-5, GPT-5.3, Opus 4.6, Gemini 1.5.1 Pro, OpenRouter, Linear, Kanban, AGENTS.md, VS Code

Podcast AI Tools

AI Engineer

Shivam Verma at AI Engineer: Personalization in the Era of LLMs at Spotify

Shivam Verma, tech lead of Spotify's User Representations team in the AI Foundation org, walks through how Spotify is moving from siloed multi-stage recommender pipelines to a single steerable LLM-backed generative recommender. He covers three building blocks: user embeddings as foundational user modeling, semantic IDs for tokenizing catalog content into LLM-trainable tokens, and projecting user vectors as soft tokens to personalize a fine-tuned open-weight LLM.^{[40]AI Engineer: Shivam Verma — Spotify personalization}

Verma introduces himself at [00:14] as tech lead of the User Representations team in Spotify's AI Foundation org — the group that builds frontier foundational models, user and content representations, and does CPT and SFT on open-weight LLMs. He frames the talk less as classical agentic context engineering and more as context engineering on the modeling side, in the service of better recommendations across Spotify's 750M+ users, 100M+ tracks, 400K+ audiobooks, millions of podcasts, and growing video catalog in 184 markets [02:16][03:17]. The three pillars are foundational user modeling, catalog understanding, and assembling both into a steerable personalized system. He recaps the legacy "tradex" multi-stage pipeline — candidate generation followed by one or more rankers — and notes that every product (home shelf, personalized playlists, search, podcasts, ads) historically had its own siloed team and model, which Spotify is now consolidating into a single unified LLM-backed model [05:19][06:21]. The AI DJ, prompted playlists (which now support podcast episodes as of that week), and the new Taste Profile (which exposes what Spotify knows about you and lets you edit it) are the user-facing surfaces driving toward steerability [04:18][05:19]. On user modeling [06:21], the team generates user embeddings daily for over a billion users — a massive, expensive pipeline. They've moved past autoencoder-style generalized representations (referenced in a prior team paper) to a single sequential transformer model where user interactions, request context (query, surface), and item candidates are all part of the prompt. A visualization at [09:22] shows users, tracks, and podcast episodes embedded in the same hypersphere — Verma's own embedding sits near the Big Big Tech podcast, illustrating cross-content modeling across users and content in one shared space. Catalog understanding [11:22] combines Spotify's proprietary item vectors with world knowledge from fine-tuned open-weight LLMs like Llama and Qwen. The key technique is semantic IDs — a concept introduced by a Google YouTube paper — which tokenize content vectors hierarchically into roughly four to six tokens [12:23][13:24]. Ariana Grande and Bruno Mars share the first two tokens because both are pop artists, with later tokens encoding niches. This lets the LLM autoregressively generate the next track or episode the way it would generate the next word, with training data composed of tokenized listening histories like an Italian user listening to an Italian podcast [14:25][15:25]. The final assembly [16:26] is the steerable personalized generative recommender powering the Taste Profile feature: users edit text describing themselves ("start listening to Justin Bieber more," "don't recommend this podcast") and that flows back into the model. To personalize without training on every one of 750M users, the team projects a user representation vector into the LLM's token space as a "soft token" inserted into the prompt, giving the model contextual user understanding at inference time [17:26][18:27]. Verma reports positive internal metrics and notes the next-episode podcast recommendations on Spotify are now productionized on this stack [18:27].

Sections

~00:14 Intro: Spotify AI Foundation and modeling-side context engineering

~02:16 Spotify scale and the three pillars of personalization

~04:18 AI DJ, prompted playlists, and the Taste Profile as steerable surfaces

~05:19 From siloed tradex pipelines to a unified LLM-backbone recommender

~06:21 Foundational user modeling and daily billion-user embeddings

~09:22 Cross-content embedding visualization on a shared hypersphere

~11:22 Catalog understanding with open-weight LLMs and semantic IDs

~16:26 Soft-token user projection and Taste Profile in production

We're moving away from siloed models towards this single unified model which supports an LLM backbone and allows you to steer it towards the sort of recommendations that you want.

We generate embeddings for like a billion plus users. We do that every day.

The first two tokens for both of them are shared because they're both pop artists.

Tools: Spotify, Llama, Qwen, transformers, autoencoders, semantic IDs, Spotify URI, Spotify AI DJ, Taste Profile

Podcast Developer Tools

DeepLearningAI

Emma McGrattan (Actian) at AI Dev 26: Engineering the Context Layer

Actian CTO Emma McGrattan argues that the context layer (the data substrate feeding RAG and agents) is load-bearing for enterprise AI and must be architected as a hybrid of cloud, on-premises, and edge tiers. She walks through the regulatory, latency, and data-gravity forces pushing enterprises toward hybrid topologies and previews where vector databases are headed next.^{[41]AI Dev 26: Emma McGrattan on the context layer}

McGrattan opens by framing the central enterprise AI problem: LLMs know nothing about your business by default, so the engineering challenge is building a context layer that lets AI deliver grounded, reliable answers at scale [00:07]. She identifies three pressures wrecking clean architectural diagrams: regulatory pressure (US Patriot Act concerns driving sovereign clouds in Europe), industry regulation (financial services and healthcare data that can't leave the data center, plus sub-millisecond latency needs for autonomous vehicles and fraud detection), and data gravity, citing Gartner's finding that the average enterprise has 400 distinct data sources [02:09]. She explains RAG using a concrete example: a car insurance customer asks why their premium rose 20%, and the vector database retrieves the contributing context (neighborhood theft rates, commute changes, pothole damage) so the LLM can cite documents and ground the answer in business reality [06:10]. The key insight is that where that retrieval data lives directly shapes query performance, making topology choice critical. The core of the talk is a three-way topology tradeoff [07:11]. Cloud offers elastic scale and global reach but suffers 20–200ms latency, egress costs, and connectivity dependence. On-premises is mandatory for sovereignty and regulated industries (HIPAA, PHI, financial trade surveillance) but requires infrastructure investment and patching overhead. Edge is required when milliseconds count, when devices can lose connectivity (her Florida fast-pass thunderstorm example [09:13]), or when data legally cannot leave the device — but compute, memory, and index freshness are constrained. She notes Actian software is built into nuclear submarine warheads, where air-gapped deployment precludes license-server checks [12:15]. Her prescription is hybrid by design with intelligent query routing: classified data routes to on-prem, sub-5ms latency to edge, sub-minute freshness to cloud, cross-domain queries fan out and merge [13:16]. Looking ahead 12–18 months, she predicts multimodal retrieval (audio, image, time series alongside text), AI-driven index management, and governance-aware retrieval that bakes access controls into the index rather than bolting them on. She closes with takeaways: distributed AI requires distributed retrieval, silent failures are dangerous, and the context layer is load-bearing — treat it like your business depends on it [14:17]. She announces Actian's new vector AI database for on-prem and edge, plus her newly published O'Reilly book on vector databases.

Sections

~00:07 Intro: 30 years in data, the context layer challenge

~01:08 Three pressures: regulation, latency, data gravity

~02:09 Sovereign clouds and the Patriot Act

~05:09 RAG explained via the car insurance example

~07:11 Topology tradeoffs: cloud, on-prem, edge

~10:13 Edge deployment: disconnected operation and air-gapped systems

~13:16 Hybrid design and intelligent query routing

~14:17 Future: multimodal retrieval and governance-aware indexes

The context layer is load-bearing, so treat it like your business depends upon it.

Distributed AI requires distributed retrieval.

According to Gartner, the average enterprise has 400 different sources of data.

Tools: Actian Vector AI Database, Amazon Sovereign Cloud (Europe), Salesforce, Azure, AWS, Gartner (research), O'Reilly

Podcast Developer Tools

DeepLearningAI

Marc Brooker (AWS) at AI Dev 26: It's Time to Be Right

AWS VP and Distinguished Engineer Marc Brooker argues that the opportunity size for knowledge-work agents is gated by their defect rate, not by frontier capability. He surveys AWS investments in correct-by-construction frameworks, automated reasoning, and auto-formalization (Hydro, Cedar, Strata, Lean, Kiro, agent core policy) aimed at moving agents into the low-frequency, low-consequence-defect quadrant.^{[42]AI Dev 26: Marc Brooker — It's Time to Be Right}

Brooker opens by framing this as the most exciting time in his 30-year software career — agents give him a length of lever over problems he's never had before — but says the technology isn't perfect yet [00:07]. His hypothesis: the opportunity for knowledge-work agents is limited by the defect rate, and improvements in defect rate will contribute more to the overall opportunity than pushing the capability frontier forward [01:08]. He lays out a 2x2 of defect frequency vs. defect importance, evaluating outcomes from the full agent loop (not the raw model), since feedback loops let you build great things on faulty foundations [02:08]. The bad-bad corner (high frequency, high importance) is unbuyable beyond short hype windows. High-frequency, low-importance defects are tolerable 'slop' — fine for school newsletter summaries but a small opportunity. Low-frequency, high-consequence defects show up in software systems development, where agents ship code that seems correct, passes tests, then fails in production and requires expert humans to fix; useful but a sharp tool only experts can wield [04:09]. The real opportunity is the low-frequency, low-consequence quadrant — where everyone can play [05:09]. He observes that the last 18 months drove down defect frequency more than they raised agents' ability to do complex tasks reliably, and that the right tail of bad outcomes deserves as much investment as the flashy left tail. He recounts a frontier model lying to him about a Cauchy distribution for 15 minutes as a fitting illustration [07:10]. The core technical content is AWS's symbolic and correctness-oriented investments [08:10]. Hydro is a Rust framework for writing correct distributed systems and protocols — current models struggle with concurrency and failure reasoning, so Hydro is correct-by-construction scaffolding agents can lean on. Cedar is a policy language with deep automated-reasoning roots for writing authorizers. Kiro, AWS's coding agent, leans into spec-driven development to anchor initial correctness and prevent drift. Strata is an intermediate representation that compiles languages down for reasoning across multiple automated-reasoning backends, all powered by Lean [09:10]. Auto-formalization [10:11] turns natural-language SOPs into mathematically precise specifications through a clarifying conversation with the customer in Bedrock AI Guardrails and agent core policy, encoding the result in Cedar or Lean. Deterministic agent/tool policy applies these formalizations to runtime behavior: agent core gateway tool-call policies, Strands steering (pre/post-conditions on tool calls), and the newly open-sourced Trusted Remote Execution that constrains agent-generated operational scripts with formal Cedar policies [11:11]. He closes with a call to higher standards [12:13]: benchmarks should measure failure severity, not just pass-at-10 density; success metrics should include operational properties (performance, cost, durability, availability); the industry needs a research program on the shape of failures; and teams must take their worst days as seriously as their best to build a culture of reliability [14:16].

Sections

~00:07 Intro: most exciting time in a 30-year career

~01:08 Hypothesis: agent opportunity is gated by defect rate

~02:08 Defect quadrants framework

~05:09 Why low-frequency, low-consequence is the real opportunity

~08:10 Correct-by-construction: Hydro and Cedar

~09:10 Kiro spec-driven development, Strata, and Lean

~10:11 Auto-formalization in Bedrock Guardrails and agent core policy

~11:11 Deterministic policy: Strands steering and Trusted Remote Execution

~12:13 Call to action: severity-aware benchmarks and reliability culture

The opportunity size for knowledge work agents specifically is limited by the defect rate.

We need more benchmarks that capture failure severity rather than just failure density.

We need to take our worst days as seriously as our best ones.

Tools: Hydro (Rust framework), Cedar (policy language), Kiro (AWS coding agent), Strata (intermediate representation), Lean (proof assistant), Bedrock AI Guardrails, AgentCore Policy, AgentCore Gateway, Strands (agent framework), Strands Steering, Trusted Remote Execution

Podcast Developer Tools

DeepLearningAI

Anush Elangovan (AMD) at AI Dev 26: The K-Shaped Future of Software

AMD's Anush Elangovan frames AI as the fastest software platform shift he has lived through and adopts a 'K-shaped' future where systems thinking and problem framing accelerate while rote coding skill falls. He illustrates with four AMD agent-driven projects — Geek, a Rosetta-style ISA translator, Llama.cpp runtime work, and a record-fast tokenizer — and argues 'intent velocity' is the new mode.^{[43]AI Dev 26: Anush Elangovan — K-Shaped Future}

Elangovan opens by placing AI alongside mainframes, client-server, web, mobile, and cloud — but notes this transition is happening in months and weeks instead of years or decades [00:07]. He thanks Andrew Ng for awarding his gesture-recognition startup the best AI startup award in 2014; that team now runs much of AMD's AI software organization [01:09]. His central frame is the 'K-shaped future of software engineering,' which he credits to a viral slide backed by a Harvard analysis [02:09]. The top arm of the K — systems-level thinking, judgment, taste, problem framing, harness design — accelerates dramatically; teams suddenly become 100x more productive and their wingspan moves up and down the stack. The bottom arm — language-specific syntax knowledge, formatting — falls in value because those are just intermediate languages for AI agents to consume. He stresses the metric shifts from lines of code to business outcomes [03:10]. His prescribed mode in this divergence is 'intent velocity' [04:11]: how fast you can turn an idea into production, with everything else as a means to the end. Winners operate in parallel — he runs four to six agents at night and during keynotes, crunching autonomously once given intent. The flywheel is that faster movement compounds, and he calls out the leadership responsibility to close the divergence without slowing the upper arm [06:14]. He then walks through AMD ROCm portfolio examples [06:14]. AMD has a pervasive hardware strategy (laptops to data center) and is layering AI on its open-source software stack so frontier models understand the stack end-to-end. Project Geek is an agent loop that observes customer workloads and continuously auto-optimizes software, yielding faster token serving [08:17]. The Rosetta-style project translates machine ISA from one GPU to another on the fly — previously a 4–5 year, 200–300 person project, prototyped in roughly 48 hours using a few billion tokens of Claude Code / Opus 4.6 and now shipping in production, even letting hardware coming out in 2–3 years run on current GPUs at native speed [09:18]. He uses this to make the point that 'too hard' no longer exists — you only need intent, framing, and awareness of pitfalls. A new zero-cost-overhead runtime moves tensors seamlessly between CPU, GPU, and NPU, integrating first into Llama.cpp so laptops can use all available silicon — he argues AI consumption will follow wherever compute lives [10:18]. A new tokenizer — one person, 200,000 lines of generated code — is now the world's fastest, and being open source feeds the flywheel because future model pre-training data will include it [11:19]. He closes with personal practice [12:19]: in December he thought agents were 'prompts in a cron job'; now agents continuously monitor his projects, reproduce bugs, file PRs, validate with tests, and commit on green CI without his involvement. His parting advice is the oxygen-mask analogy — put your own on first, then help the person next to you — but you do need to put it on, because the future is coming very fast [13:19].

Sections

~00:07 AI vs. prior software transitions

~01:09 Thanks to Andrew Ng and the 2014 AI startup award

~02:09 The K-shaped future of software engineering

~04:11 Intent velocity as the new mode

~05:13 Winners operate in parallel: agents running 24/7

~06:14 AMD ROCm: hardware portfolio plus open software

~08:17 Project Geek: agent-driven software auto-optimization

~09:18 Rosetta-style ISA translator built in 48 hours

~10:18 Llama.cpp runtime and fastest open-source tokenizer

~12:19 Agents in daily practice and the oxygen-mask close

Speed is the moat and your intent velocity is what you want to measure.

Now there is no such thing as too hard. You just have to have an intent saying I'm going to attempt that too hard thing.

In December I thought agents were prompts in a cron job.

Tools: AMD ROCm, AMD GPUs, Project Geek (AMD), Llama.cpp, Claude Code, Claude Opus 4.6, CPU/GPU/NPU runtime (AMD)

Podcast Industry

Y Combinator

Y Combinator Interviews Aadit Palicha (Zepto)

Y Combinator's Jared Friedman interviews Zepto co-founder Aadit Palicha about how he and Kaivalya Vohra started a 10-minute grocery delivery company at 17, pivoted from a WhatsApp delivery group to dark stores, and turned down Stanford to scale into billions of dollars in GMV. Palicha walks through Zepto's founding, customer-obsessed iteration, supply chain depth, AI deployment, and his lessons for young founders.^{[44]YC: Aadit Palicha (Zepto) interview}

Aadit Palicha opens by describing how Zepto was never intended to be a company [00:00]. He and co-founder Kaivalya Vohra had been friends since their early teens and admired Silicon Valley product builders. At 17, both got into a strong California school but COVID stranded them back in Mumbai [01:00]. Rather than do remote college, they took a gap year and started a WhatsApp group to deliver groceries to neighbors during the pandemic's first wave. That group became an app called KiranaKart, which eventually led them to Y Combinator [02:00]. On the decision to drop Stanford just 2-3 months before matriculation, Palicha rejects the heroic narrative [03:00]. He and Kaivalya took a tactical approach: they spent a year building, hit early signs of product-market fit around 8-9 months in, and only committed fully once they were doing about 10,000 orders per day and roughly 60-70 crore GMV run-rate [04:02]. Investor interest and a term sheet sealed the leap. His advice to other students considering the same path: get real proof of concept first. Grocery delivery in India was already a multi-billion-dollar space with players chasing it for 20 years [05:03], but Palicha says they never framed it as picking a differentiated model. Instead, the founders personally did deliveries in Mumbai's Andheri neighborhood and talked to customers at the door, learning that none of the existing models truly satisfied customers on speed, quality, selection, and price [06:04]. That bottom-up insight led to the pivot from KiranaKart's storefront-pickup model to mini dark stores - the very first one being Kaivalya's apartment [07:04][08:04]. The single dark-store neighborhood quickly did 3-4x the volume of the rest of the city, validating the model [09:04]. Palicha attributes their first-principles 10-minute delivery vision to a Brian Chesky-style "what's the most extreme positive customer experience if you remove the laws of physics?" framing [10:05][11:06]. He argues that customer delight unlocks volume, throughput, and lower costs that no spreadsheet would have forecast - "it's impossible to build a financially viable, profitable business without customer delight." Their YC-era advantage was naivety and isolation in COVID lockdown, with only 30-40 neighborhood customers to obsess over, no startup-Twitter noise [13:06][14:07]. Under the consumer app, Zepto is fundamentally a logistics, supply chain, and retail company [15:07]. They run dark-store design, replenishment algorithms, trucking, industrial-grade backend automation, and one of India's largest fruits-and-vegetables supply chains, sourcing millions of units per week directly from farmers [16:09][17:09][18:10]. The business now employs over 200,000 people, has millions of daily transacting customers, and has scaled to billions of dollars in topline, with a hundreds-of-millions ARR ads business where brands like Coca-Cola, Pepsi, and Nestle bid on search keywords [19:10][20:10]. The long-term vision is to build India's urban grocery infrastructure across 40-50 cities - a homegrown Amazon-like platform - and to incubate consumer brands on top of Zepto [21:10][22:11]. On AI [23:11], Palicha says Zepto has built ML-driven supply-chain forecasting that replaces days of manual work and now runs millions of units per day without humans in the loop, raising throughput and forecast accuracy. On the consumer/ads side, GenAI tools help brands generate keywords and predict ROAS, driving a big ads revenue bump [24:11]. Internally, AI has let them cut almost all SaaS spend and managed-services costs to near zero while shipping more with the same engineering headcount (about 500 engineers plus 150 in data/product/design) [25:12][26:12]. Closing on how he upgraded himself as a founder, Palicha credits surrounding himself with a senior management team - CFO, COO, CTO, head of growth, CBO - with decades of experience who believed in the vision early, and shamelessly asking basic questions: "surround yourself with people that are smarter than you and learn from them shamelessly" [27:13][28:14].

Sections

~00:00 Origins: two 17-year-olds and a WhatsApp grocery group in Mumbai

~02:00 KiranaKart, finding YC, and the Stanford decision

~04:02 Tactical risk-taking: PMF and 10,000 orders/day before dropping out

~06:04 Why a crowded grocery market still had room - customer interviews on the doorstep

~08:04 The pivot: co-founder's apartment as the first dark store

~10:05 First-principles 10-minute delivery and customer-delight economics

~15:07 Zepto as a logistics and supply chain company, not a consumer app

~19:10 Scale, ads business, and the urban-grocery long-term vision

~23:11 AI in supply chain, ads, and internal ops - cutting SaaS spend to zero

~26:12 Hiring, upgrading yourself, and learning shamelessly from senior teammates

If you remove all constraints and you just remove all the laws of physics and you just think from first principles, what's the most extreme positive customer experience you can give and you start from there and then you work backwards from how can I make that possible.

It's impossible to build a financially viable, profitable business without customer delight. Customer delight is where financial value starts.

The naivity of knowing how to build a company was a big advantage - we were extremely young and naive and the advantage of being young and naive is that you don't know how difficult it actually is.

If you remove all the software and the tech and the dark stores, fundamentally we're an atta-dal-tarkari company.

Tools: WhatsApp, Machine learning forecasting models (internal), Generative AI ad-tools for keyword generation and ROAS prediction (internal), Industrial-grade warehouse automation and robotics (internal)

Podcast Industry

Sequoia Capital Sequoia Capital Sequoia Capital

Sequoia Interviews Jake Stauch (Serval) — rebuilding IT for the AI age

Jake Stauch, founder/CEO of Serval, lays out his thesis for an AI-native enterprise service management platform that replaces ServiceNow-style manual workflow building with natural-language codegen. He argues the moat in the AI era is customer insight plus the boundaries (permissions, approvals, controls) around agents — not the models themselves — and that AI-native organizations must reinvent themselves every few months.^{[45]Sequoia: Jake Stauch (Serval) interview} Companion shorts highlight the autonomy-vs-control tension^{[46]Sequoia short: employees want autonomy} and Twitter's old "bias to yes" cross-team approval ban^{[47]Sequoia short: Twitter cross-team approvals}.

Stauch opens by framing Serval as the AI-native ServiceNow: a platform for employee support (enterprise service management) where instead of waiting on a ticket, you ask for something and get it instantly through automation [01:00]. He concedes ServiceNow's founding insight — workflows on top of a database — was correct, but argues those workflows take weeks-to-months of dedicated developer effort to build and maintain, which is fatal in an era of rapidly changing business processes [02:03][03:04]. Serval's wedge is a codegen engine: describe a workflow in natural language and the system generates the TypeScript instantly, and the same for keeping the underlying databases fresh. A core product principle is that building the automation must be as easy or easier than doing the task manually — otherwise IT staff will just reset the password by hand instead of opening a workflow builder [04:04][05:04]. This creates a 'slop automation' risk (the 20th duplicate password-reset workflow), which Serval mitigates with an admin-side agent that has full context of existing workflows and proactively suggests consolidation, deletion, and approval steps [05:04][06:04]. On moats and the application-layer debate, Stauch argues you should be happy when new models ship: the product is the boundaries — permissions, approvals, API scopes, audits, logs — not the raw capability [08:05][09:06]. Serval splits its system into two agents: an admin agent that builds tools/skills under tight control, and a help-desk agent that can reason freely but only invokes admin-sanctioned tools [09:06][10:06]. They run OpenAI models for end-user interaction and Anthropic (Sonnet/Opus) for codegen, and have learned that new model releases are not plug-and-play — sometimes they downgrade because predictable behavior beats raw intelligence [11:08][12:09][13:09]. Unit economics are healthy because they aren't reselling tokens at runtime: once a workflow is codegenned it just executes, so the expensive token spend amortizes [14:10]. On the existential threat of OpenAI or Anthropic moving into ITSM, Stauch points out that Anthropic has added more ARR in recent months than ServiceNow has in 20 years, so even a successful ITSM build would be a poor focus allocation for them [16:10][17:12]. Customer needs across AI-native startups and Fortune 500s are more similar than he expected — the real difference is committee size and decision velocity [18:12][19:14]. AI-native customers love that Serval lets IT folks finally use cool AI tools instead of fielding ServiceNow tickets to provision Cursor seats [19:14]; large enterprises value the employee-experience transformation (no more tickets vanishing into the abyss) [20:16]. Internally, Serval applies AI 'right of first refusal' to every function — no SDRs, no solutions engineers, leaner RevOps and product marketing, with reps using Serval itself to generate decks and battle cards live on calls [25:18][26:18]. Stauch's contrarian take is the emerging autonomy-vs-control gap: individuals want their Claude agents to do everything, but security and IT orgs need control, and that tension is where Serval positions itself [31:21][32:21]. He compares this to the iPhone/BlackBerry shadow IT moment — companies that default to 'yes' will lead, accept some security incidents, and ultimately win [32:21][33:21]. The biggest constraint is still hiring — every AI company is hiring more than ever despite the automation narrative — and Serval's mantra is 'fewer, better' [33:21][34:21][35:22]. Stauch wants Serval to be remembered not for automating jobs away but for closing the gap between what people thought their job would be and what it actually is — getting employees back to meaningful work [00:00][36:23][37:23].

Sections

~00:00 Mission: close the gap between idealized and actual jobs

~01:00 Why an AI-native ServiceNow — automation as the path to instant employee support

~02:03 ServiceNow's primitives were right; manual workflow building is the problem

~04:04 Building automation must be easier than doing it manually; slop automation risks

~07:05 Customer obsession as moat: Slack channels with 100+ customers, every day

~08:05 Application-layer moat: the product is the boundaries (permissions, approvals, audits)

~10:06 Two-agent architecture (admin agent + help-desk agent) and model mix (OpenAI + Anthropic)

~15:10 Will OpenAI/Anthropic build ITSM? Focus and ARR math say no

~24:18 Operating AI-native: 'right of first refusal' for AI, no SDRs/SEs, fewer-better

~31:21 Contrarian take: autonomy vs control — individuals want autonomy, orgs want control

~33:21 Biggest issue is still hiring; talent density and agility as the only moat

~36:23 Legacy: unlock meaningful work, not just automate jobs away

We want to be the tool that actually closes the gap between what you think your job's going to be and what your job actually is.

If it were actually easier to build the automation, you would build the automation — because why wouldn't you?

The product is the boundaries. The product is the controls. The product is actually what limits the capabilities of the model.

In the past couple months, Anthropic has added more ARR than ServiceNow has in the past 20 years.

Companion shorts

Employees want autonomy. Organizations don't. — Stauch in 60 seconds on the central enterprise-AI tension^{[46]Sequoia short: autonomy vs control}.
Twitter banned cross-team approvals. — Companion short: a "bias to yes" policy lets experiments ship as long as a direct manager approves^{[47]Sequoia short: bias to yes}.

Tools: Serval, ServiceNow, Peregrine, Remedy, Google Workspace, Cursor, OpenAI (GPT-5.5), Anthropic Claude (Sonnet, Opus), TypeScript, Slack, LinkedIn, ChatGPT, iPhone / BlackBerry (analogy)

Karpathy joins Anthropic

Why this hire, why now

Anthropic's momentum line

"The wrapper is the product"

Karpathy's LLM wiki, transplanted onto Claude

Auto research and the /goal loop

Three predictions

Gemini 3.5 Flash takes the speed crown at I/O

The benchmark picture

Price up, distribution wider

Sundar's keynote setup

SDK day-one

An informal AICodeKing test

Managed Agents in the Gemini API — Google echoes Anthropic

The pitch

The six-part API surface

Competitive context: Anthropic parallel

Availability

Gemini Spark: the always-on personal AI agent

What Spark actually does

Daily Brief

MCP partners go live

UI redesign + macOS

Scale claim

Gemini Omni — any-to-any multimodality, video first

Output modalities

Avatars and effects

Omni Flash in Flow

Google AI Studio + Antigravity 2.0: the agent-first IDE stack

Antigravity becomes an ecosystem

AI Studio: mobile, Workspace, Android

AI Ultra positioning

Build with Gemini XPRIZE

Gemini for Science: Co-Scientist, AlphaFold, WeatherNext go live

Co-Scientist

Antimicrobial resistance

Cancer target identification in Uganda

WeatherNext and Hurricane Melissa

Scope

Project Genie + Street View = AI-generated real places

Stitch, Pomelli, Flow — Google Labs creative tools level up

Stitch

Pomelli

Flow + Flow Music

KPMG goes all-in on Claude: 276,000 employees across 138 countries

Scope

Stack

Why this matters for Anthropic's enterprise narrative

Anthropic widens the AI conversation — and opens Mythos

Who Anthropic is talking to

The ethical-reminder result

Mythos opens up

OpenAI ships content provenance — C2PA + SynthID, jointly with Google

Why two layers

The verification tool

SynthID scale

Musk loses his OpenAI case — statute of limitations

Nvidia: too big to excite?

EY retracts consulting report riddled with AI hallucinations

Simon Willison's 6-month LLM recap (PyCon US 2026)

Simon Willison ships: llm-gemini 0.32, datasette-llm 0.1a8, accountant 0.1a4

Theo turns a $40 Copilot plan into $40K of inference

The loophole

Four billing models

The T3 Chat near-bankruptcy detour

Caching economics

The June 1 token-credit migration

Azure benchmark revenge arc

Six agent protocols, three that matter: MCP / A2A / AGUI

MCP — the tool and data layer

A2A — agent-to-agent coordination

AGUI — the human control layer

The three domain-specific protocols

The real hot take

LLMorphism: are we becoming more like LLMs?

Mozilla ships Zero, an AI-native systems language

What Zero is

The pushback

Nate B Jones: non-coding AI is exploding; AI jobs have no fixed destination

Scaling your system from 0 to 1M users — the 7-step playbook

Auto research and the `/goal` loop