Gemini 3.5 Flash lands; Antigravity faceplants

AI Models Hot Take

Google DeepMindSimon Willison's WeblogBetter StackAICodeKingTheo - t3.ggThe AI Daily Brief: Artificial Intelligence NewsTech BrewSimon Willison's Weblog

Google I/O 2026 day two: Gemini 3.5 Flash lands to mixed-to-hostile reviews

Google I/O day two dropped Gemini 3.5 Flash and a sprawling agent stack, but third-party benchmarks and hands-on tests landed somewhere between unimpressed and hostile. Artificial Analysis pegs the new Flash at roughly 5.5x the real cost of Gemini 3 Flash and trailing several cheaper coders ^{[1]Better Stack — Gemini 3.5 Flash is just... fine}, AICodeKing calls Antigravity 2.0 and 3.5 Flash “dead on arrival” ^{[2]AICodeKing — Antigravity 2.0 & Gemini 3.5 Flash (Fully Tested): SO BAD! FINAL NAIL IN THE COFFIN FOR GOOGLE.}, Theo records a 22-minute crash-out flagging hidden pricing and a Fish-Slap agent eval where 3.5 Flash fails outright ^{[3]Theo - t3.gg — I'm scared to make this video}, and Simon Willison — a Gemini fan in normal times — explicitly tags most of I/O as vaporware ^{[4]Simon Willison's Weblog — Google I/O, Gemini Spark, Antigravity}. Google's own DeepMind video still pitches the 4x faster output tokens and GDP-bench wins ^{[5]Google DeepMind — Gemini 3.5 Flash has landed.}, and Nathaniel Whittemore frames the whole event as confused and unfocused ^{[6]The AI Daily Brief: Artificial Intelligence News — The Most Important AI News from Google I/O}.

Gemini 3.5 Flash: speed up, price obfuscated

Google's headline pitch was “better than 3.1 Pro on nearly every benchmark + 4x faster output tokens” — Better Stack confirms the throughput claim at 278 tokens/sec ^{[1]Better Stack — Gemini 3.5 Flash is just... fine} but pulls Artificial Analysis numbers showing 3.5 Flash actually costs ~5.5x Gemini 3 Flash per task because it spends more tokens to get there, and trails Haiku 4.5 and GPT-5.1 Mini on coding. Theo's ~03:03 segment digs into the token-bloat math and points out that Google buried the real per-token price in fine print ^{[3]Theo - t3.gg — I'm scared to make this video}; his ~08:05 Fish Slap agentic test has 3.5 Flash failing while GPT-5.5 ships in the same harness.

AICodeKing's hands-on at ~05:10 benchmarks the model against Sonnet, GPT-5.5, and Haiku 4.5 on practical coding and concludes Flash gets beaten across the board ^{[2]AICodeKing — Antigravity 2.0 & Gemini 3.5 Flash (Fully Tested): SO BAD! FINAL NAIL IN THE COFFIN FOR GOOGLE.}, and at ~06:10 recommends staying on those alternatives.

It's mid at best on every real coding test I ran — and yet they want $200/month for the Antigravity tier. Just no.

Nathaniel Whittemore reads I/O as a slate of features that simply did not have a focal point — Gemini Omni, Spark, Antigravity, and Flash all launched the same day with no clear flagship ^{[6]The AI Daily Brief: Artificial Intelligence News — The Most Important AI News from Google I/O}, while Google DeepMind's own launch video leans hard on the GDP-bench wins and the latency story ^{[5]Google DeepMind — Gemini 3.5 Flash has landed.}.

Willison: most of I/O is vaporware

Simon Willison — whose policy is to only write about features he can personally test — admits I/O 2026 fell largely outside that scope because so much was preview or gated. The one clearly-released piece (3.5 Flash) he covers in a separate post; everything else is “coming soon” ^{[4]Simon Willison's Weblog — Google I/O, Gemini Spark, Antigravity}.

The naming problem is now a strategy problem

Tech Brew calls out what Whittemore independently makes the same point about: nobody at Google appears to own a coherent naming taxonomy. Gemini sits inside AI Studio sits inside AI Mode sits next to AI Overviews; Spark sits inside Gemini, runs on Flash, runs on Antigravity. Even Google's own keynote presenters seemed unsure whether Spark was a feature or a product ^{[7]Tech Brew — Google it (while you still can)}.

Even Google's own keynote presenters seemed unsure whether Spark was a feature of Gemini or a separate product. The naming problem is now a strategy problem.

Sidebar: making “4x faster” tangible

Simon Willison also highlighted Mike Veerman's interactive tool that simulates LLM token output from 5 to 800 tokens per second — a useful companion piece when product and design audiences need to actually feel what Google's “4x faster output tokens” claim translates to on screen ^{[8]Simon Willison's Weblog — How fast is 10 tokens per second really?}.

Tools: Gemini 3.5 Flash, Gemini 3 Flash, Haiku 4.5, GPT-5.1 Mini, GPT-5.5, Artificial Analysis, Fish Slap

AI Tools AI Future

Simon Willison's WeblogTech BrewThe AI Daily Brief: Artificial Intelligence News

Gemini Spark: Google's personal agent, with a prompt-injection-shaped target on its back

Google launched Gemini Spark as its answer to a personal AI agent, with native hooks into Gmail, Calendar, Drive, Docs, Sheets, Slides, YouTube, and Maps, running on Gemini 3.5 Flash and the new Antigravity platform ^{[4]Simon Willison's Weblog — Google I/O, Gemini Spark, Antigravity}. Tech Brew frames it as a “24/7 personal agent” positioned squarely against OpenAI and Anthropic's consumer agent pushes ^{[7]Tech Brew — Google it (while you still can)}, and Whittemore reads the product as confusingly scoped — it does a lot, but isn't clearly Pro-tier or free-tier ^{[6]The AI Daily Brief: Artificial Intelligence News — The Most Important AI News from Google I/O}. Willison flags the prompt-injection surface area as a potential “Challenger disaster” for agent security.

What Spark is

From the FAQ Willison surfaces: Spark is pitched as “your personal AI agent” that “connect[s] natively with your favorite Google apps like Gmail, Calendar, Drive, Docs, Sheets, Slides, YouTube, and Google Maps” and explicitly runs “on Gemini 3.5 Flash and Antigravity” ^{[4]Simon Willison's Weblog — Google I/O, Gemini Spark, Antigravity}. Every task executes in a “fresh, strictly isolated, ephemeral VM,” with traffic gated through a secure Agent Gateway enforcing DLP policies.

Prompt injection: the elephant in the calendar

Willison's read: an agent that can read your inbox and act on your behalf is precisely the system adversarial content in those inboxes is designed to hijack ^{[4]Simon Willison's Weblog — Google I/O, Gemini Spark, Antigravity}. He doesn't soften the framing.

An agent that can read your email and calendar and act on your behalf is exactly the kind of system that adversarial content in those sources could hijack.

Tech Brew's coverage at the company-strategy level positions Spark as Google trying to occupy the consumer-agent slot before OpenAI ships ChatGPT Agent 2 and Anthropic ships Co-work as default ^{[7]Tech Brew — Google it (while you still can)}. Whittemore at ~12:06 reads the launch as “a confusingly positioned 24/7 personal agent” ^{[6]The AI Daily Brief: Artificial Intelligence News — The Most Important AI News from Google I/O}.

Tools: Gemini Spark, Gemini 3.5 Flash, Antigravity, Gmail, Google Drive, ChatGPT Agent 2, Claude Co-work

AI Models

Google DeepMindThe AI Daily Brief: Artificial Intelligence NewsAI Engineer

Gemini Omni: any-to-any multimodal, with conversational video editing as the headline demo

Gemini Omni is the day's most genuinely novel piece: an any-to-any multimodal model that combines Gemini's reasoning with VEO, Nano Banana, and Genie to do conversational video editing rather than competing head-on with Sora 3 ^{[9]Google DeepMind — Build your next story with Gemini Omni.}. Whittemore reads it as “a Nano-Banana moment for video editing,” not a V4 contender ^{[6]The AI Daily Brief: Artificial Intelligence News — The Most Important AI News from Google I/O}. Patrick Löber's AI Engineer talk on the same architecture explains why native multimodal generation matters — the model can ground audio in the same world-knowledge the language model uses, instead of routing through text ^{[10]AI Engineer — Any-to-Any: Building Native Multimodal Agents - Patrick Löber, Google DeepMind}.

What Omni does

The DeepMind launch video pitches Omni as a single model that takes any input and produces any output, internally orchestrating VEO (video), Nano Banana (image), and Genie (worlds) under Gemini's planner. The headline demo is conversational video editing — “make the dog turn around,” “swap the time of day” — without re-rendering from scratch ^{[9]Google DeepMind — Build your next story with Gemini Omni.}.

Why “native” matters

Löber's talk argues the unlock of any-to-any models is that media generation gets access to the LM's world knowledge instead of being a downstream call ~11:25. The Live API at ~13:28 demonstrates the same logic for audio: a single audio-to-audio architecture for real-time interaction rather than ASR+LM+TTS. He also surfaces multimodal embeddings and Gemma 4 local agents at ~15:00 ^{[10]AI Engineer — Any-to-Any: Building Native Multimodal Agents - Patrick Löber, Google DeepMind}.

Whittemore's read

Whittemore at ~09:05 sees Omni as the right framing — Google is winning at editing creative media, not at generating it from a blank canvas, and Omni leans into that ^{[6]The AI Daily Brief: Artificial Intelligence News — The Most Important AI News from Google I/O}.

Tools: Gemini Omni, VEO, Nano Banana, Genie 3, Live API, Gemma 4, multimodal embeddings

Developer Tools Hot Take

Simon Willison's WeblogTheo - t3.ggAICodeKingBetter Stack

Antigravity 2.0 lands, Gemini CLI gets a June 18 obituary

Antigravity 2.0 is Google's new agentic runtime — desktop app, Go-based CLI, open-source Python SDK wrapper, and a VS Code fork IDE — and it ships with the announcement that the open-source TypeScript Gemini CLI gets shut down June 18, 2026 in favor of the closed-source Antigravity CLI ^{[4]Simon Willison's Weblog — Google I/O, Gemini Spark, Antigravity}. Theo's hands-on at the CLI is brutal: bugs in auth, sub-agents that look like Codex clones, and a launch that smells dysfunctional even by Google standards ^{[3]Theo - t3.gg — I'm scared to make this video}. AICodeKing concurs on the desktop app — Codex clone with UX regressions — and recommends sticking with existing tools ^{[2]AICodeKing — Antigravity 2.0 & Gemini 3.5 Flash (Fully Tested): SO BAD! FINAL NAIL IN THE COFFIN FOR GOOGLE.}.

Four surfaces, one runtime

Willison's recap: Antigravity is a desktop app, a CLI written in Go, an open-source Python SDK wrapper, and a VS Code fork ^{[4]Simon Willison's Weblog — Google I/O, Gemini Spark, Antigravity}. The platform runs on Google Cloud with ephemeral VMs, an Agent Gateway, DLP, and encrypted credentials.

Gemini CLI gets a June 18 funeral

The existing Gemini CLI — Apache 2.0 TypeScript — stops working on June 18, 2026. Its replacement is the closed-source Antigravity CLI. Willison notes this open-to-closed reversal without enthusiasm; developers get roughly a month to migrate ^{[4]Simon Willison's Weblog — Google I/O, Gemini Spark, Antigravity}.

The CLI itself is bad

Theo at ~10:06 walks through bugs in the Antigravity CLI's authentication and basic UX, and at ~20:09 argues that Antigravity is straight-up copying Codex — sub-agents, planning panels, the works ^{[3]Theo - t3.gg — I'm scared to make this video}.

AICodeKing's CLI walkthrough at ~02:08 hits the same auth issues, and the desktop-app review at ~03:09 calls it a Codex clone with regressions ^{[2]AICodeKing — Antigravity 2.0 & Gemini 3.5 Flash (Fully Tested): SO BAD! FINAL NAIL IN THE COFFIN FOR GOOGLE.}. Better Stack at the model level adds that Antigravity 2 is positioned as a Codex competitor but ships with all the rough edges Codex shed eight months ago ^{[1]Better Stack — Gemini 3.5 Flash is just... fine}.

Tools: Antigravity CLI, Gemini CLI, Antigravity SDK, VS Code (Antigravity fork), OpenAI Codex

AI Tools Industry

Tech BrewGoogle Research

Search redesign, YouTube Search, smart-glasses, and life-size Beam meetings

Tech Brew bundles the not-Gemini parts of I/O: the first major Google Search redesign in 25 years (stacked-card AI-first layout), a YouTube Search overhaul with conversational interface, and a smart-glasses push co-branded with Warby Parker and Gentle Monster ^{[7]Tech Brew — Google it (while you still can)}. Google Research separately announced Beam group-meetings with life-size remote attendees on HP Dimension — an experiment showing measurable presence improvements ^{[11]Google Research — A new experiment brings better group meetings to Google Beam}.

Search, finally redesigned

Tech Brew describes the new layout as a stacked-card interface where AI Overviews sit above results, with the traditional 10 blue links pushed down. The rollout is gradual and US-first ^{[7]Tech Brew — Google it (while you still can)}.

YouTube Search becomes conversational

The YouTube redesign treats search as a conversation — query expansion, follow-up questions, and timestamped clip retrieval inside the search bar ^{[7]Tech Brew — Google it (while you still can)}.

Smart glasses, via fashion houses

Co-branded hardware with Warby Parker and Gentle Monster is the bet that prescription-friendly, design-led frames are the only way smart glasses become daily-driver ^{[7]Tech Brew — Google it (while you still can)}.

Beam: remote attendees at life-size

Google Research separately announced that Google Beam — the rebranded Project Starline — can now match remote participants to true-to-life size on HP Dimension's immersive display, eliminating the “tiny head on a screen” cue that makes remote attendees feel optional. Early enterprise pilots show measurable improvements in perceived presence and participation ^{[11]Google Research — A new experiment brings better group meetings to Google Beam}.

Tools: Google Search, YouTube Search, Warby Parker, Gentle Monster, AI Overviews, Google Beam, HP Dimension

AI Future AI Models

OpenAIOpenAILast Week in AI

OpenAI's reasoning model disproves an 80-year-old Erdős conjecture

An internal OpenAI reasoning model autonomously disproved the Erdős unit distance conjecture — a central open problem in combinatorial geometry that had stood roughly 80 years — using techniques from algebraic geometry. The result has been independently verified ^{[12]OpenAI — An OpenAI model has disproved a central conjecture in discrete geometry}. OpenAI's promo video features the researchers walking through how the model produced the proof end-to-end ^{[13]OpenAI — The Erdős Breakthrough}. Last Week in AI's coverage frames this as the most significant “model does real math” moment since AlphaProof ^{[14]Last Week in AI — Last Week in AI #245 - TML-Interaction, Claude For Legal, Sam Altman on Stand}.

What was disproved

The unit-distance conjecture bounds how many pairs of points in n points in the plane can be at unit distance from each other. OpenAI's model produced a construction that exceeds the conjectured bound — i.e., a disproof rather than a proof — and used tools from algebraic geometry to build it ^{[12]OpenAI — An OpenAI model has disproved a central conjecture in discrete geometry}.

Why this is different

The Erdős problem is not in the AlphaProof / IMO style of well-defined competition mathematics: it's a famous open problem from a famously prolific problem-poser. The proof itself is novel enough that humans had to study it to verify. OpenAI's video at ~00:00 opens with the researchers explaining their own surprise ^{[13]OpenAI — The Erdős Breakthrough}.

Industry

Simon Willison's WeblogLast Week in AI

Anthropic is paying SpaceX $1.25B/month for compute, per an accidental S-1 disclosure

SpaceX's SEC S-1 filing inadvertently disclosed a $1.25-billion-per-month cloud services agreement with Anthropic, running through May 2029 and using the COLOSSUS and COLOSSUS II clusters ^{[15]Simon Willison's Weblog — Quoting SpaceX S-1}. The deal is large enough that Last Week in AI dedicates a 14-minute segment to its implications, including how it slots alongside Anthropic's existing Google $40B Anthropic investment and the broader neocloud restructuring ^{[16]Last Week in AI — Last Week in AI #244 - GPT-5.5 Instant, Grok 4.3, OpenAI vs Musk}.

The leak

Buried in SpaceX's S-1: a multi-year compute contract with Anthropic at $1.25B/month, through May 2029, running on COLOSSUS and COLOSSUS II. Willison points out the obvious — disclosure was almost certainly unintentional, and the run rate puts Anthropic compute at well over $15B/year just from this one provider ^{[15]Simon Willison's Weblog — Quoting SpaceX S-1}.

How it fits the rest of the financing puzzle

LWIA #244 at ~51:50 connects the dots: Anthropic's compute mix now spans Google ($40B investment), AWS Bedrock, and SpaceX (COLOSSUS); the orbital-data-center angle is no longer pure science fiction ^{[16]Last Week in AI — Last Week in AI #244 - GPT-5.5 Instant, Grok 4.3, OpenAI vs Musk}.

If Anthropic is paying $1.25 billion a month for compute on COLOSSUS, the next question isn't whether scaling laws still hold — it's who can actually finance the next decade of training.

Tools: COLOSSUS, COLOSSUS II, AWS Bedrock, neocloud

Industry Podcast

EveryLast Week in AI

Anthropic acquires Stainless for $300M; founder Alex Rattray on MCP and API “dendrites”

Anthropic acquired Stainless for $300M — the API/SDK company behind OpenAI, Anthropic, and other major LLM providers' developer experiences. Every's Dan Shipper interviews founder Alex Rattray on what the deal means for MCP, agent UX, and the “dendrites of the internet” framing ^{[17]Every — Anthropic Just Bought a Dev Tools Startup for $300M. Here's What Its Founder Told Me.}. LWIA #245's segment also notes Stainless's role in fixing MCP's current scaling problems ^{[14]Last Week in AI — Last Week in AI #245 - TML-Interaction, Claude For Legal, Sam Altman on Stand}.

The interview

Rattray opens at ~04:15 by reframing APIs as “dendrites of the internet” — the connective tissue that lets agents reach beyond their model context ^{[17]Every — Anthropic Just Bought a Dev Tools Startup for $300M. Here's What Its Founder Told Me.}. At ~09:18 he explains why MCP isn't working well at scale: the protocol assumes per-tool descriptions, but the descriptions themselves don't compose. At ~11:20 Shipper teases the “refund my stripey socks across 5 apps” demo as the agentic North Star.

Why Anthropic paid $300M

The Every read: Stainless is one of the few companies sitting on the actual taxonomy of how thousands of APIs differ, plus the tooling to normalize them. For Anthropic, that's a foundational lever for Claude Co-work and the agent SDK ecosystem ^{[17]Every — Anthropic Just Bought a Dev Tools Startup for $300M. Here's What Its Founder Told Me.}.

Tools: Stainless, MCP, Claude Co-work, OpenAI SDK, Anthropic SDK

Industry

Morning Brew

Meta lays off 10% (~7,800), pivots org chart to AI

Meta is laying off roughly 7,800 employees — about 10% of its ~78,000-person workforce — while simultaneously redirecting 7,000 remaining employees into AI-related initiatives and closing 6,000 open non-AI reqs. The framing is explicit: capex and headcount alike are being recomposed toward AI ^{[18]Morning Brew — Meta lays off 10% of workforce}.

Morning Brew's read: this is not a cost-cut, it's a portfolio rebalance. The reduction in non-AI hiring (6,000 reqs closed) is almost as large as the layoff itself, and the 7,000 internal AI redeployments make the net effect a near-zero headcount change with a different mix ^{[18]Morning Brew — Meta lays off 10% of workforce}. The deeper signal is that Meta is treating its own non-AI org chart as the largest available source of AI engineers.

AI Future AI Tools

AI News & Strategy Daily | Nate B JonesData Science Weekly

The seven infrastructure control points that decide whether your agent ships

Nate B Jones argues the companies that actually decide whether AI agents ship are not OpenAI or Anthropic — they're the seven infrastructure providers controlling runtime, identity, data, payments, observability, the kill switch, and the cross-cutting governance framework ^{[19]AI News & Strategy Daily | Nate B Jones — These 5 Infrastructure Giants Secretly Rule AI}. The piece doubles as a pre-launch worksheet: any of those seven still TBD is a production blocker. Data Science Weekly's headline this week makes the adjacent argument from the other direction — tools, not bigger models, are the missing ingredient for production-grade agents ^{[20]Data Science Weekly — Tools: Why AI Agents Need Them}.

The seven control points

Nate maps each layer to specific incumbents at ~02:02 through ~13:07: runtime → Cloudflare Agents SDK, AWS Bedrock Agent Core, Vercel AI Gateway; identity → Auth0 (Ozero), Okta for AI agents, Microsoft Entra Agent ID, WorkOS; data governance → Snowflake Cortex, Databricks Mosaic AI; payments → Stripe, Visa, Mastercard, Amex; observability → Datadog LLM, Langsmith, Braintrust, Langfuse; kill switch → layered across runtime/identity/gateway/payment/framework (e.g., LangGraph interrupts) ^{[19]AI News & Strategy Daily | Nate B Jones — These 5 Infrastructure Giants Secretly Rule AI}.

The argument in one line

The companies that decide whether your agent is successful are the ones that are building the layer that determines whether agents can act.

At ~13:07 he closes with the kill-switch insight that telling the model to stop is never the kill switch.

If the only way to tell your agent to stop is to just tell the model to stop, you don't have a kill switch.

Tools: Cloudflare Agents SDK, AWS Bedrock Agent Core, Vercel AI Gateway, Auth0, Okta, Microsoft Entra, WorkOS, Snowflake Cortex, Databricks Mosaic AI, Stripe, Datadog LLM, Langfuse, LangGraph

Podcast

Last Week in AI

Last Week in AI #245: TML-Interaction, Claude for Legal, Sam Altman on the stand

Andrey Kurenkov and Jeremy Harris cover OpenAI's GPT Realtime 2 stack, Thinking Machines' surprise TML-Interaction Small launch, Anthropic's vertical push with Claude for Legal, Musk v. OpenAI testimony, Anthropic's “Teaching Claude Why” alignment research, Jack Clark's prediction of fully automated AI R&D by 2028, and METR's horizon evaluation putting Claude Mythos at a ~16-hour task horizon ^{[14]Last Week in AI — Last Week in AI #245 - TML-Interaction, Claude For Legal, Sam Altman on Stand}.

The episode opens on real-time voice as a newly contested frontier ~03:12: GPT Realtime 2 ships powered by GPT-5 with 128k context, tunable reasoning effort, and a real-time Whisper variant. Days later, Thinking Machines (Mira Murati's lab, quiet since February) drops TML-Interaction Small at ~12:15: a 276B-parameter MoE targeting ~400ms end-to-end latency with a system-1/system-2 architecture, bitwise-aligned training/inference, custom MoE kernels, and persistent GPU sessions.

The middle section ~24:29 is dominated by Anthropic's Claude for Legal — a GitHub-distributed bundle of agent skills, MCP connectors (Docusign, Ironclad, iManage, LexisNexis, Everlaw, Box), and partnerships with Harvey, Lora, the Free Law Project, and the Justice Technology Association. Anthropic disclosed that legal is now the #1 power-user job in Claude Co-work with 3x the usage of any other function. Jeremy spends a chunk of the segment on TSMC-vs-Amazon-Basics: is Anthropic a neutral platform or will it Amazon-Basics Harvey's $11B valuation?

It's the knockoff shoe... applied to AI.

At ~44:48 Musk v. OpenAI testimony: Altman, Ilya Sutskever, and others on the stand.

Ilya's responses were praised for their depth of reasoning while Elon received praise for his low latency and high batch size.

At ~57:59 a wild segment on a Chinese gray market reselling Claude API access at 90% off — “the knockoff shoe applied to AI.” Then at ~71:07 Anthropic's “Teaching Claude Why” research: training on ethical reasoning generalizes far better than training on specific behaviors.

Show me the why and not the what to get that generalization.

The final stretch covers Jack Clark's 2028-automated-AI-R&D prediction ~78:11, METR's audit of Anthropic's automated-R&D safety case ~95:26, and the headline METR horizon-eval result ~101:29: Claude Mythos at ~16 hours of task horizon.

Alignment is a compounding error problem. If you're doing recursive self-improvement, your alignment may be 99.9%, but generation on generation, once you do 500 generations, now you're down to 60%.

Tools: GPT Realtime 2, GPT Realtime Translate, TML-Interaction Small, Claude for Legal, Harvey, Petri 3.0, METR, Claude Mythos

Podcast

Last Week in AI

Last Week in AI #244: GPT-5.5 Instant, Grok 4.3, OpenAI vs Musk

Andrey and Jeremy cover GPT-5.5 Instant's benchmarks and cyber-risk classification, the “goblin” RL-training-leakage quirk that surfaced across GPT-5 generations, Grok 4.3 (with a narcolepsy issue), Anthropic adding Dreaming and Outcomes plus multi-agent orchestration to Claude, the Musk v. OpenAI Brockman diary, the Anthropic-SpaceX Colossus deal in detail, banks offloading AI data-center debt via SRTs (Big Short echoes), and Claude Opus 4.7 autonomously building an AlphaZero pipeline for Connect 4 ^{[16]Last Week in AI — Last Week in AI #244 - GPT-5.5 Instant, Grok 4.3, OpenAI vs Musk}.

Episode opens at ~01:11 on the Mythos PR-stunt pushback and broader AI-skepticism reckoning, then at ~10:18 dives into GPT-5.5 Instant's benchmarks and where it landed in OpenAI's cyber-risk preparedness framework. The “goblin” segment at ~15:24 is a fun detour on what happens when RL training leaks across model generations.

Grok 4.3 lands at ~23:31, with the “narcolepsy” issue documented in passing. Anthropic's Dreaming + Outcomes + multi-agent orchestration at ~35:38. The Musk v. OpenAI segment at ~41:41 covers the Brockman diary, Siobhan Zilis as conduit, and Murati testimony. The Anthropic-SpaceX Colossus deal gets a 14-minute treatment at ~51:50 — this is the same story Willison flagged from the S-1.

At ~65:02 the systemic-risk segment: banks offloading AI data-center debt via Significant Risk Transfers (SRTs), with explicit Big Short comparisons. Anthropic's Natural Language Autoencoders and unverbalized eval awareness at ~73:08. Then research-corner at ~97:26 on recursive multi-agent systems passing activations instead of text, and the closer at ~108:28: Claude Opus 4.7 autonomously building an AlphaZero pipeline for Connect 4.

Tools: GPT-5.5 Instant, Grok 4.3, Claude Opus 4.7, Natural Language Autoencoders, Anthropic Dreaming, SRTs

Podcast

Last Week in AI

Last Week in AI #243: GPT-5.5, DeepSeek V4, AI safety sabotage

GPT-5.5 release and pricing alongside the Anthropic compute gap, the GPT-5.5 regression on an internal AI-research benchmark and a related “goblin” quirk + a digression into AI consciousness, xAI Grok Voice Think Fast 1.0 and Claude creative-tool connectors, DeepSeek V4's hybrid compressed attention + 1M-token context + Huawei co-optimization, Tencent Hunyuan 3 and the ClawMark benchmark, Google's $40B Anthropic investment plus Meta-AWS Graviton, OpenAI/Microsoft restructuring, the Musk v. OpenAI trial, AISI sabotage evals, Delegate-52 document corruption, temporal sparse autoencoders, and sign-bit attacks ^{[21]Last Week in AI — Last Week in AI #243 - GPT 5.5, DeepSeek V4, AI safety sabotage}.

Pricing-and-compute opener at ~02:11, GPT-5.5 regression on internal AI-research benchmark at ~13:18 with the goblin quirk and an AI-consciousness digression. xAI Grok Voice + Claude creative connectors at ~20:23. DeepSeek V4's architecture and context window at ~26:31; Tencent Hunyuan 3 and ClawMark at ~43:40. The business-deals block at ~49:45: Google's $40B in Anthropic, Meta-AWS Graviton, China blocks Mana, OpenAI/Microsoft. Musk v. OpenAI trial + DOJ-Anthropic case + Gemini on-prem + Ineffable Intelligence funding at ~63:55. Safety research wraps the episode at ~79:07: AISI sabotage evals, Delegate-52 document corruption, temporal sparse autoencoders, and sign-bit attacks ^{[21]Last Week in AI — Last Week in AI #243 - GPT 5.5, DeepSeek V4, AI safety sabotage}.

Tools: GPT-5.5, DeepSeek V4, Tencent Hunyuan 3, AISI sabotage evals, temporal SAEs

Podcast

Latent Space

Latent Space — swyx interviews Jake Cooper (Railway): the agent-native cloud

swyx interviews Railway founder Jake Cooper on building the agent-native cloud: 3M users, 100k signups/week, 3-month payback on bare-metal data centers, and a deliberate refusal to use Kubernetes ^{[22]Latent Space — The Agent-Native Cloud: 3M Users, 100K Signups/Wk, Data Centers, & Death PRs — Jake Cooper, Railway}. The deeper argument is that what agents want from infrastructure — versioning, observability, feature flags at 1000x scale — reshapes what cloud platforms have to ship.

Cooper opens at ~02:00 framing Railway as “the easiest way to ship anything,” then walks the growth chart at ~07:02: free-tier era, the compaction crisis, and getting back to 100K signups/week. The infrastructure primitives discussion at ~14:04 is the bluntest part of the interview — Railway pointedly does not use Kubernetes.

At ~16:06 the economics: 3-month payback on bare-metal data centers, debt-financed cloud-burst capacity, and why owning the floor matters. The agent-native part starts at ~26:11 — versioning, observability, and feature flags must work at 1000x current scale because each user now has an army of agents pushing to prod. At ~32:14: the death of push-pull deploys, with canvas as the new output and CLI as the agent surface.

The closing third gets philosophical: at ~46:20 Cooper is openly skeptical of full AI SRE, defending the spec/code/tests trinity and the dream of self-replicating infra. At ~55:29 he closes on Heroku's slow death, a Temporal critique, and his own founder/focus philosophy ^{[22]Latent Space — The Agent-Native Cloud: 3M Users, 100K Signups/Wk, Data Centers, & Death PRs — Jake Cooper, Railway}.

Tools: Railway, Kubernetes, Heroku, Temporal, bare-metal data centers

Podcast

The Pragmatic Engineer

Pragmatic Engineer — Alice Ryhl on why Rust is different

Gergely Orosz interviews Alice Ryhl — Tokio maintainer, Google Android Rust engineer, Rust kernel contributor — on why Rust is structurally different from C++ and TypeScript, how the language is governed without a BDFL, and what AI-assisted coding does (and doesn't) change about kernel work ^{[23]The Pragmatic Engineer — Why Rust is different, with Alice Ryhl}.

Alice's path at ~02:00: Minecraft mods to Tokio to Android. The TypeScript pitch at ~07:03 — no null, the ? operator, doc tests, exhaustive matches — sets up the C++ pitch at ~13:07: memory safety eliminates a category of CVEs, period.

Ownership and the borrow checker at ~19:10: the right framing for newcomers is “rethink your data structures,” not “fight the compiler.” unsafe as escape hatch and Vec being built on top of it at ~26:11. Cargo + crates ecosystem at ~31:12, with Linus's package-manager gripe acknowledged.

Governance at ~35:15: teams, RFCs, ACPs, MCPs, and Final Comment Periods. Editions vs versions and the 6-week release cycle at ~46:24. At ~52:32: Rust in the Linux kernel — no longer experimental as of December 2025. And at ~55:35 the AI question — using LLM-assisted coding inside the Tokio repo and for kernel code review ^{[23]The Pragmatic Engineer — Why Rust is different, with Alice Ryhl}.

Tools: Rust, Tokio, Linux kernel, Cargo, Rust editions

Podcast

Nerd Snipe

Nerd Snipe — the Claude Code creator's $1.3M Codex spend, explained

Nerd Snipe interviews Pete (the Claude Code/OpenClaw creator) on the screenshot that went viral: $1.3M of Codex tokens over 603 billion tokens in “Codex Bar.” The deeper interview unpacks what that money actually bought — an agent fleet doing per-commit security, meeting-listening PR bots, and continuous claw-sweeping — plus how Theo deliberately gimps his own automations to avoid prompt-injection and supply-chain risk ^{[24]Nerd Snipe — How the OpenClaw creator uses $1.3 million of tokens}.

Pete's screenshot at ~02:00: $1.3M, 603B tokens. Fast Mode + 2.5x pricing explanation at ~05:30 — what Codex Bar actually measures. The agent fleet at ~07:05: Claw Sweeper, per-commit security agents, meeting-listening PR bots.

At ~11:09 Theo explains why he caps his own automations — prompt injection + supply-chain risk make “agent does X automatically” a real attack surface. Gary Tan's $10K/month thesis and GBrain at ~14:09. Anthropic's interactive-vs-programmatic billing split + the -p flag ban at ~22:19. Mark Cuban's token-tax framing at ~40:29. Closes at ~49:35 on Hashimoto on AI psychosis and the Bun Zig-to-Rust rewrite as evidence ^{[24]Nerd Snipe — How the OpenClaw creator uses $1.3 million of tokens}.

Tools: Codex, Codex Bar, Claw Sweeper, GBrain, Bun, Hashimoto

Podcast

Matt Williams

Matt Williams & Ryan: a chat on May 19

Matt Williams and Ryan's May 19 chat ranges across whether “AI engineer” is the new DevOps, conference vector-DB tooling, a 1M-node Datadog integration graph in Obsidian, NousResearch Hermes heartbeat agents, Matt Pocock's skill library, Ollama 0.30 reverting to llama.cpp, and hardware corner (broken A7R5, M5 Max MacBook Pro) ^{[25]Matt Williams — Matt and Ryan have a chat on May 19, 2026}.

Opens at ~00:00 with “is AI engineer the new DevOps” and the AI Engineer conference vibe-check. At ~11:08 product theater and transcripts-as-conference-vector-DBs. DevOps Days Boston war story at ~15:14: $100k hotel + $60k microwave uplink.

Obsidian as second brain at ~19:20: a 1M-node Datadog integration graph + LLM-driven pruning. NousResearch Hermes heartbeats at ~29:28 — burning 30M tokens on RSS-able background tasks. Matt Pocock's skill library at ~38:34: grill-me, grill-with-docs, ubiquitous language, PowerShell vs cmd. Ollama 0.30 reverting to llama.cpp + MLX 2x speedup + Toto v2 at ~50:46. Hardware corner at ~44:40: broken A7R5, Sony A7R6 pre-order, M5 Max MacBook Pro. Closes at ~55:51 on new-machine hygiene — dotfiles, pnpm vs npm, decision logs for brew ^{[25]Matt Williams — Matt and Ryan have a chat on May 19, 2026}.

Tools: NousResearch Hermes, Ollama 0.30, MLX, Toto v2, Obsidian, Matt Pocock skills

Podcast

Dwarkesh Patel

Dwarkesh — David Reich on the Middle Paleolithic revolution

Dwarkesh Patel and geneticist David Reich on the Middle Paleolithic revolution: the standard narrative focuses on the 50–100k-year cognitive shift, but Reich argues the 3–400k-year transition to mined, transported flint cores may be the bigger evolutionary inflection ^{[26]Dwarkesh Patel — The Stone Age Breakthrough Hiding in Plain Sight - David Reich}.

At ~00:00 Reich lays out the orthodox 50k-event view; at ~02:00 he reframes the 3–400k-year shift as evidence of long-distance planning and proto-trade — humans were already moving worked flint across continents long before the symbolic explosion ^{[26]Dwarkesh Patel — The Stone Age Breakthrough Hiding in Plain Sight - David Reich}.

Industry Hot Take

Lenny's Podcast

Lenny's Podcast: re-industrialization is now a national-security story

A short Lenny's Podcast clip arguing that “there's more change in war than there is in consumer electronics in the next 2 years.” The guest cites the daily iteration cycle on Ukrainian drone designs as proof that hardware velocity has shifted to defense, and argues for re-industrialization as a national-security imperative ^{[27]Lenny's Podcast — The impact of war on the hardware industry}.

Productivity

EO

EO: ballet trained a $22B founder for delayed gratification

EO short with the founder of a $22B company on how childhood ballet conditioned her for delayed-reward work: rehearsing a year for one hour on stage taught her to accept slow compounding, and a teenage 3-year goal-setting habit became the through-line ^{[28]EO — What Ballet Taught the Founder of a $22B Company}.

Podcast

DeepLearningAI

Andrew Ng at AI Dev 26 x SF: the future of software engineering

Andrew Ng's AI Dev 26 keynote argues software is now LEGO bricks: more building blocks, more AI assemblers, and small generalist teams shipping at 10–100x speed. The downstream effect is that PM, design, legal, marketing, and sales become the new bottlenecks — and he closes by announcing Code Dream, his conversational learning environment on Codoji ^{[29]DeepLearningAI — AI Dev 26 x SF: Andrew Ng: The Future of Software Engineering}.

Opens at ~00:07 on LEGO bricks: LLMs, RAG, agentic workflows, UI components, databases, auth — all combinable at unprecedented speed. At ~02:10 he reframes the “% AI vs human coding” debate: his work is 100% AI-written, and once humans must hand-review every line, review itself becomes the bottleneck.

The PM-bottleneck argument at ~04:11: classic 1:8 PM-to-engineer ratios collapse toward 1:1 and into a single generalist who shapes products and codes. Hiring at ~06:13: AI-native engineers must use coding agents (Claude Code, Gemini, Codex, OpenCode), know the building blocks deeply, and have generalist PM skills.

Downstream bottlenecks at ~08:15: design, legal, marketing, sales — and he pushes back on the “job apocalypse” framing. Context Hub announcement at ~11:17 for up-to-date API docs. Code Dream announcement at ~14:19: conversational learning on Codoji ^{[29]DeepLearningAI — AI Dev 26 x SF: Andrew Ng: The Future of Software Engineering}.

Tools: Claude Code, Gemini, OpenAI Codex, OpenCode, Context Hub, Code Dream, Codoji

Podcast

DeepLearningAI

Paige Bailey at AI Dev 26 x SF: what's new and what's next in AI

Paige Bailey's AI Dev 26 talk rolls through recent Google model releases and demos: AI Studio's YouTube video understanding + Compare Mode, Gemini 3.1 Flash Live (screen share, multilingual voice, camera), Gemma 4 open models (Apache 2, 2B–31B), AI Studio Build for voice-driven app creation and Lyria 3 music, Genie 3 world models with the “cat-on-jetpack” demo, and VO 3.1 video generation including a Chick-fil-A ad recreation ^{[30]DeepLearningAI — AI Dev 26 x SF | Paige Bailey: What's New and What's Next in AI}.

Roll call at ~00:07. AI Studio YouTube understanding + Compare Mode at ~03:09. Gemini 3.1 Flash Live at ~11:13: screen share + multilingual voice + camera in one session. Gemma 4 open models at ~17:24. AI Studio Build at ~20:28 with Lyria 3 music. Genie 3 at ~27:31 — the cat-on-jetpack world-model demo. VO 3.1 at ~35:45 with the Chick-fil-A ad recreation ^{[30]DeepLearningAI — AI Dev 26 x SF | Paige Bailey: What's New and What's Next in AI}.

Tools: AI Studio, Gemini 3.1 Flash Live, Gemma 4, AI Studio Build, Lyria 3, Genie 3, VO 3.1

Podcast

DeepLearningAI

Marc Manara (OpenAI) at AI Dev 26 x SF: a fireside chat

Marc Manara from OpenAI's startup partnerships team on what OpenAI optimizes for in coding (preambles, token efficiency, long-horizon tool calling), what's still brittle (ambiguous intent, tool selection at scale), where enterprise adoption is moving fastest (legal, healthcare, vertical scale-up software), and what the next unlock is: trusting Codex on multi-hour trajectories until the human becomes the bottleneck ^{[31]DeepLearningAI — AI Dev 26 x SF | A Fireside Chat with OpenAI's Marc Manara}.

Pre-launch model tuning at ~00:07. What OpenAI optimizes for at ~03:09: preambles, tokens, long-horizon tool calling. Still brittle at ~07:11: ambiguous intent and tool selection. Next unlock at ~10:13: Codex on multi-hour trajectories, humans as bottleneck. Startup-taste argument at ~11:14: 5–10 person teams hitting tens of millions ARR. Enterprise verticals at ~17:17: scale-up software, Harvey + Legora, Abridge + Ambience. Closes at ~22:21: the abstraction layer moves up, “engineer” expands ^{[31]DeepLearningAI — AI Dev 26 x SF | A Fireside Chat with OpenAI's Marc Manara}.

Tools: OpenAI Codex, Harvey, Legora, Abridge, Ambience

Podcast

DeepLearningAI

Jeff Huber (Chroma) at AI Dev 26 x SF: agentic search and Context 1

Jeff Huber (Chroma) on agentic search and the new “Context 1” model: context is the underrated half of AI capability, long context windows aren't the answer (context rot), and agentic search needs both read and write paths with continuous learning at the context layer ^{[32]DeepLearningAI — AI Dev 26 x SF | Jeff Huber: Everything You Need to Know About Agentic Search}.

Chroma background at ~00:07. The thesis “AI = context + reasoning” at ~03:08. Context rot at ~07:09: long context windows don't solve the problem. Agentic search read/write paths at ~12:12. Context 1 model at ~15:14: small, fast, cheap, agentic-search-shaped. Three predictions at ~20:16: continuous context, extreme speed, continual learning at the context layer ^{[32]DeepLearningAI — AI Dev 26 x SF | Jeff Huber: Everything You Need to Know About Agentic Search}.

Tools: Chroma, Context 1, agentic search

Podcast

DeepLearningAI

Adit Abraham (Reducto) at AI Dev 26 x SF: better agents with better data

Adit Abraham (Reducto) on why PDF processing is still hard, the industry shift from chatbots to action-based agents, and how Reducto's agentic OCR pipeline (VLMs + speculative decoding, dynamic Markdown vs HTML output, Deep Extract) outperforms traditional CV on enterprise documents ^{[33]DeepLearningAI — AI Dev 26 x SF | Adit Abraham: Better Agents with Better Data}.

Reducto overview + data-bottleneck framing at ~00:07. Industry shift at ~04:08: chatbots → action-based agents. PDF complexity at ~07:11: silent failures + enterprise edge cases. CV vs VLM with speculative decoding at ~10:12. Formatting for the consumer at ~13:14: dynamic Markdown vs HTML tables for RAG retrieval. Agent harnesses + Deep Extract at ~19:16 ^{[33]DeepLearningAI — AI Dev 26 x SF | Adit Abraham: Better Agents with Better Data}.

Tools: Reducto, Deep Extract, VLMs, speculative decoding

Podcast

DeepLearningAI

Aditi Gupta (Redis) at AI Dev 26 x SF: SRE agents with the Redis Context Engine

Aditi Gupta (Redis) on building SRE agents with the Redis Context Engine: enterprise Redis is complex enough that vanilla LLMs + web search are unsafe, so the team built a multi-agent system (Knowledge, Chat, Deep Triage) with semantic caching, hybrid search, and proactive scheduling ^{[34]DeepLearningAI — AI Dev 26 x SF | Aditi Gupta: Building SRE Agents with the Redis Context Engine}.

Problem framing at ~00:07. Why vanilla LLMs are unsafe at ~02:09. Knowledge base foundation at ~05:13: chunking + metadata filtering + vector storage. Multi-agent architecture at ~09:19: Knowledge + Chat + Deep Triage with MapReduce. Model tiering + semantic caching + context-window mgmt at ~13:23. Agent memory server + hybrid search + citations + proactive scheduling at ~22:26 ^{[34]DeepLearningAI — AI Dev 26 x SF | Aditi Gupta: Building SRE Agents with the Redis Context Engine}.

Tools: Redis Context Engine, semantic caching, hybrid search, MapReduce

Podcast

DeepLearningAI

Eli Schilling at AI Dev 26 x SF: agent context & memory on Oracle AI Database

Eli Schilling at AI Dev 26 builds a research-paper assistant in a live notebook on Oracle's converged database, walking through short-term + long-term memory tables, the agent loop, context engineering (token budgets, summarization, offloading), and benchmark results comparing memory-equipped vs naive agents ^{[35]DeepLearningAI — AI Dev 26 x SF | Eli Schilling: Hands On Agent Context & Memory Engineering with Oracle AI Database}.

Why agents need memory at ~00:00. Data types + Oracle's converged DB at ~10:11. Memory architecture (short, long, agent loop) at ~17:14. Live notebook (7 memory tables) at ~22:20. Context engineering at ~37:30: token budgets, summarization, offloading. Benchmark vs naive agent at ~48:38 ^{[35]DeepLearningAI — AI Dev 26 x SF | Eli Schilling: Hands On Agent Context & Memory Engineering with Oracle AI Database}.

Tools: Oracle AI Database, converged database, agent memory

Podcast

DeepLearningAI

Nyah Macklin at AI Dev 26 x SF: auditable AI agents using context graphs

Nyah Macklin frames the “AI said so” auditability problem: 95% of agent projects fail because of fractured context, and knowledge graphs — with relationships as first-class citizens — outperform tables and vectors on auditability. Cites Jiang et al. (IEEE 2026) for 37% → 91% accuracy via Graph RAG, then demos a live auditable credit decision with causal traces ^{[36]DeepLearningAI — AI Dev 26 x SF | Nyah Macklin: The AI Said So? How to Build Auditable AI Agents Using Context Graphs}.

The “AI said so” problem at ~00:07. 95% failure rate framing at ~03:10. KG vs tables vs vectors at ~06:12. Graph RAG research (Jiang et al. 2026, 37% → 91%) at ~09:14. Context graphs defined at ~12:17. Auditable credit decision demo at ~17:25 ^{[36]DeepLearningAI — AI Dev 26 x SF | Nyah Macklin: The AI Said So? How to Build Auditable AI Agents Using Context Graphs}.

Tools: Graph RAG, context graphs, knowledge graphs

Podcast

DeepLearningAI

Pratik Verma at AI Dev 26 x SF: observability agent that finds & fixes agent issues

Pratik Verma's Okahu observability talk: agents fail on edge cases more than logic errors, so the right primitive is agentic tracing (Project Monocle, open-source) feeding a knowledge graph + LLM-as-judge evals, with CI/CD integration that auto-creates issues and runs Claude fix loops ^{[37]DeepLearningAI — AI Dev 26 x SF | Pratik Verma: Observability Agent to Find & Fix Issues in AI Agents}.

Why agents fail at ~00:07: edge cases, not logic. Project Monocle at ~02:07: open-source agentic tracing. Okahu observability agent + KG at ~04:07. Silent failures + LLM-as-judge at ~06:07. CI/CD + Claude fix loop at ~08:08. Full loop at ~11:12 ^{[37]DeepLearningAI — AI Dev 26 x SF | Pratik Verma: Observability Agent to Find & Fix Issues in AI Agents}.

Tools: Okahu, Project Monocle, LLM-as-judge

Podcast

DeepLearningAI

Eda Zhou & Mahdi Ghodsi at AI Dev 26 x SF: personal AI agents with open-source models

Eda Zhou and Mahdi Ghodsi walk through building personal AI agents with open-source models — the LLM-vs-agent gap, the three components (model + runtime + tools), deploying Qwen 3.5 20B on AMD GPUs with vLLM, Open Claude onboarding with personality files, a live code-debug demo with reusable skills, and a multi-agent morning-briefing workflow ^{[38]DeepLearningAI — AI Dev 26 x SF | Eda Zhou & Mahdi Ghodsi: Building Personal AI Agents with Open Source Models}.

LLM vs agent at ~02:08. Three components at ~04:10. Qwen 3.5 20B on AMD + vLLM at ~07:12. Open Claude onboarding + personality at ~11:17. Live demo at ~18:24. Multi-agent morning briefing at ~27:36 ^{[38]DeepLearningAI — AI Dev 26 x SF | Eda Zhou & Mahdi Ghodsi: Building Personal AI Agents with Open Source Models}.

Tools: Qwen 3.5 20B, vLLM, AMD GPUs, Open Claude

Podcast

DeepLearningAI

William Imoh & Charlie Wood at AI Dev 26 x SF: closing the care gap

William Imoh and Charlie Wood on closing the “care gap”: the chart-prep burden and readmission risk for high-risk patients, why rules-based EMR systems fall short, and why a local vector DB + four-agent pipeline (Context, Risk, Protocols, Brief) running on-prem can ship a pre-visit brief for a high-risk patient ^{[39]DeepLearningAI — AI Dev 26 x SF | William Imoh & Charlie Wood: Closing the Care Gap}.

The care gap at ~00:07. Rules-based EMR limits at ~02:08. Architecture (hybrid queries + HNSW + on-prem embedding) at ~05:11. Four-agent pipeline at ~11:14. Live demo at ~21:27. Edge RAG takeaways at ~29:35 ^{[39]DeepLearningAI — AI Dev 26 x SF | William Imoh & Charlie Wood: Closing the Care Gap}.

Tools: Edge RAG, HNSW, hybrid search

Podcast

DeepLearningAI

Jean-Marie John-Mathews at AI Dev 26 x SF: systematic red-teaming for LLM apps

Jean-Marie John-Mathews on systematic red-teaming for LLM apps: the Chipotle chatbot case, a taxonomy of intentional attacks vs legitimate-use failures, why LLM-as-judge falls short for agentic systems, multi-turn failure examples (frustrated customers, hidden tool-call errors), and a demo of Just Catch — an open-source skill that converts natural-language requirements into a test suite ^{[40]DeepLearningAI — AI Dev 26 x SF: Jean-Marie John-Mathews: Red Teaming LLM Applications Systematically}.

Chipotle chatbot case at ~00:07. LLM risk taxonomy at ~01:07. Why LLM-as-judge falls short at ~03:09. Multi-turn failures at ~05:11. Just Catch open-source skill at ~08:13. Live demo at ~10:15 ^{[40]DeepLearningAI — AI Dev 26 x SF: Jean-Marie John-Mathews: Red Teaming LLM Applications Systematically}.

Tools: Just Catch, red-teaming, LLM-as-judge

Podcast

AI Engineer

Patrick Löber (Google DeepMind) at AI Engineer: any-to-any native multimodal agents

Patrick Löber (Google DeepMind) walks the “any-to-any” multimodal architecture: phase-1 multimodal understanding (PDFs, video, audio), phase-2 agentic loop with function calling for multimodal generation, why native generation gives access to LM world-knowledge, the Live API's single audio-to-audio architecture, and new multimodal embeddings + Gemma 4 local agents ^{[10]AI Engineer — Any-to-Any: Building Native Multimodal Agents - Patrick Löber, Google DeepMind}.

What any-to-any means at ~00:17. Phase 1 multimodal understanding at ~03:20. Phase 2 agentic loop at ~07:24. Why native generation matters at ~11:25. Live API single architecture at ~13:28. Multimodal embeddings + Gemma 4 at ~15:00 ^{[10]AI Engineer — Any-to-Any: Building Native Multimodal Agents - Patrick Löber, Google DeepMind}.

Tools: Gemini Omni, Live API, multimodal embeddings, Gemma 4

Podcast

AI Engineer

Marc Klingen (Clickhouse) at AI Engineer: skilling up coding agents for Langfuse

Marc Klingen (Clickhouse/Langfuse) on skilling-up coding agents: skills are Rubik's-cube manuals (reliable shortcuts), the problem is stale pre-training context + 478 pages of docs, six concrete learnings (traces, CLI help flags, agent sitemaps, RAG search), basic eval setup with LLM-as-judge over filesystem state, auto-research where agents improve their own skills, and the open problems of skill versioning and distribution ^{[41]AI Engineer — Skill issue: Lessons from skilling up coding agents to use Langfuse - Marc Klingen, Clickhouse}.

Rubik's-cube framing at ~00:15. The problem at ~03:18: stale pre-training + 478 pages of docs. Six learnings at ~05:19: traces, CLI help, sitemaps, RAG search. Basic eval setup at ~12:24. Auto-research at ~14:24. Open problems at ~17:24: versioning, distribution, target-definition ^{[41]AI Engineer — Skill issue: Lessons from skilling up coding agents to use Langfuse - Marc Klingen, Clickhouse}.

Tools: Langfuse, agent skills, LLM-as-judge

Podcast

AI Engineer

Cormac Brick (Google) at AI Engineer: fine-tuning tiny LLMs for on-device agents (46 → 90%)

Cormac Brick (Google AI Edge) on fine-tuning tiny LLMs for on-device agents: the two on-device GenAI patterns (system-level Gemini Nano vs app-level LiteRT/LiteLM), Function Gemma's 270M-parameter robust function calling, and how a synthetic-data fine-tuning workflow took function-calling accuracy from 46% to 90%+ ^{[42]AI Engineer — From 46% to 90%: Fine-Tuning Tiny LLMs for On-Device Agents — Cormac Brick, Google}.

AI Edge stack at ~00:15: system vs app GenAI. LiteRT runtime at ~01:15: 2.7B devices, CPU/GPU/NPU. Agent skills demo at ~06:19: AI Edge Gallery app on Gemma 4. LiteLM runtime + export at ~10:20. Function Gemma at ~13:24: 270M-parameter robust function calling. Fine-tuning workflow (46% → 90%+) at ~14:24 ^{[42]AI Engineer — From 46% to 90%: Fine-Tuning Tiny LLMs for On-Device Agents — Cormac Brick, Google}.

Tools: LiteRT, LiteLM, Function Gemma, Gemini Nano, AI Core

Developer Tools

The AI Daily Brief: Artificial Intelligence News

AI Daily Brief: 9 Codex tips from the Codex team

AI Daily Brief distills Jason Lou's “Codex maxing” post into nine practical tips: long-running mono-threads, voice + “the art of the ramble,” mid-run steering, structured memory in an Obsidian vault, computer/browser tools, remote control + mobile Codex, heartbeat check-ins, /goal for verifiable success criteria, and the side panel as workspace ^{[43]The AI Daily Brief: Artificial Intelligence News — 9 Codex Tips from the Codex Team}.

Tip 1 mono-threads at ~12:08. Tip 2 voice ramble at ~14:10. Tip 3 mid-run steering at ~15:11. Tip 4 Obsidian memory at ~16:12. Tip 5 tools (computer/browser/connectors) at ~19:12. Tip 6 mobile Codex at ~20:14. Tip 7 heartbeats at ~21:14. Tip 8 /goal at ~22:15. Tip 9 side panel at ~23:16 ^{[43]The AI Daily Brief: Artificial Intelligence News — 9 Codex Tips from the Codex Team}.

Tools: Codex, /goal, Obsidian, heartbeats, mobile Codex

AI Models

marimo

Kimi K2.6 via marimo: the end of slow LLMs

marimo highlights Kimi K2.6 as a speed milestone: a 100-line rhyming poem in ~9.2 seconds at ~120 tokens/sec on the Weights & Biases inference engine running on CoreWeave. The takeaway is that the “slow LLM” era for medium-context generation is ending ^{[44]marimo — Kimi K2.6 - The End of Slow LLMs}.

Tools: Kimi K2.6, Weights & Biases inference, CoreWeave

AI Tools Developer Tools

Better Stackmarimo

Tool demos: Understand-Anything for codebase maps, marimo for reactive graphs

Two quick tool demos from today's feed. Better Stack profiles Understand-Anything, an open-source tool that turns any codebase into a queryable knowledge graph using static analysis + multi-agent LLM processing — 14k+ GitHub stars, positioned as the “codebase MRI” before refactoring something you didn't author ^{[45]Better Stack — This AI Tool Maps Any Codebase Before You Touch It (Understand-Anything)}. marimo separately demos a reactive graph widget where selections in the visual graph are reactive to Python — hover state, multi-select, and programmatic node additions all flow bidirectionally between Python state and the rendered graph ^{[46]marimo — Wayyy Better Graphs}.

Tools: Understand-Anything, marimo, knowledge graphs, reactive widgets

Productivity

Artem Zhutov

Stop writing Markdown in Obsidian — use AI-generated HTML artifacts

Artem Zhutov argues Markdown becomes a limiting format as notes grow in complexity, and proposes using AI-generated HTML artifacts inside Obsidian instead — unlocking interactive dashboards, dynamic tables, and richer documents while preserving Obsidian's local-first workflow ^{[47]Artem Zhutov — Stop Writing Markdown in Obsidian. Do This Instead}.

Tools: Obsidian, HTML artifacts, AI dashboards

Productivity

Real PythonSequoia Capital

Custom LLM skills as libraries — and the automation-must-be-easier law

Two short productivity clips that rhyme. Real Python: build personal LLM skills the way you build libraries, so you stop re-teaching the model the same workflow on every new project — a framing that dovetails with Marc Klingen's AI Engineer talk on skills-as-Rubik's-cube manuals ^{[48]Real Python — Build Custom LLM Skills to Save Hours of Work}. Sequoia with Jake Stauch (Serval): the most common failure mode for enterprise automation is that building the automation isn't easier than just doing the manual task — the “skill-as-library” framing is one of the few moves that flips that inequality ^{[49]Sequoia Capital — The simple test most automation platforms fail | Jake Stauch, Serval}.

Tools: LLM skills, Serval, enterprise automation

Developer Tools

Arjay McCandless

System design walkthrough: pre-signed URLs for file upload

Arjay McCandless's system-design walkthrough on building a Google Drive-style upload service. Key insight: skip your API/server entirely on the file-upload path and have clients upload directly to S3 via pre-signed URLs, then notify your service to update metadata ^{[50]Arjay McCandless — System Design: Google Drive}.

Tools: S3 pre-signed URLs, system design

Industry

OpenAI

Abridge on GPT-5.5 for clinical documentation

OpenAI customer story with Abridge engineering manager Matt Sanders. GPT-5.5 noticeably improves fact extraction from doctor-patient conversations, particularly when the same topic resurfaces multiple times at varying depths during a visit — a long-standing failure mode for clinical documentation models ^{[51]OpenAI — Built with GPT-5.5: Abridge Clinical AI Notes}.

Tools: GPT-5.5, Abridge

Hot Take

AI News & Strategy Daily | Nate B JonesAI News & Strategy Daily | Nate B Jones

Nate B Jones: teens lean on AI companions while schools haven't reckoned with AGI

Two Nate B Jones shorts on the same axis. First: three-quarters of teenagers now use AI companions for emotional support, in some cases as their primary source of connection — his argument being that a chatbot “can't model empathy because it doesn't have anything to lose” ^{[52]AI News & Strategy Daily | Nate B Jones — How ChatGPT Became Teenagers' Best Friend}. Second: 2 billion kids attend schools designed for a 20th-century industrial economy, while Nature has published a peer-reviewed argument that AGI has arrived and 86% of students globally now use AI in coursework — the curriculum gap is a now-problem, not a 5-year problem ^{[53]AI News & Strategy Daily | Nate B Jones — The calculator moment nobody's talking about in education}.

Hot Take

Last Week in AI

LWIA clip: bracing for the bio-weapon version of Mythos

Short Last Week in AI clip: society shouldn't be “shocked” by Mythos-style AI-PR-stunt cyber events anymore, and the next inflection — a bio-weapon-shaped incident — is foreseeable. Predictions in the segment are deliberately not hand-waved ^{[54]Last Week in AI — The "bio-weapon version" of Mythos}. Reads as a companion piece to Nate B Jones's argument above that society is consistently a step behind the actual deployment curve.

Google I/O 2026 day two: Gemini 3.5 Flash lands to mixed-to-hostile reviews

Gemini 3.5 Flash: speed up, price obfuscated

Willison: most of I/O is vaporware

The naming problem is now a strategy problem

Sidebar: making “4x faster” tangible

Gemini Spark: Google's personal agent, with a prompt-injection-shaped target on its back

What Spark is

Prompt injection: the elephant in the calendar

Gemini Omni: any-to-any multimodal, with conversational video editing as the headline demo

What Omni does

Why “native” matters

Whittemore's read

Antigravity 2.0 lands, Gemini CLI gets a June 18 obituary

Four surfaces, one runtime

Gemini CLI gets a June 18 funeral

The CLI itself is bad

Search redesign, YouTube Search, smart-glasses, and life-size Beam meetings

Search, finally redesigned

YouTube Search becomes conversational

Smart glasses, via fashion houses

Beam: remote attendees at life-size

OpenAI's reasoning model disproves an 80-year-old Erdős conjecture

What was disproved

Why this is different

Anthropic is paying SpaceX $1.25B/month for compute, per an accidental S-1 disclosure

The leak

How it fits the rest of the financing puzzle

Anthropic acquires Stainless for $300M; founder Alex Rattray on MCP and API “dendrites”

The interview

Why Anthropic paid $300M

Meta lays off 10% (~7,800), pivots org chart to AI

The seven infrastructure control points that decide whether your agent ships

The seven control points

The argument in one line

Last Week in AI #245: TML-Interaction, Claude for Legal, Sam Altman on the stand

Last Week in AI #244: GPT-5.5 Instant, Grok 4.3, OpenAI vs Musk

Last Week in AI #243: GPT-5.5, DeepSeek V4, AI safety sabotage

Latent Space — swyx interviews Jake Cooper (Railway): the agent-native cloud

Pragmatic Engineer — Alice Ryhl on why Rust is different

Nerd Snipe — the Claude Code creator's $1.3M Codex spend, explained

Matt Williams & Ryan: a chat on May 19

Dwarkesh — David Reich on the Middle Paleolithic revolution

Lenny's Podcast: re-industrialization is now a national-security story

EO: ballet trained a $22B founder for delayed gratification

Andrew Ng at AI Dev 26 x SF: the future of software engineering

Paige Bailey at AI Dev 26 x SF: what's new and what's next in AI

Marc Manara (OpenAI) at AI Dev 26 x SF: a fireside chat

Jeff Huber (Chroma) at AI Dev 26 x SF: agentic search and Context 1

Adit Abraham (Reducto) at AI Dev 26 x SF: better agents with better data

Aditi Gupta (Redis) at AI Dev 26 x SF: SRE agents with the Redis Context Engine

Eli Schilling at AI Dev 26 x SF: agent context & memory on Oracle AI Database

Nyah Macklin at AI Dev 26 x SF: auditable AI agents using context graphs

Pratik Verma at AI Dev 26 x SF: observability agent that finds & fixes agent issues

Eda Zhou & Mahdi Ghodsi at AI Dev 26 x SF: personal AI agents with open-source models

William Imoh & Charlie Wood at AI Dev 26 x SF: closing the care gap

Jean-Marie John-Mathews at AI Dev 26 x SF: systematic red-teaming for LLM apps

Patrick Löber (Google DeepMind) at AI Engineer: any-to-any native multimodal agents

Marc Klingen (Clickhouse) at AI Engineer: skilling up coding agents for Langfuse

Cormac Brick (Google) at AI Engineer: fine-tuning tiny LLMs for on-device agents (46 → 90%)

AI Daily Brief: 9 Codex tips from the Codex team

Kimi K2.6 via marimo: the end of slow LLMs

Tool demos: Understand-Anything for codebase maps, marimo for reactive graphs

Stop writing Markdown in Obsidian — use AI-generated HTML artifacts

Custom LLM skills as libraries — and the automation-must-be-easier law

System design walkthrough: pre-signed URLs for file upload

Abridge on GPT-5.5 for clinical documentation

Nate B Jones: teens lean on AI companions while schools haven't reckoned with AGI

LWIA clip: bracing for the bio-weapon version of Mythos

Sources

Cormac Brick (Google) at AI Engineer: fine-tuning tiny LLMs for on-device agents (46 → 90%)