Sam fleeces Microsoft; Boris says coding is solved

May 4, 2026

24 topics · 39 sources

Industry Hot Take
Theo - t3.gg Sherwood Snacks

Microsoft & OpenAI Officially Break Up — and AWS Lands the $50B Pivot

OpenAI's amended Microsoft agreement kills cloud exclusivity, retires the AGI clause, and (according to Theo's reading of the fine print) flipped reverse revenue-share into a profit share — meaning unprofitable OpenAI now likely owes Microsoft "jack f***ing s***."[1]Theo: Microsoft and OpenAI break up The same day, OpenAI announced a $50B AWS investment, 2GW of Trainium capacity, and a stateful runtime co-built on Bedrock[1]Theo: Microsoft and OpenAI break up — the deal that couldn't happen until exclusivity was gone. Theo's verdict: "Sam is one of the greatest negotiators of all time because he fleeced Microsoft."

Read more

How we got here

~02:22 The original 2019 deal made Azure OpenAI's exclusive cloud and licensed Microsoft all OpenAI pre-AGI IP — but "AGI" was never defined, which meant in practice the agreement had no off-ramp. Microsoft eventually parked $13B-plus into OpenAI for what's now ~$135B/27% on a fully-diluted basis.

The o1 yelling incident

~07:30 Theo identifies the September 2024 o1 launch as the actual fracture: OpenAI refused to hand chain-of-thought details to Microsoft, despite IP-sharing terms requiring it. The leaked yelling incident — Sam Altman pressuring OpenAI staff (including then-CTO Mira Murati) to ship faster to Redmond — "was the start of the end."

Open AI built this with 250 people, Nadella said. Why do we have Microsoft research at all? He was pissed.

The 2026 restructuring

~15:48 Microsoft becomes "primary" not "exclusive" cloud, OpenAI ships products on any cloud, the IP license extends to 2032 but loses exclusivity, and Microsoft loses right of first refusal on compute. Microsoft no longer pays revenue share to OpenAI for serving its models — but the reverse direction was reportedly converted from revenue share to profit share, which on OpenAI's economics means roughly nothing.

Sam is one of the greatest negotiators of all time because he fleeced Microsoft.

The AWS pivot

~19:08 AWS becomes the exclusive third-party cloud distribution for OpenAI's Frontier agent platform, OpenAI commits to 2GW of Trainium, and AWS gets to co-build a stateful runtime on Bedrock — a feature Theo says "Azure would be hellish to try and implement." The existing $38B AWS commitment is extended by $100B over eight years across Trainium 3 and Trainium 4 (FP4 compute, more HBM, shipping 2027).

Why Anthropic was already winning enterprise

~21:14 Theo's enterprise thesis: Anthropic's reported $30B run rate isn't because Claude is better — it's because Claude is on Bedrock, Vertex, and Azure, while OpenAI was Azure-locked. He also flags that AWS, GCP, and Azure startup credits cannot be used on Anthropic models because revenue-share to the clouds is so aggressive (he estimates ~50%) that the clouds would lose real money giving away free Anthropic compute.

OpenAI is petrified of the enterprise growth at Anthropic outpacing their own enterprise growth.

Azure's performance crash-out

~27:00 Theo had $1M in Azure credits he refused to spend because Azure-hosted GPT-5.4 was 2.2× slower on average and up to 15× slower at the tail (sometimes 0.3–2 tok/s vs 70+ on OpenAI direct). After he shipped azure.t3.gg as a public benchmark, Microsoft asked him to delete it. He did, with a 15-day warning to repost. By noon the next day the bugs were fixed.

I guess bullying works.

The coming silicon war

~33:25 OpenAI on Trainium will be its first non-NVIDIA deployment — a real risk given Anthropic's reported quality regressions on the same hardware. Theo predicts the next year's AI war is NVIDIA vs AMD vs Trainium vs TPU, with only NVIDIA and Google positioned across both layers.

Market context

The breakup lands inside a broader capex frenzy: Sherwood notes Alphabet, Amazon, Meta, and Microsoft together are planning $700B+ in 2026 capex, while Apple's capex actually fell 36% last quarter and Apple chose Google's Gemini to power Siri rather than build a frontier model.[36]Sherwood: Market's '90s flashback

Tools: AWS Bedrock, AWS Trainium 3/4, OpenAI Frontier, Azure OpenAI Service, GPT-5.4, o1, NVIDIA GPUs, Google TPU, AMD
Industry
Anthropic

Anthropic Lines Up Blackstone, Goldman & Hellman to Service the Mid-Market

Anthropic is co-founding a new enterprise AI services firm with Blackstone, Hellman & Friedman, Goldman Sachs, General Atlantic, Leonard Green, Apollo, GIC, and Sequoia.[2]Anthropic: New enterprise AI services company The new entity embeds Anthropic applied AI engineers directly with mid-market client teams — community banks, mid-sized manufacturers, regional health systems — segments underserved by Anthropic's existing partners (Accenture, Deloitte, PwC). No funding amount was disclosed; CFO Krishna Rao is the public face.

Read more

The investor consortium spans private equity (Blackstone, H&F, Leonard Green, Apollo), investment banking (Goldman Sachs), growth equity (General Atlantic, Sequoia), and sovereign wealth (Singapore's GIC). The structure is unusual — a separately-capitalized services firm rather than another consulting partnership — and the implicit message is that Anthropic believes the consulting bottleneck on Claude adoption is real and won't be solved by simply onboarding more Big Four partners.

Enterprise demand for Claude is significantly outpacing any single delivery model. This new firm brings additional operating capability to the ecosystem.

Pair this with Theo's read on AWS+Bedrock[1]Theo on Anthropic enterprise dominance and Anthropic's enterprise lead becomes a deliberate two-front strategy: ubiquitous cloud distribution + a captive services arm to land the mid-market that won't pay Deloitte rates.

Tools: Claude
Developer Tools Industry
OpenAI

OpenAI's WebRTC Stack Behind 900M Weekly Voice Users

Yi Zhang and William McDonald walk through how ChatGPT voice and the Realtime API serve 900M+ WAU on a custom relay-plus-transceiver WebRTC stack, written in Go, deployed on Kubernetes, with Cloudflare geo-steering for signaling and a Global Relay fleet for media ingress.[3]OpenAI: Low-latency voice AI at scale The post is unusually concrete on the Linux-level optimizations they used to avoid kernel-bypass.

Read more

Why WebRTC, why a transceiver (not an SFU)

WebRTC handles ICE/DTLS/SRTP, codec negotiation, jitter buffering, and echo cancellation — solved problems OpenAI didn't want to re-implement per platform. Because nearly every session is 1:1 (one user, one model), they chose a transceiver model over a Selective Forwarding Unit: the transceiver terminates the WebRTC session at the edge and converts media to simpler internal protocols, so backend inference services scale like ordinary HTTP services.

The Kubernetes problem and the relay split

Conventional WebRTC reserves one public UDP port per session, which doesn't survive Kubernetes autoscaling. OpenAI split routing from termination: a stateless relay exposes a tiny fixed UDP surface (e.g. 203.0.113.10:3478) and forwards packets without decrypting media, while transceivers own ICE/DTLS/SRTP state. Routing metadata is encoded into the ICE username fragment (ufrag), avoiding any hot-path external lookup; Redis caches the IP:Port→transceiver mapping for fast recovery.

Go relay tricks (no kernel bypass)

Three Linux primitives carry the load: SO_REUSEPORT distributes UDP packets across worker goroutines; runtime.LockOSThread pins each reader to one OS thread for cache locality; pre-allocated buffers minimize GC. They explicitly evaluated kernel-bypass (DPDK-style) and decided their workload didn't justify the operational complexity.

Optimize for the common case before reaching for kernel bypass. A narrow Go implementation with careful use of SO_REUSEPORT, thread pinning, and low-allocation parsing was enough for our workload.

Global Relay + geo-steered signaling

Cloudflare's geo and proximity steering routes the initial HTTP/WebSocket SDP exchange to the nearest transceiver cluster; the SDP answer points the client at the nearest Global Relay; ICE ufrag carries the rest. Both signaling and media take a near-user path into OpenAI's backbone.

The broader lesson is that the best place to add complexity is in a thin routing layer, not in every backend service, and not in custom client behavior.

Notable: the post credits Justin Uberti (one of WebRTC's original architects) and Sean DuBois (creator of Pion) as colleagues now at OpenAI — a quiet flex.

Tools: WebRTC, Pion, Go, Kubernetes, Cloudflare, Redis, ICE/DTLS/SRTP/STUN, SO_REUSEPORT
Podcast
Sequoia Capital

Sequoia Interviews Boris Cherny: "Coding Is Solved" — 150 PRs/Day, From His Phone

Anthropic Labs' Boris Cherny — the engineer who built Claude Code — tells Lauren Reeder that for him, today, the model writes 100% of his code, he ships dozens of PRs per day (record: 150), and most of his work now happens from his phone via the Claude app's code tab.[4]Sequoia: Boris Cherny on Claude Code His one-line thesis on what comes next: "loops are the future."

Read more

~02:30 Claude Code began inside Anthropic Labs in late 2024 as a deliberate bet on a "product overhang" — Sonnet 3.5 could go far beyond tab-completion if wrapped in a real agent harness.

~03:50 The first six months were a flop; Opus 4 (May) was the inflection. Growth has compounded with every model release since (4.5, 4.6, 4.7).

~05:30 The "coding is solved" claim, precisely: for Boris, today, Claude writes 100% of his code. The Claude Code codebase (which leaked, so he can say this) is TypeScript and React — chosen because they were "on distribution" for the model in late 2024.

~06:30 The phone-first setup: 5–10 sessions concurrently, "a few hundred agents going" daytime, "a few thousand" overnight. Personal record — 150 PRs in a single day last week.

Loops as the new primitive

~07:30 The mechanism Boris uses most: /loop — Claude schedules a recurring cron job (every minute, 5 min, daily). He runs dozens: one babysitting his PRs and auto-rebasing, one fixing flaky CI, one clustering Twitter feedback every 30 minutes. Anthropic just shipped "routines" — the server-side equivalent that runs with your laptop closed.

I sort of feel like loops are the future at this point.

Cross-disciplinary generalists

~08:55 Boris predicts product engineers spanning iOS/web/server, plus engineers who are also strong designers, data scientists, or PMs. On the Claude Code team itself: "every single person codes" — EM, PM, designers, data scientist, finance, user researcher.

No SaaS apocalypse — but 10× more disruption

~10:30 Pushback on the "SaaS is dead" meme via Hamilton Helmer's 7 Powers: switching costs and process power erode (4.7 hill-climbs to a target until done), but network effects, scale economies, and cornered resources still matter. Prediction: 10× more disruptive startups in the next decade because incumbents can't retrain orgs fast enough.

Claude is getting really good at figuring out process. With 4.7, it can just hill climb anything. If you give it a target and tell it to iterate until it's done, it will just do it. I think this is the first model like that.

The model/harness split

~13:15 Roughly 50/50 historically, with the harness mattering less as models improve. Current focus: loops as a first-class primitive, easier multi-agent orchestration, and reducing safety scaffolding (prompt-injection guards, static command verification, permission modes) because future models will "just do the right thing."

The printing-press analogy

~15:30 1400s Europe had ~10% literacy. Fifty years after the printing press, more output than the prior thousand years and a 100× cost drop in books. Software authoring follows the same arc, "much faster than 50 years."

The best person to write accounting software, I think maybe even today, is not an engineer, it's a really good accountant — because they know the domain really well and coding is the easy part.

Anthropic's actual edge

~17:30 No model gap (everyone gets the same models). Real gap: org process. Inside Anthropic, Claudes talk to each other over Slack to resolve unknowns; no manually written code anywhere; all SQL is model-written.

We have no more manually written code anywhere at the company. All of the SQL is written by models. Everything is just built by the models.

Multi-agent ergonomics

~19:18 The parallelization decision should be the model's, not the user's. 4.7 already volunteers loops ("I noticed the data is changing — I'll start a loop and report every 30 minutes via Slack MCP").

Knowledge work beyond code

~21:50 MCP is the answer for everything with an integration (Salesforce, Google Docs, Calendar). For everything else, computer use via Co-work — slow but reliable with 4.7.

What to build now

~23:30 Claude Design (already good, will be much better), more Claude Code launches in the coming weeks, massively parallel agent products (loop, batch), and computer use.

Tools: Claude Code, /loop, Routines, /batch, Sub-agents, MCP, Slack MCP, Co-work, Claude Design, Computer use, Opus 4.x, Mythos
Hot Take AI Future
Sequoia Capital

Karpathy's "Car Wash Problem": AI's Jaggedness Is a Data-Distribution Story

Karpathy gives the new canonical example of LLM jaggedness: state-of-the-art Opus 4.7 will refactor a 100,000-line codebase or find zero-day vulnerabilities, and then tell you to walk 50 meters to a car wash.[5]Karpathy on the car wash problem His explanation: capabilities track the data distribution and RL environments labs deliberately added — code is overweighted because it's economically valuable and easy to verify.

Read more

Karpathy's GPT-3.5 → GPT-4 chess example is the cleanest data point: chess didn't improve from general capability scaling, it improved because someone at OpenAI added a large chess dataset to pre-training.

How is it possible that state-of-the-art Opus 4.7 will simultaneously refactor a 100,000 line code base or find zero-day vulnerabilities and yet tells me to walk to this car wash? This is insane.
If you're in the circuits that were part of the RL, you fly and if you're in the circuits that are out of the data distribution, you're going to struggle.

Practical takeaway: you don't get a manual with the model, so you have to empirically explore which circuits your application sits in. If you're outside the distribution, fine-tuning is the answer — not waiting for the next general-capability bump.

Podcast
AI Engineer

Chris Parsons at AI Engineer: Ralph Loops — Build Dumb AI Loops That Ship

Cherrypick's Chris Parsons argues that complex agent orchestration (n8n flows, parallel multi-agent dependency graphs) is the wrong shape for current coding agents — instead, run a "dumb" loop that repeatedly tells the AI to "do the next most important thing" and let the model handle dependencies.[6]AI Engineer: Ralph Loops by Chris Parsons The talk is named after Ralph Wiggum: keep trying the same thing until it works. Pairs cleanly with Boris Cherny's "loops are the future" thesis.[4]Boris Cherny on loops as primitive

Read more

~00:14 Audience poll: most attendees use Claude Code or Codex, many no longer write code themselves, and a growing minority use them for non-coding work.

~03:17 Why he abandoned n8n: a weekly newsletter workflow kept failing every Monday at 2pm. He replaced ~a week of fragile JSON with a single Claude Code skill that produces better newsletters and self-improves with "update the skill with anything you should have done differently" at the end of each run.

~07:18 The earliest Ralph (Geoffrey Huntley) was just: after the AI says it's done, give it the same prompt again. With GPT-5.12+ and Opus/Sonnet 4.6+ this rarely catches anything because models actually finish — but the pattern unlocked something bigger.

~21:25 The real unlock (credited to Matt PCO): point the loop at a list of tickets and say "pick the next most important one." His earlier failed attempt was hand-building a giant dependency graph and firing 6–7 parallel agents — they collided on shared tickets and reimplemented the same work.

I think we over complicate things by assuming that they need to be in parallel. I quite like the idea of just starting with a loop.

~25:25 AI is good at picking the next ticket, bad at parallel coordination. Hand-built dependency graphs become AI-driven waterfall.

~40:35 Claude Code's new built-in /loop command wraps this with a cron trigger ("loop every minute, build the next ticket"). Parsons runs a 6am morning briefing loop, a 15-min heartbeat, a worker loop driving a vibe-coded Kanban PM, and an experimental "startup" skill that once spontaneously generated an investor update deck.

The reversibility rule

~66:48 One hard rule for autonomous loops: "is this reversible without embarrassment to me?" If no, hand back to a human. The AI drafts ~15–16 emails every morning but never sends them; never closes a project itself.

Is this reversible without embarrassment to me? If the answer is no, don't do it.

Sandboxing and the lethal trifecta

~52:45 Loops run on a VPS with separate API keys and narrowly-scoped permissions. He's building "Lockbox" to harden this and cites Simon Willison's "lethal trifecta" (untrusted tokens + internet + secret data = data loss).

Sub-agents beat same-context self-review

~54:45 Moving validation into a sub-agent (with fresh context) catches bugs that same-context validation misses because of confirmation bias.

Theory of Constraints

~95:13 If your release process is the bottleneck, faster coding via Ralph just queues up more PRs and makes things worse. Goldratt's The Goal applies.

The bottleneck is usually not the number of agents. It's usually you just keeping up with the AI just doing things over and over again.
AI can do all of the rubbish work, but it can't and it shouldn't do the work that I'm uniquely good at.
Tools: Claude Code, /loop, Codex, n8n (deprecated by him), Lockbox (his), Beads, Linear, Playwright, Obsidian, Leanne, MCP agent mail, "air skills" (his in-progress product)
Podcast
EO

EO Interviews Yutori's Abhishek Das: Real Agents Are Still 5 Years Away

Yutori co-founder Abhishek Das frames the central technical wall: agents make sequences of decisions, and over a 10–50-step workflow even 90% per-step accuracy collapses to a low overall success rate.[7]EO: Yutori's Abhishek Das interview His other complaint is cultural: the industry has wrongly normalized non-determinism and slop. Yutori's stance — "if it's not good enough to work on the first try, it's not good enough."

Read more

~00:00 Roughly 100 agent products claim "do anything on the web" yet rarely succeed first-try. Compounding error is the wall.

~03:04 Yutori is reimagining the browser for an era where users talk to AI assistants that take actions and complete tasks proactively in the background. Digital agents arrive before physical ones.

~06:06 Every production query goes through a comprehensive eval suite. The differentiator is whether an agent can recognize a mistake, backtrack, and try a different branch — because the universe of websites is infinite and models will always be trained on a finite subset.

~07:06 Pushback on normalized "slop" — Yutori enforces standards via weekly hour-plus dog-fooding and runs tens of internal experiments at any time, only one of which might ship.

It feels like we have started normalizing and developed a tolerance for non-determinism and low reliability in shipping products.
If it's not good enough to work on the first try, it's not good enough.

~10:07 Connects to his earlier Grad-CAM interpretability research: AI shouldn't only deliver final answers, it should expose proof of work. Yutori's Scouts feature includes a UI button to inspect which sites the agent visited and what it looked at.

In a world where it's very easy to come up with first prototypes using these coding LLMs, the true differentiator is in taste and craft.
Tools: Yutori, Yutori Scouts, Grad-CAM
AI Future Hot Take
Import AI

Import AI 455: AI Systems Are About to Start Building Themselves

Jack Clark's central case: SWE-Bench has gone from ~2% (Claude 2, late 2023) to 93.9% (Claude Mythos Preview, 2026), METR independent-task-completion time has grown from ~30 seconds in 2022 to ~12 hours in 2026 (projected ~100 hours by year-end), and OpenAI has publicly committed to "an automated AI research intern by September 2026."[8]Import AI 455: Automating AI research His warning: small alignment errors compound — 99.9% accuracy degrades to ~60% after 500 generations.

Read more

The coding singularity

SWE-Bench: 2% → 93.9%. METR time horizons: GPT-3.5 ~30s → GPT-4 4 min → o1 40 min → GPT-5.2 High ~6 hrs → Opus 4.6 ~12 hrs → projected ~100 hrs by end-2026.

The vast majority of people I meet at frontier labs and around Silicon Valley now code entirely through AI systems.

Core AI R&D skills, near-saturated

CORE-Bench (computational reproducibility): GPT-4o at 21.5% in Sep 2024 → Opus 4.5 at 95.5% in Dec 2025. MLE-Bench (75 Kaggle-style problems): o1 16.9% → Gemini3+search 64.4%. GPU kernel design progressing fast (DeepSeek, Meta PyTorch→CUDA, Huawei AscendCraft, ByteDance CUDA Agent).

AI training AI

PostTrainBench: human baseline 51% uplift; Opus 4.6 and GPT-5.4 at 25–28%. Anthropic's CPU-only training optimization speedup: Opus 4 (May 2025) 2.9× → 4.5 16.5× → 4.6 30× → Mythos Preview (Apr 2026) 52×. Human expert: 4–8×.

Frontier math

A Gemini model produced 13 solutions across ~700 Erdős problems, with one (Erdős-1051) tentatively novel. UBC/UNSW/Stanford/DeepMind paper: "The proofs of the main results were discovered with very substantial input from Google Gemini and related tools."

Industry commitments

OpenAI: "automated AI research intern by September 2026." Anthropic: published research on automated alignment researchers. DeepMind: "automation of alignment research should be done when feasible." Recursive Superintelligence: $500M raised explicitly for this. Mirendil: "building systems that excel at AI R&D."

Why this matters: alignment

Unless your alignment approach is '100% accurate'…things can go wrong quite quickly. For example, your technique is 99.9% accurate, then that becomes 95.12% accurate after 50 generations, and 60.5% accurate after 500 generations.

Why this matters: economics

Clark predicts a capital-heavy/human-light economy with autonomous AI corporations, Amdahl's Law bottlenecks where digital acceleration hits physical constraints (drug trials), and governance gaps around redistribution.

AI systems are about to start building themselves. What does that mean?
Tools mentioned: SWE-Bench, METR, CORE-Bench, MLE-Bench, PostTrainBench, Claude Mythos Preview, Opus 4.6, GPT-5.4, Gemini3, Claude Code, OpenCode, AIME 2025, Arena Hard, BFCL, GPQA, GSM8K
AI Models Industry
OpenRouter

GPT-5.5's Real-World Cost: 49–92% Above GPT-5.4

GPT-5.5 doubled the sticker price (input $2.50→$5.00/M, output $15→$30/M), but OpenRouter's analysis of users who switched found real-world costs increased 49–92%, not 100%, because the model became less verbose on long inputs.[9]OpenRouter: GPT-5.5 cost analysis

Read more

For prompts 10K–128K tokens, GPT-5.5 generates 19–34% fewer output tokens than GPT-5.4. Worst real-world hit: short prompts under 2K tokens (+92%, because it actually outputs slightly more there). Best case: 50K–128K prompts (+49%). Above 128K it spikes back to +85% — likely the model becoming verbose again at extreme context lengths.

We observed cost increases between 49–92%.

Methodology used OpenRouter's unified tokenization, excluded cancelled requests, media, and zero-token requests. No direct Claude/Gemini comparisons in the post.

Tools: OpenRouter, GPT-5.5, GPT-5.4
Podcast
Dwarkesh Patel

Dwarkesh Patel Interviews Dario Amodei: AI Won't Be a Monopoly

In a short clip from a longer interview, Dario opens with: "I don't think this field's going to be a monopoly. All my lawyers never want me to say the word monopoly."[10]Dwarkesh interviews Dario Amodei His structural argument: Facebook-style network effects don't exist for AI labs. The right analog is cloud computing — three or four players sustained by capital and expertise barriers — but with more product differentiation than cloud has.

Read more

~00:00 The thesis, with the cloud analog: "You have three, maybe four players within cloud. I think that's the same for AI. Three, maybe four." Capital intensity and depth of frontier-lab expertise are the real barriers, not user lock-in.

Profits are not astronomical, margins are not astronomical, but they're not zero.

Notable concession from a frontier-lab CEO: AI will be a real business, but not winner-take-all. Stakes out a middle position against both monopoly framing and commodity collapse.

Differentiation vs cloud

~01:01 Dario rejects the simplified "Claude codes, GPT reasons" story as too coarse — "models are good at different types of coding, models have different styles." Implication: AI oligopoly will be stickier and more product-defined than the cloud oligopoly. Buyers care which model they pick in a way they don't really care which hyperscaler hosts their VMs.

Cloud is very undifferentiated. Models are more differentiated than cloud.

Note: this transcript is a short clip (~1 min). Anchors are limited; the wider interview likely contains more detail.

Tools: Claude, GPT, Gemini
Podcast
AI Engineer Arjay McCandless

Pedro Rodrigues at AI Engineer: How Skills Made Agents Actually Good at Supabase

Supabase AI tooling engineer Pedro Rodrigues demos how Anthropic-style "skills" (skill.md folders) fix subtle agent failures — like Claude creating Postgres views that silently bypass row-level security — and walks through an eval-driven workflow for testing skills before shipping.[11]AI Engineer: Pedro Rodrigues on Supabase The same RLS landmine bit indie creator Arjay this week — bypassing his per-user AI usage limits and almost producing a $10K bill.[34]Arjay: Database Hacked

Read more

~00:14 Pedro reframes "Skill Issue" → "Level up your skills" (the original title is reserved for his keynote). His mandate at Supabase: improving "DAX" — developer-agent experience.

What skills are

~03:16 Folders containing skill.md (frontmatter with required name and description + markdown body), optional reference markdown, optional bash/python scripts. The key innovation over MCP is progressive disclosure — only the frontmatter description loads into context until the agent decides the skill is relevant.

Skills vs MCP — use both

~06:18 MCP for integrations and remote/authenticated tool execution; skills for workflow context and prompt-template-style guidance that won't fit in a tool description. Anthropic's new tool_search tool is MCP's answer to progressive disclosure.

The RLS bypass demo

~14:26 A vibe-coded Next.js perf-review app on Supabase with four users (employee, two managers, HR). Pedro asks Claude to add a department_stats SQL view via the Supabase MCP server. Claude creates it and reports success.

~31:55 Switching to Bob (engineering manager) reveals everyone can see all departments' salaries. Bug: Postgres views default to the creator's permissions, bypassing RLS. Fix: WITH (security_invoker = true), available since Postgres 15 — but underrepresented in model training data.

~38:57 Installs the pre-built supabase-security skill via npx skills (Vercel's package). With the skill loaded, Claude generates the view with security_invoker. Tip: starting skill descriptions with the verb "use" measurably increases load rate on Claude.

Eval pipeline

~62:34 Following agent-skills.org: eval.json with prompts/expected outputs/assertions; Python harness resets DB, runs Claude Code in headless mode twice (with/without skill), writes grading.json.

~70:50 Live run produces a counter-intuitive result — "with skill" fails. Pedro uses it as a meta-point: the eval was checking the wrong metadata (view definition instead of pg_class reloptions). Evals are just code, and bad assertions produce bad signals.

The companion lesson — Arjay's $10K near-miss

~00:00 Arjay's app was hacked when a user removed per-user AI usage limits — same root cause: missing row-level security. Without RLS, a client-side SELECT * FROM todos returns everyone's rows.[34]Arjay: Database Hacked

Someone found a way to remove those limits. And in theory, they could have racked up a $10,000 bill if they wanted to.
Tools: Anthropic Claude Code, Supabase, Supabase MCP server, Postgres (security_invoker, RLS, pg_class reloptions), Next.js, Vercel npx skills, agent-skills open standard, MCP tool_search, Brain Trust, Langfuse, Cursor
Podcast
AI Engineer

Angelos Perivolaropoulos at AI Engineer: Train an LLM From Scratch in 15 Minutes on a Free Colab GPU

ElevenLabs' speech-to-text lead Angelos Perivolaropoulos walks through training a tiny ~1.8M-parameter GPT-2-style decoder on a laptop or a free Colab T4, with concrete loss-to-quality milestones (overfit threshold ≈1.0 on this dataset).[12]AI Engineer: Train an LLM from scratch The Q&A covers reasoning models, multimodal injection, and how ElevenLabs' audio side actually works.

Read more

~00:14 Intro: ElevenLabs Scribe v2 is currently top-ranked on public transcription benchmarks. Workshop framing: pure PyTorch, no pre-trained weights, gets you ~80% of the way to how labs actually design models.

~04:17 Four building blocks: tokenizer, model architecture, training loop, inference.

~09:22 Character-level tokenizer on tiny Shakespeare: 65 tokens, ~4,225 bigrams, tractable for the dataset size. Production labs use BPE; a 50K vocab like GPT-2's would balloon embedding params to 19M and overwhelm a small model.

~15:29 Transformer building blocks: multi-head causal self-attention, MLP, residual connections (so each layer makes small adjustments), layer norm (keeps activations from exploding).

~23:36 Model config: 6 layers, 6 heads, 384-dim, 256-token context, ~1.8M params total.

~41:51 Training loop: batch size 64, AdamW + cosine LR with 100-step warmup over 5,000 total steps.

Concrete loss-to-quality landmarks

~49:56 ln(65) ≈ 4.17 (random) → 3.3 (character frequencies, "th") → 2.5 (the word "the") → 1.5–2.0 (real words appearing) → 1.0–1.2 (recognizable Shakespearean phrases) → below 1.0 = overfit. Optimal in his test: ~2,400 steps, ~15 minutes on a free Colab T4.

When the loss starts going below 1.0 for this specific dataset, that's where we're going to start seeing overfitting. The model will still be producing reasonable things, but it will no longer start getting better at it.

~53:00 Inference: greedy good for transcription, boring for LLMs. Use temperature ~0.7 + top-k. Fixed seed for the workshop's competition (best Shakespearean verse wins ElevenLabs swag).

Q&A highlights

~63:12 Reasoning models share the same base architecture and are post-trained on very high-quality chain-of-thought data — labs like Scale AI hire physicists and PhDs because bad data breaks models. Someone has converted Llama 1B (non-reasoning) into a reasoning model purely via post-training.

The model just cares about these embeddings. It doesn't care if it's text or if it's audio or if it's video.

Multimodal: video/audio encoders produce hidden vectors injected into the text transformer's embedding layer at prefix positions. ElevenLabs trains tokenizers on mel spectrograms; TTS often uses L2 over spectrograms or KL-divergence for distillation rather than cross-entropy. Music generation can be autoregressive or diffusion-based; diffusion is generally easier to get working for abstract modalities.

Tools: PyTorch, NumPy, tqdm, tiktoken, uv, Google Colab T4, AdamW, nano-GPT (Karpathy), GPT-2 architecture, BPE, ElevenLabs Scribe v2, Scale AI, Qwen 3, Llama 1B, Mel spectrograms
AI Tools Productivity
Nate Herk | AI Automation

Nate Herk Builds a Sales Voice Agent in 15 Minutes (ElevenLabs + Claude Code)

Nate's proof-of-concept: a voice agent trained on all 400 of his YouTube transcripts, scaffolded by Claude Code in ~15 minutes.[13]Nate Herk: Voice Agents Then a longer live build of a B2B sales agent that books Cal.com discovery calls, with concrete fixes for the Adam-voice-too-AI bug, UTC vs Central time, and security/cost lockdown for public widgets.

Read more

~03:00 Anatomy of a voice agent: persona (system prompt) + voice (ElevenLabs library or 4-hr custom clone) + knowledge (docs, Supabase/Pinecone vector stores) + tools (API calls, MCP servers, n8n, Zapier, Python).

It is a loop. It's not magic.

~07:02 Live build with Claude Code in VS Code: dictate the goal in plain language → enable plan mode so Claude interviews him about Cal.com event type, voice persona, required data fields → Claude autonomously creates .env, wires Cal.com + ElevenLabs APIs, writes the system prompt, picks a voice, injects the widget snippet.

~10:03 Dictation: he switched from Whisper to GLO (faster, private) and joined the GLO team.

~18:05 Iterative debugging: Adam voice sounds too AI, agent doesn't deliver first message, check-availability tool queries UTC instead of Central. Each fix described in plain language; Claude Code reads ElevenLabs docs and inspects the conversation transcript dashboard to pinpoint the UTC bug.

~29:14 Security & cost: ElevenLabs widgets are HTML you can copy and the owner pays per minute, so lock to specific hostnames, cap call duration, rate-limit, ground knowledge base. Deploy: GitHub → Vercel → live widget; Twilio for phone.

Code beats clicks — it's so much better to just build a voice agent by speaking into your computer rather than going onto the dashboard and clicking and clicking.
Tools: Claude Code, ElevenLabs, Cal.com, GLO, Whisper, Supabase, Pinecone, n8n, Zapier, MCP, VS Code, Vercel, Twilio
AI Tools Developer Tools
Ramp Builders

Ramp's Agent Identity Model: OBOU + OAuth2-PKCE for Safe Agent Spending

Ramp's stance: handing agents raw session tokens or API keys breaks attribution, scoping, and lifecycle controls. Their first answer is "On Behalf Of User" (OBOU) — Agent Keys tied to both a human sponsor and a business entity, with permissions strictly scoped to a subset of the sponsor's role-based access.[14]Ramp Builders: Agent Identity

Read more

OBOU model

Agent Key tied to a human sponsor + business entity. Permissions are always a strict subset of the sponsor's RBAC — agents can be scoped down but never elevated. All spend and actions attribute to the sponsor: audit logs read "Approved by Sarah (via Codex)." Lifecycle ties to employment status or explicit expiration; sponsor, manager, or admin can revoke.

An agent can never have more access than its sponsor. Every OBOU action has a human accountable.

Authentication

Agent Keys deliberately don't authenticate — they're identifiers in an OAuth2-PKCE flow that produces short-lived JWTs (≤1 hr) plus a refresh token. JWT refresh re-validates that the Agent Key isn't revoked and the sponsor still has the required permissions. A leaked Agent Key alone is useless; a leaked access token expires within an hour and can be revoked instantly.

A leaked access token is bad for an hour (and can be revoked instantly). A leaked API key would have been bad until someone notices.

Audit logging without rewrites

Ramp's existing DenormalizedActor type was extended with an optional AgentContext field, so every agent action automatically participates in audit logging from the moment the Agent Key is created. Existing human-initiated flows required no changes.

Lifecycle

Agent Keys have expiration dates with renewal reminders. Revocation produces a data-exhaust log of actions taken before revocation. Renewal-over-rotation was a deliberate trade-off: less management burden, slight theoretical security cost.

Pairs naturally with the broader agent-security theme this week — Pedro's RLS demo[11]Pedro on Supabase RLS, Arjay's RLS near-miss[34]Arjay: Database Hacked, and Linkly (in GitHub Trending below) all hammer the same point: agents need first-class identity primitives, not stolen credentials.

Tools: Ramp for Agents, OAuth2-PKCE, JWT, Claude, Codex, Cursor
Industry AI Models AI Tools
Google

Google's April: Cloud Next, Gemma 4, 8th-Gen TPUs, Deep Research Max

Cloud Next '26 drew 32,000+ attendees and 260+ announcements centered on the "agentic AI era," with 330 organizations now processing 1T+ tokens annually on Google Cloud and ~75% of Cloud customers using Google Cloud AI.[15]Google AI updates: April 2026 Headlines include Gemma 4 ("byte for byte the most capable open model" — Gemma family is past 500M downloads), 8th-gen TPUs, Deep Research Max, the Gemini Enterprise Agent Platform, and Google Vids free for all account holders (10 videos/month).

Read more

Models & infra

  • Gemma 4 — newest open-weights model, "byte for byte the most capable open model," tuned for agentic workflows. Family past 500M downloads.
  • 8th-generation TPUs — co-designed for agentic workloads, emphasis on energy efficiency.
  • Gemini Enterprise Agent Platform — orchestration layer for autonomous, multi-step business processes.

Products

  • Deep Research Max — autonomous research agent for end-to-end multi-source synthesis.
  • Google Vids free tier — AI video generation/editing, 10/month for all Google account holders.
  • Google Colab Learn Mode — Gemini-powered "personal coding tutor" with Custom Instructions per notebook.
  • Google AI Studio — increased usage for Pro/Ultra; Vibe Coding Course with Kaggle launching June 2026.
  • Gemini Test Prep — TOEIC reading-comprehension quizzes, initially Korea.
  • Google Translate — 20th anniversary; 1B users, ~1T words/month; new pronunciation tool on Android.
  • Fitbit + Gemini — deeper biometric integration in personal health coaching.

Other

Google.org + J&J Foundation: $10M for AI literacy/training in rural U.S. healthcare workforce.

Tools: Gemma 4, TPU v8, Gemini, Gemini Enterprise Agent Platform, Deep Research Max, Google Vids, Google Colab, Google AI Studio, Kaggle, Google Translate, Fitbit
Developer Tools
Google

Gemini API Adds Webhooks to Replace Polling

Long-running Gemini jobs (Deep Research, long video generation, batch processing) no longer need polling — Google added webhooks following the Standard Webhooks spec, with HMAC at project level and JWKS per-request, signed headers (webhook-signature, webhook-id, webhook-timestamp) for replay protection, and at-least-once delivery with 24-hour retry.[16]Google: Webhooks in Gemini API

Read more

Configure either globally at the project level (HMAC-secured) or dynamically per-request (JWKS-secured). Push payload arrives the instant a task finishes, eliminating polling overhead for jobs that span minutes or hours. Python SDK example for batch task configuration available, plus full docs at ai.google.dev/gemini-api/docs/webhooks and a Cookbook notebook on GitHub.

A push-based notification system that eliminates the need for inefficient polling — push a real-time HTTP POST payload to your server the instant a task finishes.
Tools: Gemini API, Python SDK, Standard Webhooks, HMAC, JWKS, Batch API
Hot Take AI Future
AI News & Strategy Daily | Nate B Jones

Nate B Jones: The Thin Ice Job Audit (TCLD Framework)

Nate's central claim: the most dangerous moment in a knowledge-work career isn't when work disappears, but when the work still exists yet less of it actually requires you.[17]Nate B Jones: AI's Thin Ice Moment His TCLD audit (Theater / Commodity / on the Line / Durable) is a practical exercise: tag every meeting, doc, email, and Slack item from the last 10 business days, then read off the T+C number — that's the fraction of your week on thin ice.

Read more

The thesis

~00:00 AI doesn't replace whole jobs — it picks away at pieces inside the job until the next economic shock exposes the hollowing. Travel agents are the canonical analog: Expedia changed booking economics first, the visible break came later when downturns forced the admission.

The first sign that your job is on thin ice is often a full calendar and no clue what's happening.
The useful question is not, will AI replace me? The useful question is, how much of my last two weeks still needed me?

The data

~03:30 OpenAI/UPenn: ~80% of US workers could see ≥10% of tasks affected, ~20% could see half their tasks affected. Anthropic Economic Index: ~49% of jobs have already had ≥25% of their tasks performed using Claude. Microsoft (200K Bing Copilot conversations): people most often bring information-gathering and writing to AI; AI most often performs writing, teaching, providing information, and advising.

Why the lag is dangerous

~06:00 Performance reviews still measure visible output (docs written, updates sent, meetings attended) — they don't ask whether the output actually required you. Tools without throughput limits collapsed the timeline.

The TCLD framework

~09:00

  • T (Theater) — work the org performs rather than examines for value. First layer AI absorbs because it was already below the threshold of real human attention.
  • C (Commodity) — real work that doesn't need you specifically. Test: could you write a spec and have someone else produce a near-equivalent?
  • L (on the Line) — uncomfortable middle. Pattern recognition where patterns are structured. A strong junior could do 70% and the last 30% feels like yours.
  • D (Durable) — you changed the question more than answered it. Read the room. Saw the stated problem wasn't the real problem.
Your job is not one thing. Your job is 50, 60, 300 small things packed into one title in a trench coat.

~14:30 What the audit reveals: most people undercount theater, find commodity bigger than expected, and find durable smaller than self-image. T+C is your thin-ice number.

Durable work is question-holding

~18:00 Question-answering is commodifiable because the frame is set. Durable work starts before the question and is often invisible: the bad hire that didn't get made, the 6-month detour that didn't happen, the customer escalation that never became a crisis.

Avoided damage is often where senior judgment lives.

How work compounds

~20:30 Theater compounds to nothing. Commodity compounds to the org (and gets captured by tools). Durable compounds to you.

If the thing you're improving can be captured by the system, the system will capture it.

The legibility paradox

~21:30 Durable work has to be legible enough that the system values it, but not so legible that the system can run it without you. Show outcomes, separate analysis from judgment in language ("the analysis says X, my judgment is Y"), don't expose the mechanism where there isn't one to articulate.

The real obstacle is identity

~24:30 The audit mechanics take an afternoon. The hard part is what the tags do to your professional self-image.

The advantage goes to the person who can update their self-image before the organization forces the update on them.

Six moves after the audit

~25:30 (1) Stop performing inertial theater. (2) Don't pour recovered time into more commodity work — that just makes you twice as productive at work whose value is collapsing. (3) Build a private weekly track record of judgment calls. (4) Use the record to refuse commodity work via project selection. (5) Make durable work partially legible. (6) If the audit shows no path to durable work in the current role, move.

Note: he suggests using Codex with computer use to help run the audit across email/calendar/Slack, but warns it'll require chunking across separate agents.

Tools: Codex, Claude, computer use, Bing Copilot
Hot Take Industry
Better Stack The Pragmatic Engineer

Kent C Dodds Pivots to Product Engineering — Because PMs Are Sending PRs Now

Kent C Dodds, who built a career teaching how to write clean code, is shifting everything he teaches: AI agents now one-shot production-level code, and the scarce skill is "knowing which target is worth hitting" — what he calls product engineering.[18]Better Stack: Why Kent C Dodds Stopped Teaching Code The Pragmatic Engineer's clip with Mario and Armin Ronacher confirms the inverse pressure: PMs now send PRs, marketing ships site changes, sales builds demo features.[19]Pragmatic Engineer: PMs are sending PRs

Read more

Kent's reframe: "product engineering" = bridging implementation details with product outcomes — what user problem is actually being solved, what constraints can't be violated, who is negatively affected by a change. He's started a podcast around this idea, signaling he believes it's a durable discipline rather than a temporary hedge.

The skill that actually matters now is knowing which target is worth hitting.

Armin's pushback in the Pragmatic Engineer clip is the necessary other half: democratized contribution doesn't eliminate the need for guardrails — it probably increases it.

The problem is that people are now so focused on everybody can do everything now that they forget that you still need a process to guardrail all of that. — Armin
Industry Hot Take
Fireship

732 Bytes of Python Borked Every Linux Machine — Found by an AI in 1 Hour

CVE-2026-31431 ("copy fail") is a logic flaw in the Linux kernel's AF_ALG interface, sitting unnoticed since 2017, that gives any unprivileged local user root access on essentially every Linux distribution updated since then — exploitable in 732 bytes of Python.[20]Fireship: 732 bytes Linux exploit An AI agent from Theori found it in roughly one hour of scan time.

Read more

The bug: ONC ESN writes 4 bytes of scratch into what it thinks is a crypto output buffer, but a bug in the AF_ALG splice function lets that buffer point into the page cache of a read-only file. The exploit targets the read-only su binary present on every distribution.

Affected: Debian, Arch, Red Hat, Ubuntu, SUSE, Amazon Linux. CrowdStrike confirmed active exploitation. CISA added it to the KEV list. Patch is available.

The going rate for a universal Linux privilege escalation on the gray market is somewhere between $10,000 and $7 million… but a few days ago, an AI agent found one in about an hour of scan time.

Theori released a proof-of-concept and a dedicated website publicly, for free.

Tools: Theori AI scanner, Metasploit, CodeRabbit
AI Tools
AICodeKing

Hermes V0.12: Autonomous Curator Agent + SQLite Kanban for Multi-Agent Coordination

Two big releases. V0.11 was an Ink/React CLI rewrite with a pluggable transport layer that opened native AWS Bedrock support and the /steer command for nudging running agents. V0.12 adds an autonomous curator background agent that grades, prunes, and consolidates the skill library on its own, plus a 57% cold-start improvement and Hermes Kanban — a SQLite-backed durable task board for multi-agent workflows.[21]AICodeKing: Hermes V2

Read more

V0.11 — Interface Release

~01:00 Ink/React rewrite (sticky composer, live streaming, status bar, light theme). Backend: pluggable transports → native AWS Bedrock via Converse API, plus NVIDIA NIM, RCAI, Step Plan, Google Gemini CLI, Vercel AI Gateway, GPT-5.5 via Codex. Adds /steer, shell hooks, webhook direct delivery, smarter orchestrator-style delegation.

V0.12 — Curator Release

~02:06 Autonomous curator runs on its own schedule to grade, prune, and consolidate skills. Self-improvement loop is more rubric-based, prefers updating the most recently used skill, and properly inherits parent runtime. New providers: GMI Cloud, Azure AI Foundry, MiniMax, Tencent TokenHub, first-class LM Studio. Microsoft Teams as first pluggable gateway platform. Native Spotify, Google Meet, ComfyUI, Touch Designer MCP bundles. ~57% cold-start improvement.

Hermes Kanban

~03:06 Tasks live in hermes/con.db with status, assignee, parent/child dependencies, comments, run history, structured handoff data. Six lanes: triage / todo / ready / in-progress / blocked / done. Single-host by design.

Kanban is not only moving a card from one column to another. It is carrying structured context from one stage of the workflow to the next.

Use cases

  • ~05:08 Solo dev pipeline with dependency promotion: schema → API → tests, with explicit parent/child dependencies. Downstream workers read structured handoff (changed files, decisions) instead of re-reading conversation logs.
  • ~07:09 Fleet farming: translator/transcriber/copywriter profiles drain independent queues in parallel; "lanes by profile" view shows live state.
  • ~07:09 PM→engineer→reviewer pipeline: rejected tasks surface specific feedback for retry; each attempt stored as structured run history.
  • ~09:10 Circuit breaker + crash recovery: "gave up" state after failure limit; dead-process detection releases claims and resets tasks to ready, with crashed runs preserved in history.
If you're going to run long agent workflows, you need failure history. You need retries, you need blocked states, you need human intervention, and you need the system to not silently lose what happened.
Tools: Hermes V0.11/V0.12, AWS Bedrock, NVIDIA NIM, Vercel AI Gateway, Azure AI Foundry, LM Studio, Microsoft Teams, ComfyUI, Touch Designer, Spotify, Google Meet, SQLite
AI Future Industry
Y Combinator Y Combinator Y Combinator

YC's RFS Trio: Inference Chips for Agents, AI-Native Discovery, Company OS

YC dropped three Request-for-Startups videos with overlapping logic. Inference chips: current GPUs hit only 30–40% peak utilization on agent workloads because work is bursty across memory-bound model calls, IO-bound tool use, and CPU-bound orchestration.[22]YC: Inference Chips for Agents AI-native discovery: PhD-level scientific reasoning lets you go from research co-pilot to closed design-make-test-analyze loops.[23]YC: AI-Native Discovery Company OS: top AI-native companies have made every meeting/ticket/customer interaction queryable to a persistent AI layer.[24]YC: AI Operating System for Companies

Read more

Inference chips for agent workflows

Most AI chips are designed for a world where inference means prompt in response out. Agents don't work that way.

What's needed: fast context switching between models, native speculative decoding, persistent KB-scale caches across an entire execution graph. NVIDIA bought Groq for $20B "because it saw this coming" — Groq's real insight wasn't the chip but the compiler. Google built TPU v7 for inference but "nobody's designing for the agent loop itself." Pairs closely with Theo's silicon-war prediction in topic 1.

AI-native discovery engines

Drug discovery, material science, protein engineering: models propose candidates, automated labs synthesize and test, results feed back. The framing is anti-tooling — don't sell research co-pilots, build engines that own the closed discovery loop.

The companies that make meaningful contributions to scientific progress won't just sell research co-pilots. There'll be AI native discovery engines that work alongside researchers to propose and validate hypotheses.

The AI operating system for companies

The pattern: top AI-native companies have a persistent AI layer learning from every meeting, ticket, customer interaction. Decision-making moves from open-loop to closed-loop. YC reports teams that adopt this "cut sprint time in half and ship 10× as much." Bottleneck today is integration (Slack/Linear/GitHub/Notion/call recordings) — opportunity for a connective layer that makes a company "legible to AI by default."

I've seen teams that do this, cut sprint time in half, and ship 10x as much.
Tools: NVIDIA GPU, Groq, Google TPU v7, Slack, Linear, GitHub, Notion
AI Models Developer Tools Hot Take
Simon Willison Simon Willison Simon Willison Simon Willison Simon Willison

Simon Willison's Day: Granite SVG Pelicans, TRE Regex, Redis Arrays, Andy Masley on Land Use

Five posts in one day. Headline finding: across all 21 GGUF quantizations of IBM's new Apache-2.0 Granite 4.1 3B, none produce a passable pelican-on-bicycle SVG, and there's "no distinguishable pattern relating quality to size."[25]Simon Willison: Granite 4.1 SVG Gallery Plus: a Python binding for the TRE regex library that scales linearly on ReDoS patterns, an interactive Redis Array playground, and a quote from Andy Masley pushing back on data-center land-use criticism.

Read more

Granite 4.1 SVG pelican gallery

IBM released Granite 4.1 (Apache 2.0) in 3B/8B/30B sizes. Unsloth produced 21 GGUF-quantized variants of the 3B (1.2GB to 6.34GB, 51.3GB combined). Willison ran his standard "pelican riding a bicycle" prompt across all 21.[25]Granite 4.1 SVG Gallery

There's no distinguishable pattern relating quality to size — they're all pretty terrible!

TRE Python binding (ReDoS robustness)

Willison built a ctypes binding for Ville Laurikari's TRE regex engine.[26]Simon Willison: TRE Python binding TRE handles 10M-character "evil" patterns faster than Python's re handles tiny ones — because TRE has no backtracking, performance scales linearly. Built experimentally with Claude Code.

TRE processes even notorious 'evil' patterns on gigantic inputs (10 million characters) much faster than `re` on tiny ones — scales linearly with input size instead of exponentially.

Redis Array playground

Salvatore Sanfilippo's PR adds a native array data type to Redis with 18 new commands (ARCOUNT, ARDEL, ARGREP, etc).[27]Simon Willison: Redis Array The most interesting is ARGREP — server-side regex grep against array values using vendored TRE (same library as above). Willison had Claude Code for web build a WASM-compiled Redis playground in the browser.

I had Claude Code for web build this interactive playground for trying out the new commands in a WASM-compiled build of a subset of Redis running in the browser.

Andy Masley quote on data-center land use

Willison curates a contrarian take on data-center criticism.[28]Simon Willison: Andy Masley quote

Between 2000 and 2024, farmers sold in total a Colorado-sized chunk of land all on their own, 77 times all land on data center [property acquisition], and grew more food than ever on what was left.

April newsletter

The monthly recap covers Opus 4.7, GPT-5.5 (with price increases — see topic 9), Claude Mythos and LLM security, ChatGPT Images 2.0, his LLM 0.32a0 refactor, the OpenAI-Microsoft AGI clause history (relevant to topic 1!), and DeepSeek V4 pricing.[29]Simon Willison: April 2026 newsletter Newsletter is paywalled at $10/month via GitHub Sponsors.

Pay me to send you less!
Tools: Granite 4.1, Unsloth, GGUF, llm CLI, TRE regex library, Python ctypes, Claude Code, Redis (ARGREP), WebAssembly, LLM 0.32a0
Industry AI Future AI Tools
The Rundown AI

AI in the ER, Pentagon Snubs Anthropic, Maryland Bans AI Grocery Pricing

OpenAI's o1-preview hit 67.1% diagnostic accuracy across 76 real ER cases vs 55.3% / 50.0% for two attending physicians using only raw EHR text — and flagged a rare flesh-eating infection in a transplant patient 12–24 hours before the treating doctor.[30]Rundown: AI in the ER Plus: the Pentagon added 8 vendors to classified AI networks (SpaceX, OpenAI, Google, Nvidia, Reflection, Microsoft, AWS, Oracle) and excluded Anthropic on "supply-chain risk." Maryland became the first U.S. state to ban AI-driven grocery pricing.

Read more

ER diagnostic accuracy

76 real ER cases, three decision stages, raw EHR text only. o1-preview (a 2024-era model!) outperformed attending physicians.

Flagged a rare flesh-eating infection in a transplant patient roughly 12 to 24 hours before the treating doctor caught it.

Pentagon classified AI networks

SpaceX, OpenAI, Google, Nvidia, Reflection, Microsoft, AWS, and Oracle all added. Anthropic excluded. The DoD CTO cited "supply-chain risk" and "national security moment" — striking framing given Anthropic's safety positioning. Reads as another data point in the Anthropic-vs-incumbents dynamic running through this briefing.

Maryland's AI grocery pricing ban

First U.S. state to ban AI-driven dynamic grocery pricing. Fines up to $25,000 per violation.

Labor protections

SAG-AFTRA secured AI guardrails in a new four-year studio contract. A Chinese court ruled that AI cannot justify worker termination, ordering wrongful-termination damages.

New tool drops

xAI Grok Custom Voices (voice cloning), OpenAI Codex Pets (animated progress trackers), ElevenLabs ElevenMusic (AI song generation), Xiaomi MiMo-V2.5-Pro (open-source).

Tools: OpenAI o1-preview, Custom Voices (Grok), Codex Pets (OpenAI), ElevenMusic (ElevenLabs), MiMo-V2.5-Pro (Xiaomi)
Industry Developer Tools Hot Take
Sherwood Snacks Tech Brew Github Awesome marimo marimo AI News & Strategy Daily | Nate B Jones Acquired Real Python

Markets' '90s Flashback & the Rest of the Day's Signal

Quick hits across markets, security, dev tooling, and a couple of philosophy clips that don't merit their own topic but shouldn't be ignored.

Read more

Markets' '90s flashback (Sherwood)

Dot-com era stocks roaring back on AI infra demand: Micron +50% in April (best month since Feb 2000), Western Digital +60% (since Jan 2001), Dell +27% in April (+60% YTD), Sandisk up nearly 300% in 2026.[36]Sherwood: Market's '90s flashback Tech investment as % of GDP now exceeds the 4.5% peak from 2000. Apple's contrarian play (capex −36%, partner with Gemini for Siri) gets called out, alongside $600M in inter-company transactions across Musk's empire (xAI bought $430M of Tesla Megapacks, SpaceX bought $143M of Tesla vehicles). Tim Cook flagged Mac mini/Studio supply shortages of "several months" from AI workload demand.

State health exchanges leaked customer data to Big Tech (Tech Brew)

Bloomberg investigation: nearly all 20 state-run health insurance exchanges had ad trackers transmitting sensitive personal data — race, citizenship status, sex/gender, ZIP codes, info about incarcerated family members — to Meta, Google, TikTok, Snap, and LinkedIn.[35]Tech Brew: State health exchanges leaked data Healthcare.gov (~30 states) doesn't embed these trackers. California removed them before the investigation; several others removed them only after Bloomberg called. Hospital sector tracker deployment dropped from 98% (2021) to 30% (2025), largely from litigation.

GitHub Trending #33: 7 tools for agents and devs

Quick rundown[33]Github Awesome: GitHub Trending #33:

  • ~00:08 chromex — Codex-powered Chrome side-panel assistant, runs locally so API creds stay out of extension storage.
  • ~00:33 whatcable — macOS menu bar app that reads USB-C cable hardware data and reports wattage / data speed / display support.
  • ~01:04 link-cli (Linkly) — agents request one-time virtual payment cards from your Stripe Link wallet, you approve via push notification (cf. Ramp Identity above).
  • ~01:31 open-slide — React-based agent-first slide framework on a 1920×1080 canvas.
  • ~01:57 serve-sim — Swift helper exposes booted iOS simulator as MJPEG, lets Claude Code/Cursor see + interact with the simulator.
  • ~05:03 baguette — headless iOS simulator manager, single Swift CLI, 60 FPS streaming, multi-finger gesture injection.
  • ~02:49 TagTinker — Flipper Zero app for pushing custom pixel art / text to wireless e-ink price tags via infrared.

marimo notebook design — apple sliders, salmon GIFs, storytelling

Two clips reviewing notebook competition entries.[31]marimo: How notebooks stand out[32]marimo: The Apple Slider What stood out: an embedded Minesweeper-style game inside a paper on neural thickets; a recurring dead-salmon GIF as a "red line" through a statistical-fragility piece (referencing the famous MRI-on-dead-fish experiment); and an apple-shaped slider that lets readers literally peel a D-dimensional sphere to make high-dimensional volume distribution visceral.~01:00 Also notable: top entries used AI as a tool inside a human-directed creative process, often with explicit disclosure ("Please engage with a critical eye. Vibe coding abounds.")

Distillation = lossy MP3, not a copy (Nate B Jones short)

The cost of generating intelligence is astronomically higher than the cost of copying that intelligence. Distillation does not produce a copy of the original model. It produces a compression. And that compression, like a lossy MP3, has characteristics that matter enormously for anyone building real systems on top of these models.[37]Nate B Jones short: distillation

Acquired clip on Enzo Ferrari

Why Enzo wasn't a champion driver: he had the talent but couldn't suppress the fear of death after watching two close mentors die on the Alfa Romeo team.[38]Acquired: Motorsport champions

Real Python on Agile vs Waterfall

Short take from Real Python on why iteration beats planning when requirements aren't fixed — PDCA loops as the lean/agile alternative to upfront Gantt charts.[39]Real Python: Agile vs Waterfall Worth pairing with the Ralph Loops topic above — both arguments converge on "small loops with feedback beat big upfront plans."

Tools mentioned across this section: chromex, whatcable, link-cli, open-slide, serve-sim, baguette, TagTinker, marimo, Codex, Claude Code, Cursor, Flipper Zero, Stripe Link

Sources

  1. YouTube Microsoft and OpenAI break up (Amazon is pumped) — Theo - t3.gg, May 4
  2. Blog Building a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs — Anthropic, May 4
  3. Blog How OpenAI delivers low-latency voice AI at scale — OpenAI, May 4
  4. YouTube Anthropic's Boris Cherny: Why Coding Is Solved, and What Comes Next — Sequoia Capital, May 4
  5. YouTube Andrej Karpathy on one of AI's weirdest flaws: the car wash problem — Sequoia Capital, May 4
  6. YouTube Ralph Loops: Build Dumb AI Loops That Ship — Chris Parsons, Cherrypick — AI Engineer, May 4
  7. YouTube Why the Real AI Agent Era Is Still 5 Years Away | Yutori, Abhishek Das — EO, May 4
  8. Newsletter Import AI 455: AI systems are about to start building themselves — Import AI (Jack Clark), May 4
  9. Blog GPT-5.5 Price Increase: What It Actually Costs — OpenRouter, May 4
  10. YouTube Why AI Won't Be a Monopoly - Dario Amodei — Dwarkesh Patel, May 4
  11. YouTube Skill Issue: How We Used AI to Make Agents Actually Good at Supabase — Pedro Rodrigues, Supabase — AI Engineer, May 4
  12. YouTube Training an LLM from Scratch, Locally — Angelos Perivolaropoulos, ElevenLabs — AI Engineer, May 4
  13. YouTube Building Realistic Voice Agents Has Never Been Easier — Nate Herk | AI Automation, May 4
  14. Blog Agentic identity: modeling agents to keep users in control — Ramp Builders, May 4
  15. Blog The latest AI news we announced in April 2026 — Google, May 4
  16. Blog Reduce friction and latency for long-running jobs with Webhooks in Gemini API — Google, May 4
  17. YouTube AI's 'Thin Ice' Moment: Is Your Job Already Gone? — AI News & Strategy Daily | Nate B Jones, May 4
  18. YouTube Why Kent C Dodds Stopped Teaching Code — Better Stack, May 4
  19. YouTube Mario & Armin: Product managers are now sending pull requests — The Pragmatic Engineer, May 4
  20. YouTube 732 bytes of Python just borked every Linux machine on earth… — Fireship, May 4
  21. YouTube Hermes Agent V2.0 (Refreshed!): This NEW UPDATE to HERMES IS CRAZY! — AICodeKing, May 4
  22. YouTube Inference Chips for Agent Workflows — Y Combinator, May 4
  23. YouTube AI-Native Discovery Engines — Y Combinator, May 4
  24. YouTube The AI Operating System for Companies — Y Combinator, May 4
  25. Blog Granite 4.1 3B SVG Pelican Gallery — Simon Willison, May 4
  26. Blog TRE Python binding — ReDoS robustness demo — Simon Willison, May 4
  27. Blog Redis Array Playground — Simon Willison, May 4
  28. Blog Quoting Andy Masley — Simon Willison, May 4
  29. Blog April 2026 newsletter — Simon Willison, May 4
  30. Newsletter AI shows its skills in the emergency room — The Rundown AI, May 4
  31. YouTube How do you make a notebook actually stand out? — marimo, May 4
  32. YouTube The Weirdest/Coolest Slider Sofar — marimo, May 4
  33. YouTube GitHub Trending Today #33: chromex, whatcable, link-cli, open-slide, serve-sim, baguette, TagTinker — Github Awesome, May 4
  34. YouTube Database Hacked — Arjay McCandless, May 4
  35. Newsletter US state health exchanges leaked customer data to Big Tech — Tech Brew, May 4
  36. Newsletter Market's '90s flashback — Sherwood Snacks, May 4
  37. YouTube AI Is Cheaper to Copy Than Create #Shorts — AI News & Strategy Daily | Nate B Jones, May 4
  38. YouTube What separates motorsport champions from the rest? — Acquired, May 4
  39. YouTube Agile vs. Waterfall: Why Iteration Beats Planning — Real Python, May 4