May 4, 2026
OpenAI's amended Microsoft agreement kills cloud exclusivity, retires the AGI clause, and (according to Theo's reading of the fine print) flipped reverse revenue-share into a profit share — meaning unprofitable OpenAI now likely owes Microsoft "jack f***ing s***."[1]Theo: Microsoft and OpenAI break up The same day, OpenAI announced a $50B AWS investment, 2GW of Trainium capacity, and a stateful runtime co-built on Bedrock[1]Theo: Microsoft and OpenAI break up — the deal that couldn't happen until exclusivity was gone. Theo's verdict: "Sam is one of the greatest negotiators of all time because he fleeced Microsoft."
~02:22 The original 2019 deal made Azure OpenAI's exclusive cloud and licensed Microsoft all OpenAI pre-AGI IP — but "AGI" was never defined, which meant in practice the agreement had no off-ramp. Microsoft eventually parked $13B-plus into OpenAI for what's now ~$135B/27% on a fully-diluted basis.
~07:30 Theo identifies the September 2024 o1 launch as the actual fracture: OpenAI refused to hand chain-of-thought details to Microsoft, despite IP-sharing terms requiring it. The leaked yelling incident — Sam Altman pressuring OpenAI staff (including then-CTO Mira Murati) to ship faster to Redmond — "was the start of the end."
Open AI built this with 250 people, Nadella said. Why do we have Microsoft research at all? He was pissed.
~15:48 Microsoft becomes "primary" not "exclusive" cloud, OpenAI ships products on any cloud, the IP license extends to 2032 but loses exclusivity, and Microsoft loses right of first refusal on compute. Microsoft no longer pays revenue share to OpenAI for serving its models — but the reverse direction was reportedly converted from revenue share to profit share, which on OpenAI's economics means roughly nothing.
Sam is one of the greatest negotiators of all time because he fleeced Microsoft.
~19:08 AWS becomes the exclusive third-party cloud distribution for OpenAI's Frontier agent platform, OpenAI commits to 2GW of Trainium, and AWS gets to co-build a stateful runtime on Bedrock — a feature Theo says "Azure would be hellish to try and implement." The existing $38B AWS commitment is extended by $100B over eight years across Trainium 3 and Trainium 4 (FP4 compute, more HBM, shipping 2027).
~21:14 Theo's enterprise thesis: Anthropic's reported $30B run rate isn't because Claude is better — it's because Claude is on Bedrock, Vertex, and Azure, while OpenAI was Azure-locked. He also flags that AWS, GCP, and Azure startup credits cannot be used on Anthropic models because revenue-share to the clouds is so aggressive (he estimates ~50%) that the clouds would lose real money giving away free Anthropic compute.
OpenAI is petrified of the enterprise growth at Anthropic outpacing their own enterprise growth.
~27:00 Theo had $1M in Azure credits he refused to spend because Azure-hosted GPT-5.4 was 2.2× slower on average and up to 15× slower at the tail (sometimes 0.3–2 tok/s vs 70+ on OpenAI direct). After he shipped azure.t3.gg as a public benchmark, Microsoft asked him to delete it. He did, with a 15-day warning to repost. By noon the next day the bugs were fixed.
I guess bullying works.
~33:25 OpenAI on Trainium will be its first non-NVIDIA deployment — a real risk given Anthropic's reported quality regressions on the same hardware. Theo predicts the next year's AI war is NVIDIA vs AMD vs Trainium vs TPU, with only NVIDIA and Google positioned across both layers.
The breakup lands inside a broader capex frenzy: Sherwood notes Alphabet, Amazon, Meta, and Microsoft together are planning $700B+ in 2026 capex, while Apple's capex actually fell 36% last quarter and Apple chose Google's Gemini to power Siri rather than build a frontier model.[36]Sherwood: Market's '90s flashback
Anthropic is co-founding a new enterprise AI services firm with Blackstone, Hellman & Friedman, Goldman Sachs, General Atlantic, Leonard Green, Apollo, GIC, and Sequoia.[2]Anthropic: New enterprise AI services company The new entity embeds Anthropic applied AI engineers directly with mid-market client teams — community banks, mid-sized manufacturers, regional health systems — segments underserved by Anthropic's existing partners (Accenture, Deloitte, PwC). No funding amount was disclosed; CFO Krishna Rao is the public face.
The investor consortium spans private equity (Blackstone, H&F, Leonard Green, Apollo), investment banking (Goldman Sachs), growth equity (General Atlantic, Sequoia), and sovereign wealth (Singapore's GIC). The structure is unusual — a separately-capitalized services firm rather than another consulting partnership — and the implicit message is that Anthropic believes the consulting bottleneck on Claude adoption is real and won't be solved by simply onboarding more Big Four partners.
Enterprise demand for Claude is significantly outpacing any single delivery model. This new firm brings additional operating capability to the ecosystem.
Pair this with Theo's read on AWS+Bedrock[1]Theo on Anthropic enterprise dominance and Anthropic's enterprise lead becomes a deliberate two-front strategy: ubiquitous cloud distribution + a captive services arm to land the mid-market that won't pay Deloitte rates.
Yi Zhang and William McDonald walk through how ChatGPT voice and the Realtime API serve 900M+ WAU on a custom relay-plus-transceiver WebRTC stack, written in Go, deployed on Kubernetes, with Cloudflare geo-steering for signaling and a Global Relay fleet for media ingress.[3]OpenAI: Low-latency voice AI at scale The post is unusually concrete on the Linux-level optimizations they used to avoid kernel-bypass.
WebRTC handles ICE/DTLS/SRTP, codec negotiation, jitter buffering, and echo cancellation — solved problems OpenAI didn't want to re-implement per platform. Because nearly every session is 1:1 (one user, one model), they chose a transceiver model over a Selective Forwarding Unit: the transceiver terminates the WebRTC session at the edge and converts media to simpler internal protocols, so backend inference services scale like ordinary HTTP services.
Conventional WebRTC reserves one public UDP port per session, which doesn't survive Kubernetes autoscaling. OpenAI split routing from termination: a stateless relay exposes a tiny fixed UDP surface (e.g. 203.0.113.10:3478) and forwards packets without decrypting media, while transceivers own ICE/DTLS/SRTP state. Routing metadata is encoded into the ICE username fragment (ufrag), avoiding any hot-path external lookup; Redis caches the IP:Port→transceiver mapping for fast recovery.
Three Linux primitives carry the load: SO_REUSEPORT distributes UDP packets across worker goroutines; runtime.LockOSThread pins each reader to one OS thread for cache locality; pre-allocated buffers minimize GC. They explicitly evaluated kernel-bypass (DPDK-style) and decided their workload didn't justify the operational complexity.
Optimize for the common case before reaching for kernel bypass. A narrow Go implementation with careful use of SO_REUSEPORT, thread pinning, and low-allocation parsing was enough for our workload.
Cloudflare's geo and proximity steering routes the initial HTTP/WebSocket SDP exchange to the nearest transceiver cluster; the SDP answer points the client at the nearest Global Relay; ICE ufrag carries the rest. Both signaling and media take a near-user path into OpenAI's backbone.
The broader lesson is that the best place to add complexity is in a thin routing layer, not in every backend service, and not in custom client behavior.
Notable: the post credits Justin Uberti (one of WebRTC's original architects) and Sean DuBois (creator of Pion) as colleagues now at OpenAI — a quiet flex.
Anthropic Labs' Boris Cherny — the engineer who built Claude Code — tells Lauren Reeder that for him, today, the model writes 100% of his code, he ships dozens of PRs per day (record: 150), and most of his work now happens from his phone via the Claude app's code tab.[4]Sequoia: Boris Cherny on Claude Code His one-line thesis on what comes next: "loops are the future."
~02:30 Claude Code began inside Anthropic Labs in late 2024 as a deliberate bet on a "product overhang" — Sonnet 3.5 could go far beyond tab-completion if wrapped in a real agent harness.
~03:50 The first six months were a flop; Opus 4 (May) was the inflection. Growth has compounded with every model release since (4.5, 4.6, 4.7).
~05:30 The "coding is solved" claim, precisely: for Boris, today, Claude writes 100% of his code. The Claude Code codebase (which leaked, so he can say this) is TypeScript and React — chosen because they were "on distribution" for the model in late 2024.
~06:30 The phone-first setup: 5–10 sessions concurrently, "a few hundred agents going" daytime, "a few thousand" overnight. Personal record — 150 PRs in a single day last week.
~07:30 The mechanism Boris uses most: /loop — Claude schedules a recurring cron job (every minute, 5 min, daily). He runs dozens: one babysitting his PRs and auto-rebasing, one fixing flaky CI, one clustering Twitter feedback every 30 minutes. Anthropic just shipped "routines" — the server-side equivalent that runs with your laptop closed.
I sort of feel like loops are the future at this point.
~08:55 Boris predicts product engineers spanning iOS/web/server, plus engineers who are also strong designers, data scientists, or PMs. On the Claude Code team itself: "every single person codes" — EM, PM, designers, data scientist, finance, user researcher.
~10:30 Pushback on the "SaaS is dead" meme via Hamilton Helmer's 7 Powers: switching costs and process power erode (4.7 hill-climbs to a target until done), but network effects, scale economies, and cornered resources still matter. Prediction: 10× more disruptive startups in the next decade because incumbents can't retrain orgs fast enough.
Claude is getting really good at figuring out process. With 4.7, it can just hill climb anything. If you give it a target and tell it to iterate until it's done, it will just do it. I think this is the first model like that.
~13:15 Roughly 50/50 historically, with the harness mattering less as models improve. Current focus: loops as a first-class primitive, easier multi-agent orchestration, and reducing safety scaffolding (prompt-injection guards, static command verification, permission modes) because future models will "just do the right thing."
~15:30 1400s Europe had ~10% literacy. Fifty years after the printing press, more output than the prior thousand years and a 100× cost drop in books. Software authoring follows the same arc, "much faster than 50 years."
The best person to write accounting software, I think maybe even today, is not an engineer, it's a really good accountant — because they know the domain really well and coding is the easy part.
~17:30 No model gap (everyone gets the same models). Real gap: org process. Inside Anthropic, Claudes talk to each other over Slack to resolve unknowns; no manually written code anywhere; all SQL is model-written.
We have no more manually written code anywhere at the company. All of the SQL is written by models. Everything is just built by the models.
~19:18 The parallelization decision should be the model's, not the user's. 4.7 already volunteers loops ("I noticed the data is changing — I'll start a loop and report every 30 minutes via Slack MCP").
~21:50 MCP is the answer for everything with an integration (Salesforce, Google Docs, Calendar). For everything else, computer use via Co-work — slow but reliable with 4.7.
~23:30 Claude Design (already good, will be much better), more Claude Code launches in the coming weeks, massively parallel agent products (loop, batch), and computer use.
Karpathy gives the new canonical example of LLM jaggedness: state-of-the-art Opus 4.7 will refactor a 100,000-line codebase or find zero-day vulnerabilities, and then tell you to walk 50 meters to a car wash.[5]Karpathy on the car wash problem His explanation: capabilities track the data distribution and RL environments labs deliberately added — code is overweighted because it's economically valuable and easy to verify.
Karpathy's GPT-3.5 → GPT-4 chess example is the cleanest data point: chess didn't improve from general capability scaling, it improved because someone at OpenAI added a large chess dataset to pre-training.
How is it possible that state-of-the-art Opus 4.7 will simultaneously refactor a 100,000 line code base or find zero-day vulnerabilities and yet tells me to walk to this car wash? This is insane.
If you're in the circuits that were part of the RL, you fly and if you're in the circuits that are out of the data distribution, you're going to struggle.
Practical takeaway: you don't get a manual with the model, so you have to empirically explore which circuits your application sits in. If you're outside the distribution, fine-tuning is the answer — not waiting for the next general-capability bump.
Cherrypick's Chris Parsons argues that complex agent orchestration (n8n flows, parallel multi-agent dependency graphs) is the wrong shape for current coding agents — instead, run a "dumb" loop that repeatedly tells the AI to "do the next most important thing" and let the model handle dependencies.[6]AI Engineer: Ralph Loops by Chris Parsons The talk is named after Ralph Wiggum: keep trying the same thing until it works. Pairs cleanly with Boris Cherny's "loops are the future" thesis.[4]Boris Cherny on loops as primitive
~00:14 Audience poll: most attendees use Claude Code or Codex, many no longer write code themselves, and a growing minority use them for non-coding work.
~03:17 Why he abandoned n8n: a weekly newsletter workflow kept failing every Monday at 2pm. He replaced ~a week of fragile JSON with a single Claude Code skill that produces better newsletters and self-improves with "update the skill with anything you should have done differently" at the end of each run.
~07:18 The earliest Ralph (Geoffrey Huntley) was just: after the AI says it's done, give it the same prompt again. With GPT-5.12+ and Opus/Sonnet 4.6+ this rarely catches anything because models actually finish — but the pattern unlocked something bigger.
~21:25 The real unlock (credited to Matt PCO): point the loop at a list of tickets and say "pick the next most important one." His earlier failed attempt was hand-building a giant dependency graph and firing 6–7 parallel agents — they collided on shared tickets and reimplemented the same work.
I think we over complicate things by assuming that they need to be in parallel. I quite like the idea of just starting with a loop.
~25:25 AI is good at picking the next ticket, bad at parallel coordination. Hand-built dependency graphs become AI-driven waterfall.
~40:35 Claude Code's new built-in /loop command wraps this with a cron trigger ("loop every minute, build the next ticket"). Parsons runs a 6am morning briefing loop, a 15-min heartbeat, a worker loop driving a vibe-coded Kanban PM, and an experimental "startup" skill that once spontaneously generated an investor update deck.
~66:48 One hard rule for autonomous loops: "is this reversible without embarrassment to me?" If no, hand back to a human. The AI drafts ~15–16 emails every morning but never sends them; never closes a project itself.
Is this reversible without embarrassment to me? If the answer is no, don't do it.
~52:45 Loops run on a VPS with separate API keys and narrowly-scoped permissions. He's building "Lockbox" to harden this and cites Simon Willison's "lethal trifecta" (untrusted tokens + internet + secret data = data loss).
~54:45 Moving validation into a sub-agent (with fresh context) catches bugs that same-context validation misses because of confirmation bias.
~95:13 If your release process is the bottleneck, faster coding via Ralph just queues up more PRs and makes things worse. Goldratt's The Goal applies.
The bottleneck is usually not the number of agents. It's usually you just keeping up with the AI just doing things over and over again.
AI can do all of the rubbish work, but it can't and it shouldn't do the work that I'm uniquely good at.
Yutori co-founder Abhishek Das frames the central technical wall: agents make sequences of decisions, and over a 10–50-step workflow even 90% per-step accuracy collapses to a low overall success rate.[7]EO: Yutori's Abhishek Das interview His other complaint is cultural: the industry has wrongly normalized non-determinism and slop. Yutori's stance — "if it's not good enough to work on the first try, it's not good enough."
~00:00 Roughly 100 agent products claim "do anything on the web" yet rarely succeed first-try. Compounding error is the wall.
~03:04 Yutori is reimagining the browser for an era where users talk to AI assistants that take actions and complete tasks proactively in the background. Digital agents arrive before physical ones.
~06:06 Every production query goes through a comprehensive eval suite. The differentiator is whether an agent can recognize a mistake, backtrack, and try a different branch — because the universe of websites is infinite and models will always be trained on a finite subset.
~07:06 Pushback on normalized "slop" — Yutori enforces standards via weekly hour-plus dog-fooding and runs tens of internal experiments at any time, only one of which might ship.
It feels like we have started normalizing and developed a tolerance for non-determinism and low reliability in shipping products.
If it's not good enough to work on the first try, it's not good enough.
~10:07 Connects to his earlier Grad-CAM interpretability research: AI shouldn't only deliver final answers, it should expose proof of work. Yutori's Scouts feature includes a UI button to inspect which sites the agent visited and what it looked at.
In a world where it's very easy to come up with first prototypes using these coding LLMs, the true differentiator is in taste and craft.
Jack Clark's central case: SWE-Bench has gone from ~2% (Claude 2, late 2023) to 93.9% (Claude Mythos Preview, 2026), METR independent-task-completion time has grown from ~30 seconds in 2022 to ~12 hours in 2026 (projected ~100 hours by year-end), and OpenAI has publicly committed to "an automated AI research intern by September 2026."[8]Import AI 455: Automating AI research His warning: small alignment errors compound — 99.9% accuracy degrades to ~60% after 500 generations.
SWE-Bench: 2% → 93.9%. METR time horizons: GPT-3.5 ~30s → GPT-4 4 min → o1 40 min → GPT-5.2 High ~6 hrs → Opus 4.6 ~12 hrs → projected ~100 hrs by end-2026.
The vast majority of people I meet at frontier labs and around Silicon Valley now code entirely through AI systems.
CORE-Bench (computational reproducibility): GPT-4o at 21.5% in Sep 2024 → Opus 4.5 at 95.5% in Dec 2025. MLE-Bench (75 Kaggle-style problems): o1 16.9% → Gemini3+search 64.4%. GPU kernel design progressing fast (DeepSeek, Meta PyTorch→CUDA, Huawei AscendCraft, ByteDance CUDA Agent).
PostTrainBench: human baseline 51% uplift; Opus 4.6 and GPT-5.4 at 25–28%. Anthropic's CPU-only training optimization speedup: Opus 4 (May 2025) 2.9× → 4.5 16.5× → 4.6 30× → Mythos Preview (Apr 2026) 52×. Human expert: 4–8×.
A Gemini model produced 13 solutions across ~700 Erdős problems, with one (Erdős-1051) tentatively novel. UBC/UNSW/Stanford/DeepMind paper: "The proofs of the main results were discovered with very substantial input from Google Gemini and related tools."
OpenAI: "automated AI research intern by September 2026." Anthropic: published research on automated alignment researchers. DeepMind: "automation of alignment research should be done when feasible." Recursive Superintelligence: $500M raised explicitly for this. Mirendil: "building systems that excel at AI R&D."
Unless your alignment approach is '100% accurate'…things can go wrong quite quickly. For example, your technique is 99.9% accurate, then that becomes 95.12% accurate after 50 generations, and 60.5% accurate after 500 generations.
Clark predicts a capital-heavy/human-light economy with autonomous AI corporations, Amdahl's Law bottlenecks where digital acceleration hits physical constraints (drug trials), and governance gaps around redistribution.
AI systems are about to start building themselves. What does that mean?
GPT-5.5 doubled the sticker price (input $2.50→$5.00/M, output $15→$30/M), but OpenRouter's analysis of users who switched found real-world costs increased 49–92%, not 100%, because the model became less verbose on long inputs.[9]OpenRouter: GPT-5.5 cost analysis
For prompts 10K–128K tokens, GPT-5.5 generates 19–34% fewer output tokens than GPT-5.4. Worst real-world hit: short prompts under 2K tokens (+92%, because it actually outputs slightly more there). Best case: 50K–128K prompts (+49%). Above 128K it spikes back to +85% — likely the model becoming verbose again at extreme context lengths.
We observed cost increases between 49–92%.
Methodology used OpenRouter's unified tokenization, excluded cancelled requests, media, and zero-token requests. No direct Claude/Gemini comparisons in the post.
In a short clip from a longer interview, Dario opens with: "I don't think this field's going to be a monopoly. All my lawyers never want me to say the word monopoly."[10]Dwarkesh interviews Dario Amodei His structural argument: Facebook-style network effects don't exist for AI labs. The right analog is cloud computing — three or four players sustained by capital and expertise barriers — but with more product differentiation than cloud has.
~00:00 The thesis, with the cloud analog: "You have three, maybe four players within cloud. I think that's the same for AI. Three, maybe four." Capital intensity and depth of frontier-lab expertise are the real barriers, not user lock-in.
Profits are not astronomical, margins are not astronomical, but they're not zero.
Notable concession from a frontier-lab CEO: AI will be a real business, but not winner-take-all. Stakes out a middle position against both monopoly framing and commodity collapse.
~01:01 Dario rejects the simplified "Claude codes, GPT reasons" story as too coarse — "models are good at different types of coding, models have different styles." Implication: AI oligopoly will be stickier and more product-defined than the cloud oligopoly. Buyers care which model they pick in a way they don't really care which hyperscaler hosts their VMs.
Cloud is very undifferentiated. Models are more differentiated than cloud.
Note: this transcript is a short clip (~1 min). Anchors are limited; the wider interview likely contains more detail.
Supabase AI tooling engineer Pedro Rodrigues demos how Anthropic-style "skills" (skill.md folders) fix subtle agent failures — like Claude creating Postgres views that silently bypass row-level security — and walks through an eval-driven workflow for testing skills before shipping.[11]AI Engineer: Pedro Rodrigues on Supabase The same RLS landmine bit indie creator Arjay this week — bypassing his per-user AI usage limits and almost producing a $10K bill.[34]Arjay: Database Hacked
~00:14 Pedro reframes "Skill Issue" → "Level up your skills" (the original title is reserved for his keynote). His mandate at Supabase: improving "DAX" — developer-agent experience.
~03:16 Folders containing skill.md (frontmatter with required name and description + markdown body), optional reference markdown, optional bash/python scripts. The key innovation over MCP is progressive disclosure — only the frontmatter description loads into context until the agent decides the skill is relevant.
~06:18 MCP for integrations and remote/authenticated tool execution; skills for workflow context and prompt-template-style guidance that won't fit in a tool description. Anthropic's new tool_search tool is MCP's answer to progressive disclosure.
~14:26 A vibe-coded Next.js perf-review app on Supabase with four users (employee, two managers, HR). Pedro asks Claude to add a department_stats SQL view via the Supabase MCP server. Claude creates it and reports success.
~31:55 Switching to Bob (engineering manager) reveals everyone can see all departments' salaries. Bug: Postgres views default to the creator's permissions, bypassing RLS. Fix: WITH (security_invoker = true), available since Postgres 15 — but underrepresented in model training data.
~38:57 Installs the pre-built supabase-security skill via npx skills (Vercel's package). With the skill loaded, Claude generates the view with security_invoker. Tip: starting skill descriptions with the verb "use" measurably increases load rate on Claude.
~62:34 Following agent-skills.org: eval.json with prompts/expected outputs/assertions; Python harness resets DB, runs Claude Code in headless mode twice (with/without skill), writes grading.json.
~70:50 Live run produces a counter-intuitive result — "with skill" fails. Pedro uses it as a meta-point: the eval was checking the wrong metadata (view definition instead of pg_class reloptions). Evals are just code, and bad assertions produce bad signals.
~00:00 Arjay's app was hacked when a user removed per-user AI usage limits — same root cause: missing row-level security. Without RLS, a client-side SELECT * FROM todos returns everyone's rows.[34]Arjay: Database Hacked
Someone found a way to remove those limits. And in theory, they could have racked up a $10,000 bill if they wanted to.
ElevenLabs' speech-to-text lead Angelos Perivolaropoulos walks through training a tiny ~1.8M-parameter GPT-2-style decoder on a laptop or a free Colab T4, with concrete loss-to-quality milestones (overfit threshold ≈1.0 on this dataset).[12]AI Engineer: Train an LLM from scratch The Q&A covers reasoning models, multimodal injection, and how ElevenLabs' audio side actually works.
~00:14 Intro: ElevenLabs Scribe v2 is currently top-ranked on public transcription benchmarks. Workshop framing: pure PyTorch, no pre-trained weights, gets you ~80% of the way to how labs actually design models.
~04:17 Four building blocks: tokenizer, model architecture, training loop, inference.
~09:22 Character-level tokenizer on tiny Shakespeare: 65 tokens, ~4,225 bigrams, tractable for the dataset size. Production labs use BPE; a 50K vocab like GPT-2's would balloon embedding params to 19M and overwhelm a small model.
~15:29 Transformer building blocks: multi-head causal self-attention, MLP, residual connections (so each layer makes small adjustments), layer norm (keeps activations from exploding).
~23:36 Model config: 6 layers, 6 heads, 384-dim, 256-token context, ~1.8M params total.
~41:51 Training loop: batch size 64, AdamW + cosine LR with 100-step warmup over 5,000 total steps.
~49:56 ln(65) ≈ 4.17 (random) → 3.3 (character frequencies, "th") → 2.5 (the word "the") → 1.5–2.0 (real words appearing) → 1.0–1.2 (recognizable Shakespearean phrases) → below 1.0 = overfit. Optimal in his test: ~2,400 steps, ~15 minutes on a free Colab T4.
When the loss starts going below 1.0 for this specific dataset, that's where we're going to start seeing overfitting. The model will still be producing reasonable things, but it will no longer start getting better at it.
~53:00 Inference: greedy good for transcription, boring for LLMs. Use temperature ~0.7 + top-k. Fixed seed for the workshop's competition (best Shakespearean verse wins ElevenLabs swag).
~63:12 Reasoning models share the same base architecture and are post-trained on very high-quality chain-of-thought data — labs like Scale AI hire physicists and PhDs because bad data breaks models. Someone has converted Llama 1B (non-reasoning) into a reasoning model purely via post-training.
The model just cares about these embeddings. It doesn't care if it's text or if it's audio or if it's video.
Multimodal: video/audio encoders produce hidden vectors injected into the text transformer's embedding layer at prefix positions. ElevenLabs trains tokenizers on mel spectrograms; TTS often uses L2 over spectrograms or KL-divergence for distillation rather than cross-entropy. Music generation can be autoregressive or diffusion-based; diffusion is generally easier to get working for abstract modalities.
Nate's proof-of-concept: a voice agent trained on all 400 of his YouTube transcripts, scaffolded by Claude Code in ~15 minutes.[13]Nate Herk: Voice Agents Then a longer live build of a B2B sales agent that books Cal.com discovery calls, with concrete fixes for the Adam-voice-too-AI bug, UTC vs Central time, and security/cost lockdown for public widgets.
~03:00 Anatomy of a voice agent: persona (system prompt) + voice (ElevenLabs library or 4-hr custom clone) + knowledge (docs, Supabase/Pinecone vector stores) + tools (API calls, MCP servers, n8n, Zapier, Python).
It is a loop. It's not magic.
~07:02 Live build with Claude Code in VS Code: dictate the goal in plain language → enable plan mode so Claude interviews him about Cal.com event type, voice persona, required data fields → Claude autonomously creates .env, wires Cal.com + ElevenLabs APIs, writes the system prompt, picks a voice, injects the widget snippet.
~10:03 Dictation: he switched from Whisper to GLO (faster, private) and joined the GLO team.
~18:05 Iterative debugging: Adam voice sounds too AI, agent doesn't deliver first message, check-availability tool queries UTC instead of Central. Each fix described in plain language; Claude Code reads ElevenLabs docs and inspects the conversation transcript dashboard to pinpoint the UTC bug.
~29:14 Security & cost: ElevenLabs widgets are HTML you can copy and the owner pays per minute, so lock to specific hostnames, cap call duration, rate-limit, ground knowledge base. Deploy: GitHub → Vercel → live widget; Twilio for phone.
Code beats clicks — it's so much better to just build a voice agent by speaking into your computer rather than going onto the dashboard and clicking and clicking.
Ramp's stance: handing agents raw session tokens or API keys breaks attribution, scoping, and lifecycle controls. Their first answer is "On Behalf Of User" (OBOU) — Agent Keys tied to both a human sponsor and a business entity, with permissions strictly scoped to a subset of the sponsor's role-based access.[14]Ramp Builders: Agent Identity
Agent Key tied to a human sponsor + business entity. Permissions are always a strict subset of the sponsor's RBAC — agents can be scoped down but never elevated. All spend and actions attribute to the sponsor: audit logs read "Approved by Sarah (via Codex)." Lifecycle ties to employment status or explicit expiration; sponsor, manager, or admin can revoke.
An agent can never have more access than its sponsor. Every OBOU action has a human accountable.
Agent Keys deliberately don't authenticate — they're identifiers in an OAuth2-PKCE flow that produces short-lived JWTs (≤1 hr) plus a refresh token. JWT refresh re-validates that the Agent Key isn't revoked and the sponsor still has the required permissions. A leaked Agent Key alone is useless; a leaked access token expires within an hour and can be revoked instantly.
A leaked access token is bad for an hour (and can be revoked instantly). A leaked API key would have been bad until someone notices.
Ramp's existing DenormalizedActor type was extended with an optional AgentContext field, so every agent action automatically participates in audit logging from the moment the Agent Key is created. Existing human-initiated flows required no changes.
Agent Keys have expiration dates with renewal reminders. Revocation produces a data-exhaust log of actions taken before revocation. Renewal-over-rotation was a deliberate trade-off: less management burden, slight theoretical security cost.
Pairs naturally with the broader agent-security theme this week — Pedro's RLS demo[11]Pedro on Supabase RLS, Arjay's RLS near-miss[34]Arjay: Database Hacked, and Linkly (in GitHub Trending below) all hammer the same point: agents need first-class identity primitives, not stolen credentials.
Cloud Next '26 drew 32,000+ attendees and 260+ announcements centered on the "agentic AI era," with 330 organizations now processing 1T+ tokens annually on Google Cloud and ~75% of Cloud customers using Google Cloud AI.[15]Google AI updates: April 2026 Headlines include Gemma 4 ("byte for byte the most capable open model" — Gemma family is past 500M downloads), 8th-gen TPUs, Deep Research Max, the Gemini Enterprise Agent Platform, and Google Vids free for all account holders (10 videos/month).
Google.org + J&J Foundation: $10M for AI literacy/training in rural U.S. healthcare workforce.
Long-running Gemini jobs (Deep Research, long video generation, batch processing) no longer need polling — Google added webhooks following the Standard Webhooks spec, with HMAC at project level and JWKS per-request, signed headers (webhook-signature, webhook-id, webhook-timestamp) for replay protection, and at-least-once delivery with 24-hour retry.[16]Google: Webhooks in Gemini API
Configure either globally at the project level (HMAC-secured) or dynamically per-request (JWKS-secured). Push payload arrives the instant a task finishes, eliminating polling overhead for jobs that span minutes or hours. Python SDK example for batch task configuration available, plus full docs at ai.google.dev/gemini-api/docs/webhooks and a Cookbook notebook on GitHub.
A push-based notification system that eliminates the need for inefficient polling — push a real-time HTTP POST payload to your server the instant a task finishes.
Nate's central claim: the most dangerous moment in a knowledge-work career isn't when work disappears, but when the work still exists yet less of it actually requires you.[17]Nate B Jones: AI's Thin Ice Moment His TCLD audit (Theater / Commodity / on the Line / Durable) is a practical exercise: tag every meeting, doc, email, and Slack item from the last 10 business days, then read off the T+C number — that's the fraction of your week on thin ice.
~00:00 AI doesn't replace whole jobs — it picks away at pieces inside the job until the next economic shock exposes the hollowing. Travel agents are the canonical analog: Expedia changed booking economics first, the visible break came later when downturns forced the admission.
The first sign that your job is on thin ice is often a full calendar and no clue what's happening.
The useful question is not, will AI replace me? The useful question is, how much of my last two weeks still needed me?
~03:30 OpenAI/UPenn: ~80% of US workers could see ≥10% of tasks affected, ~20% could see half their tasks affected. Anthropic Economic Index: ~49% of jobs have already had ≥25% of their tasks performed using Claude. Microsoft (200K Bing Copilot conversations): people most often bring information-gathering and writing to AI; AI most often performs writing, teaching, providing information, and advising.
~06:00 Performance reviews still measure visible output (docs written, updates sent, meetings attended) — they don't ask whether the output actually required you. Tools without throughput limits collapsed the timeline.
Your job is not one thing. Your job is 50, 60, 300 small things packed into one title in a trench coat.
~14:30 What the audit reveals: most people undercount theater, find commodity bigger than expected, and find durable smaller than self-image. T+C is your thin-ice number.
~18:00 Question-answering is commodifiable because the frame is set. Durable work starts before the question and is often invisible: the bad hire that didn't get made, the 6-month detour that didn't happen, the customer escalation that never became a crisis.
Avoided damage is often where senior judgment lives.
~20:30 Theater compounds to nothing. Commodity compounds to the org (and gets captured by tools). Durable compounds to you.
If the thing you're improving can be captured by the system, the system will capture it.
~21:30 Durable work has to be legible enough that the system values it, but not so legible that the system can run it without you. Show outcomes, separate analysis from judgment in language ("the analysis says X, my judgment is Y"), don't expose the mechanism where there isn't one to articulate.
~24:30 The audit mechanics take an afternoon. The hard part is what the tags do to your professional self-image.
The advantage goes to the person who can update their self-image before the organization forces the update on them.
~25:30 (1) Stop performing inertial theater. (2) Don't pour recovered time into more commodity work — that just makes you twice as productive at work whose value is collapsing. (3) Build a private weekly track record of judgment calls. (4) Use the record to refuse commodity work via project selection. (5) Make durable work partially legible. (6) If the audit shows no path to durable work in the current role, move.
Note: he suggests using Codex with computer use to help run the audit across email/calendar/Slack, but warns it'll require chunking across separate agents.
Kent C Dodds, who built a career teaching how to write clean code, is shifting everything he teaches: AI agents now one-shot production-level code, and the scarce skill is "knowing which target is worth hitting" — what he calls product engineering.[18]Better Stack: Why Kent C Dodds Stopped Teaching Code The Pragmatic Engineer's clip with Mario and Armin Ronacher confirms the inverse pressure: PMs now send PRs, marketing ships site changes, sales builds demo features.[19]Pragmatic Engineer: PMs are sending PRs
Kent's reframe: "product engineering" = bridging implementation details with product outcomes — what user problem is actually being solved, what constraints can't be violated, who is negatively affected by a change. He's started a podcast around this idea, signaling he believes it's a durable discipline rather than a temporary hedge.
The skill that actually matters now is knowing which target is worth hitting.
Armin's pushback in the Pragmatic Engineer clip is the necessary other half: democratized contribution doesn't eliminate the need for guardrails — it probably increases it.
The problem is that people are now so focused on everybody can do everything now that they forget that you still need a process to guardrail all of that. — Armin
CVE-2026-31431 ("copy fail") is a logic flaw in the Linux kernel's AF_ALG interface, sitting unnoticed since 2017, that gives any unprivileged local user root access on essentially every Linux distribution updated since then — exploitable in 732 bytes of Python.[20]Fireship: 732 bytes Linux exploit An AI agent from Theori found it in roughly one hour of scan time.
The bug: ONC ESN writes 4 bytes of scratch into what it thinks is a crypto output buffer, but a bug in the AF_ALG splice function lets that buffer point into the page cache of a read-only file. The exploit targets the read-only su binary present on every distribution.
Affected: Debian, Arch, Red Hat, Ubuntu, SUSE, Amazon Linux. CrowdStrike confirmed active exploitation. CISA added it to the KEV list. Patch is available.
The going rate for a universal Linux privilege escalation on the gray market is somewhere between $10,000 and $7 million… but a few days ago, an AI agent found one in about an hour of scan time.
Theori released a proof-of-concept and a dedicated website publicly, for free.
Two big releases. V0.11 was an Ink/React CLI rewrite with a pluggable transport layer that opened native AWS Bedrock support and the /steer command for nudging running agents. V0.12 adds an autonomous curator background agent that grades, prunes, and consolidates the skill library on its own, plus a 57% cold-start improvement and Hermes Kanban — a SQLite-backed durable task board for multi-agent workflows.[21]AICodeKing: Hermes V2
~01:00 Ink/React rewrite (sticky composer, live streaming, status bar, light theme). Backend: pluggable transports → native AWS Bedrock via Converse API, plus NVIDIA NIM, RCAI, Step Plan, Google Gemini CLI, Vercel AI Gateway, GPT-5.5 via Codex. Adds /steer, shell hooks, webhook direct delivery, smarter orchestrator-style delegation.
~02:06 Autonomous curator runs on its own schedule to grade, prune, and consolidate skills. Self-improvement loop is more rubric-based, prefers updating the most recently used skill, and properly inherits parent runtime. New providers: GMI Cloud, Azure AI Foundry, MiniMax, Tencent TokenHub, first-class LM Studio. Microsoft Teams as first pluggable gateway platform. Native Spotify, Google Meet, ComfyUI, Touch Designer MCP bundles. ~57% cold-start improvement.
~03:06 Tasks live in hermes/con.db with status, assignee, parent/child dependencies, comments, run history, structured handoff data. Six lanes: triage / todo / ready / in-progress / blocked / done. Single-host by design.
Kanban is not only moving a card from one column to another. It is carrying structured context from one stage of the workflow to the next.
If you're going to run long agent workflows, you need failure history. You need retries, you need blocked states, you need human intervention, and you need the system to not silently lose what happened.
YC dropped three Request-for-Startups videos with overlapping logic. Inference chips: current GPUs hit only 30–40% peak utilization on agent workloads because work is bursty across memory-bound model calls, IO-bound tool use, and CPU-bound orchestration.[22]YC: Inference Chips for Agents AI-native discovery: PhD-level scientific reasoning lets you go from research co-pilot to closed design-make-test-analyze loops.[23]YC: AI-Native Discovery Company OS: top AI-native companies have made every meeting/ticket/customer interaction queryable to a persistent AI layer.[24]YC: AI Operating System for Companies
Most AI chips are designed for a world where inference means prompt in response out. Agents don't work that way.
What's needed: fast context switching between models, native speculative decoding, persistent KB-scale caches across an entire execution graph. NVIDIA bought Groq for $20B "because it saw this coming" — Groq's real insight wasn't the chip but the compiler. Google built TPU v7 for inference but "nobody's designing for the agent loop itself." Pairs closely with Theo's silicon-war prediction in topic 1.
Drug discovery, material science, protein engineering: models propose candidates, automated labs synthesize and test, results feed back. The framing is anti-tooling — don't sell research co-pilots, build engines that own the closed discovery loop.
The companies that make meaningful contributions to scientific progress won't just sell research co-pilots. There'll be AI native discovery engines that work alongside researchers to propose and validate hypotheses.
The pattern: top AI-native companies have a persistent AI layer learning from every meeting, ticket, customer interaction. Decision-making moves from open-loop to closed-loop. YC reports teams that adopt this "cut sprint time in half and ship 10× as much." Bottleneck today is integration (Slack/Linear/GitHub/Notion/call recordings) — opportunity for a connective layer that makes a company "legible to AI by default."
I've seen teams that do this, cut sprint time in half, and ship 10x as much.
Five posts in one day. Headline finding: across all 21 GGUF quantizations of IBM's new Apache-2.0 Granite 4.1 3B, none produce a passable pelican-on-bicycle SVG, and there's "no distinguishable pattern relating quality to size."[25]Simon Willison: Granite 4.1 SVG Gallery Plus: a Python binding for the TRE regex library that scales linearly on ReDoS patterns, an interactive Redis Array playground, and a quote from Andy Masley pushing back on data-center land-use criticism.
IBM released Granite 4.1 (Apache 2.0) in 3B/8B/30B sizes. Unsloth produced 21 GGUF-quantized variants of the 3B (1.2GB to 6.34GB, 51.3GB combined). Willison ran his standard "pelican riding a bicycle" prompt across all 21.[25]Granite 4.1 SVG Gallery
There's no distinguishable pattern relating quality to size — they're all pretty terrible!
Willison built a ctypes binding for Ville Laurikari's TRE regex engine.[26]Simon Willison: TRE Python binding TRE handles 10M-character "evil" patterns faster than Python's re handles tiny ones — because TRE has no backtracking, performance scales linearly. Built experimentally with Claude Code.
TRE processes even notorious 'evil' patterns on gigantic inputs (10 million characters) much faster than `re` on tiny ones — scales linearly with input size instead of exponentially.
Salvatore Sanfilippo's PR adds a native array data type to Redis with 18 new commands (ARCOUNT, ARDEL, ARGREP, etc).[27]Simon Willison: Redis Array The most interesting is ARGREP — server-side regex grep against array values using vendored TRE (same library as above). Willison had Claude Code for web build a WASM-compiled Redis playground in the browser.
I had Claude Code for web build this interactive playground for trying out the new commands in a WASM-compiled build of a subset of Redis running in the browser.
Willison curates a contrarian take on data-center criticism.[28]Simon Willison: Andy Masley quote
Between 2000 and 2024, farmers sold in total a Colorado-sized chunk of land all on their own, 77 times all land on data center [property acquisition], and grew more food than ever on what was left.
The monthly recap covers Opus 4.7, GPT-5.5 (with price increases — see topic 9), Claude Mythos and LLM security, ChatGPT Images 2.0, his LLM 0.32a0 refactor, the OpenAI-Microsoft AGI clause history (relevant to topic 1!), and DeepSeek V4 pricing.[29]Simon Willison: April 2026 newsletter Newsletter is paywalled at $10/month via GitHub Sponsors.
Pay me to send you less!
OpenAI's o1-preview hit 67.1% diagnostic accuracy across 76 real ER cases vs 55.3% / 50.0% for two attending physicians using only raw EHR text — and flagged a rare flesh-eating infection in a transplant patient 12–24 hours before the treating doctor.[30]Rundown: AI in the ER Plus: the Pentagon added 8 vendors to classified AI networks (SpaceX, OpenAI, Google, Nvidia, Reflection, Microsoft, AWS, Oracle) and excluded Anthropic on "supply-chain risk." Maryland became the first U.S. state to ban AI-driven grocery pricing.
76 real ER cases, three decision stages, raw EHR text only. o1-preview (a 2024-era model!) outperformed attending physicians.
Flagged a rare flesh-eating infection in a transplant patient roughly 12 to 24 hours before the treating doctor caught it.
SpaceX, OpenAI, Google, Nvidia, Reflection, Microsoft, AWS, and Oracle all added. Anthropic excluded. The DoD CTO cited "supply-chain risk" and "national security moment" — striking framing given Anthropic's safety positioning. Reads as another data point in the Anthropic-vs-incumbents dynamic running through this briefing.
First U.S. state to ban AI-driven dynamic grocery pricing. Fines up to $25,000 per violation.
SAG-AFTRA secured AI guardrails in a new four-year studio contract. A Chinese court ruled that AI cannot justify worker termination, ordering wrongful-termination damages.
xAI Grok Custom Voices (voice cloning), OpenAI Codex Pets (animated progress trackers), ElevenLabs ElevenMusic (AI song generation), Xiaomi MiMo-V2.5-Pro (open-source).
Quick hits across markets, security, dev tooling, and a couple of philosophy clips that don't merit their own topic but shouldn't be ignored.
Dot-com era stocks roaring back on AI infra demand: Micron +50% in April (best month since Feb 2000), Western Digital +60% (since Jan 2001), Dell +27% in April (+60% YTD), Sandisk up nearly 300% in 2026.[36]Sherwood: Market's '90s flashback Tech investment as % of GDP now exceeds the 4.5% peak from 2000. Apple's contrarian play (capex −36%, partner with Gemini for Siri) gets called out, alongside $600M in inter-company transactions across Musk's empire (xAI bought $430M of Tesla Megapacks, SpaceX bought $143M of Tesla vehicles). Tim Cook flagged Mac mini/Studio supply shortages of "several months" from AI workload demand.
Bloomberg investigation: nearly all 20 state-run health insurance exchanges had ad trackers transmitting sensitive personal data — race, citizenship status, sex/gender, ZIP codes, info about incarcerated family members — to Meta, Google, TikTok, Snap, and LinkedIn.[35]Tech Brew: State health exchanges leaked data Healthcare.gov (~30 states) doesn't embed these trackers. California removed them before the investigation; several others removed them only after Bloomberg called. Hospital sector tracker deployment dropped from 98% (2021) to 30% (2025), largely from litigation.
Quick rundown[33]Github Awesome: GitHub Trending #33:
Two clips reviewing notebook competition entries.[31]marimo: How notebooks stand out[32]marimo: The Apple Slider What stood out: an embedded Minesweeper-style game inside a paper on neural thickets; a recurring dead-salmon GIF as a "red line" through a statistical-fragility piece (referencing the famous MRI-on-dead-fish experiment); and an apple-shaped slider that lets readers literally peel a D-dimensional sphere to make high-dimensional volume distribution visceral.~01:00 Also notable: top entries used AI as a tool inside a human-directed creative process, often with explicit disclosure ("Please engage with a critical eye. Vibe coding abounds.")
The cost of generating intelligence is astronomically higher than the cost of copying that intelligence. Distillation does not produce a copy of the original model. It produces a compression. And that compression, like a lossy MP3, has characteristics that matter enormously for anyone building real systems on top of these models.[37]Nate B Jones short: distillation
Why Enzo wasn't a champion driver: he had the talent but couldn't suppress the fear of death after watching two close mentors die on the Alfa Romeo team.[38]Acquired: Motorsport champions
Short take from Real Python on why iteration beats planning when requirements aren't fixed — PDCA loops as the lean/agile alternative to upfront Gantt charts.[39]Real Python: Agile vs Waterfall Worth pairing with the Ralph Loops topic above — both arguments converge on "small loops with feedback beat big upfront plans."