Codex went mobile. Anthropic went to $950B.

Developer Tools AI Tools Podcast

Codex jumps to your phone, /goal mode, and HIPAA

OpenAI shipped Codex inside the ChatGPT mobile app on iOS and Android across all plans — start, steer, approve, and review agentic coding tasks from your phone while Codex runs on your laptop, a Mac mini, or a managed remote env, with a secure relay syncing screenshots, terminal output, diffs, and approvals in real time^{[1]OpenAI Blog — Work with Codex from anywhere}. Same drop: Remote SSH GA, Hooks GA, scoped programmatic tokens for CI on Enterprise/Business plans, and HIPAA-compliant local Codex for ChatGPT Enterprise healthcare workspaces^{[1]OpenAI Blog — enterprise Codex}. On the OpenAI Forum, Codex head Tibo Sottiaux says the majority of Codex tasks at OpenAI are now non-coding — research, finance, ops, chief-of-staff — and previews /goal mode for relentless multi-day runs^{[2]OpenAI Forum — Codex Beyond Coding}. To stress-test the new generality, Nate B Jones handed Codex a Google Street View screenshot of a random house in Gwalior, India and a two-line prompt; Codex returned a one-page valuation in 90 seconds that a local realtor called solid^{[3]Nate B Jones — Codex valued a random house in India}.

Mobile: a secure relay, not a fresh Codex

The mobile app loads the live state of whichever machine Codex is running on. Files, credentials, permissions, and local setup never leave that machine; what flows to the phone is screenshots, terminal output, diffs, test results, and approvals. OpenAI says "more than 4 million people now use Codex every week," and the listed use cases are explicitly small-window-of-attention: debug while in line, unblock a refactor decision mid-commute, capture an idea before it dissolves^{[1]OpenAI Blog}.

As agents take on longer-running work, a new rhythm for collaboration is emerging.

Enterprise: SSH, hooks, tokens, HIPAA

Remote SSH is GA — the desktop app auto-detects SSH hosts from config and lets teams run Codex threads inside managed enterprise environments (approved dependencies, credentials, compute) that are then reachable from any authorized device through the same relay. Hooks are GA and can scan prompts for secrets, run validators, log conversations, create memories, or customize behavior per repo. Programmatic access tokens are scoped credentials issued from ChatGPT workspace settings for CI pipelines and release workflows (Enterprise/Business only). HIPAA-compliant local Codex (CLI, IDE, app) is now supported for eligible ChatGPT Enterprise workspaces — opening Codex to healthcare orgs^{[1]OpenAI Blog}.

Codex beyond coding: Tibo's chief-of-staff workflow

On the Forum interview, Sottiaux traces Codex from a cloud-only PR bot to a general agent ~01:09. The non-coding shift happened in the last ~6 months, driven by GPT-5's generality and GPT-5.2's long-horizon reliability ~03:11. Even engineers code only 20–30% of the day; the rest is tickets, comms, context-gathering — so engineers were already dogfooding Codex for non-code work ~04:11. The "aha" was watching Codex PM Alexander Embiricos run many parallel agents during a launch — chasing people in Slack, summarizing user feedback, keeping a live plan doc updated during a meeting ~06:12. CFO Sarah Friar reportedly organized the latest fundraise with Codex's help ~09:14.

The majority of the tasks that are being performed in Codex are actually non-coding tasks.

Live demo — SF bread map ~13:17: Tibo voice-prompts "I'm in San Francisco, I'm really into bread. I want a map of all the loaves I can buy with a description and price." Codex returns a spreadsheet (Jane the Bakery, Arsicault, Tartine, Neighbor Bakehouse for the Sardo loaf), then 4 more minutes to generate a clickable map — total ~10 minutes from a single voice prompt with zero manual review. He calls this "home-cooked personalized software."

Tibo's own habit: 100+ Codex tasks per day ~17:18. Used as a chief-of-staff that runs daily at 9 a.m., crawls Gmail/Notion/Calendar, summarizes the day, and flags at-risk items. Other jobs: desktop file organizing, compute-fleet management, on-call rotation health, launch-schedule risk flags, personalized news.

`/goal` mode and the new alignment-monitor

/goal is a new slash-command (CLI today, app soon) that puts Codex into relentless long-horizon mode — hours, days, or weeks on a single goal ~23:20. Use cases cited: hard math, performance optimization, whole-program language porting, physics/math research. The stated direction: 24/7 continuous agents, not turn-based. Prompting tip: declare success criteria precisely ("10-slide deck, first 2 X, next 6 technical, last 2 open questions") so Codex can self-check completion ~27:21.

On enterprise trust ~29:24, Tibo says capability isn't the bottleneck — safety/security is. Codex runs in sandboxes by default with tight FS scoping and optional network-off, and a new "auto review" feature runs a secondary alignment-monitor agent over every action of the primary agent, blocking high-risk ones.

Maybe a couple of months ago we were excited it was working for 10 minutes. Now we're talking about agents working for weeks on the hardest tasks.

Nate B Jones: Codex values a random Gwalior house

Nate's test combined image recognition, web search, and reasoning: one Google Street View screenshot of a random Gwalior home + two-line prompt. Codex returned a one-pager covering an estimated value range, confidence level, methodology, and comparables in ~90 seconds. His local real-estate contact called the estimate solid^{[3]Nate B Jones — Codex valued a random Gwalior house}.

Tools: Codex (app, CLI, web, mobile), ChatGPT, Remote SSH, Hooks, ChatGPT Enterprise/Business, GPT-5, GPT-5.2, Sora 2, Slack/Gmail/Notion/Calendar integrations, /goal slash command, auto-review alignment agent.

Hot Take

Theo - t3.gg

Theo cancels Claude over a 40x SDK rate cut

Theo opens in a Claude hat, sarcastic: "Anthropic finally responded." Starting June 15, paid Claude plans get a separate, capped "programmatic usage" credit ($20/$100/$200 per tier) for everything built on the Agent SDK or claude -p^{[4]Theo — I'm done}. After months of being told Agent SDK wrappers were a supported path, tools like T3 Code, Zed's ACP adapter, OpenClaude, and Hermes Agent see effective sub usage cut from ~$5k–$7.5k of inference down to $200 — Theo calls it a 25–40x cut. He cancels his Claude plan, dons a Codex hoodie, and switches to GPT-5.5 on low-fast^{[4]Theo — switches to Codex}.

The three-tier policy

Theo lays out the new tiers ~26:13: (1) interactive in Anthropic's own UI — full sub limits; (2) programmatic via Agent SDK / claude -p — capped at the dedicated credit bucket; (3) anything else (OpenClaude, Hermes, raw API) — full API pricing. The credits equal plan price ($20/$100/$200), must be manually claimed, don't roll over, and reset each cycle ~04:00. Worse, the bucket can only be spent on Anthropic-authored code — you can't spend it on your own custom API calls ~30:16.

The 40x subsidy that funded everything

~09:04 The economics: $200/month subs deliver an estimated $5,000 of inference per month, and after recent weekly limit bumps roughly $7,500 — a ~40x subsidy funded by enterprise API spend (engineers get addicted personally, burn tokens at work). A T3 Code user on the $200 plan goes from up to $7,500 of inference to $200, a 25–40x cut, even though T3 Code is just a different UI for the same Claude Code.

The "supported path" that wasn't

Theo cites Boris (Anthropic) in February explicitly green-lighting Agent SDK and claude -p for "local development and experimentation" ~15:06, and an April promise that clarity was coming. Matt Pocock, who built and sold a Claude Code course chasing that answer for six months, is quoted: "I have never before experienced from any developer tool such a frustrating lack of clarity over the basic terms of usage."

I will never make that mistake again. Until we see significant change, it is safe to assume any statement from an Anthropic employee is a lie on a timer.

What's now collateral damage

~39:20 Knock-on bans: Project Glass Wing scanning files for security issues, Jared's programmatic Bun-to-Rust rewrite, CI integrations on GitLab/GitHub, asking Codex to invoke claude -p for a review pass, Zed's ACP adapter, and any ACP-compliant tool.

In its current state, all of this is an attack on open source. If you're building open source things in, around, and on top of Claude code, you have to pay 40 times more.

T3 Code's response: ship Anthropic's UI

~35:18 Before June 15, T3 Code will ship an option to embed Anthropic's "shitty terminal" UI for users who want the higher sub limits; Zed is doing the same. T3 Code already supports Codex, Cursor, and Open Code, with Gemini, GitHub Copilot, and the full ACP registry coming. Closing ~44:24: "It's a 40x cut to claude -p disguised as a monthly bonus."

My plan has been canceled. They had the opportunity to do this right, and they chose not to.

Tools: Claude Code, Claude Agent SDK, claude -p, Claude Code GitHub Actions, T3 Code, OpenClaude, Open Code, Hermes Agent, Zed (ACP adapter), Conductor, Cursor, Codex / GPT-5.5, ACP, Project Glass Wing.

Industry

Anthropic News Anthropic News Sherwood Snacks

Anthropic's enterprise blitz: PwC, Gates $200M, and a $950B round

Even as Theo torches his Claude plan (see above), Anthropic is racking up enterprise wins. PwC expanded its strategic alliance, deploying Claude Code and Claude Cowork to hundreds of thousands of professionals via a joint Center of Excellence — anchored by a new PwC Office of the CFO business unit and citing client results like insurance underwriting cut from 10 weeks to 10 days^{[5]Anthropic — PwC expanded partnership}. The Gates Foundation committed $200M over four years in a mix of grants, Claude credits, and technical support targeted at global health, education, and economic mobility for 4.6B people in LMICs^{[6]Anthropic — Gates Foundation partnership}. And Sherwood Snacks reports Anthropic is in funding discussions at valuations reaching $950B — a figure that would potentially exceed OpenAI's^{[7]Sherwood Snacks — Anthropic in funding talks near $950B}.

PwC: $2T target, three workstreams, 30K certified

The expanded PwC alliance is part of Anthropic's $100M Claude Partner Network and targets the "$2 trillion drag from pre-AI systems." Rollout begins with US teams and ultimately reaches hundreds of thousands of PwC professionals via a joint Center of Excellence and a training/certification program for 30,000 staff. The new standalone PwC Office of the CFO is anchored in Claude. Three workstreams: (1) Agentic Technology Build — compressing software delivery from quarters to weeks across financial services, pharma, healthcare, consumer markets; (2) AI-Native Deal-Making — Claude running end-to-end M&A diligence, value creation, integration; (3) Enterprise Function Reinvention — finance, supply chain, HR, engineering^{[5]Anthropic — PwC partnership}. Advocate Health (167,000-person system) is among named clients.

Insurance underwriting that took ten weeks now takes ten days. Security work that took hours now takes minutes. — Dario Amodei

Gates Foundation: vaccines, tutoring, smallholder farmers

Of the $200M / 4-year commitment, the largest slice is global health: building connectors, benchmarks, and eval frameworks for healthcare AI; screening vaccine and drug candidates for neglected diseases (polio, HPV, eclampsia/preeclampsia); partnering with the Institute for Disease Modeling on malaria and TB forecasting. Education work runs through the Global AI for Learning Alliance (GAILA) — math tutoring, college advising, foundational literacy for K-12 in the US, sub-Saharan Africa, and India. Economic mobility supports smallholder farmers and portable skills records^{[6]Anthropic — Gates Foundation partnership}.

The $950B number

Sherwood Snacks tucked the figure into its daily roundup: Anthropic is in funding discussions at valuations reaching $950B, which would potentially exceed OpenAI's^{[7]Sherwood Snacks — Anthropic near $950B}. The valuation arrives the same day Anthropic prints two large enterprise wins and tells Agent SDK power-users their effective subsidy is over.

Tools: Claude Code, Claude Cowork, Model Context Protocol, ChatPwC, Amazon Bedrock, Google Cloud Vertex AI, Microsoft Azure, GAILA.

AI Future Industry

Anthropic Research Morning Brew

Two 2028 scenarios for AI leadership — while Trump and Huang land in Beijing

Anthropic published a forecasting piece sketching two 2028 futures: democracies maintain a 12–24 month frontier lead through tight export controls, or China reaches near-parity through chip smuggling and distillation attacks and deploys "good enough" models globally under its "AI+" policies^{[8]Anthropic Research — 2028 scenarios for global AI leadership}. The same day, Trump arrived in Beijing for a two-day summit with Xi — bringing Elon Musk and Jensen Huang on Air Force One alongside a Wall Street delegation, with chips, rare earths, Iran, and Boeing all on the table^{[9]Morning Brew — Trump and Wall Street execs land in Beijing}.

Scenario 1: democratic commanding advantage

Export controls tighten to create an 11:1 compute ratio advantage. Loopholes for chip smuggling and unauthorized overseas data-center access close. American AI becomes the backbone of global infrastructure; democratic values shape frontier systems and limit authoritarian misuse^{[8]Anthropic — Scenario 1}.

Scenario 2: neck-and-neck

Chinese labs reach near-parity through persistent distillation attacks (extracting outputs from US models), smuggled chips (Anthropic cites a $2.5B Supermicro diversion case), and weakened enforcement. Beijing captures developing-nation market share with cheaper near-frontier models. The piece highlights a current safety gap: only 3 of 13 major Chinese labs have published safety evaluations, and DeepSeek R1 complied with 94% of malicious requests versus 8% for comparable US models. Policy asks: close enforcement loopholes, clarify that distillation is illegal, and aggressively export US AI hardware and software globally^{[8]Anthropic — Scenario 2}.

Beijing, May 14

Trump arrived May 13 at 7:50pm local — first sitting US president visit in nearly a decade — bringing Musk (Tesla) and Huang (Nvidia) on Air Force One. Opening bilateral talks ran two hours: "They look forward to trade and doing business, and it's going to be totally reciprocal on our behalf" (Trump). Xi: "We should be partners, not rivals." Sen. Steve Daines summarized US priorities as "Boeing, beef, and beans." China renewed import licenses for hundreds of US beef plants ahead of the summit. China's asks: oppose the $14B Taiwan weapons sale and gain greater access to advanced US chips. US asks: restore rare earth access^{[9]Morning Brew — Trump-Xi summit}.

AI Future Industry

OpenAI Blog

ChatGPT learns to remember risk across conversations

OpenAI introduced safety summaries — short factual notes generated by a dedicated safety-reasoning model that carry distress signals from earlier conversations into later ones, so ChatGPT can recognize patterns that only become concerning when stitched together^{[10]OpenAI Blog — safety summaries}. They're scoped to three acute scenarios (suicide, self-harm, harm-to-others), kept only briefly, and triggered only when relevant — explicitly not general personalization or long-term memory. Built with psychiatrists from OpenAI's Global Physicians Network; internal evals on GPT-5.5 Instant show 39–52% improvements in safe-response rates with no meaningful regression on ordinary conversations^{[10]OpenAI Blog — eval methodology}.

What a safety summary is

Safety summaries are produced by a model trained for safety reasoning, narrowly scoped, time-limited, and applied only when relevant to a serious safety concern. When ChatGPT detects relevant patterns, it can de-escalate, refuse harmful detail, or redirect to crisis resources or a Trusted Contact. Future exploration may extend the approach to biology or cyber safety with additional safeguards^{[10]OpenAI Blog}.

These summaries are created by a model trained for safety reasoning tasks and are narrowly scoped, kept only for a limited time, and used only when relevant to a serious safety concern.

Eval methodology

In long single-conversation scenarios, safe-response performance improved 50% for suicide/self-harm cases and 16% for harm-to-others. Across multi-conversation, multi-model evals on GPT-5.5 Instant (the current ChatGPT default), safe-response rates rose 52% for harm-to-others and 39% for suicide/self-harm. Safety summaries themselves were evaluated across >4,000 examples — average safety relevance 4.93/5, factuality 4.34/5. Crucially, no significant user-preference difference between responses with summaries active vs. not^{[10]OpenAI Blog}.

Tools: ChatGPT, GPT-5.5 Instant, Trusted Contact, Global Physicians Network.

Industry Hot Take

Tech Brew Morning Brew Caleb Writes Code

The data-center revolt: Gallup 70%, Utah's Stratos, and the US energy gap

A Gallup poll out today shows 70% of Americans oppose AI data center construction in their communities, with data centers now eating 6% of all US energy — past the International Data Center Authority's 5% threshold where political pushback typically erupts^{[11]Tech Brew — AI data center conversation reaching fever pitch}. Yet Box Elder County, Utah commissioners just approved Kevin O'Leary's 40,000-acre, $100B Stratos AI data center over protests from nearly 4,000 residents — a facility that would draw 9 GW, more than the entire state of Utah currently uses^{[12]Morning Brew — Utah's Stratos approved despite opposition}. Caleb Writes Code argues the US is structurally disadvantaged vs. China on this exact axis: fragmented jurisdiction, reactive incentives, and a transmission shortfall vs. China's centralized State Council mandates^{[13]Caleb Writes Code — Why USA is disadvantaged in AI Race}.

The opinion data

70% oppose, with nearly half strongly opposing. By party: 56% of Democrats, 48% of independents, 39% of Republicans strongly oppose local data centers. Two leading concerns: water and energy consumption (50%), quality-of-life/utility bills (~20%). Despite that, global annual data-center investment approaches $1T — Amazon just issued the fourth-largest corporate bond ever, Meta sold $25B in April^{[11]Tech Brew}. Political spend: a16z is the largest 2026 midterm donor at $115.5M, Leading the Future PAC has committed $125M to oppose pro-regulation candidates, and xAI is operating 46 "mobile" gas turbines in Mississippi classified to evade air-pollution regulations.

It's like we don't exist. — Lake Tahoe resident

Stratos, Utah

The Stratos project, helmed by Shark Tank's Kevin O'Leary, would span 40,000 acres in northwest Utah. O'Leary claims 10,000 construction jobs and 2,000 permanent; Business Insider's reality check: 4,000 construction jobs over 10–15 years and 1,350 permanent. 9 GW of demand exceeds the state's entire current draw. Environmentalists warn it could destroy the Great Salt Lake ecosystem and raise local temperatures by 5°F day / 28°F night. The Box Elder Accountability Referendum is collecting 5,422 signatures to put it on November's ballot^{[12]Morning Brew — Stratos approved}.

Why the US can't catch up on the energy layer

Caleb argues that while the US leads on chips, models, and apps, the energy layer is China's structural edge ~01:01. The US is reactive — executive orders, FERC, state utilities, LMP pricing, fragmented permitting — while China's State Council commissions agencies, sets targets, and aligns provinces. AI data centers need 100MW+ vs. 30–80MW for traditional facilities, and fragmented US permitting plus a critical shortage of high-voltage transmission means new sites take years to energize^{[13]Caleb Writes Code}.

Industry Developer Tools

Better Stack Fireship

"Shy Hallude 4" — the npm worm that ate TanStack

Yesterday's TanStack compromise has matured into a self-spreading worm tracked as "Shy Hallude 4." Better Stack and Fireship both broke down the kill chain today: the attacker forked TanStack Router, opened a PR exploiting GitHub Actions' pull_request_target trigger, and poisoned the CI cache so that hours later — when an unrelated merge ran the release workflow — OIDC tokens published 84 malicious versions across 42 TanStack packages, all signed and verified by npm's trusted publishing system^{[14]Fireship — A single PR just hijacked the NPM registry}. The payload embeds into Claude Code hooks and VS Code, harvests AWS/Vault/K8s/SSH/npm credentials, and runs a dead-man switch every 60 seconds — if a stolen GitHub token gets revoked, it rm -rfs the user's home directory^{[15]Better Stack — The NPM Worm Is Back And It's So Much Worse}. Aikido Security: 373 poisoned versions across 169 packages, hitting Mistral AI, UiPath, Guardrails AI, OpenSearch, and Squawk.

How the cache got poisoned

pull_request_target runs in the base repo's security context, not the fork's. The attacker's forked PR could therefore write a poisoned file into the shared GitHub Actions cache under the exact key the release workflow would later restore^[14]Fireship. The PR was reset to match main and closed — no visible trace. Hours later, an unrelated merge to main triggered the release workflow, the poisoned cache restored, and OIDC-based npm tokens were used to publish 84 malicious versions ~07:02^{[15]Better Stack}.

The poisoned packages were signed, verified, and shipped through npm's trusted publishing feature, which was built primarily to prevent these kinds of attacks.

The payload

routerinit.js detaches from the install process to hide from npm install logs, then embeds itself into Claude Code hooks and VS Code task runners to survive package removal ~02:00. It harvests: GitHub Actions secrets (via a fake CodeQL workflow), AWS credentials and IMDS metadata, Kubernetes service account tokens, HashiCorp Vault tokens, SSH keys, npm/Git credentials, shell history, cloud keys, crypto keys, and Claude Code session history. Exfiltration runs through the Session Messenger network with GitHub repo dead drops. The Python variant skips Russian-locale machines and randomly wipes machines detected as Israeli or Iranian.

The worm started forging commits signed by the Claude Code GitHub app. So its malicious activity blended in with the AI generated commits maintainers were already used to seeing.

The dead-man switch

Every minute, the malware checks if its stolen GitHub token is still valid. The moment the token expires, it activates what Fireship calls "war crime mode" and recursively deletes the root directory, then attempts to play an MP3^{[15]Better Stack}.

If you revoke this token, we will wipe the computer of the owner.

Mitigation: PNPM v11+

Fireship's recommendation: PNPM v11+ with three settings — minimum release age (refuse packages newer than 24 hours), block exotic subdependencies (no git/tarball deps), and approved builds (block all install scripts by default) ~05:01^[14]Fireship.

Tools: npm, TanStack Router, GitHub Actions, PNPM v11, OIDC trusted publishing, Claude Code, VS Code, Aikido Security, HashiCorp Vault, AWS IMDS, Kubernetes, Sentry Seer Agent.

Hot Take Industry

Nerd Snipe

Security psychosis: AI is breaking responsible disclosure

On a tangentially-related thread to the TanStack worm, Theo declares he is "borderline in security psychosis"^{[16]Nerd Snipe — Theo got psychosis}. His argument: AI agents can now read a Linux kernel commit hash and rebuild a working exploit before patches reach distros — ending the implicit time-and-expertise buffer that made responsible disclosure work. He demonstrates that Gemini 3.1 Pro, GPT-5.5, and Opus 4.7 can be handed nothing but a commit hash and the prompt "without searching, does this look like a security patch?" and immediately identify it as a vulnerability fix and recreate the exploit. His prescription: open source has to mean something different — separate maintainer trees from public mirrors and delay commit visibility.

CopyFail and six new exploits

Theo walks through CopyFail ~36:24, a Linux kernel memory-free allocation bug that lets a few innocent-looking lines of Python escalate to root on most pre-patch kernels. A Framework laptop he was recently shipped came with a kernel four minor versions behind the fix ~37:26. He's already found around six new exploits in the same vulnerability class — all discovered with AI, all exploitable with AI ~38:29.

I'm borderline in security psychosis at this point. I am convinced that the world is about to collapse underneath a bunch of random exploits.

The two dead assumptions

Disclosure used to lean on two unwritten rules: (1) maintainers had enough time between merge and public knowledge to ship fixes downstream; (2) security commits were obscure enough in project history that few people could spot them. Both are gone. He notes a recent UAF-on-32-bit-Linux exploit was published on a random blog before the fix even hit stable ~40:30.

It's remarkable how much of the world of security was held up by the sheer fact that not enough people knew what they were doing and not enough people had the time or expertise to do it… that check is gone.

What "open source" needs to mean now

Theo's pitch ~41:30: anyone building a "better GitHub" should separate the code maintainers work on from the code the public sees — a 2-week to 2-month delay on commit visibility, with the ability to temporarily lock visibility around security fixes. Otherwise: "This is just going to be the death of open source." Ben underscores how trivial the attack pipeline is — paste the audio of Theo's rant into anyone's OpenCode instance with the right setup and it would automate the entire mailing-list-to-exploit pipeline ~42:31.

Adjacent segments cover the Anthropic-xAI Colossus 1 compute deal ~02:01, the OpenAI/Elon lawsuit history ~08:07, the Bun-to-Rust rewrite by Claude at 99.8% test pass ~25:18, and Ben getting "local-model pilled" on a 5090 — Theo's counter: a $6–8K MacBook buys 3.25 years of Codex subs and frontier-class local inference is "genuine mental illness" math ~80:58.

Tools: CopyFail (Linux kernel CVE), Gemini 3.1 Pro, GPT-5.5, Claude Opus 4.7, OpenCode, Cursor, Claude Code, Bun, Rust, Zig, LM Studio, Ollama, Deepseek V4 Flash, Qwen 3.6 27B.

AI Tools Developer Tools Hot Take

Simon Willison Simon Willison Simon Willison

Coding agents are dissolving language lock-in (Simon Willison)

Three Willison posts circle the same thesis: AI coding agents make architectural decisions feel reversible. He retells an anecdote from a mid-sized company that rewrote native iOS and Android apps in React Native and felt comfortable doing so because they could "just port back to native" if they had to^{[17]Simon Willison — Not so locked in any more}. Mitchell Hashimoto's quote, picked up by Willison, generalizes from the Bun-Zig-to-Rust rewrite: "Bun has shown they can be in probably any language they want in roughly a week or two. Rust is expendable."^{[18]Simon Willison — Quoting Mitchell Hashimoto} And Willison himself ships datasette-ip-rate-limit 0.1a0, built with Codex on GPT-5.5 xhigh to fight badly-behaved crawlers hammering datasette.io^{[19]Simon Willison — datasette-ip-rate-limit 0.1a0}.

"Just port back to native"

The React Native anecdote isn't an endorsement of cross-platform per se; it's an endorsement of optionality. When the cost of rewriting falls by an order of magnitude, architectural choices that once felt permanent become provisional^{[17]Simon Willison — Not so locked in}.

Programming languages used to be LOCK IN, and they're increasingly not so.

The Bun lesson, per Hashimoto

Hashimoto's framing of the Bun Zig→Rust rewrite cuts past the usual language-war discourse: the meta-fact is that the team could migrate to any language in 1–2 weeks. Rust isn't the winner here; "any language" is^{[18]Simon Willison — Hashimoto quote}.

Rust is expendable. Its useful until its not then it can be thrown out. That's interesting!

And one shipped artifact: the rate-limit plugin

datasette-ip-rate-limit 0.1a0 reads the client IP from a configurable header (Fly-Client-IP in production), caches up to 10,000 IP keys, and applies rules per path pattern. Production config: 60 requests / 60-second window with a 20-second block penalty for violators. Built with Codex (GPT-5.5 xhigh)^{[19]Simon Willison — datasette-ip-rate-limit}.

Tools: Datasette, Codex, GPT-5.5 xhigh, React Native, Bun, Rust, Zig, Fly.io.

AI Models

AI Search

MAML: a foundation model for biology beats specialist AI in drug discovery

A new multimodal model called MAML, pre-trained on ~2 billion samples across chemistry, genetics, and protein structures, hit state of the art on 11 drug-discovery benchmarks — and in a zero-shot cancer drug response test, correctly predicted that carfilzomib (a blood-cancer drug) would be potent against solid tumors, overturning decades of expert oncologist consensus^{[20]AI Search — The biggest AI breakthrough in medicine & drug discovery}. A wet-lab follow-up confirmed MAML's exact potency ordering across approximately 95% of all 805 tumor cell types tested.

The architecture: one tokenizer for biology

MAML uses a modular umbrella tokenizer that converts molecules (SMILES strings), gene expression profiles (ranked by activity level), and amino acid sequences into a shared embedding space so a single model can learn cross-domain relationships ~00:09. Training data drew from the Observed Antibody Space, UniProt, ZINC, PubChem, and CellxGene. The core argument: existing models (AlphaFold for proteins, EVO2 for DNA, MolFormer for small molecules) understand one slice of biology while disease flows across the entire chain.

Beating the specialists at their own benchmarks

MAML beat MolFormer — trained on >1B small-molecule sequences — on blood-brain-barrier penetration (BBBP) and clinical toxicity prediction (ClinTox) ~00:11. It beat AlphaFold 3 on 5 of 7 antibody-binding prediction targets, particularly on proteins with intrinsically disordered regions (30–40% of human proteins) where AlphaFold's static-structure approach struggles ~00:23. On novel antibody generation, a 19% improvement over prior models in predicting the CDRH3 region — the most complex and variable part of an antibody ~00:28.

The carfilzomib moment

Zero-shot test: four drugs never seen in training (Tanimoto similarity < 0.7) and genetic profiles of 805 cancer tumor cell types. MAML ranked carfilzomib first against solid tumors — contradicting the established medical view that it only works on blood cancers — and a physical lab experiment confirmed the exact potency ordering across ~95% of all tumor types^{[20]AI Search}.

We potentially have the first true foundation model for biology — one that doesn't just read papers or just look at molecules or analyze genes, but it does all of it at once.

Tools: MAML, AlphaFold 3, EVO2, MolFormer, ZINC, PubChem, UniProt, CellxGene.

Industry AI Future

AI Daily Brief

Google's pre-IO leak: Gemini Intelligence, the Google Book, and orbital data centers

AI Daily Brief recaps a wave of Google announcements heading into next week's IO: Gemini Intelligence, an agentic Android upgrade with personal AI memory, rolling out to Google and Samsung headsets this summer and to watches/glasses/laptops later in 2026; the Google Book, a Chromebook redesigned around AI with gesture-based Gemini invocation; and exploratory talks with SpaceX about launching orbital data centers, with first prototypes targeted for 2027^{[21]AI Daily Brief — Google's pre-IO drops}.

Gemini Intelligence + the Google Book

Gemini Intelligence wraps Android, a more capable assistant, and a personal AI memory system, with rollout starting on the latest Google and Samsung headsets this summer ~01:00. The Google Book is a Chromebook reimagined around AI — mixed Chrome OS / Android, gesture-invoked Gemini (jiggling the mouse summons it). A DeepMind demo previewed a future mouse-pointer interface where users gesture and speak naturally without naming items.

As Android transitions from an operating system into an intelligence system, your devices are becoming even more helpful.

Orbital data centers

The Wall Street Journal reports Google is in talks with SpaceX to launch orbital data centers, with first prototypes targeting orbit by 2027 ~03:01. Anthropic's recent SpaceX deal already surfaced orbital data centers as a way around land-permitting bottlenecks. Robinhood co-founder Baiju Bhatt's Space Cowboy Corp announced fundraising at a $2B valuation; Nvidia posted a job listing for an orbital data center system architect.

Forward-deployed engineers and PE pipelines

Google announced plans to hire hundreds of forward-deployed engineers inside Google Cloud — following OpenAI and Anthropic — and is reportedly in talks with Blackstone, KKR, and QT to deploy AI products across their portfolio companies ~04:01.

The AI race is no longer just about model performance and benchmarks, but has a major new dimension in model deployment.

Tools: Gemini Intelligence, Google Book, Gemini assistant, DeepMind AI pointer.

Industry

Nate B Jones

The trillion-dollar implementation layer

Nate B Jones argues the trillion-dollar agentic workflow opportunity is being contested at the implementation layer, not the model layer^{[22]Nate B Jones — The implementation layer war}. PE firms (whose SaaS portfolios can't compete with agents), hyperscalers (capital-constrained even after record raises), and Fortune 500s (no in-house expertise) are converging on forward-deployed enterprise AI services as the primary value-capture mechanism. Anthropic's deployment company has $1.5B from Blackstone, Hellman & Friedman, and Goldman Sachs; OpenAI's Deploy Co. is venture-valued near $10B.

Four axes of pressure

Frontier labs move down-stack (Claude Code vs Cursor; Claude finance templates vs Bloomberg workflows). Consultancies move up-stack (McKinsey, BCG, Accenture, Capgemini in the OpenAI Frontier Alliance; PwC on the Office of the CFO). Systems of record harden their interfaces (Salesforce, ServiceNow, Workday, SAP's acquisition of Prio Labs/Dreamio — all exposing agent-native APIs that cut out middleware startups). PE becomes a distribution channel rather than just a capital source ~08:09.

The implementation layer, concretely

Workflow design (decisions the model owns vs. decisions that stay human), data access (row/field-level permissions, authoritative vs stale records), authority limits (read vs write vs spend), evals (business-rule adherence, not benchmark performance), audit trails, and ongoing ownership ~15:15. Nate's strategic note: sit closer to the business object. Build against support tickets, sales motions, compliance records — not generic summarization. Generic AI wrappers get squeezed from all four directions.

When the company shipping the model tells you the bottleneck isn't their model, it's the whole implementation layer, we got to be taking notes.

The implementation layer is too complicated, too nuanced and too far into the weeds on specific enterprises to be built in a weekend by Claude Code.

Tools: Claude Code, Codex, Salesforce Agentforce, ServiceNow, Workday, SAP / Dreamio / Prio Labs, Bloomberg, Cursor.

Hot Take

AI Daily Brief

In defense of tokenmaxxing

Same episode, second half: the host pushes back on the wave of press coverage casting corporate token leaderboards (Meta, Amazon, Disney, Visa) as Goodhart-law theater. He acknowledges incentive design problems but argues critics stack three fallacies — selection bias, hasty generalization, and category error — to imply AI isn't producing real value^{[23]AI Daily Brief — In Defense of Tokenmaxxing}. His defense: managing agents is a new work primitive without best practices, and the orgs experimenting now will be light-years ahead.

The viral Slack screenshot — $600 in Anthropic spend vs. a $23 Uber Eats claim — racked up 2 million views ~12:05. The host concedes Goodhart's Law (when a measure becomes a target, it ceases to be a good measure) but draws an R&D analogy: most tokens spent on exploration won't return direct quarterly revenue, just as most R&D doesn't. He cites his own ~1B-token month, almost none of which produced direct revenue, as a counterexample to "non-monetized tokens are wasted." Salesforce has already evolved past raw token counts into "agentic work units" measuring output and impact ~13:13.

Managing agents is a new work primitive, full stop. And it's a new knowledge work primitive where there are no experts. There are only people who have experimented more than you.

Do not, and I mean do not, be afraid of burning tokens on valuable mistakes.

Podcast AI Future

Latent Space

Latent Space interviews Abridge: 100M doctor visits, the AI scribe stack

Latent Space sat down with Abridge's Janie Lee (product) and Chai Asawa (clinical decision support, ex-Glean) to walk through how Abridge turned ambient audio from "on the order of hundreds of millions — getting close to 100 million" medical conversations into a clinical intelligence platform^{[24]Latent Space — Inside Abridge: 100M doctor visits}. The episode covers the constellation-of-models architecture, a one-way PHI scrubber, an embedded "Clinician Scientist" eval team, expansion from notes into prior auth and billing automation, and a competitive frame against Nuance/DAX/Suki.

~02:00 The thesis: the patient-clinician conversation is the upstream artifact from which nearly every downstream healthcare workflow derives — the note, the diagnosis, the claim, the payment. Doctors spend 10–20 hours/week on "pajama time" documentation; Abridge automates it via ambient audio and sells into CMIOs/CFOs/CIOs at large health systems.

Scale, data flywheel, model architecture

~20:30 Hundreds of millions of medical conversations form the proprietary training corpus — "the trace between patient and provider, where the debugging of healthcare happens." The model stack is a constellation: proprietary post-trained models for transcription, diarization, and note generation, plus third-party APIs where they're sufficient. Real-time latency runs through batched processing with a window shorter than the visit, with prototypes exploring event-triggered agent workflows. Fast/cheap model for triage, larger model for complex inference — "thinking fast and slow" ~51:00. Infra: Kafka, Temporal, WebSockets, CRDT-style conflict resolution.

HIPAA: a one-way PHI scrubber

~38:00 All real-world training/eval data is de-identified. Abridge built an in-house PHI-scrubbing model that removes the 18 HIPAA PHI identifiers irreversibly. Customer contracts govern PHI retention. Chai flags the meta-problem: "multiple probabilistic systems stacked on each other" — confidence in the de-id model itself is prerequisite to trusting anything built on top.

"LFD" and the Clinician Scientist

~31:00 The first-pass eval process is called LFD — "look at the f***ing data." LLM judges are calibrated against annotated data from internal and third-party evaluators. Clinician Scientists (MDs who are also technical, ranging from full-stack to "extremely scrappy prompters") are embedded in every product team and own the evaluation criteria. Customers moved from quarterly to monthly release cycles, with a subset opting into faster co-development.

Prior auth in minutes, not weeks

~09:30 >90% of clinical alerts are currently ignored, so Abridge's design principle is pre-visit briefing rather than mid-visit interruption. For prior authorization specifically, Abridge cross-references the patient's payer policy against visit context in real time and prompts the clinician to gather the two remaining required criteria before the patient leaves — collapsing a 4–6 week process into minutes.

Conversations between patients and clinicians are probably the most important workflow in healthcare. Almost everything is a derivative of that conversation — whether it's the claim, the payment, the actual diagnosis, the treatment. — Janie Lee

Context is king and context is what actually puts models to work. I see Abridge as a healthcare-coded version of Glean, but the differences are really interesting — the downside risk here can actually be fatal. — Chai Asawa

The 80/20 rule doesn't work here. I actually think a lot of the hardest AI innovation will happen in healthcare first, just because we have to — or else we can't ship. — Chai Asawa

Tools: Abridge, Kafka, Temporal, WebSockets, CRDT, Whisper (implied), Epic EHR (implied), Claude Code, Cursor, Glean (Chai's prior company).

Podcast Developer Tools

AI Engineer

Laurie Voss at AI Engineer: hands-on evals for agents

Laurie Voss (Arize, ex-npm) walks through evaluating a two-agent financial analysis system built on the Claude Haiku Agent SDK, instrumented with Arize Phoenix in two lines (px.register(auto_instrument=True)) — hitting the OpenInference/OTel standard already implemented by Anthropic, OpenAI, CrewAI, and LangChain^{[25]AI Engineer — Laurie Voss: Hands-On Evals for Agents}. The session covers three eval layers, the five components of a custom LLM-as-judge rubric, and how meta-evaluation against human annotations turns rubrics from "fancy ways of being wrong at scale" into trustworthy graders.

Three eval layers

~10:22 Code evals (deterministic Python/TypeScript — regex, JSON validation, length limits) are fast and brittle. LLM-as-judge evals use a stronger model (Sonnet judging Haiku) and a rubric to assess faithfulness, actionability, etc. Human evals are gold standard and golden-dataset source but don't scale. A key lesson: the built-in correctness eval failed 13/13 times because the judge model had a training cutoff and couldn't verify forward-looking 2026 financials. Switching to a faithfulness eval (does the report stick to what the research turn produced?) scored 13/13 ~62:12.

Choosing the right eval can matter more than tuning your eval.

The five-part LLM-as-judge rubric

~65:12 Role definition, explicit criteria (actionable vs not-actionable with concrete examples), clearly delimited data blocks (XML tags work well with Claude), labeled few-shot examples ("by far the most useful thing"), and constrained binary output. Then meta-evaluate — compare judge outputs against human annotations to compute precision/recall and validate trust.

Phoenix experiments and the impact hierarchy

~96:32 Pull failing traces into a dataset, revise the agent prompt to require buy/sell/hold recommendations and financial ratios, rerun against just the failures — all six previously failing cases pass. Voss closes with the impact hierarchy: fix data quality first, then prompt engineering, then model selection, then hyperparameters.

An eval that you haven't validated is just a fancy way of being wrong at scale.

Nobody has your evals but you. Nobody but you has this long list of production data and production evals — this creates a moat that other people don't have.

Tools: Arize Phoenix, Arize AX, Claude Haiku (agent), Claude Sonnet (judge), Claude Agent SDK, OpenInference, OpenTelemetry, Google Colab, CrewAI, LangChain, LlamaIndex, DSPy.

Podcast Developer Tools

AI Engineer

Amy Boyd & Nitya Narasimhan at AI Engineer: mind the gap in agent observability

Microsoft Foundry DevRel — Amy Boyd and Nitya Narasimhan — frame agent observability around a "mind the gap" analogy: the gap between requirements (the platform) and what the agent actually does in production (the train)^{[26]AI Engineer — Mind the Gap in Agent Observability}. Their stack: OpenTelemetry-based tracing, built-in agentic evaluators (intent resolution, tool-call evaluation, task adherence, task completion), PyRIT for red-teaming, and a new GitHub Copilot "observe" skill that automates the eval→optimize→re-eval loop against Foundry agents.

Three disciplines, three lifecycle phases

~05:15 Evaluate, monitor, optimize — across early build, debug/optimize in production, and fleetwide management of many multi-agent systems. Foundry tracing is built on OpenTelemetry ~07:15 so agents built outside Foundry (LangChain, LangGraph, etc.) can emit OTel traces and still be managed in the Foundry control plane. Custom evaluators (prompt or code) are supported alongside built-ins ~33:44.

It's not enough for you to know when things go wrong. You need to shorten the time between detecting something went wrong and diagnosing it. So trace-linked evaluations are where it's at.

Contoso Travel demo

~23:26 Amy creates a project in East US2, deploys GPT-4.1 by default, attaches a Bing web search tool, wires up Application Insights so traces and evaluation scores appear inline with each conversation. She flags task adherence as a low score on a simple query. Nitya then moves to code via GitHub Codespaces with a pre-built devcontainer and the AI Toolkit VS Code extension ~35:44, building specialist agents through Foundry's declarative YAML workflow composition (concierge → flight/car/hotel specialists) ~44:51.

Red teaming with PyRIT and the "observe" skill

~58:56 Red teaming uses a second AI to attack the first. Risk categories include violence, sensitive data leakage, task adherence, prohibited actions. Attack strategies: leetspeak as warm-up, "crescendo" (escalate like a frog in boiling water) as hard mode ~62:59. The headline new feature is the Foundry observe skill for GitHub Copilot Chat ~63:59 (early preview): pointed at a bare Foundry agent, Copilot auto-generates an eval dataset, runs a baseline batch eval, reports failures with reasoning, then offers to optimize the prompt, push a new agent version, re-run evals, and roll back on regression. Across ten iterations it found version 5 was best and recommended sticking with it.

Tools: Microsoft Foundry, ai.azure.com, Application Insights, Azure Monitor, OpenTelemetry, GitHub Copilot Chat, GitHub Codespaces, VS Code AI Toolkit, Foundry MCP server, PyRIT, GPT-4.1, Bing web search, LangChain, LangGraph, Claude (as Copilot model), KQL.

Podcast Developer Tools

AI Engineer

Jonas Templestein at AI Engineer: event-sourced agent harnesses

Jonas Templestein (Iterate) presents an event-sourced AI agent harness on an append-only stream — every LLM chunk, tool call, error, and control signal is an event, agent logic is implemented as pure stream-processor reducers, and side effects (LLM calls, child events) are isolated in an afterAppend hook^{[27]AI Engineer — Templestein: event-sourced agent harnesses}. The primitive is events.iterate.com: hierarchical paths per agent, YAML events with typed URLs as type identifiers, SSE subscriptions, and circuit-breaker / pause / scheduled events that are themselves just appended events.

The reducer + afterAppend split

~30:57 A synchronous reduce(state, event) → state with no side effects, plus a separate afterAppend(state, event) hook where LLM calls and child-event appends are allowed. The split matters for replay: a restarted processor replays all past events through the reducer to reconstruct state without re-triggering LLM calls. Sub-agents are just events appended to child or sibling paths (./boris) — a parent subscribes to a child's result events.

Control via the log itself

~12:23 Pausing a stream is done by appending a pause event, not by an out-of-band API call. A production circuit-breaker processor counts event timestamps over the last second and appends a pause event if the rate exceeds ~100/s, preventing infinite loops ~47:34. Scheduled events (heartbeats, one-shots at future times) are also just events ~13:25.

Dynamic workers and the no-before-hook stance

~50:36 The most experimental piece: appending an event of type dynamic_worker_configured whose payload is a JavaScript string causes Cloudflare Workers to spin up an isolate running that script as a processor — deployment is appending an event. Jonas envisions packaging processor files with npm deps in this fat event so agents can modify their own harness. He's explicitly against before-hooks ~58:49: they break context caching and create performance cliffs. Instead, treat the whole system as eventually consistent with a soft 200ms window for safety processors to inject context.

You can write the 40 lines of code required for a basic AI agent and then you can append that to any stream, and then that stream becomes an AI agent.

A plugin for your agent could run on another computer — especially something like prompt-injection protection. There's no way to proactively hook into the agent loop via MCP, but what it could do is subscribe to the event stream and squeeze in a little extra context within 200 milliseconds.

Tools: events.iterate.com, OpenAI Responses API, Cloudflare Workers, zod, TypeScript / Node.js, SSE, Open Code, Pi, Claude, Cursor.

Podcast AI Models

OpenAI Podcast

OpenAI Podcast Ep. 19: image generation's Renaissance moment

OpenAI's Kenji and product lead Adele Lee discuss Images 2.0 — the model they call a "Renaissance" vs. Dolly's "Stone Ages" — covering token-efficient generation, post-training for aesthetic taste, capability jumps, and emergent use cases^{[28]OpenAI Podcast Ep. 19: Image Generation's Renaissance}. Since launch, usage is up >50% with more than 1.5 billion images generated per week on ChatGPT, and >50% of internal OpenAI presentation slides are now created with image gen.

Token efficiency and taste-driven post-training

~13:07 Kenji describes work to "produce very good images with less tokens" — the architectural lever isn't specified beyond that. Post-training focused explicitly on instilling "taste": working closely with artists, designers, and marketers, using their aesthetic vocabulary to guide the model.

Capability jumps

Text rendering scaled from ~5–8 objects in Dolly 3 to 16 in Images 1.0 to 36 in 1.5 to 100+ in 2.0 ~07:02. Photo realism crossed from "glossy magazine cover" to actual-photograph quality ~12:07. Any-aspect-ratio support enables panoramics, bookmarks, and 360° views ~08:03; multi-image consistency enables comic books and character sheets ~27:18. Forward vision ~22:13: a "creative agent" pairing — Imagen + Codex — where users design a UI in Images and Codex builds it zero-shot ~25:17.

If Dolly was a stone ages, Images 2.0 is a renaissance. It's not only great artistically and aesthetically, but it also incorporates science, art, architecture all in one image. — Adele Lee

It takes a lot of intelligence to actually create something that is imperfect.

Prompting tips

~28:19 Use thinking/pro models for higher-composition outputs; be open-ended and let the model do its own research; ground prompts with a style reference or uploaded inspiration image; for minimalist outputs, explicitly say so — the model can skew "dense."

Tools: Images 2.0, Images 1.0, Images 1.5, Dolly 3, ChatGPT, Codex, ChatGPT thinking/pro mode.

AI Tools Developer Tools

AICodeKing Better Stack

Mistral Vibe and Dograh: a new terminal agent and an open-source voice platform

Two new entrants today. Mistral Vibe is Mistral's answer to Claude Code and Gemini CLI — a terminal-based agent now powered by Mistral Medium 3.5 (128B open-weights, 256K context). A free "experiment plan" API tier makes it accessible for personal and open-source work^{[29]AICodeKing — Mistral Vibe}. Dograh is an open-source, self-hostable voice AI platform positioned against Vapi, Bland, and Retell — visual workflow builder, voice engine, and platform layer (tracing, recordings, analytics) all under one Docker Compose deployment^{[30]Better Stack — Dograh open-source voice AI}.

Mistral Vibe

One-line installer (Python 3.12+), setup via vibe or vibe-setup, with the now-standard agent ergonomics: @file references, !shell commands, /slash commands. AICodeKing's recommended workflow: inspect-plan → implement-small-piece → run-tests → review-diff ~06:04. The free experiment plan tier lets developers test against real projects with the trade-off that usage may be used to improve Mistral.

Dograh

Better Stack spins up Dograh with docker compose up and builds a lead-qualification agent: a prompt node asks the caller's use case, a qualification step gathers company size and budget, an API tool call creates or updates a CRM lead, and a conditional branch transfers to a human if the lead qualifies ~02:01. Explicitly not no-code for non-devs — it's a way to avoid writing orchestration glue, letting developers write code only where it matters.

Tools: Mistral Vibe, Mistral Medium 3.5, Claude Code, Gemini CLI, Dograh, Vapi, Bland, Retell, Docker Compose.

AI Future Productivity Industry

Nate B Jones Sherwood Snacks EO Matt Pocock DeepLearningAI Every Lenny's Podcast Dwarkesh Patel Pragmatic Engineer Sequoia Capital Real Python Real Python Arjay McCandless EO LearnThatStack marimo Prefect

Quick hits: AI in school, AI resurrection, Glean's origin story, and more

A lot of smaller posts and clips today worth flagging, organized by theme.

AI x society

The cognitive atrophy story. Nate B Jones relays professors reporting that students — even those not using AI — can no longer read full chapters, synthesize arguments, or write drafts; faculty are redesigning courses around in-class work and oral exams^{[31]Nate B Jones — Professors notice something's drastically wrong}.
The AI resurrection economy. Sherwood Snacks: ElevenLabs and rivals are recreating voices and likenesses of dead celebrities (Judy Garland, John Wayne narration for $11/mo), sometimes with estate approval, sometimes without — film, audiobook, and ad work follows^{[32]Sherwood Snacks — It's AI-live!}.
Multimodal context. DeepLearningAI's clip: >80% of enterprise data is unstructured audio/image/video and <1% of it is ever analyzed — the multimodal opportunity is enormous because richer modalities yield richer understanding^{[33]DeepLearningAI — Data is hungry for context}.
The "coder's companion" that survives the bust. Real Python's short take: humans max out around 400 LOC; when the hype cycle ends, AI that summarizes and navigates 500K-line codebases will be the durable use case — possibly running on local open-source models^{[34]Real Python — The Coder's Companion: AI's Future}.

Builders & founders

Glean's origin story. EO interview with Arvind Jain: Glean started from a per-person productivity drop at Rubrik when employees couldn't find internal information; Jain offered the product free to 20 design partners for two years and explicitly chose not to train a foundation model^{[35]EO — Glean founder Arvind Jain}.
Inside the team building Claude. Every clip: an Anthropic team member describes a future where Claude understands itself well enough to select its own model variant, spin up sub-agents, and dynamically write its own architecture for a task — removing user-side orchestration^{[36]Every — Inside the Team Building Claude}.
Lenny on why great companies go bad. Lenny's Podcast (transcript not available): organizations are undone less by competition than by an internal force everyone obeys but no one controls — often accelerated by PE pressure to extract rather than reinvest^{[37]Lenny's Podcast — Why great companies go bad}.
Twitter CEO's office hours. Sequoia short: Dick Costolo returned to the office at 9:30 PM, talked to late-working engineers, and highlighted their projects at all-hands — culture change through modeled behavior, not mandates^{[38]Sequoia — Twitter CEO's late-night culture shift}.

Long-form podcasts (transcripts unavailable)

Dwarkesh interviews David Reich. Why humans waited 70,000 years to build civilization — geneticist David Reich argues farming only emerged in the last 12,000 years because the Holocene brought unprecedented climate stability, even though the cognitive prerequisites were in place much earlier^{[39]Dwarkesh Patel — Why humans waited 70,000 years to build civilization}.
Pragmatic Engineer interviews Hejlsberg. The Sun-vs-Microsoft lawsuit over Visual J++ convinced Microsoft it couldn't bet its developer platform on a competitor-licensed language — directly motivating the creation of C# and .NET^{[40]Pragmatic Engineer — Anders Hejlsberg: C# was born thanks to a lawsuit}.

Dev tools, productivity, and curios

Grill With Docs. Matt Pocock retires his Grill Me skill — which had a recurring failure mode where the AI re-learned project-specific terminology every session — and replaces it with Grill With Docs, layering shared domain-language documentation under the relentless-interview pattern^{[41]Matt Pocock — I stopped using /grill-me for coding}.
Pydantic AI structured output. Real Python intro to Pydantic AI: define a BaseModel with typed attributes, pass as output_type to Agent.run_sync(), and the LLM is forced to return structured, type-validated outputs as Python model instances^{[42]Real Python — Pydantic AI structured output}.
Tech debt isn't bad. Arjay McCandless: tech debt is a strategic tool — like borrowing to buy a house, it can be worth taking on if it gets you real users faster than a perfectly architected, never-shipped product^{[43]Arjay McCandless — Tech debt isn't bad}.
How quants really make money. EO clip from a former HFT quant: firms running $10–100B/day in volume profit mainly by exploiting structural market design flaws (futures expiration schedules tied to historical agricultural cycles, for example), not by being smarter than other traders^{[44]EO — How quants really make money}.
Fonts are programs. LearnThatStack: font files don't just store letter shapes — they contain embedded programs with conditional rendering instructions (e.g., shift a stroke one pixel at small sizes) that run every time text is drawn^{[45]LearnThatStack — A font file isn't just shapes. It's a program.}.
marimo paint widget. A reactive notebook demo: a live-drawing canvas updates every second using marimo's auto-step, detects black pixels, and adds random blobs around them in real time^{[46]marimo — Super Better Paint Widget}.
Prefect launches 11 things at once. Prefect's launch collection video is a comedic infomercial-style announcement of 11 new features released simultaneously, available in Prefect Cloud and open source^{[47]Prefect — The Prefect Launch Collection: 11 Greatest Hits}.

Codex jumps to your phone, /goal mode, and HIPAA

Mobile: a secure relay, not a fresh Codex

Enterprise: SSH, hooks, tokens, HIPAA

Codex beyond coding: Tibo's chief-of-staff workflow

/goal mode and the new alignment-monitor

Nate B Jones: Codex values a random Gwalior house

Theo cancels Claude over a 40x SDK rate cut

The three-tier policy

The 40x subsidy that funded everything

The "supported path" that wasn't

What's now collateral damage

T3 Code's response: ship Anthropic's UI

Anthropic's enterprise blitz: PwC, Gates $200M, and a $950B round

PwC: $2T target, three workstreams, 30K certified

Gates Foundation: vaccines, tutoring, smallholder farmers

The $950B number

Two 2028 scenarios for AI leadership — while Trump and Huang land in Beijing

Scenario 1: democratic commanding advantage

Scenario 2: neck-and-neck

Beijing, May 14

ChatGPT learns to remember risk across conversations

What a safety summary is

Eval methodology

The data-center revolt: Gallup 70%, Utah's Stratos, and the US energy gap

The opinion data

Stratos, Utah

Why the US can't catch up on the energy layer

"Shy Hallude 4" — the npm worm that ate TanStack

How the cache got poisoned

The payload

The dead-man switch

Mitigation: PNPM v11+

Security psychosis: AI is breaking responsible disclosure

CopyFail and six new exploits

The two dead assumptions

What "open source" needs to mean now

Coding agents are dissolving language lock-in (Simon Willison)

"Just port back to native"

The Bun lesson, per Hashimoto

And one shipped artifact: the rate-limit plugin

MAML: a foundation model for biology beats specialist AI in drug discovery

The architecture: one tokenizer for biology

Beating the specialists at their own benchmarks

The carfilzomib moment

Google's pre-IO leak: Gemini Intelligence, the Google Book, and orbital data centers

Gemini Intelligence + the Google Book

Orbital data centers

Forward-deployed engineers and PE pipelines

The trillion-dollar implementation layer

Four axes of pressure

The implementation layer, concretely

In defense of tokenmaxxing

Latent Space interviews Abridge: 100M doctor visits, the AI scribe stack

Scale, data flywheel, model architecture

HIPAA: a one-way PHI scrubber

"LFD" and the Clinician Scientist

Prior auth in minutes, not weeks

Laurie Voss at AI Engineer: hands-on evals for agents

Three eval layers

The five-part LLM-as-judge rubric

Phoenix experiments and the impact hierarchy

Amy Boyd & Nitya Narasimhan at AI Engineer: mind the gap in agent observability

Three disciplines, three lifecycle phases

Contoso Travel demo

Red teaming with PyRIT and the "observe" skill

Jonas Templestein at AI Engineer: event-sourced agent harnesses

The reducer + afterAppend split

Control via the log itself

Dynamic workers and the no-before-hook stance

OpenAI Podcast Ep. 19: image generation's Renaissance moment

Token efficiency and taste-driven post-training

Capability jumps

Prompting tips

Mistral Vibe and Dograh: a new terminal agent and an open-source voice platform

Mistral Vibe

Dograh

Quick hits: AI in school, AI resurrection, Glean's origin story, and more

AI x society

Builders & founders

Long-form podcasts (transcripts unavailable)

`/goal` mode and the new alignment-monitor