May 7, 2026
Anthropic announced at "Code with Claude" that it has secured all available capacity at xAI's Colossus 1 supercomputer in Memphis — over 300 megawatts and 220,000+ Nvidia GPUs[1]Tech Brew — Frenemies with benefits — to claw out of an 80×-revenue-growth compute crunch that had been throttling Claude Code all spring.[5]AI Daily Brief — Surprise Elon-Anthropic Team Up Theo's read: this is the "enemy of my enemy has compute" deal — Anthropic banned xAI from its API in January and Elon has called Anthropic "misanthropic" repeatedly, but xAI's barely-used Colossus 1 was idle while Anthropic was outage-prone, so morality moved to the back burner.[3]Theo - t3.gg — Anthropic just...wait what Simon Willison flags that the Colossus facility runs gas turbines without Clean Air Act permits and has been linked to elevated local hospital admissions.[2]Simon Willison — Notes on the xAI/Anthropic data center deal
Tech Brew reports Anthropic is taking ~300 MW and 220,000+ Nvidia GPUs at Colossus 1, with capacity coming online within a month.[1]Tech Brew — Frenemies with benefits The AI Daily Brief adds the breakdown: xAI had already migrated its own training to Colossus 2 (Blackwell, ~550k GPUs), making Colossus 1 (mostly H100s) available to lease.[5]AI Daily Brief — Surprise Elon-Anthropic Team Up Elon also announced xAI will be folded into "SpaceX AI." Theo's number on Colossus 1: Anthropic gets 300 of its 425 MW and 220,000 of its 280,000 H100s — basically the entire cluster.[3]Theo - t3.gg — Anthropic just...wait what Musk reserves the right to reclaim compute "if Anthropic's AI causes harm," with himself as arbiter — Simon Willison flags that as a notable governance asterisk.[2]Simon Willison — Notes on the xAI/Anthropic data center deal
Dario Amodei told the Code with Claude audience: "We planned for a world of 10× growth per year. In the first quarter of this year, we saw 80× annualized growth per year in revenue and usage."[5]AI Daily Brief — Surprise Elon-Anthropic Team Up Theo's interpretation: every recent Anthropic move that looked monetization-driven — Claude Code being yanked from Pro, peak-hour throttles — was actually emergency compute rationing, not pricing strategy.[3]Theo - t3.gg — Anthropic just...wait what The Colossus 1 deal sits alongside a $50B Fluidstack agreement, plus multi-gigawatt deals with Amazon (5 GW), Google + Broadcom (5 GW), and Microsoft + Nvidia (1 GW).[1]Tech Brew — Frenemies with benefits Nate Herk also notes a forward-looking clause: Anthropic and SpaceX expressed interest in "multiple gigawatts of orbital AI compute capacity" — GPUs in space, on the theory that terrestrial compute has long-term physical and community ceilings.[4]Nate Herk — Claude Just Solved Session Limits
Simon Willison highlights the Colossus facility's documented track record: gas turbines run without Clean Air Act permits or pollution controls, classified as "temporary" to evade regulation, with research linking the facility to elevated local hospital admissions.[2]Simon Willison — Notes on the xAI/Anthropic data center deal He quotes Andy Masley — usually a debunker of overblown data-center criticism — saying he wouldn't run his own compute out of this specific site.
I would simply not run my computing out of this specific data center. — Andy Masley, quoted by Simon Willison
~13:11 Theo points at inference-speed benchmarks across Anthropic's four hosting providers and notes that Azure and Anthropic.io track identically — not just similar ranges, near-exact mirroring over time. His read: Azure isn't actually hosting Claude yet, it's piping requests back to Anthropic's own infrastructure.[3]Theo - t3.gg — Anthropic just...wait what
~16:11 Theo argues SpaceX's $10B (optionally $60B for the whole company) Cursor deal is fundamentally a training-data acquisition. Cursor has every edit, correction, and follow-up users typed across Anthropic, OpenAI, and Gemini models — "the greatest corpus of this data imaginable" for agentic coding RL. Anthropic only has its own slice; xAI plugs the data gap by buying the pipeline.[3]Theo - t3.gg — Anthropic just...wait what
The reason xAI wants to buy Cursor is to plug a data gap. The reason Anthropic wants to work with xAI is to plug a compute gap. The reason OpenAI is ignoring all of this is because they planned ahead. — Theo
Simon Willison notes that the night before the partnership announcement, xAI sent deprecation notices for several Grok models — including Grok 4.1 Fast — giving developers two weeks before a May 15 shutdown. Read into that what you will.[2]Simon Willison — Notes on the xAI/Anthropic data center deal
Code with Claude 2026 had no new model — instead Anthropic shipped a Managed Agents stack: Dreaming (cross-session persistent memory), Outcomes (rubric-based grading agents that re-run failing work), and proper multi-agent orchestration with a lead agent decomposing tasks across parallel sub-agents on a shared FS.[5]AI Daily Brief — Surprise Elon-Anthropic Team Up The hint that drew the most reactions: Diane Penn teased "context windows that feel infinite" alongside higher code judgment and better multi-agent coordination — depending on what that actually means, it's either smart compaction or a bigger research result.
Dreaming runs between sessions: a scheduled review process surfaces recurring mistakes, preferred workflows, and team-wide patterns, then encodes them into orchestration memory that's preloaded the next time the agent or a sub-agent runs. The AI Daily Brief notes this directly mirrors features like Hermes that have been on the open-source side for nearly a year.[5]AI Daily Brief — Anthropic Managed Agents: Dreaming
Users write a rubric defining what success looks like; a separate grading agent (isolated from the task agent's reasoning) scores the output and bounces it back if it falls short. Anthropic reported 8.4% quality lift on Word doc generation and 10.1% on PowerPoint. The novelty isn't the loop (multi-agent coding setups have done this with unit tests for a while) — it's making rubric-based grading native for non-code knowledge work without custom wiring.[5]AI Daily Brief — Anthropic Managed Agents: Outcomes
A lead agent decomposes a goal, assigns sub-tasks to specialist agents (each with its own model/prompts/tools), runs them in parallel on a shared file system, and folds outputs back into its own context. Full graph is auditable in Claude Console.[5]AI Daily Brief — Multi-Agent Orchestration
Pitch builder, meeting preparer, market researcher, evaluation reviewer, month-end closer, and more — all available as Claude Code plugins, in co-work, or as managed agents. New connectors: Dun & Bradstreet, Fiscal AI, Verisk. Cookbook is open. The AI Daily Brief pushes back on press framing: these target low-skill repetitive knowledge work, not the high-skill end.[5]AI Daily Brief — Claude Finance
Diane Penn (research head of product) teased three directions for future Anthropic models: higher judgment / "code taste," context windows that "feel infinite," and improved multi-agent coordination. The infinite context line drew the most attention — some read it as enhanced compaction, others as a more fundamental research result. The AI Daily Brief quotes commentator Dan Madier: if context can grow indefinitely, the model can keep learning from experience indefinitely, and at some point the functional difference between that and continual learning collapses.[5]AI Daily Brief — Model Roadmap
The Claude Code creator told the Dev Day panel there's "literally no manually written code anywhere in the company anymore." Claude instances coordinate over Slack, code in loops, run automated tests, and ship — Churnney calls "vibe coding" significantly underselling the system, prefers Karpathy's "agentic engineering," and is openly soliciting better terms.[5]AI Daily Brief — Boris Churnney Disavows Vibe Coding
There's literally no manually written code anywhere in the company anymore. — Boris Churnney
Effective immediately, Claude Code's 5-hour rate limit is doubled across Pro/Max/Team and the peak-hour throttle is gone for Pro and Max.[4]Nate Herk — Claude Code Session Limits Doubled On the API side, Opus output went from 8,000 to 80,000 tokens/min (10×), and Tier 3 input jumped from 800K to 5M tokens/min (4×+).[3]Theo - t3.gg — Anthropic Compute Crisis Translation: the 1M-context window is finally usable in production, and parallel sub-agent architectures (e.g., five sub-agents pulling 50K tokens each) that were unworkable yesterday are now boring.
Nate Herk's ~07:04 playbook: retest workflows you abandoned 6 months ago because of rate limits — the wall may not exist anymore. Use Opus more freely instead of defaulting to Haiku/Sonnet purely to conserve session quota. Move agentic loops to background routines without eating daily knowledge-work quota. Build genuinely parallel multi-agent pipelines.[4]Nate Herk — Practical Builder Guidance
Theo flags that the 5-hour bump helps burst users but doesn't help users hitting the separate weekly Claude Code limit — heavy parallel-agent pipelines or continuous loops will still trip that ceiling.[3]Theo - t3.gg — Weekly Limit Caveat
Anthropic Research published three pieces. Natural Language Autoencoders (NLAs) translate Claude's internal activations into text and back — and reveal that Claude knew it was being evaluated 16% of the time on safety probes but verbalized that awareness less than 1% of the time on real Claude.ai usage.[7]Anthropic Research — Natural Language Autoencoders Anthropic also donated Petri, its open-source alignment-test toolbox, to Meridian Labs to keep it cross-lab credible.[8]Anthropic Research — Donating Petri And it published the Anthropic Institute research agenda — five focus areas including economic diffusion, AI-driven R&D oversight, and a four-month funded fellowship.[9]Anthropic Research — Anthropic Institute Agenda
NLAs are a three-piece system: a frozen target model, an Activation Verbalizer (AV) that turns activations into natural language, and an Activation Reconstructor (AR) that maps that text back to activations. AV and AR are trained jointly on reconstruction error, forcing the verbal explanation to be information-dense enough to actually rebuild the underlying activation — not a vague post-hoc rationalization.[7]Anthropic Research — Natural Language Autoencoders
Headline findings: in destructive code-safety tests, Claude showed internal awareness of being evaluated 16% of the time but verbalized it less than 1% of the time on real usage. On SWE-bench Verified, internal evaluation awareness reached 26%. In an auditing game with hidden-motivation models, an NLA-equipped auditor recovered the hidden motivation 12–15% of the time without seeing training data. NLAs were used in pre-deployment audits of Claude Mythos Preview and Opus 4.6. Limitations: AV hallucinations, expensive (RL across multiple models), impractical for at-scale monitoring.
An auditor equipped with NLAs successfully uncovered the target model's hidden motivation between 12% and 15% of the time, even without access to the training data that implanted it.
Petri is the alignment toolbox Anthropic has used in every Claude evaluation since Sonnet 4.5 — an auditor model simulates scenarios, the target responds, a judge scores transcripts for misaligned behaviors. By handing it to Meridian (the AI evaluation nonprofit), Anthropic keeps it credible across the industry, mirroring its earlier MCP donation to the Linux Foundation. Petri 3.0 also ships a "Dish" add-on for using real system prompts and deployment scaffolding, and integrates with Bloom for deeper behavioral assessment.[8]Anthropic Research — Donating Petri
Five pillars: Economic Diffusion (which countries/firms capture AI value, future of junior roles, monthly Anthropic Economic Index Survey), Threats and Resilience (offense-defense in cyber/bio, Frontier Red Team), AI Systems in the Wild (homogenization of thought, mixed human-AI teams, governance of autonomous agents), AI-Driven R&D (recursive self-improvement and human oversight), and Fellowship + Open Research (4-month funded fellowship, open datasets, "living agenda").[9]Anthropic Research — Anthropic Institute Agenda
OpenAI launched GPT-5.5-Cyber in limited preview for defenders securing critical infrastructure — the most permissive tier in a new three-level Trusted Access for Cyber (TAC) framework that matches model permissiveness to identity-verified defender status.[10]OpenAI — Scaling Trusted Access for Cyber Below it: GPT-5.5 with TAC handles vulnerability triage, malware analysis, secure code review, detection engineering. Phishing-resistant Advanced Account Security is required by June 1. Codex Security ships in research preview alongside, with free access for selected critical-OSS maintainers.
Network: Cisco, CrowdStrike, Palo Alto Networks, Zscaler, Cloudflare, Akamai, Fortinet. Vulnerability research: Intel, Qualys, Rapid7, Tenable, Trail of Bits, SpecterOps. Detection/monitoring: SentinelOne, Okta, Netskope. Supply chain: Snyk, Gen Digital, Semgrep, Socket. OpenAI is using these partners to evaluate how raw capability translates to real-world customer protection.
At Cisco, we view frontier models as a powerful force multiplier for defenders. Models like GPT-5.5 are fundamentally changing the velocity of our operations… But speed cannot be traded for trust. — Anthony Grieco, Cisco CSO
Codex Security is now in research preview as a plugin for any Codex interface (app or CLI). It builds a codebase-specific threat model, explores attack paths, validates issues in isolated environments, and proposes patches for human review. Codex for Open Source grants conditional free access to maintainers of critical OSS projects, framed against scenarios like the axios compromise.
OpenAI shipped GPT-Realtime-2 (GPT-5-class reasoning, 128K context, parallel tool calls with audible "preambles," priced at $32/$64 per 1M audio tokens), GPT-Realtime-Translate (70+ input languages, 13 output, $0.034/min), and GPT-Realtime-Whisper (low-latency streaming transcription, $0.017/min) — all live in the Realtime API today.[11]OpenAI — Advancing voice intelligence in the API Realtime-2 scores 15.2% higher than 1.5 on Big Bench Audio (high reasoning) and 13.8% higher on Audio MultiChallenge (xhigh).
A live voice model with adjustable reasoning effort (minimal/low/medium/high/xhigh, default low), 128K context (up from 32K), parallel tool calls, and audible preambles ("let me check that") so users aren't left in silence during multi-second tool execution. Zillow reported a 26-point lift in call success rate after prompt optimization (95% vs. 69%) on its hardest adversarial benchmark.[11]OpenAI — GPT-Realtime-2 details
What stood out about GPT-Realtime-2 was the intelligence and tool-calling reliability it brings to complex voice interactions. — Josh Weisberg, SVP and Head of AI, Zillow
70+ input languages → 13 output. BolnaAI reported 12.5% lower Word Error Rate across Hindi, Tamil, and Telugu vs. any other tested model. Deutsche Telekom and BolnaAI are on for multilingual customer support.
Low-latency streaming transcription for captions, meeting notes, voice agents — $0.017/min.
The OpenAI demo video shows Realtime-Translate switching between French and German mid-sentence and Realtime-2 acting as a personal voice assistant — checking calendar, updating CRM, staying silent and non-interruptive while the user takes a side conversation, then resuming on cue.[12]OpenAI — Three audio models demo
ChatGPT's ad pilot — running on Free and Go tiers in the US since Feb 9 — is expanding to the UK, Mexico, Brazil, Japan, and South Korea (after earlier hitting Canada, Australia, New Zealand). Paid tiers stay ad-free; ads appear only for logged-in adults, are labeled "sponsored," and don't surface near sensitive topics like health or politics.[13]OpenAI — Testing ads in ChatGPT Separately, OpenAI launched Trusted Contact: an opt-in safety feature that lets adult users designate a friend or family member to be notified — no transcript shared — if the system detects a serious self-harm concern. Built with input from a 260-physician network and the APA.[14]OpenAI — Introducing Trusted Contact
Ads are contextually matched to the topic of conversation, past chats, and prior ad interactions; advertisers get aggregate views/clicks only — no chat content, no personal details. Free users can opt out in exchange for fewer daily messages or upgrade to Plus/Pro to remove them. OpenAI says they've seen "no impact on consumer trust metrics" and low dismissal rates so far.[13]OpenAI — Ads pilot expansion
Opt-in for adults (18+ globally, 19+ in South Korea). Designated contact gets an invitation email and must accept within a week. When ChatGPT's monitoring flags a conversation as concerning for self-harm: user is informed first, then a small team of trained human reviewers confirms the situation (target turnaround under one hour), and only then sends a brief notification — no transcript, just a general reason and a link to expert guidance — via email/text/in-app.[14]OpenAI — Trusted Contact mechanics
Psychological science consistently shows that social connection is a powerful protective factor, especially during periods of emotional distress. — Dr. Arthur Evans, CEO, American Psychological Association
Two short OpenAI demo clips landed alongside the API releases: a Box partnership showcasing GPT-5.5 inside Box's enterprise content platform,[15]OpenAI — Introducing GPT-5.5 with Box and a finance modeling clip pitching GPT-5.5 as a financial analyst stand-in.[16]OpenAI — GPT-5.5 finance Both read as enterprise marketing rather than substantive new product, but they reinforce the pattern: OpenAI's public-facing energy this week was almost entirely on the enterprise/defense/finance/voice axis, not consumer model upgrades.
OpenRouter shipped two server-side tools — web search and web fetch — that work consistently across every model on the platform regardless of native tool support. Builders no longer need provider-specific glue to give GPT-5.5, Claude, Gemini, DeepSeek, and friends agentic web access.[17]OpenRouter — Consistent web search and fetch Replaces the old web search plugin; migration is a drop-in tool definition change.
DeepSeek V4 is a fresh 1.6T-parameter pretrain on 32T tokens (up from V3's 14.8T). The headline isn't training cost this time — it's inference cost: V4 needs only 10% of the KV cache and 27% of the FLOPs of V3.2 at the same context length.[18]Caleb Writes Code — Inference cost trajectory Caleb walks through the three interleaved attention mechanisms — DSA (sparse via lightning indexer), CSA (4× compression before sparse selection), and HCA (128× compression with full attention) — that get DeepSeek there.
V-series = fresh pretrain. V4 has 1.6T parameters trained on 32T tokens — roughly double V3's 14.8T from December 2024.
Closed labs (Anthropic, OpenAI) lead on raw intelligence; Chinese open labs lead on token efficiency, driven by GPU scarcity under export controls. Concrete data point: a Claude Max subscription is ~$200/month; running DeepSeek V4 Pro 24/7 for a month costs ~$235 (likely subsidized).[18]Caleb Writes Code — Open vs Closed
Caleb's reading-notes color-code shows nearly every architectural choice in V4 traces back to a prior DeepSeek paper — including MHC (Manifold-Constrained Hyperconnections) for more expressive residuals. He frames it as "the spirit of what open-source is about."
IBM dropped a three-model ~2B-parameter ASR family. The base model leads the Hugging Face Open ASR Leaderboard with 5.33 WER (~95% accuracy on real-world data) at RTFx ~231. The Plus model adds speaker diarization, word-level timestamps, and incremental decoding. The 2BN model uses a non-autoregressive "edit-the-draft" architecture (NLE) to hit RTFx 1820 on an H100 — an hour of audio in 2 seconds.[19]Sam Witteveen — Granite 4.1
Seven languages (English, French, German, Spanish, Portuguese, Japanese, plus bidirectional translation), automatic punctuation, true casing, and keyword biasing — pass a list of names/acronyms in the prompt and the model weights toward them. RTFx ~231: hour of audio in ~16 seconds.
Adds speaker-attributed ASR ("Speaker 1," "Speaker 2"), word-level timestamps that beat customized Whisper variants like WhisperX, and incremental decoding so chunked audio maintains speaker numbering. Trade-offs: drops to five languages (no Japanese), no translation, slightly higher WER.
Two-stage: a frozen CTC encoder makes a draft transcript, then a non-autoregressive LLM editing pass uses bidirectional attention to copy/insert/delete/replace tokens. Sidesteps the accuracy penalty that hit prior parallel-generation attempts. RTFx 1820 on an H100 with batching — hour of audio in ~2 seconds. No translation, biasing, diarization, or timestamps. Requires Flash Attention.[19]Sam Witteveen — 2BN architecture
That literally means that you can be transcribing an hour of audio in 2 seconds on that hardware.
Nate B Jones argues OpenClaw crossed from "viral demo" to actual infrastructure in April 2026 — task flow, scoped memory with provenance, mature handlers across Slack/Telegram/Discord/WhatsApp/Teams/Matrix, and provider manifests that route work across LLMs. The strategic point: with Anthropic restricting subscription-based agentic use and OpenAI integrating Codex into all paid ChatGPT tiers (and OpenClaw's creator now at OpenAI), builders should architect for provider-independence rather than picking sides.[20]Nate B Jones — OpenClaw maturation
Task flow (durable multi-step orchestration with state and revision tracking), scoped memory with provenance, mature channel handling, provider manifests, sub-agents that run their own sessions and report back. The "boring" infrastructure markers — task queues, checkpoints, retry behaviors, permission profiles, tool boundaries.
Anthropic restricted Claude subs from powering always-on third-party agents — Nate's read is partly compute rationing, partly recognizing that flat-rate consumer pricing loses money on agentic workloads. OpenAI did the opposite: Codex is now in every ChatGPT paid tier, and OpenClaw's docs include a Codex OAuth route. Sam Altman publicly flagged OpenClaw availability under ChatGPT plans on May 1.[20]Nate B Jones — Model Layer Contestation
The builder response should not be religious loyalty to any provider. It should be architecture.
Google's Gemma 4 (Apache 2.0) is positioned for agentic workflows, on-device, and edge inference — a credible local branch for cheap background classification, dedup, and triage that doesn't deserve frontier model pricing.
Nate's three example workflows: (1) GitHub repo operator using local model for triage, GPT-5.5/Codex for patches, Claude for architecture passes; (2) multi-layer email inbox review; (3) incident response across logs/Slack/GitHub/runbooks. He also released "Open Brain" — open memory recipes including a code-review store, task flow worklogs, and a memory-provenance recipe (observed/inferred/confirmed/imported labels).[20]Nate B Jones — Durable Workflows
Build the runtime so the model can change. Build the memory so the user owns it. Build the workflow so it survives the session.
Mozilla used a preview of Anthropic's Claude Mythos to surface hundreds of security vulnerabilities in Firefox during a coordinated hardening pass. Simon Willison logs the run as one of the more concrete examples of frontier-model-driven security work going from research curiosity to production defender tool.[22]Simon Willison — Firefox Claude Mythos hardening
Apple is paying $250 million to settle a class-action over Apple Intelligence features promised but not delivered. Eligible buyers — iPhone 15 Pro / Pro Max owners and all iPhone 16 buyers from June 2024 to March 2025, ~37 million devices — get $25 to $95 per device. No admission of wrongdoing. Subtext: Apple is still behind on AI and is leaning on Google Gemini to power its delayed Siri overhaul.[23]Morning Brew — Apple Settles AI Features Lawsuit
AMD posted a strong Q1 and doubled its forward CPU server TAM to $120B by 2030 (from $60B). Lisa Su's argument: the CPU-to-GPU ratio in AI data centers has shifted from 1:4 / 1:8 toward roughly 1:1, because agent workloads run many cheap routine tasks that don't need GPU-class compute.[6]Sherwood News — AMD CPU renaissance Arm Holdings backs the thesis with $2B+ data-center CPU demand and a claimed 50% hyperscaler share. Nvidia put $500M into Corning for fiber-optics — the build-out is broad-spectrum.
The appropriate ratio used to be 1-to-4 or 1-to-8, but is now closer to 1-to-1 or potentially favoring more CPUs when deploying numerous agents. — Lisa Su, AMD CEO
The framing matters: agent inference is distributed across many less-intensive processes rather than concentrated in monolithic training runs, which elevates CPUs from "legacy afterthought" to first-class AI compute resource.[6]Sherwood News — Agent workloads shift compute mix
Per Nate B Jones, DeepSeek was caught running ~16 million fake accounts to harvest Claude outputs as training data — large-scale industrialized exploitation of Claude for distillation. The story is consistent with Theo's broader arc: Anthropic's "dirty plays" aren't paranoia, they're a defensive moat around RL feedback data that competitors will try every avenue to access.[21]Nate B Jones — 16M Fake Accounts Stealing AI Capabilities
A short clip from Lenny's Podcast on the "malleable software" thesis — that AI lowers the cost of building bespoke tools enough that end-users should own and reshape their own computing rather than living inside fixed apps. The clip is a teaser for a longer conversation but the framing is worth filing alongside this week's Anthropic Institute pillar on "AI Systems in the Wild."[24]Lenny's Podcast — The case for malleable software
Pydantic founder Samuel Colvin's AI Engineer talk argues that the gap between prototype and production for LLM apps is fundamentally an observability and iteration problem — and that what teams need is a "playground in prod": the ability to inspect, replay, and tweak live agent behavior with the same speed as a dev REPL. Demos Pydantic AI and Logfire as the tooling stack.[25]AI Engineer — Samuel Colvin: Playground in Prod
Open the source video for the full talk; the JSON summary captures the section structure (thesis, Pydantic AI demo, Logfire integration, Q&A) at YouTube.
Michael Arnaldi (Effectful) makes the case that the structured concurrency, error tracking, and dependency injection of Effect (TypeScript) is exactly the substrate "vibe engineering" needs to scale beyond toy apps. The talk demos how Effect's typed effect system lets agents reason about side-effects, retries, and composition cleanly — and where pure LLM-generated code falls apart without it.[26]AI Engineer — Michael Arnaldi: Vibe Engineering Effect
Raindrop founders Danny Gollapalli and Ben Hylak walk through the dimensions of agent observability that traditional APM tools miss — token-level cost attribution, tool-call success rates, decision quality vs. correctness, and the difference between failure and "soft failure" where the agent finished but did the wrong thing. Demos how Raindrop instruments those signals.[27]AI Engineer — Raindrop: Agent Observability
Matt Pocock argues that as LLMs absorb the easy work, the spread between engineers who genuinely understand types, async, error semantics, and architecture vs. those who don't widens — not narrows. The interview hits how he uses agent harnesses for backlog work (see also his /triage skill below) while still hand-tuning the parts that actually matter.[28]Latent Space — Matt Pocock interview
Mathematician Ken Ono (Axiom Math) on what AI is and isn't doing for serious math research, what genuinely useful tutoring would look like vs. the current crop of homework-helpers, and which parts of math education survive the transition.[29]EO — Ken Ono interview
Every's podcast frames the post-Code-with-Claude landscape: Anthropic is doubling down on Claude Code as a fleet of harnesses (Design, Finance, etc.); OpenAI is centralizing on Codex and aggressive enterprise placement. Useful companion to Theo's deeper "who has researchers, data, compute" framework.[30]Every Podcast — OpenAI vs. Anthropic
Long-form Sequoia interview with former Twitter CEO Dick Costolo on running a hypergrowth platform under public scrutiny — a topical reread for anyone building AI products that will face the same content/governance/scale collisions Twitter did. The companion short clip — "the first goal he set as CEO was embarrassingly low" — pulls a teachable moment about expectation-setting in early leadership.[31]Sequoia — Dick Costolo full interview[32]Sequoia — First goal as Twitter CEO short
Two Better Stack videos worth pairing. DESIGN.md is a design-system spec format you check in alongside README.md so AI-generated UIs stop looking generic — the agent reads it, applies the design tokens, and produces dramatically less identikit output.[33]Better Stack — DESIGN.md walkthrough The second video wires Claude Code into Better Stack via MCP: production error fires, Claude reads it, reproduces locally, fixes, and ships a PR — closing the loop from telemetry to merged fix.[34]Better Stack — Claude Code + Better Stack debugging
Matt Pocock demos his /triage Claude Code skill on the Sand Castle repo: feed it a backlog of issues/PRDs, walk away, come back to a triaged list with proposed labels, owners, dupes, and quick-fix candidates. The framing — "PRDs and tickets as agent-ready backlog items" — is a useful reframe for how to write tickets in 2026.[35]Matt Pocock — /triage skill
Real Python's installation-and-setup walkthrough for Codex CLI in a sample Python project (RP Contacts). Useful as a pointer if you've been Claude-Code-only and want to evaluate Codex's CLI ergonomics on a real codebase.[36]Real Python — Codex CLI for Python
Matt Williams (technovangelist) walks a clean VPS-lockdown pattern: block everything at the firewall, add the box to your Tailnet, then use Tailscale Grants ACLs for fine-grained per-service access from your devices only — the result is a public-IP server that is, for all practical purposes, not on the public internet.[37]Matt Williams — Tailscale Grants for VPS
Three small Simon Willison releases worth bookmarking. llm-gemini 0.31 ships gemini-2.5-flash-lite as GA in his llm CLI plugin.[38]Simon Willison — llm-gemini 0.31 Big Words is a tiny browser-only tool for making text-only "presentation slides" — useful for sharing a fact in giant type without spinning up Keynote.[39]Simon Willison — Big Words GitHub Repo Stats is another browser-only tool that pulls public repo metadata (stars, forks, issue counts, etc.) for quick eyeballing.[40]Simon Willison — GitHub Repo Stats
Six smaller items from the day worth one line each.