Opus 4.8 takes #1; Anthropic's $965B day

AI Models

Claude Opus 4.8 lands as the new #1 model — "a modest but tangible improvement"

Anthropic shipped Claude Opus 4.8, and Artificial Analysis immediately crowned it the new #1 model with a 61.4 Intelligence Index, 1.2 points clear of GPT-5.5.^{[3]Artificial Analysis — Claude Opus 4.8: the new #1 model} Pricing stays flat at $5/$25 per million tokens, the model is 4× less likely than 4.7 to let code flaws slip by, and fast mode dropped to $10/$50.^{[1]Anthropic — Introducing Claude Opus 4.8} Simon Willison praised the candor of Anthropic's "modest but tangible improvement" framing, but flagged that the honesty gains come largely from abstaining on uncertain questions rather than knowing more.^{[2]Simon Willison — a modest but tangible improvement} Power users at Every, who had drifted to Codex during 4.7's reign, say 4.8 pulled them back and earned a rare "paradigm shift" grade.^{[6]Every — Why Opus 4.8 Pulled Me Back to Claude}

Benchmarks and pricing

Anthropic's launch post frames four improvement areas over Opus 4.7: agentic judgment/reliability, coding quality, honesty, and multimodal consistency. Headline agentic numbers include 84% on Online-Mind2Web and the first model ever to clear the 10% all-pass threshold on the Legal Agent Benchmark, plus a research-preview "Dynamic Workflows" feature that runs hundreds of parallel subagents across codebase-scale migrations.^{[1]Anthropic — Introducing Claude Opus 4.8} Standard pricing is unchanged at $5 input / $25 output per million tokens with a 1M-token context window; fast mode fell from $30/$150 to $10/$50.

Artificial Analysis pegs Opus 4.8 at 61.4 on its Intelligence Index — a +4.1 jump over 4.7 and 1.2 points ahead of GPT-5.5. On the agentic GDPval-AA benchmark it scores 1,890 Elo (an implied ~67% win rate vs GPT-5.5) while using 15% fewer turns and 35% fewer output tokens than Opus 4.7 — though still ~30% more turns than GPT-5.5. It edges Humanity's Last Exam by a point but trails GPT-5.4/5.5 on the CritPt physics benchmark, and ranks #2 on AA-Omniscience hallucination accuracy behind Gemini 3.1 Pro.^{[3]Artificial Analysis — Claude Opus 4.8 analysis and benchmarks}

The honesty caveat, and useful API changes

Willison zeroes in on a line from Anthropic's own system card: Opus 4.8 had the "lowest incorrect-rate of the six models on every benchmark" — but achieved it "by abstaining on questions about which it was uncertain rather than by answering more questions correctly." In other words, the improvement is risk-aversion, not raw knowledge. He highlights two practically useful API changes: mid-conversation system messages (update instructions mid-session without resetting the prompt cache) and a lower prompt-cache minimum of 1,024 tokens, down from 4,096. His max-effort pelican-SVG test cost 43 cents for a single generation — a reminder that frontier quality stays expensive.^{[2]Simon Willison — Claude Opus 4.8} Willison's llm-anthropic 0.25.1 shipped same-day with claude-opus-4-8 support, a -o fast 1 flag, and a fix for the long-standing 8,192-token output cap (each model now defaults to its own maximum).^{[4]Simon Willison — llm-anthropic 0.25.1}

What changed for power users

Nate Herk notes the model is built on 4.7 with sharper judgment and longer autonomy, and that effort levels now span low / medium / high / X-high / max / "Ultra Code" (X-high + workflows), accessible via /workflows in Claude Code.^{[5]Nate Herk — Opus 4.8} Anthropic devoted a blog section to honesty, citing cases where 4.7 overclaimed (saying a task took 4 hours when it took 20 minutes, or reporting 50 items pushed when only 15 were) — and 4.8 scores roughly half of 4.7 and Sonnet 4.6 on misaligned-behavior evals ~03:02. Herk also notes Anthropic teased "Mythos," a model class above Opus currently limited to cybersecurity research.

Every's team — self-described Claude die-hards who had defected to Codex and GPT-5.5 — rated 4.8 a "gold/paradigm shift," a grade they call very rare ~05:02. On their senior-engineer benchmark (humans score 80–90), 4.8 hit 63 vs GPT-5.5's 62 and far above 4.7; on writing it scored 79.6 vs 73, their best writing model tested. But they argue "the harness matters as much as the model" — Codex's clean single-pane app keeps it the daily driver despite the Claude app's fragmented chat/code/co-work tabs.^{[6]Every — Why Opus 4.8 Pulled Me Back to Claude}

"Anthropic, I know you're trying to underpromise, but you are overdelivering." — Every

Tools: Claude Opus 4.8, Claude Code, Dynamic Workflows, Effort Control, llm-anthropic, Codex, GPT-5.5

Industry

Anthropic OpenRouter

AI's money day: Anthropic's $65B Series H at $965B, OpenRouter's $113M

Anthropic closed a $65B Series H at a $965B post-money valuation — one of the largest private rounds ever — with run-rate revenue crossing $47B by May.^{[7]Anthropic — $65B Series H at $965B} On the same day, model-routing layer OpenRouter raised a $113M Series B led by Google's CapitalG at a ~$1.3B valuation, as weekly token volume exploded 5× to 25 trillion tokens in six months.^{[8]OpenRouter — $113M Series B}

Anthropic Series H

The $65B round was led by Altimeter, Dragoneer, Greenoaks, and Sequoia, with co-leads including Capital Group, Coatue, D1, GIC, ICONIQ, and XN, and a long tail of institutions (Blackstone, Fidelity, General Catalyst, Lightspeed, T. Rowe Price, Temasek). Separately, hyperscalers committed $15B — including $5B from Amazon — and memory partners Micron, Samsung, and SK hynix joined. Anthropic says the capital funds safety/interpretability research, compute, and product scaling, and that annualized run-rate revenue crossed $47B by May 2026.^{[7]Anthropic — Series H}

"Startups and Global 5000 companies alike are deploying Claude to handle complex workflows." — Anthropic

OpenRouter Series B

OpenRouter's $113M Series B was led by CapitalG (Alphabet's growth fund) at roughly $1.3B, with NVIDIA's NVentures, ServiceNow, MongoDB, Snowflake, and Databricks ventures co-investing alongside a16z and Menlo. The platform offers a single API across 400+ models for 8M+ developers; weekly token volume jumped from 5T to 25T over six months and is on pace to exceed one quadrillion tokens in 2026. Funds go toward infrastructure, enterprise controls (workspaces, spend management, zero-data-retention), and intelligent routing, plus new multimodal inference for image, audio, speech, and video.^{[8]OpenRouter — Series B}

Tools: Claude, OpenRouter

Hot Take

Theo - t3.gg

Theo: is Anthropic actually profitable now?

Reacting to news that Anthropic told investors Q2 revenue will more than double to ~$10.9B with its first-ever operating profit, Theo argues the profitability is real but partly accidental — driven by zero-compute revenue-share deals across AWS/GCP/Azure, stealth price and tokenizer hikes, Claude Code inference demand, and conservative compute bets rather than purely organic growth.^{[9]Theo — Anthropic is profitable now}

~00:00 Theo opens skeptical of the AI bubble, then reacts to the headline: Anthropic will more than double revenue to ~$10.9B in Q2 and post its first operating profit. ~02:01 The raise history is dizzying — a $61B valuation 14 months ago to ~$900B now (15×), with the new round diluting under 10%. Revenue went $87M (2024) → ~$9B (end 2025) → ~$30B run-rate now (~$11B/quarter).

~04:03 The distribution argument: Anthropic models are the only strong frontier option on AWS (which powers most of the Fortune 500), and they're on all three clouds. ~07:06 Crucially, Anthropic strikes deals where it puts up zero compute — just ships weights to AWS/Google and takes ~50% of every token's revenue, a high-margin profit farm that frees its own GPUs for research.

~10:08 The pricing thesis: the shift from 3 tiers to 4 (adding Mythos) was a stealth price hike, with Opus 4.5 effectively taking Sonnet's role ($15→$25/M out); the 4.7 tokenizer generates 30–50% more tokens; and wrong answers loop into million-token bills. ~14:11 On Deep-SWE benchmarks GPT-5.4 used 67K tokens to Opus 4.7's 100K; per-run costs were ~$16 (4.7) vs $3.30 (5.4) vs $5.80 (5.5). ~18:14 Costs have a ceiling because Nvidia is booked to ~2028; conservative compute bets slowed Anthropic's cost ramp but caused shortages, pushing it to pay xAI ~$1.25B/month. ~25:20 He debunks the Microsoft "cancellation" (just rerouted via Copilot CLI) and Uber overrun (failed to re-budget for Opus 4.5), concluding Opus 4.5 was the single biggest catalyst.

"They put zero dollars on the line. They give them zero compute… but they take a 50% fee or so of every single token sent."

"Wrong answers cost more than right ones."

Tools: Claude Code, Codex, Opus 4.5/4.6/4.7, GPT-5.4/5.5, AWS Bedrock, Google TPU, Azure, GitHub Copilot CLI

Podcast

Latent Space

Latent Space interviews Walden Yan & Cole Murray: Devin's 80% moment

Cognition co-founder Walden Yan and Open Inspect creator Cole Murray date a December 2025 inflection (Opus 4.5, GPT-5.2) as the "end of hand-held coding," and share that Devin's merged PRs grew 7× in 2–3 months while headcount rose ~10%, with Devin's commit share across all Devin repos jumping from 16% in January to 80% in March.^{[10]Latent Space — Devin's 80% Moment} The bulk of the conversation is architectural: harness-in-vs-out-of-the-box, full VMs over Docker, why testing is harder than computer use, and why memory and multi-agent remain unsolved.

~02:02 The December 2025 inflection. Models reached the point where teams stopped hand-holding in the IDE and could drive spec-to-PR autonomously, making background/cloud agents practical. ~04:04 The numbers Cognition was "afraid to release": 7× merged-PR growth, 16%→80% commit share.

~05:04 Open Inspect. Cole built an open-source cloud background-agent system and deliberately won't monetize it — the middle layer is hard to defend; money is made at the sandbox (Daytona, E2B) and model layers. ~11:08 The core "harness in the box vs out of the box" decision: in-box forces secrets into the sandbox (exfiltration risk); Cognition keeps the "brain" in a control plane and treats the sandbox as the "hands."

~15:09 Infra. Repo setup is a "perennial problem"; Docker-in-Docker is a poor boundary vs full Firecracker VMs; a custom "block diff file storage format" enables VM snapshot/restore proportional to the filesystem diff, and slow grep traced to S3-backed network filesystems, not a missing index.

~20:13 Testing > computer use. People overindex on click coordinates; the hard part is orchestrating front/back-end services at the right version and triggering feature-flagged behavior — sometimes orchestrating multiple frontier models. ~30:24 Memory is largely unsolved (~95% of Devin's memories are auto-generated). ~38:27 Yan stays skeptical of multi-agent; the practical regime is manager-subagent with isolated boxes.

~46:35 AI slop guardrails. "Your codebase regresses to your worst engineer" — enforce module boundaries and lint rules against AI code smells (getattr/hasattr reward-hacking, untyped tuples, GPT-style backwards-compat shims). ~59:43 Use cases: SRE auto-triage on Slack/Datadog/Sentry alerts, PMs filing fixes, continual security scanning; reasonable spend cited at $1,000–$5,000 per engineer.

"Devin commit percentages on all Devin repos was 16% in January and now 80% in March. We were afraid to release this."

Tools: Devin / Cognition, Open Inspect, Windsurf 2.0, Claude Code, Codex, Firecracker VMs, modal, Daytona, E2B, Cloudflare, MCP, Semgrep

Podcast

Y Combinator

YC Paper Club: inference, diffusion & world models

The inaugural YC Paper Club at the Pioneer building features five paper presentations, threaded by the idea that inference and data efficiency are capability levers, not just cost concerns. Highlights: Speculative Speculative Decoding hitting ~300 tok/s on Llama-3 70B, a LeJEPA world model 50× faster than rivals, and a data-constrained pre-training recipe yielding 5–17× data-efficiency wins.^{[11]Y Combinator — YC Paper Club}

~00:07 Intro. The first-ever YC Paper Club, ~100 researchers/founders, riffing on the W16 OpenAI origin story in the same building.

~03:13 Paper 1: Speculative Speculative Decoding (SSD). Tanishq (Stanford, with Tri Dao) reframes inference as a capability lever — tokens/sec equals peak deliverable intelligence. SSD runs draft and verify on separate hardware in parallel, predicting verification outcomes ~80–90% of the time, beating SGLang on both latency and throughput at ~300 tok/s for Llama-3 70B on 4× H100s.

~17:19 Paper 2: Diffusion Model Predictive Control (DMPC). Stannis (Google DeepMind) uses diffusion for both action proposals and dynamics, with runtime adaptation to novel rewards (train locomotion → induce jumping) and novel dynamics (a "broken ankle" walker recovers).

~29:28 Paper 3: LeJEPA world model. Isaac Ward (LeCun group, tied to LeCun's reported $1.03B raise) replaces anti-collapse tricks with a single SIGReg loss term — ~50× faster, 15M params on a single <24GB card, with native model-error/uncertainty quantification.

~43:35 Paper 4: "Deep Learning is Not So Mysterious." PAC-Bayes plus compressibility and flat minima explain overparameterization and benign overfitting, with non-vacuous bounds at billion-parameter scale.

~50:38 Paper 5: Pre-training under data constraints. Konwoo (Chris Ré's lab, with Percy Liang and Tatsu): aggressive regularization (~30× weight decay) + ensembling + distillation yields ~5× data-efficiency, and continued pre-training shows 4B math tokens matching a full 73B-token corpus — a ~17× win.

"Inference today is seen as a cost or convenience lever. But in one, two, or three years inference is going to be seen as a capability… tokens per second is exactly the peak intelligence that you can deliver."

Tools: vLLM, SGLang, Llama-3 70B, JEPA/LeJEPA, SIGReg, DINO-WM, Dreamer, TD-MPC, DCLM, PAC-Bayes

Podcast

Sequoia Capital

Sequoia interviews Neuralink's DJ Seo: connecting brains and AI

Neuralink president and co-founder DJ Seo walks through the company's brain-computer interfaces — Telepathy (cursor control by thought), Convoy (robotic limbs), and Blindsight (writing vision into the visual cortex) — across 20+ human patients, and frames AI as a future "exocortex" connected via a high-bandwidth neural interface that solves the I/O bottleneck between human output and AI.^{[12]Sequoia — Neuralink's DJ Seo}

~00:01 Patient demo. Telepathy lets locked-in/quadriplegic ALS patients control a cursor by thought; previews of brain-controlled robotic limbs and Blindsight for vision restoration. ~03:09 Origin. Co-founded with Musk in 2016 around the human-output/AI-capability I/O bottleneck, which felt "insane" then and "more real" every week.

~06:10 Built for scale. The underappreciated "Elon magic" is vertical integration — device, surgical robots, factories — to make implantation as routine as LASIK, eventually for millions. ~11:13 Blindsight. An external camera writes into the visual cortex via electrical stimulation, creating "phosphenes" — more electrodes, more pixels; next-gen is in preclinical testing with a hoped-for human trial "by the end of this year."

~13:17 AI as exocortex. Near-term: thinking straight to Grok prompts. Long-term: "direct uncompressed high-fidelity multimodal transfer of concepts," the breakthrough being to "compute on the raw intent itself." Even at ~20 participants, Neuralink is building a "neural foundational model" by fine-tuning LLMs on neural data. ~16:18 Elon's "all green light schedule" lesson: 80–90% of perceived constraints aren't physical. ~20:19 Strategy is "beachhead then expand"; biggest bottleneck is biology, then regulatory/payment.

"Things just seem impossible without scale, but things just become inevitable with scale."

Tools: Neuralink Telepathy, Blindsight, Convoy, Utah array, Grok, neural foundational model

Hot Take

Nerd Snipe

"Google is Not a Serious Company"

A sustained takedown argues that despite checking every box on the artificial-analysis charts, Google's bureaucracy ("15 layers") makes it incapable of shipping competitive AI — illustrated with a "nuclear bomb" analogy, Gemini 3.5 Flash's self-berating reasoning loops, and a previewed long-running SWE benchmark where Gemini 3.1 Pro scores ~10% vs 60s–70s at the top. The episode expands into China closing its open-weight ecosystem and the host's new "slop cloud" project, Lakebed.^{[13]Nerd Snipe — Google is Not a Serious Company}

~02:01 The thesis. The hosts are tired of the "sleeping giant" narrative: Google looks great on charts where it "has all the boxes checked," but having the parts doesn't mean you can assemble the bomb. ~04:02 Gemini 3.5 Flash is called a "disaster" stuck in self-berating reasoning loops; the host says Google "never figured out reasoning."

~10:06 Cloud reliability. The marquee example is Railway's GCP account "accidentally" deleted by a new account-culling algorithm — with the running joke that it might be Gemini-powered. ~16:06 A previewed long-running SWE benchmark: GPT-5.5/5.4 lead, Opus 4.7 ties 5.4, cascading down to Gemini 3.1 Pro at ~10%.

~25:13 Geopolitics. The Manus/Meta breakup — Beijing retroactively undoing the acquisition — is framed as China closing the open-weight ecosystem, with the "shot heard round the world" being Cursor's Composer 2 built on Kimi K2.5. ~35:19 Bureaucracy as the real problem, contrasted with OpenAI's Codex desktop app and Vercel's "unblock-me" P0 channel. ~44:21 The episode closes pitching Lakebed, a deliberately "good enough" slop cloud built in ~4 days with GPT-5.5.

"It's a very pretty chart. That doesn't mean Google is competent."

Tools: Gemini 3.5 Flash / 3.1 Pro, GPT-5.5/5.4, Opus 4.7, Kimi K2.5/K2.6, Composer 2.5, Cursor, Railway, Vercel, Lakebed

Developer Tools

OpenAI

OpenAI Build Hour: the Agents SDK splits the harness from the compute

OpenAI engineers walk through major Agents SDK updates: a Codex-style harness, sandbox-using agents with pluggable compute backends (Modal, Cloudflare, E2B, Vercel, Daytona), a versioned Skills API, and lossless pause/resume via filesystem snapshots. The headline architectural move is splitting the harness from the compute — keeping secrets off ephemeral sandboxes and easing snapshot/rehydration.^{[14]OpenAI — Build Hour: Agents SDK}

~01:01 Why now. Codex has internally run for days; the pain is balancing in-distribution performance against cross-provider flexibility, plus dying sandboxes and secret management. ~06:03 Split harness from compute. The harness runs in your own infra while the sandbox is ephemeral, avoiding load-bearing sandboxes and reducing prompt-injection/exfiltration risk.

~11:07 New releases. A hosted shell tool in the Responses API, a containers endpoint (auto/manual modes), network access controls (domain allow-lists, lockdown), a Skills API (versioned SKILL.md bundles), and the SDK now in TypeScript (previously Python-only since April). ~14:08 Live demo: a SandboxAgent task tracker on a Docker sandbox, with skills loaded from a GitHub repo and a capability object bundling filesystem/shell/compaction.

~20:17 Pause/resume. On stop, the SDK snapshots the filesystem (local tarball by default) and rehydrates a fresh container on resume — opaque to the model. He then hoists the agent to Modal with snapshots in Cloudflare R2. ~25:23 The manifest declares a desired file tree (from uploads, R2/S3/Azure, or GitHub); custom @function_tools, boolean/predicate tool-call approvals, and agent handoffs round it out.

"The thing I'm most excited about… we've split the harness from the compute. You can treat the sandbox as this totally ephemeral thing."

Tools: OpenAI Agents SDK (Python + TypeScript), Responses API, hosted shell tool, Skills API, Codex harness, Modal, Cloudflare R2, E2B, Vercel, Daytona, Docker

Industry

AI Engineer

Accenture at AI Engineer: why most enterprise agentic projects are doomed

Jess Grogan-Avignon and Jack Wang argue enterprise agentic projects fail not from data, APIs, or model limits but because the enterprise's human-paced operating system collides with machine-speed AI — only 12% of companies reach "AI achiever" status. They lay out five tensions (speed, value, delivery, trust, moat) with concrete prescriptions for each.^{[15]Accenture — Why Enterprise Agentic Projects Are Doomed}

~03:10 Tension 1 — Speed. A real engagement built an agentic app in ~2 weeks but took 12 more months to ship due to alignment across infra, security, and data governance teams. Fix: turn every human process into executable code, not sign-off chains.

~06:13 Tension 2 — Value. Business cases assume scope/value/cost are knowable up front; with AI you learn by doing. AI achievers see ~50% higher revenue growth — from new things, not cost-cutting. Fix: the CFO should think like a VC and back a portfolio of bets.

~10:15 Tension 3 — Delivery. Non-deterministic agents can't be milestoned like fixed programs; adopt hypothesis-driven delivery with build-evaluate-iterate loops. ~13:19 Tension 4 — Trust. Treat each delivery as a deposit in a "trust account"; use a progressive-autonomy "exposure ladder" (shadow → advisory → controlled → wider autonomy). ~16:24 Tension 5 — Moat. CRM/ERP/SOPs are "transactional memory," a floor everyone has; the real moat is "living memory" — feedback compounded at your scale. "Feedback is the only moat."

"Every single human process needs to become adaptable, executable code. Not another meeting, not a sign-off chain, code."

Tools: Cursor, Claude Code, GitHub, Jira

Developer Tools

AI Engineer

Phil Hetzel at AI Engineer: agent observability is a different problem

Braintrust's Phil Hetzel argues agent observability diverges from traditional o11y on three axes: non-deterministic vs deterministic systems, massive semi-structured traces (1GB+) vs constrained metrics, and a mixed technical/non-technical persona vs a pure engineering audience — which is why Braintrust built a custom trace database.^{[16]Braintrust — Agent observability vs traditional o11y}

~04:17 Problem 1: non-determinism. You care why a path was chosen, so agent o11y measures qualitative properties — grounding, expected tool use, brand alignment — on top of token/latency metrics. ~07:21 Problem 2: nasty traces. Highly semi-structured, voluminous (a single trace can exceed 1GB; a span can be 20MB), and needed in true real time.

~09:22 The custom DB. They moved off ClickHouse (couldn't do the text indexing) and built one with a write-ahead log for instant trace visibility, indexing for fast filtering, and a forked Tantivy full-text index. ~11:23 Problem 3: persona. The best teams mix engineers with clinicians, lawyers, and advisors who improve agents via natural-language prompts. ~13:24 What's next: a lightweight LLM over incoming traces doing embedding/clustering for topic, intent, and sentiment modeling. Observability and evals are treated as one system.

"Agent traces are really nasty… an agent trace could be over a gigabyte in size. An individual span can be 20 megabytes."

Tools: Braintrust, Datadog, Tantivy, ClickHouse, Rust, SQL

Developer Tools

AI Engineer

Neo4j at AI Engineer: context graphs for decision-aware agents

Neo4j's Zaid Zaim and Andreas Kollegger argue knowledge graphs should evolve into "context graphs" that capture not just what agents know but the rules, policies, and reasoning behind decisions — then present a transferable decision-making workflow that lets agents act explainably and defer to humans when authority or certainty is lacking.^{[17]Neo4j — Context Graphs for Decision-Aware Agents}

~01:07 Agent memory on a graph. Short-term (conversation/state), long-term (orgs/people/things), and reasoning memory (policies/rules). ~02:09 Context graphs add the "missing why." ~05:15 Architecture: query → knowledge source → graph DB fallback via text-to-Cypher → traversal.

~06:17 The decision workflow. The running example: an agent with a credit card to keep the fridge stocked might order Red Bull when rent is due. ~09:20 (1) frame with local context, (2) feed global context plus hard/soft business rules, ~11:21 (3) risk-value analysis with "reference class validation" (the drug right 99% of the time but fatal for the 1%), (4) propose alternatives and hand off to an agent that checks authority or escalates to a human, ~14:23 (5) record the full reasoning back into the graph as precedent.

"A lot of our practice as AI engineers is being explicit about the implicit knowledge that we carry with us."

Tools: Neo4j, text-to-Cypher, LangGraph, Google ADK, GraphAcademy

Developer Tools Hot Take

Nate B Jones

The 9-second database wipe and the case for agent analytics

A Cursor agent reportedly erased Pocket OS's production database and backups in 9 seconds via a single Railway API call — but Nate B Jones reframes it as a product-analytics failure, not an "AI went rogue" story. He argues the "agent run" is the new unit of product behavior, and the completion-vs-acceptance gap is the blind spot most dashboards miss.^{[18]Nate B Jones — Agent analytics}

~01:01 The incident. Standard dashboards would have shown an active user, a long session, an AI feature in use, and many messages — none of which reveal the instruction given, the credential found, or the permission boundary that failed. ~03:01 Chat logs and engineering traces are each insufficient: a trace tells you a run cost 30 cents, but not whether it was worth it.

~05:05 The agent run. The right unit of measurement, analogous to the click — surfacing intent, tools used, failed calls, approvals, and whether the user accepted or redid the output. He points to Salesforce's "Agent Work Units" (2.4B delivered, +57% QoQ) as directional. ~08:07 Completion vs acceptance. A 2×2: high completion / low acceptance means the agent finishes but isn't trusted; user corrections are "the new clicks of the agent era." ~10:11 His call to action: product teams shouldn't delegate agent observability entirely to engineering traces.

"A session tells you that a user showed up. An agent run tells you what work was attempted."

Tools: Cursor, Railway, Salesforce Agentforce, Slack

AI Tools

AICodeKing

Codex 4.0 ships App Shots, Goal Mode, and remote computer use

A Codex update headlined by "App Shots" — sending the frontmost macOS window (screenshot plus text) to the agent by pressing both Command keys — also makes Goal Mode official across app/IDE/CLI, adds remote computer use that survives a locked Mac, and ships plugin sharing for teams.^{[19]AICodeKing — Codex 4.0 upgrades}

~01:02 App Shots. Press both Command keys to send the frontmost window's screenshot and accessible text into the agent — useful for debugging native apps and UI states, with the obvious privacy caveat. ~02:03 Goal Mode official. Sustained multi-hour objectives across app/IDE/CLI; CLI 0.133 enabled goals by default with dedicated storage and progress tracking, and 0.132 stops on usage limits rather than looping.

~04:05 Remote computer use. Codex continues desktop tasks after a Mac locks and can be driven from Codex mobile, with short-lived auth, covered displays, and auto-relock on local input. ~05:05 Plugin sharing. Teams distribute bundles of skills, app integrations, MCP servers, and hooks via marketplace sources. ~07:08 Browser annotations let users mark up rendered UIs for precise front-end fixes.

"Whatever is on your screen right now can become context for the agent."

Tools: Codex, Codex CLI, Codex mobile, Chrome extension, MCP

Developer Tools

Matt Pocock Sequoia Capital

Cursor: a "thermonuclear" review skill, and why it skipped pre-training

Matt Pocock tests Cursor's aggressive "thermonuclear code quality review" skill against his own commits — it caught ~5 of 7 solid issues including a blocker-class 1,000-line file, but he criticizes its verbosity and total silence on testing.^{[20]Matt Pocock — Cursor's review skill} Separately, a Sequoia clip explains why Cursor deliberately skipped pre-training its own model, working top-down to ship value faster.^{[21]Sequoia — Why Cursor Skipped Pre-Training}

The thermonuclear review skill

~00:00 The single skill.md instructs the agent to audit the whole codebase from the current branch, not just the diff. Non-negotiables: don't push a file past 1,000 lines, treat nested ifs as design problems, question optionality and any/unknown types, and look for "code judo moves" that delete whole categories of complexity. ~08:08 Run against the last five PRs to his open-source Sandcastle, it flagged a 1,000-line init file, a generic registry helper, swallowed errors, and an incomplete decomposition — ~5 of 7 findings rated valid. His critiques: heavy repetition, vague "improve or worsen the local architecture" language, and zero mention of tests or seams.

"The behavior is correct in all three substantive PRs, but the code base is meaningfully messier than it was a week ago."

Why Cursor skipped pre-training

A Cursor team member explains the rationale: bottom-up (pre-train → scale → post-train → RL) takes a long time before any user value. By starting top-down — fine-tuning capable base models and applying RL directly — Cursor shipped a useful model far faster, optimizing for time-to-value over technical completeness.

"How do we get a model that's useful to users in the least time possible?"

Tools: Cursor, Claude Code, Sandcastle

AI Future

AI Search

AI co-scientists make real lab discoveries

Two Nature papers show multi-agent AI systems making lab-validated biomedical discoveries. Google's AI Co-Scientist autonomously generated treatments for AML leukemia, liver fibrosis, and antimicrobial resistance — rated more novel and impactful than human PhD experts in blind tests — while Robin closed the loop, writing and running its own data analysis to find macular-degeneration drug candidates, synthesizing 551 papers in ~30 minutes for $10.76.^{[22]AI Search — AI co-scientists make real discoveries}

Google AI Co-Scientist

~01:01 Not one chatbot but an ecosystem of agents — a supervisor, a generation agent, a brutal reflection/reviewer agent, a proximity agent that clusters ideas, an evolution agent, and a ranking agent running an ELO tournament of head-to-head debates. Given 15 unsolved biomedical goals, blind judges rated its ideas higher in novelty, plausibility, and impact than the best human experts. Results: for AML it found binimetinib (IC50 2nM) and KIRA6 (18× more effective at killing leukemia stem cells), plus a JQ1 + Olaparib + MSA2 combo; for liver fibrosis, Vorinostat; and for AMR, it explained cf-PICI phage-tail hijacking in 2 days — matching an unpublished lab team's months of work.

Robin: closing the loop

~21:17 Robin goes further by interpreting raw experimental data and iterating: Crow does literature review, Falcon writes deep drug reports, and Finch autonomously writes/debugs/executes analysis code, launching 8 parallel instances with a 50%-consensus mechanism to avoid hallucination. Applied to dry age-related macular degeneration, it confirmed compound Y27632, then via RNA-seq surfaced the ABCA1/APOE link, and proposed Ripasudil and the circadian modulator KL001. A human would need ~400 hours; Robin finished the full loop in under 2 hours for $10.76.

"It synthesized those 551 papers in just around 30 minutes… and the compute cost for everything is just $10.76."

Tools: Google AI Co-Scientist, Robin (multi-agent system)

Developer Tools

Github Awesome

GitHub Trending #34: agent tooling and 101ms VM forks

This week's 35 trending repos lean heavily into agent infrastructure: Workshop (local agent debugger), CodeGraph (a repo knowledge graph that cuts Claude Code's token waste), and Forkd, which forks warm Firecracker micro-VMs in ~101ms using userfaultfd and copy-on-write memory. ML highlights include HRM-Text, a 1B model trainable for ~$1,000.^{[23]Github Awesome — GitHub Trending #34}

~00:00 Agent dev tools. Workshop instruments agents in Claude Code/Codex/Cursor, streaming tokens/tool-calls/spans into a local UI for self-healing eval loops; CodeGraph pre-indexes a repo into a knowledge graph so agents make fewer wasted tool calls; Agent HTML replaces giant markdown blobs with stable, updatable semantic HTML artifacts.

~01:00 Infra & ML. Forkd forks warm Firecracker VMs in ~101ms. HRM-Text trains a 1B hierarchical-recurrent model on 16 H100s for ~46 hours (~$1,000); MIT CSAIL's ELF is a diffusion LM denoising in embedding space; FlashLib rebuilds ML GPU primitives for Hopper (K-means 26×, SVD 208× faster). Pya Engram offers a local-first, MCP-compatible shared memory store across AI tools; Mailflare is a self-hosted email platform on Cloudflare Workers + D1.

~02:01 Plus fun/utility repos: ASCII Aquarium (ESP32), Nvidia's PiDi (latents → 2048px in <1s on a 5090), WorkOS's auth.md agent-registration protocol, ShadowChat (file transfer via light), and a Raycast-style tmux palette.

Tools: Workshop, CodeGraph, Agent HTML, Forkd, Firecracker, HRM-Text, ELF, FlashLib, Pya Engram, Mailflare, auth.md

Developer Tools

Better Stack Better Stack Better Stack Real Python marimo

Dev tooling: wterm, Deno 2.8, Devbox, MCP servers, marimo

A roundup of the day's developer tooling: Vercel's wterm renders a web terminal to the DOM (not canvas) for free browser accessibility; Deno 2.8 makes fresh npm installs 3.6× faster and adds deno audit fix and deno pack; Devbox puts your dev environment in Git; Real Python walks through building MCP servers in Python; and marimo demos a reactive 2D animation widget.^{[24]Better Stack — wterm}

wterm (Vercel)

~00:00 A Zig-based web terminal that compiles to a 12KB WASM binary and renders to the DOM, so text selection, find, and screen readers work for free — unlike canvas-based xterm.js. An optional libghosty renderer (400KB) improves legibility and color, and it needs a WebSocket back-end spawning a PTY.^{[24]Better Stack — wterm}

Deno 2.8

Fresh npm installs drop from ~3.3s to ~96ms (3.6× faster) via parallelism and off-critical-path decompression; deno audit fix auto-upgrades vulnerable deps to the nearest safe version; and deno pack publishes a Deno/JSR library to npm — transpiling, generating .d.ts, rewriting imports, and shimming Deno APIs — without a separate build pipeline.^{[25]Better Stack — Deno 2.8}

Devbox

A friendly Nix wrapper: devbox init/add/shell gives every developer the same project-scoped tool versions, committed as devbox.json + devbox.lock rather than a rotting README. Honest trade-offs: the first Nix download is slow, complex logic belongs in a .sh file, and it's not a cloud-IDE replacement.^{[26]Better Stack — Devbox}

Building MCP servers in Python

~03:00 Real Python's course frames MCP as a set of rules (QR-code analogy: server generates, client consumes), scaffolds a project with uv, and implements a get_sales tool — a plain Python function decorated with @mcp.tool(), where type hints and a docstring are the metadata the LLM uses to decide whether to call it.^{[27]Real Python — Building MCP servers}

marimo animation widget

A 2D slider widget that animates a puck along a programmable path with easing, looping, and step functions — every downstream chart updates reactively, so any variable driven by its output can be animated.^{[28]marimo — 2D animation widget}

Tools: wterm, xterm.js, Ghostty/libghosty, Zig, Deno, npm, JSR, Devbox, Nix, FastMCP, uv, marimo

AI Tools

Tech Brew

Apple's chatbot-style Siri redesign leaks ahead of WWDC

Bloomberg-leaked screenshots show Apple's long-promised Siri overhaul, expected at WWDC on June 8: a dark-mode standalone app with voice/image/file input, Dynamic Island integration — and, most notably, a dropdown to route queries to Claude, Gemini, or ChatGPT, positioning Siri as an orchestration layer rather than a closed assistant.^{[29]Tech Brew — Apple's Siri redesign}

The recreated screenshots show a chatbot-style interface similar to Claude and ChatGPT, defaulting to dark mode with an "Ask Siri" field anchored at the bottom, support for voice prompts, image attachments, and file uploads, plus a "Search or Ask" surface reachable by swiping down. Conversation history appears as lists or bubbles, and a new AI search layer unifies apps, web results, weather, and shortcuts. The standout addition is the dropdown to route queries to Google Gemini, Anthropic Claude, or OpenAI ChatGPT. The design adopts Apple's "Liquid Glass" aesthetic, though early reaction is mixed, and Apple is also reportedly working on an auto-lock theft-detection feature.^{[29]Tech Brew — Siri redesign}

Tools: Siri, Claude, ChatGPT, Google Gemini

Industry

Sherwood

Meta One subscriptions and the European EV race

Meta launched "Meta One," paid tiers across Instagram, Facebook, and WhatsApp ($2.99–$3.99/mo) plus two AI plans ($7.99 and $19.99/mo) — its first major consumer monetization beyond ads, sending the stock up 3.7%. Sherwood also flagged that BYD registered more than twice Tesla's EV volume in Europe in April, while Ferrari's electric debut, the Luce, sank the stock up to 8.4%.^{[30]Sherwood — Meta One & EV race}

Meta One. Instagram Plus and Facebook Plus at $3.99/mo, WhatsApp Plus at $2.99/mo, and two AI tiers at $7.99 and $19.99/mo (the free chatbot now rate-limited). The move is read as a way to justify Meta's huge AI capex to investors; Sherwood also noted Anthropic's annualized revenue reportedly approaching $45B.

EV market. BYD posted 115% YoY growth in European registrations vs Tesla's 47%, widening the gap, while Ferrari's Luce disappointed investors — down 5.3% in US trading and 8.4% in Milan — a sign that luxury positioning alone isn't a credible electrification roadmap.^{[30]Sherwood — Meta One & EV race}

Tools: Meta AI

AI Future AI Models

OpenAI OpenAI OpenAI

OpenAI's governance framework and GPT-5.5 in the wild

OpenAI published a Frontier Governance Framework mapping its safety practices to the EU AI Act and California's Transparency in Frontier AI Act, covering four risk areas reviewed by a cross-functional Safety Advisory Group.^{[31]OpenAI — Frontier Governance Framework} Two customer stories show GPT-5.5 at work: Abridge saw evaluation scores rise as it added tools for clinical decision support, and Chip Ganassi Racing used OpenAI tools to win at Long Beach.^{[32]OpenAI — Abridge clinical decision support}

Frontier Governance Framework

OpenAI's public document codifies how its safety work aligns with the EU AI Act Code of Practice and California's Transparency in Frontier AI Act, covering cyber offense, CBRN, manipulation, and loss of control. A cross-functional Safety Advisory Group reviews whether safeguards sufficiently minimize severe risk before routing recommendations to leadership — framed as going beyond minimum compliance, echoing Anthropic's recent safety updates.^{[31]OpenAI — Frontier Governance Framework}

GPT-5.5 at Abridge and Chip Ganassi

~00:00 Abridge's clinical decision support tool saw a counterintuitive upward trend on its eval set as it added more tools with GPT-5.5, with better reasoning and information density while keeping clinicians the final authority.^{[32]OpenAI — Abridge} ~00:10 Chip Ganassi Racing used OpenAI tools to synthesize years of race data, optimize pit strategy, and even generate strength-and-conditioning plans, winning at Long Beach.^{[33]OpenAI — Chip Ganassi Racing}

Tools: GPT-5.5, ChatGPT, Abridge

Productivity Hot Take

Arjay McCandless Real Python Acquired Lenny's Podcast Nate B Jones Nate B Jones Simon Willison Data Science Weekly

Quick hits: durable agents, crisis playbooks, index funds & more

Shorter items: designing agents that survive mid-run failures, an engineering crisis-management playbook, the 1974 paper that birthed index funds, a cautionary "automation is a lie" vibe-coding story, two Claude-vs-ChatGPT prompting notes, a small Simon Willison SVG tool, and this week's Data Science Weekly.

Designing agents for production failures. The hard part isn't building an agent, it's what happens when it breaks mid-run; AgentSpan (open-source SDK by Orcus) adds execution-history inspection, configurable retries, and resume-from-failure.^{[34]Arjay McCandless — Build Durable Agents}

Crisis management playbook. Clear your schedule in crisis mode, use the OODA loop to reorient often, delegate project management past ~10 people, and name a DRI; directionality matters more than magnitude.^{[35]Real Python — Crisis Management Playbook}

The birth of index funds. Paul Samuelson's 1974 paper found no evidence managers beat the market and called for a fund that "apes the whole market" — the intellectual blueprint for index funds, now ironically criticized for being too big.^{[36]Acquired — How index funds were born}

"Automation is a lie." A vibe-coded side app crashed every 10 minutes at launch, Codex created a whack-a-mole loop, and it took two senior engineers (and a case of bursitis) to fix — a reminder that AI-built systems still need human oversight.^{[37]Lenny's Podcast — AI still needs humans}

Claude prompting notes. Give Claude existing work to edit rather than a blank canvas — it's stronger as an editor (85% structural coherence vs ChatGPT's 78% in one test).^{[38]Nate B Jones — Give Claude work to edit} And Claude's principles-based training yields higher instruction compliance (94% vs ChatGPT's 87%), which matters most across vague, multi-turn tasks.^{[39]Nate B Jones — Claude instruction compliance}

markdown-svg-renderer. A small in-browser markdown viewer from Simon Willison that renders fenced SVG code blocks as live images (with a source toggle), taking input via paste, raw URL, or GitHub Gist — handy for sharing LLM output with embedded diagrams.^{[40]Simon Willison — markdown-svg-renderer}

Data Science Weekly #653. Highlights: 6 LLM prompting techniques for data scientists, a Reddit thread noting ML/DE skills are rebounding while pure "AI" postings decline, and Frank Harrell's argument that AI should be a "specification writer and comprehensive tester" in statistics.^{[41]Data Science Weekly — Issue 653}

Tools: AgentSpan, Kanban boards, Codex, Claude, ChatGPT, markdown-svg-renderer

Claude Opus 4.8 lands as the new #1 model — "a modest but tangible improvement"

Benchmarks and pricing

The honesty caveat, and useful API changes

What changed for power users

AI's money day: Anthropic's $65B Series H at $965B, OpenRouter's $113M

Anthropic Series H

OpenRouter Series B

Theo: is Anthropic actually profitable now?

Latent Space interviews Walden Yan & Cole Murray: Devin's 80% moment

YC Paper Club: inference, diffusion & world models

Sequoia interviews Neuralink's DJ Seo: connecting brains and AI

"Google is Not a Serious Company"

OpenAI Build Hour: the Agents SDK splits the harness from the compute

Accenture at AI Engineer: why most enterprise agentic projects are doomed

Phil Hetzel at AI Engineer: agent observability is a different problem

Neo4j at AI Engineer: context graphs for decision-aware agents

The 9-second database wipe and the case for agent analytics

Codex 4.0 ships App Shots, Goal Mode, and remote computer use

Cursor: a "thermonuclear" review skill, and why it skipped pre-training

The thermonuclear review skill

Why Cursor skipped pre-training

AI co-scientists make real lab discoveries

Google AI Co-Scientist

Robin: closing the loop

GitHub Trending #34: agent tooling and 101ms VM forks

Dev tooling: wterm, Deno 2.8, Devbox, MCP servers, marimo

wterm (Vercel)

Deno 2.8

Devbox

Building MCP servers in Python

marimo animation widget

Apple's chatbot-style Siri redesign leaks ahead of WWDC

Meta One subscriptions and the European EV race

OpenAI's governance framework and GPT-5.5 in the wild

Frontier Governance Framework

GPT-5.5 at Abridge and Chip Ganassi

Quick hits: durable agents, crisis playbooks, index funds & more

Sources