May 28, 2026
Anthropic shipped Claude Opus 4.8, and Artificial Analysis immediately crowned it the new #1 model with a 61.4 Intelligence Index, 1.2 points clear of GPT-5.5.[3]Artificial Analysis — Claude Opus 4.8: the new #1 model Pricing stays flat at $5/$25 per million tokens, the model is 4× less likely than 4.7 to let code flaws slip by, and fast mode dropped to $10/$50.[1]Anthropic — Introducing Claude Opus 4.8 Simon Willison praised the candor of Anthropic's "modest but tangible improvement" framing, but flagged that the honesty gains come largely from abstaining on uncertain questions rather than knowing more.[2]Simon Willison — a modest but tangible improvement Power users at Every, who had drifted to Codex during 4.7's reign, say 4.8 pulled them back and earned a rare "paradigm shift" grade.[6]Every — Why Opus 4.8 Pulled Me Back to Claude
Anthropic's launch post frames four improvement areas over Opus 4.7: agentic judgment/reliability, coding quality, honesty, and multimodal consistency. Headline agentic numbers include 84% on Online-Mind2Web and the first model ever to clear the 10% all-pass threshold on the Legal Agent Benchmark, plus a research-preview "Dynamic Workflows" feature that runs hundreds of parallel subagents across codebase-scale migrations.[1]Anthropic — Introducing Claude Opus 4.8 Standard pricing is unchanged at $5 input / $25 output per million tokens with a 1M-token context window; fast mode fell from $30/$150 to $10/$50.
Artificial Analysis pegs Opus 4.8 at 61.4 on its Intelligence Index — a +4.1 jump over 4.7 and 1.2 points ahead of GPT-5.5. On the agentic GDPval-AA benchmark it scores 1,890 Elo (an implied ~67% win rate vs GPT-5.5) while using 15% fewer turns and 35% fewer output tokens than Opus 4.7 — though still ~30% more turns than GPT-5.5. It edges Humanity's Last Exam by a point but trails GPT-5.4/5.5 on the CritPt physics benchmark, and ranks #2 on AA-Omniscience hallucination accuracy behind Gemini 3.1 Pro.[3]Artificial Analysis — Claude Opus 4.8 analysis and benchmarks
Willison zeroes in on a line from Anthropic's own system card: Opus 4.8 had the "lowest incorrect-rate of the six models on every benchmark" — but achieved it "by abstaining on questions about which it was uncertain rather than by answering more questions correctly." In other words, the improvement is risk-aversion, not raw knowledge. He highlights two practically useful API changes: mid-conversation system messages (update instructions mid-session without resetting the prompt cache) and a lower prompt-cache minimum of 1,024 tokens, down from 4,096. His max-effort pelican-SVG test cost 43 cents for a single generation — a reminder that frontier quality stays expensive.[2]Simon Willison — Claude Opus 4.8
Willison's llm-anthropic 0.25.1 shipped same-day with claude-opus-4-8 support, a -o fast 1 flag, and a fix for the long-standing 8,192-token output cap (each model now defaults to its own maximum).[4]Simon Willison — llm-anthropic 0.25.1
Nate Herk notes the model is built on 4.7 with sharper judgment and longer autonomy, and that effort levels now span low / medium / high / X-high / max / "Ultra Code" (X-high + workflows), accessible via /workflows in Claude Code.[5]Nate Herk — Opus 4.8
Anthropic devoted a blog section to honesty, citing cases where 4.7 overclaimed (saying a task took 4 hours when it took 20 minutes, or reporting 50 items pushed when only 15 were) — and 4.8 scores roughly half of 4.7 and Sonnet 4.6 on misaligned-behavior evals ~03:02. Herk also notes Anthropic teased "Mythos," a model class above Opus currently limited to cybersecurity research.
Every's team — self-described Claude die-hards who had defected to Codex and GPT-5.5 — rated 4.8 a "gold/paradigm shift," a grade they call very rare ~05:02. On their senior-engineer benchmark (humans score 80–90), 4.8 hit 63 vs GPT-5.5's 62 and far above 4.7; on writing it scored 79.6 vs 73, their best writing model tested. But they argue "the harness matters as much as the model" — Codex's clean single-pane app keeps it the daily driver despite the Claude app's fragmented chat/code/co-work tabs.[6]Every — Why Opus 4.8 Pulled Me Back to Claude
"Anthropic, I know you're trying to underpromise, but you are overdelivering." — Every
Anthropic closed a $65B Series H at a $965B post-money valuation — one of the largest private rounds ever — with run-rate revenue crossing $47B by May.[7]Anthropic — $65B Series H at $965B On the same day, model-routing layer OpenRouter raised a $113M Series B led by Google's CapitalG at a ~$1.3B valuation, as weekly token volume exploded 5× to 25 trillion tokens in six months.[8]OpenRouter — $113M Series B
The $65B round was led by Altimeter, Dragoneer, Greenoaks, and Sequoia, with co-leads including Capital Group, Coatue, D1, GIC, ICONIQ, and XN, and a long tail of institutions (Blackstone, Fidelity, General Catalyst, Lightspeed, T. Rowe Price, Temasek). Separately, hyperscalers committed $15B — including $5B from Amazon — and memory partners Micron, Samsung, and SK hynix joined. Anthropic says the capital funds safety/interpretability research, compute, and product scaling, and that annualized run-rate revenue crossed $47B by May 2026.[7]Anthropic — Series H
"Startups and Global 5000 companies alike are deploying Claude to handle complex workflows." — Anthropic
OpenRouter's $113M Series B was led by CapitalG (Alphabet's growth fund) at roughly $1.3B, with NVIDIA's NVentures, ServiceNow, MongoDB, Snowflake, and Databricks ventures co-investing alongside a16z and Menlo. The platform offers a single API across 400+ models for 8M+ developers; weekly token volume jumped from 5T to 25T over six months and is on pace to exceed one quadrillion tokens in 2026. Funds go toward infrastructure, enterprise controls (workspaces, spend management, zero-data-retention), and intelligent routing, plus new multimodal inference for image, audio, speech, and video.[8]OpenRouter — Series B
Reacting to news that Anthropic told investors Q2 revenue will more than double to ~$10.9B with its first-ever operating profit, Theo argues the profitability is real but partly accidental — driven by zero-compute revenue-share deals across AWS/GCP/Azure, stealth price and tokenizer hikes, Claude Code inference demand, and conservative compute bets rather than purely organic growth.[9]Theo — Anthropic is profitable now
~00:00 Theo opens skeptical of the AI bubble, then reacts to the headline: Anthropic will more than double revenue to ~$10.9B in Q2 and post its first operating profit. ~02:01 The raise history is dizzying — a $61B valuation 14 months ago to ~$900B now (15×), with the new round diluting under 10%. Revenue went $87M (2024) → ~$9B (end 2025) → ~$30B run-rate now (~$11B/quarter).
~04:03 The distribution argument: Anthropic models are the only strong frontier option on AWS (which powers most of the Fortune 500), and they're on all three clouds. ~07:06 Crucially, Anthropic strikes deals where it puts up zero compute — just ships weights to AWS/Google and takes ~50% of every token's revenue, a high-margin profit farm that frees its own GPUs for research.
~10:08 The pricing thesis: the shift from 3 tiers to 4 (adding Mythos) was a stealth price hike, with Opus 4.5 effectively taking Sonnet's role ($15→$25/M out); the 4.7 tokenizer generates 30–50% more tokens; and wrong answers loop into million-token bills. ~14:11 On Deep-SWE benchmarks GPT-5.4 used 67K tokens to Opus 4.7's 100K; per-run costs were ~$16 (4.7) vs $3.30 (5.4) vs $5.80 (5.5). ~18:14 Costs have a ceiling because Nvidia is booked to ~2028; conservative compute bets slowed Anthropic's cost ramp but caused shortages, pushing it to pay xAI ~$1.25B/month. ~25:20 He debunks the Microsoft "cancellation" (just rerouted via Copilot CLI) and Uber overrun (failed to re-budget for Opus 4.5), concluding Opus 4.5 was the single biggest catalyst.
"They put zero dollars on the line. They give them zero compute… but they take a 50% fee or so of every single token sent."
"Wrong answers cost more than right ones."
Cognition co-founder Walden Yan and Open Inspect creator Cole Murray date a December 2025 inflection (Opus 4.5, GPT-5.2) as the "end of hand-held coding," and share that Devin's merged PRs grew 7× in 2–3 months while headcount rose ~10%, with Devin's commit share across all Devin repos jumping from 16% in January to 80% in March.[10]Latent Space — Devin's 80% Moment The bulk of the conversation is architectural: harness-in-vs-out-of-the-box, full VMs over Docker, why testing is harder than computer use, and why memory and multi-agent remain unsolved.
~02:02 The December 2025 inflection. Models reached the point where teams stopped hand-holding in the IDE and could drive spec-to-PR autonomously, making background/cloud agents practical. ~04:04 The numbers Cognition was "afraid to release": 7× merged-PR growth, 16%→80% commit share.
~05:04 Open Inspect. Cole built an open-source cloud background-agent system and deliberately won't monetize it — the middle layer is hard to defend; money is made at the sandbox (Daytona, E2B) and model layers. ~11:08 The core "harness in the box vs out of the box" decision: in-box forces secrets into the sandbox (exfiltration risk); Cognition keeps the "brain" in a control plane and treats the sandbox as the "hands."
~15:09 Infra. Repo setup is a "perennial problem"; Docker-in-Docker is a poor boundary vs full Firecracker VMs; a custom "block diff file storage format" enables VM snapshot/restore proportional to the filesystem diff, and slow grep traced to S3-backed network filesystems, not a missing index.
~20:13 Testing > computer use. People overindex on click coordinates; the hard part is orchestrating front/back-end services at the right version and triggering feature-flagged behavior — sometimes orchestrating multiple frontier models. ~30:24 Memory is largely unsolved (~95% of Devin's memories are auto-generated). ~38:27 Yan stays skeptical of multi-agent; the practical regime is manager-subagent with isolated boxes.
~46:35 AI slop guardrails. "Your codebase regresses to your worst engineer" — enforce module boundaries and lint rules against AI code smells (getattr/hasattr reward-hacking, untyped tuples, GPT-style backwards-compat shims). ~59:43 Use cases: SRE auto-triage on Slack/Datadog/Sentry alerts, PMs filing fixes, continual security scanning; reasonable spend cited at $1,000–$5,000 per engineer.
"Devin commit percentages on all Devin repos was 16% in January and now 80% in March. We were afraid to release this."
The inaugural YC Paper Club at the Pioneer building features five paper presentations, threaded by the idea that inference and data efficiency are capability levers, not just cost concerns. Highlights: Speculative Speculative Decoding hitting ~300 tok/s on Llama-3 70B, a LeJEPA world model 50× faster than rivals, and a data-constrained pre-training recipe yielding 5–17× data-efficiency wins.[11]Y Combinator — YC Paper Club
~00:07 Intro. The first-ever YC Paper Club, ~100 researchers/founders, riffing on the W16 OpenAI origin story in the same building.
~03:13 Paper 1: Speculative Speculative Decoding (SSD). Tanishq (Stanford, with Tri Dao) reframes inference as a capability lever — tokens/sec equals peak deliverable intelligence. SSD runs draft and verify on separate hardware in parallel, predicting verification outcomes ~80–90% of the time, beating SGLang on both latency and throughput at ~300 tok/s for Llama-3 70B on 4× H100s.
~17:19 Paper 2: Diffusion Model Predictive Control (DMPC). Stannis (Google DeepMind) uses diffusion for both action proposals and dynamics, with runtime adaptation to novel rewards (train locomotion → induce jumping) and novel dynamics (a "broken ankle" walker recovers).
~29:28 Paper 3: LeJEPA world model. Isaac Ward (LeCun group, tied to LeCun's reported $1.03B raise) replaces anti-collapse tricks with a single SIGReg loss term — ~50× faster, 15M params on a single <24GB card, with native model-error/uncertainty quantification.
~43:35 Paper 4: "Deep Learning is Not So Mysterious." PAC-Bayes plus compressibility and flat minima explain overparameterization and benign overfitting, with non-vacuous bounds at billion-parameter scale.
~50:38 Paper 5: Pre-training under data constraints. Konwoo (Chris Ré's lab, with Percy Liang and Tatsu): aggressive regularization (~30× weight decay) + ensembling + distillation yields ~5× data-efficiency, and continued pre-training shows 4B math tokens matching a full 73B-token corpus — a ~17× win.
"Inference today is seen as a cost or convenience lever. But in one, two, or three years inference is going to be seen as a capability… tokens per second is exactly the peak intelligence that you can deliver."
Neuralink president and co-founder DJ Seo walks through the company's brain-computer interfaces — Telepathy (cursor control by thought), Convoy (robotic limbs), and Blindsight (writing vision into the visual cortex) — across 20+ human patients, and frames AI as a future "exocortex" connected via a high-bandwidth neural interface that solves the I/O bottleneck between human output and AI.[12]Sequoia — Neuralink's DJ Seo
~00:01 Patient demo. Telepathy lets locked-in/quadriplegic ALS patients control a cursor by thought; previews of brain-controlled robotic limbs and Blindsight for vision restoration. ~03:09 Origin. Co-founded with Musk in 2016 around the human-output/AI-capability I/O bottleneck, which felt "insane" then and "more real" every week.
~06:10 Built for scale. The underappreciated "Elon magic" is vertical integration — device, surgical robots, factories — to make implantation as routine as LASIK, eventually for millions. ~11:13 Blindsight. An external camera writes into the visual cortex via electrical stimulation, creating "phosphenes" — more electrodes, more pixels; next-gen is in preclinical testing with a hoped-for human trial "by the end of this year."
~13:17 AI as exocortex. Near-term: thinking straight to Grok prompts. Long-term: "direct uncompressed high-fidelity multimodal transfer of concepts," the breakthrough being to "compute on the raw intent itself." Even at ~20 participants, Neuralink is building a "neural foundational model" by fine-tuning LLMs on neural data. ~16:18 Elon's "all green light schedule" lesson: 80–90% of perceived constraints aren't physical. ~20:19 Strategy is "beachhead then expand"; biggest bottleneck is biology, then regulatory/payment.
"Things just seem impossible without scale, but things just become inevitable with scale."
A sustained takedown argues that despite checking every box on the artificial-analysis charts, Google's bureaucracy ("15 layers") makes it incapable of shipping competitive AI — illustrated with a "nuclear bomb" analogy, Gemini 3.5 Flash's self-berating reasoning loops, and a previewed long-running SWE benchmark where Gemini 3.1 Pro scores ~10% vs 60s–70s at the top. The episode expands into China closing its open-weight ecosystem and the host's new "slop cloud" project, Lakebed.[13]Nerd Snipe — Google is Not a Serious Company
~02:01 The thesis. The hosts are tired of the "sleeping giant" narrative: Google looks great on charts where it "has all the boxes checked," but having the parts doesn't mean you can assemble the bomb. ~04:02 Gemini 3.5 Flash is called a "disaster" stuck in self-berating reasoning loops; the host says Google "never figured out reasoning."
~10:06 Cloud reliability. The marquee example is Railway's GCP account "accidentally" deleted by a new account-culling algorithm — with the running joke that it might be Gemini-powered. ~16:06 A previewed long-running SWE benchmark: GPT-5.5/5.4 lead, Opus 4.7 ties 5.4, cascading down to Gemini 3.1 Pro at ~10%.
~25:13 Geopolitics. The Manus/Meta breakup — Beijing retroactively undoing the acquisition — is framed as China closing the open-weight ecosystem, with the "shot heard round the world" being Cursor's Composer 2 built on Kimi K2.5. ~35:19 Bureaucracy as the real problem, contrasted with OpenAI's Codex desktop app and Vercel's "unblock-me" P0 channel. ~44:21 The episode closes pitching Lakebed, a deliberately "good enough" slop cloud built in ~4 days with GPT-5.5.
"It's a very pretty chart. That doesn't mean Google is competent."
OpenAI engineers walk through major Agents SDK updates: a Codex-style harness, sandbox-using agents with pluggable compute backends (Modal, Cloudflare, E2B, Vercel, Daytona), a versioned Skills API, and lossless pause/resume via filesystem snapshots. The headline architectural move is splitting the harness from the compute — keeping secrets off ephemeral sandboxes and easing snapshot/rehydration.[14]OpenAI — Build Hour: Agents SDK
~01:01 Why now. Codex has internally run for days; the pain is balancing in-distribution performance against cross-provider flexibility, plus dying sandboxes and secret management. ~06:03 Split harness from compute. The harness runs in your own infra while the sandbox is ephemeral, avoiding load-bearing sandboxes and reducing prompt-injection/exfiltration risk.
~11:07 New releases. A hosted shell tool in the Responses API, a containers endpoint (auto/manual modes), network access controls (domain allow-lists, lockdown), a Skills API (versioned SKILL.md bundles), and the SDK now in TypeScript (previously Python-only since April). ~14:08 Live demo: a SandboxAgent task tracker on a Docker sandbox, with skills loaded from a GitHub repo and a capability object bundling filesystem/shell/compaction.
~20:17 Pause/resume. On stop, the SDK snapshots the filesystem (local tarball by default) and rehydrates a fresh container on resume — opaque to the model. He then hoists the agent to Modal with snapshots in Cloudflare R2. ~25:23 The manifest declares a desired file tree (from uploads, R2/S3/Azure, or GitHub); custom @function_tools, boolean/predicate tool-call approvals, and agent handoffs round it out.
"The thing I'm most excited about… we've split the harness from the compute. You can treat the sandbox as this totally ephemeral thing."
Jess Grogan-Avignon and Jack Wang argue enterprise agentic projects fail not from data, APIs, or model limits but because the enterprise's human-paced operating system collides with machine-speed AI — only 12% of companies reach "AI achiever" status. They lay out five tensions (speed, value, delivery, trust, moat) with concrete prescriptions for each.[15]Accenture — Why Enterprise Agentic Projects Are Doomed
~03:10 Tension 1 — Speed. A real engagement built an agentic app in ~2 weeks but took 12 more months to ship due to alignment across infra, security, and data governance teams. Fix: turn every human process into executable code, not sign-off chains.
~06:13 Tension 2 — Value. Business cases assume scope/value/cost are knowable up front; with AI you learn by doing. AI achievers see ~50% higher revenue growth — from new things, not cost-cutting. Fix: the CFO should think like a VC and back a portfolio of bets.
~10:15 Tension 3 — Delivery. Non-deterministic agents can't be milestoned like fixed programs; adopt hypothesis-driven delivery with build-evaluate-iterate loops. ~13:19 Tension 4 — Trust. Treat each delivery as a deposit in a "trust account"; use a progressive-autonomy "exposure ladder" (shadow → advisory → controlled → wider autonomy). ~16:24 Tension 5 — Moat. CRM/ERP/SOPs are "transactional memory," a floor everyone has; the real moat is "living memory" — feedback compounded at your scale. "Feedback is the only moat."
"Every single human process needs to become adaptable, executable code. Not another meeting, not a sign-off chain, code."
Braintrust's Phil Hetzel argues agent observability diverges from traditional o11y on three axes: non-deterministic vs deterministic systems, massive semi-structured traces (1GB+) vs constrained metrics, and a mixed technical/non-technical persona vs a pure engineering audience — which is why Braintrust built a custom trace database.[16]Braintrust — Agent observability vs traditional o11y
~04:17 Problem 1: non-determinism. You care why a path was chosen, so agent o11y measures qualitative properties — grounding, expected tool use, brand alignment — on top of token/latency metrics. ~07:21 Problem 2: nasty traces. Highly semi-structured, voluminous (a single trace can exceed 1GB; a span can be 20MB), and needed in true real time.
~09:22 The custom DB. They moved off ClickHouse (couldn't do the text indexing) and built one with a write-ahead log for instant trace visibility, indexing for fast filtering, and a forked Tantivy full-text index. ~11:23 Problem 3: persona. The best teams mix engineers with clinicians, lawyers, and advisors who improve agents via natural-language prompts. ~13:24 What's next: a lightweight LLM over incoming traces doing embedding/clustering for topic, intent, and sentiment modeling. Observability and evals are treated as one system.
"Agent traces are really nasty… an agent trace could be over a gigabyte in size. An individual span can be 20 megabytes."
Neo4j's Zaid Zaim and Andreas Kollegger argue knowledge graphs should evolve into "context graphs" that capture not just what agents know but the rules, policies, and reasoning behind decisions — then present a transferable decision-making workflow that lets agents act explainably and defer to humans when authority or certainty is lacking.[17]Neo4j — Context Graphs for Decision-Aware Agents
~01:07 Agent memory on a graph. Short-term (conversation/state), long-term (orgs/people/things), and reasoning memory (policies/rules). ~02:09 Context graphs add the "missing why." ~05:15 Architecture: query → knowledge source → graph DB fallback via text-to-Cypher → traversal.
~06:17 The decision workflow. The running example: an agent with a credit card to keep the fridge stocked might order Red Bull when rent is due. ~09:20 (1) frame with local context, (2) feed global context plus hard/soft business rules, ~11:21 (3) risk-value analysis with "reference class validation" (the drug right 99% of the time but fatal for the 1%), (4) propose alternatives and hand off to an agent that checks authority or escalates to a human, ~14:23 (5) record the full reasoning back into the graph as precedent.
"A lot of our practice as AI engineers is being explicit about the implicit knowledge that we carry with us."
A Cursor agent reportedly erased Pocket OS's production database and backups in 9 seconds via a single Railway API call — but Nate B Jones reframes it as a product-analytics failure, not an "AI went rogue" story. He argues the "agent run" is the new unit of product behavior, and the completion-vs-acceptance gap is the blind spot most dashboards miss.[18]Nate B Jones — Agent analytics
~01:01 The incident. Standard dashboards would have shown an active user, a long session, an AI feature in use, and many messages — none of which reveal the instruction given, the credential found, or the permission boundary that failed. ~03:01 Chat logs and engineering traces are each insufficient: a trace tells you a run cost 30 cents, but not whether it was worth it.
~05:05 The agent run. The right unit of measurement, analogous to the click — surfacing intent, tools used, failed calls, approvals, and whether the user accepted or redid the output. He points to Salesforce's "Agent Work Units" (2.4B delivered, +57% QoQ) as directional. ~08:07 Completion vs acceptance. A 2×2: high completion / low acceptance means the agent finishes but isn't trusted; user corrections are "the new clicks of the agent era." ~10:11 His call to action: product teams shouldn't delegate agent observability entirely to engineering traces.
"A session tells you that a user showed up. An agent run tells you what work was attempted."
A Codex update headlined by "App Shots" — sending the frontmost macOS window (screenshot plus text) to the agent by pressing both Command keys — also makes Goal Mode official across app/IDE/CLI, adds remote computer use that survives a locked Mac, and ships plugin sharing for teams.[19]AICodeKing — Codex 4.0 upgrades
~01:02 App Shots. Press both Command keys to send the frontmost window's screenshot and accessible text into the agent — useful for debugging native apps and UI states, with the obvious privacy caveat. ~02:03 Goal Mode official. Sustained multi-hour objectives across app/IDE/CLI; CLI 0.133 enabled goals by default with dedicated storage and progress tracking, and 0.132 stops on usage limits rather than looping.
~04:05 Remote computer use. Codex continues desktop tasks after a Mac locks and can be driven from Codex mobile, with short-lived auth, covered displays, and auto-relock on local input. ~05:05 Plugin sharing. Teams distribute bundles of skills, app integrations, MCP servers, and hooks via marketplace sources. ~07:08 Browser annotations let users mark up rendered UIs for precise front-end fixes.
"Whatever is on your screen right now can become context for the agent."
Matt Pocock tests Cursor's aggressive "thermonuclear code quality review" skill against his own commits — it caught ~5 of 7 solid issues including a blocker-class 1,000-line file, but he criticizes its verbosity and total silence on testing.[20]Matt Pocock — Cursor's review skill Separately, a Sequoia clip explains why Cursor deliberately skipped pre-training its own model, working top-down to ship value faster.[21]Sequoia — Why Cursor Skipped Pre-Training
~00:00 The single skill.md instructs the agent to audit the whole codebase from the current branch, not just the diff. Non-negotiables: don't push a file past 1,000 lines, treat nested ifs as design problems, question optionality and any/unknown types, and look for "code judo moves" that delete whole categories of complexity. ~08:08 Run against the last five PRs to his open-source Sandcastle, it flagged a 1,000-line init file, a generic registry helper, swallowed errors, and an incomplete decomposition — ~5 of 7 findings rated valid. His critiques: heavy repetition, vague "improve or worsen the local architecture" language, and zero mention of tests or seams.
"The behavior is correct in all three substantive PRs, but the code base is meaningfully messier than it was a week ago."
A Cursor team member explains the rationale: bottom-up (pre-train → scale → post-train → RL) takes a long time before any user value. By starting top-down — fine-tuning capable base models and applying RL directly — Cursor shipped a useful model far faster, optimizing for time-to-value over technical completeness.
"How do we get a model that's useful to users in the least time possible?"
Two Nature papers show multi-agent AI systems making lab-validated biomedical discoveries. Google's AI Co-Scientist autonomously generated treatments for AML leukemia, liver fibrosis, and antimicrobial resistance — rated more novel and impactful than human PhD experts in blind tests — while Robin closed the loop, writing and running its own data analysis to find macular-degeneration drug candidates, synthesizing 551 papers in ~30 minutes for $10.76.[22]AI Search — AI co-scientists make real discoveries
~01:01 Not one chatbot but an ecosystem of agents — a supervisor, a generation agent, a brutal reflection/reviewer agent, a proximity agent that clusters ideas, an evolution agent, and a ranking agent running an ELO tournament of head-to-head debates. Given 15 unsolved biomedical goals, blind judges rated its ideas higher in novelty, plausibility, and impact than the best human experts. Results: for AML it found binimetinib (IC50 2nM) and KIRA6 (18× more effective at killing leukemia stem cells), plus a JQ1 + Olaparib + MSA2 combo; for liver fibrosis, Vorinostat; and for AMR, it explained cf-PICI phage-tail hijacking in 2 days — matching an unpublished lab team's months of work.
~21:17 Robin goes further by interpreting raw experimental data and iterating: Crow does literature review, Falcon writes deep drug reports, and Finch autonomously writes/debugs/executes analysis code, launching 8 parallel instances with a 50%-consensus mechanism to avoid hallucination. Applied to dry age-related macular degeneration, it confirmed compound Y27632, then via RNA-seq surfaced the ABCA1/APOE link, and proposed Ripasudil and the circadian modulator KL001. A human would need ~400 hours; Robin finished the full loop in under 2 hours for $10.76.
"It synthesized those 551 papers in just around 30 minutes… and the compute cost for everything is just $10.76."
This week's 35 trending repos lean heavily into agent infrastructure: Workshop (local agent debugger), CodeGraph (a repo knowledge graph that cuts Claude Code's token waste), and Forkd, which forks warm Firecracker micro-VMs in ~101ms using userfaultfd and copy-on-write memory. ML highlights include HRM-Text, a 1B model trainable for ~$1,000.[23]Github Awesome — GitHub Trending #34
~00:00 Agent dev tools. Workshop instruments agents in Claude Code/Codex/Cursor, streaming tokens/tool-calls/spans into a local UI for self-healing eval loops; CodeGraph pre-indexes a repo into a knowledge graph so agents make fewer wasted tool calls; Agent HTML replaces giant markdown blobs with stable, updatable semantic HTML artifacts.
~01:00 Infra & ML. Forkd forks warm Firecracker VMs in ~101ms. HRM-Text trains a 1B hierarchical-recurrent model on 16 H100s for ~46 hours (~$1,000); MIT CSAIL's ELF is a diffusion LM denoising in embedding space; FlashLib rebuilds ML GPU primitives for Hopper (K-means 26×, SVD 208× faster). Pya Engram offers a local-first, MCP-compatible shared memory store across AI tools; Mailflare is a self-hosted email platform on Cloudflare Workers + D1.
~02:01 Plus fun/utility repos: ASCII Aquarium (ESP32), Nvidia's PiDi (latents → 2048px in <1s on a 5090), WorkOS's auth.md agent-registration protocol, ShadowChat (file transfer via light), and a Raycast-style tmux palette.
A roundup of the day's developer tooling: Vercel's wterm renders a web terminal to the DOM (not canvas) for free browser accessibility; Deno 2.8 makes fresh npm installs 3.6× faster and adds deno audit fix and deno pack; Devbox puts your dev environment in Git; Real Python walks through building MCP servers in Python; and marimo demos a reactive 2D animation widget.[24]Better Stack — wterm
~00:00 A Zig-based web terminal that compiles to a 12KB WASM binary and renders to the DOM, so text selection, find, and screen readers work for free — unlike canvas-based xterm.js. An optional libghosty renderer (400KB) improves legibility and color, and it needs a WebSocket back-end spawning a PTY.[24]Better Stack — wterm
Fresh npm installs drop from ~3.3s to ~96ms (3.6× faster) via parallelism and off-critical-path decompression; deno audit fix auto-upgrades vulnerable deps to the nearest safe version; and deno pack publishes a Deno/JSR library to npm — transpiling, generating .d.ts, rewriting imports, and shimming Deno APIs — without a separate build pipeline.[25]Better Stack — Deno 2.8
A friendly Nix wrapper: devbox init/add/shell gives every developer the same project-scoped tool versions, committed as devbox.json + devbox.lock rather than a rotting README. Honest trade-offs: the first Nix download is slow, complex logic belongs in a .sh file, and it's not a cloud-IDE replacement.[26]Better Stack — Devbox
~03:00 Real Python's course frames MCP as a set of rules (QR-code analogy: server generates, client consumes), scaffolds a project with uv, and implements a get_sales tool — a plain Python function decorated with @mcp.tool(), where type hints and a docstring are the metadata the LLM uses to decide whether to call it.[27]Real Python — Building MCP servers
A 2D slider widget that animates a puck along a programmable path with easing, looping, and step functions — every downstream chart updates reactively, so any variable driven by its output can be animated.[28]marimo — 2D animation widget
Bloomberg-leaked screenshots show Apple's long-promised Siri overhaul, expected at WWDC on June 8: a dark-mode standalone app with voice/image/file input, Dynamic Island integration — and, most notably, a dropdown to route queries to Claude, Gemini, or ChatGPT, positioning Siri as an orchestration layer rather than a closed assistant.[29]Tech Brew — Apple's Siri redesign
The recreated screenshots show a chatbot-style interface similar to Claude and ChatGPT, defaulting to dark mode with an "Ask Siri" field anchored at the bottom, support for voice prompts, image attachments, and file uploads, plus a "Search or Ask" surface reachable by swiping down. Conversation history appears as lists or bubbles, and a new AI search layer unifies apps, web results, weather, and shortcuts. The standout addition is the dropdown to route queries to Google Gemini, Anthropic Claude, or OpenAI ChatGPT. The design adopts Apple's "Liquid Glass" aesthetic, though early reaction is mixed, and Apple is also reportedly working on an auto-lock theft-detection feature.[29]Tech Brew — Siri redesign
Meta launched "Meta One," paid tiers across Instagram, Facebook, and WhatsApp ($2.99–$3.99/mo) plus two AI plans ($7.99 and $19.99/mo) — its first major consumer monetization beyond ads, sending the stock up 3.7%. Sherwood also flagged that BYD registered more than twice Tesla's EV volume in Europe in April, while Ferrari's electric debut, the Luce, sank the stock up to 8.4%.[30]Sherwood — Meta One & EV race
Meta One. Instagram Plus and Facebook Plus at $3.99/mo, WhatsApp Plus at $2.99/mo, and two AI tiers at $7.99 and $19.99/mo (the free chatbot now rate-limited). The move is read as a way to justify Meta's huge AI capex to investors; Sherwood also noted Anthropic's annualized revenue reportedly approaching $45B.
EV market. BYD posted 115% YoY growth in European registrations vs Tesla's 47%, widening the gap, while Ferrari's Luce disappointed investors — down 5.3% in US trading and 8.4% in Milan — a sign that luxury positioning alone isn't a credible electrification roadmap.[30]Sherwood — Meta One & EV race
OpenAI published a Frontier Governance Framework mapping its safety practices to the EU AI Act and California's Transparency in Frontier AI Act, covering four risk areas reviewed by a cross-functional Safety Advisory Group.[31]OpenAI — Frontier Governance Framework Two customer stories show GPT-5.5 at work: Abridge saw evaluation scores rise as it added tools for clinical decision support, and Chip Ganassi Racing used OpenAI tools to win at Long Beach.[32]OpenAI — Abridge clinical decision support
OpenAI's public document codifies how its safety work aligns with the EU AI Act Code of Practice and California's Transparency in Frontier AI Act, covering cyber offense, CBRN, manipulation, and loss of control. A cross-functional Safety Advisory Group reviews whether safeguards sufficiently minimize severe risk before routing recommendations to leadership — framed as going beyond minimum compliance, echoing Anthropic's recent safety updates.[31]OpenAI — Frontier Governance Framework
~00:00 Abridge's clinical decision support tool saw a counterintuitive upward trend on its eval set as it added more tools with GPT-5.5, with better reasoning and information density while keeping clinicians the final authority.[32]OpenAI — Abridge ~00:10 Chip Ganassi Racing used OpenAI tools to synthesize years of race data, optimize pit strategy, and even generate strength-and-conditioning plans, winning at Long Beach.[33]OpenAI — Chip Ganassi Racing
Shorter items: designing agents that survive mid-run failures, an engineering crisis-management playbook, the 1974 paper that birthed index funds, a cautionary "automation is a lie" vibe-coding story, two Claude-vs-ChatGPT prompting notes, a small Simon Willison SVG tool, and this week's Data Science Weekly.
Designing agents for production failures. The hard part isn't building an agent, it's what happens when it breaks mid-run; AgentSpan (open-source SDK by Orcus) adds execution-history inspection, configurable retries, and resume-from-failure.[34]Arjay McCandless — Build Durable Agents
Crisis management playbook. Clear your schedule in crisis mode, use the OODA loop to reorient often, delegate project management past ~10 people, and name a DRI; directionality matters more than magnitude.[35]Real Python — Crisis Management Playbook
The birth of index funds. Paul Samuelson's 1974 paper found no evidence managers beat the market and called for a fund that "apes the whole market" — the intellectual blueprint for index funds, now ironically criticized for being too big.[36]Acquired — How index funds were born
"Automation is a lie." A vibe-coded side app crashed every 10 minutes at launch, Codex created a whack-a-mole loop, and it took two senior engineers (and a case of bursitis) to fix — a reminder that AI-built systems still need human oversight.[37]Lenny's Podcast — AI still needs humans
Claude prompting notes. Give Claude existing work to edit rather than a blank canvas — it's stronger as an editor (85% structural coherence vs ChatGPT's 78% in one test).[38]Nate B Jones — Give Claude work to edit And Claude's principles-based training yields higher instruction compliance (94% vs ChatGPT's 87%), which matters most across vague, multi-turn tasks.[39]Nate B Jones — Claude instruction compliance
markdown-svg-renderer. A small in-browser markdown viewer from Simon Willison that renders fenced SVG code blocks as live images (with a source toggle), taking input via paste, raw URL, or GitHub Gist — handy for sharing LLM output with embedded diagrams.[40]Simon Willison — markdown-svg-renderer
Data Science Weekly #653. Highlights: 6 LLM prompting techniques for data scientists, a Reddit thread noting ML/DE skills are rebounding while pure "AI" postings decline, and Frank Harrell's argument that AI should be a "specification writer and comprehensive tester" in statistics.[41]Data Science Weekly — Issue 653