April 17, 2026
Claude Opus 4.7 ties GPT-5.4 and Gemini 3.1 Pro at the top of Artificial Analysis's Intelligence Index (57.0), runs away with agentic work on GDPval-AA (1,753 Elo, 79 points clear of the field), cuts hallucinations from 61% to 36%, and costs 11% less to eval than 4.6[1]Artificial Analysis — Opus 4.7: Everything you need to know. Theo spent a full day with it and called it "kind of a disaster" — not because the weights are bad, but because Claude Code itself is rotting around them[2]Theo — t3.gg, "This model is kind of a disaster". Meanwhile one independent analyst argues Anthropic's per-token pricing shift is the only honest signal in a market that's been torching compute on flat-rate plans and leaderboard-gamed token budgets[3]AI Demand Is Inflated And Only Anthropic Is Being Realistic.
Artificial Analysis pegs Opus 4.7 at $5/M input and $25/M output — unchanged from 4.6 — but the full Intelligence Index eval now costs ~$4,406, 11% less than 4.6 despite higher scores. The reason: 102M output tokens this round vs 157M for 4.6. Opus 4.7 earns the savings by abstaining on uncertain questions rather than guessing, which also pushed hallucination rate from 61% → 36%[1]Artificial Analysis — Opus 4.7 deep dive. Context window is 1M tokens with 128K max output, and reasoning effort now has five levels (low/medium/high/xhigh/max).
Theo lists the real wins: better instruction-following, vision up to ~2576px (3× prior Claudes), file-system memory, an "extra high" reasoning tier, a new /ultra-review slash command, and auto-mode permission classification. But at ~00:00 he hands the model a redesign of t3.gg and it surfaces three system-reminder blocks telling it to refuse the work as if it were malware. Anthropic's cyber safeguards then hard-locked a DEF CON Gold Bug cryptography puzzle (the "Cshanty" 12-bottle cipher), forcing a fallback to Sonnet 4. Asked to modernize his Ping codebase "to latest versions," Opus 4.7 silently installed Next.js 15 (two years old) and React 19 and never web-searched — GPT-5.4 in the same test acknowledged its cutoff and fetched docs to confirm Next 16. Claude Code's bypass-permissions mode broke silently mid-session, auto-mode didn't trigger, and a clone-repo task hallucinated that .env files weren't in .gitignore[2]Theo — t3.gg, "This model is kind of a disaster".
I can't believe I watched the model regress in real time is what I'm going to say. I don't actually believe Anthropic's models get dumber over time… I just genuinely think Claude Code is this shitty and poorly maintained.
Theo's diagnosis: Anthropic employees run a different internal stack from what ships to customers (unlike OpenAI/Google, where "employees may be a week or two ahead on Codex builds, but it's the same code"), so the public harness degrades invisibly. Contrast Tibo's public Codex "ghosts in the machine" post-mortem — OpenAI investigates and publishes; Anthropic ships and stays quiet. Theo also cites Jake's line: "4.7 is the weirdest model either labs released in a while. It just plowed through a bug that required touching 30 files and then got a boolean backwards in the fix." His take: Anthropic "RL'd consistency out" to get a higher ceiling and better bold-number counts, and the variance landed on users.
The ENG culture at Anthropic is rotting at the core. You guys suck at coding as a company, as a business… You make Google look like a well QA'd company with the slop you guys are throwing at us.
A third lens: Dario Amodei has said most AI companies "have not written down the spreadsheet — they don't really understand the risk they're taking"[3]AI Demand Is Inflated And Only Anthropic Is Being Realistic. Uber's CTO said AI coding tools maxed out Uber's annual AI budget by April. Goldman Sachs reports companies overrunning initial inference budgets by "orders of magnitudes." A single $200/month Anthropic Max plan was estimated to consume $2K–$5K of compute. Meta and Shopify track AI usage volume rather than output; Jensen Huang explicitly told NVIDIA managers that he'd be "deeply alarmed" if a $500K engineer did not consume at least $250K in tokens. Against that backdrop, Anthropic cutting third-party tools (OpenClaw) from unlimited subscriptions and moving enterprises to per-token billing is the only thing producing clean demand data — potentially decisive going into its expected IPO.
Anthropic Labs launched Claude Design, a visual creation tool powered by Opus 4.7 that turns prompts into prototypes, slide decks, and landing pages, with direct export to Canva and a one-click handoff into Claude Code[4]Anthropic — Introducing Claude Design by Anthropic Labs. It's bundled with existing Pro/Max/Team/Enterprise subscriptions at claude.ai/design. Brilliant reports cutting complex page prototyping from 20+ prompts to 2; Datadog compressed week-long design cycles into single conversations.
Built on Claude Opus 4.7 — Anthropic's most capable vision model — Claude Design accepts text prompts, DOCX/PPTX/XLSX uploads, codebase integration, and web captures. It auto-applies existing brand colors, typography, and components from a repo or design file, and exports to internal URLs, PDFs, PPTX, standalone HTML, or Canva[4]Anthropic News — Claude Design. Collaboration is scoped at the org level with granular permissions; on Enterprise it's off by default and must be admin-enabled.
We're excited to build on our collaboration with Claude, making it seamless for people to bring ideas and drafts from Claude Design into Canva. — Melanie Perkins, Co-Founder & CEO, Canva
Nate Herk walked through the full loop at ~00:00[5]Nate Herk — Claude Design Just Became Unstoppable. The design system builder ingests a company name, brand docs, logos, and a GitHub repo, then spends ~15 minutes producing a skill.md manifest plus colors/type/components. At ~06:03 he drops a PDF and watches Claude invoke a "read PDF" skill, plan 19 slides, and render a fully branded deck ("I like this more than Gamma because Gamma has its own things and it's really nice, but it's a bit more inflexible"). At ~09:04 he asks it to build a workshop landing page; Claude asks a structured sequence of clarifying questions (name, dates, seat cap, early-bird pricing) before drawing a pixel, then offers a Tweaks panel for one-click deadline swaps, orange-vs-blue accent, and countdown toggles. At ~11:04 he pastes a generated "hand off to Claude Code" fetch command into his website project; Claude Code unzips the design, spins up a local server, and replaces placeholder images with real assets. "The flow is you come to Claude Design and you have your design elements… and then when you need to actually take it one step further to deploy it, that's when you send it over to Claude Code with the click of a button."
Canva publicly endorsed the launch ("we've loved collaborating with Anthropic over the past couple of years"). Nate's argument is that the tight integration — shared design systems, direct handoff commands, context already in the Claude environment — gives teams already on Claude Code a compelling reason to cancel Gamma and Canva subscriptions even as the official line is "partnership." He also frames Design as a token-efficiency play: visual iteration upfront saves the costly course-corrections that Claude Code would otherwise eat in implementation.
Anthropic rolled out Claude Routines: schedule-, GitHub-event-, and API-triggered Claude Code runs in isolated cloud containers, no laptop required. The headline trade-off: usage pulls from the subscription pool, but Pro is capped at 5 runs/day and Max 20x at 15[6]Better Stack — Claude Routines: The Hidden Costs Nobody Talks About. Nate Herk used it to wire up a five-routine Opus 4.7 trading agent on Alpaca/Perplexity/ClickUp[7]Nate Herk — I Turned Claude Opus 4.7 Into a 24/7 Trader. The common review: powerful, remote-first, but thin at the low-end price point[8]AI That Works While You Sleep & More AI News You Can Use.
Routines are the remote counterpart to Claude Code's local Loops and Schedules. Three triggers: cron, GitHub events (e.g. PR opened), and inbound API POST. Each run spins up a fresh cloud container with no access to local skills/hooks/settings unless the repo is cloned in. You can attach Slack/GitHub connectors and lock down allowed outbound domains. Better Stack demoed a 9 AM Slack digest reading JS Weekly/React Status/Node Weekly, plus a PR-review routine[6]Better Stack — Claude Routines: The Hidden Costs.
Because Claude Code in the cloud creates a new instance of Claude Code, it doesn't have access to your local skills or settings or hooks.
Pro gets 5 routine runs / 24 hours, Max 20x gets 15. Manual test runs don't count. "A single automated PR review depletes one of only five daily runs." Better Stack's verdict: "For what it is, it's very expensive," and they'd rather self-host with Hermes agent + Multica on GLM 5.1 or a GPT coding model behind webhooks.
Nate set up five weekday routines — pre-market (6 AM), open (8:30 AM), midday (12 PM), close (3 PM), and Friday-only weekly review (4 PM) — around an Opus 4.7 fundamentals strategy[7]Nate Herk — I Turned Claude Opus 4.7 Into a 24/7 Trader. Since routines are stateless, persistence lives entirely in files — CLAUDE.md, strategy docs, trade log, research log — read at wake and written at shutdown. "Every routine basically wakes up, reads files, does the job, and then writes back any important lessons or anything like that the next agent when it wakes up needs to know. Files aren't just memory, but they're essentially the agent's full personality and discipline." Tool stack: Alpaca Markets API (endpoints preferred over their MCP), Perplexity API for web research, ClickUp for daily/weekly summary notifications. API keys live in a named Claude Cloud Environment ("trading"), not in .env. He migrated the strategy from an OpenClaw agent by exporting its memory files as a zip.
On model fit: Opus 4.7 gained ~4 points on Agentic Financial Analysis over 4.6 (reaching 64.4%), but the benchmark measures SEC-filing digestion, not chart reading — Nate routes it to long-term/swing trades, explicitly not day trading. Agentic Search regressed from 4.6. Guardrails: 5% max position size, daily loss caps, no options. Paper trade on Alpaca's 100K account first. Grade yourself weekly ("the test run gave itself a C").
A separate news round-up clip notes the Claude desktop app now supports multi-window Claude Code sessions with drag-out split-view and an embedded terminal[8]AI That Works While You Sleep & More AI News You Can Use. Routines show up here too, framed as the one clear differentiator vs ChatGPT scheduled tasks: "you could be on a $20 plan and it could run in the cloud every day."
Nate B Jones argues the 2026 AI moat isn't model quality — it's accumulated user memory, and there is no "bring your own context" (BYOC) solution[9]Nate B Jones — Anthropic And OpenAI Are Fighting Over Your Memory. You're Going To Lose.. Every conversation with ChatGPT, Claude, or Perplexity accrues a career-scale asset you don't own; platforms have zero incentive to let you leave. His pitch: treat your AI context like a professional asset and plug into MCP — "the USB-C connector for AI."
At ~02:03 Nate breaks context into four layers: domain encoding (industry vocabulary, products, regulatory environment), workflow calibration (how you like research structured, drafts shaped — saves 5-8 turns per conversation), behavioral relationship (unstated preferences about when to challenge, how technical, how much preamble — "compound interest on a relationship"), and artifact / demonstrated capability (the reasoning behind the artifacts you produce, which should be portable). A cited survey claim: more than 60% of workers use personal AI at work regardless of IT policy.
You don't see the way the LLM is shaping itself to your patterns of response, but it is… It's like the compound interest on a relationship except in this case it is the compound behavioral difference the LLM learns from interacting with you over time.
At ~11:10: "This is a problem that I would bet you a lunch affects 90% of us in the next two years if we're in the professional workforce." Triggers: job changes, role changes, corporate AI vendor switches. The credential gap is being filled "by vibes" — Meta reportedly flies candidates in and locks them in a room with a company laptop and tools, a bad proxy because the candidate has no context. At ~13:12 he distinguishes "candy products" (nice to have, diffuse pain) from "opium products" (acute pain). Memory portability is candy — "like your engine giving out and costing you 20,000 miles, but you don't really think of it that way" — which is why well-funded memory-layer startups keep failing.
At ~16:13 he outlines three tiers: (1) a structured markdown profile — 30 minutes of work, "720p not 4K" — captures domain, communication preferences, workflow, style; (2) a personal context server, an MCP-native store backed by Postgres/Supabase/VPS you control, flipping push → pull so the AI queries only what it needs; (3) MCP exposure for bidirectional read/write. "MCP is effectively the USB-C connector for AI. Everything plugs into it." He's building this into a product called OpenBrain. Thousands of mostly non-technical users have already built with it.
Your canonical working identity is important enough to put in a database now because guess what? This is the way the web is going and it should live with you, not inside a platform a company owns.
At ~26:19: for decades professional value was four things — what you know, what you can do, who you know, what you can prove you've done — all of which live in your head or relationships. AI creates a fifth: your working intelligence, the accumulated calibration that makes you effective with AI tools. Uniquely, it lives on third-party servers, fragmented across accounts, under unnegotiated terms of service. His closing thesis at ~29:20:
Memory has replaced models as the moat of 2026. The platforms that built their retention around your accumulated context are winning right now.
Peter Steinberger, creator of OpenClaw and now at OpenAI, gave a State of the Claw update at AI Engineer: five months old, arguably the fastest-growing project in GitHub history, and drowning in ~16.6 security advisories per day[10]AI Engineer — State of the Claw, Peter Steinberger. He addressed "Closed Claw" fears about OpenAI, laid out the Open Claw Foundation plan, and talked taste, soul.md, prompt injection, dreaming, and glasses that annotate the world.
OpenClaw hit ~30K commits, ~2K contributors, ~30K PRs, and the largest star count of any non-educational GitHub project — a friend called the curve "stripper pole growth." Steinberger is juggling his OpenAI role and the newly-forming Open Claw Foundation (structure inspired by Ghostty, bottlenecked on American banking processing non-American founders). He's actively not staffing OpenClaw primarily with OpenAI employees — routing contributors through Nvidia, Microsoft (for Teams/Windows), Red Hat, Tencent, ByteDance, Telegram, Salesforce, Slack, and Alibaba/MiniMax/Kimi — and notes Chinese companies are OpenClaw's largest userbase by continent[10]AI Engineer — State of the Claw.
OpenAI didn't buy OpenClaw — they might have bought my soul.md.
1,142 advisories, 99 critical, 469 published, 60% closed. That's roughly 2× the Linux kernel's daily inflow (8–9/day) and more than curl's lifetime total (~600). The rule of thumb: "the higher they're screaming how critical they are, the more likely it's slop." Tells for AI-generated reports: overly nice tone, apologetic phrasing ("people in security don't apologize"). Real incidents worth naming: a CVSS-10 permissive-read escalation ("Gshjp") that's technically maximum severity but barely exposes real users; a "ghost claw" rootkit npm campaign attributed to North Korea; a transitive Axios supply-chain hit via the Teams and Slack plugins (OpenClaw itself doesn't use Axios); and a Belgian CERT RCE on a gateway-token forwarding bug that only fires if you actively fight the default setup. He singles out a paper titled "Agents of Chaos" that spent four pages dissecting OpenClaw architecture while skipping the security docs — and when pressed, the authors admitted they'd run OpenClaw in sudo mode, something that requires code changes, and omitted that "because that wouldn't give them clout."
A concrete demo: Nvidia invited him the Sunday before their Monday NemoClaw keynote; he hooked up Codex security to their hardened sandbox and found five breakout paths in half an hour — partly because OpenAI's internal cyber-trained models are meaningfully stronger than public ones.
His viral "10 parallel agents" photo is now more like 5–6 windows (Codex got faster). He rejects the dark-factory: "the way to the mountain is usually never a straight line." Taste is "if it doesn't stink like AI" — no purple gradients, no leftbar color borders, but yes to the delightful details (OpenClaw's occasional roast messages) that survive human involvement. soul.md came from noticing his WhatsApp relay didn't match how his friends actually text. He previewed dreaming, a memory-reconciliation feature akin to sleep-based garbage collection — and noted the recent Anthropic source-code leak shows they're working on similar ideas. On small local models (~20B): no defenses against prompt injection, dangerous against email/web; OpenClaw now warns when a small model is selected. Andrej Karpathy and Mahran Dre both automate their houses with OpenClaw, only possible because IoT device security is weak enough that a web-clicking agent can actually drive them.
OpenClaw would have never come out of an American company — it would have been killed in legal long before it would have been released.
OpenAI MTS Ryan Lopopolo argues that since GPT-5.2, code is free — an engineer's job is now building the harness: lints, skills, reviewer sub-agents, and prompts-everywhere that let agents execute the full SWE role while humans steer[11]AI Engineer — Harness Engineering, Ryan Lopopolo, OpenAI. He banned his team from touching editors for nine months, runs as a "token billionaire" (1B+ output tokens/day, $1K+/day), and treats LLMs as "fuzzy compilers" whose optimization pass is the harness.
Implementation is no longer the scarce resource. Scarce resources: human time, human/model attention, model context window. Every engineer is now effectively a staff engineer with 5 / 50 / 5,000 parallel teammates. The job shifts to systems thinking, delegation, and writing down the non-functional requirements that define "a good job."
For the last nine months I have had the privilege of building software exclusively with agents. I am a token billionaire and I believe that in order for us to get into our AGI future, we want everybody to be token billionaires.
Code is free. It's free to produce, free to refactor, and it is not a thing to get hung up on anymore… the important thing is not the code but the prompt and the guardrails that got you there.
The stack: Codex (CEX) is the outer harness, not a shell the app spawns into. Five to ten skills teach it to launch the app, spin up local observability, attach Chrome DevTools via a daemon. PNPM workspace with ~750 packages isolated by domain/layer, enforced package privacy, dependency-edge lints (agents optimize for local coherence over shared utilities). Source-structure tests assert files <350 lines for context efficiency. Lint error messages are written as prompts with remediation steps ("we parse don't validate at the edge"). Reviewer sub-agents per persona — security, reliability, front-end architect, product — run on every push. QA-plan requirements gate PRs.
Garbage collection day every Friday: every recurring reviewer complaint becomes a durable guardrail. He tethers his laptop to his phone, straps it in the back seat, and lets agents cook during his commute — GPT-5.4 in CEX autocompacts well enough that he never runs /new. Token spend is roughly ⅓ planning/ticket curation, ⅓ implementation, ⅓ CI-time review. He avoids plan mode (approving unread plans encodes bad instructions); if you do use plans, ship them as standalone human-reviewed PRs first. He points Codex at OpenAI's prompting cookbooks to synthesize a skill that writes the prompts.
Every time I have to type continue to the agent is a failure of the harness to provide enough context around what it means to continue to completion.
The harness is the optimization/constraint pass; swapping models is like swapping LLVM for Cranelift — tokens differ, acceptance criteria hold. Collaboration happens in markdown + PRs as a hub-and-spoke broadcast domain; agents can acknowledge, defer, or reject reviewer feedback so they're not "bullied by reviewers." Future vision: hand machines a quarter's token budget, a ranked backlog, and success metrics; humans do triage, runbooks, vibes, and meta-programming of acceptance criteria.
Jess from Braintrust: too many teams ship AI features on engineer gut-checks or a PM trying "a couple of prompts" — i.e. vibes. Her fix is a four-part eval framework (Dataset / Task / Scoring / Experiments) and a concrete case study comparing agentic vs vector search on Claude Code over TypeScript-Go and SWE-Bench Verified Django tasks[12]AI Engineer — Stop Shipping on Vibes, Jess at Braintrust. Verdict: agentic search wins on signal; vector search returns proximity without "connective tissue."
(1) Dataset — goldens, edge cases, failure modes; (2) Task — the prompt + model; (3) Scoring — deterministic, LLM-as-judge, or human review; (4) Experiments — one (dataset, task, score) triple = one run, compared across runs to detect regressions. Braintrust's "Loop" feature lets you query long traces in natural language. Feedback loop: production logs → sample 10–20% → dataset → eval → ship. "Evals are a team sport" — AI engineers, PMs, SMEs (especially in medicine/law/insurance), data analysts. She cites OpenAI's April 2025 sycophancy-revert as the canonical "this is why evals matter" moment.
You're essentially making ship decisions based off of vibes, which is not good. Ideally you would like to be able to quantify when you're shipping by saying things like we ran 200 different test cases and 94% of them passed.
Dataset 1: Microsoft TypeScript-Go PRs with "fix" in the title, checked out at the parent commit to recreate the buggy state, with Claude-synthesized bug descriptions acting as "linear tickets" and the Go test suite as the scorer. Dataset 2: 25 Django rows from SWE-Bench Verified, scored against fail_to_pass. Two languages, two codebase ages.
Implementation gotchas: agentic search was the Claude Code default; forcing pure vector search was hard — Claude kept falling back to agentic. Fix: pass --disallow-tools and write explicit prompt instructions. Second gotcha: Claude Code runs as a subprocess, so traces were orphaned; attach via parent span IDs in env vars.
Results: SWE-Bench 60% vector / 68% agentic. TypeScript-Go tied at 70%, but vector search burned far more tokens and dollars (she cautions the $4/run number is likely inflated). One SWE-Bench run made 26 vector search calls "guessing and checking." Takeaways: vector chunks return code without imports, type defs, or callers; agentic search enables real chain-of-thought exploration via import chains; more searches = more tokens = more cost.
Vector search gave the agent a lot of proximity to relevant code, but didn't give the connective tissue between the code for it to actually implement a fix.
I don't consider this eval close to being done at all… LLMs are very non-deterministic and so I would not be surprised if I ran this eval multiple times with all the exact same criteria that it might have a difference of 10 to 15%.
Simba (Redis) argues reasoning is saturating — agent independence is doubling every ~7 months and tool-calling benchmarks are near 100 — so the differentiator is now the context engine[13]Simba K — Context Engineering 2.0: MCP, Agentic RAG & Memory. He lays out an architecture built on agentic RAG, a semantic layer over OLTP data, memory/extraction patterns, and semantic caching, and previews a Redis product exposing a "context surface" via both MCP and CLI.
Every 7 months, the amount of time that you can trust an agent to work independently is doubling… if we do this conference next year, we'll be at 4 hours of uninterrupted time of an agent working.
A human locked in a room with just a prompt for 4 hours would hallucinate — agents need a way to actively find, retrieve, and refresh context. "Context is all that matters. If you give an agent and LLM proper context today — all that it needs — it will be able to solve it an insane amount of the time." Prediction: "context engineer" will emerge as a job title the way "data scientist" did.
Classic RAG is linear and one-shot — chunk docs, vector search, stuff into prompt — so it can't benefit from improving reasoning. The demo: a "Reddish" (DoorDash clone) app where plain RAG answers "why is my order late?" with generic reasons, while agentic RAG iteratively calls order lookup → policy search → answer. Vector DB is one tool the agent chooses, not the whole pipeline. Redis is previewing a context surface accessible via MCP and CLI, backed by arbitrary sources (APIs, other databases, Redis Search, Redis Agent Memory Server). Underneath sits RDI — a "context view" pattern analogous to feature-store online stores, materialized from operational data. Spicy take: full graph RAG doesn't work — LLMs can't maintain consistency across a full knowledge graph over time; a semantic graph a few hops out is fine.
Memory is a subset of broader context extraction — running LLMs over conversations, docs, and decision traces to pull out structured state. Redis open-sources an Agent Memory Server. A semantic layer over OLTP (Redis, powered by Pydantic) is critical because agent access patterns are transactional, not analytical — unlike traditional Snowflake/Databricks semantic layers. APIs built for humans expose too many knobs; agents need human-level abstractions (a transaction, an order).
Good morning cost me like a million dollars a year… every time someone picks up the phone and calls our agent, they say good morning and then it goes to Opus.
Fix: semantic caching with a fine-tuned ModernBERT to match intent and shortcut expensive reasoning calls.
At Stanford GSB, NVIDIA's Jensen Huang and Congressman Ro Khanna, moderated by H.R. McMaster, debate how the U.S. keeps its AI lead across the full five-layer stack (energy → chips → factories → models → applications), calibrated export controls, reindustrialization, and a workforce strategy that avoids both decoupling and abandoning American workers[14]Stanford GSB — U.S. Leadership in AI with Jensen Huang and Ro Khanna.
Jensen reframes data centers as "AI factories" that "turn electricity into tokens" — a shift from retrieval-based to generative to agentic computing. The U.S. must win all five layers (energy, chips, infrastructure, models, applications). The most important layer is application/diffusion: if fear or over-regulation blocks adoption, the flywheel dies and leadership is squandered. Khanna inventories U.S. comparative advantages: 60% of AI startups are founded by immigrants, 72% of AI researchers did their undergrad outside the U.S. (38% Chinese nationals), 14 of the world's top 20 research universities, academic freedom, and the government-university-industry transfer model (NSF-funded ARPANET at Stanford, 1969).
Jensen pushes back on decoupling: NVIDIA's China share was 95%; U.S. energy and infrastructure buildouts for AI need Chinese supply chains; ASML is Dutch. "The idea that I'm going to shut you off and expect no repercussions is a bit naive."
We're going to compete with China, but we're not anti-China… there's a slippery slope between anti-China and being anti-Chinese.
Khanna agrees decoupling is wrong but argues the U.S. made a "colossal mistake" hollowing out industrial towns. He calls for a "21st-century Marshall Plan" — strategic tariffs paired with an industrial development bank (citing Raphael Reif's foreign affairs piece) to scale rare earths, critical minerals, active pharma ingredients, advanced steel, and robotics. Jensen notes NVIDIA and partners are investing roughly half a trillion dollars in U.S. chip and computer manufacturing, creating construction, electrician, and fine-tool jobs.
The classical data centers used to be a file server. Now you have basically token generators. You turn electricity into tokens.
Jensen uses radiology as the running counter-example: a decade ago a leading AI pioneer said radiology was a dead-end career; instead scan volumes exploded and there are now more radiologists than ever.
It is unlikely most people will lose a job to AI. It is most likely that most people will lose their job to somebody who uses AI.
Khanna accepts the abundance framing but, invoking Keynes's miscalculation about leisure, argues AI is capital-biased and demands worker bargaining power, ownership, and an "affirmative jobs agenda" — hiring young people to rebuild communities, work in the care economy, improve local government, or join moonshots.
I'm not an AI doomer. I'm not an AI accelerationist. I'm an AI democratist. I want this technology to work for everyone.
Khanna wants a combination of protected frontier models and open-source American models that can compete with Qwen, with export controls tied to compute thresholds and an "American AI = excellent, safe, privacy-respecting AI" brand. Jensen argues regulate applications, not the technology. Closing: America's "final project" is a cohesive multiracial democracy; Jensen's closing line:
The world is reset. An entire industry, the largest industry in the world, the computer industry is reset… Nobody has a head start on you.
Dwarkesh Patel released two clips from a longer Jensen Huang interview. Clip one: AI doomers were wrong about radiology; conflating a discrete task with a full job leads to bad policy and bad career advice[15]Dwarkesh Patel — AI Doomers Were Wrong About Radiology. Clip two: the case for selling NVIDIA chips to China, because ceding 40% of the global tech market to Huawei is the real strategic loss[16]Dwarkesh Patel — Jensen Huang Makes the Case for Selling Chips to China.
Whatever you do, don't be a radiologist. Radiology is going to be the first career to go. The world's not going to need any more radiologists. Guess what? We're short of radiologists.
Jensen: the job of a radiologist is patient care; the task is reading a scan. Confuse those and predictions go wildly wrong. Apply the same error to software engineering and "we're not going to have enough radiologists and good enough healthcare" — the headline shortage reruns in a new industry[15]Dwarkesh Patel — AI Doomers Were Wrong About Radiology.
Jensen's two pillars: (1) China is ~40% of the world's tech industry; conceding it to Huawei is a strategic loss. (2) If global AI models come to run best on non-American hardware, the U.S. loses its platform advantage. His nightmare scenario is the day DeepSeek launches on Huawei first[16]Dwarkesh Patel — Jensen Makes the Case for Selling Chips to China.
The day that Deepseek comes out on Huawei first, that is a horrible outcome for our nation.
He rejects the "any marginal compute is a weapon" framing (compared to restricting microprocessors, DRAM, or electricity), invokes the collapse of the U.S. telecom industry as an export-controls cautionary tale, and, when Dwarkesh presses him on why China would be stuck on Huawei if NVIDIA wins on merit, replies:
In the absence of a better choice, you'll take the only choice you have. How is that illogical? It's so logical.
Host Christopher Bailey and Jodie Burchell (JetBrains data scientist / Python advocacy lead) recap six months of LLM progress: pre-training scaling laws plateauing, RLVR and reasoning models pushing performance post-training, orchestration eclipsing raw model gains, and Karpathy's "summoning ghosts" metaphor explaining jagged intelligence[17]Real Python Podcast #291: Reassessing the LLM Landscape & Summoning Ghosts.
By late 2024 pure pre-training scaling had hit diminishing returns. 2025's two big post-training themes: (1) reasoning models trained to decompose problems, built initially on expensive handcrafted datasets (grade-school word problems for Qwen 3); (2) reinforcement learning from verifiable rewards (RLVR) — training on math and code where answers are concretely checkable, enabling far cheaper, more scalable datasets. Combined with test-time compute (internal candidate exploration), this explains the token-consumption explosion. Jodie's verdict: the models are improving only modestly — 2025's real gains came from orchestration. She's skeptical of benchmarks (accidental leakage, "assessment overfitting," LM-arena gaming), and RLVR makes gaming worse.
AI is ML. Let's not forget this. You cannot escape the rules — overfitting, data leakage, they're coming for you.
We're not evolving or growing animals, we are summoning ghosts.
Karpathy's line is the through-line. LLMs are trained on dead text; the "ghosts" are spikes of brilliance from very smart human-authored text plus domains we've figured out how to teach (via RLVR etc.). That explains jagged intelligence — superficial brilliance plus pits of absolute stupidity (falling for basic prompt injection). On AGI:
The word that everyone forgets in AGI is general… LLMs don't learn the way we do. We take our innate general intelligence and go top down; LLMs go bottom up.
Covered: Claude Code, Codex, JetBrains Junie, the new ACP (Agent Context Protocol) — JetBrains+Zed's LSP-style protocol for plugging external coding agents into IDEs — plus LM Studio, Qwen Coder, Devstral, Pydantic's "The human in the loop is tired" post, and Laura Summers' critique of code-review exhaustion. Comparison: current vibe-coding feels like 1970s cocaine-fueled filmmaking — huge output, human collapse.
You made a whole bunch of babies here and you need to still care and feeding of all of them… that feeling of responsibility is why you're tired.
On hype: no real evidence coding agents are replacing devs (the 30,000-layoff wave is cognitive dissonance — companies laying off are generally not doing well). Closing forecast:
Winters don't kill off progress. It's actually when the exciting work happens. I think we are in a bubble, I think this is unsustainable economically, but the technology will stick around and the really exciting work is going to start soon — once we stop focusing on AGI and give up on that.
OpenAI released GPT-5.4-Cyber — a deliberately "cyber permissive" fine-tune gated via the Trusted Access for Cyber (TAC) program[18]GPT-5.4-Cyber: What you need to know. Anthropic's parallel Mythos model is reportedly finding thousands of critical vulnerabilities autonomously, and is now a live concern for centralized crypto exchanges (not blockchains themselves)[19]What Anthropic's Mythos Means For Crypto Security. The policy debate rhymes with the 1995 SATAN scanner fight.
Described as "a variant of GPT-5.4 trained to be cyber permissive" — not the base model with access restrictions, but a fine-tune with lowered refusal boundaries for malware analysis, vulnerability discovery, and defense work. Access gated through TAC, an automated program available to both companies and vetted individuals. The debate: Anthropic runs a closed consortium; OpenAI's TAC is relatively open. Panelists argue strict access controls resemble security-through-obscurity — adversaries already have WormGPT[18]GPT-5.4-Cyber panel.
The bad guys already have things like WormGPT and things like that. They will have their cyber permissive, their guardrailless AIs that are out there whether we give it to them or not.
Anthropic's Mythos is reportedly capable of finding thousands of high/critical severity vulnerabilities, bypassing auth / 2FA, and turning flaws into working exploits — compressing the defender-reaction window between disclosure and weaponization. Anthropic recommends shortening patch cycles and enabling auto-updates[19]What Anthropic's Mythos Means For Crypto Security.
This algorithm is able to find exploits in software that humans have looked at for decades and have not found any problems in. So it is creating a new class of attacks.
Crypto read: base-layer blockchains (Bitcoin) are largely insulated — simple code, 15+ years of scrutiny, decentralized consensus. Real threat surface is centralized exchanges, trading apps, and retail-facing platforms holding customer funds. And the biggest current vector may not even be code: AI-driven voice/messaging makes mass social engineering cheap enough to harvest seed phrases at scale. Flip side: defenders can run the same agentic AI internally for proactive audits.
Meta released Muse Spark, its first model from the new Superintelligence Labs — and a hard pivot from Llama's open-weights strategy to proprietary AI. It's natively multimodal (262K-token context, three reasoning modes), ranks #4 on the Artificial Analysis Intelligence Index, and runs on far fewer tokens than Opus 4.6 or GPT-5.4[20]DeepLearning.AI The Batch — Issue 349.
Inputs: text, image, speech. Output: text. Three reasoning modes: instant, thinking, and contemplating (parallel agents propose solutions simultaneously). A thought-compression technique penalizes excessive reasoning tokens. Muse Spark uses ~59M tokens per task — far fewer than Claude Opus 4.6 (158M) and GPT-5.4 (116M).
Benchmark highlights: 86.4% on CharXiv Reasoning (beats GPT-5.4 at 82.8% and Gemini 3.1 Pro at 80.2%); 81% on MMMU Pro (second to Gemini 3.1 Pro's 82%); 42.8% on HealthBench Hard (leading all competitors). Health reasoning improved via data curation with 1,000+ physicians. Free via meta.ai and the Meta AI app, with rollout planned across WhatsApp, Instagram, Facebook, Messenger, and Ray-Ban Meta AI glasses, plus an API preview for select partners.
Muse Spark matches Llama 4 Maverick's capabilities with over an order of magnitude less processing devoted to training.
Context: the launch follows Meta's lab reorg after Llama 4 training-data contamination allegations, and its $14.3B investment for a 49% stake in Scale AI with Alexandr Wang named chief AI officer.
Eli Lilly — the world's most valuable pharma — agreed to provide up to $2.75B to Hong Kong-based Insilico Medicine for rights to AI-discovered drug candidates, starting with a $115M payment for exclusive rights to undisclosed pre-clinical drugs. It's the largest pharma-AI deal to date[20]The Batch #349 — Eli Lilly Commits $2.75B to Insilico Medicine.
Insilico, founded 2014, has developed 28 candidate drugs with about half in clinical trials. Its most advanced candidate, rentosertib (idiopathic pulmonary fibrosis), posted a Phase 2a readout where the highest-dose group gained an average 98.4 mL in forced vital capacity vs a 20.3 mL decline for placebo. A second candidate, garutadustat (inflammatory bowel disease), entered Phase 2a in January 2026.
The pipeline uses PandaOmics to analyze biological datasets, research, patents, and clinical trials for novel targets (it identified TNIK as an IPF target), and Chemistry42, which runs ~30 generative models in parallel to optimize molecules for binding, toxicity, and solubility. The lead IPF molecule was synthesized and tested using fewer than 80 compounds — vs the conventional 200K–1M compound screens — and target to preclinical readiness took ~18 months vs the typical 5–6 years. Caveat: no AI-discovered drug has yet received regulatory approval, and 70% of Phase 2 candidates historically fail the next phase.
Forty states have now enacted 100+ AI laws and are considering 1,500+ more bills — a fragmented compliance landscape that's accelerating despite a December 2025 Trump executive order threatening to withhold federal funds from states with "onerous" AI rules, and a March 2026 follow-up releasing national legislative guidelines[20]The Batch #349 — US States Enact AI Laws Despite Federal Pushback.
Key state actions from the issue:
Net effect: rising compliance cost, rising legal risk, and real friction on national product launches.
Four Google items land the same day: DeepMind's TurboQuant cuts KV-cache memory 6× and speeds H100s 8×, briefly dropping DDR5 prices 30%[21]Better Stack — Google's TurboQuant Just Made the RAM Crisis Worse. Gemini 3.1 Flash TTS lands on top of leaderboards with bracket-based emotion tags[8]AI News You Can Use — Google roundup. AlphaEvolve-powered Persona Generators cover 82% of possible human response variation[20]The Batch #349 — Google's Persona Generators. And a Search spam-policy update targets back-button hijacking, effective June 15, 2026[22]Better Stack — Google's New "Anti-Spam" Policy Explained.
Two-stage quantization: polar-quant rotation transforms vectors into polar coordinates for compact representation; QJL (Quantized Johnson-Lindenstrauss) preserves inner-product precision during attention. Net: 16-bit → 3-bit KV cache, 6× memory cut, 8× H100 speedup, near-zero claimed accuracy loss. For a 128K context Llama 3, KV cache alone had been consuming 16 GB VRAM per session. Retail 32 GB DDR5 prices dropped up to 30% on the announcement. Expected rebound:
When you make something six times cheaper, people don't just spend less, they use it 10 times more.
Topped leaderboards above ElevenLabs. Bracket-based emotion tags shift tone dramatically — whisper, panic, etc. Free at studio.google.com/generate-speech and available in Google Vids. Google also shipped a Gemini Mac desktop app (Option+Space, Notebook LM dropdown — "really, they're just playing catch-up") and a free offline dictation app on the Mac App Store powered by on-device Gemma 4, competing directly with Whisper Flow at no cost[8]AI News You Can Use — Gemini 3.1 + Mac apps.
This is what local AI is eventually going to look like. It's going to be packaged in apps from the big tech companies and just going to be really fluent to use.
Davide Paglieri, Logan Cross, and colleagues used the AlphaEvolve evolutionary algorithm to auto-generate code producing diverse LLM personas, coded in Python and run through Gemma 3-27B-IT via the Concordia library. 30 questionnaires across healthcare, financial literacy, and conspiracy theories; 25 persona prompts each; AlphaEvolve iterated 10 code versions × 500 iterations. Result: 82% response coverage vs 76% for Nemotron Personas and 46% for the Concordia baseline. Use case: simulate public sentiment for market research without recruiting human panels[20]The Batch #349 — Google Persona Generators.
Starting June 15, 2026, sites that abuse history.pushState/history.replaceState to trap users — injecting dummy entries so back-button takes them to ad interstitials instead of search results — get manual spam actions or automated ranking demotions. Legitimate SPA routing that respects the user's back intent is explicitly exempt[22]Better Stack — Google's Anti-Spam Policy.
If your site is using a library that messes with the history stack to keep users engaged, you are the one that's going to get penalized, not the library vendor.
Four headline papers: Tencent's open-source HY-World 2.0 3D world model, RAD-2 autonomous driving RL planner (56% collision-rate cut), DR³-Eval Deep Research Agent benchmark (Claude Sonnet 4 leads at 65.6), and TESSY — a teacher-student cooperative SFT method that flips Qwen3-8B from a 10% regression to an 11% gain on coding[23]HuggingFace Papers — 2026-04-17.
Tencent's four-stage pipeline unifies 3D world generation and reconstruction from text, single-view images, multi-view images, and video. Stages: HY-Pano 2.0 (panorama), WorldNav (five-mode trajectory planner: regular, surrounding, reconstruct-aware, wandering, aerial), WorldStereo 2.0 (distilled to 4-step DiT for fast inference), and WorldMirror 2.0 (feed-forward model predicting dense point clouds, depth maps, surface normals, camera params, and 3DGS attributes). Achieves state-of-the-art among open-source; matches closed-source Marble. All weights and source released.
From Huazhong UST and Horizon Robotics. Unified generator-discriminator framework for autonomous driving: diffusion-based trajectory generator produces candidates; RL-trained discriminator reranks by long-term quality. Two new optimization techniques: TC-GRPO (Temporally Consistent Group Relative Policy Optimization) and OGO (On-policy Generator Optimization). BEV-Warp simulation environment skips image-level rendering for high-throughput closed-loop RL training. Collision rates down 56% on large-scale benchmarks.
From Nanjing, M-A-P, and NUS. 100 tasks (50 EN, 50 ZH) across tech, economy, humanities; pairs multimodal user files with a static sandbox corpus up to 512K tokens. Five scoring metrics: Information Recall, Citation Coverage, Factual Accuracy, Instruction Following, Depth Quality. Leaderboard: Claude Sonnet 4 at 65.6, GLM-4.7 at 64.1, Gemini 2.5 Pro at 57.0. All models drop 10–15 points when context grows from 64K to 512K. Hallucination is the primary failure mode.
Shanghai AI Lab. Identifies a new failure mode: stylistic divergence between teacher-generated SFT data and the student's distribution causes catastrophic forgetting. Fix: decompose tokens into capability tokens (code, numerical steps, solution logic) and style tokens (discourse markers like "wait" or "but"). Generate alternately — teacher writes capability spans, student writes style — with a generate-then-rollback strategy for precise span boundaries. On Qwen3-8B with GPT-OSS-120B as teacher: vanilla teacher-SFT drops 3.25% on LiveCodeBench-Pro and 10.02% on OJBench; TESSY lifts them by +11.25% and +6.68%. Generalizes across DeepSeek-R1 and Qwen3-235B-A22B-Thinking.
Nate B Jones argues Shopify CEO Tobi Lütke's April 2025 AI-native memo has aged into a structural restructuring of the tech talent market. By January 2026 it's producing real signals: Josh Miller at The Browser Company is paying premiums for engineers "native to the Claude Code way of building"[24]Nate B Jones — Tech talent is about to get ugly thanks to this memo.
The memo applied Shopify's existing "Red Queen" principle — you must grow as fast as the company just to keep your role — to AI adoption. In a company growing 20–40% year-on-year, "you have to improve by at least that much every single year just to re-qualify for your own role." The industry read-across is speeding up: compensation premiums for AI-native workflows, new role definitions, hiring criteria rewritten around agentic tools.
Stagnation is not just failure, it's slow-motion termination.
A cluster of short tutorials: Better Stack says most devs run Claude Code as one sequential session when work trees + batch + hooks + dispatch can cut a 45-minute task by up to 70%[25]Better Stack — You're Using Claude Code Wrong. AICodeKing walks through Bite Rover, now an officially integrated structured-memory plugin for OpenClaw hitting 92.2% on LoCoMo[26]AICodeKing — OpenClaw Super Mode / Bite Rover. And Arjay McCandless shows Elastic Agent Builder wiring an agent directly to Elasticsearch in 30 seconds[27]Arjay McCandless — Building an AI Agent in 30s.
One prompt turns into a fully coordinated team of these AI agents.
OpenClaw's main weakness was memory — agents losing context or retrieving wrong things from older notes. Bite Rover replaces flat note dumps with a hierarchical tree organized by project area, feature, architecture decisions, and relationships. Tiered retrieval (fuzzy text → LLM-driven). 92.2% on the LoCoMo memory benchmark. Storage is local markdown by default, with optional cloud sync. Multiple OpenClaw agents or sessions can share the same memory layer.
If your memory is better, you can often get more out of cheaper models because they have better context to work with.
Connect directly to Elasticsearch so data stays in place (no migration to a separate AI platform). Built-in tools: index listing, search, natural-language query; plus custom tools for recurring queries.
My favorite part is everything stays close to where the data already lives.
Matt Williams walks through a zero-friction voice-note pipeline: Just Press Record on Apple Watch or CarPlay captures audio; Apple Shortcuts transcribes and POSTs to an N8N webhook; N8N calls Ollama Cloud to classify and summarize; the cleaned note lands in Obsidian[28]Matt Williams — Trigger Ollama from Anywhere.
Design principle: treat capture like a fire alarm, not a filing cabinet. The phone is either absent (gym) or a bad fit (driving), so Apple Watch and CarPlay handle entry. iCloud sync hands the audio to Shortcuts. The N8N workflow receives the webhook, logs the text, then uses an Ollama text-classifier node to distinguish grocery list from video idea, extracts the core concept, and writes a cleaned-up markdown file to Obsidian. Ollama runs on Ollama Cloud, not local — reliability over home-network uptime.
I tend to think about capturing ideas more like a fire alarm than a filing cabinet.
If your system requires menus or good intentions to work, you've already missed the point.
Netflix posted a Q1 beat — $12.25B revenue vs $12.18B est, $1.23 EPS vs $0.76 — but the beat was heavily inflated by a $2.8B breakup fee from the failed Warner Bros. Discovery merger, and shares sold off on soft Q2 guidance[29]Sherwood Snacks — Netflix earnings underwhelm investors as Hastings sets stage for exit. Reed Hastings leaves the board in June after 29 years. TSMC CEO CC Wei publicly pushed back on Musk's "light speed" Terafab timeline.
Netflix: revenue $12.25B vs $12.18B consensus; EPS $1.23 vs $0.76 est, heavily inflated by the $2.8B WBD breakup fee. Shares sold off in after-hours as investors focused on Q2 guidance, reversing gains from the Warner merger cancellation news.
Reed Hastings will step down from the Netflix board in June 2026, planning to "focus on his philanthropy and other pursuits." Completes his formal separation from the company he co-founded 29 years ago.
Tesla Terafab: TSMC's CC Wei publicly replied to Musk's aggressive Terafab timeline, saying modern foundries realistically take ~5 years to build and ramp — "there are no shortcuts." Intel has joined the Terafab effort, but skeptics remain. The newsletter flags an unrelated but striking data point: Tesla and SpaceX together bought 1 in 5 Cybertrucks registered in Q4 2025 — 1,414 of 7,071.
Following US and Israeli strikes, the Iranian Revolutionary Guard Corps blockaded the Strait of Hormuz — through which 20% of the world's oil previously flowed. Actual global supply contraction: only ~8%, cushioned by UAE, Saudi, Iraq, and Iran's own bypass pipelines. Prices spiked 60% anyway[30]How the Iran War Spiked Oil Prices.
Oil demand is highly inelastic (~-0.1 elasticity): roughly 75% of personal vehicle trips are non-discretionary, commercial transport won't stop regardless of price. The 8%/60% ratio implies an observed elasticity of ~-0.13, slightly above theoretical. Commodity traders initially under-reacted, reading Trump's "limited operation" framing as precedent from Venezuela, Syria, and the June 2025 Iran nuclear strike. Then war-risk insurance cancellations — with 72-hour notice — made Gulf oil exports functionally impossible for weeks. When policies were rewritten, premiums surged from ~0.1–0.25% to ~3% of vessel value per voyage — more than a 10× jump — and prices stabilized near $100/barrel.
War risk insurance had gone up more than tenfold — often selling for around 3% of the ship's value per journey rather than the 0.1 to 0.25% in the weeks leading up to the war.
UAE pipeline to Fujeira (majority of its 4.1M bbl/day); Saudi Arabia's 1,200 km East-West pipeline to Yanbu (5–7M bbl/day capacity, built in response to the Iran-Iraq war); Iran's own pipeline to Jusk (~15% of exports); Iraq's Kurdistan-to-Turkey pipeline (ramping toward 600K bbl/day after a new Baghdad-Kurdistan-Turkey agreement). Even with the blockade, a trickle of Iran-linked tankers continued transiting.
The illusion of Gulf stability has been shattered.
Two quick Simon Willison posts: Datasette 1.0a28 fixes a 1.0a27 regression in execute_write_fn() callbacks, adds a proper datasette.close() shutdown, and ships a pytest plugin to auto-close test fixtures and stop leaking file descriptors[31]Simon Willison — Datasette 1.0a28. Separately, PyCon US 2026 runs May 13–19 in Long Beach — the first West Coast edition since 2017 — and debuts dedicated AI and security tracks[32]Simon Willison — PyCon US 2026 announcement.
Changes were primarily driven by regressions uncovered during a Datasette Cloud upgrade to 1.0a27. Highlights:
execute_write_fn() callbacks with non-standard parameter names.database.close() now properly shuts down write connections.datasette.close() for clean shutdown of all databases, auto-invoked on server shutdown.May 13–19 in Long Beach, CA. First-ever dedicated AI track on Friday May 15 (eight sessions on LLMs, voice agents, async patterns, local models) and security track on Saturday May 16. Track chairs: Silona Bonewald (CitableAI) and Zac Hatfield-Dodds (Anthropic); Simon Willison is in-room chair for the AI track. He describes PyCon as "the least corporate feeling large event" he attends — 2,000+ attendees.
It's the add-ons around the talks that really make it work for me.
Destin Sandlin sits with Rev. Dalon Woodall to walk through the 1955 Johari Window (arena / facade / blind spot / unknown) as a framework for self-awareness and relationship health — then asks Dalon, on camera, to name his actual blind spots[33]Smarter Every Day 314 — The Johari Window.
The model: four quadrants based on what's known/unknown to self and to others. Arena = public. Facade = hidden/withheld. Blind spot = others see, you don't. Unknown = mystery to everyone. The central claim: our self-knowledge is imperfect and others often perceive truths we can't see ourselves.
The central implication of the Johari window is that there are parts and pieces of our personality and identity that are mysteries to us. That we don't have perfect self-knowledge… We are coming to know ourselves as others are coming to know us and that we have blind spots.
Dalon names three for Destin: (1) he operates with pre-internet assumptions about relationships in a culture that's become transactional and parasocial; (2) his lived experience as a white guy constrains his ability to perceive what others from different contexts go through; (3) he's blind to how large his influence is — strangers crossing a restaurant to greet him are traversing an enormous social chasm in a loneliness epidemic.
It's never been harder in human history than it is right now for a person to walk up to a stranger… And it happens to you all the time. And you don't understand how big of a deal that is for people.
Destin closes with the growth prescription: disclose to expand from facade into arena; ask for feedback to shrink the blind spot; use prayer, novelty, and curiosity to probe the unknown. "No man is an island. Our self-awareness, ironically, isn't just about ourselves."