Cursor just crushed Claude Code

AI Models

Cursor's Composer 2.5: near-frontier code at a fraction of the cost

Theo (an early Cursor investor, disclosing the bias) argues Composer 2.5 — not Gemini 3.5 Flash — was the underrated drop of the week: a distilled model built on Moonshot's Kimi K2.5 that scores ~63% on Cursor Bench, sitting between GPT-5.5 high (64) and Opus 4.7 (65) at roughly 20x lower cost, priced at $0.50/$2.50 per million in/out tokens.^{[1]Theo - t3.gg — Cursor just crushed Claude Code} The catch: there's no API — it's usable only inside Cursor — but a freshly announced SpaceX deal to train a model from scratch on 10x more compute means Cursor could leapfrog rather than just catch up.

The subsidization war Cursor can't win on price

~03:00 — Theo walks through token economics: Sonnet historically $3/$15 per million, Opus 4.5 at $5/$25, GPT-5 generously $1.25/$10 climbing to 5.5's jump to $5/$30. Three pricing factors matter — per-token cost, number of tokens generated (Sonnet burned 200M tokens on a bench where GPT-5.5 used only 75M for a higher score), and the ability to negotiate deals.^{[1]Theo - t3.gg — pricing primer} ~07:00 — The strategic argument: a $200 Claude Code sub yields ~$4,000 of usage, but Cursor reselling Anthropic API pays near full price (~$3,000 for that same $4,000), so it's "competing in a subsidization war it doesn't have a player in." Cursor's real moat is data — the chat histories of how devs collaborate with agents.

Training: targeted RL, synthetic tasks, reward hacking

~10:00 — Composer 1 was overpriced and token-hungry, 1.5 worse-than-Sonnet value, so Theo had written it off: "I was betting against Jacob and you should never bet against Jacob." Composer 2 was 7x cheaper than 1.5; 2.5 holds the cheap price. Both 2 and 2.5 are still based on Kimi K2.5, effectively doubling the base model's score.^{[1]Theo - t3.gg — Cursor Bench} ~14:00 — Cursor published its method: "targeted RL with textual feedback" (a teacher model steers the student via on-policy distillation KL loss to fix localized behaviors like bad tool calls), 25x more synthetic tasks including a "feature deletion" trick (delete a feature, re-implement, use existing tests as reward), and reward-hacking examples where the model decompiled Java bytecode to recover deleted signatures.

All of the work for making models good at code is post-training in RL largely.

Live demo and the SpaceX kicker

~20:00 — In Cursor's "Glass" harness (which Theo trashes as "slow, clunky, obnoxious"), Composer 2.5 re-implements his game "Fish Slop" fast via parallel agents; the first render fails but core mechanics work. ~26:00 — The core complaint: no API means no external benchmarking. The finale: Cursor announced with SpaceX AI it's training a model from scratch with 10x more total compute (100x more than Kimi's original) on Colossus 2's ~1 million H100-equivalents, plus a pending deal (a $10B collaboration or a $60B acquisition, likely at the in-process IPO).

I legitimately think it's more likely that we're using xAI Cursor Composer 7 than anything by Gemini for our day-to-day dev work.

Tools: Cursor, Composer 2.5, Kimi K2.5, GPT-5.5, Opus 4.7, Gemini 3.5 Flash, Cursor SDK, Cursor Glass, Supermaven, Tabnine

Industry AI Future

The AI Daily Brief

AI's new acceleration phase: Anthropic's profitable quarter, the Erdős problem, Karpathy to Anthropic

NLW's weekly recap argues the individual stories this week "add up to a whole much more than the sum of its parts" — felt acceleration across business models, pricing, consumer scale, and capability. Headliners: Anthropic projecting the first-ever profitable quarter for any AI lab, OpenAI solving an 80-year-old Erdős math problem, Andrej Karpathy joining Anthropic to work on recursive self-improvement, and a federal AI executive order scuttled hours before signing.^{[2]The AI Daily Brief — AI's New Acceleration Phase}

Profitability and the end of the subsidy era

~01:01 — Anthropic expects its first-ever profitable quarter (with caveats: it's a projection, revenue-recognition questions linger, and it's getting discounted SpaceX compute). OpenAI had a "banger" Q1 generating ~$1B more revenue than Anthropic, boosted by token-hungry Codex; Nvidia beat every expectation.^{[2]The AI Daily Brief — earnings} ~03:01 — The "end of the subsidy era": token-hungry agents kill flat-rate plans. Google I/O's Ultra plan cut from $250 to $200/month came with a shift to usage-based billing; Microsoft canceled its Claude Code licenses on cost; Anthropic shipped a /usage command to show which skills, agents, MCPs, or plugins are the biggest token hogs.

This is the first-ever profitable quarter for any AI lab.

Filling the gap; consumer scale

~04:01 — Cursor's Composer 2.5 performs comparably to Opus 4.7 and GPT-5.5 at 10–60x lower cost; Elon Musk settled into an "AI compute czar" role with SpaceX offering compute "as a service at significant scale," and Anthropic's Tom Brown announced an expanded SpaceX partnership scaling on Colossus 1 and 2. ~06:02 — The Gemini app hit 900M monthly active users, effectively closing the gap with ChatGPT; monthly tokens jumped 700% to 3.2 quadrillion. Google added agentic "information agents" to Search and shipped Docs Live for voice-driven editing.

The Gemini app is now up to 900 million monthly active users, having effectively closed entirely the gap with ChatGPT.

The Erdős breakthrough, Karpathy, and policy whiplash

~10:03 — OpenAI used an internal model to break an 80-year-old Erdős problem (how many pairs of n points can be exactly one unit apart), disproving the assumed square-grid optimum. Fields medalist Tim Gowers called it "the first really clear example of AI solving a really well-known unsolved math problem"; Noam Brown said it was a general-purpose LLM with a simple prompt. Andrej Karpathy announced he's joining Anthropic to work on recursive self-improvement, using Claude to accelerate pre-training research itself. ~14:04 — California's Newsom signed an exploratory AI-labor order; the federal AI executive order was scuttled hours before its signing ceremony after former AI czar David Sacks personally called Trump to block it.

He called POTUS this morning, unbeknownst to anyone, his own staff included, and derailed it.

Tools: Claude, Claude Code, ChatGPT, Gemini, GitHub Copilot, Codex, Cursor Composer 2.5, Opus 4.7, GPT-5.5, Antigravity, Docs Live, Artificial Analysis

AI Future Industry

AI News & Strategy Daily | Nate B Jones

Nate B Jones: the AI boom is about to hit a wall — below the GPU

Nate argues the real AI constraint isn't GPUs but the layer below — high-bandwidth memory, chip packaging, power, and cooling — which turns every AI vendor contract into a de facto supply contract. The killer stat from Epoch AI: in 2025 the four largest AI chip designers consumed ~90% of global chip-packaging and HBM capacity but only ~12% of advanced logic-die production, so the bottleneck was never GPUs but integration into served tokens.^{[3]Nate B Jones — Why the AI boom is about to hit a wall}

'Capacity constrained' decoded

~00:00 — Anchored on Microsoft's Q3 call: Satya Nadella said the company will spend $190B on capex this year and still expects to be capacity constrained — which does NOT mean running out of GPUs, but whether you can manufacture enough chips packaged with the memory they need.^{[3]Nate B Jones — Nadella capex} ~01:01 — Six months ago an AI vendor contract looked like a software contract; now it's effectively a supply contract needing allocation, capacity terms, and fallback.

A user will see a paragraph generated on a screen, but every word in that paragraph came out of a factory.

The factory and its bottlenecks

~04:05 — Stop thinking of AI as software with a fancy backend; every answer is the output of a "production chip system" — chips, HBM, packaging, networking, power, cooling, land, construction, ops talent. ~05:06 — Capex roll call: Meta $125–145B, Amazon landed 2.1M+ AI chips in 12 months, Google $185B last year; the physical unit is Nvidia's GB200 NVL72 module (72 Blackwell GPUs, 13.5 TB HBM3, 576 TB/s memory bandwidth). ~08:08 — Bottleneck tour: HBM as the single most constrained input, TSMC CoWoS packaging, optics, firm power at the right location, and liquid cooling. ~12:10 — The Epoch AI stat: 90% packaging/HBM vs 12% logic dies.

Falling costs, Jevons, and the bubble test

~14:11 — GPU depreciation runs 3–5 years while data-center shells last far longer, so CFOs must ask whether they can earn enough before the next hardware generation. ~17:11 — Serving costs are falling fast (smaller models, distillation, caching, quantization), and Microsoft said Copilot inference throughput rose 40% from software/hardware optimization alone — but Jevons' paradox means cheaper tokens create more demand. ~19:11 — Three questions for any AI investment review: reserved vs best-efforts capacity, a concrete routing plan to cheaper models, and where hidden human supervision is masking product failure. He repeatedly says this is why he does NOT think we're in a bubble.

It's part of why we're token constrained in May is because we got better agents in January.

Tools: Microsoft Copilot, Nvidia GB200 NVL72, Nvidia Spectrum-X Photonics, TSMC CoWoS, AWS Bedrock, Amazon Trainium, Opus 4.7, ChatGPT 5.5

Developer Tools

Simon Willison

Datasette ships a Jump-to menu, an LLM agent, and test fixtures together

Simon Willison shipped three coordinated Datasette releases on May 24: Datasette 1.0a30 adds a keyboard-driven "Jump to..." menu, datasette-agent 0.1a4 wires an LLM chat agent into that menu, and datasette-fixtures 0.1a0 provides a standardized test fixture database for plugin developers.^{[4]Simon Willison — Datasette Agent ecosystem releases}

Datasette 1.0a30 introduces a customizable "Jump to..." menu triggered by pressing /, letting users search and jump to databases, tables, and debug options. A new jump_items_sql() plugin hook lets third-party plugins inject their own items.^{[4]Simon Willison — Datasette 1.0a30}

datasette-agent 0.1a4 hooks into the jump menu via the makeJumpSections() JavaScript hook, surfacing a "Start a new agent chat" entry; users can type natural-language queries (e.g. "count entries") and the agent returns SQL-backed results (the demo shows 3,300 entries). datasette-fixtures 0.1a0 ships a ready-made fixture database (sample roadside attractions) backed by a new datasette.fixtures.populate_fixture_database(conn) API, pullable via uvx without a full Datasette install for standardized plugin test suites.

Tools: Datasette, datasette-agent, datasette-fixtures, uvx

Hot Take

Simon Willison

Armin Ronacher: AI-rewritten 'slop issues' are killing open-source bug reports

Armin Ronacher argues that AI-polished issue reports strip out the submitter's authentic observations and replace them with inaccurate conclusions, fabricated reproductions, and overconfident root-cause speculation — making them harder to debug than raw notes would be.^{[5]Simon Willison — Quoting Armin Ronacher}

Ronacher's post, quoted by Simon Willison, targets "slop issues": bug reports run through an LLM before submission. His core complaint is that the reports no longer sound like the person who filed them — they lose the human voice that signals what the submitter actually observed versus what they inferred.^{[5]Simon Willison — slop issues} He advocates a minimal, first-person format: the exact command run, what was expected, what actually happened, and the raw error or log — nothing more. AI rewrites introduce fabricated minimal reproductions and speculative root-cause analyses presented with unwarranted confidence.

The most frustrating failure mode right now is that people submit issues that are not in their own voice.

Industry

Morning Brew

The K-shaped economy splits travel: premium surges, budget collapses

Economic inequality is splitting the travel industry in two: Delta's premium ticket revenue surpassed economy sales for the first time in 2025 even as economy cabin sales fell 7%, while luxury hotel RevPAR rose 2.9% and economy hotel RevPAR dropped 4.1%.^{[6]Morning Brew — The K-shaped economy is trippin}

On the airline side, Delta saw economy cabin sales fall 7% year-over-year in 2025 while premium ticket sales rose 9% — and for the first time ever, premium revenue exceeded economy revenue. Rising fuel costs (partly attributed to the Iran war) are accelerating the split, with budget carrier Spirit Airlines ceasing operations in May 2026.^{[6]Morning Brew — airlines and hotels}

Hotels show the same divergence: luxury properties gained 2.9% RevPAR (Nov 2024–Nov 2025), midscale fell 2.6%, and economy dropped 4.1%. Despite the squeeze, two-thirds of Americans still plan domestic summer trips, adapting via frequent-flyer miles, deal sites, and "destination dupes" — cheaper substitutes like Brussels for Paris or Naples for Rome.

Developer Tools Hot Take

Mario Zechner

Mario Zechner: Building the Pi coding agent & the 'rich man's game'

Mario Zechner (badlogic), creator of the Pi coding agent, explains why he abandoned Claude Code for his own harness, why he's skeptical of "model degradation" claims, and why affording tokens is a genuine economic moat — "a rich men's game" where a $200/month plan already prices out ~99% of the world.^{[7]Mario Zechner — Tokens can make you rich, just do this}

Why he built Pi

~00:30 — He credits Anthropic with the "genius idea" of giving agents a terminal/bash so they could explore codebases themselves — "agentic search" — which broke Cursor's indexing limits. He was a "diehard Claude Code fan" until July–August 2025, when bloat and system-prompt changes "broke my workflows every day," so he reverse-engineered Claude Code and switched to Pi in October.^{[7]Mario Zechner — building Pi} ~06:06 — On degradation, he's skeptical: he attributes it to psychology (the honeymoon wearing off) and harness changes, not quantization, calling the broader phenomenon "collective psychosis." He concedes a real Anthropic change on March 26 cleared older thinking from idle sessions — "lobotomizing the model for that session."

Anthropic had the genius idea of making the agents basically use your computer.

The rich man's game and open weights

~11:08 — He agrees with the "rich get richer" framing: those who can afford tokens have "a massive edge," and a $200/month plan already prices out ~99% of the population. The real value is internal tooling — his linguist wife 5x'd her scientific output after two nights of Claude Code instruction. "The code can be total slop as long as it generates time saving." ~17:11 — He credits Chinese open-weight labs with "fucking with the token economics of the US labs" (citing ~70% Anthropic inference margins) and finds Kimi K2.6 good enough to self-host at comparable cost. The host shares a ~$6K/month Anthropic run rate, now pivoting toward GPT-5.5, DeepSeek v4, and Codex.

I definitely think it's going to be a rich men's game. The people who have the means of production in the sense that they can afford the tokens have a massive edge.

Europe, the future of apps, and work

~15:15 — On Europe's AI gap, he blames capital and legal/tax structure, not regulation: "In the US, you set up a Delaware corporation and that just works." ~27:16 — He predicts single-purpose apps (fitness/diet trackers) are dead, absorbed by personal agents building "malleable, self-modifying software" on the fly, while infrastructure-heavy apps like Spotify survive. ~31:17 — On work, he's a Jevons-paradox believer (agents make people more productive rather than replacing them) but predicts a painful "chop-pocalypse" squeezing 40-50+ non-adopters and juniors, favoring "universal tokens" over UBI.

Coding workflow and the limits of LLMs

~38:21 — He runs at most four terminals driven by prompt templates keyed to GitHub issues, delegates simple features but does refactorings by hand because agents "fix things locally and then things explode globally." His thesis: syntax no longer matters, but architecture does — a warning to juniors. On creativity, an LLM only interpolates within its training-data "cloud"; full automation won't arrive via LLMs because design process and the top 0.01% of human taste are rarely encoded as tokens, and good code is statistically overwhelmed by garbage during training.

90% of the code that is in the training data is garbage. The 10% which are pristine have so little statistical power during training that they don't get reinforced into the model weights.

Tools: Pi (pi.dev), Claude Code, Cursor, OpenCode, Kimi K2.6, DeepSeek v4, GPT-5.5, Codex, Gemma 4, GitHub

Podcast

Lenny's Podcast

Lenny's Podcast × Dan Shipper: the AI paradox — more automation, more work

Dan Shipper, CEO of Every (~30 people, doubled in a year), lays out contrarian predictions: work will split between a company "super agent" in Slack and agents-with-browsers like Codex/Claude Co-work as the new OS; SaaS will boom rather than die; and AI creates more human work, not less.^{[8]Lenny's Podcast — Dan Shipper on the AI Paradox}

~03:04 — A year after correctly calling that people were "sleeping on Claude Code" for non-engineering work, Shipper returns: Every is now ~30 people, runs ~6 internal products, and everyone works by talking to their computer in English via Codex/Claude Code.^{[8]Lenny's Podcast — how Every operates} ~10:09 — He cites METR's benchmark (the big Anthropic model can do ~17-hour tasks at 50% accuracy) but frames work as a paradox: "we have so much automation, so much AI, and I also work way more."

Automation is a lie. Every agent needs a human.

~11:09 — Prediction 1: work bifurcates into (a) a company-wide "super agent" you delegate to in Slack (citing Shopify's "River" and Ramp's agent), having flipped from his earlier belief in per-person agents because OpenClaw-style harnesses are too much upkeep; and ~18:11 (b) Codex/Claude Co-work as the OS for all work, with an in-app browser. ~23:12 — Second-order effect: SaaS runs inside the agent using the user's own tokens, so "I would buy SaaS stocks right now."

I would buy SaaS stocks right now. The SaaS apocalypse is dumb.

~40:25 — On his "senior engineer benchmark" (rewriting his vibe-coded Proof editor from first principles), models scored ~30/100 until GPT-5.5 jumped to 62, versus human seniors at high-80s/low-90s. ~49:31 — Shape of work: PRs skyrocket (Pete on OpenClaw spins up ~50,000 Codex instances, merges ~1,000/day). ~67:40 — Who wins: PMs and full-stack designers; his advice is to "ride the models." He notes Cursor was effectively acquired by SpaceX and that "we speed ran the CLI era."

What models do in general is they make yesterday's human competence cheap.

Tools: Claude Code, Claude Co-work, Codex, OpenClaw, Cursor, Slack, Proof, Spiral, GPT-5.5, Opus 4.7, METR benchmark, Shopify River, Ramp

Podcast

Better Stack

Better Stack Podcast #15: why Claude recommends Resend 70% of the time

Chris Pennington, DX engineer at Resend, explains how the email API company became the default Claude recommendation (~70% of the time) via SEO basics, being early to MCP/skills, and recency bias — plus how Resend runs scoped agents, his productivity system, email deliverability, and an Astro hot take.^{[9]Better Stack Podcast #15 — Why Claude Recommends Resend}

~01:01 — Pennington's React Email and Resend videos in 2023 put him on the team's radar; he was the 10th hire at what is now a ~45-person company.^{[9]Better Stack — Resend journey} ~19:11 — The headline: why Claude recommends Resend ~70% of the time. The playbook: nail SEO basics (single H1, JSON-LD), ship llms.txt (their pricing page returns markdown on curl), use Q&A accordions, do "best email API" listicle trades, and be first to MCP and agent skills.

Right now I think it's about 70% of the time that Claude will suggest us.

~28:14 — Resend runs several scoped agents (the marketing one is named "Hermes"), each on its own instance with its own 1Password vault, Twitter, and GitHub accounts, scoped tightly via API key management and a Slack-only interface. ~34:19 — His productivity stack (Raycast, OmniFocus, Obsidian synced to GitHub) and blog post "predictability is a superpower" advocate planning only until ~1pm and leaving margin. ~40:22 — On deliverability: Gmail weights domain reputation over IP; he recommends sending from a subdomain, DMARC/SPF/DKIM, and no cold emails.

~56:28 — Hot take: "every site should default to Astro." React Server Components have repeated security vulnerabilities and a confusing mental model (Dan Abramov explained RSC using Astro); he praises TanStack Start's explicit server components.

For me, every site should default to Astro. That's my hot take.

Tools: Resend, React Email, Claude, Claude Code, MCP, agent skills, llms.txt, Supabase, Railway, Fly.io, Vercel, Raycast, Obsidian, OmniFocus, Linear, Astro, TanStack Start, Google Postmaster Tools

Podcast

Latent Space

Sunil Pai at AI Engineer: why you should build science fiction

Cloudflare's Sunil Pai (creator of the Agents SDK) covers Cloudflare's agentic primitives — Durable Objects and Dynamic Workers — argues no one has built the "React moment" for agentic frameworks yet, and makes the case for radical originality over incremental improvement.^{[10]Latent Space — Sunil Pai at AI Engineer}

~00:03 — They open on breaking news: Anthropic's just-launched Managed Agents platform. Sunil likes the team but wants to compete with Workers and Durable Objects.^{[10]Latent Space — competitive angle} ~01:03 — Cloudflare's two primitives: Durable Objects (the actor model at the infrastructure layer) and Dynamic Workers (safe LLM-generated code execution with zero startup, default-deny outbound). ~03:03 — A demo: Cloudflare's MCP server for their 2,600-endpoint API reduced to two tool calls — search and execute — where each submits JavaScript running in an isolate.

~04:03 — On the harness of the future: no one has built the React equivalent for agentic software yet; he speculates skills (English-language Markdown) might be the universal translation layer. ~06:04 — "Slop forks": he recounts the Versel Just Bash incident, where he used Claude Opus to port Just Bash to Cloudflare on vacation and woke up to management DMs after the Versel CTO publicly criticized the fork. ~11:07 — Open-source repos have become "adversarial grounds" with fake security reports a top attack vector. ~13:09 — His closing: in a world where any idea can be vibed into existence, original thinking is the scarcest resource.

Build sci-fi stuff. Build stuff like for your family. You own so much agency in changing the world and I want people to just be original.

Tools: Cloudflare Workers, Durable Objects, Dynamic Workers, Cloudflare Agents SDK, Vercel AI SDK, Claude Opus, Just Bash, MCP

Podcast

AI Engineer

Sawhney & Ballantyne at AI Engineer: how Google DeepMind runs agents at scale

Google DeepMind engineers KP Sawhney and Ian Ballantyne discuss how DeepMind runs agents internally at scale using the Antigravity harness, covering quota management, skills libraries, observability, and auto code review — a live demo plus audience Q&A.^{[11]AI Engineer — How Google DeepMind Runs Agents at Scale}

~00:14 — A live demo of Antigravity — DeepMind's internal IDE/agent harness — shows multi-agent spawning, browser automation with DOM inspection, and a human-in-the-loop plan review, using Gemini models.^{[11]AI Engineer — Antigravity demo} ~04:15 — KP is generalizing the harness beyond coding, replacing huge context blobs in the Deep Research pipeline with a shared file system where components act as collaborators.

~08:20 — Token quota is the primary constraint: DeepMind enforces hard quotas, and the near-term fix is seamless fallback to Gemma 4 (free on internal hardware). ~13:36 — Observability: a custom hierarchical UI drillable down to raw predict requests, plus an agent trajectory store to diagnose looping. ~17:45 — On MCP vs skills, KP calls MCP possibly "a flash in the pan" and prefers skills with guardrail CLI interactions, curated Darwinian-style. ~22:52 — Auto code review uses per-language fine-tuned models on internal style guides; engineers now submit trillions of agent-generated lines through GitHub.

We have worse limits than you do because obviously we prioritize customers and not ourselves.

Tools: Antigravity, Gemini (Flash, Pro, Ultra), Gemma 4, Deep Research (Interactions API), Jewels, MCP

Podcast

AI Engineer

Adrian Bertagnoli at AI Engineer: scaling heterogeneous intelligence

Callosum founding engineer Adrian Bertagnoli argues AI scaling is shifting from homogeneous (single model, identical chips) to heterogeneous (mixed models, mixed hardware, mixed workflows), presenting benchmarks showing 7-18x cost reductions and 3-5x speed improvements by routing subtasks to purpose-fit models and chips.^{[12]AI Engineer — Scaling the Next Paradigm of Heterogeneous Intelligence}

~00:14 — The neural scaling laws era optimized homogeneous systems, but as AI moves to inference that's inefficient. Three heterogeneity signals are already visible: mixture-of-experts (architecture), multi-agent systems (workflow), and prefill-decode disaggregation (hardware).^{[12]AI Engineer — heterogeneity signals} ~02:16 — A three-stage trajectory from mild to full heterogeneity, formalized via a "principle of maximum heterogeneity."

~06:22 — Demo 1, heterogeneous recursion on the ULong benchmark: vs GPT-5.2 (~2,000s, $3.75/task), their Cerebras-routed system is 7x cheaper and 5x faster; the SambaNova variant 12x cheaper and 3x faster. ~10:24 — Demo 2, visual web navigation on Video WebArena: a mixture of Qwen3 VL8B + Kimi 2.5 + GPT-5.2 beats GPT-5.2 by 18% and Gemini 2.5 by 25%; routing simple subtasks to smaller models yields 11x speed and 43x cost gains. ~12:27 — Three eras of compute (CPU, GPU, heterogeneous); Callosum has a £3M ARIA grant for the first heterogeneous collocated cluster in the UK.

What comes next is heterogeneous intelligence where models, workflows, and silicon co-evolve and every new source of diversity makes the whole system smarter, faster, and cheaper.

Tools: Cerebras, SambaNova, Qwen3 VL8B-Instruct, Kimi 2.5, GPT-5.2, Gemini 2.5

Podcast

LangChain

Harrison Chase at Interrupt 26: the future of agents & LangSmith Fleet

LangChain's keynote imagines Interrupt 2027, predicting a split into long-horizon vs latency-sensitive agents, the rise of voice and open models, agent identity, and continual learning — then announces LangChain Labs and demos LangSmith Fleet, a no-code managed agent builder.^{[13]LangChain — The Future of AI Agents | Interrupt 26}

~00:06 — Harrison Chase predicts a divergence into long-horizon agents (running minutes to days) and latency-sensitive customer-experience agents where brand and voice matter; all agents will need a sandbox — "giving the marketing team a software engineer."^{[13]LangChain — two agent types} ~02:08 — On voice, the current STT/TTS "sandwich" vs emerging native speech-to-speech. ~04:08 — A predicted rise in open models, with deep-agents benchmarking showing open base models approaching frontier performance.

~05:09 — Agent identity splits into agents acting on user credentials vs fixed service accounts. ~07:09 — The core thesis is continual learning across model, harness, and context layers (citing a Meta-Harness paper where an agent optimized a coding harness on Terminal Bench 2 and beat human-written ones). He announces LangChain Labs. ~12:14 — LangSmith Fleet, a no-code agent builder with 200+ built-in tools, an Arcade partnership adding 7,500 tools, MCP support, and native Slack/Gmail/Outlook; a live go-to-market demo cites 84% weekly usage, lead-to-qualified conversion up 240%, and ~40 hours saved per rep per month.

We moved from top 30 on Terminal Bench 2 to top 5 just by changing the harness itself. No changes to the model.

Tools: LangSmith Fleet, LangChain Labs, LangSmith, Deep Agents, Arcade, MCP, Salesforce, BigQuery, Slack, Gmail, Outlook, Fireworks, Terminal Bench 2

Podcast

Dwarkesh Patel

Dwarkesh × David Reich: following the Yamnaya trail into India

Geneticist David Reich explains that Yamnaya ancestry is diluted to just 5–20% in India today, but it still reliably traces the spread of Indo-European languages and culture across the subcontinent.^{[14]Dwarkesh Patel — Following the Yamnaya Trail into India}

Reich describes the genetic trail of the Yamnaya — the steppe pastoralists linked to Proto-Indo-European — as they moved into South Asia. By the time Yamnaya-derived populations crossed Central Asia and entered northern South Asia, successive admixture had diluted the ancestry dramatically: even the highest Yamnaya ancestry in India sits around 20%, with most groups under 10% or even 5%.^{[14]Dwarkesh Patel — Yamnaya dilution} Despite the dilution, the signal remains meaningful: that small percentage tracks which populations speak Indo-European languages and share cultural elements with distant peoples on the other side of Eurasia. A low ancestry fraction, he emphasizes, shouldn't be dismissed.

This 5% you shouldn't sneeze at it, right? Like that's tracing something important.

Hot Take

AI News & Strategy Daily | Nate B Jones

Nate B Jones: AI memory is becoming a lock-in mechanism

Nate B Jones argues AI companies are deliberately using memory features to trap users: the longer you use one platform, the more context you accumulate that can't be transferred to a competitor, making switching practically impossible.^{[15]Nate B Jones — Why switching AI models is now impossible}

Jones argues memory features on platforms like ChatGPT are not primarily about user benefit — they are lock-in strategies. As users build up conversation history, that context becomes trapped; switching to Gemini or Claude means starting from zero, not because the new model is inferior but because the old platform holds your history hostage.^{[15]Nate B Jones — memory lock-in} He notes this memory is not agent-readable, which compounds the problem as autonomous agents become more prevalent.

The big corporations are betting that if they can trap you with memory, you will only use their agents, and they will get to keep you and your attention and your dollars forever.

AI Tools

AICodeKing

Antigravity 2.0: bug fixes, a 3x rate-limit boost, and doubled Flash context

Antigravity (Google's AI coding agent) shipped bug fixes and UX improvements including OAuth persistence in the CLI, a 3x+ rate-limit increase, and doubled context length for Gemini 3.5 Flash. The reviewer is cautiously optimistic but flags missing cross-sync between the UI and IDE.^{[16]AICodeKing — Antigravity & AGY CLI New Upgrades}

Fixes addressed: projects failing to migrate from Antigravity 1 when the thread title contained C, J, or K (a string-escaping bug), duplicate projects on import, and Google One credits not applying. On the CLI side, OAuth credentials weren't persisting between sessions — now resolved.^{[16]AICodeKing — bug fixes} Varun Mohan and the CLI lead noted that, unlike Claude Code which auto-detects terminal configs, AGY CLI requires manual configuration under "color scheme."

New: a one-click "Install/Open IDE" button, a "proceed in sandbox" permission mode that auto-approves terminal commands, consumer onboarding in the CLI, and an env var to hide email/plan tier during demos. Limits were raised 3x, then boosted a further 3x for a week ("pretty much unlimited for a week"), and Gemini 3.5 Flash context was doubled. The reviewer notes model quality is harder to fix and cross-sync between web UI and IDE still doesn't work.

It's pretty much unlimited for a week.

Tools: Antigravity 2.0, AGY CLI, Antigravity IDE, Gemini 3.5 Flash, Claude Code

Developer Tools

Better Stack

Skybridge 1.0: an open-source framework for interactive MCP apps

Alpic released Skybridge 1.0, an open-source TypeScript framework for building MCP apps — interactive React widgets that run inside ChatGPT or Claude, with shared state between the human user and the AI. The demo builds and deploys an e-commerce camera-lens store MCP app end-to-end in minutes.^{[17]Better Stack — MCP Apps Are Changing the Internet (Skybridge)}

Skybridge handles the protocol bridging, state sync, and security rules needed for MCP apps — a paradigm where the same UI is shared by both the human and the AI assistant, so a human click is instantly visible to the LLM and vice versa.^{[17]Better Stack — Skybridge} Version 1.0's headline feature is a browser-based emulator dashboard with three tools: an Alpic playground for testing widgets with live HMR (no LLM required), a single-click integrated tunnel exposing the local server via a public URL, and a Beacon audit tool that scans metadata and security policies to catch app-store rejection triggers. The demo builds a lens-search app (search by price, compare, checkout) entirely by prompting Claude with the Skybridge skill installed, then connects it to Claude via the Connectors panel.

With an MCP app, you're actually building for two users at once, the human and the AI assistant, because they both share the exact same interface.

Tools: Skybridge, Alpic, Claude, ChatGPT, React, Alpic Playground, Beacon Audit Tool

AI Models AI Future

AI Search

AI News Roundup: co-scientist, DNA models, open robots & more

AI Search's roundup covers ~19 model and tool releases. The most notable: Google DeepMind's multi-agent AI co-scientist published in Nature, Carbon (an open DNA foundation model claimed 275x faster than EVO 2 Medium), HuggingFace's $2,500 open-source 3D-printed humanoid, Alibaba's agentic Qwen 3.7 Max, and Tencent's HYMT2 translation MoE beating the larger DeepSeek V4.^{[18]AI Search — AI co-scientist, AI for DNA, open-source robots, new Qwen}

Google DeepMind AI co-scientist — a multi-agent research partner that generates hypotheses, reviews literature, identifies gaps, and proposes experiments via internal agent debate; published in Nature with drug-discovery examples for liver fibrosis. ~23:18
Carbon (DNA foundation model) — an open DNA language model processing up to ~400,000 base pairs, claimed 275x faster than EVO 2 Medium and able to run the full human genome on a single GPU in under 2 days (500M–8B params, GGUF available; EVO 2 still leads on accuracy). ~10:04
HuggingFace open-source humanoid robot — full 3D-printable designs, parts list, assembly docs, simulator, and training software for ~$2,500 in parts; experimental sim-to-real hardware, not a consumer product. ~33:29
Qwen 3.7 Max (Alibaba) — an agentic coding/reasoning model on par with DeepSeek V4, GLM 5.1, and Kimi K2.6, with vision usable for robot-dog navigation; integrates with Claude Code and OpenClaw, available via Cloud Model Studio. ~27:18
HYMT2 (Tencent) — a 1.8B/7B/30B-MoE (3B active) multilingual translation family across 33 languages, beating the larger DeepSeek V4 on instruction-following and domain benchmarks. ~18:13
Lance (ByteDance) — a 3B unified multimodal model for text-to-video, video/image editing, and visual Q&A (needs 40GB VRAM). ~01:00
Apple LTO — single-image 3D reconstruction with view-dependent appearance, outperforming Trellis on average accuracy. ~03:00
Flash GRPO — a faster human-preference alignment technique for video diffusion, sampling one timestep per update. ~04:01
Reactive GWM — an AI game world model with promptable NPC behavior via cross-attention (built on CogVideoX 1.2.2). ~06:03
L2P — an open pixel-space image diffusion model (no VAE/latent) with 8K extrapolation, SOTA among pixel-based models. ~08:03
Meituan LongCat Video Avatar 1.5 — a more stable, expressive talking-avatar generator with multi-speaker support (16GB int8). ~12:07
Mega ASR — noisy-audio speech recognition with ~30% lower error rates, trained on 2.6M samples across 7 distortion types (under 5GB). ~15:12
Qwen 3.5 Live Translate — real-time speech translation using visual context across 60 languages (free demo). ~28:19
Marlin 2B — a compact video-language model for timestamped event extraction, matching Gemini 2.5 Flash on captioning (under 6GB). ~25:18
CogOmniControl — ControlNet-style multi-reference video generation (paper only so far). ~35:29
Meta WaveFlow — raw-waveform video-to-audio generation competing with MM Audio (production weights withheld). ~37:30
PanoWorld — floor-plan-to-3D panorama home-tour generator with cross-room spatial consistency (coming soon). ~40:33
Stability AI Stable Audio 3 — open text-to-music models (small: 2-min, medium 1.4B: 6m20s) with LoRA and inpainting. ~42:34
Alibaba Fashion Chameleon — real-time video virtual try-on at ~24 FPS, claimed 30–180x faster than baselines. ~44:43

Tools: AI co-scientist, Carbon, EVO 2, HuggingFace Humanoid Robot, Qwen 3.7 Max, HYMT2, DeepSeek V4, Lance, Apple LTO, Flash GRPO, Reactive GWM, L2P, LongCat Video Avatar 1.5, Mega ASR, Qwen 3.5 Live Translate, Marlin, CogOmniControl, Meta WaveFlow, Stable Audio 3, Fashion Chameleon

Developer Tools

The Pragmatic Engineer

Anders Hejlsberg: C# was designed by a small, adversarial team of six

Anders Hejlsberg describes how C# was designed by a team of six or seven people who met three times a week for two hours, specifically tasked with trying to shoot down each other's ideas.^{[19]The Pragmatic Engineer — Anders Hejlsberg: C# was designed by 6 people}

C# was not the product of a large committee. The design team was around six or seven people — all with prior experience building languages — who met three times a week in two-hour sessions. The culture was adversarial by design: if someone proposed an idea, the team's job was to find what was wrong with it, and only ideas that survived that stress-testing were accepted.^{[19]The Pragmatic Engineer — adversarial design}

If someone comes up with a new idea, now it's our job to try to shoot it down. What's wrong with this idea? And if it could stand the test of that, then it was probably a decent idea.

Tools: C#

Developer Tools

Arjay McCandless

System design: a hybrid push/pull Instagram feed

A mock system design interview walks through designing Instagram's feed: naive pull-on-read collapses at scale, a pure push-to-Redis fan-out breaks on celebrity accounts, so the solution is a hybrid — push for normal users, pull-from-DB for high-follower accounts.^{[20]Arjay McCandless — System Design: Instagram Feed}

The naive approach — query the database for all posts from followed accounts on every open — overwhelms the database at millions of DAU. The first improvement is fan-out on write: when a user posts, push it into a pre-built Redis feed queue for every follower, so reads are instant memory lookups.^{[20]Arjay McCandless — fan-out} That breaks for celebrity accounts (Taylor Swift) — pushing to hundreds of millions of queues per post is impractical. The final design is hybrid: regular users get fan-out on write, while high-follower accounts use fan-out on read (posts stay in the DB), and the backend merges the pre-computed Redis feed with a real-time pull of any celebrity posts at load time.

Tools: Redis

Developer Tools

Real Python

The DRY principle in Python, in two minutes

A quick intro to the DRY (Don't Repeat Yourself) principle from The Pragmatic Programmer: if you're copying and pasting code, you're probably violating it, and the fix is to extract a single abstraction.^{[21]Real Python — DRY: The Python Principle That Cleans Up Your Code}

The core test is simple: if you find yourself copying and pasting code, or if multiple blocks look very similar, you likely need to consolidate them into a single abstraction — a function, class, or module.^{[21]Real Python — DRY} The goal is one authoritative source of truth for any given piece of logic, so a change only needs to be made in one place.

Cursor's Composer 2.5: near-frontier code at a fraction of the cost

The subsidization war Cursor can't win on price

Training: targeted RL, synthetic tasks, reward hacking

Live demo and the SpaceX kicker

AI's new acceleration phase: Anthropic's profitable quarter, the Erdős problem, Karpathy to Anthropic

Profitability and the end of the subsidy era

Filling the gap; consumer scale

The Erdős breakthrough, Karpathy, and policy whiplash

Nate B Jones: the AI boom is about to hit a wall — below the GPU

'Capacity constrained' decoded

The factory and its bottlenecks

Falling costs, Jevons, and the bubble test

Datasette ships a Jump-to menu, an LLM agent, and test fixtures together

Armin Ronacher: AI-rewritten 'slop issues' are killing open-source bug reports

The K-shaped economy splits travel: premium surges, budget collapses

Mario Zechner: Building the Pi coding agent & the 'rich man's game'

Why he built Pi

The rich man's game and open weights

Europe, the future of apps, and work

Coding workflow and the limits of LLMs

Lenny's Podcast × Dan Shipper: the AI paradox — more automation, more work

Better Stack Podcast #15: why Claude recommends Resend 70% of the time

Sunil Pai at AI Engineer: why you should build science fiction

Sawhney & Ballantyne at AI Engineer: how Google DeepMind runs agents at scale

Adrian Bertagnoli at AI Engineer: scaling heterogeneous intelligence

Harrison Chase at Interrupt 26: the future of agents & LangSmith Fleet

Dwarkesh × David Reich: following the Yamnaya trail into India

Nate B Jones: AI memory is becoming a lock-in mechanism

Antigravity 2.0: bug fixes, a 3x rate-limit boost, and doubled Flash context

Skybridge 1.0: an open-source framework for interactive MCP apps

AI News Roundup: co-scientist, DNA models, open robots & more

Anders Hejlsberg: C# was designed by a small, adversarial team of six

System design: a hybrid push/pull Instagram feed

The DRY principle in Python, in two minutes

Sources