GPT-5.5 takes the throne; Claude Code can't catch a break

AI Models AI Tools Hot Take

GPT-5.5 lands and is "a total freak"

Two days after release, GPT-5.5 is sitting at #1 on Terminal-Bench (~12 percentage points ahead of Claude Opus 4.7), Artificial Analysis, LiveBench, and ARC-AGI2 (85%) — at roughly 2x the price of GPT-5.4 Extra High and a 922K-token context.^{[1]AI Search — GPT-5.5 is a total freak} OpenAI's Romain Huet confirmed there will be no separate GPT-5.5-Codex; Codex was folded into the main model line back at GPT-5.4.^{[2]Simon Willison — Quoting Romain Huet} OpenAI also published a fresh prompting guide whose central recommendation is counterintuitive: don't port your old prompts — start over from a fresh baseline.^{[3]Simon Willison — GPT-5.5 prompting guide} The big asterisk: 86% hallucination rate on SimpleQA versus Opus 4.7's 36%.^{[1]AI Search — GPT-5.5 is a total freak}

Codex is no longer a separate product

Romain Huet (OpenAI) clarified for anyone wondering whether OpenAI would ship a coding-specialized GPT-5.5: "Since GPT-5.4, we've unified Codex and the main model into a single system, so there's no separate coding line anymore."^{[2]Simon Willison — Quoting Romain Huet} He pitches GPT-5.5 as carrying that further "with strong gains in agentic coding, computer use, and any task on a computer."

OpenAI's prompting guide: rebuild, don't port

Simon Willison flags the most surprising recommendation in OpenAI's official prompting guide: "Begin migration with a fresh baseline instead of carrying over every instruction from an older prompt stack."^{[3]Simon Willison — GPT-5.5 prompting guide} Start minimal, then iteratively tune reasoning effort, verbosity, tool descriptions, and output formats. A second concrete UX rec: "Before any tool calls for a multi-step task, send a short user-visible update that acknowledges the request and states the first step." OpenAI also ships a Codex-app skill that runs $openai-docs migrate this project to gpt-5.5 against your codebase.

What it actually does (per AI Search's stress test)

~01:00 In the Codex app, GPT-5.5 built a fully interactive 3D Earth digital twin with zoom from space to street view in roughly three prompts, and a browser-based ray-tracing simulation with adjustable material sliders in another three.^{[1]AI Search — GPT-5.5 is a total freak} ~10:07 Two prompts produced a fully functional 3D third-person mecha-vs-aliens shooter on three.js with multiple waves and levels. ~18:24 An agentic prompt scraped three California roofing companies that had email but no website, then built a custom landing page for each with linked CTA buttons — all in about three minutes.

Benchmarks and the hallucination caveat

~22:27 GPT-5.5 takes #1 on Terminal-Bench (beating Opus 4.7 by ~12pp), Artificial Analysis (both Extra High and High variants over Opus 4.7 Max), LiveBench, and ARC-AGI2 (85%, the highest score on the leaderboard). It uses fewer tokens than GPT-5.4 while scoring higher. Context window: 922K tokens (~700K words). Pricing is roughly 2x GPT-5.4 Extra High and slightly above Opus 4.7 Max.

~25:27 The headline caveat: GPT-5.5 Extra High hallucinates 86% of the time on SimpleQA, against Opus 4.7's 36% and even-lower scores from open-weight models like GLM 5.1. Medical imaging tests were mixed — 3/4 chest CT lesions identified correctly, but 0/6 brain tumors classified correctly.

"At least to me, this is noticeably better than Opus 4.7. It just handles things more autonomously. It makes fewer mistakes, and it just runs a bit smoother. At least that's the vibe that I got." — AI Search

"If factual accuracy is super important for you, like for example if you're working in medical research or law, then GPT 5.5 might not be the best option for you." — AI Search

Tools: GPT-5.5, Codex app, ChatGPT, three.js, Artificial Analysis leaderboard, LiveBench, ARC-AGI2 leaderboard

AI Tools Hot Take AI Future

Nate B Jones Simon Willison

ChatGPT Images 2.0 collapses three creative roles into one prompt

GPT Image 2 won 93% of blind pairwise comparisons in Image Arena — a 26-point gap over Google's Nano Banana 2 — by wrapping a thinking pass, live web search, and a self-verification step around the raw image model.^{[4]Nate B Jones — ChatGPT Images Just Replaced Three People on Your Team} Nate B Jones argues that this collapses the first-draft researcher, copywriter, and layout designer into a single prompt — and that the same machinery now produces convincing forgeries of receipts, Slack screenshots, boarding passes, and pharmacy labels from a free ChatGPT account. Simon Willison's contribution: ChatGPT Images 2.0 spontaneously generated a "WHY ARE YOU LIKE THIS" road sign that was never in the prompt — proof, in his telling, that the model now exercises something like editorial judgment.^{[5]Simon Willison — WHY ARE YOU LIKE THIS}

The benchmark, and why it's a bigger gap than usual

~00:00 93% of blind pairwise comparisons in Image Arena vs Nano Banana 2 at 67% — image-generation leaders normally trade places by 3–4 points, so 26 points is unprecedented. Inkdrop's Takuya Matsuyama fed the model his app summary, V6 release notes, and blog posts about Japanese aesthetics in a single prompt and got back a complete Hokusai-inspired landing page mock-up — typography, hero illustration, and feature grid — in his actual written voice.^{[4]Nate B Jones — ChatGPT Images Just Replaced Three People on Your Team}

Three mechanisms — think, search, verify

~02:01 Thinking mode: a Pro/reasoning model spends 10–20 seconds planning composition, typography, object placement, and constraint satisfaction before committing pixels. ~03:01 Web search inside the generation loop: a demo rendered a geologically accurate depth chart of the Strait of Hormuz as a Richard Scarry children's book illustration; the knowledge cutoff is December 2025, but the model self-retrieves anything it's uncertain about. ~04:02 Eight coherent frames from a single prompt — Sam Altman demoed an eight-panel manga with consistent characters — replacing the old generate/screenshot/feed-back/stitch workflow. A self-verification pass corrects typos between first and second generation.

Four newly viable workflows

~05:02 A single session produced a French fashion magazine cover, a Japanese restaurant menu with hiragana and kanji (vertical flow respected), and a high-density Russian annotation — zero spelling errors. UI specs become Codex rendering targets: PMs describe a settings page, the model renders the mock-up, the coding agent implements against it inside the same environment. Microsoft Foundry demoed a fictional flower-brand subway-car ad campaign from a photo of an empty car in three prompts.

The adversarial twin

~09:04 With one prompt and a free ChatGPT account, you can now forge a restaurant receipt with a specific date and time, a Slack screenshot with a real user's avatar, a real-flight boarding pass, a pharmacy label with a real drug and dose, a government notice on real letterhead, or a competitor menu with undercut prices. Text renders at 99% accuracy; over 70% of blind testers during the Arena rollout believed the outputs were real photos. Content credentials and watermarking don't survive a screenshot and recrop.

GPT Image 2 vs Claude Design — same shift, different output primitive

~11:05 Anthropic shipped Claude Design four days earlier; the underlying insight is identical (reasoning joined the visual stack). OpenAI keeps pixels as the primitive; Anthropic skipped pixels entirely and renders editable HTML directly implementable by Claude Code. Decision rule: pick GPT Image 2 for end-state assets (posters, menus, packaging, social posts), pick Claude Design for working prototypes (landing pages, dashboards, interactive mocks).

Three structural shifts

~13:08 (1) First-draft researcher + copywriter + layout designer now collapse into one prompt — the word-processor moment for design. (2) Image generation is now an agent-callable subroutine: the next consumer is Claude's tool loop or Codex, not a human. (3) The image itself is a compressed reasoning trace; auditing AI visuals is a different discipline because failure modes shifted (an image can be wrong because the source was wrong, not just hallucination).

~17:10 Role-by-role: design leadership reweights toward briefing/brand/QA; founders/solo operators get "what a five-person agency did for you a month ago" for $20/month; trust/risk teams need to red-team their own evidence baselines now.

Simon Willison's stack

Willison fed ChatGPT Images 2.0 a prompt for "a horse riding an astronaut riding a pelican on a bicycle" and the model independently added a road sign reading "WHY ARE YOU LIKE THIS" — never in the prompt.^{[5]Simon Willison — WHY ARE YOU LIKE THIS}

"Generation became a reasoning workload, and 93% of blind human judges could feel the difference without anybody explaining to them why." — Nate B Jones

"The work of a five-person agency a month ago is now a $20 subscription and a good brief." — Nate B Jones

"The new ceiling is specification. Your leverage now depends on how precisely you can describe the layout, the typography hierarchy, the text content, the constraints, the reference material, the audience, the format." — Nate B Jones

Tools: GPT Image 2, ChatGPT, Claude Design, Claude Code, Codex, Inkdrop, Microsoft Foundry, Figma, Canva

AI Models Industry Hot Take Developer Tools

Better Stack Better Stack Better Stack

Claude Code's rough week: regression, silent price hike, and the Caveman fix

Anthropic's postmortem (April 23) confirmed Claude Code regressed for about a month due to three harness-level changes — default reasoning was silently downgraded from high to medium, a bug dropped reasoning context after idle sessions, and a verbosity-cutting system prompt change had to be reverted because it hurt code quality.^{[6]Better Stack — Claude ACTUALLY got dumber} Separately, Anthropic was caught testing the removal of Claude Code from the $20/mo Pro plan entirely (Max-only) before Anthropic's head of growth said it was a 2% test and reverted the page.^{[7]Better Stack — The Claude Price Hike They Didn't Announce} And on the verbosity question, a community-built "Caveman" prompt skill is making the rounds as the actual fix.^{[8]Better Stack — Kevin was right about Claude}

The harness regression

Anthropic's three-cause postmortem: (1) default reasoning effort was downgraded from high to medium for latency, cutting capability on harder coding tasks; (2) a bug dropped accumulated reasoning context after every message in an idle session, producing forgetful and repetitive behavior; (3) a system-prompt change designed to reduce verbosity was found to hurt code quality and was reverted.^{[6]Better Stack — Claude ACTUALLY got dumber} The model itself didn't get dumber — the harness did.

"It wasn't actually Claude the model that got dumber, it was the harness, Claude code." — Better Stack

"It's kind of insane to me that they don't test these things before pushing out these changes." — Better Stack

The silent Pro-plan test

~00:00 Pro subscribers ($20/mo) discovered via screenshots on X and Reddit that Claude Code had been quietly removed from their plan. Anthropic's head of growth later said this was a limited test on 2% of new sign-ups; the pricing page was reverted.^{[7]Better Stack — The Claude Price Hike They Didn't Announce} Stated rationale: the $20 plan wasn't designed for multi-hour agent workflows or always-on coding sessions. Implied rationale (per Better Stack): compute constraints — Pro users are reportedly hitting caps after just a few prompts at peak hours — and a signal that Claude Code becomes Max-only or that Pro pricing rises.

The Caveman skill

~00:00 A prompt skill called "Caveman" tells Claude or Codex to drop filler phrases like "you're absolutely right" and reply tersely while preserving technical detail.^{[8]Better Stack — Kevin was right about Claude} Configurable conciseness levels include an extreme "Wenglall mode" advertised as the most token-efficient language. The pitch is fewer output tokens, faster scanning, and more usable headroom inside a usage cap.

"Why waste time say lot word when few word do trick?" — Caveman skill

Tools: Claude Code, Claude Pro, Claude Max, Codex, Caveman skill

AI Future Hot Take

Nate B Jones

AI is now writing AI: Codex 5.3 and Claude Code near full self-coding

Per Nate B Jones, OpenAI describes Codex 5.3 as "the first frontier AI model that was instrumental in creating itself" — earlier Codex builds analyzed training logs, flagged failing tests, and suggested fixes, with the resulting model showing 25% speed gains and 93% fewer wasted tokens.^{[9]Nate B Jones — AI models writing their own successors} Anthropic has been even more direct: 90% of Claude Code is itself written by Claude Code, with that figure converging toward 100%. Boris Trenne (Anthropic) says he hasn't written code in months — his role is now specification, direction, and judgment.

~00:00 Specific claims: Codex 5.3 is described as the first frontier model "instrumental in creating itself" — earlier Codex iterations analyzed training logs, flagged failing tests, and suggested fixes to training scripts, contributing to a 25% speed improvement and 93% reduction in wasted tokens. Anthropic estimates the entire company will move to entirely AI-generated code around April 2026.

"Codex 5.3 is the first frontier AI model that was instrumental in creating itself, and that's not a metaphor." — Nate B Jones

"90% of the code in Claude Code, including the tool itself, was built by Claude Code, and that number is rapidly converging toward 100%." — Nate B Jones

"Boris Trenne isn't joking when he talks about not writing code in the last few months. He's simply saying his role has shifted to specification, to direction, to judgment." — Nate B Jones

Tools: Codex 5.3, Claude Code

Podcast AI Future Hot Take

Y Combinator

Replit's Amjad Masad on the only two jobs left in the company of the future

Amjad Masad walks YC through Replit's pivot from dev environments to "vibe coding" and lands the headline thesis: in the company of the future, only two roles remain — builders and salespeople. Sales survives because customers still want to talk to humans they trust; builders persist because every employee becomes a generalist entrepreneur deputizing agents to solve their own problems.^{[10]Y Combinator — Replit's CEO On The Only Two Jobs Left} He sketches a "post-prompting world" where you tell Replit "every day build me a SaaS company and try to market it and make me some revenue" and flags computer-use models and continual learning as the two capabilities still gating full autonomy.

~01:04 Replit's pivot to vibe coding

September 2024: Replit became the first product to abstract code entirely behind a natural-language agent and reframe the audience away from traditional engineers.

~03:07 A billion new developers — abandoning the dev audience

Masad explicitly walks away from React/Webpack stacks. "VB6 was better than setting up React and Webpack." The audience is product managers, designers, and entrepreneurs — "AI-native developers."

~07:10 Domain experts as builders

Concrete examples: a physical therapist building a 3D body-scan app after burning hundreds of thousands on offshore devs; pool-maintenance SaaS; sports clubs running on MS-DOS migrating to Replit-built tools.

~09:10 Enterprise: Whoop tries 10x more ideas; revops kills SaaS silos

Whoop testing 10x more product experiments; revops teams replacing six-figure SaaS spend with internal Replit builds.

~28:19 The post-prompting world

"You should be able to tell Replit every day build me a SaaS company and try to market it and see what works and make me some revenue." — Amjad Masad

~31:21 What's still missing: computer-use models and continual learning

"Coding turned out to be a bit of a hack or workaround for computer use agents." — Amjad Masad

~34:21 The two jobs left: builders and salespeople

"I think the company of the future is made of builders and sales people broadly." — Amjad Masad

Sales survives as evangelism and trust-based transformation work — "a lot of other companies will want to talk to someone... they trust other humans." Builders persist because abstraction layers keep climbing — humans were once literal "computers," then operators, then software engineers, now agent-deputizers.

~37:24 Vibe coding residents — everyone becomes a founder inside the company

Replit's internal "vibe coding resident team" roams the company hunting problems (support queue prioritization, HR onboarding portals) and spawning agents to solve them. Masad's vision: "almost everyone is a founder. They wake up in the morning and they think how can I make the company more successful?"

"True product market fit is entirely different. It's like an explosive thing." — Amjad Masad

Tools: Replit, Replit Agent 4, Claude Sonnet, Claude Opus 4.6, MCP, Stripe, HubSpot, Salesforce, Gong, Zendesk, TestFlight

Podcast Productivity AI Tools

AI Daily Brief

AI Daily Brief: building a personal agentic operating system

Nufar Gaspar's thesis on the AI Daily Brief: every agentic harness — Cursor, Claude Code, Codex, OpenClaude, Windsurf, Antigravity, Hermes — is converging on the same primitives, all reading text files defining who you are, what you know, what you can do, what you remember, and what you can reach. So the tool you pick matters less; the system you build underneath is the real moat. She lays out a seven-layer "Agent OS" framework — identity, context, skills, memory, connections, verification, automations — using a running "Chloe the Chief of Staff" example.^{[11]AI Daily Brief — How To Build a Personal Agentic Operating System}

~02:01 Thesis: tools converge, the system underneath is the moat

"Every agentic tool is becoming every agentic tool... the tool you pick matters less and less and what matters much more is the system that you build underneath it." — Nufar Gaspar

~08:06 Layer 1 — Identity: let AI interview you

The file the tool reads first (soul, AGENTS.md, CLAUDE.md, copilot-instructions). Don't write from scratch — brain-dump to an AI and let it interview you with ~15 questions.

~11:09 Layer 2 — Context: 3–5 single-page files, ongoing curation

"Every time you catch yourself re-explaining something about your situation to AI, that thing should have been in a context file." — Nufar Gaspar

Stakeholders, strategy/priorities, operating principles. Not a 40-page novel.

~13:10 Layer 3 — Skills: reusable "when X, do Y" patterns

Every knowledge worker has 20–30 of them — pre-reads, daily brief, voice match, commitment tracker. Ship MVP and patch.

~15:10 Layer 4 — Memory: lean on built-in, add specialized layers for high leverage

Explicitly ask the tool "explain how your memory system works." Add specialized memory (decision logs, relationship context) only where the leverage justifies it.

~18:12 Layer 5 — Connections: MCPs/CLIs/APIs, start read-only

Email, calendar, Slack, Jira. Start read-only and only grant write access after weeks of trust — "an agent gossiping in company Slack" is already a real incident pattern.

~21:14 Layer 6 — Verification: per-skill checks plus periodic OS retro

"Without it, your OS has a shelf life of maybe 8 weeks before everything goes stale. With it, your OS compounds even further and forever." — Nufar Gaspar

~23:14 Layer 7 — Automations: drafts only, logs always, after manual trust

~24:15 The compounding payoff

"Your first agent is hard... your chief of staff maybe took you a weekend. But the second agent... takes you an afternoon because it inherits everything." — Nufar Gaspar

Tools: Claude Code, Cursor, Codex, GitHub Copilot, Windsurf, Antigravity, Hermes (Nous), Lovable, Replit, MCP servers, OpenClaude (soul, heartbeat)

Podcast Hot Take

Dwarkesh Patel

Dwarkesh × Ada Palmer: pamphlets, newspapers, and the birth of fact-checking

Historian Ada Palmer hands Dwarkesh actual early-modern artifacts — a hand-stitched pamphlet, a copy of The Gentleman's Magazine, papyrus, parchment — and walks through how cheap, fast, contradictory print created the original information-overload crisis. Her punchline: when newspapers proliferated and contradicted each other, somebody invented the magazine as a weekly fact-checking roundup. The format was born to adjudicate.^{[12]Dwarkesh Patel — Pamphlets, Newspapers, and the Birth of the Magazine}

~00:00 What a pamphlet actually is

Naked pages, hand-stitched, printable in two-to-four days, sold cheap around town and to traveling news writers.

"It's cheap. It's ephemeral. You print a thousand of them." — Ada Palmer

~00:20 Pamphlet content: real news, lurid nonsense

"My favorite ever title of a pamphlet was the scandalous tale of a doctor from Padua and how he seduced his maid, murdered his wife, murdered the maid, cut out her heart and ate it, and how he was justly punished by God." — Ada Palmer

~01:00 Why early paper is blue-gray

Made from rag pulp — laundry-lint color. "Fundamentally laundry lint is what paper is."

~01:15 The Gentleman's Magazine — birth of the magazine format

~01:30 Newspapers contradicted each other — and the invention of fact-checking

"Every week they would publish a roundup of that week's news saying what each newspaper said about it and where they contradicted each other and analyzing who's right and wrong. It was the fact-checking. This is the first magazine." — Ada Palmer

~01:50 Papyrus: cheap, brittle, scroll-friendly

~02:01 A real 17th-century parchment letter in indecipherable hand

~02:20 Cheap vs. good parchment, and writing around a hole in the skin

"They wrote around the hole because too valuable to not use that sheet." — Ada Palmer

Podcast Developer Tools Hot Take

AI Engineer

Matt Carey at AI Engineer: MCP = Mega Context Problem

Cloudflare's Matt Carey argues that naively dumping every API endpoint into MCP tools blows up context windows — Cloudflare's 2.3M-token OpenAPI spec converts to ~1.1M tokens of tool defs, "never going to fly even with the biggest foundational models."^{[13]AI Engineer — MCP = Mega Context Problem (Matt Carey)} His proposed fix: "Code Mode" — generate a typed TypeScript SDK from OpenAPI, let the model write code against the types, execute inside lightweight V8 isolates (Cloudflare's WorkerD) with programmable guardrails. One tool called code replaces many tool calls. He predicts MCP becomes middleware (an MCP=true flag in Next.js by year-end).

~01:08 From bundled tools to remote MCP — and the context explosion

~03:10 Cloudflare's 2.3M-token OpenAPI problem and the 16-server workaround

Splitting an API into many product-specific MCP servers (Cloudflare ended up with 16 covering ~2,600 endpoints) forces users to pick the right server and still leaves coverage gaps.

~05:11 Progressive discovery: CLIs, tool search, Code Mode

Tool search (e.g., Claude Code's keyword-matched K-tool loader) burns ~2,100 tokens to surface ~500 actually used.

~07:12 Code Mode: typed SDKs let the model write code against the API

"Instead of doing tool calls, you can have one tool called code where the model generates the code of your choice and then you run it." — Matt Carey

~09:14 Why clients didn't adopt it — running untrusted code is scary

"Running untrusted code is mega mega scary." — Matt Carey

File/secret exfiltration, infinite loops, crypto miners.

~11:15 WorkerD isolates: programmable sandboxes with programmable guardrails

WorkerD spawns dynamic V8 isolates from a string. Toggling node compat hides process.env; flipping a boolean blocks or allows internet access.

~13:17 Live demo: full Cloudflare API via one MCP client

~15:18 Where this goes: 'code' as the one tool, sandbox primitives, MCP as middleware

"Your APIs have to be ready to take a beating because they have to have good rate limiting. Cuz I can run this in a for loop on multiple sandboxes at once and just hammer your API." — Matt Carey

"By the end of this year, we'll be like natively in every single at least TypeScript big full stack framework... they'll just have a native integration." — Matt Carey

Tools: MCP, Cloudflare Workers, WorkerD, Wrangler CLI, Cloudflare D1, Cloudflare Access, Claude Code, Code Mode, Deno, Pydantic Monty, OpenAPI, TypeScript, Next.js

Podcast Developer Tools AI Tools

AI Engineer

Ido Salomon at AI Engineer: AgentCraft puts the orc in orchestration

Ido Salomon's thesis: scaling from one agent to dozens does not 100x productivity because the engineer is the bottleneck — managing reckless "employees" is not a skill most engineers have practiced. He demos AgentCraft, an RTS-game-inspired orchestrator that visualizes agents as units on a map of your file system, with hotkey cycling, agent-proposed quests, container-isolated campaigns, cron-driven idea channels, and shared workspaces where teammates' agents appear alongside yours.^{[14]AI Engineer — AgentCraft (Ido Salomon)}

~00:07 Thesis: humans are the orchestration bottleneck

"Spinning them up isn't the problem. It's us. We are the bottleneck in orchestrating all of these agents." — Ido Salomon

"The role of the engineer to actually go and manage dozens of reckless employees is not typically what we do in most companies." — Ido Salomon

~01:08 Borrowing orchestration skills from RTS gaming

~02:10 Basics: agents as units, buildings as functionality

Each agent is a physical unit on screen, backed by a real coding agent session (Cursor, Claude Code, Codex, OpenCloud). Buildings represent functionality (skills/plugins, integrated terminal, git).

~03:11 Visibility: file system as map, lineage and collision heat maps

"The map is actually a projection of my file system. Each directory is on the map, each file is a room — so I can track visually what the agent is working on." — Ido Salomon

~04:12 Muscle-memory cycling between agents

~05:13 Quests: agents propose missions so you don't have to

~06:13 Campaigns and Channels: container-isolated orchestrators and cron-driven idea sources

"Once it's decomposed, I'm not the one doing the babysitting. Now I have the campaign orchestrator and that's his problem." — Ido Salomon

~08:15 Workspaces: human-to-human and human-to-agent collaboration with soft signals

"How much time do I need to spend on the plan if I can just do it 10 times and pick the one that fits?" — Ido Salomon

Tools: AgentCraft, MCI, MC apps, Cursor, Claude Code, Codex, OpenCloud

AI Models AI Tools

Better Stack

Kimi K2.6 makes a serious play vs Claude Code

Kimi K2.6 scales its agent swarm from ~100 sub-agents in K2.5 to 300, with up to 4,000 coordinated steps and a new "preserve thinking mode" to stop memory drift on long tasks; Moonshot reports a 13-hour task with a 185% throughput gain.^{[15]Better Stack — Kimi K2.6 vs Claude Code} The model also adds MoonVIT, an open-source native vision encoder, and the whole thing is on Hugging Face. The reviewer's "$39 plan" pitch: a 40-minute web-agency demo (find 20 Toronto notaries without sites, generate landing pages and outreach emails) would have torched Claude Code usage caps but ran fine on Kimi's Allegretto plan.

~01:01 Agent swarm + preserve thinking mode

~100 sub-agents in K2.5 → 300 specialized agents in K2.6, up to 4,000 coordinated steps. Preserve thinking mode keeps reasoning consistent across multi-turn tasks.

"In K2.5, we were looking at about 100 sub-agents, but K2.6 scales this horizontally to 300 specialized agents that can execute up to 4,000 coordinated steps." — Better Stack

~02:01 MoonVIT vision encoder, fully open source

Native vision encoder for UI/UX reasoning. Generates fully functional interactive prototypes (GSAP animations, scroll-triggered effects) from a single visual reference. The model and encoder are both on Hugging Face.

~03:02 Web-agency demo: 20 Toronto notaries, 40 minutes

Five sub-agents found notaries without websites (Google Maps + Canadian Yellow Pages), generated landing pages, produced outreach emails, and a market-size report. A 17-minute follow-up added unique CSS animations and AI-generated headers per page. Pages still shared boilerplate structure under the visual differences.

~06:04 Cost vs Claude Code

"I have a feeling that I would certainly have burned through all of my usage limits by now if I used Claude to do the same thing." — Better Stack

~07:04 Coding demo: RAM price comparison app in 12 minutes

Full-stack scraper across Amazon, Newegg, Best Buy via Axios + Cheerio. Bare-bones Node + Express + vanilla JS, no React. New token counter in K2.6's CLI.

Tools: Kimi K2.6, MoonVIT, Hugging Face, Claude Code, Node.js, Express, Axios, Cheerio

AI Models Developer Tools

Developers Digest

DeepSeek v4 Pro and Flash with hybrid attention and 1M context

DeepSeek v4 ships in two open-weights variants: V4 Pro (1.6T params, 49B active) and V4 Flash (284B / 13B), both with native 1M-token context. The architecture uses a hybrid attention scheme — Compressed Sparse Attention plus Heavy Compressed Attention — that runs on 27% of the FLOPs and 10% of the KV cache of V3.2 at 1M tokens. V4 Pro Max benchmarks against Opus 4.6 and GPT-5.4. Pricing: V4 Pro at $1.74/$3.48 per million in/out tokens; V4 Flash at $0.14/$0.28.^{[16]Developers Digest — DeepSeek v4 in 4 Minutes}

~00:00 Model release overview

V4 Pro (1.6T / 49B active) and V4 Flash (284B / 13B) — both open weights on Hugging Face with 1M token native context. V4 Pro Max benchmarks against Opus 4.6 and GPT-5.4.

~01:00 Hybrid attention architecture

Two interleaved mechanisms: Compressed Sparse Attention (CSA) — 4-token collapse plus sparse top-K — and Heavy Compressed Attention (HCA) — 128-token collapse, no sparsity. The combination is the source of the memory savings.

~02:02 Pricing and availability

V4 Pro: $1.74/$3.48 per million in/out tokens. V4 Flash: $0.14/$0.28. Context caching included. Open weights on Hugging Face.

~03:02 Agentic use case economics

1M context enables long-horizon agent loops that were previously cost-prohibitive; Flash pricing makes the math viable.

Tools: DeepSeek V4 Pro, DeepSeek V4 Flash, Hugging Face

Developer Tools Industry Hot Take

AICodeKing

RIP RooCode; Cursor/SpaceX rumor; model-agnostic tools win

Two Kilo blog posts dropped the same day: RooCode is shutting down on May 15 (VS Code extension, cloud, router) — credited with pioneering agentic coding modes (architect/code/debug) but pivoting away from IDEs toward remote cloud agents. Kilo positions itself as the natural migration target. Separately, SpaceX reportedly holds an option to acquire Cursor for $60B (or pay $10B for partnership work), and the precedent (Anthropic cutting Claude access to Windsurf when OpenAI acquisition rumors surfaced) puts model flexibility at risk.^{[17]AICodeKing — RIP Roo & Cursor}

~01:02 RooCode sunset

Shutting down May 15 — VS Code extension, cloud, router. Credited with pioneering architect/code/debug agentic modes; pivoting away from IDEs toward remote cloud agents.

~02:03 Kilo positions itself as the migration target

Rebuilt VS Code extension on open code server (same core as its CLI/cloud). Features: parallel execution, sub-agent delegation, agent manager, inline diff review with line-level comments.

~03:03 Cursor/SpaceX acquisition option

$60B option, or $10B for partnership work. Concern: coding tools are now "distribution layers for models," and consolidation with XAI threatens Cursor's model flexibility. Windsurf precedent — Anthropic cut Claude access when OpenAI acquisition rumors surfaced.

~06:03 Model freedom as the core value

Hot take: best model shifts constantly (Claude vs GPT-5 Codex vs Gemini vs Grok vs Qwen depending on task and cost). Model-agnostic tools (Kilo, Cline, OpenCode, Aider) are the safest bet. Kilo's positioning is strong but still needs to prove it won't lock in later.

Tools: RooCode, Kilo, Cursor, Cline, OpenCode, Aider, Windsurf, Claude, GPT-5 Codex, Gemini, Grok, Qwen

Developer Tools AI Tools

Better Stack

Stop using grep: Claude Context MCP for agent codebase search

Better Stack tested Zilliz's Claude Context MCP plugin as a replacement for grep/glob in coding agents. It uses Tree-sitter (AST parsing), a Merkle DAG for incremental re-indexing, and hybrid vector + BM25 search across 9 languages via MCP. Claims 40% context reduction. Best fit: 20–30K-line codebases (sub-minute indexing for cents); large codebases like VS Code (1.5M lines) take ~50 minutes and $1.06 to index.^{[18]Better Stack — I Stopped Using Grep and My Agent Got 10x Faster}

Plugin overview

Tree-sitter AST parsing + Merkle DAG for incremental re-index + hybrid vector/BM25 search across 9 languages via MCP. Works with any agent harness. 40% context reduction claimed.

Setup and cost

Requires Zilliz Cloud (paid serverless recommended; free tier timed out), OpenAI key for embeddings, Node v20–23. Indexing 1.5M-line VS Code repo: ~50 min, $1.06 in embeddings. 23K-line repo: <1 min, $0.01.

Benchmarks (Open Code + GLM-5 Turbo against VS Code)

Simple queries: grep faster (14s vs 19s).
Complex/deep queries: Claude Context 2–5x faster with richer output (e.g., 1m47s vs 5min for an Electron architecture deep-dive), with file refs and line numbers.

Verdict

Best fit is 20–30K-line codebases where indexing is fast and quality gains are clear. Very large codebases impractical due to indexing time.

Tools: Claude Context MCP, Zilliz Cloud, Tree-sitter, Open Code, GLM-5 Turbo

Hot Take Developer Tools

Nate B Jones

The 19% productivity dip — AI's J-curve in production

Nate B Jones contrasts the "55% lab speedup for GitHub Copilot" study with a recurring production reality: experienced developers take ~19% longer with AI tooling. His framing: this is a J-curve — the productivity dip is workflow-adaptation lag, not evidence that AI is hype. Specific production problems he cites: larger pull requests, higher review costs, more security vulnerabilities.^{[19]Nate B Jones — Experienced developers took 19% longer with AI}

The sharpest line, from a senior engineer Nate quotes: "Copilot makes writing code cheaper, but owning it more expensive." Recurring sentiment across the industry, per his telling.

Tools: GitHub Copilot

Industry Hot Take

Better Stack

Apple App Store cracks down on vibe-coded submissions

Apple is expanding enforcement of App Store guideline 4.2.6 against vibe-coded apps from Bolt, Lovable, and Replit Agent. Even though each AI-generated app produces a technically unique codebase, they share "hallucinated DNA" — identical logic errors, unoptimized assets, identical UI patterns — that triggers Apple's spam filters. The existing rule against "template-based functional clones" is being applied to this new category.^{[20]Better Stack — Why Apple is Cracking Down On Vibe-Coding Apps} Practical advice: vibe-coded apps need human-led engineering or unique architectural value to survive review.

Tools: Bolt, Lovable, Replit Agent, App Store guideline 4.2.6

Industry

Better Stack

GitHub data integrity incident reverts 2,804 PRs

GitHub had a data integrity incident where commits were generated from the wrong base state, causing previously merged changes to be randomly reverted. 2,804 pull requests affected. Remediation instructions sent to impacted customers. The optics are bad because the incident coincided with a Verge article reporting GitHub employee concerns about reliability and leadership.^{[21]Better Stack — GitHub just BROKE}

Industry Hot Take

Theo - t3.gg

Theo on the fake-GitHub-stars shadow economy

Awesome Agents — building on a peer-reviewed Star Scout study from CMU, NC State, and Socket — identifies ~6M fake stars across 18,600 repos run by ~301,000 accounts, with 16%+ of all repos with 50+ stars involved in fake-star campaigns by July 2024. Stars cost as little as 6¢ at the low end; aged premium accounts run 80–90¢. Pre-built GitHub profiles with 5-year commit histories sell for ~$5,000 on Telegram. Social Plug claims 3.1M stars delivered to 53,000 clients. WeChat groups making $3.4–4.4M/year.^{[22]Theo - t3.gg — Making millions of dollars on fake GitHub stars} Theo's strong pushback on the "stars-to-VC" examples (Lovable, Browser Use, Pinkalan) and his hot take that GitHub is "a place that holds your source code, kind of" round it out.

~01:01 The shadow economy

Star Scout analyzed 20TB of GitHub metadata, 6.7B events, 326M stars from 2019–2024. ~6M fake stars across 18,600 repos by ~301,000 accounts. By July 2024, 16%+ of all repos with 50+ stars were involved. GitHub itself validated detection by deleting 90% of flagged repos and 57% of flagged accounts as of Jan 2025. AI/LLM repos became the largest non-malicious category (177,000 fake stars), ahead of blockchain. 78 fake-star repos made GitHub trending.

Pricing tiers: 3–10¢ disposable, 20–50¢ mid-range (1–2 weeks), 80–90¢ premium aged. Dagster's 2023 research bought €85 per 100 stars from registered German company GitHub24 (all 100 persisted). Pre-built profiles with Arctic Code Vault badges go for ~$5,000 on Telegram. Social Plug claims 3.1M stars to 53,000 clients with a formal API.

~12:08 Manipulation fingerprints

Flask baseline (71K stars): median account age 4,481 days, 5.3% zero-repo, 10% zero-follower, fork-to-star ratio ~0.20, watcher-to-star ~0.03. Manipulated examples: Union Labs 47% suspected fake, FreedomDAO 81% zero-followers (watcher-to-star of 0.001), OpenAFM 66% suspicious accounts and 36% ghost. Heuristic: fork-to-star below 0.05 with 10K+ stars warrants scrutiny; organic watcher-to-star is 0.005–0.03.

"You can fake a star count, but you can't fake a bug fix that saves someone's weekend." — Theo

~18:12 How stars become VC dollars (and Theo's pushback)

Redpoint's Jordan Segal: median GitHub stars at seed = 2,850; at Series A = 4,980. Buying $85–285 in budget stars hits the seed median; $1K–4.5K hits Series A. Returns: 3,500x–117,000x on a $1M–10M round. Runa Capital's ROSS index, GitHub Fund + M12 ($10M/yr), and an Organization Science paper (15pp more likely to raise if active on GitHub) all reinforce the loop.

Theo's pushback: Lovable raised on $400M/yr revenue. Browser Use (50K stars in 3 months, YC W25, $17M seed) — Theo invested, confirms it raised on demand and the agent-browser thesis, not stars. Pinkalan got into YC, $4.7M seed; Theo passed.

"Lovable did not raise based on their stars on GitHub. They raised based on their fucking unbelievable revenue." — Theo

~23:16 NPM, VS Code, Twitter astroturfing

Svelte's NPM downloads jumped from ~370K to 28M (clear manipulation). Andy Richardson demoed pushing his package to ~1M downloads/week using a single Lambda on the free tier. Aqua Security found 1,283 VS Code extensions with malicious deps totaling 229M installs. NBC News + Clemson identified a network of 686 X accounts posting 130,000+ LLM-generated replies (with the uncensored "dolphin" model leaking through artifacts) — promoting Blackbox AI / Claudex. Theo turned down a seven-digit Higgsfield sponsorship after they purged paying users' accounts and got banned from Twitter for ToS violations.

~27:17 Legal exposure

FTC Consumer Review Rule (effective Oct 21, 2024): up to $53,000 per violation for selling/buying fake social-influence indicators. SEC precedent: Headspin CEO charged with wire fraud (max 20 years) and securities fraud for inflating metrics to scam $80M from investors. Theo offers to be FTC expert witness against fake-viewership YouTubers (e.g., 4M views / 36 comments examples).

~32:19 GitHub enforcement asymmetry, and what to track instead

"GitHub doesn't know how to run a platform. They know how to run a place that holds your source code kind of." — Theo

CMU researchers recommend a network-centrality-weighted popularity metric. Jono Bacon (StateShift) recommends package downloads, issue quality, contributor retention, community discussion depth, usage telemetry. Healthy fork-to-star ratio: 100–200 forks per 1,000 stars.

"Star economy is a $50 problem with a $50 million consequence." — Theo

Tools: GitHub, Star Scout, Socket, Awesome Agents, Runa Capital ROSS index, FTC, SEC, Lovable, Browser Use, Pinkalan, Higgsfield, Blackbox AI

AI Models AI Future

Two Minute Papers

NVIDIA Sonic: humanoid teleoperation from video, voice, music

Two Minute Papers covers NVIDIA "Sonic," a multimodal teleoperation controller for humanoid robots. It takes video of human motion, voice commands, or music as input and translates them into joint/motor commands via a universal-token architecture (motion generator → human encoder → quantizer → decoder, with a "root trajectory spring model" damping rapid motions). Trained on 100M frames of human motion with no manual action labels, on 128 GPUs over 3 days. ~42M parameters — runs on a smartphone. Open and free.^{[23]Two Minute Papers — NVIDIA's New AI Broke My Brain} Demos: walking, crawling, kung fu, expressive gaits (happy / stealthy / injured), lawn mowing via voice, dancing to music. Led by Prof. Zu and Jim Fan (NVIDIA humanoid robots lab).

AI Tools Productivity

Nate Herk

Nate Herk: Claude Code + Playwright as a universal automation harness

Nate Herk's pattern for end-to-end browser automation: skip Chrome DevTools MCP (it floods context with tool definitions) and have Claude Code drive Playwright CLI directly. He demos six concrete workflows — automated QA loops that find and self-patch bugs, web scraping that auto-switches search engines after detecting bot-blocking, persistent-profile authenticated sessions, and a daily-scheduled community bot. The recommended end state: iterate a Playwright script to reliability, then wrap it as a named Claude Code skill.^{[24]Nate Herk — Claude Code + Playwright Automates Literally Anything}

~01:00 Playwright CLI vs Chrome DevTools MCP

Token efficiency is the deciding factor; MCP floods context with tool descriptions.

~03:00 Automated QA loop

Claude Code built a 12-question form, ran Playwright in headed mode, found 3 bugs (enter-key navigation, missing review page, stale overlay), self-patched, and re-ran to green.

~08:10 Web scraping with engine fallback

Auto-switched from Google to DuckDuckGo after detecting bot-blocking; collected 5 phone numbers across dental office sites.

~10:10 Authenticated sessions via persistent profile

Tested on Skool; 4–5 iterations to reliably like posts (gray vs. yellow icon distinction, newest-sort filter).

~15:13 Autonomous community bot ("AIS agent")

Runs daily on a schedule via Claude Code desktop app: AI news roundups, wins engagement, notification replies, unprompted birthday post; self-extended by writing a new poll-voting script when it hit a capability gap.

~02:01 The skills pattern

Iterate a Playwright script to reliability, then wrap it as a named Claude Code skill for repeatable invocation.

Tools: Claude Code, Playwright, Chrome DevTools MCP, Skool, DuckDuckGo

Developer Tools

LearnThatStack

WebSockets are HTTP that stops being HTTP

LearnThatStack walks through the four promises HTTP makes (stateless, short-lived, client-initiated, infrastructure-friendly) and how WebSockets break each one to enable persistent bidirectional channels — including the SHA-1 challenge-response (not for security; to prove the server is a real WebSocket endpoint), thread-pool collapse at 10K connections, sticky sessions + Redis pub/sub once stateless dies, NAT eviction as short as 30s on cellular, and the universal exponential-backoff + jitter + reset-on-success pattern for thundering herds on deploy.^{[25]LearnThatStack — A WebSocket Is an HTTP Request That Stops Being HTTP} Closing thesis: protocol ossification — WebSockets, HTTP/2, and QUIC all had to smuggle through existing infrastructure rather than negotiate on their own terms.

Developer Tools Hot Take

Better Stack

MDN ditches React for Lit + custom server components

MDN rebuilt its front-end. Out: Yari (React SPA, ejected CRA, heavy Webpack, dangerouslySetInnerHTML). In: Lit-based web components with custom elements embedded directly in content, plus custom server components for per-page CSS/JS delivery so unused JS never ships. The main nav dropdown runs on CSS alone, progressively enhanced with JS.^{[26]Better Stack — MDN's New Stack Is Nuts}

Developer Tools

Github Awesome

Honker: Postgres-style NOTIFY/LISTEN inside SQLite

Honker is a Rust-based loadable SQLite extension that ports Postgres's NOTIFY/LISTEN pub/sub mechanism to SQLite. The point: durable pub/sub and task queues live inside the existing DB file, transactions atomically span business logic and queue tasks, no polling, single-digit-millisecond latency.^{[27]Github Awesome — Honker: a Rust SQLite extension}