Claude Design comes for Figma's throat

AI Models

Kimi K2.6 takes the open-weights lead

Moonshot's Kimi K2.6 is now the #1 open-weights model — a 1T-total / 32B-active MoE with a 256k window that ranks #4 overall on the AA Intelligence Index at 54, just behind the three closed frontier labs at 57^{[1]Artificial Analysis, Kimi K2.6: The new leading open weights model}. Agentic Elo jumped from K2.5's 1309 to 1520, τ²-Bench Telecom hit 96%, and the hallucination rate collapsed from 65% to 39%.

Numbers that matter

K2.6 keeps K2.5's architecture (1T parameters, 32B active) but pushes capability sharply higher. GDPval-AA agentic Elo of 1520 (vs. 1309), τ²-Bench Telecom 96% on tool use, AA-Omniscience hallucination rate 39% — better than MiniMax-M2.7's 34% is close to Claude Opus 4.7's 36%^{[1]Artificial Analysis, Kimi K2.6}. Reasoning-token usage lands at ~160M, between GPT-5.4 (~110M) and Claude Sonnet 4.6 (~190M). Supports image and video input with text output. Available via Moonshot's API and third-party hosts Novita, Baseten, Fireworks, and Parasail.

Tools: Kimi K2.6, Moonshot, AA Intelligence Index, GDPval-AA, τ²-Bench Telecom, AA-Omniscience, Novita, Baseten, Fireworks, Parasail

AI Tools

Google Developers

Gemini Deep Research Max: the overnight due-diligence agent

Google shipped Deep Research Max, a Gemini 3.1 Pro-powered autonomous research agent built for asynchronous, long-horizon work — with MCP server support, multimodal inputs, and native chart/infographic generation^{[2]Google Developers, Deep Research Max}. Announced MCP partners include FactSet, S&P Global, and PitchBook — aimed squarely at finance's overnight due-diligence workflow.

Two tiers, one thesis

The product ships in two tiers: Deep Research (low-latency, interactive) and Deep Research Max (exhaustive, slower, professional-grade). Max runs extended reasoning loops, iteratively refines a report, and natively generates HTML charts and Nano Banana infographics^{[2]Google Developers, Deep Research Max}. New features: Model Context Protocol support (for proprietary financial data feeds), multimodal input (PDF, CSV, image, audio, video), a collaborative planning step that asks for user approval before execution, and streaming of intermediate reasoning. Web access is optional — the agent can work entirely on user-provided data. Public preview on paid Gemini API tiers via the Interactions API; Google Cloud coming.

maximum comprehensiveness and highest-quality synthesis

Tools: Deep Research Max, Deep Research, Gemini 3.1 Pro, MCP, Nano Banana, FactSet, S&P Global, PitchBook, Gemini API, Interactions API

Developer Tools AI Tools

Google Labs Google Labs

Google Labs double: Stitch DESIGN.md goes open, Pomelli lands in Europe

Google Labs open-sourced DESIGN.md — a portable, markdown spec that lets designers encode tokens, WCAG rules, and brand voice so AI agents can validate their own output instead of guessing^{[3]Google Labs, Stitch's DESIGN.md format is now open-source}. Separately, Pomelli — Google Labs' AI marketing-asset generator — rolled out in English across the EU, UK, Iceland, Liechtenstein, Norway, and Switzerland^{[4]Google Labs, Pomelli in Europe}.

DESIGN.md: AGENTS.md for visual systems

DESIGN.md is a markdown format that encodes why each design token exists — what each color is for, which combinations pass WCAG contrast, what the brand voice is. Agents generating UIs no longer have to infer intent from a Figma dump — they can read the spec, reference it during generation, and self-check^{[3]Google Labs, Stitch DESIGN.md}. Published on GitHub as platform-agnostic; Stitch (stitch.withgoogle.com) is the reference implementation. Context: Anthropic's Claude Design launched the same week with its own "skills.md" output for design systems — see topic 14.

AI agents can know exactly what a color is for, and can validate their choices against WCAG accessibility rules

Pomelli: three-step marketing asset flow for SMBs

Pomelli's flow — Analyze (scrape the user's site for brand voice, type, color), Ideate (suggest campaigns or accept a prompt), Create (output downloadable social, web, and ad assets) — is aimed at small and mid-sized businesses that can't afford an agency^{[4]Google Labs, Pomelli in Europe}. English-only in Europe for now.

Tools: Stitch, DESIGN.md, GitHub, WCAG, Pomelli, Google DeepMind, Google Labs

AI Models

Simon Willison's Weblog OpenAI

ChatGPT Images 2.0: OpenAI reclaims the image crown

OpenAI launched ChatGPT Images 2.0 (gpt-image-2) — Sam Altman framed the jump from gpt-image-1 as "equivalent to going from GPT-3 to GPT-5 all at once"^{[5]OpenAI, This is ChatGPT Images 2.0 (keynote)}. Simon Willison ran his "raccoon with a ham radio" Where's-Waldo benchmark and says it "takes the crown from Gemini, at least for the moment"^{[6]Simon Willison, Where's the raccoon with the ham radio?}. New: a Thinking mode that can search the web, a 2K API output ceiling, and flexible aspect ratios from 3:1 to 1:3.

Simon's benchmark

Simon's stress test — a complex illustrated scene with a raccoon and ham radio hidden in it — produced the most coherent, detail-rich output of any model tested. High-quality 3840x2160 renders cost ~40¢ and consumed 13,342 output tokens^{[6]Simon Willison, Where's the raccoon with the ham radio?}. Nano Banana 2 placed the raccoon obviously in the center booth; Nano Banana Pro produced the worst result. In a follow-up, none of the models (including gpt-image-2) could reliably solve their own Where's Waldo puzzles.

Sam Altman said that the leap from gpt-image-1 to gpt-image-2 was equivalent to jumping from GPT-3 to GPT-5.

Instant mode vs Thinking mode

Instant mode is available to all users and produces images that "just look normal" — typos are "very rare," enabling full pages of text and full magazine layouts^{[5]OpenAI keynote, This is ChatGPT Images 2.0}. Thinking mode (paid only) deliberates, can search the web, synthesize references, and produce multiple coherent images in a single pass — demonstrated by generating a three-page manga from one selfie, and a social-media reaction image with real quotes from Threads, LinkedIn, and Reddit plus a working QR code. ~00:00

Six capability deep-dives

Thinking & Intelligence — researcher Ian: multi-page Newton infographics at textbook quality, social-media aesthetic trends from 2006–2016–2026 synthesized into a comparison page.
Instruction Following — Jian Feng: precise spatial placement (apple center, mug right, books above); clocks at arbitrary times (2:25, 2:30, 7:45) instead of the default training-data 10:10; word-art on specific hands in a portrait.
Slides & Infographics — Yu Guan: a 1,000-word prompt for an educational infographic; a 70-page PDF converted into seven consistent slide images; a one-page academic poster from the same PDF.
Multilingual & Text Rendering — dense text in Chinese, Korean, Japanese, and Bengali rendered correctly, including a 100-page Chinese-translated GPT paper as a zoomable image.
Aspect Ratios & Resolution — 3:1 through 1:3, 2K in the API (up from 1K), and auto-selecting aspect ratio from context (a 2:1 panorama for a 360-degree prompt). Resolution enables reading "GPT image 2" on a single grain of rice.
Chameleon — companion clip released with the keynote.
This is ChatGPT Images 2.0 — hero short reprising the launch positioning: DALL-E as cave drawings, Images 1 as ancient art, Images 2.0 as the Renaissance.

Image gen 2 is more of a partner now as opposed to just a tool.

Tools: ChatGPT Images 2.0, gpt-image-2, OpenAI API, Thinking mode, Instant mode, Nano Banana 2, Nano Banana Pro, Claude Opus 4.7, Google AI Studio

Hot Take

Simon Willison's Weblog Simon Willison's Weblog

Simon's hot takes: too-human agents and pelican data poisoning

Two Simon Willison posts share a theme: the friction between humans and the AI systems being built on top of them. Andreas Påhlsson-Notini argues current agents fail by being too human — drifting toward familiar solutions and "negotiating with reality" when the problem gets hard^{[7]Simon Willison, Quoting Andreas Påhlsson-Notini}. Steve Cosman's scosman/pelicans_riding_bicycles weaponizes the other direction — deliberately mislabeled images published as folk-art protest against AI scraping^{[8]Simon Willison, scosman/pelicans_riding_bicycles}.

"Less human AI agents, please"

Påhlsson-Notini's core argument: frontier agents don't fail by being inhuman — they fail by being too human. Faced with tough constraints they round corners, pick the path they've seen in training, and bargain with the problem statement rather than attack it^{[7]Simon Willison, Quoting Andreas Påhlsson-Notini}. Simon surfaced the quote without commentary, implicitly endorsing it as a useful frame.

AI agents are already too human... they drift towards the familiar. Faced with hard constraints, they start negotiating with reality.

Pelicans as protest

Cosman's repo seeds the web with deliberately mislabeled examples — a "pelican riding a bicycle" that's actually a bear on a snowboard — hoping future scrapes will teach models the wrong thing^{[8]Simon Willison, scosman/pelicans_riding_bicycles}. Simon admits he's already complicit, having published similar red-herring content. The post treats adversarial data poisoning as folk art rather than a serious defense — but it crystallizes a live question: what do publishers owe the scrapers training on them?

I firmly approve

Tools: GitHub

Industry

Tech Brew

Apple's hardware bet: Ternus in, Siri runs on Gemini

Tim Cook steps down on September 1, 2026. Hardware chief John Ternus becomes CEO; Cook stays as executive chairman^{[9]Tech Brew, Apple bets on hardware in an AI world}. Apple's pitch: hardware is still 80% of revenue, and the upcoming Siri revision will be powered by Google's Gemini — a stark admission that Apple is not building the frontier model itself.

Inherited strengths, inherited misses

Cook grew Apple from ~$350B to nearly $4T market cap since 2011 and built services to ~$100B/year. Ternus inherits a company whose AI strategy is explicitly "let Google cook the model, we'll ship the glass"^{[9]Tech Brew, Apple bets on hardware in an AI world}. Apple Car (canceled 2024) and Vision Pro (well below expectations) are the cautionary cases; the iPhone 17 lineup and budget MacBook Neo are the recent wins. Ternus's pending tests: a Gemini-powered Siri relaunch, and display-free smart glasses that have to be "era-defining" rather than another Vision Pro.

We'll blow you away with what we show next year

Tools: Apple, John Ternus, Tim Cook, Siri, Google Gemini, iPhone 17, MacBook Neo, Apple Vision Pro

Industry

Sherwood Snacks

Sherwood: robotaxi price war, DeepMind's Anthropic strike team

Two items from Sherwood. First: Tesla Robotaxi is charging ~$4.35/mile in the Bay Area vs. Waymo's $9.58/mile, with per-mile operating costs of $0.74 and $1.36 respectively — Sherwood calls this a "golden era" echo of early Uber/Lyft subsidies^{[10]Sherwood Snacks, Autonomous wheels are deals}. Second: Google DeepMind has assembled a dedicated "strike team" to close the gap with Anthropic — an implicit acknowledgement that Claude is the one to beat.

Robotaxi pricing as subsidy war

With Waymo, Tesla Robotaxi, and Amazon Zoox all live in SF, pricing has collapsed into a subsidy battle^{[10]Sherwood Snacks, Autonomous wheels are deals}. Tesla's $4.35/mile is profitable on paper against a $0.74 floor; Waymo's $9.58 is less aggressive but still below fully-loaded TAM economics. Sherwood's read: riders are getting the same VC-funded discount that defined 2014 ride-sharing, and it won't last.

DeepMind reorganizes around one competitor

The Sherwood item is framed as competitive intel inside a broader market-confusion piece (Magnificent 7 forward P/Es are at AI-boom lows, software multiples compressing). DeepMind's response is a structural reorg of talent around closing the Anthropic gap — the move itself is the story, since it signals who Google thinks the frontier leader actually is.

Tools: Waymo, Tesla Robotaxi, Amazon Zoox, Google DeepMind, Anthropic

Podcast

Prefect Prefect

McPeople: the MCP Dev Summit + AgentCraft recap

Adam and Jeremiah (Prefect) debrief after attending the MCP Dev Summit in NYC and AI Engineering Europe in London. Big themes: FastMCP is being renamed to an "MCP server object," stateless sessions in the June spec will force orchestration, clients are finally stepping up, and "tokens per successful outcome" is THE metric to obsess over^{[11]Prefect, McPeople | MCP Dev Summit | AI Engineering Europe}. A separate Prefect short teases AgentCraft — a WoW-style agent orchestrator demoed at the conference^{[12]Prefect, MCP Summer has AgentCraft}.

~00:04 The hosts open by noting how many community members introduced themselves by GitHub handle — a running joke that they recognized people by the issues they'd filed, not their names.

Max Marcelo's Python SDK talk: FastMCP becomes MCP server object

~01:04 FastMCP in the MCP SDK will be renamed to an "MCP server object" for parity with the TypeScript SDK. A much larger set of under-the-hood changes lands in the June spec update around sessions, transports, and bidirectional communication. The sharp pain point: the protocol is shifting toward stateless sessions, which breaks today's developer experience of writing a tool that can sample/elicit mid-function and then resume.

It kind of turns out the problem of making code run is a fairly universal one. And here we are again where a lot of MCP features are going to be literally impossible without an orchestration engine behind them.

Agent Craft: WoW-style orchestration

~04:04 Leor and Ido's new project Agent Craft — spawn an orc, type "go refactor this front end," watch it walk off to "prompt town" and delegate to sub-agents^{[12]Prefect, MCP Summer has AgentCraft}. Prefect frames this window as "MCP Summer."

VS Code: the content vs structured_content debate

~08:08 VS Code shows the tool-result content field (intended for the LLM) to both the user AND the LLM, whereas Adam's position is that content is LLM-only and structured_content is for the app — "return 10,000 browsable rows" that you don't send to the LLM is at stake.

Matt Kerry (Cloudflare) on code mode and dynamic sandboxes

~11:10 Matt Kerry delivered back-to-back "code mode" talks — the thesis being "instead of giving folks thousands of tools, give LLMs the ability to search and then write arbitrary programs over them" — plus dynamic sandboxes via Cloudflare isolates.

"Clients need to do better"

~17:13 Adam's thesis: "I'm done complaining about bad clients." Instead, ship apps and iframes to deliver elicitation and async tasks clients won't natively support. Claude adding tool search and code-mode by default is cited as the right direction.

Chrome DevTools MCP PM: tokens per successful outcome

~23:14 The sharpest talk of the summit: "tokens per successful outcome" as THE metric, optimized via prompts, tool bundling, and descriptions. Key finding: no free lunch with tool search — tools it knows to search for get collapsed efficiently, but tools it never thinks to search for become invisible and send the agent on "long Odysian-style journeys."

Tokens per successful outcome — that is the metric he obsesses over internally.

Tools: MCP Python SDK, MCP TypeScript SDK, FastMCP, Prefect, prefab, MCP apps, Agent Craft, VS Code, Claude Desktop, Cloudflare Workers, Pydantic, Hugging Face, GitHub MCP server, Postman, MCPUI, Chrome DevTools MCP

Podcast

AI Engineer

AIE Miami Day 2: Cerebras, Cursor 3.0, Arize's MCP-vs-Skills eval

Day 2 of AI Engineer Miami delivers the day's densest content: fast-inference hardware (Cerebras + Codex Spark at 1,200 tok/s), specialized sub-agent models (Morph), context graphs vs naive RAG (Neo4j), behavior-runtime voice robots (Akamai), agent-first product design (Mux), SOTA agentic memory (OutRival), a 500-run MCP-vs-Skills eval (Arize), and a from-scratch IDE (Cursor 3.0)^{[13]AI Engineer, AIE Miami Day 2}.

David House (G2I): agentic coding adoption

~04:03 Framework: beginners need frameworks that constrain input; experts need frameworks that amplify it. Good agent frameworks reveal hidden practice, make output reviewable, and train engineers in delegation.

A beginner agentic framework should constrain their input. For an expert, an agentic framework should amplify their input.

Sarah Chiang (Cerebras): Codex Spark at 1,200 tok/s

~25:23 The day's inference blockbuster: Codex Spark, from OpenAI x Cerebras, generates code at 1,200 tokens/second — roughly 20x faster than GPT/Claude/Gemini at 50–150 tok/s. Full-stack pay-down: wafer-scale on-chip SRAM, disaggregated prefill/decode (why Nvidia bought Groq for $20B), MoE + REAP, and KV-cache reuse. The regime change: the human is the bottleneck.

We are entering a regime where the coding model can code faster than we humans can keep up with.

Le Kalinowski (Callstack): latent diffusion on mobile NPUs

~44:07 Deploys a latent diffusion model directly on a phone NPU using ONNX Runtime, driven by ambient light sensor values instead of text prompts — ~600ms latency, fully offline.

Tis (Morph): specialized sub-agent models

~110:41 Frontier models are wasted on code search, compaction, and diff-apply. Morph's code-search model runs 12 parallel GP/read calls over 6 turns; its compaction model runs at 33k tokens/sec. Frontier labs won't train a parallel tool-calling harness because RL causes chronic forgetting on sequential calls — that's the specialist gap.

When you double speed without hurting accuracy, you roughly double conversion rates.

Rick Blalock (Agentuity): coding agents as universal primitive

~123:50 Coding agents can build every other kind of agent. He contrasts 2023's AutoGPT/LangChain "orchestration theater" with today's reality where OpenClaw sells out Mac Minis to non-technical users. Serverless is broken for agents (30s timeouts vs 5min agent runs); agents need purpose-built cloud.

Software ate the world. Now coding agents are eating software.

Nia Mlin (Neo4j): context graphs

~146:10 Vector search alone gets 70–80% accuracy; the missing dimension is structural similarity. Cited ablation: 37% base to 54% fine-tuned to 91% with knowledge graph + RAG. A live demo catches a fraudulent credit-line-increase that a vector-only pipeline would approve.

Better models don't fix fractured context.

Lena Hall (Akamai): behavior runtime for voice + motion

~268:08 Five principles building a Reachi Mini compliment robot on OpenAI Realtime: (1) model picks intent, runtime picks action; (2) serialize responses or lose coherence; (3) latency is interaction design (a 4s post-tool delay reads as "hesitation"); (4) personality is policy; (5) infrastructure bugs show up as behavior — when the robot started speaking Spanish from a tool-call overflow.

Dave Kiss (Mux): designing for agents as first-class users

~291:31 Switching costs have collapsed to a few prompts. Salesforce headless 360, Netlify ditching per-seat pricing, Stripe's agent-provisioned infra. Ship a pricing markdown file, add Link headers for discovery, rewrite your top 3 error messages with actionable fixes.

Alvin Payne (OutRival): agentic memory SOTA

~316:01 "DMD" (Dynamic Memory Discovery) scored SOTA on LongMemEval in 5 days — no vector DB, no reranker, no knowledge graph, just a filesystem of raw JSON sessions and an agent orchestrator with recursive LLM calls. Three unsolved: temporal reasoning, entity disambiguation, principled forgetting.

Every additional assumption is new surface area to be wrong.

Eric (CodeRabbit): skills as a simple primitive

~333:22 "If you can one-shot it, it's not a skill — it's already in the weights." His DAW skills took ages on Opus 4.6, then just worked on 4.7. Live-codes music generation, bedtime stories with text-to-speech, and agent-metal via Claude controlling a DAW.

Hassan (Together AI): shipping 10-15 AI apps a year

~398:01 Harness stack: Codex for 30-min prototypes, Cursor for everyday work, Claude Code for PR review. Parallel agent gains stop after 2–3 threads. Spends 80%+ of time on UI even in AI-heavy apps.

Stefan Abram (OpenCode): the 3 R's of enterprise AI

~416:09 Resist, Rush, Reign. OpenCode's thesis: model flexibility from day one. Last week's top model was Kimi K2.5; this week it's 2.6. Case study: Ramp built "Inspect" on OpenCode — 30% of Ramp's merged PRs are now written by it.

Lori Voss (Arize AI): MCP vs Skills, 500 runs

~428:14 Opus 4.6 against a fake GitHub repo across 25 tasks x 4 difficulty tiers x 5 passes (500 total). Correctness is tied in the high 80s. MCP is 5x slower and 6x more expensive on tier-4, and on one test made 71 tool calls where only 3 were actual MCP calls — the rest bash/jq around JSON overflow. Skills lose on consumer use cases: OAuth, remote tools, non-developer users.

MCP vs the command line is the wrong question. It's MCP plus the command line.

David (Cursor): IDEs are dead

~456:36 Tab-complete usage went from 1,400 in September to 9 in December to ~zero today. Cursor 3.0 was built from scratch to escape VS Code layout baggage; cloud-agent usage has climbed steadily since November. Local compute is no longer the bottleneck.

Agents are writing 99% of our code, which means we have to rethink IDEs from first principles.

Tools: Cerebras, Codex Spark, OpenAI Codex, Claude Code, Claude Opus 4.5/4.6/4.7, Cursor 3.0, Cursor CLI, OpenCode, Arize AI, Morph, Agentuity, Neo4j, Mux, OutRival, CodeRabbit, Together AI, Reachi Mini, OpenAI Realtime API, Devon (Cognition), Kimi K2.5/K2.6, GLM 5.1, Flux Schnell, Nano Banana, Nvidia B200/H200, Groq LPU, Google TPU, AWS Trainium, MCP, Claude Agent SDK, ONNX Runtime, REAP, Turboquant, LongMemEval, SWE-bench Pro, Whisper Flow, Ramp Inspect

Podcast

AI Engineer

Sander Dieleman at AI Engineer: Veo, Nano Banana, and spectral autoregression

Google DeepMind's Sander Dieleman (Veo, Nano Banana) gives an eight-pillar tour of training large-scale diffusion models for audiovisual data. Central thesis: diffusion — not autoregression — dominates image and video generation because it naturally does coarse-to-fine "spectral autoregression" in the frequency domain, and tricks like classifier-free guidance let diffusion models "punch well above their weight"^{[14]AI Engineer, Sander Dieleman on Veo and Nano Banana}.

Data curation is the most underrated lever

~02:15 Dieleman stresses that time spent on data often beats time tweaking optimizers or architectures — and data remains part of the "secret sauce" teams can't publish about.

Time spent on improving the data is sometimes a better investment of that time than actually trying to tweak the model and trying to make the optimizer better.

Latent representations via learned autoencoders

~04:15 30 seconds of 1080p video at 30fps is several GB per training example — raw pixels are a non-starter. Teams train their own autoencoders to produce latents (e.g. 256x256 RGB compresses to a 32x32 latent with extra channels for high-frequency detail), preserving the topological grid structure pixels have.

Diffusion as iterative denoising

~10:17 The denoiser predicts a clean image from a noisy one, but because noise destroys information the prediction is blurry — a direction, not a destination. Sampling takes small steps in that direction and re-adds fresh noise each step to prevent error accumulation.

Fourier analysis: diffusion as spectral autoregression

~16:19 Natural images have a power-law spectrum; Gaussian noise is spectrally flat. Noise progressively drowns out the highest frequencies first — so diffusion generates coarse-to-fine in frequency space.

Diffusion is basically spectral autoregression.

Architecture and video

~21:23 U-Nets gave way to diffusion transformers with fully bidirectional attention. For video, there's a spectrum from fully autoregressive models to joint spatiotemporal diffusion, with a hybrid middle ground (autoregression in time + diffusion per frame) being ideal for real-time applications like Genie.

Classifier-free guidance

~24:25 The most important sampling trick: amplifies the delta between conditional and unconditional denoiser predictions, trading diversity for quality.

A lot of people would be surprised at how bad today's models are if you take guidance away.

Distillation and control

~27:26 Distillation in diffusion is about reducing sampling steps — consistency models try to predict the endpoint directly, though one-step sampling is "a tall order" and interval-based ~3-step variants work better. ~29:27 Text prompts alone are no longer sufficient — users want reference-based generation, explicit camera motion, and timing controls.

Tools: Veo, Nano Banana, Google DeepMind, Genie, Stable Diffusion, GLIDE, EQ-VAE, JAX, jax.pjit, TPUs, Rectified Flow, Consistency Models

Podcast

AI Engineer

Gergely Orosz at AI Engineer: token maxing and big-tech AI infra

Gergely Orosz (Pragmatic Engineer) opens with "token maxing" — the phenomenon at Meta, Microsoft, Salesforce, and Coinbase where engineers are measured on AI token output, sometimes tied to performance reviews^{[15]AI Engineer, Gergely Orosz conversation}. His read: AI is individually productive but teams haven't figured out how to retrofit it yet, and big tech is rationally building massive internal infra (MCP gateways, background coding agents) because off-the-shelf tools can't handle their codebases.

Token maxing: leaderboards tied to performance reviews

~00:00 At Meta, Microsoft, and Salesforce, engineers' token output is being measured on public leaderboards. Meta had one tied to performance evaluations as "one data point of many." ~02:03 They killed it publicly after an article dropped, but people are still token-maxing out of fear — asking agents to summarize docs they could just read, running autonomous agents to "build junk" to pump numbers. Salesforce has a $175/month minimum AI spend target.

Have you ever wondered why big tech loves to do leetcode style interviews? It selects for the person who's smart and willing to put up with absolute [bullshit] to get the job.

Why leadership panicked into mandating AI

~06:05 Six months ago at a CTO dinner in Amsterdam, leaders were panicked their engineers weren't using AI while Anthropic's revenue went vertical — they conflated correlation with causation and pushed hard. Coinbase's Brian Armstrong fired an engineer a week after emailing "use AI or we talk."

Individual vs team productivity; the METER study

~09:06 Individually yes, as teams it's a question mark. The METER study (30-person sample) felt 20% more productive but was 20% slower — one outlier who was actually faster was interviewed on the pod. His theory: the real unlock is enabling non-technical collaborators to code so engineers aren't the bottleneck.

Simon's "no manual" insight — theory doesn't transfer

~11:07 Quoting Simon Willison: "this thing AI is just so hard to get good at. There's no manual." Gergely's key insight: "understanding the theory will not make you better at using the tools" — which is an "absolute mindfuck" because engineers expect theory to map to practice.

Low ego, open to learning, open to leaving your priors behind.

Role collapse into software engineer

~13:08 AI is accelerating a pre-existing trend where VC-funded startups demand broader-range engineers (tester into SWE, devops into SWE, now product). John Deere's two-pizza teams are now one-pizza teams.

Pushback on "everyone's an EM now"

~15:11 You get orchestration without the people drama, conflict management, or 6-month feedback loops. It's closer to tech lead or "mech suit" (DHH's framing). Michelle Hashimoto uses two agents max, one background.

Big tech infra buildout: MCP gateways, background agents

~17:11 Uber, Airbnb, Intercom, Meta, Microsoft are rebuilding IAM, building custom background coding agents, MCP gateways integrated into service discovery, risk-categorizing code review systems. Three reasons: (1) low-risk hands-on with AI; (2) their codebases never fit in a context window; (3) "anything with AI in it gets funded" — rename "developer platform" to "agent experience" and you get headcount.

If you're at a large company and you're not already building an MCP gateway, what are you even doing?

Pragmatic Engineer origin and growth

~22:12 Started during COVID after Uber layoffs. Found product-market fit in week one with 100 paid annual subs before publishing, hit 1,000 paid ($100k ARR, his Uber base) in six weeks. Said no to all interviews/collabs for two years to protect the 1-2 article/week cadence. Number one paid tech newsletter for three years until semianalysis overtook.

Tools: GitHub Copilot, Cursor, Claude Code, Claude Opus 4.5, MCP gateway, Substack

Podcast

AI Engineer

Tuomas Artman (Linear CTO) on taste and craft

Linear's CTO argues the pendulum has swung too far toward shipping-everything-immediately now that agents can code anything on demand. His thesis: when every competitor can ship the same feature set in hours, taste, craft, and quality become the only durable moat — and AI still has none of those because it has no sense of time, no sense of animation feel, and no taste^{[16]AI Engineer, Taste & Craft with Tuomas Artman}.

Engineering used to be the gate

~00:00 Agents shipping everything instantly is trending the wrong way. Steve Jobs' "saying no to 999 things" matters more, not less.

Engineering used to be hard. That was the gate that made us think before we built.

Uber hypergrowth analogy

~02:02 Uber's hypergrowth era was a winner-takes-all race where quality was sacrificed for speed. AI is recreating that dynamic for every product team, because a one-person shop with AI can now match your feature set.

How Linear operates today

~04:03 Still refuse most feature requests, still invest heavily in root cause behind customer asks, still put design first. Where AI has actually accelerated them: bug fixing — ~05:04 10% of incoming bugs at Linear are now auto-fixed by a single-shot AI instance (assigned, PR'd, landed with no engineer involved). Expects to trend toward 100% over the next few years.

Hot take: Claude Code's quality shows the cost of shipping fast

~06:06 Anthropic has said Claude Code was coded by Claude, "and it shows" — he can spot real bugs in either the CLI or desktop app within seconds. Attributes it to shipping too fast under competitive pressure with OpenAI.

Anthropic said all of Claude Code was coded by Claude — and it shows. You can spot actual bugs in a few seconds.

Quality Wednesdays and the Zero Bug Policy

~11:10 Every Wednesday the whole engineering team (~25 people) dials in for ~30 minutes and each engineer demos one quality fix they found and shipped that week — anything from one-pixel alignment to backend efficiency. Bug fixes don't count. Linear has fixed ~2,500–3,000 such micro-issues. Side effect he cares about most: engineers now watch for quality issues while building unrelated features.

~16:14 Zero Bug Policy: bugs auto-assigned to whoever owns the area, become that engineer's top priority, typically fixed within 2–3 hours (always within 7 days). Insight: bugs arrive at a constant rate, so fixing now vs. in three months costs the same total effort — you just need a one-time ~3-week pause on new features to drain the backlog, then maintain.

Bugs are created at a constant rate. Fixing them now or in three months costs the same total effort — so just stop for three weeks, get to zero, and stay there.

Why AI has no taste

~20:16 Two concrete failures: (1) AI has no concept of time — it reads that 1s is better than 2s but can't feel that 2s is frustratingly slow; (2) AI can't judge whether an animation feels natural. Referencing Linear design engineer Emil's demo where agent-built animations did the "right" things technically but felt off until a human polish pass.

They have no taste. They simply don't. AI doesn't have a concept of time — it knows one second is better than two, but it doesn't know whether two seconds is slow enough.

Hiring: paid one-week work trial

~22:17 Every engineering candidate does a full paid one-week greenfield project end-to-end (sometimes actually shipped). "Those people [who won't take a week] didn't want to be here in the first place." New engineers get firehosed: open Slack channels with every big customer, every customer meeting recorded and tagged.

One year out: every engineer becomes a product engineer

~26:20 Pipe-data-from-A-to-B work disappears; what remains is knowing what a good feature and good UX look like. Closing advice for ICs wanting product sense: build for yourself, ship it, and read Apple's Human Interface Guidelines — "the best book" on UX.

In a year, everybody will become a product engineer. You won't need engineers who pipe data from one place to another — you'll need engineers who know what a customer wants and what a good feature looks like.

Tools: Linear, Claude Code (CLI + desktop), Claude Opus 4.5, Apple Human Interface Guidelines

Podcast

Lenny's Podcast

Lenny's Podcast: shed 30K, rehire 8K AI-first

~00:00 A guest on Lenny's Podcast predicts a wave of net-negative tech headcount reductions over the next 12–24 months — companies will shed large workforces built during the 2020–2022 growth era and replace them with a much smaller, AI-first cohort^{[17]Lenny's Podcast, Are tech jobs safe in 2026?}. Concrete scenario: 30,000 layoffs, 8,000 rehires, all AI-first.

The two-factor squeeze: (1) companies feel they never extracted enough value from headcount hired over the last five years; (2) AI demands an entirely different skill set than the people on payroll. Framing: a "judgement day" reset where firms deliberately downsize to rebuild around AI-native talent, not retrain existing staff.

We aren't getting as much for the staff that we grew in the last 5 years. And this AI thing requires a totally different skill set.

You might see a company shed 30,000 and hire 8,000. But the 8,000 people they're going to hire are going to all be AI-first.

AI Tools Industry Hot Take

Nate Herk | AI Automation Fireship Theo - t3.gg Nate B Jones

Claude Design comes for Figma: the mega-launch

The day after Opus 4.7, Anthropic launched Claude Design under a new "Anthropic Lab" sub-brand — a canvas-based product that ingests codebase + brand assets + Figma files and outputs logos, typography, color palettes, components, full UI kits, and an agent-readable skills.md^{[18]Nate B Jones, Your Prompts Didn't Change. Opus 4.7 Did.}. Figma's stock dropped 7% on announcement day; Anthropic CPO Mike Krieger had resigned from Figma's board three days earlier^{[19]Fireship, Claude just got another superpower}. Early testers: Theo calls it "the best software Anthropic's ever shipped"^{[20]Theo - t3.gg, Did Anthropic just kill Figma?}; Nate Herk built a working 3D-scroll site in 20 minutes^{[21]Nate Herk, Claude Design Builds Beautiful 3D Websites Instantly}.

What it is

Claude Design is built on Opus 4.7 (with "much more advanced vision capabilities than previous models"^{[21]Nate Herk, Claude Design}) and sits alongside Claude Code the way Claude Code sits alongside the raw API — an opinionated, higher-level wrapper around the same model. It supports prototypes, slides, landing pages, animated promos, and 3D-scroll sites. Access requires a paid Claude plan; currently in "research preview." ~00:00

Fireship's framing: fully interactive outputs, shader-based animations, video animations longer than a minute, a 3.75 MP image-understanding ceiling on 4.7, and 87.6% on SWE-bench — but also vocally criticized as slower than Stitch, Codex, or Cursor Composer^{[19]Fireship, Claude just got another superpower}.

Figma stock dropped 7% within hours of the announcement. Longtime Adobe executives are just giving up and jumping out of high-rise windows.

The killer feature: skills.md as design infrastructure

Nate B Jones calls out the real strategic move: Claude Design doesn't just generate human-facing brand docs, it generates a machine-readable skills.md that any future AI agent can consume to produce on-brand output^{[18]Nate B Jones, Claude Design first drive}. ~13:08

A design tool now produces a skill file natively from your codebase and brand assets to ensure that future projects are brand native. It's actually turning the design system into agent infrastructure.

Nate Herk's 20-minute 3D-scroll demo

~02:00 The workflow: prompt Claude Chat to invent a brand ("LOL" — a magnesium glycinate drink), generate an image + video prompt, feed the image prompt into Kling Nano Banana 2, then Kling Cance 2.0 for an 8-second looping video. Upload to Claude Design, sketch a wireframe on canvas marking the navbar, hero text, hero video. Opus 4.7 runs with a live to-do list. The scroll-animation technique: ~14:06 tell Claude to associate each video frame with a scroll position so scrolling forward/backward moves the video.

Canvas features: comments, inline edits, drawings, tweaks panel

~11:04 Four interaction modes. (1) Comments on specific elements. (2) Inline text/size edits without a prompt. (3) Freehand drawing annotations ("add a transition overlay so this ends less abruptly"). (4) A generated tweaks panel — sliders for palette, font, layout, spacing, component style that are non-destructive and cost no tokens. Theo independently called the comment-batching workflow genuinely cool^{[20]Theo - t3.gg, Did Anthropic just kill Figma?}.

If you put this prompt into Claude Code, you would maybe get something that visually looks like this. But in order to go back and forth and make tweaks, you would have to take some awkward screenshots or be very specific.

The failure mode everyone hit: logos and repeat corrections

~16:09 Nate B Jones's run: Claude Design silently reinterpreted his logo into a black square plus word mark, propagated it through the entire UI kit, and refused to fix it across five or six correction passes — each billable^{[18]Nate B Jones, Claude Design first drive}. Fireship's live test also failed to fix a logo issue, just changing the background color instead^{[19]Fireship, Claude just got another superpower}. Theo's session: ~22:02 designs disappeared mid-session after burning 10%+ of usage.

The moment it starts redesigning your logo without your permission or request, every downstream artifact becomes suspect.

Usage quota: brutal and separate

Claude Design has a separate weekly quota on top of normal Claude usage limits. Nate Herk (Max 20x): "a few video projects and a few websites" before the limit. ~27:14 Nate B Jones spent $42 in one afternoon — $5 for the initial system, $10 on review iterations, $2.50 for a 60-second overview, $23.29 for a 2-minute video with five review passes^{[18]Nate B Jones, Claude Design first drive}. Theo burned 18% of his weekly quota fast; pro-tier users reportedly get ~2 prompts. Best practices: use Sonnet 4.6 for iterations, one visual change per prompt, use the tweaks panel instead of back-and-forth chat, export zip and open a fresh session when threads get long.

I've already eaten through my design quota and I've already spent over $200 in extra usage just playing around with the stuff.

The Claude Code handoff is the killer integration

~19:09 Two export paths: "Hand off to Claude Code" (a command to paste) and download-as-zip. The zip contains the background video, scraps, uploads, and the HTML. Open in VS Code with Claude Code, "push this to a private GitHub repository," connect Vercel, deploy. Theo highlights this same flow as the actual killer feature: ~31:12 "instead of building a super fancy bridge between the tools that's proprietary, the solution is literally just zip all the context and throw it at the agent with a link."

Theo's Iris story: why this matters beyond Figma

~10:05 Theo's long tribute to Iris — a non-coder designer at Twitch who built working prototypes herself — makes the emotional case: Claude Design's real unlock is empowering designers without frontend engineers as gatekeepers.

If you give a motivated person like her the tools they need to make something useful and playable... that's magical. And I know that if they get this right, people like Iris are going to just take over the world with it.

Theo's verdict

This is the best software Anthropic's ever shipped. Not that that's a high bar, but it is a challenge for them.

If I was Figma, I'd be scared as [f***] right now cuz this is actually very useful already.

Tools: Claude Design, Anthropic Lab, skills.md, Claude Opus 4.7, Claude Sonnet 4.6, Claude Code, Kling Nano Banana 2, Kling Cance 2.0, motions.ai, Figma, Adobe, Tailwind UI.sh, SVGL, Excalidraw, Canva, Vercel, GitHub, Google Stitch

AI Models Hot Take

Nate B Jones

Opus 4.7: the behavior changes no one warned you about

Nate B Jones argues 4.7 feels different because three things shifted at once: adaptive thinking under-invests on "simple" tasks, the model follows instructions literally (no more filling in the gaps), and the register is measurably more combative — Code Rabbit's tone harness clocked 4.7 at 77% assertiveness with 16% hedging^{[18]Nate B Jones, Your Prompts Didn't Change. Opus 4.7 Did.}. Most user complaints collapse these three distinct changes into a single "got dumber" narrative.

Three compounding behavior changes

~20:12 (1) Adaptive thinking "underinvests on tasks it judges as simple." Hex's CTO calibration rule: "low-effort 4.7 is like medium effort 4.6." Anthropic removed temperature, top-P, top-K, and thinking budget entirely. (2) Literal instruction-following: per Anthropic's migration guide, "the model will not silently generalize an instruction from one item to another and it will not infer requests that you did not make." Concrete example: paste an article, ask "summarize this in three sentences and make it punchy." On 4.6 you got three sentences plus a header, kicker, and bolding. On 4.7 you get exactly three punchy sentences. (3) Combativeness: Code Rabbit's 77% assertiveness / 16% hedging measurement. Gergely Orosz of Pragmatic Engineer publicly rolled back to 4.6 over this.

Half the value you were getting from 4.6 was the model guessing at what you meant and filling it in. So that value didn't go away — the value moved because now you have to ask for it explicitly.

Low-effort 4.7 is like medium effort 4.6.

The tokenizer tax and the hidden cost story

~04:01 Per Anthropic's docs, the new tokenizer maps the same text to "up to 35% more tokens." Simon Willison measured 1.46x on the real Opus 4.7 system prompt; independent measurements range 1.29–1.47x. Ethan Lambert flagged that the tokenizer change "suggests that this is a new base model, not a finetune of 4.6" — if correct, 4.7 is architecturally much bigger than the version number implies.

You have a model that charges more and a system that decides how many tokens you get. Both levers moved in the same release. That's not an accident. That's definitely a monetization strategy.

Head-to-head vs ChatGPT 5.4: persistence fixed, audit trail hallucinated

~06:03 Nate ran 465 messy files (CSV, Excel, PDF, images, VCF) with planted traps ("Mickey Mouse," "asdf asdf," a $25M unit order) through both models in a single-shot migration pipeline. 4.7 finished in 33 minutes vs 53 for GPT 5.4 and built a shippable review UI. But it hallucinated processing a TSV file it never touched. Neither model caught the planted traps — the $25,000,000 order was silently normalized to $25 by 4.7. Benchmark deltas: SWE-Bench verified 80%→87%, Cursor bench 58→70, MCP Atlas 75→77, GDPval 1753 (vs GPT 5.4's 1674). Regressions: BrowseComp dropped 83→79 (GPT 5.4 leads at 89), and Terminal Bench 2.0 Opus trails GPT 5.4 by six points.

Opus 4.7 did not process a file it claimed to process... It's actually breaking trust in the whole agentic flow.

Four playbooks by surface

~28:21 Universal: frontload intent, don't pad prompts — clarity over length. Claude Code: default to extra-high effort, use plan mode, use /ultra review. API: delete temperature/top-P/top-K (they 400 now), flip thinking display to summarized, regression-test cost. Chat: no levers exist — ask for reasoning explicitly, upload the context instead of describing it, start fresh chats aggressively.

You must frontload intent with this model. Anthropic's own guidance on smarter models is that they need less prescriptive engineering, not more.

Delete temperature and top P and top K from your code. They will return 400 errors.

Tools: Claude Opus 4.7, Claude Opus 4.6, Claude Code, Claude Code plan mode, /ultra review, Claude API, Claude Projects, ChatGPT 5.4, SWE-Bench verified, Cursor bench, MCP Atlas, GDPval, BrowseComp, Terminal Bench 2.0, Code Rabbit tone analysis, Ocean AI, Factory Droids, Genpark, Hex, Harvey, Databricks Office QA Pro

AI Tools

YouTube

Gemini and NotebookLM merge into unified notebooks

Google merged NotebookLM into Gemini's left sidebar as "notebooks," rolling out to paid plans now with free-tier coming^{[22]The Most Anticipated Gemini Feature is Here}. Notebooks sync bidirectionally between Gemini and NotebookLM — same notebook, same saved chats, notebook-level custom instructions and memory. Sources include websites, PDFs, Drive files, YouTube, and copied text.

How the integration actually works

~00:00 Notebooks appear in Gemini's left sidebar; clicking shows all saved chats for that notebook. ~06:01 Renaming or deleting in one app propagates to the other. Up to 5 notebooks can be pinned. Existing chats from regular Gemini history can be moved into a notebook via "add to notebook." A notebook can also be used as a source in a standard Gemini chat.

When to use each surface

~05:00 NotebookLM is better for structured learning, inline citations, infographics, podcasts, slide decks, and mind maps — but slower for back-and-forth chat. Gemini is better for creative multi-step reasoning, image input (snap a photo of a garden bed), and canvas-based document/app creation. ~10:03 Canvas enables building custom dashboards or content calendars grounded in notebook data.

Demo: YouTube strategist notebook

~09:03 YouTube Studio export → sort by view count → paste top 25 into NotebookLM → transcripts auto-import. Custom instructions make it act as a YouTube strategist. In Gemini, this notebook analyzes top-performing patterns, writes creative hooks, and produces 30-day channel growth strategies — which outperforms NotebookLM for multi-step planning.

Tools: Gemini, NotebookLM, Gemini canvas, Gems, Google Drive

AI Future AI Tools

AI Daily Brief AI Jason

Agent trends: org charts, memory, and self-evolving harnesses

Nathan's Agent Madness bracket (~100 submissions) surfaces three architectural patterns: AI org charts with named employees and orchestrators, multi-agent debate as a reliability mechanism, and hyper-personal "markets of one" built by non-technical domain experts^{[23]AI Daily Brief, Agent Building Trends}. The loudest infrastructure gap is memory — builders hack around it with markdown brain files, MCP memory servers, and vector DBs. AI Jason's deep-dive on self-evolving agents maps to the same gap: Claude Code's three-layer memory + autodream, and Hermes Agent's autonomous skill generation, are the field's current answer^{[24]AI Jason, This Agent Self-Evolves (Fully explained)}.

Three agent patterns from the bracket

~02:00 (1) AI org charts: Harold (AI chief of staff), Diamond Dozen.ai (Atlas as CEO, Nova for engineering, Blaze for marketing), The Fleet (7 agents + orchestrator), and Myze (employee IDs and a 3-strike termination policy — one agent was "fired" for fabricating business logic). (2) ~05:00 Argument as architecture: when a single LLM call is unreliable, builders made agents argue instead of adding retrieval — WikiTax.ai runs autonomous tax debates 3x/day. (3) ~03:00 Markets of one: a person with episodic Graves' disease gave Claude 9 years of Apple Health data and got a thyroid-flare detector that catches events 2–3 weeks early. An ADHD mom built LifeCoachOS. An Arkansas kayaker built Creek Intelligence.

In a very short amount of time, you've gone from AI assistant to AI employee to AI org chart.

Memory is the biggest unsolved primitive

~04:00 A meaningful share of submissions were essentially elaborate workarounds for agents not persisting state across sessions. Myze uses 50+ markdown "brain files." Carrier File is literally a plain text file users paste into any AI. Open Brain shares one MCP memory server across Claude Code, Cursor, and Windsurf.

All of these hacks — markdown files, knowledge graphs, vector DBs, copy-paste text — is kind of the diagnosis of the big problem facing the agent ecosystem.

AI Jason: two approaches to self-evolution

~00:01 Approach one (Auto Agent, Auto Research) is a for-loop that rewrites the agent harness itself: a vision file drives an LLM to modify its own runtime, runs an eval, keeps or discards changes. Approach two (Claude Code auto-memory, Open Claude, Hermes Agent) focuses on in-context learning — the agent continuously extracts facts, skills, and history into persistent memory. Jason argues approach two is more practically useful today because it doesn't require a large deterministic eval dataset.

Claude Code's three-layer memory + autodream

~05:05 Hot memory (CLAUDE.md + memory.md always in the system prompt), warm memory (individual topic files loaded on demand), and an async "autodream" consolidation process triggered after a session ends — spins up a fresh Claude Code session with a dedicated consolidation prompt, reviews existing memory, checks for outdated entries, and updates the index.

Hermes Agent's state-of-the-art self-learning

~10:07 Four memory tiers (user.md, memory.md, skills, SQLite raw history) plus two async processes: (1) autonomous skill generation — after every 10 steps with no skill created, a sub-agent reviews and decides whether to create or patch a skill, passing through a safety-scan Python guard; (2) memory reviewer — after every 10 turns with no extraction, a sub-agent reviews for preferences and writes to user.md and memory.md. "Skills unmaintained become liabilities."

Tools: Claude, Claude Code, Opus 4.6, GPT-5.4, Harold, Diamond Dozen.ai, Myze, WikiTax.ai, LifeCoachOS, Creek Intelligence, Carrier File, Open Brain, MCP, Cursor, Windsurf, Hermes Agent, Auto Agent, Auto Research, Mem0, Letta

Industry Hot Take

Dwarkesh Patel

Jensen Huang: ASICs keep getting cancelled

~00:00 Dwarkesh presses Jensen on the fact that two of the top three frontier models (Claude and Gemini) were trained on TPUs. Jensen reframes the competitive question entirely: Nvidia built "accelerated computing," not a tensor processing unit, and Nvidia's hardware serves fluid dynamics, particle physics, and domains far beyond AI — giving it a market reach no ASIC can replicate^{[25]Dwarkesh Patel, Jensen Huang on Nvidia's Competition}. The moat claim is velocity: "the only company in the world that's cranking it out every single year."

Jensen dismisses ASICs with two arguments: (1) a historical pattern of cancellations ("look at the number of ASICs that have been cancelled"), and (2) the sheer difficulty of building something better than Nvidia. He also deflects by noting that competitors experimenting with alternatives is fine — it validates Nvidia by comparison.

We built a very different thing. What Nvidia built is accelerated computing, not a tensor processing unit.

Just because you're going to build an ASIC, you still have to build something better than Nvidia. And it's not that easy building something better than Nvidia. It's not sensible.

We're the only company in the world that's cranking it out every single year. Big leaps every single year.

AI Models

Y Combinator

Poetic's meta-harness as an alternative to fine-tuning

~01:00 Ian Fischer (Poetic co-founder) argues fine-tuning frontier models is a losing strategy for startups: costs millions, takes months, obsoleted by the next frontier model. Poetic's alternative is a "meta system" — a recursively self-improving agentic harness (code + prompts + reasoning strategies) that sits on top of existing frontier models and auto-generates optimized harnesses at zero re-training cost when new models ship^{[26]Y Combinator Light Cone, The Powerful Alternative To Fine-Tuning}.

Benchmark results

~05:02 On ARC-AGI v2, Poetic built on top of cheaper Gemini 3 Pro (not Deep Think) and scored 54% vs Gemini 3 Deep Think's 45% — a 9 percentage-point improvement at roughly half the cost ($32 vs ~$70/problem). On Humanity's Last Exam (2,500 expert-level questions), Poetic reached 55%, beating the previous SOTA of 53.1% from Claude Opus 4.6. Optimization cost under $100K — orders of magnitude less than a frontier training run. Team size: 7.

Harness = code + prompts + reasoning strategies

~08:05 The meta system outputs harnesses — combinations of code, prompts, and reasoning strategies layered over one or more LLMs. It can optimize entire agents or sub-components. ~12:12 Harness outputs are often non-human-intuitive (e.g. ARC-AGI prompts with intentional wrong examples). Key insight from prior DeepMind work: optimizing prompts alone raised hard-task performance from 5% to ~5%, but adding AI-generated reasoning strategies pushed it from 5% to 95%.

Tools: Poetic meta system, Gemini 3 Pro, Gemini 3 Deep Think, Claude Opus 4.6, ARC-AGI v2, Humanity's Last Exam, DSPy/JEPPA

Developer Tools AI Tools

AICodeKing Github Awesome Github Awesome

Claude Code skill packs: Addy-Skills, Kami, diagram-design

Three Claude Code skill packs drop the same day. Addy Osmani's Agent Skills repo packages a disciplined software lifecycle (spec → plan → build → test → review → ship) as reusable markdown skills^{[27]AICodeKing, Addy-Skills + Claude Code, Codex}. Kami enforces a single design language across every document Claude generates — white papers, resumes, pitch decks^{[28]Github Awesome, Kami}. diagram-design turns Claude into an editorial graphic designer generating zero-JS HTML/SVG diagrams^{[29]Github Awesome, diagram-design}.

Addy-Skills: process discipline as reusable skills

~01:02 Seven main slash commands (/spec, /plan, /build, /test, /review, /code-simplify, /ship) map to a software lifecycle, invoking 20+ skill files underneath. Skills are plain markdown — portable across Claude Code, Cursor, Gemini CLI, Windsurf, Open Code, GitHub Copilot. ~04:03 Anti-pattern: don't dump the whole repo into one giant prompt; load the right skill for the right phase. ~09:04 The real argument: strong model + sloppy process = sloppy outcomes; decent model + disciplined workflow = reliable work.

A strong model with a sloppy process still produces sloppy outcomes. A decent model with a disciplined workflow can often produce much more reliable work than people expect.

It is not trying to magically replace engineering judgment. It is trying to encode it into a reusable operating system for the agent.

Kami: one design language across every doc type

Kami applies a branded aesthetic (warm parchment canvas, ink-blue accents, serif/sans-serif pairing) across six document types using three SVG diagram primitives and eight design invariants^{[28]Github Awesome, Kami}. Documents auto-trigger the appropriate template by type.

diagram-design: editorial-quality architecture diagrams

Self-contained HTML and SVG diagrams with zero JavaScript or external dependencies. 13 built-in structural types including sequence diagrams, state machines, Venn diagrams, and swim lanes — designed as an alternative to manual Figma work for quick architecture sketches^{[29]Github Awesome, diagram-design}.

Tools: Claude Code, Cursor, Gemini CLI, Windsurf, Open Code, GitHub Copilot, Verdin, Addy-Skills, Kami, diagram-design

Developer Tools

Better Stack Better Stack

Open-source builder tooling: Multimodal + PenPot

Two Better Stack deep-dives on open-source alternatives to closed SaaS. Multimodal wraps terminal coding agents (Claude Code, Open Code) with task management, scheduling, custom system prompts, and multi-machine orchestration — positioned as a self-hostable alternative to Claude managed agents and Routines^{[30]Better Stack, Multica: The Open Source Tool That Makes Claude Code 10x Better}. PenPot is a browser-based, open-source Figma competitor built on real web standards (SVG, CSS, Flexbox, Grid, HTML) — so the inspect mode outputs real CSS, not a translation layer^{[31]Better Stack, PenPot}.

Multimodal: self-hosted agent orchestration

~00:00 A daemon polls for tasks, spawns agents via git worktrees, and tracks execution through a Kanban-style UI. Custom system prompts per agent, custom skills (CLI-installed or UI-defined), and custom model flags. ~01:00 Self-hosting via Docker (three containers: Go backend, Next.js frontend, Postgres). Gotcha: set APP_ENVIRONMENT=development and clear the Resend API key to skip OAuth, then use 888888 as the login code. ~05:00 Autopilot (equivalent of Claude Routines) supports cron-scheduled tasks but lacks API/GitHub triggers. Reviewer recommends pairing with Tailscale.

If something is connected to the internet, then it's definitely hackable.

PenPot: real web standards, real handoff

~00:00 PenPot uses real web standards under the hood (SVG, CSS, Flexbox, Grid, HTML) — designs are already expressed in the same language the web uses, not a simulation to be translated later^{[31]Better Stack, PenPot}. ~02:02 Inspect mode outputs clean Flexbox CSS that developers can copy straight into a project — no "Dev Mode" or plugin step. ~03:03 Self-host with a single docker compose up; free, unlimited files, unlimited collaborators. Noted limitations: struggles with very large files, smaller plugin ecosystem.

Instead of designing inside something you have to decode and de-structure later, you're already closer to how the web actually works.

Tools: Multimodal, Claude Code, Open Code, Hetzner VPS, Docker, Tailscale, Postgres, Next.js, PenPot, Figma, Sketch

Industry

Y Combinator

BillionToOne: liquid biopsy at 600K tests a year

~00:01 BillionToOne detects cell-free DNA in blood to enable non-invasive prenatal genetic testing and cancer detection. Now processing 600K+ tests/year at ~20% prenatal cfDNA market share after going public at a $4B+ valuation^{[32]Y Combinator, BillionToOne}.

The core technical insight: quantitative counting templates

~04:05 Fetal and tumor DNA is dilute and rare (potentially one altered base pair in 3 billion), and standard PCR amplification introduces so much noise that the signal is lost. Their solution: spike each sample with known synthetic DNA (quantitative counting templates, QCTs) before amplification — allowing them to measure and subtract amplification-introduced errors computationally.

There are 3 billion base pairs in the human genome. In a lot of human diseases we detect from mom's blood... it's usually only one base pair that's different. So you're looking for one base pair that's different out of billions. And that's where the billion to one name came from.

That converts a difficult biology problem to almost a simple mathematical problem.

Roadmap: the Tesla master plan, in biotech

~16:15 Three steps: (1) prenatal testing — done, one in 11 US babies screened with their test. (2) Late-stage cancer / minimal residual disease — in progress, commercial launch within a year. (3) Early-stage cancer screening for stage 1/2 patients post-surgery, and eventually population-level.

Once we are there, I think technically we would have solved the holy grail of cancer detection.

Tools: PCR amplification, next-generation sequencing, quantitative counting templates, liquid-handling robots, Northstar Select

Productivity Hot Take

Nate B Jones DeepLearning.AI Real Python

Briefly: generative UI teaser, Real Python's 11 steps, and the J-curve

Three shorter items. Nate B Jones's short cites a MITRE randomized controlled trial showing developers using AI coding tools completed tasks 19% slower, even after controlling for task difficulty, experience, and familiarity^{[33]Nate B Jones, AI Tools Got Faster But Developers Didn't}. DeepLearning.AI teased an upcoming course on generative UI — agents rendering charts and cards instead of plain text^{[34]DeepLearning.AI, Build Interactive Agents with Generative UI}. And Real Python walks through its 11-step editorial process — 20–40 hours of expert review per tutorial^{[35]Real Python, 11-Step Editorial Process}.

The J-curve of AI adoption

~00:00 Workflow disruption outweighed generation speed — developers lost time evaluating AI suggestions, fixing "almost right" code, context-switching, and debugging subtle errors. 46% of developers don't fully trust AI-generated code. Framing: bolting an AI assistant onto an existing workflow causes a productivity dip before improvement, because the workflow itself hasn't been redesigned around the tool^{[33]Nate B Jones, AI Tools Got Faster But Developers Didn't}.

You're kind of running a new engine on old transmission. The gears are going to grind.

DeepLearning.AI: generative UI course teaser

A paradigm where AI agents render custom UI components (charts, cards) instead of plain text. The course covers defining components with simple schemas, instructing agents when to display them via natural language, and enabling interactive, visually rich outputs such as pie charts from data queries or flight cards from travel queries^{[34]DeepLearning.AI, Build Interactive Agents with Generative UI}.

Real Python's 11-step editorial process

Every Real Python tutorial goes through 11 editorial steps involving 20–40 hours of expert review: technical experts verify code correctness, educators review for clarity, and editors polish writing — content kept updated as Python evolves^{[35]Real Python, 11-Step Editorial Process}.