The Harness Era Begins — April 15, 2026

AI Models

Claude Opus 4.7 Launches with xhigh Reasoning and Task Budgets

Anthropic released Claude Opus 4.7 on April 16 at the same price as 4.6 ($5/$25 per M tokens), posting a 13% jump on code resolution, 21% fewer errors on OfficeQA Pro, and 98.5% on XBOW's visual-acuity benchmark (up from 54.5%).^{[1]Anthropic News — Introducing Claude Opus 4.7} Vision input resolution is tripled to ~3.75 megapixels and a new xhigh effort tier plus a public-beta “task budgets” feature give developers finer control over long runs.^{[1]Anthropic News — Introducing Claude Opus 4.7} Theo’s reaction framed it as a “Cursor killer” moment alongside the desktop app launch.^{[3]Theo - t3.gg — Claude’s new Cursor killer just dropped}

Benchmarks and Pricing

Claude Opus 4.7 is available via Claude.ai, the Claude API (claude-opus-4-7), Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry at the same $5/M input, $25/M output pricing as Opus 4.6. Key benchmark deltas over 4.6: +13% on code resolution, 21% fewer errors on OfficeQA Pro document reasoning, state-of-the-art on a Finance Agent evaluation, and 98.5% on XBOW’s visual-acuity benchmark (vs. 54.5% for 4.6).^{[1]Anthropic News — Introducing Claude Opus 4.7}

New Controls: xhigh, Task Budgets, /ultrareview

A new xhigh effort tier lets developers trade latency for deeper reasoning. “Task budgets” (public beta) caps token spend on long-running work. Claude Code picks up a /ultrareview command and an expanded auto mode for Max subscribers.^{[1]Anthropic News — Introducing Claude Opus 4.7}

Safety Posture

The release notes call out better honesty, stronger resistance to prompt injection, and intentionally reduced cybersecurity capability relative to the Mythos Preview model — consistent with Anthropic’s restricted-access playbook for offensive capabilities.^{[1]Anthropic News — Introducing Claude Opus 4.7}

“Code quality is noticeably improved, it’s cutting out meaningless wrapper functions. Fixes its own code as it goes. Works coherently for hours, pushes through hard problems rather than giving up.”

Tools: Claude.ai, Claude API, Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundry, Claude Code

AI Tools Industry

Theo - t3.gg The Rundown AI The AI Daily Brief Nate Herk Better Stack Better Stack Nate Herk | AI Automation

Claude Code Desktop 2.0: Routines, Multi-Session, and Managed Agents

Anthropic shipped a redesigned Claude Code desktop app built for parallel agentic work — sidebar for managing multiple local and cloud sessions, integrated terminal and editor, git worktrees on by default, and a new Routines research preview for GitHub-event or API-triggered cloud agents.^{[4]The Rundown AI — Anthropic redesigns Claude Code desktop}^{[5]The AI Daily Brief — Vibe Coding Gets an Upgrade} Theo roasted the launch UX as “slop” with ~5 bugs visible before any code change.^{[3]Theo - t3.gg — Claude’s new Cursor killer just dropped} Separately, users noticed a ~40% jump in token burn after Claude Code updated to v2.1.100 — infrastructure for upcoming features loaded into every request.^{[7]Better Stack — Claude Code is Silently Burning Your $200 Subscription}

What Shipped

Anthropic rebuilt the desktop experience inside the same app that hosts Claude chat and Claude Co-work (three products in one). New features include tiling/split view, thread forking, built-in terminal, sidebar diff view, remote control across Macs on your network, multi-folder project context, a Preview feature for dev servers with network/log inspection, and a plug-in store including an iMessage connector. ~00:00 Notably, the Claude desktop app itself uses less memory than the CLI despite spinning a ~2.5GB Claude VM service.^{[3]Theo - t3.gg — Claude’s new Cursor killer just dropped}

Routines: Cloud-Hosted 24/7 Agents

Routines let users package a saved Claude Code config (prompt + repos + connectors) and trigger it on schedule, via API, or on GitHub events (PRs, pushes, issues, releases). Runs execute in short-lived Anthropic cloud environments — Pro gets 5/day, Max gets 15, Team/Enterprise gets 25 — each with 4 vCPUs, 16 GB RAM, 30 GB disk. The Register called it “a dynamic cron job or a trigger-driven short-lived event.”^{[6]Nate Herk — Claude Code Just Dropped Routines}

“A routine is a saved Claude Code configuration, a prompt, one or more repositories, and a set of connectors packaged once and run automatically.”

Theo’s UX Review: ~5 Bugs Before a Single Change

~04:00 Theo documents: stop icon stays visible after threads freeze, no copy buttons anywhere, pasting an image attaches to the prior tool call (“Anthropic doesn’t know how to use their SDK”), bypass-permissions mode isn’t persisted, Ctrl+backtick terminal opens the wrong tab, default git worktrees are inside the project forcing .gitignore edits, voice input types into all chat windows simultaneously, and the in-app Preview browser remains laggy with bypass on. He argues Anthropic is failing at both lab options — ship open-source primitives so others can build, or ship something genuinely great.^{[3]Theo - t3.gg — UX review}

“The level of slop that is being shipped by Anthropic is unfathomable and y’all just bear and grin it.”

“From a multi-billion dollar near trillion dollar company that has supposedly been working on this for months... if this was a vibecoded thing that two small devs at a recent YC company were showing me, I would probably pass on the investment.”

Claude Code Pipelines: Content Creation Case Study

~09:03 Nate Herk built an overnight content-production pipeline with Claude Code orchestrating HeyGen (Avatar 5 from 10M+ facial-expression data points), ElevenLabs, and Remotion. Scripts are chunked at sentence boundaries for ElevenLabs’ ~1-minute quality ceiling and HeyGen’s 3-minute cap; Remotion syncs word-level text overlays. Full stack: ~$250/mo subscriptions + ~$4/min API = roughly $6/hr of reclaimed time.^{[9]Nate Herk — Claude + HeyGen}

The Token-Burn Spike

~00:00 After updating from 2.1.98 to 2.1.100, per-request token cost jumped from ~50K to ~70K — a 40% increase billed as cache-creation tokens and invisible to /context. Two causes: a massively expanded system tool registry (MCP + complex agent swarm support sent every request) and a bug where the standalone Claude binary mangles cache fingerprints, forcing full codebase re-processing each turn. A source-code leak exposed feature flags for terminal pets, proactive mode, and background memory consolidation — infrastructure loaded per-request even when inactive.^{[7]Better Stack — Claude Code token burn}

“This can turn a 10-cent request into a $2 request instantly.”

Leaked System Prompts

~00:00 A GitHub repo surfaced raw system prompts for 28 AI coding tools (Cursor, Claude Code, Devon). Cursor’s Agent Prompt 2.0 pattern: gather full context first, break tasks into numbered steps, enforce strict rules, double-check for edge cases. “It’s not the model that’s better, it’s the structure.”^{[8]Better Stack — Leaked system prompts}

Tools: Claude Code, Claude Code desktop, Routines, HeyGen Avatar 5, ElevenLabs, Remotion, Playwright, MCP, GitHub, Cursor, Devon, T3 Code, Codex CLI

Developer Tools Hot Take

Better Stack

Claude Ultraplan vs. Superpowers: Which Planning Layer Wins

Better Stack pitted Claude Code’s new cloud-based Ultra Plan against the open-source Superpowers plugin and called it “not even close”: Superpowers produces a more thorough two-phase TDD plan while asking more clarifying questions and using fewer tokens, while Ultra Plan shines only for async, cross-device workflows.^{[10]Better Stack — Claude Ultraplan vs Superpowers}

Ultra Plan: Cloud-Based Planning

~00:01 Ultra Plan moves planning from the CLI to the Claude web UI, cloning your repo into a cloud environment to generate detailed implementation plans you can hand off to Claude Code locally. The win case is working away from your machine.

Superpowers: Local TDD Planning

~04:07 The Superpowers plugin asks more clarifying questions up front, produces a two-phase plan (analysis then implementation) with TDD structure baked in, and uses fewer tokens than Ultra Plan for equivalent work.

Verdict

~06:09 For local development, Superpowers wins. Ultra Plan has a narrow use case: async cross-device workflows when you’re not at your machine.

Tools: Claude Code Ultra Plan, Superpowers plugin

AI Future Developer Tools

The AI Daily Brief

Harness Engineering Becomes a Named Discipline

The AI Daily Brief traces the lineage from prompt engineering (2023–24) to context engineering (2025) to harness engineering — the systems, tooling, and access placed around a model so it can actually accomplish goals. Blitzcy’s knowledge-graph harness hit 66.5% on SWE-bench Pro vs. GPT-5.4’s 57.7%, and the field is splitting into “big model” and “big harness” camps.^{[11]The AI Daily Brief — Harness Engineering 101}

Big Model vs. Big Harness

~00:00 Latent Space framed the central debate. The big-model camp (Boris Cherny and Cat Wu of Claude Code, Noam Brown of OpenAI) argues thin harnesses are best because reasoning models obviate scaffolding. The big-harness camp (Jerry Liu of LlamaIndex) argues models are blank slates and the harness is everything. Kyle at humanlayer.dev splits the difference: “It’s not a model problem, it’s a configuration problem.”

“Our approach is all the secret sauce, it’s all in the model. And this is the thinnest possible wrapper over the model. We literally could not build anything more minimal.” — Boris Cherny, Claude Code

“The Model Harness Is Everything. Agent reasoning is exponentially improving, but models are blank slates. The biggest barrier to AI value is the user’s own ability to context and workflow engineer the models.” — Jerry Liu, LlamaIndex

Anatomy of a Harness

~09:05 Anthropic Labs describes a three-layer architecture: (1) Information — memory, context management, tools, skills; (2) Execution — orchestration, coordination, infrastructure, guardrails; (3) Feedback — evaluation, verification, tracing, observability. LangChain’s Viv adds capabilities like bash/code execution, sandboxed environments, and long-horizon techniques (Karpathy auto-research, Ralph Wigam loops). OpenAI’s own harness post frames progressive disclosure — skills that unfold context on demand — as essential to avoid crowding the context window.

Managed Agents as a Meta-Harness

~15:07 Anthropic’s “Scaling Managed Agents: Decoupling the Brain from the Hands” argues harnesses encode assumptions that go stale as models improve — e.g., context-anxiety fixes shipped for Sonnet 4.5 became dead weight on Opus 4.5. Managed Agents separates the agent loop, execution environment, and event log behind stable interfaces so the harness can evolve without breaking the contract. Nicolas Charrier’s “Great Convergence” post argues Linear, OpenAI, Anthropic, Notion, Google, Microsoft, Meta, Lovable, and Retool are all converging on the same shape: a looping agent harness with the right tools and context management.

“Harnesses encode assumptions that go stale as models improve. Managed agents is built around interfaces that stay stable as harnesses change.” — Anthropic

“The winners will not just have better models. They will have distribution, trusted workflow positioning, proprietary context, and the shortest path from observation to improvement.” — Nicolas Charrier

Tools: Claude Code, Claude Managed Agents, Codex, Open Claw, Cursor 3, LlamaIndex, Blitzcy, SWE-bench Pro, MCP servers

AI Tools Industry

The AI Daily Brief DeepLearningAI

Vibe Coding Matures: Lovable Desktop, Native Payments, Spec-Driven Dev

Lovable shipped a desktop app plus native payments — users describe what they want to sell in natural language and the agent wires it up — while Microsoft and Superblocks both launched enterprise-hardened vibe-coding platforms, signaling that 2026’s biggest building opportunity is making vibe-coded apps safe for production.^{[5]The AI Daily Brief — Vibe Coding Gets an Upgrade} DeepLearningAI + JetBrains launched a Spec-Driven Development course framing spec writing as “the modern developer’s core skill — like writing code that compiled to assembly.”^{[12]DeepLearningAI — Spec-Driven Development}

Lovable Desktop and Native Payments

~07:03 Lovable’s desktop app brings local MCP access, multi-project management, and native keyboard shortcuts. Native payments is the bigger story: users describe what they want to sell and the agent handles implementation. CEO Antoine Osika framed it as “easier than ever to go from product to an actual business.” Critics were quick to note that PCI Level 1 compliance, multi-country acquirer partnerships, and global tax compliance aren’t “vibed” — they underpin the feature.

“You know what you’re not vibe coding? PCI level one. Partnerships with several acquirers per country all over the world, global tax compliance.”

Enterprise Hardening of Vibe Coding

~09:04 Microsoft is testing Open-Claw-inspired features under Corporate VP Omar Shahine’s new team, focused on limiting permissions and role-siloing. Superblocks 2.0 targets the same problem explicitly: business teams build AI-powered apps with permissions baked in, IT can audit and lock anything down instantly, engineering sets universal standards. CEO Brad Menezes warned that vibe-coded apps have become an enterprise attack vector similar to shadow AI — “employees building apps on production data with zero IT oversight.”

Spec-Driven Development

~00:00 The DeepLearningAI course (taught by JetBrains’ Paul Everett) teaches a disciplined agentic workflow: write a detailed markdown spec, then let the agent implement from it. Students build a sample app called “Agent Clinic.” Benefits: control large changes with a few sentences, preserve context across sessions, improve intent fidelity, reduce “cognitive debt” from reviewing large AI-generated diffs.

“In the past, a developer’s main job was writing code that will compile down to assembly. But nowadays, you spend most of your time writing a spec that compiles down to code.” — Andrew Ng

Tools: Lovable, Superblocks 2.0, Open Claw, Microsoft Copilot, JetBrains, MCP

Podcast Industry Hot Take

Dwarkesh Patel Dwarkesh Patel

Dwarkesh Patel Interviews Jensen Huang: Nvidia's Moat and the China Chip War

Jensen rejected commoditization fears, framing Nvidia as an electrons-to-tokens manufacturer with a five-layer cake and $250B of upstream supply commitments, and called export controls on China “illogical lunacy” that will concede the world’s second-largest market.^{[13]Dwarkesh Patel — Jensen Huang interview} He also admitted his biggest miss: not realizing VCs could never fund foundation labs at $5–10B scale, letting Google and AWS lock up Anthropic.^{[13]Dwarkesh Patel — Jensen Huang interview}

The Long China Argument Against Export Controls

~56:41 Jensen’s most heated exchange: China manufactures ~60% of the world’s mainstream chips, has ~50% of the world’s AI researchers, has “ghost data centers” already powered and empty, and has so much abundant energy they can gang 7nm chips to compensate for being a node or two behind. Huawei just posted its largest year ever, SMIC has enormous capacity, and DeepSeek proves Chinese labs compensate with smarter algorithms. He flatly asserted Anthropic’s Mythos was “trained on fairly mundane capacity... abundantly available in China.”^{[14]Dwarkesh Patel — Jensen Huang Fires Back on China Chip Ban}

“The day that DeepSeek comes out on Huawei first, that is a horrible outcome for our nation.”

“We’re not enriched uranium. It’s a chip, and it’s a chip that they can make themselves. Comparing AI to anything that you just mentioned is lunacy.”

“You’re not talking to somebody who woke up a loser. And that loser attitude, that loser premise makes no sense to me.”

Nvidia's Moat: Electrons to Tokens

~00:00 Dwarkesh opened with the commoditization case: Nvidia sends GDS2 files to TSMC and outsources the rest. Jensen reframed Nvidia as the electrons-to-tokens transformer, with a philosophy of “as much as necessary, as little as possible” and partnerships up and down the five-layer cake of AI. He argues tool companies like Synopsys, Cadence, Excel, and PowerPoint will see instances explode as agents multiply usage — the opposite of what markets are pricing.

Supply Chain Locks: $250B of Purchase Commitments

~03:51 Dwarkesh cited SemiAnalysis’s $250B estimate (filings show $100B). Jensen credited a unique feedback loop: because demand is so visible (GTC as “360° of AI”), Micron, SK Hynix, Samsung, TSMC, Lumentum, and Coherent make huge upstream investments specifically for Nvidia. CoWoS was a two-year choke point that Nvidia “swarmed” to mainstream. The hardest bottleneck today, he joked, is plumbers and electricians. Hopper to Blackwell is 30–50× more energy-efficient, enabled not by Moore’s law (~25%/yr) but by algorithmic and cross-stack co-design.

“You want an industry where the instantaneous demand is greater than the total supply of the industry.”

Why TPUs and ASICs Don't Break CUDA

~16:13 Jensen reframed the question: CUDA isn’t a tensor-processing story, it’s an accelerated-computing story — molecular dynamics, QCD, fluid dynamics, data frames, AI. Nvidia stacks are designed to be operated by third parties, which is why they appear in every cloud plus on-prem. ASIC margins (~65%) aren’t meaningfully lower than Nvidia’s (~70%), so the savings story is weak. He repeatedly called out InferenceMAX and MLPerf: “Not one TPU, not one Trainium will come. I encourage them to use InferenceMAX.”

The Anthropic Miss

~38:30 Jensen’s candid self-critique: when Anthropic needed a compute partner, Nvidia wasn’t ready to make multi-billion-dollar investments and Jensen hadn’t internalized that VCs wouldn’t either. “Without Anthropic, why would there be any TPU growth at all? It’s 100% Anthropic. Without Anthropic, why would there be any Trainium growth at all? It’s 100% Anthropic.” He now has a $30B OpenAI investment and an Anthropic position.

Premium Tokens and the Doomer Critique

~88:12 Nvidia is folding Grok into CUDA to extend the Pareto frontier: lower throughput, faster response, higher ASPs — so software engineers paying for responsive tokens can support premium pricing. On jobs, Jensen’s recurring frame is the radiologist prediction: doomers told people not to enter radiology; now we’re short radiologists. “If we scare everybody out of doing software engineering jobs, we’re doing a disservice to United States.”

Tools: CUDA, HBM, CoWoS, MVLink, Spectrum-X, Hopper, Blackwell, Vera Rubin, Feynman, Huawei Ascend 910C, SMIC 7nm, DeepSeek, Trainium, Grok, InferenceMAX, MLPerf

AI Future Podcast

Y Combinator

Robotics Foundation Models Are Finally Shipping

Physical Intelligence co-founder Quan Vuong walked through the research lineage (SayCan, PaLM-E, RT-2, Open X-Embodiment) and the production deployments that make a robotics GPT-1 moment feel imminent — Weave folding laundry in real laundromats, Ultra packing real customer orders in Amazon pouches, all with inference running from the cloud.^{[15]Y Combinator — Robots Are Finally Starting to Work} Vuong sizes the prize at ~10% of US GDP (~$2.4T).

The Research Lineage

~03:02 SayCan first showed language-model common sense could port into robotics at the planning level. PaLM-E and RT-2 adapted powerful VLMs with robot data, transferring knowledge to low-level action — the robot could “pick up the Coke can and move it to Taylor Swift” despite her never appearing in robot training data. Open X-Embodiment / RT-X pooled data across ~10 robot platforms and achieved 50% better performance than specialists optimized per-platform.

Cross-Embodiment Data as the Strategy

~09:07 Vuong splits the robotics data problem into generation and capture. PI’s bet: rather than manufacture 1,000 of one robot, build infrastructure that absorbs data from thousands of different robot types. No two platforms in PI’s fleet are identical; even a single platform drifts every ~3 months. He teases unpublished zero-shot transfer results on tasks that required “hundreds and hundreds of hours” last year.

Real Deployments: Weave and Ultra

~15:10 Weave (founded by ex-Apple engineers) ships robots into homes; PI got a working laundry-folding model in ~2 weeks after setting the goal. Laundry has been a de facto robotics Turing test because deformable clothing creates an infinite observation space. Ultra runs autonomously in a real e-commerce warehouse, picking items into narrow Amazon soft pouches for a full day with minimal intervention — “autonomy at scale, not a demo station.”

Cloud Inference and Action Chunking

~23:13 Almost all PI robot evaluations run with the model hosted in a real cloud data center. Two tricks make this work: (1) the robot requests the next action chunk while still executing the current one (e.g., with 50 ms of action left, request the next 100 ms), and (2) real-time chunking ensures the new chunk stays consistent with in-flight motion despite network delay. This eliminates expensive on-device compute and decouples hardware from the semantics stack.

Playbook for Vertical Robotics Startups

~28:17 Vuong lays out the recipe: (1) deeply understand existing workflows, (2) be scrappy on hardware — reactive models compensate for mechanical imprecision, (3) ensure deployment-time data collection and eval, (4) reach a mixed-autonomy break-even, (5) then scale fleet size. PI open-sourced pi0 and pi0.5 — “the same weights our researchers internally use.”

“It still blows my mind to see a robot actually folding laundry because until ChatGPT I didn’t know if this would exist even in my entire lifetime.”

LLM Agents Babysitting Pre-Training Runs

~46:39 PI uses a Claude skill as a pre-training on-call that monitors large runs, authorized to take remediation actions on errors — yielding ~50% improvement in compute utilization.

Tools: Physical Intelligence (PI), pi0, pi0.5, SayCan, PaLM-E, RT-2, Open X-Embodiment, Weave, Ultra, Claude, MCP

Podcast AI Tools AI Future

AI Engineer The Rundown AI

AI Engineer Interviews Notion's Sarah Sachs and Simon Last on Custom Agents

Notion has rebuilt its agent harness four or five times since late 2022. The latest shift — progressive disclosure scaling to 100+ tools without “saying hello in thousands of tokens” — ships in coming weeks, and Notion just released a Claude-powered Business Workspace Auditor template.^{[16]AI Engineer — Notion on Custom Agents, Evals, Future of Work} Simon and Sarah also pushed back hard on collapsing “evals” into one thing, defended CLIs over MCP in coding contexts, and argued coding agents are “the kernel of AGI.”

Five Harness Rewrites and "Give the Model What It Wants"

~02:01 Notion started in late 2022 with a JavaScript coding agent, moved to a custom XML tool-calling format (failed because the model didn’t know the format), then to Notion-flavored markdown plus plain SQLite queries. Subsequent rewrites shifted to goal-driven tool definitions owned by individual product teams, culminating in progressive disclosure that scales to 100+ tools. MCP is treated as one type of integration. Notion has no built-in memory concept — “memory is just pages and databases.” Custom agents can invoke other custom agents, enabling manager-agent patterns; one internal user went from 70+ notifications/day to ~5 by wrapping 30 sub-agents in a manager.

“Give models what they want. That was a big learning — really be savvy about what the model wants in terms of its environment.”

Notion's Last Exam and Model Behavior Engineers

~25:10 Sarah runs three eval tiers: CI unit/regression tests; product-launch “report cards” requiring 80–90% pass; and a headroom set called “Notion’s Last Exam” deliberately calibrated to 30% pass, built with Anthropic/OpenAI partnerships. Dedicated headcount includes a data scientist, a model behavior engineer (MBE, often linguistics PhD dropouts), and a full-time evals engineer. Simon turned the eval system into an agent harness — “the same way you’d have a coding agent write the unit test, you should have a coding agent write the eval.”

“It’s a real issue when people say evals and it’s just like, that’s quality. It’s not just unit tests.”

MCP vs. CLI — Pricing-Driven Tool Choice

~40:17 Simon: CLIs inherit terminal environments, have progressive disclosure natively, and are self-healing (an agent can write its own browser in 100 lines if missing). MCP is great for narrow, lightweight, tightly-permissioned agents — “the dumb simple thing that works.” Sarah’s economic argument: using an LLM for deterministic tasks is wasteful and expensive on usage-based pricing. Linear and GitHub go through MCP; Slack, Mail, and Calendar are hand-built because MCP has no trigger protocol.

Pricing and the Middle-Market Gap

~62:36 Notion uses usage-based credits because features like agentic database autofill would otherwise “bankrupt the company — every autofill cell on Opus would be billions.” Frontier labs have abandoned the mid-market tier (Haiku isn’t much cheaper than Sonnet). Notion leans on open-source (MiniMax and others) to fill the intelligence/price/latency triangle. “Auto” is positioned not as the cheapest but the best-for-task selector.

Product Design: "Flippy," Demos Over Memos, and Refusing to Dumb Down

~57:32 Notion delayed the custom-agent launch a month to unify settings and chat into “flippy” — the editing chat is the using chat. Sarah: “We don’t try to make it as easy as possible to use, because the more we do that, the more we abstract away that interpretability that nerfs the agent from being super capable.” Security review is the first partner brought in on any new build.

Software Factories and the Coding-Agent Kernel

~29:12 Simon personally runs agents 24/7: “Every night before I go to bed I’m like, okay, did I start enough agents… my goal is to be confident they won’t be done by the time I wake up.” One coding-agent thread ran 17 days continuously (turned out to be a harness compaction bug). Every Notion engineer went through an identity crisis in summer 2025 realizing their ability to write code matters less than their ability to delegate.

“Coding agents are the kernel of AGI. Sort of everything is a coding agent.”

Claude-Powered Workspace Auditor

Alongside the interview, Notion shipped a Business Workspace Auditor template that uses Claude agents to analyze workspace efficiency and generate consultant-style reports — with permissions, it can execute its own recommended fixes.^{[4]The Rundown AI — Notion Claude Auditor}

Tools: Notion custom agents, Claude Sonnet/Opus, Claude Code, GPT-4/5.2, SQLite, MCP, Linear/GitHub MCP, MiniMax, Gemini, AWS Bedrock

AI Tools AI Future

Sam Witteveen AI Engineer

Seven Things for Agents in Production (and Paperclip's Control Plane)

Sam Witteveen’s production checklist: model control, prompt management, guardrails, budget limiting, tools/MCP, observability, and evals — framed around real failures like leaked API keys and silent hallucinations affecting hundreds of users for a day.^{[17]Sam Witteveen — 7 Things For Agents in Production} Paperclip, an open-source vendor-neutral orchestration platform, took a different angle: organizing agents into org charts (CEO, CTO, coders, CMO) with mandatory QA workflows and hit 50K GitHub stars in 34 days.^{[18]AI Engineer — Paperclip}

The Seven-Point Checklist

~00:00

Model Control — unified gateway to swap providers (Claude for tool-calling, Gemini for multimodal, fine-tuned open models for structured JSON); don’t hardcode model names given rapid deprecations.
Prompt Management — treat prompts as IP; a prompt registry storing full config (text, model, temperature, guardrails, tools) with versioning, playground, and eval-gated publish.
Guardrails — pre-LLM, post-LLM, pre-tool, post-tool hooks; PII/PHI redaction; preventing competitor mentions.
Budget Limiting — per-model, per-project, per-day spend caps; per-developer limits to contain rogue loops.
Tools & MCP — centralize MCP auth through a gateway; granular permissions; test each tool; special care for tools that cost money.
Monitoring & Tracing — OpenTelemetry-compatible traces exportable to Datadog/New Relic.
Evals — pre-production and continuous post-production; test cheaper models against historical traces to detect regressions.

“Your prompts are your intellectual property. You could be one runaway loop away from a nightmare invoice.”

Paperclip: Human Control Plane for AI Labor

~00:01 Paperclip organizes agents into hierarchical org charts — users assign tasks to a CEO agent that delegates to subordinates. Vendor-neutral (Claude, Codex, Gemini, Hermes, Pi, OpenRouter). Launched March 4 2026, hit 50K GitHub stars by April 8. Creator Doda uses Paperclip to manage Paperclip itself.

“Do not worry about AI taking your job. When you use something like Paperclip, you will be in charge of thousands of agents helping you build your business.”

Quality Control: Plans, QA Workflows, Skills

~09:10 Individual agents are unreliable at self-enforcement — they skip browser testing, ignore instructions. Paperclip enforces a coder → QA reviewer → approver loop regardless of underlying model. A skills system (comparable to skills.sh) installs reusable capability bundles; a “skill consultant” meta-agent diagnoses and improves how other agents use their skills.

Roadmap: Maximizer Mode, Cloud, Multi-User

~21:14 Next 30 days: multi-user collaboration, cloud/sandboxed agent deployments via E2B, a free desktop app, Paperclip Cloud hosting, and “maximizer mode” — a high-autonomy mode for burning tokens toward a goal with minimal stops.

“Maximizer mode — when you’ve got a dream and tokens to burn, and you want the agents to work as hard as they can to do whatever it takes to create your business.”

Tools: TrueFoundry, OpenTelemetry, Datadog, New Relic, Anthropic Claude, Gemini, Open Router, MCP, Paperclip, E2B, Greptile, Remotion, OpenRouter

Hot Take AI Future

Nate B Jones

The Real Problem with AI Agents: Tacit Knowledge and the "Now What?" Gap

Nate B Jones argues agent installation is a solved 10-second problem but productive use is a massive unsolved gap. The root cause is structural: senior knowledge workers have compiled expertise into tacit judgment they can’t articulate — and the fix is an interviewer agent that elicits your tacit knowledge first.^{[19]Nate B Jones — The Real Problem With AI Agents}

The "Now What?" Gap

~00:00 OpenClaw has ~250K GitHub stars. The dominant question in community forums is literally “Now what?” Brad Mills spent 40 hours after a 10-minute install writing standards, definition of done, and transcribing 200 hours of video — and still ended up micromanaging the agent. Another user built an adversarial auditor agent to verify the first agent actually did the work.

“A generic agent with write access to your email is actually worse than no agent at all. It’s a liability with a chat interface.”

What Works: Markdown Files as the Operating System

~06:05 Successful deployments share a file structure inside .openclaw: soul.markdown (role, job, tone, boundaries), identity.markdown (name, personality), user.markdown (detailed human profile), heartbeat.markdown (cron-reviewed checklist). “None of what I just described is artificial intelligence. It’s just plain text. But the quality of those files determines whether your AI agent is actually any good at anything.”

Every Product Fights the Wrong Wall

~11:08 Nate surveys the landscape: Original OpenClaw (Peter Steinberger, developer-focused), Manis (Meta-owned, auto-decomposes but shallow context), Perplexity Personal Computer (dedicated Mac Mini routing to 20 frontier models — Aravind Srinivas: “a traditional operating system takes instructions and an AI operating system takes objectives”), Nemoclaw (Nvidia enterprise wrapper at GTC), Claude Dispatch (Anthropic mobile-first), plus dozens of hosted wrappers. All compete on installation/UI/security/pricing while “every product breaks against the same wall.”

The Tacit Knowledge Problem

~22:18 The root bottleneck: as you become more senior, your work migrates from explicit processes to tacit judgment — “compiled from source code into machine code, metaphorically, and you no longer have the source code.” The people with the most to gain from agents are the ones whose work is hardest to delegate; juniors may have an easier time (Shopify reportedly hires juniors intentionally for this reason).

“The first agent you run should not be your open claw assistant. The first agent you run should be a tool to prepare you to run agents the way you want.”

Nate built an interviewer agent that walks a 45-minute elicitation across five layers — operating rhythms, recurring decisions, inputs needed, dependencies, friction — and auto-generates soul.markdown, heartbeat.markdown, and user.markdown files. He frames agents as “the first universal selfish incentive” to externalize tacit knowledge, because the person who documents their own expertise now gets the compounding leverage.

Tools: OpenClaw, open brain, soul.markdown, heartbeat.markdown, user.markdown, MCP, Manis, Perplexity Personal Computer, Nemoclaw, Claude Dispatch

AI Models Developer Tools

AI Engineer

$1 AI Guardrails: Fine-Tuned ModernBERTs Beat LLM Classifiers

Diego Carpentero’s AI Engineer talk makes the case that fine-tuned ModernBERT-large on the InjectGuard dataset (75K examples) hits ~85% accuracy at 35–40ms latency per classification — for under $1 in training cost — and can be self-hosted to avoid privacy risks.^{[20]AI Engineer — $1 AI Guardrails with ModernBERTs}

LLM Attack Vector Taxonomy

~00:01 The talk catalogs six attack classes: direct prompt injection (Sydney/Bing Chat), indirect injection (Wikipedia or inbox content), adversarial suffix tokens via Greedy Coordinate Gradient (transfers black-box because similar training data yields geometrically similar refusal boundaries), RAG poisoning (5 chunks in an 8M-document base suffices — 2025 Poison RAG paper), MCP exploitation (hidden instructions in the full tool description the LLM reads but users don’t), and agentic attacks including NPM supply chain via GitHub issue titles (4–5K developers affected) and RCE via “click a link” in computer-use agents.

“Model alignment is more a probabilistic preference. It’s not a hard constraint.”

Why ModernBERT

~17:15 Encoder models process full context in a single forward pass via bidirectional attention. ModernBERT’s tricks: alternating attention (local 128-token window + global every third layer up to 8,192), unpadding/sequence packing (cuts ~50% wasted compute), deep-and-narrow grid-searched architecture (22 or 28 layers, 768/1024 hidden), RoPE positional encoding, and FlashAttention — together a 70% memory reduction during fine-tuning.

The Pipeline

~35:29 Start with ModernBERT-base to validate the pipeline, switch to ModernBERT-large for a ~6 percentage point gain. Dynamic padding, bfloat16 (~40% memory cut, batch size 64), 8-bit Adam. Feedforward head on the CLS token output. Result: 35ms on CPU, ~85% accuracy on specialized benchmarks, correctly flagging Sydney prompts, Wikipedia redirect attacks, GCG suffixes, and MCP credential exfiltration.

“An encoder model can be retrained cheaply within a matter of hours. This allows us to adapt and to ship faster, more advanced defensive layers.”

Tools: ModernBERT, FlashAttention, Rotary Positional Encoding (RoPE), Hugging Face Datasets, InjectGuard dataset, bfloat16, 8-bit Adam

AI Models Developer Tools

Better Stack AICodeKing AICodeKing AI Search The Rundown AI

Open-Model and Local-Agent Wave: Hermes v0.9, SuperGemma-4, ERNIE-Image

Hermes Agent v0.9.0 dropped with a local web dashboard, native Android/Termux support, 16 messaging platforms, and the deepest security-hardening pass in the project’s history — positioning it as a cross-platform agent system rather than a hobby project.^{[22]AICodeKing — Hermes Agent V0.9.0} Google’s Gemma 4 26B picked up a community uncensored fine-tune (SuperGemma-4, MLX for Apple Silicon) claiming 95.8 vs. 91.4 on QuickBench,^{[23]AICodeKing — SuperGemma-4 26B} and Baidu’s new ERNIE-Image 8B beat Zimage to take the open-source text-to-image crown.^{[24]AI Search — ERNIE-Image}

Hermes v0.9.0: Cross-Platform Agent System

~00:02 Released April 13, 2026. Adds a self-hosted local web dashboard (no cloud control panel), native Android via Termux with TUI optimizations, voice backend, and 16 messaging platforms (Telegram, Discord, Slack, WhatsApp, Signal, Matrix, iMessage via BlueBubbles, WeChat, WeCom, DingTalk, Feishu, Mattermost, Home Assistant, webhooks, SMS, email). Also adds fast-mode routing through OpenAI/Anthropic priority queues, background monitoring with watch patterns, a pluggable context engine, native XAI/Grok + Xiaomi MiMO support, backup/restore, and a debug toolkit.

~07:05 Security hardening covers path traversal, shell injection, SSRF redirect guards for Slack uploads, Twilio webhook signature validation, API auth enforcement, Git argument injection prevention, and approval-button authorization.

“Once you start giving an AI tool access to commands, integrations, webhooks, files, and messaging platforms, security is not optional anymore.”

Hermes as a Self-Improving Agent

~00:00 Better Stack’s deeper look: Hermes reflects on interactions, auto-saves style preferences, and creates its own skills. In the demo it learned a user’s tweet-writing style (length 210→400, specific emoji preferences) without being asked, and a fresh session recalled those preferences perfectly. ~04:01 Memory lives in memory.md or an external store (Supermemory, Mem Zero, Open Viking); each session preloads ~3,500 chars (~700 tokens); full history in SQLite FTS5. Compression triggers at 50% of context (vs. Claude Code’s 80%), removing tool-call outputs and compressing the middle.^{[21]Better Stack — Hermes self-improving agent}

SuperGemma-4 26B Uncensored

~00:02 Community fine-tune by Jun Song on Hugging Face, built on Gemma 4 26B A4B (MoE with ~3.8B active). MLX 4-bit V2 targets Apple Silicon. Claims: QuickBench 95.8 vs. 91.4 baseline, 46.2 t/s vs. 42.5. Native system prompt, function calling, 256K context. GGUF Q4_K_M (~16.8 GB) for Windows/Linux via llama.cpp/LM Studio. ~04:05 Integrates with Hermes and Open Claw via MLX_LM.server OpenAI-compatible endpoint.

ERNIE-Image vs. Zimage

~00:00 Baidu’s ERNIE-Image beats Zimage in prompt adherence, text rendering, photorealism, complex composition, infographics, and comics — losing only on anatomy and some spatial reasoning. Base and Turbo variants (Turbo: 8 steps, ~16 GB, minimal quality loss). Requires Mistral 3B text encoder (7.5 GB) and Flux 2 VAE (~300 MB), totaling ~20 GB. ~14:08 Local install via ComfyUI, plus Unsloth GGUF quantizations (Q2K at ~3.18 GB up) via the ComfyUI-GGUF extension. The Rundown AI noted ERNIE-Image landed alongside OpenAI crossing ~1B weekly users across ChatGPT and Codex.^{[4]The Rundown AI — ERNIE-Image launches; OpenAI hits ~1B weekly users}

Tools: Hermes Agent, Open Claw, Gemma 4 26B, SuperGemma-4, MLX_LM, llama.cpp, LM Studio, ERNIE-Image, ERNIE-Image Turbo, Mistral 3B, Flux 2 VAE, ComfyUI, ComfyUI-GGUF, Unsloth, Termux, BlueBubbles, Supermemory, Mem Zero, SQLite FTS5

Industry Hot Take

The AI Daily Brief Nate B Jones Nate B Jones

Anti-AI Violence Escalates: Altman Attacks, Red Queen Memo, Workforce Fallout

A 20-year-old Molotov’d Sam Altman’s home at 4:00 a.m., arrested carrying an anti-AI manifesto under the handle “Butlerian Jihadist.” A second armed attack followed Sunday. Altman tweeted a family photo conceding “I had underestimated the power of words and narratives.”^{[25]The AI Daily Brief — AI Populism Turns Violent} The climate echoes Toby Luki’s Red Queen memo from early 2025 predicting AI-driven talent restructuring — 2026’s volume is “at 11”^{[26]Nate B Jones — Red Queen memo} — while AI fluency emerges as a baseline job requirement and the entry-level pipeline breaks.^{[27]Nate B Jones — AI fluency as workplace standard}

The Altman Attacks

~00:00 Daniel Moreno Gamma, 20, threw a Molotov at Altman’s home gate at 4:00 a.m. (no injuries), was arrested outside OpenAI HQ threatening arson with kerosene and a lighter. An FBI raid on his Texas home found a document with names/addresses of other AI executives, investors, and board members. He posted under the handle “Butlerian Jihadist.” Charges include attempted murder plus 10 others, potentially treated as domestic terrorism. Four days earlier, Indianapolis City Councilman Ron Gibson had 13 rounds fired at his front door with a note: “No data centers left.” Amanda Thom and Muhammad Tariq Hussain were arrested Sunday for firing a gun at Altman’s home.

“If I’m going to advocate for others to kill and commit crimes, then I must lead by example.” — Daniel Moreno Gamma’s manifesto

Altman's One-Ring Response

~02:01 Altman tweeted a family photo: “words have power... I had underestimated the power of words and narratives.” He invoked Tolkien: “the only solution I can come up with is to orient towards sharing the technology with people broadly and for no one to have the ring.” Jordan Schachtel blamed EA-aligned AI doomers. Pause AI, Hinton, Krueger, and Soares condemned the attacks. Accelerate Harder countered that safetyist objections rest on cost-benefit analysis, not principled opposition.

“Once you see AGI, you can’t unsee it. It has a real ring of power dynamic to it and makes people do crazy things.”

The Grievance Pipeline

~12:08 NLW frames the violence as continuous with other populist violence — glee over the Titan submersible, the Luigi Mangione lionization (Emerson poll: 41% of 18–29-year-olds said killing a CEO was somewhat or completely acceptable). He cites Bandura’s moral disengagement, Gray/Wegner’s moral typecasting, Konrath’s work, the DARE project, and a Journal of Conflict Resolution paper finding projected economic decline drives violence more than static poverty. “AI CEOs constantly talking about job losses is directly activating this mechanism.”

“The majority of Americans hate AI. Of course, that shouldn’t be a surprise when the CEOs of the three biggest AI labs in America are all basically saying the entire white-collar labor force is just a few years away from getting brutally job-mogged by LLMs.”

Why UBI Backfires

~19:11 Citing Rachel Kleinfeld’s Carnegie Endowment review: reducing affective polarization has zero effect on violence. What works: political efficacy (AI labs’ anti-regulation lobbying undermines this), economic trajectory fixes (retraining, housing, portable benefits), and avoiding the UBI trap — Ginges’ “Moral Logic of Political Violence” found material incentives backfire when sacred values are at stake.

“UBI from AI leaders ratifies the domain-of-loss trajectory: we agree your labor has no future value. Here’s your check.”

The Red Queen Memo and Workforce Restructuring

~00:00 Toby Luki’s early-2025 memo warned that stagnation was almost certain without action — and stagnation is slow-motion failure. A year later, roles are changing and dissolving, AI fluency is a baseline expectation, compensation is polarizing.

“I do think the Red Queen memo is one of those we’re going to look back on in 10 years and realize this kicked off a new arc in talent restructuring that we’re all living with.”

AI Fluency as a Workplace Standard

~00:00 By end of 2026, AI fluency will appear on most knowledge-work postings (as Shopify has already standardized). Role boundaries dissolve — designers submit PRs, non-engineers prototype. New orchestration roles emerge (e.g., the Browser Company’s “design producer”). Compensation polarizes: workers with demonstrable AI leverage command premiums; others face wage pressure. The entry-level pipeline is under particular strain — companies can’t justify traditional training, but need AI-native early-career talent.

“Companies will pay premiums for workers that can demonstrate genuine AI leverage, not just usage.”

Tools: OpenAI, Pause AI, Center for AI Safety, MCP servers, LLM proxy

Hot Take AI Future

Nate B Jones

"Infinitely Fast" AI Still Bottlenecked by Human Tools; The Agent-Native Web

Google’s chief scientist Jeff Dean argued at GTC that making AI infinitely fast would only yield a 2–3× productivity gain because agents are bottlenecked by human-designed tools — APIs, CRMs, compilers calibrated for human pace. Nate B Jones extends this to a three-layer web rebuild and names four human roles that survive.^{[28]Nate B Jones — Google’s Chief Scientist on Infinitely Fast AI}

The Tool Ceiling

~03:03 Dean (co-creator of TensorFlow and TPU) stated that AI 50× faster than humans loses most of the speed advantage to human-designed tool overhead. NVIDIA’s Billy Deli corroborated: inference is now 90% of data-center power consumption and heading to 10,000–20,000 tokens/sec per user. In many agentic loops, the majority of wall-clock time is the agent handling tool calls, not inference. MCP is criticized as often being “a human-friendly API with an MCP wrapper slapped on top.”

“Making a model not 50 times faster, not 100 times faster, but infinitely fast would likely only yield a 2 to 3x fold improvement in our actual productivity.” — Jeff Dean

“We spent a trillion dollars on these agents... We got them to think. Now we’re bottlenecking them on tool calls that were designed for humans.”

The Three-Layer Web Rebuild

~07:07

Layer 1 (incremental): existing tools rebuilt in Rust/Go — TypeScript 7 rewritten in Go for 10×+ speedup; Lee Robinson’s 38,000-line Rust image compressor built with only coding agents because Rust’s compiler acts as natural verification.
Layer 2 (agent-native primitives): OpenAI shipped persistent containers (install deps once, never restart) and server-side compaction keeping agents alive for hours or days. Researchers published branch-FS (copy-on-write with sub-second branch creation). A multi-agent paper showed KV-cache sharing achieves 3–4× lower latency than passing text.
Layer 3 (radical replacement): agent-native scaffolding replacing human scaffolding. “Every new generation of model effectively pinches off our human scaffolding.”

“You are losing ground by standing still because every model improvement shifts the ratio of the model capability against your human effort to contain the model and scaffold it.”

Four Human Roles That Survive

~14:10 Jones names four durable roles:

Tool-using generalist — activates AI tools, drives to completion; today’s vibe coder scaled up.
Pipeline engineer — understands infrastructure, data movement, security; keeps agentic systems running and measurable.
Relationship-driven businessperson — closes deals over dinner; agent CEOs will hire high-quality human salespeople specifically to improve close rates.
The adult in the room — maturity to pump the brakes, decide which inefficiencies are worth preserving, provide leadership judgment agents can’t.

A possible fifth: the creative visionary (“Steve Jobs chair”).

“I think it’s a promotion to the hardest and most valuable job in computing.”

Tools: MCP, Salesforce, SAP, SharePoint, Zendesk, TypeScript, Rust, Go, Zig, TensorFlow, branch-FS

Hot Take Industry

Theo - t3.gg

Theo's Letter to Tech CEOs: Open-Source the Business, Build for Weirdos, patch.md

Theo argues the classic SaaS moat — a thousand features locking customers in — is dead because AI lets customers build the missing 1% themselves. His advice to portfolio companies: open-source as aggressively as possible, and adopt a proposed patch.md standard for self-forking, self-healing software.^{[29]Theo - t3.gg — A letter to tech CEOs}

The Case for Open-Sourcing Your Business

~00:00 Theo acknowledges real risks — agents finding security holes faster, nobodies self-hosting for free, competitors cloning via agents (Cal.com is already fighting exploit attempts). T3 Chat remains closed because small-team security risk could lose millions in seconds. Despite this, he’s been privately advising invested companies to go all-in on open source and decided to share the playbook publicly.

“If I’m wrong on this, my kids don’t get college money. But if I’m right about this, my kids will own colleges.”

The 1% Feature Moat Is Dead

~06:06 Salesforce’s math: 1000 features, customers use 50, only 25 are core — the long tail of 25 bespoke features keeps customers locked in. The Vercel model wins because customers just write code for gaps (Cloudflare for firewall, Supabase for DB, Convex for backend). T3 Code data: 42K installs, 16K weekly users, 1,500 forks (10% of weekly users). Mitchell Hashimoto’s building-block economy: libghostty grew past Ghostty itself in 2 months vs. 18 months for Ghostty. Claude Code recommends Zustand 65% of the time and Vercel 100% of the time for JS deploys.

“You got to let your customers make their weird shit. Agents will more readily pick open and free software over closed and commercial. At the time of writing this article, that’s an objective truth.”

patch.md: Self-Healing Customizations

~32:20 Theo proposes patch.md — a plain-English file recording the intent of every user customization to a forked app. Workflow: when a user adds a feature, the agent edits code AND writes intent to patch.md. When upstream ships 500 commits and the fork breaks, the update button becomes an agent-driven process: try clean merge, and if it fails, spin up an agent to reapply customizations against the new base.

“Software malleability didn’t matter when the average user couldn’t change it. Now that the average person can prompt their features into existence, the platforms that allow that are going to be the winners.”

Tools: Cal.com, T3 Chat, T3 Code, Salesforce, Retool, Vercel, Cloudflare, Supabase, Convex, Ghostty, libghostty, Zustand, Claude Code, CLAUDE.md, AGENTS.md, SOUL.md, patch.md

Industry

Morning Brew The Rundown AI

Allbirds Becomes NewBirdAI — Pivot from Sneakers to GPU Rentals

Weeks after selling its shoe assets to American Exchange Group for $39M, Allbirds ($BIRD) announced a $50M raise to rebrand as NewBirdAI and acquire GPUs for long-term lease — sending the stock up over 600% (peaking at +876%) in a single day.^{[30]Morning Brew — Allbirds pivots to AI}^{[31]The Rundown AI — Allbirds ditches sneakers}

The Pivot

Pending shareholder approval, Allbirds will raise $50M, rebrand to NewBirdAI, strip its “public benefit” status from the corporate charter, and acquire “high-performance, low-latency AI compute hardware” for long-term leasing. The stock jumped from under $3 to over $24 at peak (Morning Brew reports +876%; The Rundown reports +600%).

The Skepticism

Allbirds has no AI infrastructure experience and $50M is a fraction of what established players spend — CoreWeave alone is spending up to $35B this year. Allbirds was valued at $4B at its 2021 IPO before declining via over-expansion and poorly-received products. The move fits a pattern of distressed companies rebranding around hot tech — internet companies, then crypto, now AI. Former crypto miners have retooled facilities for AI workloads; a former karaoke company reinvented itself as AI logistics.

“Every company will eventually be an AI company.”

AI Models

Google DeepMind Simon Willison

Gemini 3.1 Flash TTS Unlocks Expressive Speech

Google DeepMind released Gemini 3.1 Flash TTS (model id gemini-3.1-flash-tts-preview), topping the Artificial Analysis TTS leaderboard with an Elo of 1,211 — accepting natural-language scene direction, speaker profiles, and inline delivery notes across 70+ languages, with all output SynthID-watermarked.^{[32]Google DeepMind — Gemini 3.1 Flash TTS}

Audio Tags for Scene and Speaker Control

Three primary control areas: (1) Scene direction — environment and consistent character dialogue instructions; (2) Speaker-level specificity — unique audio profiles with pace, tone, and accent; inline tags change expression mid-sentence; (3) Seamless export — configured parameters convert to Gemini API code for consistent voice deployment. Native multi-speaker dialogue generation across 70+ languages.

Availability

Developers: preview via Gemini API and Google AI Studio. Enterprise: preview via Vertex AI. Workspace: integrated into Google Vids. No specific pricing disclosed beyond a cost-efficient positioning on the leaderboard.

Simon Willison's Tests

Willison tested accent variations (London, Newcastle, Exeter) and built a multi-speaker conversation tool via Gemini 3.1 Pro to drive the TTS model.^{[33]Simon Willison — Gemini 3.1 Flash TTS notes}

Tools: Gemini 3.1 Flash TTS, Gemini API, Google AI Studio, Vertex AI, Google Vids, SynthID, Gemini 3.1 Pro

AI Models Industry

The Rundown AI Tech Brew

OpenAI's GPT-5.4-Cyber Rejects Mythos Playbook; Microsoft Takes Stargate Norway

OpenAI released GPT-5.4-Cyber with broad access via ID verification through its Trusted Access for Cyber program — explicitly rejecting Anthropic’s Mythos playbook (40+ approved partners only, via Project Glasswing).^{[4]The Rundown AI — GPT-5.4-Cyber rejects Mythos playbook}^{[34]Tech Brew — Calculated Risks on OpenAI GPT-5.4-Cyber} Microsoft is taking over compute infrastructure from OpenAI’s abandoned Stargate Norway project.^{[34]Tech Brew — Microsoft takes Stargate Norway}

Two Access Philosophies

GPT-5.4-Cyber can reverse-engineer compiled software to identify malware and vulnerabilities. OpenAI researcher Fouad Matin framed the approach: “Cybersecurity is a team sport and no one should be in the business of picking winners and losers.” Initial access still restricted to vetted vendors and researchers, with rollout planned to thousands of defenders and hundreds of teams — a parallel to OpenAI’s cautious 2019 GPT-2 playbook. The article notes open-weight models tend to match proprietary capabilities within months.

Mythos Context

The UK AI Safety Institute confirmed Claude Mythos was the first AI to complete a 32-step corporate hack simulation. Treasury Secretary Bessent convened Wall Street leaders for an emergency Mythos briefing. Anthropic’s Project Glasswing keeps access to ~40 approved partners. Mythos reportedly discovered thousands of vulnerabilities, some dating back decades.

Stargate Norway Handoff

Microsoft is absorbing compute capacity from the abandoned Stargate Norway project, continuing the realignment in the OpenAI–Microsoft compute relationship. No financial figures or timeline disclosed.

Tools: GPT-5.4-Cyber, Claude Mythos, Project Glasswing, Trusted Access for Cyber, Persona ID verification

Industry AI Tools

The Rundown AI The Rundown AI

Industry Churn: Snap 1,000 AI Layoffs, Hiro to OpenAI, Gemini Mac App, AWS Bio Discovery

A roundup from The Rundown AI: Snap cut 1,000 jobs (16% of workforce), OpenAI hit ~1B weekly users, Google launched a native Gemini Mac app, AWS introduced Amazon Bio Discovery for antibody drug design, and Nvidia’s Ising AI became an open-source “operating system for quantum machines.”^{[31]The Rundown AI — Gemini Mac, Snap layoffs, Notion Auditor}^{[4]The Rundown AI — Hiro to OpenAI, AWS Bio Discovery, Nvidia Ising, ERNIE-Image}

Snap Cuts 1,000 Jobs

Snap eliminated 1,000 jobs (16% of workforce), targeting $500M in annual savings by end of 2026. CEO Evan Spiegel explicitly attributed the cuts to AI productivity gains rather than business headwinds. At Snap, AI now writes 65% of new code and handles over 1M monthly queries. Over 70,000 tech jobs have been eliminated across the industry so far in 2026.

Gemini Native Mac App

Google launched a native Mac desktop app for Gemini with screen-sharing, Google Drive/Photos access, Gemini Nano image generation, and Veo video generation. Triggered via Option+Space. Rolling out globally on Mac; Windows is English-only. Analysts note the app remains chat-first and lacks task execution vs. Claude and ChatGPT Desktop.

Hiro Team Joins OpenAI

Hiro, an AI personal finance startup, is shutting down with the entire team joining OpenAI. No details on role or deal structure.

AWS Amazon Bio Discovery

AWS launched Amazon Bio Discovery for antibody drug design with integrated lab testing, positioning AWS in the life sciences AI market alongside other drug discovery players.

Nvidia Ising: Open-Source AI for Quantum

Nvidia released Ising, an open-source AI operational layer for quantum computing that automates calibration (days to hours) and delivers 2.5× speed and 3× accuracy improvements in error correction. 20+ institutions adopted at launch (Harvard, Cornell, Fermilab).

“The operating system of quantum machines.” — Jensen Huang

OpenAI Hits ~1B Weekly Users

OpenAI reported approximately 1 billion weekly active users across ChatGPT and Codex — a significant consumer + developer milestone.

Tools: Gemini, Gemini Nano, Veo, ChatGPT, Codex, Ising, Amazon Bio Discovery

Industry Developer Tools

Fireship Low Level

Supply-Chain Attacks: 31 WordPress Plugins Weaponized and Windows Defender Zero-Days

An attacker bought 31 WordPress plugins on Flippa for mid-six figures, planted a dormant backdoor 8 months ago, then activated it — resolving C2 through an Ethereum smart contract so the domain could rotate instantly.^{[35]Fireship — 31 WordPress plugins supply chain attack} Meanwhile, researcher Nightmare Eclipse dropped two Windows Defender zero-days (Blue Hammer and Red Sun) after a bug-bounty dispute with MSRC.^{[36]Low Level — Windows Defender zero-days}

WordPress Plugin Acquisition Attack

~00:00 The attacker quietly acquired 31 WordPress plugins from the original developer on Flippa. After 8 months of dormancy, the backdoor activated, reached out to a C2 server, pulled down additional payloads, and in some cases modified core WordPress files like wp-config.php. The C2 domain was resolved through an Ethereum smart contract, letting the attacker rotate domains instantly if blocked. WordPress removed the affected plugins after discovery.

“The attacker didn’t exploit a vulnerability. Instead, they legitimately acquired and took control of a portfolio of plugins by simply purchasing them for money from the original developer on Flippa.”

Cloudflare's Mdash: A Sandboxed Alternative

~03:04 Cloudflare open-sourced Mdash, a WordPress-API-compatible replacement with no original PHP code. MIT-licensed, built on Astro. Plugins run in sandboxed dynamic workers and must explicitly declare capabilities in a manifest; the framework grants access only through specific bindings.

Nightmare Eclipse's Revenge

~00:00 A researcher using the handle Nightmare Eclipse published two Windows Defender privilege escalation zero-days after claiming MSRC violated an agreement and left them “homeless with nothing.” Microsoft pays up to $250K for HyperV escapes but only ~$12K for critical Defender escalation — which the video flags as low given severity.

“I was not bluffing Microsoft and I’ll do it again.”

Blue Hammer: TOCTOU Race in Defender Update

~03:02 When Defender downloads a VDM (virus definition) file, an attacker interposes a fake cloud file placeholder to block the update momentarily, then replaces the VDM file with a symlink to the SAM hive. When the cloud file resolves, Defender — running as SYSTEM — creates a VSS snapshot of the pre-update state, capturing SAM hive contents. The attacker extracts SAM from VSS and pass-the-hash’s to SYSTEM.

Red Sun: Cloud-Tag Abuse to Write Arbitrary Files as SYSTEM

~06:04 When Defender detects a malicious file carrying a cloud tag, it writes the file back to its original cloud location before quarantining — as SYSTEM. The exploit uses an EICAR-like trigger, the same SyncRoot locking technique to create a race window, then swaps the target path via a mount point redirect to System32 and replaces file contents with a malicious executable. Defender writes the payload into System32 and installs it as a service.

“Would Rust have fixed this? No, not at all. Time of check time of use race conditions are not a memory safety issue. They’re more of like a logic or implementation issue.”

Tools: WordPress, Flippa, Ethereum smart contracts, Mdash, Cloudflare, Astro, Windows Defender, VDM files, SAM hive, VSS (Volume Shadow Copy), NTLM / Pass-the-Hash, SyncRoot cloud file API, NTFS mount points, EICAR, MSRC

Developer Tools

Simon Willison Simon Willison Simon Willison Simon Willison

Datasette Ecosystem Release Burst

Simon Willison shipped a cluster of Datasette-related releases on April 15: Datasette 1.0a27 (CSRF moved from tokens to Sec-Fetch-Site headers), datasette-export-database 0.3a1 (compatibility fix), datasette-ports 0.3 (cwd + full DB path in output), and a Claude-Artifacts-built datasette.io news YAML preview tool.^{[37]Simon Willison — Datasette 1.0a27}

Datasette 1.0a27

CSRF protection now uses modern browser Fetch metadata headers (Sec-Fetch-Site) instead of cookie-based tokens, following Filippo Valsorda’s guidance. A new RenameTableEvent fires when tables are renamed during SQLite transactions, enabling plugins like datasette-comments to react. Other changes: actor parameter for datasette.client test methods, Database(is_temp_disk=True) option to reduce locking, null primary key rejection in upsert, improved API explorer examples, and documented call_with_supported_arguments().^{[37]Simon Willison — Datasette 1.0a27}

datasette-export-database 0.3a1

Narrow compatibility release: because 1.0a27 removed the ds_csrftoken cookie, the plugin’s custom signed URL mechanism had to be updated.^{[38]Simon Willison — datasette-export-database 0.3a1}

datasette-ports 0.3

Adds the working directory (derived from PID) and full database filesystem path to the output when listing running Datasette instances.^{[39]Simon Willison — datasette-ports 0.3}

datasette.io News Preview via Claude Artifacts

Willison used a single prompt to have Claude clone simonw/datasette.io, examine news.yaml, and build an Artifact that previews the rendered homepage news section and highlights YAML/markdown errors — eliminating a manual editing loop.^{[40]Simon Willison — datasette.io news preview Artifact}

“Clone simonw/datasette.io and look at the news.yaml file and how it is rendered on the homepage. Build an artifact I can paste that YAML into which previews what it will look like, and highlights any markdown errors or YAML errors.”

Tools: Datasette, SQLite, datasette-export-database, datasette-ports, Claude Artifacts, claude.ai

Podcast Hot Take

Better Stack Lenny's Podcast Lenny's Podcast

Engineer Leverage, Hiring Barrels, and the Case Against Customer Interviews

Better Stack argues the top-engineer-to-average-engineer leverage gap has widened from 50× to ~1,000× with Claude Code and Cursor — and companies that pay on output rather than tenure win.^{[41]Better Stack — How Software Should Be Built} Lenny’s Podcast captured two opposing contrarian takes: hiring more headcount without “barrels” who can drive initiatives end-to-end is wasted,^{[42]Lenny’s Podcast — Hire barrels, not ammunition} and talking to customers actively misleads product decisions because buying behavior is subconscious.^{[43]Lenny’s Podcast — You shouldn’t talk to customers}

Engineer Leverage and Output-Based Comp

~00:00 Paul Graham’s 50× heuristic has widened to ~1,000× with Stack Overflow, Claude Code, and Cursor. Most companies still treat engineers as interchangeable commodities, scheduling work by months. The speaker argues for output-based compensation and small, tight-knit teams of highly senior engineers where learning compounds.

“The very, very best engineer, 1% of engineer on the bell curve, creates, say, 1,000 times as much value as the average engineer.”

Hire Barrels, Not Ammunition

~00:00 After raising money, companies reflexively hire more people — but output doesn’t scale because “barrels” (people who take ideas from inception to success) are scarce. Adding ammunition without barrels just stacks people behind the same initiatives, increasing the “collaboration tax.”

“Can they take an idea and make it happen? One way or the other, they’re going to get your company across that hill. That’s a barrel.”

Why Customer Interviews Can Mislead

~00:00 The provocative stance: customers can’t articulate what they want because buying decisions are subconscious. Asking them to consciously explain subconscious motivations produces misleading answers even with good intentions. The speaker goes further — Shakespeare offers deeper insight into human behavior than any customer research program.

“I hate talking to customers. I refuse to allow colleagues of mine to talk to customers.”

Tools: Claude Code, Cursor, Stack Overflow

Developer Tools Hot Take

Simon Willison Simon Willison Simon Willison LearnThatStack Better Stack Real Python Better Stack

Dev Fundamentals and Quotable Takes

A cluster of dev-fundamentals and quotable commentary: CORS explained, Docker builds dropped from 10 min to under 3, Redash as a SQL-first BI alternative, Zig 0.16’s “Juicy Main,” trapped-ion vs. photonic quantum trade-offs, plus Simon Willison quoting Gruber on Apple’s eroding app moat and Kingsbury on “meat shields” as an emerging AI-accountability job category.

CORS Explained

~00:00 CORS exists to defend against CSRF. The Same-Origin Policy blocks untrusted scripts from reading responses across origins; CORS is the controlled exception — servers declare trusted origins, the browser enforces. Tools like Postman and curl don’t care about CORS because they’re not browsers. ~06:04 Any application/json, Authorization header, or PUT/DELETE/PATCH triggers an OPTIONS preflight. Missing OPTIONS handler → 404 → generic console error; always check the network tab.^{[47]LearnThatStack — Stop Copying CORS Headers}

“CORS didn’t create the restriction. It created the exception.”

Docker Build Optimization

~00:00 Three common mistakes: wrong COPY order (fix: copy package files, install deps, then copy source), no .dockerignore (500 MB context → 20 MB), no BuildKit cache mounts (3 min install → 8 s). Add multi-stage builds, and total build time drops from ~10 min to under 3.^{[48]Better Stack — Docker build optimization}

Redash: Open-Source BI

~00:00 Redash (28K+ GitHub stars) is a SQL editor and dashboard builder in one. Connects to Postgres, MySQL, BigQuery, Snowflake. Single Docker Compose setup. Positioned between no-code Metabase and heavy Superset/Tableau/Power BI — trades polish and mobile for SQL speed and operational simplicity.^{[50]Better Stack — Redash BI tool}

Zig 0.16.0 "Juicy Main"

Zig programs can now declare main() to accept a process.Init parameter, granting access to a general-purpose allocator, default IO, environment variables, and CLI argument handling without manual boilerplate. Willison praised the release notes as exemplary documentation.^{[46]Simon Willison — Zig 0.16 Juicy Main}

Trapped Ions vs. Photonics

~00:00 Real Python clip: trapped ions are more accurate than superconducting qubits but slow; photonic systems work at room temperature but photons are hard to control. No single platform is clearly best — fundamental trade-offs remain.^{[49]Real Python — Quantum hardware trade-offs}

Quotable: Gruber on Apple's Eroding Moat

Willison quotes Gruber: the real goldmine isn’t App Store fees, it’s that Apple’s platforms have the best apps — and that advantage is weakening as developer motivation declines.^{[44]Simon Willison — quoting John Gruber}

Quotable: Kingsbury on "Meat Shields"

Willison quotes Kyle Kingsbury predicting a new employment category: humans hired specifically to absorb legal and reputational liability on behalf of ML systems — internal reviewers, Data Protection Officers, outsourced contractors positioned to be “thrown under the bus.”^{[45]Simon Willison — quoting Kyle Kingsbury}

Tools: Postman, curl, wget, Docker, BuildKit, .dockerignore, Redash, Metabase, Apache Superset, Tableau, Power BI, Postgres, MySQL, BigQuery, Snowflake, Zig