Musk sues for $130B; Karpathy names the post-vibe era

Industry

OpenAI's Reckoning: $130B Trial, Missed Targets, and the Pentagon Race

Elon Musk's civil trial against OpenAI began in Oakland, seeking $130B in damages plus removal of Altman and Brockman.^{[1]The Rundown AI} The suit — whittled from 26 claims to 2 (breach of charitable trust, unjust enrichment) — will expose private founder communications over four weeks.^{[2]Tech Brew} Meanwhile, the WSJ reported OpenAI missed targets for 1B weekly users and multiple monthly revenue goals as Anthropic competition intensified, and Google finalized a classified Pentagon AI contract, dropping its 2018 no-weapons pledge despite 600+ employee petitions.^{[3]Sherwood News}^{[1]The Rundown AI}

The Trial

Musk claims he invested $38M based on nonprofit assurances; OpenAI argues he is angry about being denied majority control. OpenAI's defense presented emails showing Musk's team discussed a for-profit restructure giving Musk a 55% stake and Altman 7.5%.^{[2]Tech Brew} Musk strategically dropped fraud claims and requested damages go to OpenAI's nonprofit arm. Microsoft's lawyers stated they had no knowledge of Altman's 2023 firing and noted Musk did not complain until xAI became competitive.^{[1]The Rundown AI}

"The entire foundation of charitable giving in America will be destroyed"

Missed Targets and Market Pressure

Sherwood's Snacks highlighted two key questions for OpenAI: whether it understands why it lost enterprise market share, and whether it can compete on quality. OpenAI is pivoting to emphasize Codex growth and enterprise revenue.^{[3]Sherwood News}

"The best ability is availability" — suggesting reliability may matter more than capability.

Google's Pentagon Deal

Google finalized a classified Pentagon contract for AI use in "any lawful government purpose," while Anthropic is reportedly litigating after being blacklisted for refusing to remove safety guardrails.^{[1]The Rundown AI}

Tools: ChatGPT, Codex, GPT-5.5

AI Models

Anthropic Research

Anthropic's BioMysteryBench: Claude vs. Human Bioinformatics

Anthropic built BioMysteryBench, a 99-question bioinformatics benchmark using real biological datasets. Claude Mythos Preview achieved 86% on human-solvable tasks and 30% on human-difficult ones — sometimes solving problems that expert panels could not, using entirely different analytical strategies than humans.^{[4]Anthropic Research}

Benchmark Design

BioMysteryBench consists of 76 human-solvable and 23 human-difficult questions derived from real biological datasets. Claude had access to canonical bioinformatics tools and databases (NCBI, Ensembl).^{[4]Anthropic Research}

Results by Model

Opus 4.6: 77.4% (human-solvable), 23.5% (human-difficult)
Sonnet 4.6: 58.5% / 12.2%
Mythos Preview: 86% / 30%

Key Strategies

Two approaches emerged: a knowledge-based approach leveraging training data from hundreds of thousands of papers, and multi-method convergence where Claude used multiple analytical approaches and selected answers supported by multiple evidence lines — "a strategy we human scientists could learn from."^{[4]Anthropic Research}

The Reliability Gap

On hard problems, only 44% of successful solutions were reliable (solved 4+ of 5 attempts), while 44% were "brittle wins" (1-2 of 5).

"The accuracy gap is real, but the reliability gap underneath it is the more interesting story."

Tools: Claude Opus 4.6, Claude Sonnet 4.6, Claude Mythos Preview, NCBI, Ensembl

AI Models

OpenAI

GPT-5.5 Ships on Databricks

Databricks reports GPT-5.5 achieves a 46% error reduction over GPT-5.4 in agent harness settings, becoming the first model to surpass 50% on their benchmark. Codex with 5.5 is now state-of-the-art among all agents.^{[5]OpenAI — GPT-5.5 is SOTA for Databricks}

The key improvement is parsing quality — earlier models could not correctly parse all digits in messy customer documents, but 5.5 shows a "step-wise function lift."^[5]OpenAI Databricks expects customers to use Agent Bricks and Agent Supervisor API with GPT-5.5 as the supervisor model for custom agent workflows.

"GPT-5.5 has a step-wise function lift in parsing quality."

Tools: GPT-5.5, Codex, Databricks, Agent Bricks, Agent Supervisor API

AI Models

Sam Witteveen

NVIDIA Nemotron 3 Nano Omni

NVIDIA released Nemotron 3 Nano Omni, a 30B MoE model (3B active) that fuses text, image, video, and audio into a single forward pass — not a suite of models but one unified multimodal model with open training recipes and weights on Hugging Face.^{[6]Sam Witteveen}

Architecture

The base is the Nemotron 3 Nano (30B parameters, 3B active, Mamba transformer MoE) pre-trained on 25 trillion tokens. NVIDIA added the C-RADIO vision encoder (images + video) and the Parakeet audio encoder from their open ASR systems. Post-training enables agentic computer-use behaviors.^{[6]Sam Witteveen}

Open Training Recipes

NVIDIA published full pre-training data mix, SFT recipes, RL training stages, and dataset sources on Hugging Face — transparency no other open model family matches.

"We just don't get papers that make all of this open like this."

Demo and Deployment

Sam demos the model via Colab (NVIDIA API or Open Router) and locally on a DGX Spark using vLLM. Thinking mode exposes visible chain-of-thought. Audio transcription, image understanding, and structured tool calls all work in a single model. Available in FP16, FP8, FP4, and GGUF quantizations. ~06:05 ~09:06

Use Cases

Best for agentic pipelines needing a general multimodal workhorse. For bulk transcription-only tasks, the standalone Parakeet model still wins on efficiency. ~12:11

Tools: Nemotron 3 Nano Omni, C-RADIO, Parakeet, vLLM, DGX Spark, NVIDIA NeMo, Open Router

Developer Tools

Simon Willison

Simon Willison's LLM 0.32 Refactor

Simon Willison released LLM 0.32a0, a major backwards-compatible refactor of his Python CLI tool. The new architecture supports messages-based input, streaming typed parts (text, tool calls, reasoning), and response serialization — reflecting how modern model APIs actually work.^{[7]Simon Willison}

The original abstraction (text in, text out) was insufficient for current model capabilities. Version 0.32a0 redesigns the core to support multi-turn conversations via a messages API, streaming responses as typed events (text, tool_call_name, tool_call_args, reasoning), and serialization via to_dict()/from_dict().^{[7]Simon Willison}

New features include a reply() method for continuing conversations, a -R/--no-reasoning CLI flag, and execute_tool_calls() for running tool calls. The llm-anthropic plugin was updated for the new streaming events. A follow-up 0.32a1 fixed a bug where tool-calling conversations were not correctly restored from SQLite. Willison plans to redesign SQLite logging with a graph model in 0.33.

Tools: LLM CLI, llm-anthropic, Claude Sonnet 4.6, SQLite

AI Tools

Google

Gemini Now Generates Files

Google launched file generation in Gemini, letting users create downloadable Docs, Sheets, Slides, PDFs, CSVs, and more directly from chat prompts — then download or export to Google Drive. Rolled out globally with no tier restrictions.^[8]Google

Supported formats include Google Workspace apps (Docs, Sheets, Slides), Microsoft Office (.docx, .xlsx), PDFs, CSVs, LaTeX, plain text, RTF, and Markdown. Users describe the file they need, and Gemini creates it for download or direct Drive export.^[8]Google

Tools: Gemini, Google Drive, Google Docs, Google Sheets

Developer Tools

OpenRouter

OpenRouter + Stripe CLI: Agents Provision Accounts

OpenRouter now lets developers — and their agents — create accounts, generate API keys, and set up billing via a single CLI command through Stripe Projects integration. Running stripe projects add openrouter/api provisions everything in one step.^{[9]OpenRouter}

The key detail: agents can do this too — enabling automated infrastructure provisioning where AI agents handle their own account setup and billing. This bridges the Stripe and OpenRouter ecosystems through CLI-first developer workflows.^{[9]OpenRouter}

"Your agents can do it too."

Tools: OpenRouter, Stripe Projects, Stripe CLI

Podcast

Sequoia Capital

Andrej Karpathy at Sequoia: From Vibe Coding to Agentic Engineering

Karpathy distinguishes vibe coding (raising the floor so anyone can build software) from agentic engineering (preserving professional quality while going much faster), argues LLMs are a new computing paradigm (Software 3.0), and warns that model capabilities are jagged — shaped by whatever labs put into RL training rather than general understanding.^{[10]Sequoia Capital}

Software 3.0

~03:05 Karpathy frames his Software 1.0/2.0/3.0 taxonomy: explicit code, learned weights, and now prompting-as-programming where the LLM is the interpreter and context window is the lever. He describes a sharp transition around December 2025 when agentic coding tools crossed a threshold. ~01:02

The Death of Unnecessary Apps

~05:05 His MenuGen app (photo-to-illustrated-menu) was rendered obsolete by a single Gemini prompt using Nanobanana to directly overlay food images onto the menu photo — no app needed.

"I don't want to do anything. What is the thing I should copy paste to my agent?"

Jagged Intelligence

~10:07 Models excel in domains where RL training creates verification rewards (math, code) but fail at common sense. The chess capability jump from GPT-3.5 to GPT-4 came from someone at OpenAI adding chess data to pre-training, not from general scaling.

"State-of-the-art Opus 4.7 will simultaneously refactor a 100,000 line codebase or find zero day vulnerabilities and yet tells me to walk to this car wash. This is insane."

Vibe Coding vs. Agentic Engineering

~16:11 Vibe coding raises the floor — everyone can build software. Agentic engineering preserves the professional quality bar while going faster, coordinating "spiky, failable, stochastic" agents without introducing vulnerabilities.

"Vibe coding is about raising the floor for everyone in terms of what they can do in software. Agentic engineering is about preserving the quality bar of what existed before in professional software."

Human Judgment and Agent Limits

~20:15 Karpathy stresses that humans remain in charge of spec, design, taste, and judgment. He shares a bug where his agent matched Stripe emails to Google emails instead of using persistent user IDs.

"The agents are kind of like these intern entities. You basically still have to be in charge of the aesthetics, the judgment, the taste and a little bit of oversight."

Agent-Native Infrastructure and Education

~26:18 He calls for agent-native infrastructure — docs written for agents, deployment that does not require manual DNS configuration — and envisions agent-to-agent communication handling scheduling and coordination.

"You can outsource your thinking but you can't outsource your understanding."

Tools: Claude Code, Codex, Gemini, OpenClaw, Nanobanana, Vercel, ChatGPT, PyTorch

Podcast

Dwarkesh Patel

Dwarkesh Patel Interviews Reiner Pope: How Models Are Actually Trained and Served

A blackboard lecture format covering how frontier model inference and training actually work on GPU clusters — roofline analysis, batch size economics, mixture of experts layout, rack-level constraints, pipelining tradeoffs, the memory wall, and reverse-engineering model details from API pricing.^{[11]Dwarkesh Patel}

Why Fast Mode Costs 6x for 2.5x Speed

~00:00 Reiner introduces roofline analysis on a Blackwell NVL72 rack, decomposing inference time into compute time and memory time. At small batch sizes, inference is memory-bandwidth-bound because weight fetches dominate; at large batch sizes it becomes compute-bound.

"If you do not batch together many users, the cost and the economics you get can be like a thousand times worse than if you do batch many users together."

Batch Size Economics

~13:06 The optimal batch size is approximately 300 times the sparsity ratio, yielding roughly 2,000-3,000 concurrent sequences for a DeepSeek-like model — about 128k tokens per second per rack.

MoE and Rack Constraints

~31:23 Expert parallelism maps different experts to different GPUs within a rack, fitting NVLink's scale-up network. Crossing rack boundaries introduces an 8x bandwidth penalty via the slower scale-out network. ~42:33 The physical constraint is literally cable density and bend radius inside the rack.

"It's literally the physical space to put a cable that's constraining it."

Pipeline Parallelism

~48:39 While pipeline parallelism solves weight memory capacity constraints, it cannot reduce KV cache memory per GPU because more pipeline stages require proportionally more in-flight sequences. For inference, pipelining is latency-neutral but each rack-to-rack hop adds milliseconds.

"KV cache becomes the dominant term... pipelining doesn't help with context length, it totally helps with model size."

Reverse-Engineering from API Pricing

~93:25 Gemini's 50% price increase above 200k tokens reveals the context length where inference transitions from compute-bound to memory-bandwidth-bound. The 5x price difference between input and output tokens confirms decode is heavily memory-bandwidth-bottlenecked. ~119:00 Cached token pricing at different TTLs (5 min vs. 1 hour) likely corresponds to different memory tiers.

"It's funny that they would leak so much information through their API pricing."

Training Compute Allocation

~83:14 Reiner argues pre-training tokens, RL tokens, and inference tokens should each consume roughly equal total compute under optimal allocation, leading to the estimate that frontier models are trained on approximately 100x more tokens than chinchilla-optimal.

Neural Networks and Cryptographic Ciphers

~124:06 A closing segment explores Feistel networks being imported into ML as RevNets for memory-efficient training.

Tools: Blackwell NVL72, NVIDIA NVLink, DeepSeek V3, Gemini, GPT-5, Claude, Cursor, Maddox

Podcast

Y Combinator

Demis Hassabis at YC: Agents, AGI & Scientific Breakthroughs

Google DeepMind CEO and Nobel laureate Demis Hassabis discusses what is missing for AGI (continual learning, long-term reasoning, memory), why agents are the path to general intelligence, the power of distillation and small models, and proposes his "Einstein test" for genuine scientific discovery — all with a ~2030 AGI timeline.^{[12]Y Combinator}

What Is Missing for AGI

~00:00 Hassabis argues current LLM components will be part of the final AGI architecture but that continual learning, long-term reasoning, and memory are still unsolved — possibly requiring one or two big new ideas, at 50/50 odds.

"Continual learning, long-term reasoning, some aspects of memory, these are still unsolved. I think all of these are going to be required for AGI."

Memory and Context Windows

~04:03 Drawing on his PhD work on hippocampal memory consolidation, he notes today's approach of shoving everything into the context window is "duct tape" — even million-token contexts store unimportant and wrong information alike.

RL, Search, and AlphaGo Ideas Returning at Scale

~07:04 Techniques from AlphaGo and Alpha Zero (Monte Carlo tree search, experience replay) are being revisited at scale on foundation models.

Reasoning Gaps and Jagged Intelligence

~14:07 Models solve IMO gold-medal problems but make elementary math errors. Hassabis describes watching Gemini play chess where it identifies a blunder, cannot find anything better, and plays it anyway.

"Sometimes it will consider a move, it will realize it's a blunder, but it can't find anything better, so it kind of goes back to that move and does it anyway."

Agents: Just the Beginning

~16:08 Nobody has vibe-coded a AAA hit game yet, and autonomous agent swarms running 40 hours have not justified the input — though he expects this to change in 6-12 months.

"We haven't seen a AAA game that tops the app store charts that was sort of vibe coded yet... Something's still somehow missing."

The Einstein Test

~37:18 Train a system on 1901 knowledge and see if it produces 1905-level breakthroughs including special relativity. He says we have not passed it yet but may be one or two missing pieces away.

"My Einstein test: can you train a system with the knowledge of 1901 and then will it come up with what Einstein did in 1905, including special relativity?"

Advice for Deep Tech Founders

~38:19 Hard problems are no more difficult than shallow ones, just differently difficult. Founders must plan for AGI arriving mid-journey around 2030.

"Depending on what your AGI timeline is, mine's like 2030 or something like this, then you have to just consider AGI appearing in the middle of that journey."

Tools: Gemini, Gemma 4, AlphaFold, AlphaGo, Alpha Zero, Co-Scientist, Gemini Robotics, Genie

Podcast

Every

Every: What the Agent Economy Looks Like From Inside Stripe

Emily, Stripe's head of data and AI, discusses how agents are becoming the predominant actors on the internet — reshaping fraud, payments infrastructure, and pricing models — all seen through Stripe's vantage point of processing 2% of global GDP. Top AI companies on Stripe reach $30M ARR in ~18 months, 3x faster than 2018 SaaS companies.^[13]Every

New Fraud Vectors

~04:04 Fraudsters now steal compute, not just credentials. About 7% of signups across AI companies on Stripe are multi-accounted users. One large AI company saw only 4% of free trials convert to paid, with each trial costing $25 in LLM spend — $625 per paying customer. Free trial abuse has 4x-ed over six months.

"For one large AI user on Stripe, we're currently blocking 250,000 fraudulent free trials a week."

AI Company Growth Data

~20:15 The top 100 AI companies that reach $30M ARR get there in about 18 months — 3x faster than top SaaS companies from 2018. Largely driven by net new spending rather than SaaS substitution, though substitution from both SaaS licenses and headcount opex is increasing.

"These AI companies are just growing from a revenue perspective faster than any previous cohort we've seen."

Pricing Model Evolution

~23:16 Seat-based billing will largely disappear in enterprise within six months. Token-based metering for model providers; outcome-based pricing for vertical AI solutions. Stripe built a "token billing" product that tracks underlying model costs in real time.

"I would be super surprised if six months from now we have half of the seat-based licenses that we have today."

Developer Experience in the Agent Era

~37:28 LLM traffic to Stripe docs is up 10x year-over-year while human usage remains flat. Stripe launched "Stripe Projects" for CLI-based provisioning, with 100+ companies requesting to join after launch.

Agentic Commerce

~41:28 Stripe co-created the Agent Commerce Protocol with OpenAI, now also used by Microsoft Copilot and Meta. A Shared Payment Token (SPT) lets credentials pass securely from AI agents to merchants without exposing card details. ~50:35 Stripe's Link wallet (250M consumers) is evolving to support delegated agent purchasing with guardrails.

Tools: Stripe Radar, Stripe Billing, Stripe Projects, Stripe Link, Agent Commerce Protocol, Shared Payment Token (SPT)

Podcast

The Pragmatic Engineer

Pragmatic Engineer: Building Pi & Self-Modifying Software

Mario Zechner (creator of Pi) and Armin Ronacher (creator of Flask) join Gergely Orosz to discuss how Pi became one of the most influential AI coding agents of 2026 — a minimalist, self-modifiable harness built by a single developer who got frustrated with Claude Code's instability. They argue agents produce 10x more code but also 10x more bugs, and that the industry needs to slow down.^{[14]The Pragmatic Engineer}

Why Mario Built Pi

~32:24 He was a happy Claude Code user but the team started dog-fooding aggressively, velocity increased, and bugs multiplied. He reverse-engineered Claude Code's system prompt and tracked its evolution — it changed disruptively with every release.

"I don't want my hammer to break at a different spot every day."

Pi's Architecture

~37:27 A bespoke abstraction over LLM provider APIs, a generalized agent loop with tool calling and streaming, a flicker-free TUI, all tied together with extensive hook points. Core tools: read, write, edit, bash — "that's all you need." ~39:27 Pi does not ship with MCP — users build MCP support into it by asking Pi to modify itself.

"You can ask Pi to modify itself because of the extension points and it can write code that extends itself. It's trivial, but it's a big unlock."

AI Agents in Engineering Teams

~15:08 Armin interviewed 30+ engineering teams. Adoption exploded after Christmas 2025 holidays when engineers had uninterrupted time to experiment, but code quality dropped as PR sizes grew.

"Two minutes later you have another agent running in this window and it spits out the worst horrible garbage, but you might not notice because now you have fallen into automation bias."

Agents Don't Feel Pain

~21:11 Senior engineers say "no" to keep complexity down because they have felt the consequences. Agents say "yes" to everything, generating complexity that becomes their own worst enemy when codebases exceed context window capacity.

"A good engineer is an engineer that says no a lot and I don't need this a lot. Because that keeps complexity down. If you're using agents, the exact opposite happens."

Dark Factories and the Quality Crisis

~71:54 If an agent produces 10x more code at half the human error rate, it still produces 5x more bugs. Humans review ~1.5k LOC/day well; agents produce 10x that.

"All the companies claiming that all of their code is now written by agents — yes, we know. The quality is garbage. We feel it in our bones when we use your products."

MCP vs. CLI

~76:56 MCP is non-composable by design: combining outputs forces data through the model's context window. CLIs with pipes let the model see only end results.

"With a CLI it's a pipe. The model only sees the end result and is super free in how it massages that data."

Predictions

~85:04 A major company will publicly admit it can no longer maintain its codebase without AI — triggering a reckoning about dependency on two model providers.

"Engineering teams are already telling me that they have codebases that they think they couldn't maintain anymore without the machine."

Tools: Pi, Claude Code, OpenClaw, Flask, Sentry, GitHub Copilot, AMP, Droid, Open Code, MCP

Podcast

AI Engineer

OpenAI at AI Engineer: Codex and Subagents

OpenAI's developer experience team walks through Codex's unified agent harness, model progression from GPT-5.2 through 5.4, the plugin/automation system, and the subagent architecture for parallel task decomposition. Codex has crossed 3 million weekly active users, tripling since January.^{[15]AI Engineer}

Agent Harness and Models

~00:14 Codex is a full software engineering agent — not just a code writer. ~04:16 Rapid model progression from GPT-5.2 through 5.3 (with Cerebras), to GPT-5.4, plus mini and nano variants. WebSockets deliver ~1.75x faster tokens; fast mode stacks an additional 2x speedup.

"Just last night we crossed the milestone of crossing 3 million weekly active users."

Plugins and Automations

~12:28 The plugin system bundles skills, apps, and MCP servers into installable packages. ~14:30 Automations run as background cron jobs for Slack triage and Gmail filtering.

Code Review

~31:42 100% of PRs across all OpenAI repos — including leadership — are reviewed by Codex code review by default. It contextualizes beyond the diff to find second-order effects in untouched modules.

"100% of pull requests across all OpenAI repos made by all employees, including Greg, are reviewed by Codex code review by default."

Subagent Architecture

~32:42 A master task is decomposed into parallel, independent subtasks. Each subagent can have its own model, reasoning effort, sandbox mode, MCP server access, and skills. Demo: 20 subagents review 45 persona TOML files.

Experimental Features

~49:53 Guardian approvals spin up a subagent to evaluate whether privileged operations need human approval. ~52:13 Hooks for start, per-tool-use, and stop. Codex Security for commit-by-commit vulnerability scanning.

Tools: OpenAI Codex, GPT-5.4, GPT-5.3 Codex Spark, Cerebras, Playwright Interactive, Imagen, MCP, Claude Code

Podcast

AI Engineer

Paige Bailey at AI Engineer: Building with Google's AI Platform

Paige Bailey from Google DeepMind delivers a demo-heavy tour of Gemini 3.1 models, AI Studio's tools, Gemini Live's multimodal interactions, generative media models (Nano Banana 2, VO 3.1 Light, Lyria 3), Genie 3 world generation, and the open-weight Gemma 4 family.^{[16]AI Engineer}

Model Lineup

~00:17 Gemini 3.1 Pro, Flash, Flash Light. Gemini is uniquely multimodal for both inputs AND outputs — video, images, audio, text, and code on both sides.

"One of the reasons that it's very special is that it's multimodal both for inputs and also multimodal in terms of outputs."

AI Studio Demos

~06:19 YouTube video analysis ingests a dinosaur video and produces a table grounded via Google Search for ~27,600 tokens. ~13:22 Compare mode pits Flash Light against Flash on bounding-box tasks. ~16:24 URL context feeds blog posts to produce grounded comparisons with inline citations.

Gemini Live

~20:26 Screen sharing where the model describes what it sees, responds in Italian on request, recites a poem in a Texan accent about Lego bricks, and identifies hand gestures from a camera feed.

AI Studio Build

~25:33 Voice-prompts an app that photographs a bookshelf, catalogs books using Gemini's vision plus Google Search grounding, adds Firebase auth and Firestore persistence, and deploys with one click to Cloud Run.

Genie 3: World Generation

~29:36 Generates a fully navigable world — no physics engine, just frame-by-frame pixel generation from a composition of models.

"No physics engine behind the scenes, no Unity, no Unreal Engine, just each frame generated dynamically pixel by pixel."

Open Models

~52:09 Gemma 4 runs on mobile devices, will ship on Pixel 10, and is being integrated into Chrome. Augment Code has replatformed their entire agent system to default to Gemini 3.1 Pro for performance plus cost reasons.

Tools: Google AI Studio, Gemini 3.1 Pro, Gemini 3.1 Flash Light, Gemini Live, Nano Banana 2, VO 3.1 Light, Lyria 3, Genie 3, Gemma 4, Firebase, Google Cloud Run

Podcast

AI Engineer

Maxime Labonne at AI Engineer: Training Frontier Small Models

Maxime Labonne, Head of Pre-training at Liquid AI, presents lessons from training frontier small models (350M-24B parameters) for on-device edge deployment — covering architecture, over-training by 28 trillion tokens, the doom-looping problem, and the path to agentic small models.^{[17]AI Engineer}

Why Small Models Are Different

~00:14 Not just scaled-down large models — memory-bound, low knowledge capacity, and latency-sensitive. Embedding layers dominate parameter counts: Gemma 3 270M is 63% embedding layer. LFM2 shrinks it to ~10%.

28 Trillion Tokens on a 350M Model

~06:19 Performance kept improving far beyond chinchilla-optimal. New test-time scaling laws suggest even more tokens would help.

"More pre-training works, and it works even at the smallest scale."

Doom Looping

~10:21 Repetitive token generation that never terminates spikes when combining small models, reasoning modes, and complex tasks. Qwen 3.5 0.8B in reasoning mode has over 50% doom loop rate. Liquid's two-stage fix (DPO on loop-free rollouts + RL with repetition penalty) drove doom looping from ~16% to near-zero.

"If today you try to do the same thing with Qwen 3.5 0.8B in reasoning mode you will see a lot a lot a lot of doom loops — over 50%."

Agentic Small Models

~15:24 Low knowledge capacity can be compensated by tool access. A tiny model that can search the web dramatically outperforms a bare base model on knowledge questions.

"From experience these tiny models are actually very good at agentic tasks and this is how we should use them."

Tools: LFM2, LFM 2.5, Gemma 3, Qwen 3.5, Hugging Face

AI Tools

OpenAI Nate B Jones

OpenAI Workspace Agents

OpenAI launched Workspace Agents — Codex-powered team agents in ChatGPT that run in the cloud, operate across connected tools, and can be scheduled. They shift from personal AI assistants to reusable work units: product-feedback routing, weekly metrics reports, risk screening, and IT triage all run while your computer is closed.^{[18]OpenAI Build Hour}^{[19]Nate B Jones}

What Are Workspace Agents

Codex-powered agents that handle complex, long-running work spanning multiple systems. They have persistent memory, can improve over time through feedback, and sit alongside Codex (for individual developers) and the Agents SDK (for custom agents).^[18]OpenAI

"Workspace agents are for teams. They're built for shared work for tasks that run in the cloud even when your computer is closed."

Demo: Meeting Prep Agent

An agent named "Auto" checks a calendar each morning, researches customers via Google Drive and the web, and emails formatted meeting briefs — saving hours of manual prep. Built entirely by describing the workflow in natural language. Can be shared with teammates who can duplicate and remix it.^[18]OpenAI

Demo: IT Triage Agent in Slack

Employees request tools in a Slack channel; the agent researches the vendor, compares it against an approved software stack, checks license utilization, reasons against procurement policy, and auto-creates Jira tickets when escalation is needed. Each Slack channel has its own shared memory.^[18]OpenAI

Where They Win (and Don't)

Nate Jones scores them well for recurring cross-tool team workflows from ChatGPT or Slack, but notes they do not win when the workflow is deeply native to Salesforce (data advantage), Microsoft 365 (graph advantage), or frontier coding (developer toolchain advantage).^{[19]Nate B Jones}

Availability

Research preview on ChatGPT Business, Enterprise, Edu, and Teachers plans. Free until May 6, 2026, then credit-based pricing. Slack integration live; Teams coming.

Tools: ChatGPT Workspace Agents, Codex, Agents SDK, Google Calendar, Google Drive, Gmail, Slack, Jira

AI Tools

Theo - t3.gg

Claude Code's Default Tech Stack

A survey by Amplifying AI tested what tools Claude Code recommends across 20 categories. The winners: GitHub Actions (94%), Stripe (91%), shadcn/ui (90%), Vercel (100% for JS deployment), Tailwind (68%), Zustand (68%), PNPM (56%), Drizzle (61%), and "build it yourself" — the single most common recommendation at 12% of all picks.^[20]Theo

Methodology

Tested across three Anthropic models (Sonnet 4.5, Opus 4.5, Opus 4.6) with four project types, using open-ended prompts with no tool names in the input, three runs each. Context matters more than phrasing — the same prompt produces different results across repos but stays stable across phrasings within a project (76% consistency).^[20]Theo

DIY Over Buying

Claude Code frequently prefers building custom solutions. Feature flags are DIY'd 70% of the time; auth in Python 100%. Ironically, the recommendations Claude Code makes are not the ones Anthropic follows for building Claude Code itself — they use GrowthBook for feature flags.

"If coding agents are becoming the default way developers discover tools and the agents prefer building over buying, vendors need to either become the primitive that agents build on or make the tools so obviously superior that agents recommend them over custom solutions."

Claude Code vs. Codex

Key disagreements: Codex prefers Node and Cloudflare Workers while Claude Code prefers Bun and Vercel Edge. Claude Code never recommends StatsIG (possibly due to OpenAI acquisition). Theo highlights a significant quality gap: Claude Code incorrectly recommends Bun as a runtime for Next.js projects (incompatible), while Codex gives a precise answer noting Node is safe and Bun is beta on Vercel.

"Opus is really smart until it's really stupid and then it's really stupid."

The Building Block Economy

Mitchell Hashimoto's article: Ghostty reached 1M daily users in 18 months, but LibGhostty hit multiple millions in two months. Simon Willison argues LLMs no longer push developers toward only established tools — newer models with long context windows can learn new tools from docs at runtime.

"This is a new distribution channel where a single model's training data may shape market share more than a marketing budget or a conference talk."

Tools: Claude Code, GitHub Actions, Stripe, shadcn/ui, Vercel, Tailwind, Zustand, Sentry, Resend, PNPM, Drizzle, Prisma, Vitest, Playwright

AI Tools

Matt Pocock

De-Slopping AI Codebases

Matt Pocock argues AI has accelerated software entropy — codebases are falling apart faster than ever. His open-source "improve codebase architecture" skill (41.5K GitHub stars) fights back with classic software design fundamentals: deep modules, seams, adapters, and structured grilling sessions with Claude Code.^{[21]Matt Pocock}

Software Architecture Vocabulary

A shared vocabulary with AI is essential: modules, interfaces, implementations, deep modules (lots of behavior behind a simple interface), shallow modules (the inverse), seams (where interfaces live and testing hooks in), and adapters from hexagonal architecture. Sourced from John Ousterhout's "A Philosophy of Software Design."^{[21]Matt Pocock}

"I think of agents as really, really good tactical programmers. They're able to get on the ground and make changes quickly, but they need someone on the level above them who is the strategic programmer."

Live Demo

The skill explores a real codebase (~1,500 commits, React Router + Effect.ts), returns six "deepening opportunities," and enters a structured grilling session. The top candidate: a concept with two parallel implementations whose shared seam is untested. Output: a proposed TypeScript interface/module shape, validation dialogue, and a GitHub issue for an AFK agent to pick up.^{[21]Matt Pocock}

Practical Workflow

Run the skill every couple of days in fast-moving repos. The flywheel: deeper modules → clearer seams → better tests → better agent output → fewer entropy events. For legacy codebases, run the skill first to identify refactor candidates and build a test harness before any AI makes further changes.

"The better your tests are, the better the output from the agent is going to be."

Tools: Claude Code, Matt Pocock GitHub skills repo, Effect.ts, React Router

AI Tools

AICodeKing Github Awesome

Open Design: Open-Source Claude Design Alternative

Open Design is an Apache 2-licensed, local-first design shell that wraps whatever coding agent CLI you already have (Claude Code, Codex, Cursor, Gemini CLI) to generate UI artifacts. It ships with 19 composable design skills and 71 brand-inspired design systems — positioning structured skills over random prompting for AI-generated UI.^{[22]AICodeKing}^{[23]Github Awesome}

How It Works

Open Design detects installed CLIs from the system PATH and uses whichever is present as the design engine. Fallback: Anthropic API with bring-your-own-key. Cost is whatever the underlying agent costs — no separate design subscription.^{[22]AICodeKing}

"Open Design is not a free model. It is more like a design shell around the agents you already have."

Skills and Design Systems

19 composable skills (SaaS landing, dashboard, mobile app, PM spec, docs, etc.) with P0/P1/P2 rule checklists. 71 design systems inspired by real brands: Linear, Stripe, Vercel, Airbnb, Tesla, Notion, Anthropic, Apple, Cursor, Figma, and more. Each is a plain markdown design.md file.

Anti-Slop Engineering

A discovery form surfaces before generating any code, asking about surface, audience, tone, brand, context, and scale. A 5-dimensional critique (design philosophy, hierarchy, execution, specificity, restraint) runs before any artifact is emitted. The repo explicitly blacklists common AI UI failure modes: aggressive purple gradients, generic emoji icons, random rounded cards with left-border accents.

"30 seconds of questions can save 30 minutes of regeneration."

Technical Architecture

Vite + React + TypeScript frontend, Node/Express daemon with SQLite. Supports HTML, PDF, ZIP, Markdown, and PPTX export. Claude Code gets the richest streaming support; other CLIs are line-buffered.

Tools: Open Design, Claude Code, Codex CLI, Cursor Agent, Gemini CLI, Open Code, Qwen Code

Developer Tools

Better Stack

Graphify: 70x Fewer Tokens for AI Coding

Graphify builds a knowledge graph of your codebase using tree-sitter and an LLM, so AI coding tools can understand structure and relationships instead of re-reading all files every query. A query that cost ~14,000 tokens before Graphify dropped to a few hundred after the initial graph was built.^{[24]Better Stack}

How It Works

Graphify runs once to build a persistent knowledge graph — nodes and edges representing files, functions, modules, and their relationships. It uses tree-sitter for structural parsing, then an LLM for semantic extraction. Everything runs locally. Output: a visual HTML graph, a Markdown knowledge base, and a written report.^{[24]Better Stack}

"Graphy is basically like Google Maps for your code base. Instead of raw text, you get nodes and connections."

Graph vs. RAG

Traditional RAG finds semantically similar text chunks. Graphify builds explicit relationships with confidence levels — "extracted," "inferred," and "ambiguous." Incremental updates mean only changed files are reprocessed on subsequent runs, so context accumulates rather than resetting.

Tradeoffs

First run is slow and token-heavy, especially for large repos. The tool is early-stage. For small or simple repos, it may be overkill. But for real multi-file projects, the token reduction is dramatic.

"Your AI stops guessing and starts reasoning."

Tools: Graphify, Claude Code, Cursor, tree-sitter

Industry

Nate B Jones AI Daily Brief

Salesforce Headless 360 + AI Lab Power Rankings

Salesforce announced Headless 360 at TrailblazerDX, exposing every major platform capability as an API, MCP tool, or CLI command — 60+ new MCP tools, explicit support for Claude Code, Cursor, and Codex. Meanwhile, AI Daily Brief's power rankings put Google #1, OpenAI #2, Anthropic #4 across a 100-point scoring framework, with Anthropic matching Microsoft at 14/15 on enterprise positioning.^{[19]Nate B Jones}^{[25]AI Daily Brief}

Salesforce Headless 360

Separates what an agent does from where its output appears, letting the same agent render across Slack, mobile, Teams, ChatGPT, Claude, Gemini, and any MCP-compatible client. Co-founder Parker Harris: "Why should you ever log into Salesforce again?" Agentforce 5 uses Claude Sonnet 4.5 as its default coding model, with GPT-5 available via multi-model support.^{[19]Nate B Jones}

"Salesforce is not launching an agent. Salesforce is trying to become infrastructure under the agent economy."

Five-Question Agent Filter

Nate's framework for cutting through agent launch noise: (1) Does it plug into existing tools? (2) Can other agents build on top? (3) Does it own data you care about? (4) Is there an ecosystem forming? (5) Can you stack agents on top?^{[19]Nate B Jones}

"Features commoditize. Infrastructure compounds."

AI Lab Power Rankings

Nathaniel built a 100-point framework across nine categories. AI aggregate scores: Google 91.4, OpenAI 85.4, Microsoft 84.9, Anthropic 83.1, Amazon 80.4. His own harsher scores: only three labs above 70. Key call: Anthropic matches Microsoft at 14/15 on enterprise because enterprises want to go direct to model labs rather than through traditional software vendors.^{[25]AI Daily Brief}

"Incumbency right now in the enterprise is worth less than I think people think it is."

Momentum and Outliers

Google scores only 3/10 on momentum despite strong positioning, hurt by 2026 being dominated by agentic and coding use cases where no one is looking to Gemini. XAI scores highest X-factor due to Elon's track record, strong compute, and most room to rise.^{[25]AI Daily Brief}

Tools: Agentforce, Claude Sonnet 4.5, GPT-5, Claude Code, Cursor, Codex, Windsurf, MCP

AI Future Hot Take

Y Combinator Dwarkesh Patel Lenny's Podcast

The Near Future: Dynamic Interfaces, Alignment, and AI Adoption

Three short-form pieces sketch the terrain ahead: YC envisions users as their own "forward-deployed engineers" with dynamically customizable software interfaces,^[26]YC Dwarkesh argues an army of perfectly aligned (obedient) AIs could be terrifying in the wrong hands,^{[27]Dwarkesh Patel} and Lenny's Podcast warns the industry is massively underestimating societal pushback on AI adoption.^{[28]Lenny's Podcast}

Dynamic Software Interfaces

YC argues coding agents are now good enough that users can become their own forward-deployed engineers. Software companies would ship shared primitives and design decisions, expecting users to heavily modify final interfaces. This requires rethinking the whole delivery stack: should developers deliver source code instead of binaries?^[26]YC

"Users can become their own forward-deployed engineers."

Alignment: Who Should AI Obey?

Using the Stanislav Petrov story (the Soviet officer who refused to report a false missile alarm and likely prevented nuclear war), Dwarkesh illustrates why alignment is more nuanced than "make it obey." A government with perfectly obedient AI employees would have a terrifying monopoly on AI-enhanced surveillance and force. The real question: whose instructions should AI follow — the deployer, the user, or its own values?^{[27]Dwarkesh Patel}

"An army of extremely obedient employees is what it would look like if alignment succeeded."

Societal Pushback Is Coming

A contrarian take: technology leaders think people will blindly adopt new technology, but a period of huge societal pushback is coming. Humanity dictates how technology is adopted, not the other way around.^{[28]Lenny's Podcast}

"Technology leaders think folks will just blindly adopt new technology as it comes out, and I think we're going to enter a period of huge societal pushback."