Codex goes free, half of PMs are toast

AI Future Industry

Headless Everything: APIs Become the UI for AI Agents

Simon Willison rounds up an emerging consensus — articulated by Matt Webb, Marc Benioff, and Brandur Leach — that "headless" software (API-first, no GUI) is becoming the preferred architecture for AI agents acting on behalf of users.^{[1]Simon Willison — Headless everything for personal AI} Nate B Jones sharpens the same idea into an "agent fork" thesis: the web needs a new interface layer — structured, programmable, transactional — and Stripe, Coinbase, Cloudflare, Visa, Google, and OpenAI are already building it.^{[2]Nate B Jones — The web is about to look completely different}

Salesforce, Matt Webb, and "our API is the UI"

Willison highlights Marc Benioff announcing "Salesforce Headless 360" — exposing Salesforce's entire platform via APIs, MCP, and CLI for direct agent consumption — with the line "Our API is the UI."^{[1]Simon Willison — Headless everything for personal AI} Matt Webb's earlier argument: personal AIs are a better UX than using services directly, and headless APIs are far more reliable for agents than bot-driven browser automation.

The "Second Wave of the API-first Economy"

Analyst Brandur Leach frames this as the second API boom. An API is no longer an afterthought but "a major saleable vector" — the decisive differentiator when SaaS products are otherwise undifferentiated. Willison flags a structural disruption: most SaaS pricing is per-user, and if agents replace humans as the primary interface, that billing model breaks.^{[1]Simon Willison — Headless everything for personal AI}

"An API is no longer a liability, but a major saleable vector… an API might just be the crucial deciding factor." — Brandur Leach, via Willison

Nate B Jones: the "agent fork"

In a companion short, Nate B Jones argues AI agents — "software that reads, decides, pays, and acts" — need interface primitives the human web doesn't provide, just as mobile needed tap-to-pay, GPS, and push. The companies building this layer aren't startups but infrastructure incumbents (Stripe, Coinbase, Cloudflare, Visa, PayPal, Google, OpenAI), whose scale will make their design choices into de facto standards.^{[2]Nate B Jones — The web is about to look completely different}

"The businesses that emerge from [the agent fork] will be the ones that could not have existed on the human web." — Nate B Jones

Tools: MCP, CLI, Salesforce Headless 360, Stripe, Coinbase, Cloudflare

AI Models AI Future AI Tools

HuggingFace Papers AI Search

The 3D World Model Flood: HY-World 2.0, Happy Oyster, Nvidia LRA 2 & Friends

A single day produced four distinct 3D world / spatial-AI releases. Tencent's HY-World 2.0 — fully open-source — unifies 3D generation and reconstruction from text, images, or video, matching the closed-source Marble on benchmarks.^{[3]HuggingFace — HY-World 2.0 paper} Alibaba ATH Lab shipped Happy Oyster, a Genie-3 competitor that generates interactive 3D worlds in real time with diverse characters.^{[4]AI Search — Happy Oyster segment} Nvidia's LRA 2 (131 MB!) converts any video into explorable 3D Gaussian-splat scenes loadable into Isaac Sim for robot training.^{[4]AI Search — Nvidia LRA 2}

HY-World 2.0 (Tencent) — open-source, benchmark-parity with Marble

HY-World 2.0 is a multi-modal 3D world model that unifies generation and reconstruction in a single stack. Given text or a single image, a four-stage pipeline synthesizes navigable 3D scenes: HY-Pano 2.0 generates panoramas via a Multi-Modal Diffusion Transformer; WorldNav plans obstacle-aware camera trajectories; WorldStereo 2.0 expands views with Global-Geometric + Spatial-Stereo memory for consistency; and WorldMirror 2.0 composes into 3D Gaussian Splatting representations. Reconstruction from multi-view or video uses the same WorldMirror 2.0 backbone with normalized position encoding and explicit normal supervision.^{[3]HuggingFace — HY-World 2.0 paper} Interactive exploration happens in WorldLens, an engine-agnostic renderer with automatic IBL lighting. Weights and code are all public — comparable to closed-source Marble on benchmarks. AI Search also covers HY World 2.0 at ~16:25.

Happy Oyster (Alibaba ATH Lab) — Genie-3 rival

Happy Oyster generates interactive 3D environments in real time from prompts, supporting riding horses, paragliding, skateboarding, and dragon-riding. Same lab makes Happy Horse, which currently tops the Artificial Analysis video-generation leaderboard, beating Seedance 2.0. Access is waitlisted. ~13:26^{[4]AI Search — Happy Oyster segment}

"In the next few years, video games are going to be drastically different. Instead of everything pre-designed, it might be just open-ended worlds powered by fine-tuned AI models." — AI Search

Nvidia LRA 2 — video → explorable 3D, 131 MB

LRA 2 takes a scene video and emits a 3D point cloud / Gaussian-splat scene that stays consistent across viewpoints (long horizon). Exports drop directly into Nvidia Isaac Sim for robot training. At only 131 MB, it runs on most consumer devices and is already open-sourced on GitHub. ~14:58

Wild Det 3D & Annigen — the smaller siblings

Wild Det 3D ("3D detection in the wild") is a lightweight 3D object detector that runs on iPhone, producing accurate 3D bounding boxes with metric depth from live video or text prompts — use cases span AR, robotics, spatial AI. ~07:23 Annigen turns a single image into a rigged 3D model with an articulated skeleton ready for animation software, outperforming Animate and Puppeteer on segmentation and skeleton estimation. ~11:50

Tools: HY-World 2.0, WorldLens, Happy Oyster, Happy Horse, Nvidia LRA 2, Isaac Sim, Wild Det 3D, Annigen, Unity, Unreal, Google Genie 3

Developer Tools

HuggingFace Papers

Dive into Claude Code: A Reverse-Engineering Paper

Researchers dissected Claude Code's publicly available TypeScript source and distilled it into a design-space analysis.^{[5]HuggingFace — Dive into Claude Code paper} The headline number: only 1.6% of the codebase is AI decision logic — the other 98.4% is operational scaffolding (permissions, compaction, hooks, observability). They also found that 27% of AI-assisted tasks represent work users would not have attempted without the tool.

Five values, thirteen principles, and an ML-classifier permission layer

The paper traces five human values (decision authority, safety/security, reliable execution, capability amplification, contextual adaptability) through thirteen design principles into concrete implementation choices. Permission handling has seven modes and an ML-classifier across seven independent safety layers; context management uses a five-layer compaction pipeline; extensibility is handled through four mechanisms — MCP, plugins, skills, and hooks — with 27 hook event types.^{[5]HuggingFace — Dive into Claude Code paper}

Usage patterns over time

Auto-approval rates climb from ~20% under 50 sessions to over 40% by 750 sessions — users learn what's safe to let through. The paper compares Claude Code against OpenClaw (an independent open-source harness) to show how the same recurring design questions produce different answers in different deployment contexts.

"Only about 1.6% of Claude Code's codebase constitutes AI decision logic, with the remaining 98.4% being operational infrastructure." — Dive into Claude Code

Six open directions

The authors close with six design questions for future agent systems: observability-evaluation gaps, cross-session persistence, harness boundary evolution, horizon scaling, governance, and long-term capability preservation.

Tools: Claude Code, OpenClaw, MCP

AI Models

HuggingFace Papers

RAD-2: RL Cuts Self-Driving Collisions by 56%

RAD-2 pairs a diffusion-based trajectory generator with an RL-trained discriminator and reports a 56% reduction in collision rate versus strong diffusion-only baselines, with confirmed improvements in real-world vehicle deployment.^{[6]HuggingFace — RAD-2 paper}

Why diffusion planners alone are unstable

Diffusion-based motion planners model multimodal trajectory distributions well but suffer from stochastic instabilities and lack corrective feedback under pure imitation learning. RAD-2 decouples the generator (diffusion) from a reranker (RL-trained discriminator) scoring long-term driving quality — avoiding the problem of applying sparse scalar rewards to the high-dimensional trajectory space.

TC-GRPO + BEV-Warp

Two algorithmic contributions: Temporally Consistent Group Relative Policy Optimization (TC-GRPO) uses temporal coherence to denoise advantage signals and solve credit assignment; On-policy Generator Optimization (OGO) turns closed-loop feedback into structured longitudinal signals that progressively shift the generator toward high-reward trajectory manifolds. BEV-Warp is a high-throughput simulator doing closed-loop evaluation directly in Bird's-Eye-View feature space via spatial warping — enabling large-scale RL training without expensive rendering.^{[6]HuggingFace — RAD-2 paper}

"RAD-2 reduces the collision rate by 56% compared with strong diffusion-based planners." — RAD-2

Tools: BEV-Warp

Podcast

Lenny's Podcast

Lenny Rachitsky Interviews Nikhyl Singhal (Meta, Google): Why Half of PMs Are in Trouble

Nikhyl Singhal — ex-CPO at Credit Karma, ex-Meta, ex-Google, founder of the Skip community — argues the PM role is splitting into two camps: builders are having a renaissance (record comp, most open roles in 3+ years), and "information movers" — about half of all PMs — are about to become dinosaurs.^{[7]Lenny's Podcast — Nikhyl Singhal interview} His prediction: 12–24 months of massive layoffs followed by AI-first rehiring — "you might see a company shed 30,000 and hire 8,000, and all 8,000 will be AI-first."

~00:00 Cold open: half of PMs are in trouble, builders are having the time of their lives.

~03:01 ZIRP era vs. today. Three years ago PMs had "responsibility without authority" and their day was mostly moving information up and down the org. Today, builders are having fun again — they can test ideas directly, comp is at all-time highs, and Nikhyl's peers are going founder, CEO, or cross-functional C-level.

~11:05 Where this heads: agents, judgment, 10–100x product velocity. In two years most mechanical parts of building product will be gone. "There won't be any more bad software because someone will sit down and tell Claude to fix it."^{[7]Lenny's Podcast — Nikhyl Singhal interview}

~20:11 The prediction: shed 30,000, rehire 8,000 AI-first. Massive layoffs in the next 12–24 months — and diversity loses because AI hype is Bay-Area-centric and pace penalizes people with caregiving load (~35:20).

~25:13 Builders vs. information movers. Lenny's own data shows the most open PM roles in 3+ years — which makes sense once you bifurcate the role: builders are wanted everywhere; information movers are being shed.

"The information mover is essentially going to become a dinosaur." — Nikhyl Singhal

~38:22 Hot take: big-brand logos may now hurt you. If the company wasn't AI-forward, the logo on your resume is a liability. The new game rewards opinion and hands-on building — not the "leverage/scale/stop building yourself" playbook taught in senior PM career coaching.

~40:24 Crossing the threshold: reinvention and the "shadow superpower." The people who mastered the old system are least motivated to change. The unlock is finding a personal "moment of joy" with the tools — a chief-of-staff app, a home-automation hack, something with your partner — which flips fear into momentum.

~58:40 Nikhyl's AI stack and productivity system. All-in on Claude for the last three months. He vibe-codes during TV shows built on books (Jack Ryan, Alex Cross, a 24 rewatch). Obsoleting yourself is the job.

"I try to find TV shows I can vibe code in parallel to — because I want to watch TV, but I want to be vibe coding at the same time." — Nikhyl Singhal

~65:43 Specific advice: fire in the belly (treat your current job like a new job or new relationship), swallow your ego, play for the "skip job" — the role after the next title — not the next move.

~69:46 PMs as change agents — spreading beyond tech. Into HR, marketing, sales, even PE-owned HVAC companies. Engineers are becoming more PM-like because coding is being solved.

~71:46 Four jobs remain and design is plateauing. Surprisingly, design is not booming — companies conflate design with production rather than taste.

~89:00 Einstein reframed for the AI era: "Genius is 1% inspiration, 99% perspiration." Now AI does the 99% perspiration — the future belongs to the inspired.

"Joy is the biggest antidote to burnout." — Nikhyl Singhal

Tools: Claude, Claude Code, Codex, ChatGPT, Lovable, Tesla FSD v14.2, Supermind, skip.help, Substack

Podcast

AI Engineer

Sunil Pai at AI Engineer: Code Mode (Cloudflare)

Sunil Pai (Cloudflare) argues JSON tool-calling breaks at scale and introduces "code mode": the LLM generates JavaScript that runs in a capability-scoped sandbox. A two-tool demo (search + execute) compresses Cloudflare's ~2,600 API endpoints from 1.2M tokens down to ~1K — a 99.9% reduction.^{[8]AI Engineer — Sunil Pai Code Mode}

~01:07 Why JSON tool calling breaks at scale. Once you stuff in Google services, Jira, wiki, etc., hundreds of tools fill the context, composition gets weird, and JSON round-tripping is slow.

~02:08 Code mode: generate JS, run it, one shot. Benefits include typed APIs, syntax checking, leveraging the terabytes of code in training data, and natural use of loops, state, and parallelization.

~03:09 The Cloudflare case study. Matt Carey exposed 2,600 endpoints as just two tools — search (reads OpenAPI spec) and execute (returns callable functions) — cutting context from ~1.2M tokens to ~1K. Prompt "we're getting DDoS'd, find every offending IP and block them" resolves in one code shot, not ~8 MCP round trips.

"The Cloudflare API surface is about 2,600 API endpoints. If we exposed a tool for every single one of them, it's about 1.2 million tokens in your first call." — Sunil Pai

~05:12 Live demo of the mythical server — with on-stage pagination hiccup ("I need to pay for the Mythos model to make this work accurately").

~09:14 Kenton's tic-tac-toe — inhabiting the state machine. Kenton Varda drew a tic-tac-toe grid with an X on a canvas and asked the model to play. Instead of generating a tic-tac-toe app, the model read the stroke state in-place and drew a circle — "inhabiting the state machine" (a Ghost in the Shell reference). Opus also let Kenton win.

"It stopped generating a program and it instead inhabiting the state machine." — Sunil Pai

~11:17 Harness architecture and capability-based sandboxes. Cloudflare uses V8 isolates (fast startup, ~10 years of security hardening) with zero capabilities by default — no fetches, no APIs — grant only what's needed. Observability is required so you can audit why last Tuesday the agent made a $2.3M trade "for llama poop or something."

~14:19 Generative UI and long-running workflows. Every user can get a custom program backed by your existing APIs. Workflows can span days/months/years. Run the harness closer to the user — even on-device — to mash up services across systems.

~16:21 DX for agents and closing. Your "next billion users are these little robots generating code for you" (even though your customers are still humans). Markdown docs, actionable errors, search-discoverable endpoints — "you need to let the code do the talking."

"Your next billion users are these little robots that are generating code for you. To be clear, your customers are still humans." — Sunil Pai

Tools: Cloudflare Workers, Cloudflare agents SDK, Cloudflare sandbox SDK, V8 isolates, MCP, OpenAPI specs, PartyKit, Claude Code, Codex CLI, Opus, TLDraw, Excalidraw, WebAssembly

Podcast

AI Engineer

David Soria Parra at AI Engineer: The Future of MCP

Anthropic's David Soria Parra lays out the state and roadmap for the Model Context Protocol. 110M monthly downloads — 2x faster adoption than React.^{[9]AI Engineer — David Soria Parra Future of MCP} 2026 is the year of general knowledge-worker agents, and connectivity — via skills + MCP + CLI/computer use — is the bottleneck. Major protocol additions land in June: stateless transport (with Google), cross-app access, well-known-URL server discovery, skills over MCP, and SDK v2s.

~00:07 Opening demo. An "MCP application" — an agent shipping its own UI through an MCP server — runs unmodified across Claude, ChatGPT, VS Code, and Cursor. Shared semantics between client and server is why.

~02:09 18-month recap and 110M monthly downloads. MCP started as a tiny local-only spec with SDKs mostly written by Claude. Since then: remote transport, centralized authorization, new primitives (elicitation, tasks), experimental MCP applications. Ecosystem is ~2x faster than React — driven not just by Anthropic clients but OpenAI Agent SDK, Google ADK, LangChain, and thousands of frameworks pulling MCP as a dependency.^{[9]AI Engineer — David Soria Parra Future of MCP}

"We're now at 110 million monthly downloads. React took roughly double the amount of time to reach that volume." — David Soria Parra

~04:10 2026 is about general-purpose agents. 2024 was demos, 2025 was coding agents (the ideal case — local, verifiable, compiler-backed). 2026 shifts to analysts and marketers, which means connecting to 5+ SaaS apps.

~05:11 The connectivity stack. Three tools: skills (domain knowledge in files), CLIs (ideal for local coding agents and things in pre-training like git/GitHub), and MCP (rich semantics, UI for long-running tasks, resources, platform independence, authorization, experimental apps). The best 2026 agents will use all three seamlessly.

"If someone tells you there's one solution to all your connectivity problem — be it computer use, be it CLIs, be it MCP — they are probably pretty wrong." — David Soria Parra

~07:12 Client-side pattern 1: progressive discovery. Instead of stuffing every tool in-context, give the model a tool-search/tool-loading tool. Dramatic before/after in Claude Code.

~09:14 Client-side pattern 2: programmatic tool calling / code mode. Give the model a REPL tool (V8, Moddable, Lua) and let it write scripts instead of orchestrating step-by-step. MCP's structured output primitive makes typed composition feasible; cheap model can fill the gap otherwise.

~11:19 Server-side guidance: stop REST-to-MCP conversions. Design for the agent (often the same as designing for a human). Use MCP's richer semantics — MCP applications, skills, tasks, elicitations. Cloudflare's MCP server cited as the model.

"Every time I see someone building another REST-to-MCP conversion tool, it's a bit cringe." — David Soria Parra

~13:21 Roadmap — all landing around June: (1) Core — stateless transport protocol designed with Google so MCP servers deploy like normal stateless REST services on Cloud Run/Kubernetes; async task primitive for agent-to-agent communication; TypeScript and Python SDK v2s (he jokes FastMCP is "way better than the Python SDK I wrote"). (2) Integration — cross-app access via identity providers (Google/Okta single login), server discovery via well-known URLs. (3) Extensions — MCP applications (web-only) and "skills over MCP" letting server authors ship updated knowledge continuously.

"2026, I think, is all about connectivity, and the best agents use every available method." — David Soria Parra

Tools: MCP, Claude Code, ChatGPT, VS Code, Cursor, OpenAI Agent SDK, Google ADK, LangChain, Linear/Slack/Notion/WhatsApp/Blender/Cloudflare MCP servers, FastMCP, TypeScript/Python MCP SDK, V8 isolates, Moddable, Lua, Okta, Cloud Run, Kubernetes

Hot Take

Dwarkesh Patel

Dwarkesh × Ada Palmer: Bacon's Three Types of Thinkers

A short clip from the Ada Palmer interview. Francis Bacon's taxonomy: ants (encyclopedists who just pile up information), spiders (system-weavers spinning hypnotic but barren logical webs), and honeybees (scientists who process nature into something "sweet and useful for humankind").^{[10]Dwarkesh Patel — Ada Palmer clip on Bacon} Palmer frames this as the founding gesture of the English Academy of Sciences — and a pivotal shift in what counts as achievement.

"There are three types of knowledge wielders, says Bacon. First, there is the ant who is the encyclopedist… all he does is assemble it. A beautiful library. Nothing comes from it." — Ada Palmer

"The honeybee who, gathering from among the fruits of nature, processes what he gathers through the organ of his own being to produce something which is sweet and useful for humankind. And that is the scientist." — Ada Palmer

"You are a great achiever because you worked out how it can be done and you shared that sweet and useful thing with all of humankind." — Ada Palmer

Palmer's reading: Bacon's rhetorical move — alongside his portrait on the title page — redefined greatness from "building the dome" to sharing the method of how the dome could be built with everyone.

AI Models

AI Search

Claude Opus 4.7: #1 on LMArena, But Slow and Expensive

Anthropic's new flagship tops LMArena for text and coding, beats 4.6 on SWE-Bench Pro / Verified / Terminal-Bench, and accepts images 3x larger than prior Claude models.^{[4]AI Search — Claude Opus 4.7 segment} But AI Search flags real regressions: business management, financial ops, entertainment/sports/media, and some instruction-following cases score lower than 4.6. On Artificial Analysis' intelligence index, Opus 4.7 Max only ties Gemini 3.1 Pro and GPT 5.4 extra-high — while being slower and more expensive than both.

Benchmarks and what's new

Claude Opus 4.7 is focused on complex software engineering and agentic workflows with better autonomy on long-running tasks. It outperforms Opus 4.6 on SWE-Bench Pro, SWE-Bench Verified, and Terminal-Bench; ranks #1 on LMArena for text and coding; accepts images up to 2500+ pixels on the long side (3x prior Claude models); and has noticeably improved instruction-following for multi-step workflows. ~20:21

The regressions nobody's talking about

AI Search shows Arena category data where 4.7 underperforms 4.6 in business management, financial operations, entertainment/sports/media, hard prompts, longer queries, and some instruction-following cases. The presenter also reports 4.7 broke things in Claude Code that 4.6 handled cleanly.^{[4]AI Search — Claude Opus 4.7 segment}

"Honestly, there's no reason to use Opus 4.7 given that Gemini and GPT 5.4 are just as performant, but they're way faster and they cost less." — AI Search

"I was also using Opus 4.7 today to vibe code stuff using Claude Code and it actually broke some stuff which Opus 4.6 would never do." — AI Search

Available now in Claude chat, API, and Claude Code with a 1M-token context. Pricing matches Opus 4.6.

Tools: Claude Opus 4.7, Claude Opus 4.6, Claude Code, Gemini 3.1 Pro, GPT 5.4, LMArena, Artificial Analysis

AI Tools Industry

Developers Digest

Claude Design: Grounded UI Generation on Your Codebase

A 12-minute Claude Design walkthrough. The differentiator isn't that it generates UI — LLMs all do that — it's that Claude Design crawls your repo, extracts a design system, and stores it as HTML/CSS assets used as context for every subsequent generation. Without that grounding, LLM front-ends look obviously AI-generated.^{[11]Developers Digest — Claude Design in 12 Minutes} The Claude Design launch racked up 50M views in 36 hours.

Workflow

At ~00:00, Claude Design ingests either a codebase (agentic crawl to extract buttons, cards, nav, typography, color palette) or a Figma import. From that system, it generates full pages — e.g. pricing pages with multiple layout variants (stacked cards, unified table, split hero) — plus banners, hero sections, slide decks, and animations.

Refinement happens through chat, inline comments, direct on-canvas edits, and voice input with DOM-hover context: hovering over elements automatically sends their DOM representation into the model. The tool self-QAs by taking screenshots and feeding them back. Exports: Canva, PDF, PowerPoint, or direct handoff to Claude Code as HTML/CSS. Runs on Opus 4.7 by default; included in the Claude subscription.

"The design files are effectively the HTML, the CSS, all of the different assets… it's not actually an arbitrary format. It's the technologies that have been around for a long time." — Developers Digest

Anthropic's product expansion strategy

~07:58 The presenter frames Anthropic's parallel launches — Claude Code, app building, Claude Design — as a deliberate AGI-path strategy. Design may have a larger TAM than coding tools because it reaches users who can't write code at all. Figma and Adobe are now under direct pressure.

"They're saying, 'We have AGI on the horizon. We're going to try and build a ton of different products on the way there.' And one of the things with this strategy is: it's working." — Developers Digest

Tools: Claude Design, Claude Code, Opus 4.7, Figma, Canva

AI Tools Industry

AICodeKing

OpenAI Codex 2.0 and the Free Tier

OpenAI repositioned Codex from coding assistant to full software-workflow agent. New capabilities: background computer use (Mac, clicks/types/sees with its own cursor), in-app browser with comments, image generation, GitHub review comments, multi-terminal tabs, and SSH into remote dev boxes (alpha).^{[12]AICodeKing — Codex 2.0 + Free Tier} Bigger shift: Codex is now available on ChatGPT Free and Go plans for a limited time — plus 3M+ weekly developers already using it and 6x growth in Business/Enterprise since January 2026.

Agent capabilities

~01:02 Background computer use, in-app browser (comment directly on pages), OpenAI-model image generation, GitHub review comments, multiple terminal tabs, SSH to remote dev boxes, and improved file previews for PDFs/spreadsheets/slides/docs. A summary pane tracks plans, sources, and artifacts across the session.

"Codex can now operate your computer alongside you. Background computer use. It can see, click, type, and interact with apps using its own cursor." — AICodeKing

Long-running tasks and memory

~04:02 Codex preserves context across sessions, schedules future work, wakes itself up to continue tasks over days or weeks, and has a memory preview for preferences and corrections. It proactively suggests next steps based on project context.

"Instead of starting from zero every single time, it can pick up where it left off potentially across days or weeks." — AICodeKing

Free tier and pricing

~06:04 Codex is in ChatGPT Plus, Pro, Business, Enterprise, and Edu — and for a limited time also Free and Go. Paid plans get higher rate limits; Business and Enterprise can now add Codex on a pay-as-you-go per-seat basis. OpenAI also improved Codex's prompting guide (better starter prompts, tool-use patterns, bias for action, fewer wasted thinking tokens).^{[12]AICodeKing — Codex 2.0 + Free Tier}

"More than 3 million developers use Codex every week. Codex usage in ChatGPT Business and Enterprise has grown six times since January." — AICodeKing

Tools: Codex, ChatGPT, GitHub

AI Models

AI Search

Qwen 3.6 35B-A3B: Alibaba's New Open-Source MoE

A 35B-parameter mixture-of-experts with only 3B active per inference. Outperforms similar-sized peers on autonomous coding, agentic tasks (Terminal-Bench, SWE-Bench Pro, SWE-Agentic), graduate-level reasoning, competitive math, and multimodal reasoning. Weights (~72 GB) live on Hugging Face and ModelScope; free web access at chat.qwen.ai; integration instructions for OpenClaw Qwen Code and Claude Code. ~24:23^{[4]AI Search — Qwen 3.6 segment}

Qwen 3.6 35B-A3B sits in Alibaba's mid-size open-source lineup. On average it beats similarly-sized competitors across autonomous coding, agentic tasks, graduate-level reasoning, competitive math, and multimodal/image reasoning. Pairs well with Qwen Code / OpenClaw Qwen Code for agentic harnesses; usable in Claude Code workflows too.

Tools: Qwen 3.6 35B-A3B, Hugging Face, ModelScope, OpenClaw Qwen Code, Claude Code

AI Future Industry

AI Search

Humanoid Robotics Wave: Unitree, Beijing Marathon, Leju Assembly Line

Three data points in one day. Unitree's H1 hit 10 m/s (~36 km/h) — claimed world record for a humanoid robot.^{[4]AI Search — Unitree H1 + Beijing Marathon} Beijing's second humanoid robot marathon ran this weekend with 70+ teams, ~40% fully autonomous. And Leju Robotics opened what it claims is the first automated humanoid assembly line — one unit every 30 minutes, 10,000+ units/year, 92% of processes automated.

Unitree H1 sprinting

The 62 kg, 0.8 m-leg H1 ran 10 m/s with fluid whole-body control and high-frequency stability. The leap from last year's clumsy falling robots to this year's smooth sprinters is stark. ~25:40

Beijing humanoid marathon

70+ teams, about 40% fully autonomous (no teleoperation), many approaching human sprint speeds. Second edition, noticeably more capable than the 2025 run.

Leju Robotics — humanoids at car-plant scale

~27:12 In Foshan, Guangdong: robotic arms, AGVs, and precision stations assemble torsos, limbs, and heads across 24 digitalized assembly stages. 77 quality/safety checks per unit, flexible design to switch between robot models without major downtime, some human-in-the-loop steps remain. The shift toward car-assembly-line-scale manufacturing promises sharp cost drops and faster warehouse/factory deployment.

Tools: Unitree H1, Leju Robotics

AI Models

AI Search

GPT Rosalind: OpenAI Targets Life Sciences

A reasoning model specialized for life sciences — drug discovery, genomics, protein engineering — that outperforms GPT 5.4 on scientific subject benchmarks, with large gains in experimental design and analysis.^{[4]AI Search — GPT Rosalind segment} OpenAI also launched a life-sciences Codex plug-in with access to 50+ scientific databases (protein structures, sequence DBs, literature search). Access is gated to labs via request.

~05:22 Rosalind sits in the middle of researcher workflows — connecting literature review, experimental planning, and data analysis in one thread — rather than forcing scientists to jump between tools. OpenAI is explicitly aiming at compressing the 10–15 year target-discovery-to-approved-drug pipeline.

"AI isn't just assisting coding or writing anymore. It's starting to actively participate in real scientific discovery pipelines." — AI Search

Tools: GPT Rosalind, Codex, GPT 5.4

AI Models

AI Search

Ternary Bonsai: 1.58-bit Open-Source LLMs

A family of open-source LLMs (1.7B, 4B, 8B) where every weight is restricted to -1, 0, or 1 with a small shared scaling factor — instead of standard 16-bit precision. The 8B is 1.7–2.32 GB, ~10x smaller than Qwen3 8B, yet performs nearly as well and beats Ministmo, Llama 3.1, and GLM4 across reasoning, coding, and knowledge benchmarks.^{[4]AI Search — Ternary Bonsai segment} Throughput exceeds 100 tokens/sec on consumer GPUs and mobile chips.

~02:33 Three sizes: 1.7B, 4B, 8B. Models and run instructions are on GitHub and Hugging Face. Real-time on-device inference is the headline use case — the 100+ tok/s throughput on mobile chips is the unlock.

"It's nine times smaller than standard 16-bit models while outperforming most of them." — AI Search

Tools: Ternary Bonsai, Hugging Face, Qwen3 8B, Ministmo, Llama 3.1, GLM4

AI Models

AI Search

Gemini 3.1 Flash TTS: Most Expressive TTS Yet

Google's new TTS accepts inline meta-tags (excited, amazed, whispers, panicked, sigh, laughs, dramatic pause) with natural breathing and fine-grained pacing. Supports 70+ languages and reportedly beats ElevenLabs v3 on expressiveness. No voice cloning — only default speakers. Free via API and Google AI Studio.^{[4]AI Search — Gemini 3.1 Flash TTS segment}

~32:29 Demo examples: reading Shakespeare's "To be or not to be" as a neutral academic, as a troubled philosopher in internal debate, and as a deranged outcast. Style-transfer quality is clearly a step up.

"The model breathes. It even laughs." — AI Search

Tools: Gemini 3.1 Flash TTS, Google AI Studio, ElevenLabs v3

AI Models

Better Stack

Meta TRIBE v2: AI That Predicts Brain Activity

Meta released TRIBE v2, a foundation model described as a "digital mirror of the human brain." It predicts neural responses to video, audio, and text — and is 2–3x more accurate than actual fMRI scans because it filters noise from heartbeats, motion, and electrical interference.^{[13]Better Stack — Meta TRIBE v2} Paper, code, and weights all open-sourced.

Three-stage architecture: (1) encoding — translate video/audio/text into a model-native format; (2) universal integration — a transformer identifies shared patterns across all humans; (3) whole-brain mapping — project patterns onto 70,000 voxels. Follows AI scaling laws and can predict a new person's brain response to a new image without any real-world recording.

"TRIBE v2 is often more accurate than an actual fMRI scan — a two-to-three-times improvement over traditional methods." — Better Stack

Intended research uses: simulating brain disorders, studying emotion processing, designing AI architectures that mimic brain efficiency.

Tools: TRIBE v2

AI Tools

AI Search

Video & Image Generation Roundup: Prompt Relay, Motif Video 2B, Omni Show, Token Relight

Four notable video/image gen releases from the AI Search roundup. Prompt Relay lets Alibaba's Wan transition seamlessly between multi-prompt scenes. Motif Video 2B is a tiny 2B diffusion transformer trained with 10x less data than Wan 2.1, near state-of-the-art on VBench. ByteDance Omni Show produces consistent UGC-style marketing videos from reference images + audio + optional pose skeletons. Adobe Token Relight gives continuous control over intensity, color, ambient, diffusion level, and 3D light position in a single photo.^{[4]AI Search — video/image gen roundup}

Prompt Relay (Alibaba Wan extension)

~01:01 Training-free, plug-and-play method layered on Wan. Users supply a sequence of prompts each tied to start/end times (e.g., eagle flying → cyberpunk car → TV living-room zoom-out). Prompts are routed at cross-attention layers — active prompt dominates its segment while transitions are handled smoothly.

"Think of it like a relay race where each prompt hands off cleanly to the next person." — AI Search

Motif Video 2B

~09:20 A 2B-parameter diffusion transformer trained with <100k GPU hours and <10M videos — ~7x fewer parameters and an order of magnitude less training data than Wan 2.1. Performs close to Wan 2.2 on VBench and beats Wan 2.1, HoonVideo, StepVideo. 19 GB VRAM minimum with CPU offload; ComfyUI support planned.

ByteDance Omni Show

~17:23 Generates realistic UGC-style videos of a (real or fictional) person talking about any product, driven by reference images of person/product, audio, and optional pose-skeleton videos for fine-grained hand/finger control. Beats Hunan Custom, Humo, and Vase on consistency. GitHub repo exists; code "under internal review" for possible open-source.

Adobe Token Relight

~28:37 Tokenizes lighting attributes and feeds them through a transformer. Drag a light source in 3D, change its color, adjust diffusion, control ambient — realistic shadows and highlights update across the scene. One visible flaw: reflections in mirrors don't update. Only a technical paper is released; Adobe rarely open-sources work like this.

Tools: Prompt Relay, Alibaba Wan, Motif Video 2B, ComfyUI, Omni Show, Adobe Token Relight

AI Tools

AI Search

GameWorld: A Benchmark for AI Playing Video Games

A standardized benchmark covering 34 browser games across runner, arcade, platformer, puzzle, and simulation genres — evaluating agentic models on both low-level controls and semantic actions. Live YouTube stream shows current runs. Among general multimodal models, Gemini / GPT / Claude take the top three. All remain well below the 64% novice human baseline, though 30%+ from general chat models is notable.^{[4]AI Search — GameWorld benchmark}

~30:46 Among computer-use agents: ByteDance Seed, Sonnet 4.6, and Gemini Computer Use lead. GitHub repo supports plugging in custom LLMs. Use cases: stress-testing real-time perception + planning in agents that aren't purpose-built for games.

Tools: GameWorld, Qwen 3.5, Gemini, GPT, Claude, ByteDance Seed, Sonnet 4.6, Gemini Computer Use

AI Models Hot Take

Better Stack

Berkeley RDI: Agent Benchmarks Are Broken

Berkeley RDI published findings that popular AI agent benchmarks — SWE-Bench, GAIA — are fundamentally compromised. Two failure modes: (1) training data contamination — public datasets leak into model training, so models recall rather than reason; (2) evaluation code itself has security flaws. Some benchmarks use bare eval() on untrusted model output, letting a crafty agent inject a payload instructing the evaluator to return a perfect score.^{[14]Better Stack — Berkeley RDI benchmark exposé}

Other benchmarks lack client isolation — agent and evaluator run in the same environment — letting the agent read the hidden answer key from disk. Berkeley's proposed framework has three pillars: strict sandbox isolation (agent can't see evaluation scripts), dynamic task generation (randomized variables each run to prevent memorization), and adversarial auditing with a zero-capability agent (if it scores high, the benchmark is broken).

"A smart agent could literally hack its way to a perfect score by sending a payload that tells the evaluator, 'Hey, just mark this as a 100% correct score.'" — Better Stack

"If an agent does absolutely nothing and still gets a high score, your benchmark is essentially broken." — Better Stack

Practical takeaway: leaderboard scores can no longer be trusted at face value. Test agents for genuine reasoning, not rote recall.

Tools: SWE-Bench, GAIA

AI Tools Developer Tools

Github Awesome

35 Claude Code Skills Trending on GitHub

A walkthrough of the most notable skills in the current trending set. Standouts include Harness (auto-generates a full multi-agent team for your project domain), Anti-Vibe (post-session explanations of why the AI wrote what it did, so you actually learn), Friday (24/7 autonomous assistant from a CLI + markdown + Flask RAG — sends Telegram briefings, monitors HF, creates self-healing cron jobs), Skill Claw (collective skill evolution via a local proxy + shared cloud repo), and Paper Finder (deep ML literature research tuned for hidden-gem papers).^{[15]Github Awesome — 35 Claude Code skills}

Other mentions: Auto Skills (NPX) scans your package.json and auto-installs the best community skills from skills.sh. Manual SDD is a spec-driven development starter using symlinks so .claude, .cursor, and .codex all share one canonical AI specs folder.

"Your AI writes the code, you ship the feature, you learn absolutely nothing. Anti-Vibe is a Claude code skill that fixes that." — Github Awesome

Tools: Claude Code, Harness, Anti-Vibe, Auto Skills, Friday, Skill Claw, Manual SDD, Paper Finder, Litch Skills, Tavily API, Flask, skills.sh

Developer Tools

Github Awesome

Browser Harness: Live JS Injection When the DOM Breaks

A browser automation tool that connects to Chrome over the Chrome DevTools Protocol (WebSocket) and injects live JavaScript in real time when an agent hits an unexpected DOM change or pop-up.^{[16]Github Awesome — Browser Harness} Designed to integrate with Claude Code so agents can drive the browser in response to live conditions — not pre-written scrapers.

Developer Tools

Better Stack

HTML in Canvas: Chrome's Experimental Proposal

A browser proposal — live in Chrome Canary behind a flag — that lets developers drop real, interactive DOM elements into WebGL or 2D canvas contexts. This bridges canvas's longstanding weaknesses around accessibility, i18n, and complex text rendering by making canvas children real layout participants via the layoutsubtree attribute.^{[17]Better Stack — HTML in Canvas}

The API

~02:01 Place HTML as a child of a canvas element, add layoutsubtree to the canvas, and use texElementImage2D (WebGL/Three.js) or drawElementImage (2D canvas) to render it as a live-updating texture. Auto-updates via a paint event whenever children re-render; or trigger manually like requestAnimationFrame. Demo showed a London Underground timetable with a live clock rendered inside a Three.js scene.

Current bugs

~05:03 Performance instability, a bug where drawElementImage renders a frame behind the current DOM state (visual desync), and a crash when a scrollbar is placed inside canvas children.

Fingerprinting mitigation

Proposal includes privacy-preserving painting that excludes system colors, themes, visited-link state, and spelling markers from the rendered output — to avoid creating a new fingerprinting vector.

Tools: Chrome Canary, WebGL, Three.js, WebGPU

AI Future Hot Take

Nate B Jones

World Models as Management Infrastructure (and Where They Break)

Jack Dorsey's 5M-view post framed "world models" as software replacing middle management. Nate B Jones pushes back: world models handle information flow well but fail silently at judgment.^{[18]Nate B Jones — Block + World Models} Managers don't just route information — they edit it, decide what matters, distinguish seasonal blips from structural problems. A world model making those calls is exercising judgment it was never designed for, and the output looks identical on the surface.

Three architectures, three failure modes

~06:02 Three distinct approaches hide under the same label:

Vector database: fast to deploy, good for info logistics. But semantic retrieval has no mechanism to distinguish surfacing from interpreting — rankings are implicit editorial choices presented with uniform confidence.
Structured ontology (Palantir-style): bounded schema prevents hallucinating relationships, but only represents what's already been categorized — blind to emergent patterns.
Signal fidelity (Block's bet): build around highest-fidelity data exhaust (transactions, since "money is honest"). The trap: high input fidelity creates illusion of high judgment quality.

"High signal fidelity at the input layer creates an illusion of high judgment quality at the output layer — and the illusion is harder to see precisely because the inputs are really really good." — Nate B Jones

Five principles

~12:05

Signal fidelity sets the ceiling — transactions are high-fidelity, Slack messages are not.
Structure must be earned, not imposed — leave room for emergent relationships.
Encode outcomes, not just events — without feedback loops, month six looks like month one.
Design for resistance — capture signal as a byproduct of work, not a separate documentation act.
Time is the moat, not architecture — the Claude Code leak showed architecture is easy to copy; a good world model compounds over months of business reality flowing through it.

"It's tempting to build something that will look like intelligence. It's actually hard to build something that will act as intelligence." — Nate B Jones

Tools: Palantir, Block / Cash App, Claude Code

Developer Tools

Arjay McCandless

System Design Tier List: What Actually Matters

Three-year Amazon/startup engineer's tier list. S-tier (non-negotiable): REST APIs, load balancers, CDNs, Postgres, CI/CD, Docker, monitoring. Underrated: message queues. Overrated: caching, WebSockets. Delegate: auth (Auth0, Clerk, Supabase). Deprioritize: GraphQL, microservices, serverless, feature flags — commonly over-applied in personal projects and small teams.^{[19]Arjay McCandless — system design tier list}

"If you don't know what's happening inside your system, you're literally just waiting for your customers to find problems and send you a text." — Arjay McCandless

"Don't bother rolling up your own authentication. There are tons of great providers which will do this for you and will greatly lessen your risk of a security breach." — Arjay McCandless

"Feature flags — it's really not [important]. It can be pretty annoying to implement, and the amount of gain you get is usually not worth it unless you're working on a huge system." — Arjay McCandless

Tools: PostgreSQL, Docker, Auth0, Clerk, Supabase, GraphQL, Elasticsearch, OpenSearch, AWS Lambda

Productivity

Real Python

UX Design Assumptions Are Finally Shifting

UX conventions were historically built around users uncomfortable with computers — practical workarounds like leaving Solitaire installed on hospital workstations so nurses could practice with a mouse. That era is ending: the only remaining generation that didn't grow up with personal computers is now 70+.^{[20]Real Python — UX assumptions} Hamburger menus are now treated as obvious, not as learned behavior.

Headless Everything: APIs Become the UI for AI Agents

Salesforce, Matt Webb, and "our API is the UI"

The "Second Wave of the API-first Economy"

Nate B Jones: the "agent fork"

The 3D World Model Flood: HY-World 2.0, Happy Oyster, Nvidia LRA 2 & Friends

HY-World 2.0 (Tencent) — open-source, benchmark-parity with Marble

Happy Oyster (Alibaba ATH Lab) — Genie-3 rival

Nvidia LRA 2 — video → explorable 3D, 131 MB

Wild Det 3D & Annigen — the smaller siblings

Dive into Claude Code: A Reverse-Engineering Paper

Five values, thirteen principles, and an ML-classifier permission layer

Usage patterns over time

Six open directions

RAD-2: RL Cuts Self-Driving Collisions by 56%

Why diffusion planners alone are unstable

TC-GRPO + BEV-Warp

Lenny Rachitsky Interviews Nikhyl Singhal (Meta, Google): Why Half of PMs Are in Trouble

Sunil Pai at AI Engineer: Code Mode (Cloudflare)

David Soria Parra at AI Engineer: The Future of MCP

Dwarkesh × Ada Palmer: Bacon's Three Types of Thinkers

Claude Opus 4.7: #1 on LMArena, But Slow and Expensive

Benchmarks and what's new

The regressions nobody's talking about

Claude Design: Grounded UI Generation on Your Codebase

Workflow

Anthropic's product expansion strategy

OpenAI Codex 2.0 and the Free Tier

Agent capabilities

Long-running tasks and memory

Free tier and pricing

Qwen 3.6 35B-A3B: Alibaba's New Open-Source MoE

Humanoid Robotics Wave: Unitree, Beijing Marathon, Leju Assembly Line

Unitree H1 sprinting

Beijing humanoid marathon

Leju Robotics — humanoids at car-plant scale

GPT Rosalind: OpenAI Targets Life Sciences

Ternary Bonsai: 1.58-bit Open-Source LLMs

Gemini 3.1 Flash TTS: Most Expressive TTS Yet

Meta TRIBE v2: AI That Predicts Brain Activity

Video & Image Generation Roundup: Prompt Relay, Motif Video 2B, Omni Show, Token Relight

Prompt Relay (Alibaba Wan extension)

Motif Video 2B

ByteDance Omni Show

Adobe Token Relight

GameWorld: A Benchmark for AI Playing Video Games

Berkeley RDI: Agent Benchmarks Are Broken

35 Claude Code Skills Trending on GitHub

Browser Harness: Live JS Injection When the DOM Breaks

HTML in Canvas: Chrome's Experimental Proposal

The API

Current bugs

Fingerprinting mitigation

World Models as Management Infrastructure (and Where They Break)

Three architectures, three failure modes

Five principles

System Design Tier List: What Actually Matters

UX Design Assumptions Are Finally Shifting

Sources