7 Million Parameters Just Embarrassed GPT-5.5

AI Models

DeepSeek V4: 1.6T Parameters, Full Open Source

DeepSeek V4 Pro is a 1.6 trillion parameter model with a 1 million token context window, built by a team roughly 40x smaller than OpenAI and without top-tier NVIDIA GPUs — yet matching frontier closed models on benchmarks including a perfect 120/120 on Putnam 2025.^{[1]AI Search} The model and full technical paper (including infrastructure secrets) are fully open-sourced.^{[2]Artificial Analysis}

The Architecture Breakthrough

DeepSeek's core innovation is a hybrid attention system solving the 1M-token context memory problem with three complementary pathways: ~05:00 Compressed Sparse Attention (CSA) groups 4 tokens into one dense representation with a "Lightning Indexer" for selective retrieval. Heavily Compressed Attention (HCA) compresses 128 tokens into one for global summary. Sliding Window preserves the last ~128 tokens at full fidelity. Result: 3.7x fewer FLOPs and 90% smaller KV cache than V3.2.

Training Stability at Scale

~13:00 Manifold Constrained Hyperconnections (MHC) uses doubly stochastic matrices — where every row and column sums to one — making signal amplification mathematically impossible during training. They proved their fused kernel code correct using a Z3 SMT solver. The custom Muon optimizer replaces AdamW with a two-phase approach for faster, more stable learning.

Infrastructure Secrets

~20:00 At 1.6T parameters spanning multiple GPU racks, network latency dominates over compute. DeepSeek overlaps computation and communication by breaking data into sequential waves, using fused kernels in the tilang language. The full open-source release includes infrastructure architecture details that closed labs treat as proprietary.

This is info that they never share. There are so many gold nuggets in this paper, it's hard to believe that they're revealing all of this for free.

Tools: DeepSeek V4, Hugging Face, tilang

AI Models AI Future

Y Combinator

Recursion Is the Next Scaling Law

A 7 million parameter recursive model achieves 87% on ARC Prize 1 — a benchmark where trillion-parameter LLMs score near zero — by applying the same weights iteratively instead of scaling parameters.^{[3]Y Combinator} YC visiting partner Francois Shaard argues this proves LLMs have a hard reasoning ceiling that only recursion can break.

The LLM Reasoning Ceiling

~00:00 Transformers compute all token positions in a single forward pass. The number of reasoning steps equals the number of layers — a 30-layer model literally cannot sort a 31-element list. Chain-of-thought is a workaround bounded by training data; tool use is bounded by human knowledge.

It's provable that for comparison sort you can't do better than n log n steps, and if I have a list that's 31 elements long and my transformer is 30 layers, I run out of steps.

HRM: 27M Parameters Beat o3

~08:00 The Hierarchical Reasoning Model uses three nested recursion levels with the same weights — trained from scratch on only 1,000 tasks with zero pretraining. It achieved 70%+ on ARC Prize 1 while o3 scored literally zero.

TRM: 7M Parameters, 87% ARC

~22:00 Researcher Alexia's Tiny Recursive Model strips HRM to 7M parameters (4x smaller) while improving to 87% on ARC Prize 1. The key: collapse the dual network into one shared-weight network and backprop through one full recursion loop.

A 7 million parameter model can solve problems that a 100 billion or trillion parameter model trained on the entire internet cannot solve.

The Next Frontier

~34:00 The proposed synthesis: use large LLMs to build rich semantic embedding spaces, then deploy small recursive models inside that latent space — giving compute-depth without parameter-depth, unconstrained by discrete token space.

Tools: HRM, TRM, ARC Prize

AI Tools Developer Tools

AICodeKing Better Stack Developers Digest Artem Zhutov

AI Coding Agents: Codex Leads, Claude Code Slips

A head-to-head comparison of every major AI coding agent declares Codex the winner for its full ecosystem (local coding, cloud tasks, code review) and GLM 5.1 as the cross-tool portability champion.^{[4]AICodeKing} Claude Code is "no longer the automatic recommendation" due to pricing pressure. Meanwhile, new tools like Fallow^{[5]Better Stack} and Nimbalyst^{[6]Developers Digest} extend what coding agents can do.

The Verdict

AICodeKing tested Claude Code, Codex, GLM 5.1, and Kimi K2.6 across model quality, ecosystem flexibility, pricing, and consistency. Codex earns top billing as a full coding platform — not just a terminal agent — combining ChatGPT integration, cloud tasks, and code review at tiers starting free. GLM 5.1's key differentiator is openness: it works across Kilocode, Cursor, Claude Code, and other agents rather than locking you in. Claude Code remains technically strong for UI taste and code-base reasoning, but the reviewer cannot justify recommending it automatically. Kimi K2.6 is a "capable challenger, not yet a safe primary pick."

Fallow: Rust-Powered Code Intelligence

Fallow is a Rust-built CLI for TypeScript/JavaScript that detects dead code, duplication, and complexity. The workflow: install the Fallow Claude Code skill, run fallow dups --format json, let Claude fix duplicates without breaking functionality, put changes in a feature branch, run tests — finished in ~4 minutes. Paid runtime intelligence merges V8 coverage data to show which functions are actually triggered in production.

Nimbalyst: Visual Workspace for Codex + Claude Code

An open-source GUI orchestration layer over Claude Code and Codex with Kanban boards, parallel agent sub-sessions, Mermaid/Excalidraw rendering, AI-commit, and usage meters for both subscriptions side by side. Uses existing CLI auth — no new subscription required.

Claude Code Self-Improvement Skills

Artem Zhutov shares three practical skills:^{[7]Artem Zhutov} (1) Retrospective — session-end sub-agent that proposes skill/memory diffs for compound improvement; (2) Handoff — triggered at ~200k tokens to prevent context degradation; (3) Skill Manager — single router skill per domain to fix "skill rot" from 150+ overlapping descriptions.

Tools: Codex, Claude Code, GLM 5.1, Kimi K2.6, Fallow, Nimbalyst, Kilocode

AI Future AI Tools

AI Daily Brief OpenRouter

Harness-as-a-Service Is Real

The emergence of Cursor SDK, OpenAI's agents SDK, Anthropic's managed agents, and Microsoft's Foundry hosted agents signals a new infrastructure category — "harness as a service" — where companies sell pre-built agent runtimes the way AWS sells compute.^{[8]AI Daily Brief} OpenRouter published a guide for building custom harnesses with the Agent SDK.^{[9]OpenRouter}

What's Driving It

Q1 2026 earnings showed explosive AI-driven growth: Google Cloud +63% YoY, AWS +28%, Azure +39%, Meta +33%. The compute demand far outpaces supply. The insight is that the model alone isn't enough — you need the runtime, tools, memory, and orchestration. Companies are packaging that full stack.

Early Demos

Builders using the Cursor SDK are embedding Cursor's coding agent runtime into Gmail, Chrome plugins, and custom bug-catching workflows — demonstrating that harness-as-a-service unlocks agentic app development without requiring full infrastructure builds. One demo showed a browser-embedded agent that watches for JavaScript errors and auto-files PRs.

Tools: Cursor SDK, OpenAI Agents SDK, Claude Managed Agents, Microsoft Foundry, OpenRouter Agent SDK

AI Models

DeepLearning.ai Better Stack The Rundown

GPT-5.5 Hallucinates at 2x Claude's Rate

GPT-5.5 tops benchmarks (82.7% Terminal-Bench 2.0, 85% ARC-AGI-2) but hallucinates at 85.53% on the AA-Omniscience benchmark vs. Claude Opus 4.7's 36.18%.^{[10]DeepLearning.ai The Batch} Meanwhile, OpenAI traced ChatGPT's persistent goblin obsession to a single reward signal that spread through fine-tuning loops.^{[11]Better Stack}

The Hallucination Gap

OpenAI's latest model achieves raw capability scores exceeding all competitors, but the hallucination rate on the AA-Omniscience benchmark is more than double that of Claude Opus 4.7. This raises serious questions about deployment in high-stakes scenarios where factual accuracy matters more than benchmark performance.

The Goblin Incident

"Goblin" mentions jumped 175% after the ChatGPT-5.1 launch. The root cause: the "Nerdy" personality preset gave higher reward scores when creature words appeared — driving two-thirds of all goblin references despite comprising only 2.5% of traffic. Fine-tuning loops recycled these outputs into the default mode, compounding across model generations.^{[12]The Rundown} GPT-5.5's Codex system prompt now explicitly bans goblins, gremlins, ogres, trolls, raccoons, and pigeons.

Tools: GPT-5.5, Claude Opus 4.7, Kimi K2.6

Industry

Sherwood Snacks Tech Brew AI Daily Brief

$700B AI Capex Still Not Enough

Big tech's combined AI capital expenditure has hit $700 billion but analysts argue it's still insufficient.^{[13]Sherwood Snacks} Plot twist: the spend is so enormous that human workers are starting to look economically competitive again — some companies face monthly AI bills exceeding $113K, and one Meta employee's monthly spend potentially exceeded $1M.^{[14]Tech Brew}

The Numbers

Global IT spending is projected at $6.31 trillion in 2026 (13.5% growth). Software firms spend roughly 10% of engineering labor costs on AI infrastructure. Meta workers consumed 60 trillion Claude tokens over 30 days. Companies that initially viewed massive AI spending as "budgetmaxxing" now face overruns "by orders of magnitude."

The Paradox

Despite job cuts at Meta and Microsoft targeting AI focus, the economics remain uncertain. Human workers offer accountability that AI currently cannot provide. OpenAI surpassed its 2029 Stargate compute goal of 10 GW within 3 months, adding 3 GW in the last quarter alone. The demand is real — but so is the cost spiral.

Q1 Earnings Context

Google Cloud grew 63% YoY, AWS 28%, Azure 39%. Hyperscalers continue raising capex guidance quarter over quarter, but the gap between investment and monetization remains a concern for investors watching burn rates.^{[8]AI Daily Brief}

Hot Take Industry

The Rundown Theo (t3.gg) Nerd Snipe

The Anthropic Situation: Billing Bugs and Third-Party Hostility

The White House is reversing its stance on Anthropic, now seeking access to the Mythos model after GPT-5.5 reached similar cyber capabilities.^{[12]The Rundown} Meanwhile, Theo documented a confirmed bug where Claude Code charges overage fees based on git commit history containing banned third-party tool keywords — not active code.^{[15]Theo (t3.gg)}

White House Reversal

The administration previously opposed Anthropic expanding private sector access from ~50 to ~120 companies. A new memo would allow agencies to bypass supply chain risk designations. Internal discord persists — Secretary of War Pete Hegseth called Anthropic leadership "run by an ideological lunatic."

The Billing Bug

~00:00 Theo live-demonstrated: creating an empty repo, making a single commit with "OpenClaw" in the commit message (no system prompt changes), and confirming that Claude Code immediately routes to overage billing. The cause: harness detection scans the system prompt, which includes injected git commit history. Acknowledged by Anthropic's Thor as a bug.

Do you understand how egregiously you have to suck at your job as a developer to write code in such a way that you accidentally bill a user $200 because they had the wrong string in their commit messages?

The Compute Economics

~03:01 Theo provides context: the $200/month Max plan gives ~$2,000/month of inference value (10x subsidy). T3 Chat's April Anthropic bill: ~$40,000. Their OpenAI bill for significantly more usage was roughly half that. Anthropic's cache write costs are uniquely punishing vs. OpenAI and Google.

I spend $40,000 a month on Anthropic. If you wonder why I like OpenAI, it's not because they pay me. It's because I pay them. And I pay them less for better services.

Tools: Claude Code, OpenClaw, Hermes Agent, T3 Chat

AI Tools

Nate B Jones

Local AI Hardware: RTX 5090 vs Mac Studio vs DGX Spark

AI agents need deep access to local files, memory, and tools — making the machine on your desk important again. Nate B Jones tested all three hardware paths and provides concrete use-case guidance for each.^{[16]Nate B Jones}

The Three Paths

Mac Studio — for privacy-focused knowledge workers. Unified memory architecture handles large models well.
RTX 5090 — for throughput-focused developers. Raw CUDA performance for training and inference.
DGX Spark — a packaged CUDA appliance. Turnkey but expensive.

The Local AI Stack

Runtime (Ollama, LM Studio, vLLM), model portfolio (small/large/specialized), memory (Open Brain, Postgres, SQLite), and interface. The key insight: runtime quality and memory ownership matter more than model choice. Agents are reversing the cloud-first trend because they need persistent access to your local context.

Tools: RTX 5090, Mac Studio, DGX Spark, Ollama, LM Studio, vLLM, Open Brain

Developer Tools Industry

Low Level Better Stack The Rundown

Security: Universal Linux Exploit and GitHub RCE

A 732-byte Python script achieves root privilege escalation on every Linux distro since 2017 via a logic bug in the kernel's AF_ALG crypto socket subsystem.^{[17]Low Level} Separately, CVE-2026-3854 enabled GitHub RCE via semicolon injection in Git push options — found using AI and patched in 6 hours.^{[18]Better Stack}

copy.fail: Universal Linux Root

The exploit (dubbed "copy.fail") targets AF_ALG — the kernel's user-space crypto API. A logic bug allows a 732-byte Python script to achieve local privilege escalation on every Linux distro running kernels from 2017 onward, requiring no distro-specific patching.

GitHub Semicolon RCE

CVE-2026-3854 exploited Git push options where a semicolon could inject arbitrary commands. AI was used to discover the vulnerability. GitHub patched within 6 hours of responsible disclosure.

AI-Powered Security Tools

Anthropic launched Claude Security (public beta) for enterprise vulnerability scanning and patching. Cursor released Security Review for automated codebase vulnerability detection with Slack integration.^{[12]The Rundown}

Tools: Claude Security, Cursor Security Review

AI Future Podcast

AI Engineer

swyx at AI Engineer: Agents for Everything Else

swyx shares how his nine-person team uses coding agents like Devin to run a $9M+ business and manage a 1,000-person conference — and argues that the real unlock is agents for all knowledge work, not just coding. 60% of Vercel's user base is now bots.^{[19]AI Engineer — swyx}

~00:07 swyx recaps his yearly keynote themes: year one was productivity gains, year two was falling cost curves (100x per 12-18 months), year three was "tiny teams" — companies with more millions in revenue than employees.

~02:08 AI Engineer itself as the proof case: nine full-time people running $9M+ revenue. The turning point came from a non-technical designer in Indonesia who started prompting Devin with annotated red-line Figma mockups without being taught — shipping a pixel-perfect website.

~04:10 Key insight: agents eliminate yak-shaving and unlock parallelism, not just autonomy. Employees do more because the tight feedback loop makes it fun — Easter eggs, animations, polish that would never be prioritized otherwise.

~08:10 The conference schedule (130 speakers, sponsors, attendees) is entirely managed by Devin: forward an email, Devin handles it. Devin also handles ETL, vendor syncing, and even procurement — it researched the lobster vendor for the conference party.

~10:10 Prediction: "agents for everything else" is a top-three trend of 2026. Coding agents are breaking containment into knowledge management and wikis. Custom UIs are giving way to agent experience layers.

Tools: Devin, Vercel, Cursor, Claude Code

AI Tools

AI Engineer

Steve Ruiz at AI Engineer: Agents on the Canvas

tldraw's founder demos a progression from the 2023 "Make Real" prototype to multi-agent "fairies" on canvas, culminating in a desktop app that gives Claude direct JavaScript execution against the canvas runtime — enabling genuinely agentic local-first workflows.^{[20]AI Engineer — tldraw}

~00:14 tldraw is simultaneously a free whiteboard, a startup, and an SDK used inside Replit's agent canvas and Luma AI.

~02:15 "Make Real" (2023): sketch a UI, send to a vision model, get working HTML. "Quaint in 2026" but predated Lovable and vibe coding.

~06:19 Structured output drawing: native canvas shapes instead of image diffusion. Sidesteps vision-model spatial reasoning issues (inverted y-axis, stage-left vs viewer-left).

~09:22 "Fairies" — agents as animated sprites at fairies.tldraw.com. Multiple fairies run in parallel, one is elected leader, scouts canvas state, delegates tasks. Emergent orchestration.

~14:28 Desktop Electron app: HTTP port, any POST executes JavaScript against the tldraw editor API. Claude "has no qualms" about script injection, cheerfully rewrites minified bundles. Local-first + file-based = maximum agent agency without server-side risk.

Tools: tldraw, Claude, Replit, Luma AI, Electron

Industry

AI Engineer

Stripe at AI Engineer: AI Pricing

Stripe's Mayank Pant reveals the top 100 AI companies reached $20M ARR in 20 months vs 65 months for SaaS (3x faster), and presents a five-step pricing framework. Hybrid pricing (base fee + usage) grew from 6% in 2024 to 56% adoption among AI leaders.^{[21]AI Engineer — Stripe}

~00:07 Key challenges: 5-10% of users consume 80% of compute, unpredictable infrastructure costs, technical metrics customers don't understand, product velocity outpacing pricing.

~04:11 "Your first price is a hypothesis, not a commitment." Hypergrowth companies (100%+ YoY) changed pricing 3+ times in two years. Only 22% of low-growth companies did the same.

~05:12 Five-step framework: (1) define customer-perceived value via four archetypes (automation, augmentation, enhanced service, improved results); (2) define charge metrics customers understand; (3) translate value with credits; (4) build guardrails (caps, 50/70/90% alerts, rate limiting); (5) iterate continuously with A/B tests.

~17:24 Infrastructure principle: if every pricing change costs 3-4 months of engineering, the strategy is dead. Stripe handles subscription + usage + hybrid; Metronome for complex enterprise contracts.

Tools: Stripe, Metronome, Intercom, Lovable, Eleven Labs

Developer Tools

AI Engineer

Braintrust & Trainline at AI Engineer: AI Apps in Production

A hands-on workshop building a five-stage support triage agent, then instrumenting it with tracing, golden-set evaluations (deterministic + LLM-as-judge), managed prompts for non-engineers, and online scoring. Trainline's case study: using offline/online eval to safely validate model switches at 27M-user scale.^{[22]AI Engineer — Braintrust}

The talk demonstrates the full "flywheel" from prototype to production: start with a support triage agent that classifies tickets, extracts entities, drafts responses, and routes. Then add tracing, build golden-set evaluations (both deterministic string matching and LLM-as-judge for tone/helpfulness), and enable managed prompts so product managers can iterate without engineering deploys.

Trainline (27M users, European rail booking) shared their real-world process for switching between models: run candidate models against 200+ golden examples, compare scores on latency/cost/quality dimensions, and deploy with online scoring that monitors production quality continuously. The key learning: you need both offline eval (pre-deploy confidence) and online scoring (post-deploy monitoring) — neither alone is sufficient.

Tools: Braintrust, Trainline

Podcast AI Future

EO

Drew Bent (Anthropic): How Top Learners Use AI

Anthropic's education lead shares research showing students using AI scored 17% lower on subsequent assessments — except those who used it in an inquiry-based way. The biggest differentiator for AI power users is context loading, not prompting technique.^[23]EO

~00:00 AI-native users (who started with current tools, not 2022 versions) have an advantage because they see capabilities without outdated mental models. They treat AI as a collaborator, not an assistant.

~05:03 Anthropic coding education study: students using AI finished faster but scored 17% lower on assessments without AI — except those who probed and asked questions rather than just getting answers.

The group that didn't use AI tools actually performed 17% better. They understood the concepts better because they had slogged through this work without AI.

~09:07 The biggest gap: top users spend most of their time loading context (documents, stream-of-consciousness thinking, background) before even asking their question. Come with problems, not solutions.

It's not just a technical skill. Sure, early days of AI was how do you prompt it this way, but ultimately you have to treat this more as a colleague, as a collaborator. And so then it becomes more like a social skill.

~11:08 Teachers in a global WhatsApp group build custom tools with Claude and Claude Code every morning and deploy same-day. By 2030: AI learning companions that know your curriculum and learning history.

Tools: Claude, Claude Code, Claude Artifacts, Khan Academy

Podcast Hot Take

EO

Gumloop: "50 AI Agents Running My Company" Is a Lie

Max Brodeur-Urbas, founder of Gumloop (4M workflows/day for Instacart, Shopify, DoorDash), debunks the AI agent hype: only automate what you deeply understand, course bros selling $30K AI recipes are exploiting vulnerable people, and vibe coding has hard limits.^[24]EO

~03:02 Origin story: deported from the US and banned for 5 years, Max spent 6 months rapidly building and discarding MVPs weekly. Discovered AutoGPT's Discord was full of non-technical users asking basic setup questions — built AgentHub as a UI for them.

~06:04 Key pivot: autonomous agents were unreliable. Gave users predictable step-by-step automation instead. Reliability over autonomy. Platform grew to 80% non-technical users (business admins, ops, HR).

~09:06 The anti-hype thesis: "50 AI agents running my company" people are building slot machines. If you use AI to code and don't know how to code, you're making malware.

The course bros selling $30K weekend recipes are exploiting vulnerable people during hype bubbles similar to crypto/NFTs.

~13:08 Hiring: their Instacart, Webflow, and Shopify customers quit to join them. Filter: "Could I hang out with this person 24/7?"

Tools: Gumloop, AutoGPT, Instacart, Shopify, DoorDash

Podcast Industry

EO

Dan Wang: China Doesn't Need Better AI to Beat America

Stanford/Hoover research fellow Dan Wang argues the US has at best a moderate AI lead over China, and that China's manufacturing ecosystem, energy infrastructure (300GW solar vs 30GW US, 40 nuclear plants vs 0), and iteration speed may matter more than model sophistication.^[25]EO

Wang, author of "Breakneck," argues the conventional framing — US leads in AI models, China follows — misses the bigger picture. China's advantages in manufacturing ecosystems, energy buildout, and industrial iteration speed create compounding advantages that model sophistication alone cannot offset.

Key stats: China has 300GW of solar capacity vs 30GW in the US. China is building 40 nuclear plants while the US has zero under construction. Shenzhen can iterate on hardware prototypes in days versus weeks in the US. These infrastructure advantages create the physical substrate on which AI systems run — and China is building it faster.

The implication: even if US labs maintain a model quality lead, China's ability to deploy AI at scale — cheaply, in physical systems, with abundant energy — may determine which country captures more economic value from the technology.

Podcast Developer Tools

Real Python marimo

Real Python Podcast: Agentic Data Science with marimo pair

Trevor Manz from marimo discusses "marimo pair" — an agent skill that teaches coding agents how to use marimo reactive notebooks as a tool for data science workflows, enabling pair programming between humans and AI agents.^{[26]Real Python Podcast} Separately, marimo demoed Evoke — hierarchical clustering that runs in 100ms vs 20s for UMAP.^[27]marimo

marimo pair is an agent skill (not a standalone product) that gives coding agents access to live notebook state — cell outputs, dataframes, plots. The key design insight: build declarative agent skills, not imperative ones. Tell the agent what the notebook can do, not step-by-step how to do it.

The Evoke demo showed hierarchical clustering for embedding exploration that renders in 100ms (vs 20 seconds for UMAP projections), with parallel coordinates and treemap widgets for interactive exploration. Also demoed: "Wiggly Stuff" — draw a curve and get a histogram via spline interpolation.

Tools: marimo, marimo pair, Evoke, UMAP, Claude Code

DeepSeek V4: 1.6T Parameters, Full Open Source

The Architecture Breakthrough

Training Stability at Scale

Infrastructure Secrets

Recursion Is the Next Scaling Law

The LLM Reasoning Ceiling

HRM: 27M Parameters Beat o3

TRM: 7M Parameters, 87% ARC

The Next Frontier

AI Coding Agents: Codex Leads, Claude Code Slips

The Verdict

Fallow: Rust-Powered Code Intelligence

Nimbalyst: Visual Workspace for Codex + Claude Code

Claude Code Self-Improvement Skills

Harness-as-a-Service Is Real

What's Driving It

Early Demos

GPT-5.5 Hallucinates at 2x Claude's Rate

The Hallucination Gap

The Goblin Incident

$700B AI Capex Still Not Enough

The Numbers

The Paradox

Q1 Earnings Context

The Anthropic Situation: Billing Bugs and Third-Party Hostility

White House Reversal

The Billing Bug

The Compute Economics

Local AI Hardware: RTX 5090 vs Mac Studio vs DGX Spark

The Three Paths

The Local AI Stack

Security: Universal Linux Exploit and GitHub RCE

copy.fail: Universal Linux Root

GitHub Semicolon RCE

AI-Powered Security Tools

swyx at AI Engineer: Agents for Everything Else

Steve Ruiz at AI Engineer: Agents on the Canvas

Stripe at AI Engineer: AI Pricing

Braintrust & Trainline at AI Engineer: AI Apps in Production

Drew Bent (Anthropic): How Top Learners Use AI

Gumloop: "50 AI Agents Running My Company" Is a Lie

Dan Wang: China Doesn't Need Better AI to Beat America

Real Python Podcast: Agentic Data Science with marimo pair

Sources