Claude Code leaks, the memory wall cracks

April 1, 2026

10 topics across 10 sources: 6 YouTube, 4 blogs/newsletters. Two architectural walls come down — Claude Code's source map gets shipped to npm, and Kimi cracks the deep-model amnesia problem.

AI Tools Industry Hot Take
Fireship

Anthropic Accidentally Ships Claude Code's Source Map to npm

At 4 AM on March 31, Anthropic's claude-code npm package v2.1.88 shipped with a 57 MB source map file — over 500,000 lines of readable TypeScript that stayed mirrored across GitHub for hours before DMCA takedowns landed.[1]Fireship — Anthropic leaks Claude Code source code Two fork projects — Python rewrite “Claw Code” and model-agnostic “OpenClaw” — hit 50K+ GitHub stars within a day. The leak exposes anti-distillation poison pills, an undercover mode, a regex-based frustration detector, and feature-flag names for Opus 4.7, Capybara, ultra-plan, buddy (a Tamagotchi companion), and “chronos” (a scheduled background-agent journal).

Read more

How it happened

Source maps get auto-stripped by most build tools. Claude Code runs on Bun.js (acquired by Anthropic), and a GitHub issue filed three weeks earlier reported that Bun was serving source maps in production (~02:15). Whether root cause was the Bun bug, an unlucky developer, or someone rogue is unconfirmed.

What the leak actually reveals

  • 11-step prompt pipeline. Fireship points to a fan-built breakdown site showing the full input-to-output flow. The takeaway: Claude Code is “a dynamic prompt sandwich glued together with TypeScript,” not alien tech.
  • ~25 real tools. The codebase hard-codes roughly 25 tools. The bash tool alone is 1000+ lines of parsing/execution logic.
  • Anti-distillation poison pills. Claude Code pretends non-existent tools exist so that models trained on its outputs learn to call fake APIs. Now that the real tool list is public, the poison pills are moot.
  • Undercover mode. Instructions telling Claude to never mention itself in commit messages or outputs — stated reason is leak prevention, but critics read it as obscuring AI involvement in open-source contributions.
  • Regex frustration detector. The state-of-the-art model literally regex-matches user prompts for keywords like “balls” to log a frustration event.
  • Comment density. The codebase has far more comments than a typical human-written repo — read as self-documentation for the AI that writes the next iteration.

The roadmap leak

Feature flags name-check Opus 4.7 and a new model Capybara (possibly the teased “Mythos”). Other flags: ultra-plan, coordinator mode, daemon mode, chronos (background-agent daily journal with “dream mode” for memory consolidation), and buddy — a customizable Tamagotchi-style companion every developer gets (~06:05). Possibly an April Fool's joke; possibly not.

Your top secret application is just one npm publish away from becoming open source.

Second-order exposure: Axios

The leak also revealed that Claude Code uses axios, the npm package that was itself compromised by North Korean actors the day before. If an RAT landed on Anthropic servers via the Axios supply-chain attack, this week could get worse.

Tools: Claude Code, Bun.js, npm, Claw Code (Python fork), OpenClaw (model-agnostic fork), axios, OpenAI Codex, GitHub
AI Models
AI Search

Kimi's “Attention Residuals” Crack the Deep-Model Amnesia Problem

The Kimi team published “attention residuals” — a reworking of residual connections that lets each layer selectively attend to all prior layers instead of inheriting one cumulative pile.[2]AI Search — They solved AI's memory problem Models with the fix match baseline performance at 1.25x less compute, jump 7.5 points on GPQA Diamond, and unlock going deeper-not-wider without signal collapse. A “block attention residuals” variant keeps the design compatible with pipeline-parallel training across racks.

Read more

The problem: 50 chefs in one pot

Traditional residual connections (2015) add every layer's output into one cumulative signal (~03:00). By layer 120, early information is buried and later layers have to “scream” to have influence. Kimi's analogy: 50 chefs all dumping ingredients into one pot — the basil from chef 1 is untasteable by the end.

The fix: treat depth like sequence length

Transformers already solved a structurally similar problem at the token level. Kimi applies attention along the depth axis: each layer has query/key/value vectors and can selectively pull from specific prior layers rather than inheriting the mix (~09:00). “Each layer walks into a buffet” instead of getting the pot.

The infra gotcha — and the fix

Naive attention residuals blow up pipeline-parallel training because every layer now needs cross-rack communication. Block attention residuals segment the model by server rack: full attention-residual inside each block, a single compressed handoff between blocks (~16:00). This is the detail that makes the paper practical rather than theoretical.

Results

  • 1.25x less compute for equal performance during training (both full and block variants).
  • +7.5 points on GPQA Diamond (graduate-level science questions).
  • Gains on MMLU, math, and coding — benchmarks that reward multi-step reasoning.
  • Outperforms DeepSeek's MHC (manifold constraint hyperconnections) on the same eval.
  • Gradients distribute more evenly across layers — the training signal doesn't collapse.

The depth-vs-width study

Kimi trained 25 models with identical parameter counts but different shapes. Baseline models hit a wall going deep. Models with attention residuals kept improving with depth (~21:00). Depth becomes a lever instead of a ceiling.

The model can now constantly reconfigure itself based on what it receives. For every input it receives, it builds a custom pathway through its layers, uses it, and then discards it.

The attention patterns the researchers visualized show locality (most layers attend to neighbors) plus sudden long-range jumps back to early layers — the model dynamically rewiring itself, which they compare to neural plasticity.

Tools: Kimi, attention residuals paper, block attention residuals, DeepSeek MHC, GPQA Diamond, MMLU
Hot Take Industry
Low Level

Claude Writes a Working FreeBSD Kernel ROP Chain in ~20 Prompts

A security researcher handed Claude a known FreeBSD kernel stack-buffer overflow (CVE-2026-4747) and got a fully working remote-kernel-execution exploit back in under 20 prompts — including return-oriented programming, a clean kthread_exit to avoid panicking the kernel, and pmap_change_protections to mark the BSS executable.[3]Low Level — AI is REALLY good at hacking now The AI did not find the bug; it did weaponize it at a level that previously required a senior exploit-dev to pull off.

Read more

The bug itself is mundane

FreeBSD's RPC daemon has a length field passed straight into memcpy targeting a stack buffer — textbook memory-safety failure (~01:00). The host argues Claude would have found this instantly if pointed at the code.

Why the exploit is the impressive part

Modern mitigations (ASLR, non-executable stack) mean you can't just jump to shellcode on the stack. You need return-oriented programming: chain existing executable snippets (pop RDX; ret, etc.) to do your work (~02:30). Claude's chain:

  1. Overflow the stack buffer, overwrite the return address.
  2. Chain gadgets to call pmap_change_protections on the BSS page, setting permissions to RWX (7 = 1+2+4).
  3. Write shellcode to the BSS, 32 bytes at a time via mov-family gadgets.
  4. Jump to BSS, execute shellcode that spawns /bin/sh as root.
  5. Call kthread_exit to cleanly terminate the kernel thread so the kernel doesn't panic (~05:00).

The kthread_exit observation is the tell. That's not a pattern you stumble onto — it's the correct professional choice for this exact context (exploiting a kernel RPC thread), and Claude made it unprompted.

I'm a pretty decent reverse engineer... AI is faster than me, man. It may not have as much experience as I do, but if I point Claude Sonnet into Ghidra MCP with a binary open, it can find vulnerabilities extremely quickly.

A caveat on the test setup

FreeBSD apparently ships with kernel ASLR off by default, which made the fixed gadget addresses tractable. Not representative of hardened Linux or Windows kernels — but a signal of where defenders sit against AI-augmented attackers in the easy case.

Tools: Claude (Sonnet), Ghidra MCP, FreeBSD kernel, ROP gadgets, pmap_change_protections, kthread_exit
AI Models
Two Minute Papers

Google's TurboQuant KV-Cache Compression: Three Old Ideas, One Big Win

Google's “TurboQuant” method compresses the KV cache (the model's short-term memory) using three well-known primitives: quantization, random rotation before rounding, and the Johnson-Lindenstrauss transform for distance-preserving projection.[4]Two Minute Papers — Google's New AI just broke my brain The headline claim is 4-6x memory and 8x attention speedup; independent reproductions landed at a more sober but still impressive 30-40% memory reduction plus ~40% prompt speedup — no quality-vs-memory trade-off, which is rare.

Read more

The key insight

If you just round numbers to save memory, arrow-like vectors concentrated along one axis collapse to the grid and lose everything (~02:15). Rotate the vector randomly first and its energy spreads across axes; rounding now loses a little from everywhere instead of everything from one place. Then apply the JL transform to squish dimensions while approximately preserving pairwise distances.

Sometimes you don't need to invent brand new theories. Sometimes you need a smart combination of existing methods.

Reproduced benchmarks

  • KV-cache memory: 30-40% reduction (not the media's 4-6x, which is a best-case corner).
  • Prompt processing: ~40% faster.
  • Output quality: no meaningful degradation.
  • Code reproductions appeared within a week of the paper.

Controversy

Other researchers flagged overlap with prior quantization work; the paper was accepted but not all reviewers agreed the concerns were fully resolved. Dr. Károly Zsolnai-Fehér's angle is that this kind of combinatorial innovation is undersold in modern ML.

Practical implication in a memory-constrained year: long-context workloads (big PDFs, full codebases, movie-length video) get cheaper by a few gigabytes per session. In a world of GPU/laptop memory shortages driving up prices, “a few GB less per session” translates into real hardware cost savings.

Tools: TurboQuant, KV cache, Johnson-Lindenstrauss transform, Lambda GPU Cloud, DeepSeek
Podcast AI Tools
Matt Williams

Matt & Ryan: Tool-Calling Is Harder Than You Think, and Sub-Agents Shouldn't Talk

Matt Williams and Ryan spend an hour on the unglamorous realities of building agent harnesses on Ollama — why Ollama tool-calling went from flexible to rigid, why agents talking to each other via IRC explodes into a “wallet attack,” and why Tailscale's new aperture AI gateway (still alpha, paid-only) is exactly what a local-first agent setup needs.[5]Matt Williams — Matt and Ryan have a chat, March 31 2026 Also: a malicious OpenClaw mirror injected Armenian-language tool-call payloads, Ollama finally shipped MLX support (three months after it was usable), and they fact-check the Claude Code leak in real time.

Read more

Tool-calling was always there — the API just got rigid

Ollama has supported tool-calling since day one (~01:30); you built the prompt yourself. The later “tool-calling support” announcement wrapped that with a more rigid API. Matt's take: the rigid wrapper made tool-calling “a quarter as usable” as the old prompt-your-own approach, though it's standardized now.

The network-effect problem with agent teams

Ryan has agents communicating over IRC (yes, IRC). At 3 agents you have 6 conversation edges; at 10 agents, ~100 (~09:00). Ollama's API rate limit per IP starts rejecting around 8 agents — “fair” in Ryan's words.

Your sub-agents should absolutely not talk to each other. It's a bad, bad design... it's going to be a wallet attack.

The Armenian tool-call injection

Playing with OpenClaw, Ryan hit a malicious search result from Bing that injected tool calls containing three sentences of Armenian that compressed to look like a single word (~06:20). Claude flagged it as a code injection; the payload was Chinese. Lawsuit re: typosquatted openclaw.org variants followed days later.

Tailscale Aperture: the new AI gateway

Tailscale's new Aperture (alpha, request-only, looks Kibana-like) is a configurable AI gateway with traffic inspection and rule setting — a lighter-weight Tensor Zero that works inside one tailnet (~33:00). Ryan: “It shows you what your freaking bots are doing online, which is definitely not something I got out of a tailnet.”

Claude Code leak reactions

Ryan and Matt confirm the leak is real (~23:00) and call out ccleaks.com as the community reference page. The feature they both want: ultra-plan. The curiosity: UDS inbox — Claude's sub-agents have message inboxes like a lightweight pub/sub, which Ryan wants to adopt in his own IRC/Qdrant memory setup. Ryan calls the leaked buddy Tamagotchi “super cool.”

Other field notes

  • Ollama MLX: just announced, but has actually been usable for 3-4 months — the announcement is a formality. The delay is Ollama working with the Apple MLX team to get MLX running on Windows/Linux, not just Mac.
  • Zellij web mode: resume a terminal session from a browser on a different machine — “fantastic for agents.” You can have Playwright type into a Zellij session, and sessions are re-spoolable if Claude loses state.
  • Docker sandboxes: Ryan argues macOS's built-in sandbox is what both Docker and a competing project ended up using — Linux-style containers are “pointless” on macOS.
  • Agent-specific browsers: Vercel shipped one in Rust; a Rust competitor called Partis Browser also exists. Aria-rules-based page decomposition (navigate/click/fill/toggle/select) beats screenshot-based computer-use.
  • Deno contraction: five main Deno contributors left last week. Bun is (quietly) winning the Node-alternative race — and is also the runtime that just accidentally leaked Claude Code.
  • Axios breach: 100M downloads/week package compromised by a maintainer-account takeover. OpenClaw ships with Axios.
Tools: Ollama, OpenClaw, Minimax 2.5/2.7, Kimi K2.5, Tailscale Aperture, Tensor Zero, IRC, Qdrant, Zellij, Vercel Agent Browser, Partis Browser, Comet (Perplexity), MLX, NetBSD on Raspberry Pi, ccleaks.com, Axios
Podcast Industry
Pragmatic Engineer

Pragmatic Engineer x Thuan Pham: Scaling Uber Through the Trenches

Gergely Orosz interviews Thuan Pham, Uber's first CTO, on a career arc that went from Vietnamese boat-refugee to CTO of one of tech's most complex engineering organizations.[6]Pragmatic Engineer — Scaling Uber with Thuan Pham (Uber's first CTO) Travis Kalanick interviewed him for 30 hours over two weeks. He inherited 40 engineers, 30K rides/day, and a Node.js single-threaded dispatcher that would hit New York's scaling limit in 5 months. What followed: a rewrite playbook built around “see around corners,” a 5-month China launch that should have taken 18, and thousands of microservices.

Read more

The 30-hour interview

After a 2-hour whiteboard session with Travis Kalanick overran (~28:30), Thuan got an immediate call from the recruiter. They then did 2-hour daily Skype calls for two weeks, one whiteboard topic each. Travis once mid-call told his EA to reschedule a flight rather than cut short. Thuan later learned it was an explicit simulation of working together day-to-day.

The first 5 months — dispatch

The original dispatcher was Node.js single-threaded; engineers scaled it by moving it to bigger boxes. With city ride volume doubling, New York would crash on the largest available CPU by October (~36:00). Thuan's rewrite spec had exactly two requirements:

  • A city must be powered by multiple boxes.
  • A box must power multiple cities.

No new features. Shipped in August/September, right on the deadline.

I only have two requirements... one is a city has to be powered by multiple boxes, and a box has to power multiple cities. That's it.

The China launch

Christmas 2014: Travis declares they're launching in China, in 2 months, with services running on Chinese soil (~41:00). Thuan's TPMs scope it at 6 months; industry friends laugh and say 18. They settle at 4, slip to 5, then negotiate an incremental launch. Travis's counter: fine, but start with Chengdu — the biggest city first.

The most brilliant thing, because by doing the hardest thing first, once you launch that, everything else is downhill from there.

Thuan calls this the single most durable cultural habit Kalanick pushed: redline yourself deliberately, and do the hardest sub-task first so the rest becomes routine.

Program-and-Platform, then microservices

At 100 engineers and 12 PMs in late 2013, functional org structure had ground to a halt — every feature had to queue against mobile, dispatch, and backend bandwidth simultaneously. Thuan, Travis, and Jeff Holden did a sticky-note exercise: name every area of Uber's business (17 at the time), fund the top 7 cross-functional teams, leave the rest empty (~47:00). Programs are vertical (user-facing); Platforms are horizontal (tooling). Microservices came later — driven not by architectural preference but by “no time to react other than to survive the scale that keeps on coming at you.”

Earlier career beats (dot-com + VMware)

Thuan's resume before Uber is unusually instructive: HP Labs research (“published great papers that went nowhere”), Silicon Graphics interactive-TV prototype ($45K set-top boxes, pre-ready-for-market), NetGravity (the first dynamically targeted ads on Yahoo), then VMware's 40-person VirtualCenter team that went on to define cloud management. The thread: he keeps down-shifting company size when things get comfortable, which is how the Uber opportunity found him via Bill Gurley (an ex-NetGravity connection from a decade earlier).

If you try to do a really good job at every company... over time, very slowly, you accumulate a decent reputation in people's mind. And then when you become available, people come to you.
Tools: Uber dispatch, Node.js, Project Helix (Uber app rewrite), VMware vMotion/VirtualCenter, NetGravity, Silicon Graphics, HP Labs
Hot Take AI Future
Data Science Weekly

Data Science Weekly: What Actually Makes an Agent “Autonomous”

Data Science Weekly's headline argues autonomy is about making decisions without you, not about running forever.[7]Data Science Weekly — What Makes an AI Agent Autonomous? The framing cuts against marketing language that conflates long-running agents with autonomous ones: a cron job runs forever; it isn't autonomous. An agent that makes irreversible decisions without human-in-the-loop is.

Read more

The full article body sits behind a Substack paywall and wasn't retrievable for deeper quoting. The headline and subtitle (“It's not about running forever. It's about making decisions without you.”) make a specific testable claim: the axis of autonomy is decision surface, not time horizon. In the context of this day's other topics — Claude's leaked chronos daily-journal background agent, Matt's IRC-chat agent team, Uber's early dispatcher — the distinction matters. Many “autonomous” systems from the leak (chronos, coordinator mode) are actually long-running; the genuine autonomy question is which decisions the user pre-authorized.

Tools: Data Science Weekly newsletter
Industry AI Models
Google

Google's March 2026 AI Dump: Gemini 3.1 Flash-Lite, Lyria 3 Pro, Antigravity

Google's monthly recap post runs wide: Gemini 3.1 Flash-Lite (billed as their fastest, most budget-friendly model), Gemini 3.1 Flash Live for real-time audio in 200+ countries, Lyria 3 Pro for 3-minute music generation, Search Live globally, Canvas in AI Mode across the US, Live Translate on iOS, and the Google Antigravity coding agent in AI Studio's vibe-coding experience.[8]Google — The latest AI news we announced in March 2026

Read more

Models

  • Gemini 3.1 Flash-Lite — fastest/cheapest tier, pitched at high-volume deployments.
  • Gemini 3.1 Flash Live — advanced audio, 200+ countries.
  • Lyria 3 Pro — music generation up to 3-minute tracks with granular element control; now available via Gemini API and AI Studio.

Search, Workspace, Maps

  • Search Live (voice + camera) expanded globally.
  • Canvas in AI Mode US-wide for creative writing and coding.
  • Gemini in Docs/Sheets/Slides/Drive gained cross-file synthesis.
  • Google Maps shipped “Ask Maps” (conversational) and Immersive Navigation.

Personalization, devices, health

  • Personal Intelligence expanded to AI Mode, Chrome, and the Gemini app.
  • “Switch to Gemini” imports chat history and preferences from competing assistants — a direct shot at ChatGPT lock-in.
  • Pixel Drop: Circle to Search for outfit analysis; Magic Cue for restaurant recs; Express Pay on Pixel Watch.
  • Live Translate on iOS, 70+ languages.
  • Fitbit got a personal health coach with sleep/nutrition/medical-records integration.
  • $10M clinician-education fund.

Developer surface

AI Studio's new “vibe coding” experience integrates Google Antigravity — the coding agent that headlined the Claude Code vs. Antigravity debate on 4/13 (see 4/13 briefing). Full-stack capabilities pitched.

Retrospective

Google also noted the AlphaGo 10-year anniversary — useful context for the decade-long arc from narrow game-playing AI to Gemini 3.1.

Tools: Gemini 3.1 Flash-Lite/Flash Live, Lyria 3 Pro, Google AI Studio, Antigravity, Search Live, Canvas, Google Maps, Pixel Drop, Fitbit, Live Translate
Developer Tools
Google Developers

Gemini API Docs MCP + Agent Skills: 96.3% Pass Rate, 63% Fewer Tokens

Google paired two developer tools to fix a boring but expensive problem: coding agents generate outdated Gemini API code because training-data cutoffs lag the SDK.[9]Google — Improve coding agents' performance with Gemini API Docs MCP and Agent Skills The Gemini API Docs MCP exposes live docs to the agent via Model Context Protocol; Gemini API Developer Skills bundles best-practice patterns. Combined: 96.3% pass rate on evals with 63% fewer tokens per correct answer vs. standard prompting.

Read more

The two pieces

  • Gemini API Docs MCP — MCP server at gemini-api-docs-mcp.dev that pipes current docs, SDK references, and model info to any compliant agent.
  • Gemini API Developer Skills — “best-practice instructions, resource links, and patterns” that steer the agent toward the current SDK surface instead of deprecated patterns.

Why the token reduction matters

63% fewer tokens per correct answer isn't just a cost line — it's an argument that structured skills + live doc lookup beats cramming everything into the prompt. Matches the direction Theo was arguing in the 4/13 harness breakdown: models work better when you give them navigation (search, lookup) rather than dumping context in.

Setup instructions live at ai.google.dev/gemini-api/docs/coding-agents. Works alongside Claude Code, Cursor, Codex, and other MCP-compatible agents — not just Antigravity.

Tools: Gemini API Docs MCP, Gemini API Developer Skills, Model Context Protocol, Gemini SDK
Industry
Tech Brew

Apple at 50: A $12.7B AI Bet Against Meta's $72B and Alphabet's $91B

Apple turned 50 on April 1, 2026. Tech Brew argues the company is betting against the hyperscaler AI arms race: $12.7B in AI-related capex vs. Meta's $72B and Alphabet's $91B, with the investment going toward smaller on-device models rather than frontier cloud training.[10]Tech Brew — Apple still thinks different at 50 Despite Apple Intelligence's 2024 debut being called “underwhelming,” AAPL outperformed MSFT/META/AMZN over the trailing year and Apple posted $1B+ in AI-related App Store revenue.

Read more

The capex delta

  • Apple: $12.7B AI capex, 166K employees, nearly $4T market cap.
  • Meta: $72B.
  • Alphabet: $91B.

That's a ~6x-7x gap. Apple's bet: custom silicon + smaller on-device models can deliver enough user value without the data-center arms race.

The historical pattern — and the risk

Tech Brew notes Apple has historically let a category froth up, then entered with a polished mainstream device (iMac, iPod, iPhone). Whether that pattern applies to AI depends on whether the on-device story holds — and Siri is still the visible weakness. Apple is now planning to integrate third-party AI assistants into Siri, effectively ceding the assistant layer to become a platform hub. That's consistent with the 4/13 Tech Brew report on N50 smart glasses: the hardware works only if Siri improves, and Apple is hedging by opening the assistant layer up.

The revenue tell

$1B+ AI-related App Store revenue is the datapoint that justifies the low capex — Apple is monetizing others' AI via distribution, not building frontier models itself. Whether that's enough when a Gemini-3.1-Flash-Lite-tier model can run on-device is the structural question for the next 2-3 years.

Tools: Apple Intelligence, Siri, Apple custom silicon, App Store

Sources

  1. YouTube Tragic mistake... Anthropic leaks Claude's source code — Fireship, Apr 1
  2. YouTube They solved AI's memory problem! — AI Search, Apr 1
  3. YouTube No, Seriously. AI is REALLY Good at Hacking Now — Low Level, Apr 1
  4. YouTube Google's New AI Just Broke My Brain — Two Minute Papers, Apr 1
  5. YouTube Matt and Ryan have a chat on March 31, 2026 — Matt Williams, Apr 1
  6. YouTube Scaling Uber with Thuan Pham (Uber's first CTO) — The Pragmatic Engineer, Apr 1
  7. Newsletter What Makes an AI Agent “Autonomous”? — Data Science Weekly, Apr 1
  8. Blog The latest AI news we announced in March 2026 — Google, Apr 1
  9. Blog Improve coding agents' performance with Gemini API Docs MCP and Agent Skills — Google Developers, Apr 1
  10. Newsletter Apple still thinks different at 50 — Tech Brew, Apr 1