Codex runs inboxes, DeepMind solves Erdos

June 5, 2026

17 topics · 40 sources

AI ToolsProductivity
OpenAIOpenAIEverySimon Willison

Codex grew a browser, a site builder, and an inbox habit

OpenAI showed Sites in Codex and a 1Password one-shot workflow, while Every's demo made the bigger point: Codex is becoming an agent you bring into browser-shaped work, not just a code generator.[2]Introducing Sites in Codex[3]1Password One Shots with Codex[4]Codex Runs My Inbox Now Simon Willison also highlighted ChatGPT Lockdown Mode, which attacks prompt-injection exfiltration by cutting outbound network paths rather than trusting another model to police them.[1]OpenAI Help: Lockdown Mode

Read more

Sites and one-shots

~00:00 OpenAI's Sites demo frames Codex as a way to turn a repo into a small deployed web surface: the agent can inspect, edit, preview, and publish inside the workflow instead of stopping at a patch. ~00:00 The 1Password clip is narrower but revealing: one-shot a concrete integration task, let the agent make the edit, then review the result like a small pull request.[2]Introducing Sites in Codex[3]1Password One Shots with Codex

The inbox version

~00:00 Every's demo is the more practical wedge. The app sweeps email into cards, drafts replies, checks calendar context, proposes times, and lets the human steer one decision at a time. The important pattern is not email automation by itself; it is that the browser becomes an agent workspace for everyday admin work.[4]Codex Runs My Inbox Now

Lockdown mode is the security tell

Willison reads OpenAI's Lockdown Mode as a deterministic defense against the final stage of prompt-injection data exfiltration: remove or constrain the network path that can leak private data. He also notes the uncomfortable implication: default ChatGPT settings are not being represented as robust protection against determined exfiltration attempts.[1]OpenAI Help: Lockdown Mode

Tools: Codex, ChatGPT Lockdown Mode, 1Password
ProductivityHot Take
Nate B JonesNate B JonesY Combinator

800 million tokens is a mirror, not a flex

Nate B Jones built a dashboard after burning 800M tokens in a day, but the useful claim is behavioral: token telemetry shows which tools actually expand your work rather than merely record spend.[5]My Codex Ran 800 Million Tokens in A Day His companion short argues that rejections are more valuable than prompts because taste has to be systematized as output volume explodes.[6]The most expensive AI mistake isn't prompting YC's Paxel launch points the same way, turning agent sessions into a builder profile for steering, execution, engineering, product instinct, and planning.[7]We just launched Paxel!

Read more

Telemetry as self-improvement

~00:00 Jones says the dashboard is not about bragging on token burn. It is a way to see his AI habits: where usage changed, which tools unlocked new behavior, and whether his actual work pattern matches his story about how he uses agents.[5]My Codex Ran 800 Million Tokens in A Day

Rejection is the scarce artifact

~00:00 The short version: taste does not scale if it stays in your head. As generated output rises 10x or 100x, the important artifact is the structured rejection: why something was bad, what rule it violated, and how that decision can be reused.

YC wants this as a recruiting signal

~00:00 Paxel reads local Claude, Codex, and Cursor sessions inside Docker, then reports how someone builds. YC explicitly invites Startup School applicants to attach a Paxel token, which turns agent-work traces into a talent signal rather than just private productivity analytics.[7]We just launched Paxel!

Tools: Codex, Claude, Cursor, Paxel
Developer ToolsAI Tools
Better StackAI EngineerPrefectmarimo

Agent interfaces are turning into their own ops layer

Herder brings agent awareness to tmux-like terminal multiplexing, Google's Chrome DevTools talk treats MCP as an interface-design problem for agents, and Prefect explains the MCP gateway as the control plane between agents and many tools.[8]herder: Is This the Ultimate Agent Multiplexer?[9]Building Agent Interfaces: Lessons from Chrome DevTools (MCP) for Agents[10]What is an MCP Gateway? marimo adds the sandbox angle: put a free GPU behind an agent and notebook workflows become runnable experiments, not static prompts.[11]Agents with a GPU

Read more

The terminal wants agent state

~00:00 Herder is a Rust terminal multiplexer built around the thing tmux cannot know: whether an agent is working, blocked, or done. It keeps terminal-native persistence and SSH friendliness, then layers in notifications and a socket API so agents can interact with the environment themselves.[8]herder: Is This the Ultimate Agent Multiplexer?

DevTools as a design template

~00:00 Michael Hablich's AI Engineer talk uses Chrome DevTools and MCP to argue that agent interfaces need debuggable affordances: inspectable state, clear tool boundaries, and interaction patterns that let humans understand what an agent believes it can do.[9]Building Agent Interfaces: Lessons from Chrome DevTools (MCP) for Agents

Gateways and GPUs

~00:00 Prefect's MCP Gateway framing is operational: do not wire every agent directly to every service; put a gateway in the middle to manage discovery and access. ~00:00 marimo's update complements that with compute: a notebook agent can now run inside a GPU-backed sandbox, making local experiments and model evaluations faster to delegate.[10]What is an MCP Gateway?[11]Agents with a GPU

Tools: Herder, tmux, MCP, Chrome DevTools, Prefect, marimo
AI ModelsDeveloper Tools
marimo

Reasoning helps less when the baseline is a boring classifier

marimo tested whether explicit reasoning improves open-source model performance on a prompt-injection classification dataset and found little accuracy gain, substantial latency cost, and occasional structured-output failures.[12]Does LLM Reasoning Still Matter? The practical lesson was old-fashioned: always compare the LLM against a simple scikit-learn baseline before you pay for fancy reasoning.

Read more

~00:00 The experiment pairs a GPU-backed marimo environment with an agent, then evaluates open-source models on benign versus jailbreak prompts. Adding reasoning made the runs slower and did not materially improve accuracy on this dataset. The more humbling result: a basic scikit-learn classifier is the baseline you should train before declaring an LLM useful for a text classification job.[12]Does LLM Reasoning Still Matter?

Tools: marimo, scikit-learn, Pydantic, Claude, Codex, OpenCode
AI FutureHot Take
Nate HerkLast Week in AI

The AGI debate is now about internal labor, not vibes

Nate Herk reads Anthropic's internal usage claims as practical AGI evidence: a general model taking open-ended work, researching, experimenting, and returning useful results inside the company that builds it.[13]AGI is Here. Anthropic Just Proved It. Last Week in AI leans into the acceleration narrative, citing a 4.7-month doubling-time claim and arguing that the AI capability exponential keeps steepening rather than fading.[14]The Exponential Only Steepens

Read more

AGI as open-ended work

~00:00 Herk rejects the sci-fi definition and uses a narrower operational one: can you hand the model an ambiguous problem with no clear answer and have it figure out the approach? He points to Anthropic's reported internal code generation and experimentation workflows as the meaningful signal.[13]AGI is Here. Anthropic Just Proved It.

The curve argument

~00:00 Last Week in AI compresses the same mood into the line that the exponential only steepens: if doubling time is around 4.7 months and accelerating with the latest GPT and Claude Mythos previews, then planning around slow linear progress is the bad bet.[14]The Exponential Only Steepens

AI Models
Two Minute PapersGoogle

DeepMind's Erdos run makes 95.7% failure look good

Two Minute Papers covered DeepMind's AlphaProof Nexus attempt on roughly 350 Erdos problems: only nine solved, a 95.7% failure rate, and still a remarkable result because these were long-open math problems checked through Lean-style formal proof machinery.[15]DeepMind's New AI Found A Strange New Way To Think Google's May AI recap also bundled the broader research cadence across Gemini, Labs, Search, Android, Health, and quantum updates.[16]The latest AI news we announced in May 2026

Read more

Why failure was the headline

~00:00 The video argues that solving nine long-open Erdos problems is stunning even if the raw failure rate is 95.7%. The method relies on formalizing problems in Lean, generating candidate proofs, using critique, and selecting among imperfect attempts with cheaper judge models.[15]DeepMind's New AI Found A Strange New Way To Think

Google's wider May bundle

Google's June 5 recap is a roundup rather than one new model announcement, but it shows the breadth of the month's AI push: Gemini app updates, Google Labs experiments, model and research work, Android and Fitbit integrations, Search, Shopping, Health, Cloud, and quantum-adjacent announcements.[16]The latest AI news we announced in May 2026

Tools: AlphaProof Nexus, Lean, Gemini
AI ModelsDeveloper Tools
GoogleAICodeKing

Gemma 4 got smaller, then moved into the editor

Google released Gemma 4 quantization-aware training checkpoints to reduce memory needs and improve on-device performance for laptops and phones.[17]Gemma 4 QAT models AICodeKing's Zed demo shows the end-user version of that trend: local models from LM Studio, Ollama, and llama.cpp wired directly into editor workflows for private coding tasks.[18]Zed + Gemma-4 12B & Qwen-3.6

Read more

QAT as distribution work

Google's Gemma 4 QAT release is about making models fit where developers actually work: compressed checkpoints with better on-device efficiency, rather than assuming every coding or assistant workflow should round-trip through a frontier cloud API.[17]Gemma 4 QAT models

Zed as the proving ground

~00:02 The Zed walkthrough connects LM Studio, Ollama, and llama.cpp to the editor's assistant features. The caveat is explicit: local models will not match top cloud models on hard tasks, but they are useful for private edits, explanations, smaller refactors, and experimentation.[18]Zed + Gemma-4 12B & Qwen-3.6

Tools: Gemma 4, Zed, LM Studio, Ollama, llama.cpp, Qwen, DeepSeek Coder, Codestral
Developer ToolsHot Take
Simon WillisonBetter StackNerd Snipe

Ladybird closed public PRs because effort stopped proving good faith

Simon Willison quoted Andreas Kling's explanation for Ladybird ending public pull requests: substantial patches used to imply substantial effort, and that proxy no longer holds when AI can mass-produce plausible code.[19]A quote from Andreas Kling Better Stack's data-flow-first prompt advice and Nerd Snipe's Claude hallucination clip show the same failure mode from the developer side: agents invent architecture unless the human supplies rails and verifies reality.[20]Do This Before AI Writes Any Code[21]Claude Hallucinates Trying To Be Human

Read more

The maintainer side

Kling's point is not whether code was typed by hand. It is responsibility: once a browser patch enters Ladybird, someone must own it for real users. AI-generated patch volume breaks the old social signal where a large patch implied effort and therefore probable good faith.[19]A quote from Andreas Kling

The operator side

~00:00 Better Stack's advice is to map entities and data flow before asking an agent to write code, then explicitly forbid new entities or flows unless requested. ~00:00 Nerd Snipe's clip gives the counterexample: Claude confidently describing a nonexistent Atio implementation and made-up file paths.[20]Do This Before AI Writes Any Code[21]Claude Hallucinates Trying To Be Human

AI ToolsDeveloper Tools
AI EngineerThe Pragmatic Engineer

OpenClaw and zero-token architecture chase agent throughput

Vincent Koc's OpenClaw talk sells the dark-factory version of coding agents: ship faster than a human can read the diff, then build review and control systems around that speed.[22]Dark Factory: OpenClaw Ships Faster Than You Can Read the Diff The Pragmatic Engineer's zero-token architecture short points at the adjacent optimization: do less model work by moving context, routing, and deterministic behavior outside the token stream.[23]Zero token architecture

Read more

~00:00 OpenClaw is presented as a high-throughput coding-agent environment where the bottleneck moves from writing code to supervising, reading, and accepting changes. The phrase dark factory is doing real work: the system is valuable precisely because much of the production happens out of direct human sight, which makes governance and review the central design problem.[22]Dark Factory: OpenClaw Ships Faster Than You Can Read the Diff

~00:00 Zero-token architecture compresses the complementary intuition: every repeated instruction, static context block, or deterministic transformation that can be moved out of the model call saves cost, latency, and error surface.[23]Zero token architecture

Tools: OpenClaw, coding agents
Developer Tools
Better StackBetter StackBetter StackReal Python

SQL got Git, strings got SIMD, Minecraft got Wayland

Better Stack's developer-tool lane was unusually dense: Dolt brings branch/diff/commit/merge workflows to SQL tables, a SIMD integer-to-string algorithm gets under two nanoseconds, and WaylandCraft runs real Linux windows inside Minecraft.[24]Dolt: This Makes SQL Feel Like Git[25]Integer to String in Under 2 Nanoseconds[26]Minecraft Is Somehow a Computer Now Real Python rounded out the practical side with Docker image slimming, exception-handling strategy, Django 6.1 alpha notes, and Python community releases.[27]Reducing the Size of Python Docker Containers

Read more

Dolt

~00:00 Dolt is the cleanest workflow idea: keep SQL semantics, constraints, and queries, but add branch, diff, commit, merge, and rollback behavior for data changes. It targets the awkward middle ground where CSVs are reviewable but weak, and databases are powerful but opaque to code-review workflows.[24]Dolt: This Makes SQL Feel Like Git

Micro-optimizations and weird systems

~00:01 The SIMD integer-to-string item matters because logs, JSON payloads, metrics, and traces all pay conversion costs at scale. ~00:00 WaylandCraft is the delightful systems project: a real Wayland compositor in Minecraft via Fabric, Java, Rust, and Smithay.[25]Integer to String in Under 2 Nanoseconds[26]Minecraft Is Somehow a Computer Now

Python packaging hygiene

~00:00 Real Python Podcast #298 covers runtime container analysis and image slimming, plus exception-handling boundaries and current Python ecosystem notes.[27]Reducing the Size of Python Docker Containers

Tools: Dolt, AVX512 IFMA, Simd Itoa, WaylandCraft, Fabric, Smithay, slim toolkit, Django
AI Tools
AI Engineer

Voice AI is moving past transcripts

Herve Bredin's AI Engineer talk argues that conversation understanding is not transcription with extra steps; it needs speaker diarization, turn structure, overlap handling, and semantic context to support useful voice agents and meeting workflows.[29]Beyond Transcription: Building Voice AI That Understands Conversations

Read more

~00:00 The pyannoteAI talk focuses on the layers between raw audio and useful automation. A transcript alone loses who spoke, when they overlapped, how turns relate, and which parts of a conversation are actionable. Voice AI that understands conversations has to preserve that structure before summarization or task extraction begins.[29]Beyond Transcription: Building Voice AI That Understands Conversations

Tools: pyannoteAI, speaker diarization, voice agents
AI ModelsPodcast
Adhoc

LeCun's world-model bet is still the anti-chatbot pole

The Yann LeCun adhoc interview is the day's long-form counterweight to agent hype: intelligence needs world models, planning, persistent memory, and objective-driven systems, not just next-token prediction plus bigger inference budgets.[30]Can Yann LeCun Reshape AI (again)?

Read more

~00:00 Why he is still dissatisfied

The discussion circles LeCun's familiar critique: autoregressive LLMs are useful but structurally limited because they do not learn compact predictive world models in the way animals do. His research bet is that future systems need latent representations that support planning, abstraction, and action without reducing everything to language.

~18:00 The architecture argument

The interview contrasts chatbot-style systems with models that can reason over the state of the world, remember goals, and plan over longer horizons. It is less a denial that current systems are powerful than a claim that scaling them is not the clean path to robust machine intelligence.[30]Can Yann LeCun Reshape AI (again)?

~44:00 What to watch

The practical watch item is whether objective-driven, world-model approaches start producing demos that compete with language-agent workflows on useful tasks, not just philosophical elegance.

Developer ToolsPodcast
Adhoc

Kleppmann's data systems lecture is still the antidote to agent mush

The Martin Kleppmann adhoc lecture revisits the durable parts of data-intensive design: logs, replication, consistency, stream processing, and tradeoffs that do not disappear just because an agent writes the code.[31]Designing Data-intensive Applications with Martin Kleppmann

Read more

~00:00 Data systems are tradeoff machines

The lecture's value in an AI briefing is grounding. Agents can generate application code quickly, but durable systems still depend on explicit choices around storage, replication, fault tolerance, and the boundaries between batch and stream processing.[31]Designing Data-intensive Applications with Martin Kleppmann

~26:00 Logs as the unifying idea

Kleppmann's recurring theme is that append-only logs and ordered events are a practical abstraction for replication, recovery, and stream processing. That matters more, not less, in agent-built systems where hidden state and unclear data flow become maintenance risks.

~58:00 Correctness before convenience

The lecture is a reminder that generated code still has to live inside distributed-systems constraints: retries, partial failure, stale reads, backfills, schema evolution, and the human need to understand what happened later.

ProductivityPodcast
AdhocReal Python

Senior engineering coaching stayed aggressively human

The live coaching session for an Amazon engineer focused on senior-level judgment: communication, ownership, prioritization, and making work visible rather than just producing more code.[32]I Coached an Amazon Engineer From Mid-Level to Senior (Live) Real Python's short about being punished for working too hard is the workplace-culture rhyme: output that embarrasses the system can be treated as a problem even when customers benefit.[28]Punished for Working Too Hard?

Read more

~00:00 What promotion work actually is

The coaching session spends its time on leverage: how to frame scope, show judgment, document decisions, align with stakeholders, and move from doing assigned work to owning ambiguous outcomes. It is a useful contrast to the day's agent tooling because the senior signal is not raw throughput; it is making the right work legible and durable.[32]I Coached an Amazon Engineer From Mid-Level to Senior (Live)

~00:00 The hall-monitor problem

Real Python's workplace anecdote is short but sharp: someone helping too quickly gets told to slow down because they are making others look bad. In agent-heavy teams, that social dynamic will show up again around both human and AI-amplified productivity.[28]Punished for Working Too Hard?

IndustryAI Tools
Y CombinatorSequoia

Legora's $100M ARR story is legal AI with movie-star packaging

YC's Legora interview says the legal-AI company reached $100M ARR in 18 months and then somehow recruited Jude Law to make legal software feel less bland.[33]How Legora Went From YC to $100M ARR in 18 Months Sequoia's David Senra short compresses the founder lesson to one word: focus, or the ability to mute the world and build your own.[34]The One Word That Defines Every Great Founder

Read more

Legora's wedge

~00:00 The interview's funniest opening is the Jude Law campaign, but the business signal is customer pull: lawyers using Legora to review large contract sets quickly enough to change their weekend, not just make demos look slick. The claimed scale, $100M ARR in 18 months, is the reason the marketing stunt matters.[33]How Legora Went From YC to $100M ARR in 18 Months

Founder focus

~00:00 Senra's Sequoia clip says the common trait across great founders is focus: unusually low distraction, low concern for how others are doing things, and a willingness to mute the world while building.[34]The One Word That Defines Every Great Founder

Tools: Legora
Industry
Tech BrewMorning BrewMorning BrewMorning BrewAcquired

Apple's reset attempt leads the non-agent business lane

Tech Brew previewed Apple's WWDC as an AI reset attempt after a slower rollout than peers.[36]Apple's AI reset attempt Morning Brew's business lane covered bitcoin weakness, the first $1B May box office without Marvel, and the first confirmed Texas screwworm case since 1966, while Acquired's short turned a 1% management fee into a 40-year compounding haircut.[37]Bitcoin is down horrendous[38]Screwworms have entered the US[39]May had its first $1b box office without a Marvel[40]Investment fees matter more than you think

Read more

Apple's WWDC pressure

Tech Brew frames WWDC as a reset attempt: Apple needs to show a clearer AI story after slower, more modest execution than other Big Tech companies spending heavily on frontier models and data centers.[36]Apple's AI reset attempt

Markets, box office, and biology

Morning Brew's non-AI items were broad: bitcoin under pressure from geopolitical uncertainty and other factors; May reaching a $1B box office without Marvel leading the month; and Texas confirming a screwworm case for the first time since 1966.[37]Bitcoin is down horrendous[38]Screwworms have entered the US[39]May had its first $1b box office without a Marvel

Fees compound harder than they sound

~00:00 Acquired's short does the spreadsheet version: on a 7% market return, a 1% annual management fee is about one-seventh of annual gains, and over 40 years turns a hypothetical $100,000 into roughly $1.0M instead of $1.5M.[40]Investment fees matter more than you think

Hot TakeProductivity
Matt Williams

Screens in school got the day's strongest anti-software rant

Matt Williams argued that schools drifted into screen-heavy teaching because devices and apps became cheap, not because anyone proved children learn better that way.[35]Do We Need Screens to Teach? The broader critique lands squarely in this briefing's AI theme: tech workers overfit on software as the solution to every human problem.

Read more

~00:00 Williams says he works in tech and uses computers constantly, but does not buy the school bargain that children should spend hours staring at screens. The most relevant line for the AI crowd is the closing critique: a lot of people live in a bubble where software is the only acceptable solution, then graduate to believing generative AI solves all human problems.[35]Do We Need Screens to Teach?

Sources

  1. Blog OpenAI Help: Lockdown Mode — Simon Willison, Jun 5
  2. YouTube Introducing Sites in Codex — OpenAI, Jun 5
  3. YouTube 1Password One Shots with Codex — OpenAI, Jun 5
  4. YouTube Codex Runs My Inbox Now — Every, Jun 5
  5. YouTube My Codex Ran 800 Million Tokens in A Day — Nate B Jones, Jun 5
  6. YouTube The most expensive AI mistake isn't prompting — Nate B Jones, Jun 5
  7. YouTube We just launched Paxel! — Y Combinator, Jun 5
  8. YouTube herder: Is This the Ultimate Agent Multiplexer? — Better Stack, Jun 5
  9. YouTube Building Agent Interfaces: Lessons from Chrome DevTools (MCP) for Agents — AI Engineer, Jun 5
  10. YouTube What is an MCP Gateway? — Prefect, Jun 5
  11. YouTube Agents with a GPU — marimo, Jun 5
  12. YouTube Does LLM Reasoning Still Matter? — marimo, Jun 5
  13. YouTube AGI is Here. Anthropic Just Proved It. — Nate Herk, Jun 5
  14. YouTube The Exponential Only Steepens — Last Week in AI, Jun 5
  15. YouTube DeepMind's New AI Found A Strange New Way To Think — Two Minute Papers, Jun 5
  16. Blog The latest AI news we announced in May 2026 — Google, Jun 5
  17. Blog Gemma 4 QAT models — Google, Jun 5
  18. YouTube Zed + Gemma-4 12B & Qwen-3.6 — AICodeKing, Jun 5
  19. Blog A quote from Andreas Kling — Simon Willison, Jun 5
  20. YouTube Do This Before AI Writes Any Code — Better Stack, Jun 5
  21. YouTube Claude Hallucinates Trying To Be Human — Nerd Snipe, Jun 5
  22. YouTube Dark Factory: OpenClaw Ships Faster Than You Can Read the Diff — AI Engineer, Jun 5
  23. YouTube Zero token architecture — The Pragmatic Engineer, Jun 5
  24. YouTube Dolt: This Makes SQL Feel Like Git — Better Stack, Jun 5
  25. YouTube Integer to String in Under 2 Nanoseconds — Better Stack, Jun 5
  26. YouTube Minecraft Is Somehow a Computer Now — Better Stack, Jun 5
  27. YouTube Reducing the Size of Python Docker Containers — Real Python, Jun 5
  28. YouTube Punished for Working Too Hard? — Real Python, Jun 5
  29. YouTube Beyond Transcription: Building Voice AI That Understands Conversations — AI Engineer, Jun 5
  30. YouTube Can Yann LeCun Reshape AI (again)? — Adhoc, Jun 5
  31. YouTube Designing Data-intensive Applications with Martin Kleppmann — Adhoc, Jun 5
  32. YouTube I Coached an Amazon Engineer From Mid-Level to Senior (Live) — Adhoc, Jun 5
  33. YouTube How Legora Went From YC to $100M ARR in 18 Months — Y Combinator, Jun 5
  34. YouTube The One Word That Defines Every Great Founder — Sequoia Capital, Jun 5
  35. YouTube Do We Need Screens to Teach? — Matt Williams, Jun 5
  36. Newsletter Apple's AI reset attempt — Tech Brew, Jun 5
  37. Newsletter Bitcoin is down horrendous — Morning Brew, Jun 5
  38. Newsletter Screwworms have entered the US — Morning Brew, Jun 5
  39. Newsletter May had its first $1b box office without a Marvel — Morning Brew, Jun 5
  40. YouTube Investment fees matter more than you think — Acquired, Jun 5