June 6, 2026
Nate Herk argues the Claude Mythos hype has a real basis: a model identifier briefly appeared, Project Glasswing access widened, and prediction markets started pricing a public launch. His base case is still quieter: Mythos capabilities flow into a future Opus rather than a public Mythos button.[1]Is Claude Mythos Coming? Last Week in AI adds the model-card angle: Opus 4.8 improved coding scores and shipped alongside dynamic workflows, while Mythos remains the higher-risk system Anthropic is holding back for safety work.[2]Last Week in AI #247 - Opus 4.8, MAI, Anthropic IPO, Minimax-M3
Herk frames Mythos as a tier above Opus with unusually strong cybersecurity capability. The public signal was a brief API identifier appearance, but he keeps coming back to Anthropic's earlier position that the preview was not planned for general release. His plausible path is a limited or renamed capability drop, especially if OpenAI ships a new model first.
Last Week in AI describes Opus 4.8 as a real but incremental improvement over 4.7, with stronger agentic coding numbers and a more verbose, argumentative personality. The more strategically important piece is dynamic workflows: a model writing a graph of sub-agent work for tasks that can run hours or days.[2]Last Week in AI #247 - Opus 4.8, MAI, Anthropic IPO, Minimax-M3
The podcast's point is that model intelligence alone is no longer the whole frontier. Orchestration, workflow generation, and long-running task infrastructure become part of the product, and they also create the usage data labs need to keep improving agent behavior.
Nate B Jones says the valuable artifact in AI work is often the rejection: the moment you say no, encode a constraint, or correct taste. CommandCode's Taste system turns that idea into repo-local memory, learning repeated preferences from edits and merges so agents stop relearning the same micro-decisions.[3]The hidden value in your AI's worst outputs[4]The most expensive AI mistake you are making[5]Making DeepSeek v4 outperform Opus 4.7 with Taste
Jones argues that teams are generating useful negative examples constantly, then letting them disappear into chats, Slack threads, and comments. The capture has to happen where the work happens; otherwise nobody will maintain a separate spreadsheet of taste.
The companion short is blunt: as generated output grows, taste cannot stay implicit. Every no should be treated as a reusable rule, constraint, or preference that compounds across future work.
Ahmad Awais describes Taste as a compact, repository-local memory system that records repeated project preferences: package manager choices, CLI conventions, testing defaults, release habits, and other small decisions that developers rarely write down. The goal is not a giant rules file; it is a reviewed, evolving set of taste notes that can steer any coding agent.[5]Making DeepSeek v4 outperform Opus 4.7 with Taste
Latent Space's CommandCode interview explains why some open models feel worse in coding agents than their raw intelligence suggests: repeated schema and tool-call errors can trap them in loops. CommandCode repairs common mistakes deterministically, returns the result plus a hint, and reports that DeepSeek, Kimi, and MiniMax-class models become much more usable when the harness helps instead of only rejecting.[5]Making DeepSeek v4 outperform Opus 4.7 with Taste Last Week in AI's Minimax-M3 discussion points in the same direction: open models are competing on cost, speed, context length, and deployability, not just benchmark rank.[2]Last Week in AI #247 - Opus 4.8, MAI, Anthropic IPO, Minimax-M3
Awais says DeepSeek V4 Pro would sometimes repeat the same malformed tool call dozens of times after receiving a strict schema error. The failure was not only model capability; it was an agent harness that gave an unhelpful correction and then waited.
CommandCode's repair approach is deterministic: coerce a nullable or wrong-shaped argument when the intent is obvious, execute the tool, and return a repair hint explaining what should have been sent. The reported behavior is that later tool calls improve because the model got both the result and a pattern correction.[5]Making DeepSeek v4 outperform Opus 4.7 with Taste
Last Week in AI covers MiniMax-M3 as part of the same competitive lane: million-token context, sparse-attention speedups, lower serving cost, and eventual open weights. The takeaway is that harness quality can make cheaper models feel much closer to frontier systems for real coding work.
Hermes 0.16 adds a native desktop app, remote gateway profiles, a fuller dashboard, model search, undo, and a trimmed skill set. GitHub's MCP Apps talk makes the same bet inside VS Code: tool results should render as sandboxed interactive UI, not just text. The result is a broader shift from hidden agent internals toward inspectable surfaces for models, sessions, skills, tools, and enterprise workflows.[6]Hermes Agent 5.0 (New Upgrades): HERMES BECAME ULTRA-HERMES![7]GitHub Trending Weekly #35[8]OpenAI Investor Innovation Day[9]Building Interactive UIs in VS Code with MCP Apps
AICodeKing calls Hermes 0.16 the Surface release because the agent platform now exposes its internals through a desktop app, web dashboard, remote gateways, profile switching, fuzzy model search, and a cleaner default skill system. The important usability change is that sessions, credentials, MCP catalog settings, webhooks, and messaging channels are no longer buried in config files.
GitHub's AI Engineer talk describes MCP Apps as tool results that include a UI resource reference. VS Code fetches and renders that HTML inside a sandboxed iframe, letting an Excalidraw diagram, flame graph, checkout, or data explorer stay interactive inside chat instead of flattening into text.[9]Building Interactive UIs in VS Code with MCP Apps
Github Awesome's weekly list is packed with the same surface-area theme: Butterbase exposes 43 backend tools to coding agents, Memory OS and Persistia add longer-lived context, Harness Terminal and Vigil add control planes, and Rift gives each agent an isolated workspace.[7]GitHub Trending Weekly #35
OpenAI's investor clip is light on technical detail, but the enterprise examples are operational: deal folders, Excel workflows, hundreds of internal GPTs, and 2,700 employees with enterprise licenses. The surface question matters because this is no longer a side project UX problem.
Ara Khan's AI Engineer talk says evals are neither scoreboard truth nor useless vibes: good agent evals require realistic tasks, isolated environments, failure attribution, and careful hill-climbing without overfitting. Stripe's autonomous-payments talk lands on a parallel principle: discovery can be probabilistic, but credentials, checkout, and spending limits need deterministic APIs and enforceable constraints.[10]Evals Are Broken, Use Them Anyway[11]Building safe Payment Infrastructure for the autonomous economy
Khan splits the bad eval discourse into two camps: people who over-trust benchmark dashboards and people who retreat entirely to taste. His recommendation is to use current, precise evals, let new model releases settle before switching, and build use-case-specific eval suites when public ones do not match the work.
For coding agents, a task can involve reading files, installing dependencies, debugging infrastructure, running tests, and making tradeoffs. Khan points to Terminal Bench-style tasks and infrastructure that fans isolated jobs out in parallel, then asks agents to classify failures from traces so the harness can improve.[10]Evals Are Broken, Use Them Anyway
Stripe's Steve Kaliski uses the same boundary for payments: let LLMs explore, search, and recommend, but do not let nondeterminism handle credentials, charge amounts, seller identity, or checkout state. Shared payment tokens scope credentials to seller, time, amount, and currency; 402-style machine payments describe paid API resources; and the Agentic Commerce Protocol gives agents structured checkout state instead of a fragile browser scrape.[11]Building safe Payment Infrastructure for the autonomous economy
Theo's Cloudflare/Void Zero video argues the acquisition is not just about owning Vite, Vitest, RollDown, OXC, and the people around them. It is about making deployment legible to agents: code should describe infrastructure, a CLI should provision the cloud, and the gap between working locally and running publicly should shrink from hours to one command.[12]Cloudflare bought Vite to destroy Vercel
Theo describes Void as an attempt to make Vite apps deploy directly to real infrastructure. The developer writes code against platform primitives like database, KV, storage, queues, and AI; the platform reads that code and provisions the matching production resources.
His sharper claim is that agents make this kind of integrated cloud worth building. A human can tab into a dashboard and click through setup; an agent is much better at editing code than navigating bad dashboards. That makes code-first cloud primitives more valuable in an agent-built software world.
Cloudflare already has strong CDN, compute, and database primitives, but its developer experience has historically required platform-specific configuration and Wrangler knowledge. Void Zero gives it a credible path leftward into framework, bundler, and app-deployment experience.[12]Cloudflare bought Vite to destroy Vercel
Simon Willison released micropython-wasm, an alpha library for running MicroPython inside WebAssembly from Python, aimed at sandboxed plugin and agent execution. LearnThatStack's rate-limit guide and Arjay McCandless's AWS bill short supply the operational footnotes: when agents and APIs run code for users, CPU, memory, concurrency, retries, caching, logging, and idle resources all become product concerns.[13]Running Python code in a sandbox with MicroPython and WASM[14]micropython-wasm 0.1a2[15]How API Rate Limiting Actually Works and How to Build Your Own[16]How to reduce AWS cloud bills
Willison wants a Python code sandbox that installs cleanly from PyPI, enforces memory and CPU limits, controls file and network access, and exposes selected host functions. WebAssembly via wasmtime gives him a promising isolation layer, while MicroPython supplies a small interpreter that can be embedded as a WASM blob.[13]Running Python code in a sandbox with MicroPython and WASM
The prototype maintains interpreter state by running a loop inside MicroPython that blocks on host-provided code, evaluates it, and reports results. Version 0.1a2 adds a CLI so users can try simple code execution with uvx micropython-wasm.[14]micropython-wasm 0.1a2
LearnThatStack covers the other side of safe execution: respect Retry-After, use exponential backoff with jitter, cap concurrency before you hit limits, cache stable data, and implement server-side limiters through Redis, Nginx, or gateways depending on whether you need per-user policy.[15]How API Rate Limiting Actually Works and How to Build Your Own
Arjay's cost advice is simple but relevant: downsize underused compute, cache and index expensive database paths, control logging volume, and delete idle resources. Agentic systems will make these boring practices more important, not less.
Y Combinator's Emergent interview is the strongest business story of the day: Mukund Jha says the nine-month-old platform has 8.5 million users, more than 10 million apps built, and over $100M in annualized revenue by helping non-programmers ship working software rather than demos.[17]Emergent: How Six Months of Tinkering Led To A $100M ARR Company
Jha says Emergent started as a coding-agent research lab, topped a coding benchmark with a four-person team, then turned that foundation into a consumer platform for building, hosting, deploying, and maintaining software through chat.
The system is described as a multi-agent orchestrated platform with design, testing, memory, and infrastructure components. The team built container and snapshotting technology so multiple parallel agents can work from the same state, and Jha says the architecture has already been rewritten three times as new model classes changed what was possible.[17]Emergent: How Six Months of Tinkering Led To A $100M ARR Company
The interview's advice is to think globally from day one and aim at harder problems, because AI gives small teams access to the same frontier primitives and makes ambition cheaper to test.
Last Week in AI's short on evaporative cooling argues that the most mobile people leave first, which makes AI labs and infrastructure teams weaker in ways headcount charts hide. The full podcast ties that labor-market point to model races, Anthropic's IPO positioning, Minimax-M3, cyber risk, YouTube AI labels, and new research on model capacity and self-recognition.[18]The Evaporative Cooling Problem in Tech Teams[2]Last Week in AI #247 - Opus 4.8, MAI, Anthropic IPO, Minimax-M3
The short argues that when senior people or co-founders leave a company, the loss is not proportional to headcount. In AI work, tight coupling between RL infrastructure, evals, post-training, data systems, and product loops means senior coordination talent is especially hard to replace.
The episode's open-source lane covers MiniMax-M3 as another sign that open or open-weight models are attacking frontier labs through price, speed, long context, and deployability. The business lane discusses high revenue multiples and the pressure for AI application startups to entrench before labs move downmarket.[2]Last Week in AI #247 - Opus 4.8, MAI, Anthropic IPO, Minimax-M3
Later sections cover YouTube automatically labeling photorealistic AI videos, research on why larger models retain rarer tasks, and post-trained models recognizing their own generations. The recurring thread is that AI systems are becoming social, economic, and organizational infrastructure, not just model releases.
Github Awesome's weekly roundup covered the long tail of agent-era tools: autonomous dataset builders, open TTS, backend-as-a-service with MCP tools, memory layers, sandboxes, isolated workspaces, local agent safety, and terminal testing. Real Python pointed to free US datasets for analysis practice, while Two Minute Papers floated game-master-style agents as a new gameplay primitive.[7]GitHub Trending Weekly #35[19]Free US Datasets for Python[20]AI Agents as Games Masters?
The strongest developer-tool entries were BigSet for generated datasets, MisoTTS for open-weight emotional speech, Butterbase for backend provisioning through agent tools, Memory OS and Persistia for long-term context, Sandboxes and Rift for isolated workspaces, Vigil and Harness Terminal for policy controls, and Nullsec S1 for security review.
Real Python's quick item highlights a PyPI package of US datasets covering politics, crime, wages, income, elections, mortality, presidents, education, sports, stocks, and entertainment. It is not flashy, but it is useful raw material for notebooks, tutorials, and model-eval examples.
Two Minute Papers' short gestures at AI agents embedded in games as story drivers or player assistants. It is early, but the design question is obvious: when does an agent become a dynamic game master instead of a scripted NPC?
Morning Brew's business lane had two useful non-AI reads: US employers added 172,000 jobs in May, far above expectations, while GLP-1-driven weight loss is creating new apparel return costs as shoppers size down and reorder wardrobes.[21]Hiring surged well beyond all expectations in May[22]GLP-1-related returns leave retailers reeling
The May jobs report added 172,000 jobs versus roughly 80,000 expected, with upward revisions to March and April lifting the three-month average to 188,000. Morning Brew notes the strength came with warning signs: unemployment held at 4.3%, long-term unemployment rose, and wages lagged inflation.[21]Hiring surged well beyond all expectations in May
GLP-1 adoption is showing up in apparel returns. Morning Brew cites rising exchanges from shoppers sizing down, doubled merchandise-return value from 2020 to 2025, lower plus-size sales at Macy's, smaller sizes at Victoria's Secret, and later wedding-dress purchases.[22]GLP-1-related returns leave retailers reeling
Dwarkesh Patel's Adam Brown clip is a story about hitchhiking as radical listening: a short ride becomes a 1,000-mile detour and a visit to a father's grave. Acquired's Vanguard clip is the opposite energy: institutional power forcing Jack Bogle out even after he had become the public face of the index-investing movement.[23]A 10-mile ride turned into a 1,000-mile spiritual quest - Adam Brown[24]Vanguard forced out Saint Jack Bogle in 1999
Brown describes truckers opening up after decades alone in the cab. One planned 10-mile ride turned into a 1,000-mile spiritual quest, including a stop at the driver's father's grave in Baton Rouge. The story is only loosely tech-adjacent, but it is a useful reminder that trust can appear in very high-bandwidth, low-formality contexts.
Acquired's clip revisits Vanguard's 1999 age-limit move that forced Jack Bogle off the board despite his saint-like status with investors. The detail that the rule was not applied evenly makes the move read less like governance hygiene and more like a power struggle.[24]Vanguard forced out Saint Jack Bogle in 1999