ChatGPT moves into your bank account

AI Tools

OpenAI

ChatGPT gets a personal finance mode

OpenAI rolled out a Pro-only U.S. preview that links ChatGPT to 12,000+ financial institutions through Plaid, surfaces a spending/portfolio dashboard, and answers money questions grounded in your real accounts.^{[1]A new personal finance experience in ChatGPT — OpenAI} A new Financial memories layer stores qualitative goals so future conversations carry the context. OpenAI says 200M+ people already ask ChatGPT money questions monthly, framing this as the natural next step on top of GPT‑5.5.

U.S. Pro subscribers can now open a new Finances surface in the ChatGPT sidebar (or trigger it with @Finances, connect my accounts) and link bank, brokerage, card, and loan accounts through Plaid. Intuit support is "coming soon." After auth, ChatGPT syncs and categorizes transactions and shows a dashboard of portfolio performance, spending, subscriptions, and upcoming payments. Users can also store qualitative goals — "I'm saving for a car early next year," "I still owe my parents $X" — in Financial memories so future conversations carry that context.^{[1]OpenAI — Personal Finance}

OpenAI's "save more" walkthrough breaks down spend into Groceries, Shopping, Transportation, Dining, and Subscriptions, then prescribes per-category caps targeting an extra $500–$750/month in savings. The preview will expand to Plus subscribers before going broader.

ChatGPT can help you stay informed and feel more confident managing your finances, but it is not a replacement for professional financial advice.

Tools: ChatGPT, Plaid, Intuit, GPT-5.5

Industry

Tech Brew

OpenAI lawyers up against Apple over the Siri flop

OpenAI is hiring lawyers to weigh legal action against Apple, blaming Cupertino for burying the 2024 ChatGPT–Siri integration and failing to convert iPhone users into paid subscribers.^{[2]OpenAI takes on Apple — Tech Brew} Apple just signed a ~$1B/year deal to put Gemini behind Siri instead, and iOS 27 will offer competing AI services from Claude and Gemini alongside ChatGPT.

OpenAI execs complain that the Apple Intelligence + ChatGPT integration shipped in 2024 was buried in the UI and inadequately promoted, blunting what they expected would be a massive new-subscriber pipeline. The friction is mutual: Apple has flagged OpenAI privacy concerns, publicly objected to OpenAI's $6.5B acquisition of Jony Ive's hardware startup (which competes with Apple devices), and complained about losing dozens of engineers to that startup.^{[2]Tech Brew — OpenAI vs. Apple}

The breakup is now operational: Apple recently signed a roughly $1B/year deal with Google to power Siri with Gemini, and iOS 27 will offer competing AI services from Claude and Gemini. The dispute lands as OpenAI eyes an IPO and continues defending the Musk trial in Oakland.

We have done everything from a product perspective. They have not, and worse, they haven't even made an honest effort.

Tools: ChatGPT, Siri, Apple Intelligence, Gemini, Claude

Industry

Morning Brew

Powell out, Warsh in — Fed bets on AI productivity

The Senate confirmed Kevin Warsh as the 17th Fed chair by the slimmest margin in history, ending Powell's eight-year run.^{[3]End of an era: Warsh replaces Powell as Fed chair — Morning Brew} Warsh wants rate cuts; FedWatch puts the odds at less than 2.8%. He's framing the case Greenspan-style: an AI productivity boom keeps inflation tame even as the Fed loosens.

After an eight-year run that included the 2018 rate-hike fights with Trump, the COVID emergency programs, and the post-pandemic inflation spike, Jerome Powell is out. Powell leaves with inflation at 3.8% vs. his 2% goal; Warsh wants cuts but the FOMC isn't there yet, with FedWatch giving ≤2.8% probability of any cut through year-end.^{[3]Morning Brew — Warsh replaces Powell}

The framing pitch: like Greenspan in the late '90s, Warsh believes the AI productivity boom is real enough to coexist with rate cuts without reigniting inflation. Powell, for his part, has reportedly called the Warsh confirmation a "really big mistake."

really big mistake

Industry

Morning Brew

Honda's first annual loss since 1957

Japan's #2 carmaker posted a $2.7B annual loss — its first since IPO'ing in 1957 — after a roughly $10B EV-investment writedown.^{[4]Honda's EV misfire leads to first loss in 70 years — Morning Brew} Honda abandoned its 2030 "20% EV sales" target and its 2040 ICE phaseout, pivoting to 15 new hybrids and leaning harder on its motorcycle business (about half of profits).

Like Ford and GM, Honda overestimated EV demand; killed U.S. tax incentives, Trump's auto tariffs, and brutal China competition compounded the pain.^{[4]Morning Brew — Honda} The strategy reset: drop the 20%-EV-by-2030 and ICE-phaseout-by-2040 targets, ship 15 new hybrids by 2030, and lean on the motorcycle franchise — about a third of the global market and roughly half of profits, up 11% last year (including electric two-wheelers).

Industry AI Tools

Morning Brew

Martha Stewart's AI home-management startup raises $10M

Martha Stewart launched Hint, an "always-on, AI-native home management platform" that pulls public data on your address and ingests your bills to nag you about insurance, roofs, and repairs.^{[5]Martha Stewart announces new AI home management company — Morning Brew} $10M seed led by Slow Ventures, chasing a $500B residential renovation and repair market.

Hint asks for your address, pulls public info like flood risk and soil data, and lets you upload bills and documents so it can remind you to refresh your insurance policy or replace your roof. Monetization: premium features plus affiliate/transaction fees for service recommendations — the company claims the platform is "blind to commercial deals."^{[5]Morning Brew — Hint} Goes live this summer.

Co-founders: former Red Ventures exec Yih-Han Ma and AI engineer Kyle Rush. Seed round backed by Slow Ventures, Montauk Capital, Tusk Venture Partners, and Points Guy founder Brian Kelly. Morning Brew slots Hint under what The Cut recently called "The Girlbossification of AI."

always-on, AI-native home management platform … blind to commercial deals

Tools: Hint

Industry AI Future

DeepLearning.AI The Batch

The Batch #353: China blocks Manus, US frontier evals, mammogram AI

Beijing's NDRC blocked Meta's $2.5B acquisition of AI-agent startup Manus even after Manus relocated to Singapore, signaling that "leave China to raise abroad" no longer preserves M&A optionality for strategic AI.^{[6]The Batch #353 — DeepLearning.AI} Meanwhile the White House stood up a frontier-model national-security task force, and Google's mammogram AI beat UK radiologists in a clinical trial.

China blocks Meta's $2.5B Manus buyout

China's National Development and Reform Commission blocked Meta's $2.5B acquisition of Manus, an AI agent startup with Chinese origins that had relocated headquarters to Singapore. The move asserts Beijing's claim over strategically important AI tech developed in China — even after corporate reflagging — and forces Chinese AI founders to reconsider whether moving abroad still preserves international fundraising and M&A optionality.^{[6]The Batch #353}

U.S. frontier-model national-security evals

Leading U.S. AI companies have agreed to submit upcoming frontier models for national-security evaluation before public deployment. Officials are weighing an executive order that would make this mandatory — a notable departure from the administration's earlier deregulatory posture. The Batch frames it as the U.S. moving from voluntary commitments toward UK AISI-style state-run evals.

Google mammogram AI beats radiologists in UK trial

Google's breast-cancer detection system identified slightly more cancers than expert radiologists with fewer false positives, and flagged about 25% of cancers human radiologists missed on first read. The Batch notes diagnostic workload could fall by ~40% if deployed, but physician trust remains the bottleneck.

Tools: Manus

AI Models AI Future

Hugging Face Daily Papers

Hugging Face daily papers — five highlights

Five papers worth flagging today: a 30B model hitting IMO gold via a clean recipe, frame-wise 1–2 step AR video diffusion, a sigmoid-gated GRPO + self-distillation that destabilizes less on multi-turn agents, a memory-staleness benchmark where even Gemini-3.1-pro hits 55%, and a survey arguing multi-agent research is missing causal dependencies between stages.^{[7]Hugging Face Daily Papers — May 15}

SU-01: 30B model hits olympiad gold

SU-01 (arXiv 2605.13301) reaches gold-medal-level scores on IMO 2025 (35 pts) and exceeds the gold line on USAMO 2026 (35 pts) at 10–30× fewer parameters than GPT-5.5 or Gemini 3.1. Recipe: rigorous SFT on 338K long-form reasoning trajectories with a reverse-perplexity curriculum, two-stage RL (coarse RLVR/GSPO then refined RL with generative proof-level rewards and self-refinement), and test-time solve→verify→refine loops sustaining 100K+ token reasoning traces.^{[7]HF Papers}

Causal Forcing++: 1–2 step frame-wise AR video

Tsinghua/RUC team (arXiv 2605.15141) pushes interactive video generation to 1–2-step frame-wise AR diffusion via "causal consistency distillation," beating SOTA 4-step chunk-wise causal forcing on VBench Quality (+0.3) and VisionReward (+0.335) while cutting first-frame latency 50% and Stage-2 training cost ~4×. The trick: a single online teacher ODE step between adjacent timesteps replaces precomputed PF-ODE trajectories.

SDAR: gated GRPO + self-distillation for agents

Zhejiang/Meituan/Tsinghua (arXiv 2605.15155) introduce SDAR — sigmoid-gated detached token-level OPSD layered on GRPO. Beats vanilla GRPO by +9.4% on ALFWorld, +7.0% on Search-QA, +10.2% WebShop-Acc, and avoids the instability that hits naive GRPO+OPSD on multi-turn agents.

STALE: agents are bad at noticing their memories are wrong

A 400-scenario benchmark (arXiv 2605.06527) where new evidence implicitly contradicts a stored memory. Top scorer Gemini-3.1-pro only hits 55.2% — models retrieve updated facts but still accept outdated assumptions. Proposed fix CUPMem (explicit write-time state adjudication) jumps a prototype from 8.7% to 68.0%, arguing memory needs deliberate state tracking rather than retrieval-augmentation alone.

LIFE framework survey

A survey (arXiv 2605.14892) organizes LLM-based multi-agent research as a four-stage progression — Lay, Integrate, Find, Evolve — and argues today's literature treats those stages in isolation, missing causal dependencies that let errors propagate and fault attribution stay broken.

Tools: SU-01, GSPO, RLVR, GRPO, OPSD, ALFWorld, WebShop, Search-QA, Qwen2.5, Qwen3, STALE, CUPMem, Gemini-3.1-pro

Developer Tools AI Tools

Simon Willison's Weblog

Simon Willison's day of small tools

Simon shipped four short posts in one day: datasette-llm-limits for per-user LLM spend caps inside Datasette, inaturalist-clumper 0.1, a Claude-vibe-coded QR code generator with Wi-Fi credential support, and a pre-PyCon birding post that doubles as a real-world demo of the clumper pipeline.^{[8]Simon Willison's Weblog — May 15, 2026}

datasette-llm-limits

A 0.1a0 Datasette plugin that pairs with datasette-llm and datasette-llm-accountant to enforce per-user or global LLM spend limits with configurable scopes and rolling time windows — a real cost guardrail on agentic Datasette workflows.^{[8]Simon Willison — datasette-llm-limits}

inaturalist-clumper 0.1

Small CLI Simon's been running for several weeks behind his iNaturalist publishing pipeline. Ingests sightings JSON and groups them into geographic/temporal clumps for cleaner display on his weblog.

QR code generator, vibe-coded with Claude

A new single-page tool on tools.simonwillison.net generates square or "liquid" QR codes for URLs, text, or Wi-Fi credentials (SSID, password, security type defaulting to WPA/WPA2/WPA3, hidden network flag) with custom colors, borders, and PNG/clipboard export. Classic "Claude built it in an afternoon" utility.

Birding before PyCon

Pre-PyCon morning bird walk in LA — Western Gull (one drinking from a Starbucks cup) and Rock Pigeon, both posts published via the inaturalist-clumper pipeline above.

Tools: Datasette, datasette-llm, datasette-llm-accountant, datasette-llm-limits, iNaturalist, Claude

Industry

Sherwood Snacks

Cerebras IPO hits the tape (article unavailable)

Sherwood Snacks led its May 15 newsletter with Cerebras' IPO.^{[9]Cerebras' monster IPO — Sherwood Snacks} The article body could not be retrieved for this briefing — Sherwood's domain was returning Cloudflare 522 / "Down for maintenance" errors across multiple fetch attempts. Flagged for follow-up.

Cerebras (wafer-scale accelerator vendor) had previously filed and shelved an IPO in 2024. Worth tracking valuation, pricing, and how the market reads it against Nvidia/AMD competitive dynamics — but the Sherwood post itself remains inaccessible at briefing time.^{[9]Sherwood Snacks}

Tools: Cerebras

Podcast

Dwarkesh Patel

Dwarkesh × Eric Jang: rebuilding AlphaGo from scratch

Eric Jang (ex-VP of AI at 1X, ex-Google DeepMind Robotics) walks through reproducing AlphaGo on a $10K Prime Intellect grant during his sabbatical, and uses it as a lens on why MCTS works for Go but not for LLM reasoning, how RL credit assignment really works, and what current frontier models can and can't do as automated researchers.^{[10]Building AlphaGo from scratch — Eric Jang on Dwarkesh}

~00:00 Why AlphaGo, and the $10K reproduction

What previously cost a DeepMind team millions and a TPU pod can now be reproduced for a few thousand dollars of rented compute thanks to LLM coding — Jang spent roughly $4K on exploration, $3K on the final run, the rest on serving the bot.

~02:02 Rules of Go and Tromp-Taylor scoring

A quick primer for Dwarkesh, including the Tromp-Taylor quirks where dead-shape recognition has to be played out for a computer.

~08:04 Game-tree explosion and the PUCT action-selection rule

Why naive search over a ~361^300 game tree is intractable, and how PUCT picks actions via prior-weighted exploration bonuses with value backups from terminal leaves.

~24:18 Why a value function makes search tractable

The dual policy/value heads prune both breadth and depth of the search, turning a near-intractable problem into something a 10-layer net can amortize.

~31:25 Policy/value architecture: ResNet vs Transformer

ResNets still beat transformers in his low-budget regime because Go rewards local convolutional inductive bias — though KataGo's trick of aggregating global features helps.

~42:36 MCTS four-step loop: select, expand, evaluate, backup

~58:41 MCTS as a policy-improvement operator

The core RL story: MCTS is best understood as a label-improvement operator, not a credit-assignment scheme. Every action you took during a game gets a strictly better target (the search-improved visit distribution) regardless of whether you won — which is why training is stable and supervised-learning-like.

~76:55 Profundity: nets amortizing NP-hard search

A 10-layer net compressing what looks like an NP-hard search into a single forward pass suggests our intuitions about computational hardness are incomplete. Many real-world chaotic problems (weather, protein folding, Go endgames) have macroscopic structure that lets neural nets amortize them.

~85:00 Why naive self-play RL fails

Contrast with Karpathy's "sucking supervision through a straw" critique of LLM RL: in self-play without search you might have 51 wins and 49 losses out of 100 games with only one game actually containing a meaningfully better move, and the variance of REINFORCE-style estimators scales quadratically with trajectory length. NFSP and Q-learning get walked through as the model-free analogs when search isn't available.

~106:22 Why MCTS doesn't translate to LLM reasoning

Go has a concrete learnable value function and bounded branching, so PUCT works; in language the action space is so huge you'll never revisit a child node and PUCT's exploration term is wrong shape, plus there's no cheap local value estimate better than just running the trajectory.

~117:32 KataGo tricks the bitter lesson erased

Clever algorithmic tricks have largely been outpaced by faster GPUs and stronger initializations. Modern RL has converged back toward on-policy because off-policy is less stable. AlphaGo Lee's random-rollout grounding of the value network turned out to be unnecessary and was dropped in subsequent papers.

~131:45 Bits-per-flop: RL vs supervised vs distillation

Dwarkesh extends with a bits-per-sample analysis: supervised gives negative-log-pass-rate bits per sample while RL gives only binary-entropy bits. You spend most of training in the low-pass-rate regime where you learn almost nothing — which is also why distillation on soft targets, and AlphaGo training on the full MCTS distribution rather than just the argmax, is so much more sample-efficient.

~141:52 LLM coding assistants as automated researchers

Jang used Claude Opus 4.6 and 4.7 throughout. They're excellent at open-ended hyperparameter search and end-to-end experiment execution via a custom "experiment" skill that takes an x/y-axis description, runs everything, and produces a report. Where they fall down: choosing what experiment to run next, and the lateral thinking to realize a whole line of investigation is pointless. He had to catch infra bugs himself by prompting the right diagnostic questions.

~149:57 Go as an RL env for automated science

Quick outer-loop verification (win rate vs. KataGo or scaling-law-prediction accuracy) with a rich inner loop of distributed-systems engineering and idea evaluation. Open question: how locally verifiable any given research idea is.

Tools: Claude Opus 4.6, Claude Opus 4.7, KataGo, Prime Intellect, MCTS, PUCT, ResNet, NFSP, Q-learning

Podcast Developer Tools

AI Engineer

AI Engineer — Brian Scanlan (Intercom): Project 2x

Intercom doubled engineering throughput in under a year by standardizing on Claude Code, treating it like a senior engineer onboarded onto their Ruby on Rails monolith, and investing in internal skills, hooks, plugins, and session telemetry.^{[11]Brian Scanlan, Intercom — AI Engineer World's Fair} 17.6% of PRs are now auto-approved (SOC 2 / ISO 27001 / HIPAA compliant), Codex handles code review, and Intercom's own model serves 100% of Finn's $100M support business.

~00:07 Intercom's AI pivot and the Finn business

1,400-person, 15-year-old B2B SaaS startup, R&D led from Dublin. Pivoted to AI the weekend ChatGPT launched. Finn (support agent) launched the day GPT-4 came out — has 8,000+ customers, revenue approaching $100M, ~2M resolutions/week. Intercom now serves 100% of Finn's English conversations with an in-house model that reportedly outperforms frontier models while being cheaper and faster. Customers include Anthropic, Snowflake, Linear, Glean, LaunchDarkly.

~02:08 Project 2x: doubling throughput as an explicit goal

Mid-2024, after being unimpressed with Copilot, Cursor, and Augment, leadership set a goal: double engineering throughput in one year. Primary metric: code changes per R&D person (acknowledged Goodhart's law). Named the project, team, and everything "2x." Christmas '24 / early '25 model jump accelerated progress.

~05:10 Organizational change and executive decisiveness

Updated job descriptions: not adopting AI = not meeting expectations, binary, regardless of role. Repeat the message 100+ times. Hackathons, AI immersion days, full-time Team 2x. Argument: in medium/large orgs you need your best people on this full-time, not just telling everyone "AI everything, best of luck."

~07:10 Picking one platform: Claude Code over Cursor/Augment

"Multi-cloud doesn't compound; pick one platform and optimize it." Vision: Claude should act like a senior engineer on any technical task across Intercom. Onboard it on Rails conventions, architecture, React patterns, testing/security rules. Push internal Claude plugins to laptops, bypassing the Claude Code update mechanism.

~10:13 Give agents problems, not tasks

Build durable, testable, high-quality skills with backtesting against historical code/incidents. Story: during a Snowflake metadata leak incident, Scanlan told Claude to join a Slack channel; it auto-discovered an existing data-breach skill, ran full analysis, concluded innocuous, gave next steps in ~2 minutes — what would have been a 20-minute task. Lesson: describe the problem, let the agent pick the skill.

~15:20 Engineer maturity ladder for AI adoption

Use Claude Code for everything → automate your work → move automation to a skill → get great at writing skills → approve skills → optimize the environment (architecture, docs) for agents.

~16:21 Results: 2x PR throughput, 17.6% auto-approval, compliance-clean

Decision made December, rollout January. ~90-something percent of PRs come out of Claude Code. 17.6% automatic code-approval rate, built via backtesting + human-labeled outputs to calibrate confidence, with PRs shaped toward safe/simple shapes. Worked with auditors: fully SOC 2, ISO 27001, HIPAA compliant — no human-in-the-loop required.

~17:23 Telemetry, defects, code quality, and what's next

Hooks send skill-invocation data to Honeycomb. All Claude session transcripts pulled into S3 for data mining and skill-effectiveness analysis. Defects closing faster than ever; some teams pursuing "backlog zero." Stanford research group has Intercom's code; their code-quality metrics are rising.

Tools: Claude Code, Cursor, Augment, Codex, Honeycomb, S3, Finn, Ruby on Rails

Podcast Developer Tools

AI Engineer

AI Engineer — Mike Spitz (PFF): killing scrum

PFF ran a Jan–March 2026 case study replacing a 10-engineer scrum team with 2 engineers running agents, hitting 25x deploys and 10x output while customer satisfaction climbed from ~7 to 8.6/10.^{[12]Mike Spitz, PFF — AI Engineer World's Fair} Spitz argues engineers are no longer the bottleneck, so most Agile ceremonies can be deleted in favor of agent-driven specs, LDDs, tickets, PRs, and QA.

~00:07 PFF context

Sports-data company: 100M annual page views, 9M drafts/year, ~20 engineers. Distributed team was falling behind on a packed roadmap.

~02:07 Case study results: 25x deploys, 10x blended output

Two of his strongest engineers, running agents, shipped 5 deploys per day vs. the 10-engineer team's 1 deploy per 5 days. Blended ticket count with code complexity to estimate 10x output. Finished a 4-month roadmap in under 2 months.

~04:08 Customer quality jumps from ~7 to 8.6/10

Treated as the only retrospective signal that matters.

It doesn't matter if the output's more. It doesn't matter if the number of deployments are higher. What really matters is if the customers are happy.

~05:10 Scrum dies: huddles replace standups, sprint planning, retros

Bi-daily 30–60 min huddles between engineers, product, and design. The rest is gone.

Engineers aren't the bottleneck. So we don't need to have all the old ceremonies that we had before.

~06:10 Dev flow: agent-led spec, LDD skill, auto-tickets, auto-PRs

Agent-led spec interview, an LDD (lightweight design document) generated by a skill that mirrors prior LDDs, automatic ticket creation with blocking-dependency detection, and auto-updated ticket statuses tied to PR state.

~08:11 Rollout: curious senior engineers, slow, non-critical systems

Not everyone can drive a sports car. And that's all right.

~10:13 Agentic code review for style nits; skills as factory components

We use agents to do the code reviews that engineers hate getting any feedback from.

~13:15 QA agent on staging and the self-healing PR loop

After merge, a QA agent runs against acceptance criteria on staging; next planned step is having an agent auto-open PRs to fix failing acceptance criteria so the loop self-heals.

~14:18 Recommendations: kill redundant process, don't be too conservative

A few months behind at the moment might be six months behind in a few months, might be 12 months behind a little bit afterwards.

Tools: Claude Code, Codex, Agent skills, Feature flags, Trunk-based development, Service-repository pattern

Podcast Developer Tools

AI Engineer

AI Engineer — Pedro Rodrigues (Supabase): skills + MCP

Supabase shipped an official agent skill and ran evals across Claude and Codex models: MCP+skill outperformed MCP-only and baseline on every model.^{[13]Pedro Rodrigues, Supabase — AI Engineer World's Fair} Headline takeaway: the bottleneck isn't context, it's guidance — point at your single source of truth, be opinionated, start minimal.

~00:14 Reframing: MCP and skills play different roles

The MCP-vs-skills debate has settled, so Rodrigues instead walks through how Supabase actually wrote their skill and what they learned.

~01:15 Why agents need product-specific guidance

Agents are smart enough for mundane tasks but operate on stale training data, are lazy about admitting they don't know, and miss product-specific pitfalls — for Supabase, that means silently bypassing row-level security when creating SQL views.

~02:15 Skills primer: front matter, skill.md, bundled resources

~03:16 RLS demo: MCP-only fails, MCP+skill preserves security

With Claude Sonnet 4.6 and only the MCP, the agent created a view that exposed protected data. With MCP+skill, it correctly set security_invoker=true and preserved RLS.

~05:18 Principle 1: don't duplicate docs — point to single source of truth

Be stubborn about making agents fetch docs. Supabase is experimenting with exposing docs over SSH so agents can navigate them like a filesystem.

~07:19 Principle 2: if it can be skipped, it will be

Agents are lazy about loading reference files (one file maybe, two almost never), so anything critical like a security checklist must live directly in skill.md, not a reference.

If something can get skipped, it will be skipped.

~09:21 Principle 3: be opinionated — Supabase's schema-management workflow

Run direct DDL on dev/staging, run the advisor for security/perf issues, fix them, and only then write the migration file — rather than producing a migration on every schema change.

Be opinionated. You know your product the best. Don't be afraid of guiding the agents on workflows that you think are the most effective.

~11:21 Evals on Braintrust: MCP+skill wins across Claude and Codex

Six scenarios, four agents (Claude Code with Opus 4.6 and Sonnet 4.6, Codex with GPT-5.4 and GPT-5.4 mini), three conditions: baseline, MCP-only, MCP+skill. MCP+skill won on every model.

~13:22 Takeaways: bottleneck is guidance, not context

The bottom line is not the context, it's the guidance.

~16:22 Q&A: skill distribution is still unsolved

Supabase currently packages skills inside repos as .claude or .cursor plugins, with Vercel's skills package as one emerging option.

Tools: Supabase MCP server, Supabase agent skill, Claude Code, Claude Sonnet 4.6, Claude Opus 4.6, Codex, GPT-5.4, GPT-5.4 mini, Braintrust, Postgres RLS, Vercel skills package

Podcast Developer Tools

Real Python

Real Python #295: agentic architecture with Mikiko Bazeley

Mikiko Bazeley (MongoDB) joins host Christopher Bailey to debate files vs. databases for agent memory, unpack why million-token context windows collapse to 20–40% effective use, and share practical context-engineering and skill-building tactics for Python devs building agents.^{[14]Real Python Podcast #295}

~02:01 MongoDB's reinvention for AI workloads

Pivot from "internet scale / NoSQL" into an AI data platform — Atlas with vector + text search, the Voyage acquisition for embeddings and rerankers, partner integrations with LangChain, Mastra, and Agno, packaged in the 8.0 release.

~07:04 The "files are all you need" debate

Karpathy's all-markdown Obsidian setup went viral as "RIP vector databases," but Bazeley highlights the exceptions: multimodal data (PDFs, schematics), 100k+ doc corpora, and precision-on-source. Harrison Chase's LangChain file-storage interface still sits on top of a database.

~11:08 Three modes of agents seen in production

Assistant agents, workflow agents (no-code/low-code journey builders like Mailchimp's), and deep-research agents that spawn sub-agents.

~22:15 Anatomy of a simple agent: perception, planning, action, memory

~25:16 Big model vs. big harness, and context rot

A 1M-token model yields only 20–40% effective context after tool metadata, system instructions, and corrective turns eat the rest.

~32:19 Tool loadouts, context clash, starting with one agent

Limit tools to 5–8 per agent, use semantic tool search instead of dumping a whole MCP API, and resist the urge to start multi-agent. Successful teams scope one agent well, define evals, only split into sub-agents once tool count hits 20–30.

~41:26 Shared state in multi-agent systems

Files can really only be owned by one agent at a time; scratchpads add latency. Anthropic's deep research can take 3 hours and historically had 50 sub-agents duplicating work — exactly what ACID-transactional databases were built to handle.

Tools: MongoDB Atlas, Voyage, LangChain, Mastra, Agno, Ollama, Claude Code

Podcast Industry

EO

The Kalshi Story (EO documentary)

EO profiles Kalshi co-founders Tarek Mansour and Luana Lopes Lara, who spent years pursuing a regulated US prediction market, sued the CFTC to launch election markets, and scaled the company to a $22B valuation — double its $11B mark five months earlier.^{[15]The Kalshi Story — EO}

~00:00 Why money makes prediction markets accurate

Tarek argues academic research on prediction accuracy is done in controlled environments — in the real world, incentives matter. People give more honest forecasts when they have to put money behind their convictions.

~02:10 Founder origins

Luana grew up in Brazil sleeping 4 hours a night to balance professional ballet with school. Tarek grew up in Lebanon amid war and instability, developing adaptability and a "don't take setbacks too seriously" mindset.

~06:12 The idea: Goldman, Bridgewater, and Five Rings

Tarek noticed at Goldman in 2016 that institutions wanted exposure to event outcomes (Brexit, Trump winning) but had to use clumsy proxy trades. Working together at Five Rings prop shop in 2017–18, the founders connected the dots on a regulated US prediction market.

~07:13 60 lawyers said no, then they cold-called Jeff

60–65 lawyers in a single day all said a regulated prediction market was impossible. They reached Jeff (former CFTC), who outlined the 23 core principles. Tarek and Luana produced a full compliance analysis in a weekend, convincing him to join.

~09:15 Regulatory-first principle and the YC slog

At a YC hackathon, Michael Seibel told them their demo was "illegal" but admired their motivation. While other YC batchmates shipped products, Kalshi stagnated for years on legal documents and regulator meetings.

~13:20 Suing the CFTC to win election markets

After two years of failed engagement and a public comment period that drew 200+ supporters including a former Council of Economic Advisors chair, Kalshi sued its own regulator. They won at district and appeals court, but the government appealed and Kalshi feared running out the clock before the 2024 election.

~17:21 The 4-week scramble: 100x overnight

After winning the appeal, the third-party clearing house blocked the election market, forcing Kalshi to migrate to its own clearing house over a single weekend (normally a 6-month project). The team went 24/7, scaled 100x overnight, processed $2B+ in volume, and onboarded 2M+ customers in two weeks.

~23:27 Power users: mention markets, chart-scraping teachers, hurricane hedges

Joel made $170K in 5 months trading "mentioned markets" on Trump speeches using a historical speech database. Brandon, a 25-year-old Bucks County schoolteacher, made $150K by scraping HTML source for Travis Scott CD inventory to predict chart positions. A Gulf Coast meteorologist uses hurricane markets as homeowner's-insurance hedges.

Hot Take Industry

Theo - t3.gg Low Level

The npm supply-chain attack wave

A self-propagating worm called Shai Halud has been compromising NPM, PyPI, and Cargo, including 84 Tanstack packages poisoned via a CI cache exploit.^{[16]Shai Halud worm — Low Level} Theo argues AI has collapsed the security disclosure timeline — two parties independently found the CopyFail vuln within 9 hours of each other, and Claude/Gemini/GPT can identify a security fix from just the diff.^{[17]Everything is pwn'd now — Theo}

~00:01 Shai Halud: a package-manager worm

Once a maintainer account is compromised, the attacker gains credentials to publish signed packages and cascade further compromises. TanStack was specifically called out: attackers exploited a GitHub Actions pull_request_target workflow — which runs with maintainer-level permissions rather than contributor permissions — to poison the Actions cache, extract a publish token, and push a malicious signed package.^{[16]Low Level}

Please stop using pull request target. I think generally these are a huge foot gun.

Mitigations: CDN middlemen like Socket to scan repos for anomalous behavior, sandboxed installs via any.run, configuring npm/pnpm/uv to refuse packages younger than one week, and avoiding pull_request_target entirely.

~00:00 Theo: the supply chain is on fire

A rapid-fire string of recent exploits — CopyFail (trivial Linux kernel LPE via 732 bytes of Python), CopyFail 2, Dirty Frag, 84 Tanstack npm packages, an unpatched slab memory breakout, a curl vulnerability, and a GitHub.com RCE via a single git push that allowed unauthorized access to millions of repos. macOS 26.5 patched 79 separate CVEs in a single release, two credited to Claude/Anthropic.^{[17]Theo — Everything is pwn'd}

We cannot possibly survive long in a world where the people who are making the update we all need to install discover this information the same moment that the hackers do.

~04:03 AI is collapsing the disclosure timeline

Three pillars of software security are crumbling at once: only experts could find exploits (dead — agents in a loop with enough tokens can find real exploits today); the 90-day window is sufficient (broken — two parties independently found CopyFail within 9 hours of each other); patch-to-exploit is hard (broken — AI can read a diff and immediately identify it as a security fix). In Jeff Kaufman's test, Gemini 1.5 Pro, GPT o3 Thinking, and Claude Opus 4.7 each correctly identified CopyFail as a security fix from just the patch diff.

Anything that required a lot of attention from talented people over time can now be done in a for loop.

~17:12 Open source needs a trusted-actor tier

Theo proposes a verified middle tier between private security disclosure and public release — distribution maintainers, enterprise IT teams who can pass audit and be privately notified before public disclosure. Could even be a revenue stream (charge Microsoft for certified early access). Extends to platform design: a "better GitHub" that supports staged openness, where projects can keep specific PRs, files, or branches private until sufficient patch distribution.

If you assume the killer's already in the house, you'll have a better time.

~03:01 A QEMU/KVM hypervisor escape

Allows an attacker inside a VM to execute arbitrary code as root on the bare-metal host. The host (Low Level) attributes the rising frequency of such disclosures to AI — enabling skilled researchers to operate "as if they were 100 researchers" and lowering the bar for less-experienced threat actors. Expects a difficult 5–7 year transition before AI-assisted defenses catch up.

I think we're going to enter a very weird like 5 to 7-year period where shit just gets really bad for a while.

Tools: Socket, any.run, npm, pnpm, uv, GitHub Actions, QEMU, KVM, Claude (Opus, 5.4, 5.5), Gemini 1.5 Pro, GPT o3 Thinking, Mythos, OpenAI Daybreak

Industry

Fireship

Musk vs. Altman closing arguments

The Musk vs. Altman federal trial reached closing arguments in Oakland on May 15, with Musk seeking $134B and the unwinding of OpenAI's for-profit conversion.^{[18]I can't believe this trial is real — Fireship} Polymarket gives Musk a 32% chance of winning; Fireship assesses his case as weak.

~00:00 Trial setup

Closing arguments in federal court in Oakland for Musk vs. Sam Altman, Greg Brockman, OpenAI, and Microsoft. Musk co-founded OpenAI in 2015 as a nonprofit and donated $38M before leaving the board in 2018. He claims Altman and Brockman breached a charitable trust by bolting a for-profit subsidiary onto the nonprofit and cashing a $13B check from Microsoft, with OpenAI now approaching a $1T valuation.^{[18]Fireship — Musk trial}

~02:01 The haunted-mansion proposal

Discovery exposed Elon's 2017 push to convert OpenAI to for-profit with himself as CEO and majority shareholder, pitched from a haunted mansion with Amber Heard serving whiskey.

~03:02 "He stormed around the table"

Greg Brockman's testimony described a dramatic 2017 confrontation where Elon allegedly stormed around a table and threatened him before leaving. Discovery also surfaced Shivon Zilis's conflict of interest as a former OpenAI board member and mother of four of Musk's children.

He stood up and stormed around the table. I thought he was going to hit me.

~04:03 "Directionally, very bad"

Mira Murati's reply to Altman during the November 2023 firing — "Directionally, very bad" — became a defining moment of the discovery dump.

Directionally, very bad. — Mira Murati, texting Altman during his firing

~05:04 Why Fireship thinks Musk's case is weak

His own 2017 for-profit push undercuts the charitable-trust claim, there's no hard contract, and he admitted xAI stole OpenAI models. Polymarket: 32% Musk wins.

Hot Take Industry

Nate B Jones

SaaS vendors switch on a second agentic billing meter

Salesforce, Microsoft, and ServiceNow are all layering a second agentic billing meter on top of traditional per-seat pricing, charging for agent actions, workflow units, and credits.^{[19]Your SaaS Bill Just Got a Second Meter — Nate B Jones} Nate argues the seat was always a proxy for human work and value — and the agent license is becoming a meter for that same unit, except now that it's been delegated.

~00:00 The second meter

Salesforce Agent Force hit $800M ARR at 169% YoY growth by billing "agentic work units." Microsoft 365 Copilot keeps seat pricing but Copilot Studio adds an explicit second meter via Copilot credits, where different agent actions consume credits at different rates. ServiceNow reframes through "action fabric" — agents that provision access, escalate incidents, or open change requests generate billable operational work units. Pattern: keep the seat, switch on a second meter for delegated agentic work.^{[19]Nate B Jones}

Toll booths along the way to make sure that they get their piece of the value you're creating.

~07:06 SAP's API lockout policy

SAP's 2026 API policy restricts AI agents from "planning, selecting, or executing sequences of API calls outside SAP endorsed architectures." Any agent — internal or third-party — that needs to touch SAP data must first clear a contractual hurdle. Nate's advice: negotiate agent access paths during procurement, naming the specific AI providers (Claude, OpenAI) and required access patterns before signing.

Pricing follows platform control. The vendor that defines the new work primitive earns the argument that it should price the work.

~09:08 Fair vs. rent-seeking agent licenses

A fair license has nine traits: visible meter with sensible units, forecastable usage, no billing for failed or low-value work, governed (not blocked) paths for third-party agents, granular billing by read/draft/write/approve/execute, buyer-set caps, exportable usage data, and a fixed rate card post-adoption. Rent-seeking licenses charge for vague "AI access," make the vendor's own agent the only practical route, bill failed attempts, hide the meter until renewal, and dress commercial lock-in in security language.

The worst move you could do is to wait until your usage is embedded because you have no leverage then.

Tools: Salesforce Agent Force, Microsoft 365 Copilot, Copilot Studio, ServiceNow Action Fabric, Atlassian Rovo, SAP, Zendesk, HubSpot, Workday