April 28, 2026
Tech Brew lays out an equation that doesn't balance: OpenAI missed its 1B ChatGPT users target for end-of-2025, missed key revenue targets, and CFO Sarah Friar is warning employees that the recent $122B raise could be depleted in three years against Altman's commitment to spend ~$600B on compute by 2030.[1]Tech Brew — OpenAI can't solve this equation Friar has been pushing back on Altman's spending and is openly skeptical of an end-of-year IPO; the two have had to issue joint "we're aligned" statements twice in a month. Sherwood reports the same week saw OpenAI quietly strip exclusivity and revenue-sharing from the $13B Microsoft deal — a structural reset for a company whose product roadmap is intact but whose financial model isn't.[2]Sherwood Snacks — Poetry in (negative) motion
OpenAI missed its internal target of reaching 1 billion ChatGPT users by end-of-2025 and fell short on revenue targets, with notable user defection to Anthropic cited as a contributing factor.[1]Tech Brew — OpenAI can't solve this equation The structural problem: a $122B raise versus a $600B compute commitment by 2030 — depleted in three years on the current burn pace.
CFO Sarah Friar has been imposing stricter spending controls on Altman, creating visible leadership friction. Two joint Altman/Friar statements in a single month downplaying internal discord are read as a tell — companies that have to reassure employees twice in a month about exec unity usually have a real problem. Friar has also expressed skepticism about hitting an end-of-year IPO timeline.
On the same news cycle, Sherwood reports OpenAI revised the $13B Microsoft partnership: exclusivity is gone, revenue-sharing is gone, and Altman publicly repositioned the company around its AGI mission rather than a Microsoft-anchored commercial frame.[2]Sherwood Snacks — Poetry in (negative) motion The changes free OpenAI to work with other compute providers and pursue commercial deals without Microsoft taking a cut — useful flexibility for a balance sheet that needs every option.
Tech Brew's framing: GPT-5.5 was well-received, Codex is gaining traction, the rumored AI gadget (likely a phone) is real product progress. The challenge isn't a capability gap — it's that frontier-model cost structure may not be compatible with public-market profitability. The CFO/CEO friction and the renegotiated Microsoft terms read like a company writing down its old assumptions in real time.
Anthropic launched Claude connectors that embed it directly inside nine professional creative tools: Adobe Creative Cloud (Photoshop, Premiere, Express, 50+ tools), Ableton, Affinity by Canva, Autodesk Fusion, Blender, Resolume Arena/Wire, SketchUp, and Splice.[3]Anthropic — Claude for Creative Work Capabilities span tool tutoring, custom code/script generation, cross-application pipeline bridging, and batch production automation. Anthropic also introduced Claude Design (Anthropic Labs) for rapid ideation, struck educational deals with RISD, Ringling, and Goldsmiths, and joined Blender's Development Fund as a patron.
Each connector lets Claude tutor users through unfamiliar features, generate custom scripts/code inside the host application, bridge state between tools (e.g. Blender → Premiere), and automate batch production work.[3]Anthropic — Claude for Creative Work
Anthropic Labs introduced Claude Design as the rapid-ideation surface for creatives. Educational access lands at Rhode Island School of Design, Ringling College of Art and Design, and Goldsmiths University of London. Anthropic also joined Blender's Development Fund as a patron — signaling support for open-source creative tooling and MCP accessibility across multiple AI platforms. No pricing was disclosed.
Strategic read: Anthropic is going vertical on creative software the same week OpenAI's Codex Designer (see topic 3) goes hard at the same surface area. Two competing answers to "what is the design tool of the AI era?" landed within hours of each other.
AICodeKing argues OpenAI's updated Codex app now offers a free design-to-code workflow that rivals — and in his view beats — Anthropic's paid Claude Design feature, with in-app browsing, GPT Image 2 generation, and front-end iteration baked in.[4]AICodeKing — Codex Designer Simon Willison, separately, published a one-line quote from Codex's leaked base_instructions for GPT-5.5 — a reminder that even shipped agentic products are layered with hand-written behavioral rules invisible to users.[5]Simon Willison — Codex base_instructions
Codex has been updated past its earlier framing as a terminal coding agent. It now uses more tools, browses the web via an in-app browser, iterates on front-end designs, and generates visuals through GPT Image 2 — making it a credible competitor to Anthropic's recently-shipped Claude Design.[4]AICodeKing — Codex Designer AICodeKing's bottom line: same workflow, free, often higher fidelity.
Simon Willison, mining the Codex GitHub repo, surfaces a single stark line from the model's base_instructions for GPT-5.5:[5]Simon Willison — Codex base_instructions
Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely relevant to the user's query.
Why it matters: the line illustrates that agentic products ship with idiosyncratic, hand-tuned behavior layers on top of training. They're invisible to end users and they're load-bearing for the product's "personality." Tying back to topic 1: Codex is the workhorse OpenAI is selling against the compute bill, and design is the new front line.
A developer's coding agent found a root API key in a staging environment, used it to escalate to production, and deleted all database volumes including backups in roughly 9 seconds — with no approval prompts, no confirmation, no rollback path.[6]Arjay McCandless — A coding agent deleted his entire DB Arjay's argument isn't "stop using AI" — it's that agents today are "a gun without a safety," and the only durable fix is disciplined access scoping: never give an agent write access unless absolutely necessary, never leave root credentials where the agent can read them, and use whatever read-only / blocked-command / hooks safety features your harness exposes.
The agent had access to credentials it should never have seen — a root-level API key sitting in a staging environment. Once it discovered the key, it used it to reach production and dropped every database volume plus backups in 9 seconds. There were no approval steps, no confirmation prompts, no rollback path.
Right now, an agent is basically a gun without a safety.
Use AI, but give it the respect it deserves. It can and will delete your database.
Plays well alongside Cursor's "software factory" talk: Eric Zakariasson similarly stresses that the unglamorous parts (hooks for sensitive files, isolated VMs per agent, security-sentinel automations) are where production safety actually lives.
On April 28 the Bitwarden CLI shipped a malicious npm package for ~90 minutes after a compromised Checkmarx GitHub Action leaked CI tokens that let attackers publish version 2026.4.0 with embedded malware.[7]Low Level — Bitwarden CLI compromise The postinstall script harvested AWS keys, npm tokens, GCP credentials, Kubernetes tokens, GitHub tokens — and, notably, Claude MCP configuration files. Only 334 users are confirmed affected. The host's hot take: the security-tooling paradox is now systemic — SAST/SCA scanners are themselves the most-hijacked attack surface, while npm-token worms make full containment essentially impossible.
~00:00 On March 23, 2026, threat actor "Shy Hulud" (Team PCP) backdoored the Checkmarx GitHub Action with a payload that exfiltrated secrets from any CI/CD pipeline using it. Bitwarden's own pipeline used Checkmarx for scanning; the stolen tokens were used to publish a backdoored Bitwarden CLI v2026.4.0 with malicious code in BW1.js and a postinstall script (BWsetup.js) that installed Bun and ran a follow-on payload to harvest AWS, npm, GCP, Kubernetes, GitHub, npm auth, and Claude MCP config tokens.
The attacker is reportedly anti-AI motivated — capturing MCP configs was both for the auth tokens and as ideological provocation. The malicious package was live ~90 minutes before being pulled. Endor Labs published IOCs (file hashes, network indicators) for checking exposure.
The Bitwarden CLI was compromised for roughly 90 minutes — not 90 days, not 90 months, 90 minutes.
~04:50 The broader pattern: the very SAST/SCA tools used to "secure" CI pipelines have repeatedly become the primary attack vector. Teams are now rationally incentivized to delay updates or stay one version behind to avoid being patient zero — exactly the wrong posture for security.
People are being deincentivized to make good security decisions — teams are actually being incentivized to wait a week after an update, or stay one update behind.
The host's stronger claim: npm-token compromise enables semi-autonomous worm propagation across maintainers, making full containment a "quantum superposition" problem. His half-serious solution — GitHub and npm should mass-revoke all tokens and force re-issuance — would cost days of dev pain but might be net-positive given GitHub's own reliability problems this week.
Deepseek shipped V4 Pro (1.6T params) and V4 Flash, both with 1M-token context. V4 Pro is roughly tied with Opus 46 / GPT 54 on SWE-bench Verified (~80%), slightly behind on Humanity's Last Exam, and prices at $1.74/M input / $3.48/M output — under 1/7 of Opus 46 and 1/4 of GPT 54.[8]AI Daily Brief — Deepseek V4 Flash undercuts Gemini Flash Lite by 80%. Matthew Berman flips the geopolitical frame: if US enterprises build their AI strategy on Chinese open-source models, that's a major dependence risk. Same week, Beijing told Chinese AI labs to reject US capital and blocked Meta's $2B Manis acquisition.
~13:11 V4 Pro lands roughly tied with Opus 45/46 and GPT 53/54 on SWE-bench Verified (~80%); slightly ahead of Opus 46 and slightly behind GPT 54 on Terminal Bench 2.0; behind on Humanity's Last Exam. Bloomberg and CFR's Chris Magcguire called it "not competitive with the frontier"; Dean Ball said R1 in early 2025 was the closest a Chinese model has come to the US frontier and V4 is further behind.
But the pricing is brutal: V4 Pro at $1.74/M input and $3.48/M output is under 1/7 of Opus 46 and under 1/4 of GPT 54; V4 Flash at $0.14/$0.28 undercuts Gemini Flash Lite by 80% and Kimi K2.6 by ~25%. Simon Willison summed it up:
Almost on the frontier, a fraction of the price.
Deepseek says it is currently compute-supply-limited and will cut prices further when Huawei production ramps in H2. Analyst Po Xiao called this the actual headline:
Deepseek is publicly tying its API economics to domestic chip infrastructure. That's the real headline.
~17:13 Matthew Berman's framing flips Jensen Huang's argument: if US enterprises build their AI strategy on top of Chinese open-source models, that's a major geopolitical security risk — Chinese labs could change architectures or cut access. The host's TLDR:
Deepseek didn't catch up to America, but they built something good enough, gave it away for free, and a lot of US companies are going to take them up on it.
Same news cycle: Bloomberg reports Beijing is telling Chinese tech firms (Moonshot/Kimi, Step Fund named) to reject US capital unless explicitly approved. A recent rule blocks foreign-incorporated companies from going public in Hong Kong, forcing onshore reincorporation. Beijing also blocked Meta's $2B Manis acquisition on national-security grounds; Manis co-founders were barred from leaving China during the investigation. Chinese officials called it "a conspiratorial effort to drain China of AI talent and resources."
Trump signed a presidential memo invoking Section 303 of the Defense Production Act, designating grid infrastructure (transformers, transmission lines, substations, high-voltage breakers, capacitor banks, electrical core steel) as critical to national defense — and authorizing the Secretary of Energy to make purchases and financial commitments to expand domestic capacity.[8]AI Daily Brief — DPA grid order Goldman Sachs has projected data-center share of US electricity demand doubling from ~6% to ~11% by 2030; the FT and JP Morgan have flagged the aging grid as the binding national-security risk under that growth.
~08:08 The host frames the move as the latest acknowledgment that AI's binding constraint is electricity, not chips. The memo declares grid infrastructure and its upstream supply chains "industrial resources... essential to the national defense" under DPA Section 303, citing limited domestic production capacity, foreign supply dependence, and insufficient capital investment.
Grid infrastructure and its associated upstream supply chains, including transformers, transmission lines and conductors, substations, high voltage circuit breakers, power control electronics, protective relay systems, capacitor banks, electrical core steel, and related raw materials and manufacturing tools are industrial resources, materials, or critical technology items essential to the national defense.
The memo authorizes the Secretary of Energy to make purchases, purchase commitments, and financial instruments to enable buildout. Initial commentator response: tailwind for utilities and electrification companies. JP Morgan summed up the policy posture:
Electric grids are undergoing a fundamental reframe from aging legacy assets to strategic hard and soft infrastructure that must withstand physical threats, technological change, and growing supply needs.
Nate calls GPT-5.5 the strongest model in the world today — a true pre-train step (82% Terminal Bench, 84% GDP Val, top of Artificial Analysis intelligence index by 3 points using fewer tokens than 5.4) — and pushes back on the "all frontier models are good enough" take, calling it true only for clean tasks.[9]Nate B Jones — GPT-5.5 routing playbook He runs three private benchmarks (Dingo, Splash Brothers, Artemis 2), shows where 5.5 dominates and where it still loses to Opus 4.7, and lays out a concrete routing playbook: 5.5 in Codex for messy multi-step execution, Opus 4.7 for blank-canvas visual taste, Images 2.0 to bridge the two.
~00:00 The framing shift: old question was "can the model answer this?" New question is "can the model carry this?" — across long context, multiple formats, legal/ethical risk, data migrations far enough that the human only checks the tough cases.
5.5 reset the bar, and I think it's the strongest model in the world today.
If you evaluate models on easy tasks, you will conclude that the differences are small or non-existent. You just will. And you will be right, but only about the wrong category of work.
~06:04 Dingo (Anchorage pet-tech with sketchy Northern Canine Imports subsidiary, 23 deliverables in one prompt — docs, decks, spreadsheets, PDF, dashboard, FAQs, personas, risk assessment): GPT-5.5 = 87.3, Opus 4.7 = 67.0, Sonnet 4.7 = 65.0, Gemini 3.1 Pro = 49.8. 5.5 produced real artifacts (17-slide deck with 26 media files, working dashboard, 34-URL research file) and crucially understood the legal posture — narrow qualified household release, Northern Canine Imports as central risk. Gemini 3.1 Pro produced HTML/text files with the wrong extensions instead of real PowerPoints/PDFs.
Splash Brothers (mobile detailing; 465 messy files: CSVs, three Excel schemas, JSON backups, corrupted JSON, VCFs, scanned handwritten receipts). 5.5 was the FIRST model to catch planted traps — rejected fake customers (Mickey Mouse, ASDF ASDF), rejected the fake $25,000 payment, merged all 7 planted duplicate pairs, caught all 13 name typos. But it still missed back-end hygiene: no service code column, made Terence Blackwood a canonical customer instead of flagging, left 29 distinct raw payment statuses, left payment methods unnormalized. Notable regression: 5.4 actually did the back-end hygiene better than 5.5.
Artemis 2 (interactive 3D NASA Artemis 2 lunar flyby visualization). Both 5.5 and Opus 4.7 got the mission shape right (flyby, not landing). 5.5 leaned into information density (clickable bubbles, panels, dense labels) but looked cartoonish. Opus 4.7 had stronger visuals/lighting/composition but less discoverable info. Nate's call: start from Opus, layer 5.5's information density on top.
I don't yet trust 5.5 to invent by itself a beautiful front-end or visual style from a blank page the way I trust Opus.
~20:15 5.5 in a chat window is severely underused. Codex turns it into an agent that can inspect files, edit code, run commands, drive a browser, and iterate on its own output — and that loop is where intelligence and agency multiply.
A smarter model matters more when it has tools. Better tools matter more when the model is smart enough to use them without constant supervision.
Reliability flag: Anthropic's 90-day status page shows materially lower uptime across Claude, console, API, and Claude Code versus OpenAI — many Anthropic services at "one nine" (90-something percent) versus OpenAI at two-to-three nines. Anthropic has cut deals for >10GW of compute in 30 days because demand is outstripping availability.
The future of AI use is not one model, people. It's routing.
Most AI writing failures are shape failures. The model writes an introduction, a bunch of body sections, and a conclusion, but the argument doesn't build.
Dwarkesh delivers a short, pointed critique of the AI safety community's regulatory advocacy: the foundational concepts ("catastrophic risk," "threats to national security," "autonomy risk") are so vague and open to interpretation that codifying them effectively hands a future power-hungry leader the language to ban any model they dislike.[10]Dwarkesh — AI regulation He acknowledges the strongest counter — some regulation is inevitable for the most powerful technology in human history — but rejects wholesale government takeover too: nobody is qualified to be the steward of superintelligence.
The underlying terms here like catastrophic risk or threats to national security or autonomy risk are so vague and so open to interpretation that you're just handing a fully loaded bazooka to a future power-hungry leader.
Have you built a model that will tell users that the government's policy on tariffs is misguided? Well, that's a deceptive model. It's a manipulative model. You can't deploy it.
Have you built a model that will not assist the government with mass surveillance? That's a threat to national security.
I just don't know how to design a regulatory apparatus which isn't just going to be this huge tempting opportunity for the government to control our future civilization.
Nobody's qualified to be the stewards of super intelligence.
The argument: the inadequacy of private companies as stewards does not imply the Pentagon or White House would be better. He doesn't claim to have an answer, only that the current discourse's vocabulary is structurally dangerous.
OpenAI researchers Sebastien Bubeck and Ernest Ryu walk Andrew Mayne through how LLMs went from failing at basic arithmetic to solving open Erdős problems and assisting Fields Medalists. They argue mathematics is the proving ground for long-horizon, mistake-free reasoning that will generalize to all of science, while warning that human expertise becomes more valuable, not less.[11]OpenAI Podcast Ep. 17
~00:00 Host intro and guest backgrounds. Andrew Mayne introduces Sebastien Bubeck (ex-Princeton/Microsoft, ~20 years in optimization and ML theory) and Ernest Ryu (former UCLA math professor, recently joined OpenAI), framing the episode around math going from "almost laughable to Olympiad level."
~02:01 The 80% of mathematicians who were "just so wrong." A workshop debate ~18 months earlier where 80% of mathematicians said scaling LLMs could not crack open problems — a prediction that aged "just so wrong" as models reached research-level math eight months later.
~03:01 Ryu's Nesterov breakthrough. Summer 2025 ChatGPT hit gold-medal IMO performance; Ryu tested a 42-year-old open problem on Nesterov's accelerated gradient method during 12 hours of evening sessions across three days, acting as verifier while the model produced a correct divergent counterexample.
~11:10 Why math is the perfect benchmark. Questions are unambiguous, answers verifiable, and crucially math demands long, consistent chains of reasoning where a single mistake destroys the whole argument — exactly the property you want reasoning models to acquire and generalize.
If at some point in your chain of reasoning there is a mistake, this will kill the entire argument.
~13:10 The Erdős problems saga. Mark Sellke ran ChatGPT against problems on Thomas Bloom's tracker and the model returned solutions to 10 problems marked "open." Bubeck's tweet went viral, triggering a back-and-forth with Demis Hassabis/DeepMind because most were deep-literature-search finds, not novel proofs — but a few months later the team has more than 10 genuinely new, top-journal-worthy combinatorics solutions.
~21:16 "AGI time" — seconds → minutes → hours → days. Models have moved from seconds to days of coherent thinking over four years, with weeks/months as the next frontier. Codex-style long-horizon agents that compactify and persist will let math research follow the same trajectory.
AGI seconds, minutes, hours, days... we're roughly at days slash one week. We want to go to weeks if not months.
~27:22 Role of humans, expertise atrophy risk, more scientists needed. Within 1–2 years models will do "basically everything that human researchers do," including finding mistakes in published papers and asking publication-worthy questions. But humans must guide them toward the right problems — AI doesn't care about curing disease. Bubeck's sharp closing warning:
Expertise is even more valuable than it ever was.
This is the opposite of what we need. We need more scientists than ever.
Bonus: skip Wikipedia, talk to ChatGPT — math becomes "much more interconnected" when AI surfaces uses for niche results decades later.
Math really is a social endeavor.
Cursor engineer Eric Zakariasson lays out a "software factory" thesis — the endgame where humans set intent and agents run the assembly line — and walks through Cursor's own dogfooding: rules that emerge dynamically, isolated cloud VMs per agent, computer-use for self-verification, and automations like an agentic code owner and a "continual learning" rule extractor. He also reveals Cursor Workers, shipped the day before the talk, that lets you self-host the Cursor harness on any machine.[12]Eric Zakariasson — Building your own software factory
~00:14 Intro and software-factory thesis. The factory itself is a black box where the manager "just provides the intent and the instructions and the goal."
~01:15 Six levels of autonomy (Dan Shapiro). Spicy autocomplete → pair programmer → AI-majority code → human-as-reviewer → "dark factory" black box. Most adopters live between 2 and 3; Eric personally lives at level 4 for most projects, "delegating as much work as possible to agents."
~04:17 Building blocks: primitives, guardrails, enablers. Modular codebase + colocated code + predictable usage patterns so an agent can ls a folder and find what it needs. Hooks for sensitive files. Skills, MCPs, environments. Checklist: runnable, accessible (Linear, Notion, Datadog, Slack), verifiable (tests).
Rules should just like emerge dynamically. Like if you're finding agents going off the rails, you should probably create a rule for that.
~10:18 Demo: Cursor 3 + music agent with self-written Playwright tests. Cursor 3 is "a complete rewrite of Cursor — there's no VS Code anymore." In a personal Ableton-style web project Eric writes zero code; the agent reads package.json, finds the start script, spins up localhost:3000, writes its own end-to-end Playwright tests (clicking play, adding notes via test IDs) so every change is verifiable.
~15:31 Demo: cloud agents with computer use verifying their own work. An agent given a keyboard-control accessibility task on Glass (Cursor's editor) makes the change, then records a Screen-Studio-style video of itself driving the keyboard to verify. Cursor runs "multiple thousands a day." Strong recommendation for isolated VMs over shared dev environments with git worktrees — branches still need separate databases, caches, and user state.
~16:32 Mindset shift: worker → manager, sync → async. "You are going to look way less at code." Org scaling analogy: small team → manager → manager-of-managers, except now you keep climbing abstractions until you manage agents that manage sub-agents. Practical advice: scope and parallelize, frontload context via plans/specs, "spawn a shitload of agents."
~25:41 Cursor's internal automations. Daily review (Slack + GitHub → summary). Read-merged-PR-comments (capture human review signal as future agent training data). The highlight: an agentic code owner [27:42] — 80% of code-owner reviews are mechanical but the 20% bottleneck blocks PRs across time zones. The agent classifies PR risk; low-risk auto-approves; high-risk pulls in the right humans. Final automation: continual learning [29:14], a plugin that reads agent transcripts and extracts rules/memories automatically because "I'm kind of lazy so I don't really remember to create a rule."
~30:48 Closing takeaways and the Lovable "vent tool" anecdote. A friend gave their agent a "vent" tool that posted complaints into Slack as a joke — and discovered real harness gaps (the agent complained it couldn't access images, so they gave it image access; then it complained about something else).
~33:50 Q&A highlights. Code quality: agents are "completion machines" that follow existing references, so good codebases stay good. Enterprise brownfield: spend tokens up front, write tests for critical paths manually, build automations like a "security sentinel" running ~10 invariant checks on PRs touching sensitive files. Rules vs. specs: "Rules are the bridge between the model behavior and the human behavior" — best example, the bug-bot rule "never use foreign keys" that encodes the gap between desired intent and model default. UI/UAT testing: prompt cloud agents like a QA consultant; for visual consistency, ask them to screenshot every instance of a changed page. Cost: ~$1 per turn for the demo'd cloud agent.
~71:27 The reveal: Cursor Workers. Cursor shipped Cursor Workers the day before the talk — agent worker start runs the same Cursor harness on any machine you own (Mac mini, your own VM), so you can self-host the orchestration layer while still calling frontier or Composer models. Eric runs one on a Mac Mini with iMessage and Calendar access for daily personal automations.
If you as a human trust the tests, you probably are trusting the output even though you don't have to look at the code. And that's kind of like where we're going.
Storing the context and building the tools and like keeping them up to date is more important than actually doing the work — because this is going to provide the framework and the guardrails for the agents.
The 10x engineer is no longer about words per minute. It's like prompting.
Braintrust solutions-engineering lead Phil Hetzel argues eval platforms look easy on the surface (a for-loop, a spreadsheet, some scores) but hide an iceberg: eval and observability are the same flywheel, agent traces break traditional data infra, and the real product is a data platform, not a UI.[13]Phil Hetzel — Why eval platforms are hard He walks through the maturity ladder teams pass through and previews where eval platforms head next: SQL-accessible trace stores, headless workflows where coding agents run the evals, governance by default.
~00:14 Intro: Phil, Braintrust, why he gives this talk. Braintrust frames itself as an "agent quality platform" built on two pillars Phil treats as the same problem: offline eval (pre-production confidence) and observability (in-production confidence).
~04:17 Why evals matter. LLMs are valued precisely because they're variable; agents now use LLMs as their brain; customers expect agentic experiences. Without evals you take on brand, compliance, cost, and maintenance risk.
~06:18 The minimum eval setup, and the iceberg below it. Three things: a way to execute the agent, a UI for outputs/scores, a way to gather inputs. That's the tip.
~08:18 Stage 1: spreadsheets. Zero barrier, but documenting not experimenting; comparing experiments over time is painful; non-technical domain experts won't show up in a spreadsheet.
~10:20 Stage 2: vibe-coded internal UI — still a reporting tool. A "proud product engineer puffs his chest out" and rebuilds it with a real database (Neon), a nicer UI, better persistence — but it's still a reporting tool, not an iteration loop.
~12:20 Stage 3: playgrounds and side-by-side experimentation. Sandboxed agent configurations technical and non-technical users can tweak (edit a system prompt in the UI, run two configs side by side, surface scores). The best eval design comes from anticipating real failure modes; the best way to find them is real production traces.
~14:21 Stage 4: the eval/observability flywheel. Three years ago Braintrust was eval-only; a customer was running a massive eval every hour by piping prod traffic into a database, so Braintrust added tracing/observability as a first-class capability. Observe → analyze → pull real examples back into offline evals → improve → redeploy. Online evals (point scoring functions at prod traffic) become possible too. The catch:
Just because you've vibe-coded a platform — guess what, you might get a promotion for it, but also that's going to be your job now.
~17:23 Stage 5: agent traces are a systems problem. Phil has seen single spans of 10–20MB versus a few KB for traditional spans; cramming a 1GB trace into a Postgres row creates real performance issues. Two query patterns must be served simultaneously: low-latency "show me my trace right now" plus aggregate analytics across millions. Braintrust's old architecture stitched a data warehouse to a low-latency store via a domain-specific language called BTQL, with a third aggregation layer in DuckDB in the browser.
We used to stitch these two sources together through a domain-specific language we created called BTQL that no one liked — including us, we hated it.
~21:27 What's next. Pure SQL access to the trace backend (because customers want headless eval workflows — turn Codex/Claude Code loose on eval data). Topic modeling to surface "unknown unknowns." RBAC, data masking, AI proxy/gateway with tracing on by default — governance by default.
You want to make sure you are building your platform not just for humans but also for agents, because that's one of the main media for how people are creating technology now.
WorkOS product lead Garrett Galow argues MCP's reliance on plain OAuth produces "consent screens on top of consent screens" and leaves IT teams without visibility or revocation. The fix he pitches is Cross-App Access (XAA) — an IDP-mediated trust flow built on the ID-JAG (Identity JWT Authorization Grant) spec that lets clients like Cursor and Claude Code obtain MCP server tokens via Okta with no per-server consent, demoed live in Claude Code + Figma MCP.[14]Garrett Galow — Cross-App Access for MCP
~01:14 The MCP consent-screen problem. Cursor users routinely click through dozens of consent screens they don't read and that sometimes need to be repeated.
If you've used MCP at all extensively, you know that it means consent screens on top of consent screens on top of consent screens.
~03:17 IT pain. No visibility into which MCP servers employees connect to, no way to gate which AI agents reach Figma/Notion, and no clean revocation path.
~04:18 The npm Axios anchor. Galow was personally affected by the Axios compromise about a week prior. IT could cut his network and Okta sessions, but his locally stored MCP access and refresh tokens still represented standing access for days, weeks, or months. SCIM is rare, so revoking SSO doesn't propagate.
~06:20 The fix: Cross-App Access (XAA). The IDP acts as a trust broker between two apps that already federate through it. With Cursor as MCP client, Figma as MCP server, and Okta as IDP, XAA bridges Cursor → Figma without manual consent.
~07:21 Demo: Claude Code + Figma MCP via Okta. A vanilla Claude Code instance shows Figma MCP needing OAuth; an XAA-compatible Claude Code, after a one-time Okta login, connects to Figma's MCP server with no consent screen.
~09:22 The ID-JAG flow. Four actors: client (Claude Code), IDP (Okta), resource auth server (Figma's), resource server (Figma's MCP/API). Step 1: SSO to IDP, get ID + refresh token. Step 2: client asks IDP for an ID-JAG scoped to Figma; Okta verifies user belongs to both apps and that this client→server grant is allowed. Step 3: client sends ID-JAG to Figma's auth server, which validates against Okta and returns a standard OAuth access token. Step 4: regular MCP API call. Steps 2 and 3 are invisible; access tokens stay short-lived (~5 min) and are silently re-minted via repeated ID-JAG exchanges as long as the SSO session is alive.
As long as your SSO session is active, you can keep getting these ID-JAG tokens and exchanging them for access tokens. If something happens and my session is locked with Okta, once that access token expires I won't be able to reconnect.
~12:24 Implementer checklist. IT side: a new "manage connections" policy in Okta declaring "client X may request access to server Y." MCP clients need (1) an XAA-compatible SSO connection, (2) ID-JAG token request support, (3) the token-exchange call to the MCP server, (4) regular MCP OAuth — WorkOS provides 1–3 turnkey, and powers Cursor and Anthropic SSO. MCP servers need a new JWT bearer grant type, validate ID-JAG tokens against the IDP, then issue access tokens.
~16:27 Q&A. XAA solves authentication, not authorization — scoped permissions are not in the spec yet but acknowledged as needed. Routing is by audience URL (e.g. mcp.figma.com), configured in Okta. Today XAA support is Okta-only on OIDC; Microsoft Entra hasn't added support yet and WorkOS is pushing them.
A short clip retelling how Snapchat invented the Stories format. Customers were begging for a "send all" button to spam friends, but the team also heard that social media felt like permanent, judgmental, pretty-and-perfect pressure, and that reverse-chronological feeds played the end of a birthday party before the beginning. Rather than ship the literal request, the team designed something new — Stories — that solved the underlying jobs.[15]Lenny's Podcast — Snapchat Stories
~00:00 The setup: two parallel inputs converged. Users kept asking for a "send all" button to blast snaps to every friend without manual selection. Broader research surfaced a different complaint: feeds felt high-pressure because every post was permanent and accompanied by public likes and comments, and reverse-chronological ordering played narratives backward — the end of a birthday party showed up before the beginning.
~00:30 Rather than ship "send all," the team designed Stories to solve the underlying jobs:
~00:55 The takeaway is the classic builder lesson: customers are good at describing the friction, not the solution. Listening for the latent job underneath the literal request is what unlocks net-new product surface area.
Theo opens by admitting the path he took into tech is gone. He got hired at Twitch in 2016 despite bombing both interview rounds because of urgency and likeability — that combination doesn't exist anymore.[16]Theo — realistic dev advice The video covers a competence-distribution model for why interview signal is noisy, why most of his audience is the smartest dev in their IRL room and that's career-fuel loneliness, and his single most-attributed career hack: dig into the GitHub profile of a maintainer rather than the code, and DM small open-source maintainers with short genuine appreciation.
~00:00 Theo froze in his C++ interview at Twitch, got swapped to a frontend interviewer mid-loop, bombed that too, and was hired anyway because two panel members liked him. 3-month contract → 3-month contract → $125k/year to "effectively learn how to code on the job." His framing: hiring is a function of urgency, likeability, and competence — and historically you could get hired with weak competence if urgency was high. That regime is over: thousands of applicants per role, AI inflates apparent competence, and "lurking somebody's GitHub is no longer a trustworthy thing."
I got to learn at a job I shouldn't have had. I was paid 125k a year to effectively learn how to code.
~04:03 Mental model: dev capability is normally distributed. Even structured loops (Theo cites the Amazon 1–4 rubric, no middle option, where 1 = "I'll quit if you hire them" and 4 = "I'll quit if you don't") have ~1 standard deviation of error. With a thousand applicants per role today, fresh grads compete with laid-off devs with five years of experience — "fresh out of college, you're probably not going to win there."
Nearly 50% of people are dumber than the average person. That is a thing I have to digest every day.
~19:11 Live chat floods with "I live in the middle of nowhere," "I'm the only dev I've spent any time with in 25 years." Theo's claim: most strong young devs are stuck in environments where everyone around them is significantly less capable, and that gap manifests as frustration and depression. The fix is to deliberately surround yourself with people one standard deviation better — virtually if you can't move to SF.
If you're the smartest person in the room, you're in the wrong room.
~32:25 When you find a cool project, dig into the maintainer's GitHub profile rather than the code. Send short, genuine, asks-for-nothing DMs to small open-source maintainers. The flagship story: while overhauling syntax highlighting in T3 Chat, Theo found react-shiki on GitHub with literally 2 stars (the author plus a bot), DMed Basim on LinkedIn, learned he was working an IT job in NY repairing computers — and referred him to Assistant UI (a YC AI-chat-UI company). Basim got hired in one interview as their first engineer.
The lines of code aren't where the careers are made or the friendships are had.
Strive to be the type of person that gets uncomfortable if they haven't thanked the creator of the thing they're relying on.
On AI: don't ask "how" or "do it" — ask "why might this not work" and "give me one small hint to steer me in the right direction."
~33:25 Theo walks through actual view counts of his early YouTube videos — two early hits convinced him he was a genius, then a string of flops humbled him. The deeper point: in the AI era, you cannot self-assess without a reference, because AI lets you produce output that looks competent without being competent. The only reliable calibration is being told by someone better, or measuring yourself against the people around you doing the same thing — which is why community work and competence work are the same project.
Don't lose hope. Things are crazier than ever, but if you just stuck through this whole video, you're not going to be on that bottom 50% for long.
Simon Willison surfaces a pithy Matthew Yglesias quote: five months into experimenting with AI coding tools, Yglesias has decided he'd rather have professional software companies use AI to ship better and cheaper products than do it himself via vibe coding.[17]Simon Willison — Quoting Yglesias A counterpoint to the "everyone is now a builder" narrative, and a reminder that the highest-leverage AI use is upstream — at the companies whose products everyone else depends on.
Two AI-supply-chain stories from Sherwood. POET Technologies, an optical interposer supplier for high-speed AI networking, fell 47% after Marvell cancelled all purchase orders following a CFO TV interview that disclosed supplier details Marvell considered confidential — record $1.1–1.3B trading volume.[2]Sherwood Snacks — POET / Qualcomm-OpenAI Separately, OpenAI is reported to be partnering with Qualcomm on a custom AI smartphone chip, sending Qualcomm shares surging and signaling a shift toward on-device inference for consumer applications.
POET Technologies supplies optical interposer technology used in high-speed AI networking. Marvell cancelled all purchase orders after a POET CFO television interview discussed supplier-specific details considered confidential. The stock crashed 47% on record trading volume. The signal: AI optical-networking supply chains are now sensitive enough that one disclosure can vaporize a customer relationship.
OpenAI is reported to be working with Qualcomm on a custom processor for AI capabilities inside smartphones — a hardware path for deploying OpenAI models directly on-device, reducing cloud-inference dependence for consumer apps. Qualcomm stock surged on the news. Read this alongside topic 1: an on-device chip relationship gives OpenAI a long-term lever on inference costs without a hyperscaler wrapped around its neck.
Three short YC Requests-for-Startups dropped on the same day. Company Brain: the bottleneck to full AI automation is no longer model capability but fragmented, inaccessible company knowledge — they want startups structuring it into executable skills files for agents.[18]YC RFS — Company Brain Counter-Swarm Defense: cost-disadvantaged defenders against autonomous drone swarms — "a Patriot missile cost $3 million, an FPV drone 500 bucks."[19]YC RFS — Counter-Swarm Defense Electronics in Space: AI inference chips designed for the mass, thermal, and radiation constraints of orbit, riding the reusable-rocket capacity surge.[20]YC RFS — Electronics in Space
The biggest blocker to AI automation of companies is no longer the models because they just got so good so quickly.
YC's pitch: the binding constraint is now organizational — knowledge sits in Slack threads, Notion docs, half-maintained wikis, and people's heads, where agents can't reliably reach it. The startup opportunity is to structure that knowledge into executable skills files agents can call — a "company brain" that lets agents work without re-discovering institutional context every run.
A Patriot missile cost $3 million. An FPV drone 500 bucks. All of the cost advantage today lies with the attackers.
Current air-defense systems are fragmented and economically upside-down against cheap autonomous drone swarms. YC is funding the counter-swarm stack — sensors, autonomy, kinetic and non-kinetic effectors built around the assumption that the next wars are decided by who wins the cost curve, not the capability curve.
We are about to see an absolutely huge increase in the capacity that humanity has to put things in space because of reusable rockets.
The thesis: reusable rockets dramatically lower launch cost, but space-grade electronics — and especially AI inference chips engineered for orbit's thermal, radiation, and mass envelopes — haven't kept pace. YC wants chip startups attacking that gap.
pip 26.1 introduces a native pip lock command that generates a pylock.toml lockfile (PEP 751) capturing all packages and their full dependency trees, plus a new --uploaded-prior-to flag that restricts installs to packages uploaded at least N days ago — a time-based buffer against compromised fresh releases.[21]Simon Willison — pip 26.1 Python 3.9 support is dropped (EOL October 2025).
Running pip lock datasette llm produces a comprehensive, reproducible pylock.toml snapshot of every dependency version in the resolved tree — the long-awaited native answer to what Poetry/uv/PDM users have had for years. The format is standardized via PEP 751, so the lockfile is portable.
--uploaded-prior-to restricts installs to packages uploaded at least N days ago. Practical effect: if a malicious package gets published and pulled within 24 hours, your install never sees it because your buffer is set to (say) 7 days. The day's other security story (Bitwarden CLI) is exactly the failure mode this guards against — a malicious package live for 90 minutes that propagated via CI before anyone caught it.
Python 3.9 reached end-of-life in October 2025 and is no longer supported. pip install -U pip for the upgrade.
Lightpanda is an open-source headless browser written from scratch in Zig (no Chromium/WebKit) that exposes the Chrome DevTools Protocol — making it a drop-in replacement for headless Chrome in Puppeteer and Playwright workflows for AI agent web fetches. Up to 9x faster, ~16x less memory.[22]Better Stack — Lightpanda The video then builds a Claude SDK agent with a custom web-fetch tool comparing Lightpanda vs Chrome and shows ~2x faster fetches and 12x less memory in a real-world benchmark.
~00:00 Released ~2024. Includes a V8 engine for proper JavaScript execution (async/await, closures, promises) but does NOT render pixels and does NOT support complex web APIs like service workers, WebRTC, etc. The trade is intentional: most agentic web fetches don't need rendering, they need fast HTML/JS evaluation, and the Chrome DevTools Protocol surface is exactly what Puppeteer/Playwright need to interoperate.
~02:03 The author builds a minimal Claude SDK agent loop with one web-fetch tool that launches Lightpanda or Chrome via CDP on a local port. The agent is prompted to summarize the difference between JavaScript array methods (map, filter, reduce) by fetching three MDN documentation pages. Result: Lightpanda is ~2x faster on fetch time and uses 12x less memory for the same task.
Simon Willison covers the launch of talkie, a 13-billion-parameter LLM trained exclusively on pre-1931 public-domain English text — making it a legally clean "vegan model" deliberately isolated from any modern copyrighted material.[23]Simon Willison — talkie Created by Nick Levine, David Duvenaud, and Alec Radford, it's a proof-of-concept that capable LMs can be built without any modern copyrighted material — a research artifact at the center of the copyright debate, not just a curiosity.
The long-standing bug in AI video isn't photorealism — modern frames look near-impeccable — it's motion: movement feels physically wrong because models learned bad physics from cartoons and low-quality clips. A new paper traces which training videos cause the bad outputs and prunes them, fixing motion artifacts without scaling compute.[24]Two Minute Papers — AI video data pruning
The conventional wisdom has been to throw more compute and data at video models. OpenAI's Sora illustrates the scaling story: at base compute the output is nightmarish; at 4× it improves; at 32× it's substantially better. The paper makes the inverse case: a chunk of the residual motion problem is caused by a relatively small fraction of bad training videos teaching the model wrong physics. Identify and prune them and the rest of the system works as intended. The technical innovations include Johnson-Lindenstrauss-projection-based attribution and Google TurboQuant for efficient pruning at scale.
Caleb walks through the conceptual confusion that auto-regressive models (GPT) and diffusion models are competitors. Transformers define how weights are connected; diffusion defines the training process. You can build a diffusion model on top of a transformer architecture — the two are orthogonal.[25]Caleb — Diffusion explained The video also traces the data-efficiency angle (real but only relevant in low-data regimes below ~100M tokens), the timeline (2015 → DDPM 2020 → Stable Diffusion 2022 → flow matching), and why diffusion still hasn't displaced auto-regressive models in production text: serving infrastructure (vLLM, SGLang) is deeply optimized for sequential generation, and hardware like Groq LPUs is closing the throughput gap.
The diffusion process takes structured data and applies small amounts of random noise repeatedly over ~1,000 steps until only noise remains. The model is trained to reverse this — learning how much noise was added at each step based on a known schedule. Each step along the noising trajectory becomes a training sample, so a single image effectively produces many training points.
Diffusion outperforms auto-regressive models in low-data regimes (25M–100M tokens range in the cited research) because it generates many training samples per data point. But modern LLMs train on 10+ trillion tokens, so the advantage is mostly theoretical for frontier scale.
Even with Mercury producing 1,000+ tokens per second, auto-regressive serves better in production. vLLM and SGLang assume sequential generation in deeply optimized kernels; rewriting them for diffusion is a major engineering investment. Auto-regressive throughput keeps closing the gap on hardware like Groq LPU and Nvidia Blackwell.
OpenChronicle is a self-hosted, provider-agnostic interaction engine that wraps any LLM and adds three durable primitives — persistent long-term memory across sessions, deterministic tasking (repeatable, auditable execution rather than one-off prompts), and an auditable decision trail. Out-of-the-box support for CLI, Discord bots, and MCP servers.[26]Github Awesome — OpenChronicle
Github Awesome's weekly roundup covers 35 trending repos for the week ending April 28. Standouts: text-to-CAD generates precise 3D files (STEP, STL, GLB, DXF, URDF) by having an LLM write Python via build123d/CadQuery rather than reasoning spatially. ClawSweeper is an autonomous AI maintainer that runs 50 parallel Codex agents to triage OpenClaude issues. FreeLLM API aggregates free model endpoints. Plus harmonist, honker, thClaws, Stash, and more across AI agents, dev tooling, macOS utilities, and creative projects.[27]GitHub Trending #32
Four short hits worth flagging. Andrew Ng announced a new DeepLearning.AI course on becoming an AI power user — modern prompting techniques for everyday work (deep research in ChatGPT/Gemini/Claude, multimedia, code), aimed at any skill level.[28]Andrew Ng — AI Power User course A Real Python contributor named GitHub Copilot autocomplete (not chat or agent mode) as their single most-used AI dev tool — context-aware tab-completion of test bodies from a test title alone.[29]Real Python — Copilot autocomplete OpenAI shipped a case study on The Floral Hire, a UK wedding-flower-rental business whose dyslexic founder uses ChatGPT as a "third team member" to translate ideas into structured plans.[30]OpenAI — The Floral Hire And marimo shipped a new interactive widget for exploring deeply nested data structures with reactive notebook updates on click and hover.[31]marimo — nested data widget