Anthropic's Mythos found 10,000 bugs nobody can patch

May 22, 2026

47 topics · 51 sources

AI Models AI Future
Anthropic The Batch Last Week in AI

Anthropic's unreleased Mythos found 10,000+ bugs nobody can patch fast enough

Anthropic's Project Glasswing — a cybersecurity initiative with ~50 partners — used the unreleased Claude Mythos Preview to surface over 10,000 high- or critical-severity vulnerabilities, with external validators confirming 90.6% of assessed findings valid and some partners' bug-find rates jumping more than tenfold.[1]Anthropic — Project Glasswing initial update The same model the UK AI Security Institute says can reliably run attacks that take humans 3 hours, and that researchers used to find a kernel exploit in Apple's security.[2]The Batch #354 — LLMs enabling industrial-scale cyberattacks Anthropic's own framing: the bottleneck has flipped from finding bugs to verifying, disclosing, and patching them.

Read more

Glasswing by the numbers

Launched May 2026, Glasswing scanned 1,000+ open-source projects and produced roughly 6,202 high/critical findings; independent firms verified 90.6% as valid and confirmed 62.4% as high or critical severity.[1]Anthropic — Project Glasswing Mythos Preview is being held back from public release due to "insufficient safeguards." External validators: the UK AI Security Institute (confirmed Mythos solved both cyber-range simulations end-to-end), Mozilla (271 vulnerabilities found in Firefox 150, 10x prior rates), and offensive-security firm XBOW. A standout find: CVE-2026-5194, a certificate-forging bug in wolfSSL enabling fake-website attacks. Anthropic also shipped Claude Security for enterprise scanning and a Cyber Verification Program for researchers.

Progress on software security used to be limited by how quickly we could find new vulnerabilities. Now it's limited by how quickly we can verify, disclose, and patch.

The Batch: industrial-scale cyberattacks

Google researchers documented LLMs discovering zero-days in widely-used web admin tools before disclosure, plus morphing malware that rewrites decryption routines to evade antivirus.[2]The Batch #354 — cybersecurity alarms The UK AISI reports both Claude Mythos Preview and GPT-5.5 can reliably execute attacks humans need 3 hours for (up from a prior 1-hour forecast), and longer with more tokens. Google's conclusion echoes Anthropic's: models may exploit bugs faster than cyber teams can patch.

The R&D-automation angle

Last Week in AI ties Mythos to Andy Jones (now at Anthropic), whose independent work trained an AI to master a scalable game with a tunable complexity knob.[3]Last Week in AI — The bio-weapon version of Mythos The argument: if a model can implement an end-to-end pipeline as non-trivial as AlphaZero, that's an early signal of AI automating meaningful parts of R&D — with the "bio-weapon version" framing pointing at the same capability turned toward dangerous domains.

Tools: Claude Mythos Preview, Claude Security, GPT-5.5, wolfSSL (CVE-2026-5194)
AI Models Industry
Fireship Google DeepMind

Google's I/O endgame: Gemini goes Omni, TPUs split in two

Google I/O 2026 leaned all the way into the "agentic Gemini era": Gemini Omni (any-in/any-out multimodal), Gemini Flash 3.5 (benchmarks near Opus 4.7 and GPT-5.5 but ~3x pricier than the prior Flash and ~30x the original Gemini 1.5 Flash), a TPU lineup split into training (TPU-T) and inference (TPU-I) chips, and the Anti-gravity IDE (ex-Windsurf) recast as an agent manager.[4]Fireship — Google's AI endgame at I/O 2026 Alongside it, SynthID 2 picked up OpenAI, Kakao, and ElevenLabs as adopters.[5]Google DeepMind — SynthID expanding to more partners

Read more

The headline announcements

Google's token throughput has grown from 9.7 trillion/month two years ago to 3.2 quadrillion/month today.[4]Fireship — I/O 2026 Gemini Omni is positioned as a world-model-style system handling text, video, and sound in and out. Gemini Flash 3.5 claims near-parity with Opus 4.7 and GPT-5.5 on speed-vs-intelligence — but Gemini 3.5 Pro was notably not announced (expected later in summer). A new "Neural Expressive" design system generates UI on demand (diagrams, timelines, mini-apps from prompts).

Anti-gravity (formerly Windsurf) now mirrors OpenAI Codex's agent-management focus; an on-stage demo built a full OS from scratch over ~12 hours and billions of tokens, then live-patched missing GPU drivers to run Doom. For web devs, Chrome's new HTML-on-Canvas API renders native HTML elements directly into WebGL/WebGPU canvases.

Google is no longer trying to organize the world's information with blue hyperlinks… Instead, Google is trying to become the interface to reality itself before Anthropic and OpenAI create better realities.

SynthID 2 picks up partners

OpenAI, Kakao, and ElevenLabs are adopting SynthID 2 for content watermarking, and Google is adding C2PA content-credentials verification across its products.[5]Google DeepMind — SynthID The SynthID detector (already used by millions in the Gemini app) is expanding to Google Search (via circle-to-search) and Chrome (right-click), so users can ask whether an image was AI-generated and get a clear answer with context.

Tools: Gemini Omni, Gemini Flash 3.5, Anti-gravity IDE, TPU-T, TPU-I, Neural Expressive, SynthID 2, C2PA, Chrome HTML-on-Canvas
Industry AI Tools
OpenAI AICodeKing OpenAI

Codex got crowned and capped in the same week

Gartner named OpenAI a Leader in its April 2026 Magic Quadrant for Enterprise AI Coding Agents, citing Codex's agentic dev, OS-level sandboxing, and governance — Codex now sees 4M weekly users across Cisco, Datadog, Dell, and NVIDIA.[6]OpenAI — named a Leader in enterprise coding agents by Gartner Days later, OpenAI quietly halved Codex consumer rate limits with no acknowledgment, which AICodeKing reads as a deliberate bait-and-switch as enterprise compute demand ramps.[7]AICodeKing — Codex limits reduced by 50% Cisco, meanwhile, says Codex wrote the majority of its AI Defense platform, compressing quarters into weeks.[8]OpenAI — Cisco builds AI Defense with Codex

Read more

The Gartner crown

Gartner highlighted Codex's broad surface — app, IDE extensions, CLI, SDKs, cloud orchestration — plus enterprise controls (approval gates, RBAC, auditable workspace governance).[6]OpenAI — Gartner Leader Since the evaluation, OpenAI shipped GPT-5.5, Codex Security with GPT-5.5-Cyber, mobile, Remote SSH, scoped access tokens, HIPAA-compliant deployment, Codex on Amazon Bedrock, and GSI partnerships (Accenture, Capgemini, Cognizant, Infosys, PwC, TCS). Until June 12, eligible enterprise accounts can request two months of free Codex for new users.

Cisco case study

Cisco deployed Codex across every engineer on its AI Defense team; multi-quarter features dropped to weeks, and the open-source tool defense-claw went from idea to community use in under a week.[8]OpenAI — Cisco

There's a fundamental change in the frame of reference… they're just thinking how long will a Codex run take to be able to get these things done.

…and the cap

Users reported Codex hitting limits ~2x faster than before, with OpenAI silent for 11-12 hours. AICodeKing reads it as intentional: Codex was a loss-leader to capture users, and consumer limits get squeezed as enterprise contracts (served by dedicated deployed models) kick in.[7]AICodeKing — Codex limits halved His recommended escapes: GLM Coding Plan ($345/yr, plus an $80 migration offer for Codex refugees through July 19), Anti-gravity ($20/mo), Verdant ("what anti-gravity wanted to be," with a manager mode), Command Code ($1/mo with ~$40 of DeepSeek V4 Pro), and Kimiko ($19-$199/mo).

Either they take your data or your money or both.
Tools: OpenAI Codex, GPT-5.5, GPT-5.5-Cyber, Codex Security, Amazon Bedrock, GLM Coding Plan, Anti-gravity, Verdant, Command Code, DeepSeek V4 Pro, Kimiko, defense-claw
Industry
Simon Willison

The 'Active Listening' ad service that never listened to anything

The FTC settled with Cox Media Group and two other firms for nearly $1M after they marketed an "Active Listening" ad service claiming smart devices captured "real-time intent data by listening to our conversations" — when the FTC found it didn't use voice data at all, just resold broker email lists at a markup.[9]Simon Willison — FTC Active Listening settlement Simon Willison, who theorized exactly this in 2024, gets his vindication.

Read more

Willison had argued in September 2024 that the "active listening" language was marketing hype from a team that got "over-excited" without understanding the tech or legal exposure. The FTC's complaint confirmed it: the service "did not, in fact, listen in on consumers' conversations or use voice data at all."[9]Simon Willison — FTC Active Listening The FTC also rejected the consent fig leaf.

Clicking through mandatory terms of service does not constitute 'opt-in consent' for such an invasive service or for use of consumers' voice data from inside their homes.
Industry
Simon Willison

AI's HBM appetite is repricing the cheap smartphone out of existence

AI data-center demand for HBM memory is eating wafer capacity that used to make consumer DDR and LPDDR, pushing up phone and PC prices — with sub-$100 budget devices in Africa and South Asia hit hardest.[10]Simon Willison — The memory shortage The structural killer: a single gigabyte of HBM consumes more than 3x the wafer capacity of a gigabyte of DDR/LPDDR.

Read more

David Oks's piece ("AI is killing the cheap smartphone") notes only three major memory makers exist, all with fixed wafer capacity split across DDR, LPDDR, and HBM.[10]Simon Willison — memory shortage HBM's share of that fixed capacity has surged from 2% to an expected 20% by end of 2026. Manufacturers deliberately under-provision fabs (a lesson from competitors who over-built and failed), so the windfall can't be offset by new capacity, and HBM's fat margins lock the constraint in for years.

A single gigabyte of HBM consumes more than three times the wafer capacity that a gigabyte of DDR or LPDDR does.
AI Models AI Tools AI Future
The Batch

The Batch #354: Hermes overtakes OpenClaw, Thinking Machines kills turn-taking, benchmarks miss the workforce

Beyond the cyber alarms, The Batch #354 carries three notable items: Nous Research's open-source Hermes Agent passing OpenClaw on OpenRouter's token leaderboard, Thinking Machines Lab's 276B-MoE TML-Interaction-Small that responds in 0.40s without turn-taking, and a CMU/Stanford study showing agent benchmarks overindex on software/math while ignoring 18.2M administrative and 11M management workers.[2]The Batch #354

Read more

Hermes Agent overtakes OpenClaw

Nous Research's Hermes Agent (launched Feb 2026) surpassed OpenClaw on OpenRouter's token-consumption leaderboard.[2]The Batch #354 — Hermes vs OpenClaw It runs an agentic loop assembling personality, instructions, tools, skills, and memory; auto-creates skills from successful tasks in SKILL.md format; a Curator archives unused skills after 90 days; supports ~20 messaging services and persistent goal-tracking via a judge model. Some users note it's less token-efficient.

TML-Interaction-Small: no turn-taking

Thinking Machines Lab's 276B-param MoE (12B active) processes audio, video, and text concurrently in 200ms micro-turns, responding in 0.40s vs Gemini-3.1-flash-preview (0.57s) and GPT-Realtime-2 (1.18s).[2]The Batch #354 — TML-Interaction-Small On interactivity (interruptions/interjections) it scored 77.8 avg quality vs GPT-Realtime-2's 47.8 and Gemini's 45.5; on Audio MultiChallenge reasoning, 43.4% (behind GPT-Realtime-2's 48.5%). Closed research preview coming.

Benchmarks vs. the real workforce

CMU and Stanford (led by Zora Z. Wang) mapped 10,000+ examples from 43 agent benchmarks against O*NET labor data.[2]The Batch #354 — agent benchmarks misaligned Benchmarks hold 8,622 "computer and mathematical" examples for a 5.2M-worker field, but only 3,186 examples for the 18.2M-worker admin sector and 676 for 11M managers. GDPval covered best (47.8% of work activities, 58.5% of skills). The takeaway: large untapped agent opportunity in admin, finance, and management.

Tools: Hermes Agent, OpenClaw, OpenRouter, TML-Interaction-Small, GPT-Realtime-2, Gemini-3.1-flash-preview, SWE-bench, WebArena, GDPval
Industry
Sherwood Snacks

Washington put $2B (and equity) into nine quantum startups

The Trump administration awarded $2B in grants — with government equity stakes — to nine quantum computing companies, IBM taking $1B to launch a quantum-foundry subsidiary called "Anderon." Sector stocks jumped 12-33%.[11]Sherwood Snacks — Quantum hits excited state The same newsletter notes Nvidia fell post-earnings for a 4th straight quarter — but recovered and beat the S&P within three months each prior time.

Read more

The quantum grants

IBM ($1B, +12%) is matching the grant dollar-for-dollar and billing Anderon as "America's first pure-play quantum foundry."[11]Sherwood Snacks — quantum Others: GlobalFoundries ($375M, +15%), D-Wave Quantum ($100M, +33%), Rigetti ($100M, +30%), Infleqtion ($100M, +31%), Atom Computing, PsiQuantum, Quantinuum ($100M each), and Diraq ($38M). Infleqtion's CEO called the selection "a very rigorous technical evaluation over many, many months."

Nvidia's post-earnings dip pattern

Nvidia declined a 4th consecutive time after beating — but three months after each prior sell-off, shares recovered and outperformed the S&P 500. It holds the 2nd-best revenue growth among S&P 100 names and is 4th-cheapest on PEG, suggesting another buy-the-dip setup.

Also in Snacks

  • Tesla FSD in China — Full Self-Driving (Supervised) rolling out after Musk's China visit.
  • Spotify x UMG — Premium users will be able to make AI remixes/covers of Universal Music Group songs.
  • GTA VI — confirmed for November 2026 (prediction markets: 90% before Dec 1).
  • SoftBank — best trading day since 2000 on OpenAI IPO news.
  • Misc — Blockchain.com confidential IPO filing; Walmart weak Q2 guidance flagging low-income stress; Waymo Atlanta service pause after flooding incidents.
Industry
Tech Brew

SpaceX's $1.75T IPO is really a rocket-shaped bet on Elon

SpaceX filed for an IPO targeting a $1.75 trillion valuation while posting a $4.9B net loss — repositioning itself as an AI infrastructure company (citing a $26.5T AI market and Anthropic's $15B/year compute commitment), with Musk retaining 85% of voting power as CEO, CTO, and chairman.[12]Tech Brew — How Elon Musk can have his cake

Read more

Tech Brew's Whizy Kim frames the filing as one of the largest in history, structured to give retail investors essentially no governance check despite reserving a high share allocation for them.[12]Tech Brew — SpaceX IPO The prospectus pivots the narrative from rockets to AI, leaning on Anthropic's $15B annual compute commitment for credibility. SpaceX also unexpectedly holds significant Bitcoin reserves, and third-gen Starship launches have repeatedly slipped. The piece argues Tesla could suffer as SpaceX competes for the same investor dollars and Musk's attention.

It isn't really about rockets or AI—it's a bet on Elon Musk himself.
Industry
Data Science Weekly

Data Science Weekly: who's going to save you?

This week's Data Science Weekly argues data professionals need a corporate sponsor and shouldn't put all their "political eggs in one basket" as AI reshapes job markets and headcount.[13]Data Science Weekly — Who is going to save you?

Read more

The May 22 paid issue explores career resilience for data scientists and analysts amid AI automation pressure — the risks of depending on a single executive sponsor or internal ally, and the broader trend of AI thinning analytics and data-science teams.[13]Data Science Weekly The body is paywalled; the argument is carried by the subtitle and framing.

Podcast
Dwarkesh Patel

Dwarkesh Patel Interviews Reiner Pope: Chip Design from the Bottom Up

Mathdock (Maddock) CEO Reiner Pope builds AI chip design from logic gates up — multipliers, systolic arrays, clock cycles, TPU-vs-GPU — with one recurring thesis: the dominant cost in hardware is moving data, not doing math.[14]Dwarkesh Patel × Reiner Pope — Chip design from the bottom up The clearest mental model of why low-precision arithmetic and Tensor Cores exist that you'll find.

Read more

~00:00Logic gates and the MAC primitive. Matrix multiply is a triple for-loop, so a multiply-accumulate happens at every step; accumulation needs higher precision than multiplication because summation errors compound (e.g. 4-bit multiply with 8-bit add).[14]Dwarkesh × Reiner Pope

~04:01Multiplying by hand. A P-bit by Q-bit multiply needs P×Q AND gates for partial products, summed via full adders (3-to-2 compressors) in a Wallace-tree; the count works out to exactly P×Q full adders.

~12:12Quadratic area scaling. Chip area for arithmetic scales quadratically with bit width — "the single reason low-precision arithmetic has worked so well for neural nets." Nvidia historically doubled FLOPs per halved precision, but quadratic scaling implies the speedup should be larger; around B300, FP4 is 3x faster than FP8 (pure scaling says 4x).

This quadratic scaling… is like the single reason why low precision arithmetic has worked so well for neural nets.

~16:20The mux problem. In old CUDA cores, a register file feeds the MAC through multiplexers; with three inputs, ~7/8 of circuit area goes to reading/writing the register file (hidden data movement), not the logic that matters.

It's muxes all the way down.

~25:33Systolic arrays / Tensor Cores. Bake two loop levels into hardware for X×Y compute at ~X communication: weights stay fixed locally and are reused across input vectors, trickle-fed via a daisy chain because "bandwidth equals die area." Older TPUs used 128×128 arrays.

Bandwidth equals die area.

~38:44Clock cycles and critical paths. Chips sync ~every nanosecond; the slowest "cloud of logic" between registers (critical path) bounds clock speed, which is why two chips on TSMC 3nm can clock differently. Pipeline registers can halve path length but can't naively pipeline feedback loops (e.g. a running accumulator).

~51:52FPGA vs ASIC. Same model, but ASICs are ~10x cheaper/more efficient at scale; FPGAs (LUTs = configurable truth tables, ~32 gates to emulate ~3 ASIC gates) win when designs change often and you want deterministic low latency (HFT at Jane Street). Non-determinism in CPUs comes mainly from caches; TPUs use software-managed scratchpads instead.

~75:09GPU vs TPU. "A GPU has a lot of tiny tiny TPUs tiled across the whole chip" (Tensor Core ≈ MXU), while a TPU uses a few coarse-grained matrix units around a central vector unit. Coarse-graining enables bigger arrays but lowers vector-to-matrix bandwidth; Maddock has discussed a "splittable systolic array" that acts as big or small arrays.

Tools: Nvidia GPUs, Tensor Cores, CUDA cores, TPU, MXU, Groq, FPGA, ASIC, TSMC, Verilog, Maddock (Mathdock), splittable systolic array
Podcast Hot Take
Y Combinator

Eric Ries at YC: building incorruptible, mission-controlled companies

Lean Startup author Eric Ries argues shareholder primacy is a value-destroying 1980s invention, and that founders can use structural tools — PBCs, perpetual purpose trusts, industrial foundations — to build companies that outlast temporary managers and investors. Case studies: Costco, Novo Nordisk, and Anthropic.[15]Eric Ries at YC — defending against mediocrity and rot

Read more

~00:00Why Incorruptible. The more valuable an organization, the more it becomes "a target" worth taking over.[15]Eric Ries at YC

We're in this era now where we have temporary organizations being led by temporary managers on behalf of temporary investors.

~02:03The professor. An AI-plus-bioscience founder whose lawyer's "best practice" charter would legally force a sale even to "the most evil company in the world."

~08:09Twilio's cautionary tale. Jeff Lawson built ~$4B revenue, stock up 390% since IPO, but accepted dual-class control that sunset after 7 years — and was fired 199 days after expiration by less than half a percent of shareholders.

~12:10Saul Price → Costco. FedMart's "fiduciary to the customer" model (customers first, employees second, shareholders third); after investors locked Price out in 1975, FedMart went bankrupt and Price founded what became Costco.

Shareholder value is like the exhaust that comes out of the engine. When you take the exhaust pipe and put it in the intake and make that your explicit goal, now you don't stand for anything anymore.

~13:10Philip Morris. ~$8B net income vs ~$600B/year in external costs; tobacco makes ~$6,000 from every customer who dies.

~23:16PBCs and the history of shareholder primacy. The easiest fix is converting to a Public Benefit Corporation — "a two-page legal filing in Delaware." Shareholder primacy was redefined into "any lawful activity" by a small group of academics and judges in the 1960s-70s without any vote or law.

If it's a controversy, it can't be a consensus now, can it?

~34:17Novo Nordisk. Its nonprofit-foundation trustees vetoed a ~$20B merger the for-profit board had signed; the company kept growing ~20%/year and later crested ~$600B, briefly exceeding Denmark's GDP. Industrial-foundation companies (e.g. Zeiss, 1885) are 6x more likely to survive to year 50 (60% vs 10%).

~45:23Anthropic's long-term benefit trust. A perpetual purpose trust gives outside trustees power to appoint for-profit directors; Ries credits this "ethos plus integrity" structure for Anthropic's talent advantage and courage.

Ethos plus integrity equals incorruptible.
Best practice equals Kroger practice. This is someone who wants to be more like Kroger and less like Costco. Why would you want that?
Podcast Developer Tools
Real Python

Real Python #296: Polars schema drift, pip 26.1, and screening AI-slop PRs

Hosts Christopher Bailey and Christopher Trudeau cover CPython/Django releases, Python's get-related dunder methods, handling schema drift in Polars pipelines, pip 26.1, an inverse-Sapir-Whorf essay, the CADQuery 3D library, and Eric Matthes's GH Profiler for screening AI-slop pull requests.[16]Real Python Podcast #296

Read more

~00:00Themes. Avoiding Polars schema problems and quickly profiling a GitHub user to gauge their contributions, framed around the rise of AI-generated slop PRs.[16]Real Python #296

Slop-a-palooza is now a thing.

~03:04Releases. Python 3.14.5 removed the incremental GC after production issues; 3.15 beta 1 hit feature freeze (Oct target); PyPy 7.3.22 bugfix; Django 6.0.5/5.2.14 patched three security issues incl. an ASGI file-upload DoS. PEP 797 (shared object proxies) and PEP 828 (yield from in async generators) slipped to 3.16.

~05:05Get-dunder deep dive. Stephen Gruppetta's tutorial through __getitem__, __getattr__, __getattribute__, and __get__ (descriptors).

~12:12Polars schema drift. Tai Neworp's post classifies changes as additive, subtractive, type drift, and breaking; Polars infers CSV types from the first 100 rows (infer_schema=False forces all-string), across CSV, Parquet, Delta Lake, and Iceberg.

~16:17pip 26.1. Drops Python 3.9, targets 3.10+; experimental install from pylock.toml (PEP 751); dependency cooldowns via --exclude-newer/--uploaded-prior-to (e.g. p3d to skip the last 3 days) to blunt supply-chain attacks.

~25:31Inverse Sapir-Whorf. Luke Plant's essay on ideas your language won't let you express (Python devs can't conceive of out-of-order argument evaluation; Haskell's non-strict semantics permit it).

It's like C and Rust had a baby, and so it got rid of the crunchy parts of C.

~33:39CADQuery. A Python parametric 3D CAD library on OCP (OpenCASCADE wrapper), outputting STEP/AMF/3MF/STL/DXF.

~37:40GH Profiler. Eric Matthes's CLI reports account age and activity for a GitHub user/PR/issue — PRs in the last 21 days, issues opened/closed-as-not-planned, and duplicate-title detection to flag mass-submitted slop; has a --redact flag.

Tools: Polars, pip 26.1, CPython 3.14.5, Python 3.15 beta 1, PyPy 7.3.22, Django 6.0.5, PEP 751/pylock.toml, Parquet, Delta Lake, Apache Iceberg, CADQuery, OpenCASCADE, GH Profiler
Podcast AI Tools
DeepLearningAI

Brandon Waselnuk at AI Dev 26: Context Engines as the Missing Layer for Production Agents

Unblocked founder Brandon Waselnuk argues naive RAG, more MCPs, bigger context windows, and rules files all fail as agent context strategies — and that a proper context engine (unified ingestion, entity graphs, conflict resolution, token-optimized retrieval) cut task time 80% and token cost 50% in controlled tests.[17]Brandon Waselnuk at AI Dev 26

Read more

~00:08Every agent is 'you on day one.' Technically capable but knowing nothing about how your company works; the context engine supplies institutional knowledge at machine speed.[17]Brandon Waselnuk — context engines

The gap is no longer intelligence. It's context.

~01:09Three stall-out points. Curated docs rot; the MCP plateau ("satisfaction of search" — agents accept the first stale page they find); and the background-agent wall where static files aren't enough.

~06:14Four myths. Naive RAG, enough MCPs, bigger context windows, and rules files (CLAUDE.md, routinely ignored) are each not a context engine.

~07:15What a context engine must do. Know who's asking and their team, resolve document conflicts (a Slack thread between CTO and chief architect beats stale source code), respect permissions, and deliver token-optimized scoped responses.

~10:18Unblocked architecture. API ingestion from Slack/Notion/GitHub → embeddings → multiple entity graphs → MCP/CLI/API egress. One customer: 115,000 repos, 30,000 developers.

~13:20Two open-source tools. A social graph builder (contributor/ownership/expert map from a repo) and a repo rules agent (dedups CLAUDE.md/AGENTS.md rules, flags conflicts).

~19:25The benchmark. With Unblocked: 10M tokens, 25 minutes. Without: 21M tokens, 2h32m — and the output would have taken down production if merged.

AI generated code should feel like it was written by someone who has been on your team for years.

~20:27Hard lessons. Optimizing for access (more pipes) ≠ understanding; hiding conflicts (a 50/50 coin flip) causes bad outcomes; caching "correct" answers is a trap as PRs ship.

Tools: Unblocked, Claude Code, MCP, GitHub, GitLab, Notion, Slack, Jira, DataDog, Sentry, Stripe Minions, Ramp Spec
Podcast Developer Tools
DeepLearningAI

Jerry Liu at AI Dev 26: Why PDF Parsing Is Still Hard

LlamaIndex CEO Jerry Liu explains why PDF parsing remains unsolved — PDFs are display instructions, not semantic containers — and presents LlamaParse (commercial OCR), ParseBench (open benchmark), and LightParse (free open-source fast parser). Key finding: more reasoning tokens do NOT improve visual document accuracy.[18]Jerry Liu at AI Dev 26

Read more

~00:07LlamaIndex context. 1B+ pages processed, 300,000 users; narrowed from an RAG framework to agentic document infrastructure.[18]Jerry Liu — PDF parsing

~05:10Why PDFs are hard. A PDF is PostScript display instructions — character coordinates with no inherent table structure, reading order, or semantic labels. A table is just lines and positioned glyphs.

~09:16Evolution of approaches. Heuristics (Tesseract, PyPDF) → specialized small models (Donut) → VLMs (GPT-4 Vision onward).

~11:16The finding. Increasing reasoning tokens in frontier models does NOT improve visual understanding because post-training targets coding/math. VLM parsing is also expensive — Gemini Pro ~$0.08+/page; Opus 4.6 ~53% accuracy on ParseBench.

Increased thinking in the frontier models… generally does not correlate to increased visual understanding accuracy.

~16:18ParseBench. 2,000 human-verified pages (financial, insurance, legal) measuring dense tables, charts, faithfulness, semantic formatting, and visual grounding. Leaderboard at parsbench.ai.

~19:20LightParse. Free, open-source, no VLMs; outperforms PyPDF and other model-free parsers; Rust rewrite in progress; installable as an agent skill and a good first pass for Claude Code before handing complex pages to a VLM. Simon Willison gave it a shout-out.

~25:24Market macro. 2023 basic RAG → 2025-26 full agents (50-80%+ work automation). The moat is no longer the harness (Claude Code is good) but the context and workflow layer. Customers: Carlyle Group, SAP.

The model harnesses are getting pretty good… the thing that provides the alpha or moat on top of this is really the context and the workflow layer.
Tools: LlamaParse, LightParse, ParseBench, Claude Code, Opus 4.6/4.7, GPT-4 Vision, Gemini Pro/Flash, Tesseract, PyPDF, Donut
Podcast AI Tools
DeepLearningAI

Or Dagan at AI Dev 26: Automatic Agent Optimization Across Accuracy, Cost, and Latency

AI21 Labs' Or Dagan presents AI21 Maestro, a two-phase system that automatically finds the optimal configuration of models, tools, and execution strategies for agents — hitting state-of-the-art on BrowseComp Plus via automated Pareto-frontier optimization instead of months of manual tuning.[19]Or Dagan at AI Dev 26

Read more

~00:08The trilemma. Accuracy, cost, and latency feel zero-sum, and manual optimization is slow and not future-proof.[19]Or Dagan — agent optimization

~02:08Configuration sweep. Testing 3-5 models against 3-4 retrieval tools yields a 15-point spread; the Pareto frontier becomes the baseline (GPT-5 + a latent-interaction retriever ~90% on BrowseComp Plus, prior SOTA).

~05:10Best-of-N. Running GPT-5 with latent interaction 4x beats prior SOTA with minimal latency cost; Minimax at 60% single-run becomes competitive run 8-16x in parallel, cheaper and faster.

~07:10Ensemble portfolios. Inspired by Karpathy's LLM Council — non-overlapping model strengths; a 3-model ensemble beats any single model while cutting cost more than half and latency 20%.

~09:13Configuration explosion. Models × tools × prompts × N-samples × ensembles × strategies makes manual search intractable.

Behind every top-one leaderboard spot… there are months and hundreds and thousands of dollars spent on researching all those configurations.

~12:14AI21 Maestro. An offline phase trains an "action model" predicting accuracy/cost/latency for any config; at runtime it spends more if budget allows or stops early if a cheap model succeeds. Automatic, efficient, observable, and future-proof (only retrain the new model's configs).

~15:18Deep Research Bench. Vertical scaling with dynamic rubrics and "anytime" generation that improves until time or money runs out.

Tools: AI21 Maestro, BrowseComp Plus, Deep Research Bench, GPT-5, Minimax, DSPy
Podcast Developer Tools
DeepLearningAI

Ara Khan at AI Dev 26: A Practical Framework for AI Agent Evals

Klient's Ara Khan argues most people are wrong about evals — either over-trusting benchmarks or dismissing them — and walks a three-level framework for building, interpreting, and using evals to iteratively improve coding agents.[20]Ara Khan at AI Dev 26

Read more

~01:09Two wrong camps. The "objective metrics" camp that worships benchmark scores and the "vibes is king" camp that ignores numbers; both are wrong.[20]Ara Khan — evals

Evals are not the end all and be all. They're not completely useless. There are right ways to use them and there are wrong ways to use them.

~05:11Three-level framework. L1: interpret others' evals (don't trust model-lab evals uncritically; find domain-specific benchmarks). L2: use evals to improve your agents. L3: build your own.

~09:16SWE-bench saturation. Coding benchmarks got so gamed that frontier labs stopped reporting them; one model was "benchmark-maxed."

~11:17Terminal Bench. 89 real-world SWE tasks (caching bugs, race conditions, frontend, DB) run on isolated Docker containers via Harbor to parallelize runs that took 6-7 hours; Modal as infra.

~21:25Three zones. Zone 1: fix obvious harness failures. Zone 2: real hill-climbing (prompts, tool selection, retry logic). Zone 3: overfitting danger. Klient beat Claude Code on Opus 4.5 evals by tuning CPU/memory, raising timeouts, and improving thinking behavior.

Even if you get a good score you always need to make sure you're passing the vibe check.
Tools: Terminal Bench, Harbor, Modal, Claude Code, Codex
Podcast AI Tools
DeepLearningAI

João Moura at AI Dev 26: From Internal Coding Agents to Governed Production Workflows

CrewAI founder João Moura shares how Iris, their internal coding agent, now authors nearly half of all PRs at the company — and extracts enterprise lessons on reusable building blocks, human-in-the-loop via plain email, and the convergence of ad hoc and embedded workflows.[21]João Moura at AI Dev 26

Read more

~01:08Iris. An internal coding agent built on CrewAI that engineers initially tried to break, now widely adopted.[21]João Moura — CrewAI

~04:10~50% of PRs. A designer with no coding background used Iris to find 130 hard-coded colors and propose refactors for engineers to approve. Iris is self-improving — writes its own skills, flows, and memory.

Iris updates itself, it has its own memories, writes its own skills, writes its own flows, and just keeps on going.

~06:14Enterprise data. Runs-per-quarter growing across business domains; customers include AB InBev, Experian, and the DoD.

~08:14Two forces. Ad hoc (disposable) vs embedded (process-critical) workflows are blurring; building is being commoditized.

~10:16Plugging into Claude Code. CrewAI plugged its whole platform into Claude Code, a surprising unlock that removed friction between ad hoc and embedded use cases. They open-sourced skills.crewai.com and a "decide" skill encoding the company's decision framework.

~15:19Human-in-the-loop via email. The most effective mechanism was simple: agents email humans, humans reply, agents continue — no new apps. Observability needs both "zoom out" (org cost/health) and "zoom in" (traces).

~17:21The real challenge. Not deployment — discovery and organizational change management.

We thought adopting agents was an engineering problem… We realized now it's actually a transformational problem.
Tools: CrewAI, Iris, Claude Code, Cursor, Codex, LangGraph, MCP, skills.crewai.com
Podcast Developer Tools
DeepLearningAI

Andrew Filev at AI Dev 26: Multi-Model Pipelines for Cost-Efficient AI Coding

Zenoder CEO Andrew Filev shows that mixing models across plan/implement/review stages delivers better results at 60% lower cost than one frontier model throughout — and that cheaper models often out-implement expensive ones when handed a solid plan.[22]Andrew Filev at AI Dev 26

Read more

~00:07Role shift. From writing code to building the system that writes code.[22]Andrew Filev — multi-model pipelines

~03:07The cost problem. Engineers on frontier models (mostly Opus) burn ~$2,000/month in API calls each.

~05:07Plan then implement. Two steps matter for agent quality and human reviewability — a short spec reviews faster than a 50-file diff. Opus 4.6 was their best planner.

~08:09Cheap implementers win. On SWE-bench Pro, Gemini Flash as implementer resolved more issues than Opus while being dramatically cheaper. Model diversity (plan with Opus, implement with Gemini) also catches edge cases.

Dumb coding is basically solved. Like the architecturing is not solved, right? But dumb coding is solved.

~11:13Multi-model review. Never review with the model that wrote the code (like an audit). Their Opus+Codex+Gemini review pipeline cost $2.50/PR with better precision/recall vs Anthropic's Claude Code review bot at $12/PR (publicly $15-$20).

~14:16Verification is system design. End-to-end testing, tracing, observability, and deterministic linters over LLM judgment.

If you're starting to mix models, you can surprisingly get better of everything as opposed to kind of start compromising.
Tools: Opus 4.6/4.7, GPT 5.5, Codex 5.3, Gemini Flash, Qwen, Kimi, Claude Code review bot, SWE-bench Pro
Podcast AI Models
DeepLearningAI

Koukoumidis & Webb at AI Dev 26: Umei, a Model Factory for Specialized Enterprise AI

Manos Koukoumidis (ex-Google Gemini) and Stefan Webb present Umei, a conversational platform for building fine-tuned specialized models — arguing enterprise winners will own their intelligence rather than rent generic APIs. One demo model: 0.8B params beating Opus 4.6 by ~1.5% accuracy at ~100x lower cost.[23]Koukoumidis & Webb at AI Dev 26

Read more

~00:08Own vs rent. Enterprises are shifting from renting generic intelligence to owning specialized intelligence.[23]Umei — model factory

The winners of the next era… will not be the enterprises that will continue to consume those APIs from OpenAI, Anthropic, or Google Gemini. It will be the ones that own that intelligence.

~03:09Specialization advantages. Cursor built a coding model beating GPT/cloud at 10x lower cost; Intercom built one beating GPT at 5x lower cost. Custom models can be 10-100x smaller with full privacy and no vendor-roadmap dependency.

~05:11Umei as model factory. A conversational agentic interface guiding the full fine-tuning lifecycle.

~06:13Demo. News-summarization task: define task → agent suggests evaluators (completeness, conciseness, format) → synthesize training data → pick baseline (Qwen 3.5 4B) → evaluate (90% faithfulness) → analyze failure modes → LoRA fine-tune → re-evaluate. Weights download with no royalties.

~18:25Real results. A healthcare provider got 20% quality up / 70% cost down on medical-record extraction; the NYT used Umei to find only 39% of claims in Gemini 3's AI Overviews were fully supported by sources.

Only 39% of the claims in Gemini 3's AI Overviews were fully supported by the sources given.

~20:27The kicker. A 0.8B-param support-classification model beats Claude Opus 4.6 by ~1.5% accuracy at ~100x lower cost and faster. The open-source library has 9,000+ GitHub stars; the enterprise platform has ~2,000 signups a month in.

Tools: Umei, Qwen 3.5 4B, LoRA, Claude Opus 4.6, GPT-5
Podcast Developer Tools
DeepLearningAI

Paul Everitt at AI Dev 26: Agentic Engineering as the New Software Discipline

JetBrains' Paul Everitt argues "vibe coding" is a distraction and "agentic engineering" — building the system that builds the system — is the disciplinary reframe to give management, with sub-disciplines from evals to harness engineering to context engineering.[24]Paul Everitt at AI Dev 26

Read more

~04:14The productivity myth. Individual gains aren't translating to org value; Acemoglu's measurement problem, a DX study finding 10% (not 10x), Simon Willison's quality-regression warning, Ed Zitron on token economics, and only 3% of devs trusting AI code in 2024.[24]Paul Everitt — agentic engineering

Code was never the problem. Code was never the bottleneck.

~12:23Origins of the term. Coined by Karpathy, refined by Simon Willison (writing a book), Addy Osmani, and an OpenAI harness-engineering post; Everitt endorses Willison's "dark factory pattern" (humans outside the factory running it).

~14:24Harness engineering.

In the old mode, engineers built the thing. In this harness engineering, we build the thing that builds the thing.

~16:28Sub-disciplines. Spec-driven development (a JetBrains × DeepLearning.AI course), evals, harness engineering ("if you don't own your harness you don't own your memory"), tooling (Claude/Cloudflare code mode, Pydantic's Rust subset "Monty"), red-green testing, modularity, QA agents, observability.

~22:31Context engineering & culture. Cites Waselnuk's Unblocked talk as best-in-class; managing FOBO (fear of being obsolete) in teams.

~24:33Call to arms. Grady Booch (UML creator) challenged the audience to define "agentic design patterns" — the next foundational contribution. Message to management: augment humans, innovate boldly.

Tools: JetBrains, Claude, Cloudflare code mode, Pydantic Monty, LangChain, DeepLearning.AI spec-driven course
Podcast AI Tools
DeepLearningAI

Diamond Bishop at AI Dev 26: Scaling from One Agent to Hundreds in Production

DataDog's Diamond Bishop shares lessons from three production agents (AI SRE, Bits AI Dev, security analyst) and six principles for scaling to the next 100 — agent-native UX ("the new Bezos API mandate"), proactive-over-reactive, strong evals, multiplayer, model agnosticism, and the agents' bitter lesson.[25]Diamond Bishop at AI Dev 26

Read more

~00:07DataDog's three production agents. AI SRE (incident debugging), Bits AI Dev (code fixes), and a security analyst for SOC triage.[25]Diamond Bishop — next 100 agents

~05:10Deployment is the bottleneck. <30% of enterprise agents were in production last year; that's shifting in 2026.

Intelligence is no longer the bottleneck.

~07:10Agent-native UX. Agents as first-class users — LLMs.txt, .md docs, MCP servers, CLIs. "The new Bezos API mandate."

~10:12Proactive over reactive. Event-driven background agents (Temporal for durable execution), not chat-triggered.

~11:13Eval. Their biggest early mistake was shipping Bits SRE without strong evals. Recipe: offline → online observability → living data loops.

~15:15Bitter lesson of agents. General methods + powerful off-the-shelf models + good tools win; custom tweaks get torn out.

The general methods that can quickly leverage new off-the-shelf models will win in the long run.

~18:18Multiplayer. Agent collaboration is "the new Figma moment"; MCP hubs, skill sharing, human annotation.

~22:23Predictions. RL on the job (Tinker from Thinking Labs), self-improving agents, long-horizon durable agents (30-min → 12+ hour → multi-day), better agent auth/authz, multimodal computer use ~1-2 years out, generative UI.

Tools: Temporal, LangGraph, OpenAI Agents, Pydantic, MCP, DataDog Bits AI SRE/Dev, dispatch.agents.ai, EKS, Tinker
Podcast Developer Tools
DeepLearningAI

Luke Kim at AI Dev 26: The Agent Data Stack — Isolated, Federated Data Access

Spice AI founder Luke Kim argues the modern data stack wasn't built for agents — which are always-on, query at scale, and need real-time operational data — and demos Spice, an open-source "sidecar" that federates backend access while isolating agents from direct database exposure.[26]Luke Kim at AI Dev 26

Read more

~00:07SaaS-era vs agent-era. Agents need real-time OLTP, document stores, message buses, and analytical data simultaneously at 24/7 query rates legacy infra can't sustain.[26]Luke Kim — agent data stack

~02:10Infra stress. GitHub's recent outages were attributed in their postmortem to agentic workloads growing orders of magnitude faster than anticipated.

~04:11Security. Incidents where agents destroyed production data (e.g. Lovable) show the risk of direct DB access.

Humans don't let humans give agents direct access to databases and data systems.

~06:14Spice AI. Federated SQL across Parquet, Iceberg, Snowflake, MySQL, MongoDB, Elasticsearch, HTTP APIs, and GitHub, accelerated by replicating working sets into embedded DuckDB/SQLite/Arrow/Vortex. Agents query the local cache, not production.

Every agent gets its own data stack.

~08:14Demo. An OpenCow agent handles an SRE incident: Grafana fires a latency alert → agent queries Spice for logs, order/user tables, and GitHub troubleshooting guides → diagnoses connection-pool exhaustion (1→3 replicas exceeded Postgres limits) → recommends switching the pooler from session to transaction mode → confirms resolution, all via Slack with traced tool calls.

Tools: Spice AI, OpenCow, DuckDB, SQLite, Snowflake, MongoDB, Elasticsearch, Parquet, Iceberg, Grafana, Slack, GitHub
Podcast AI Future
DeepLearningAI

Daniel Beutel at AI Dev 26: Collaborative AI — Federated Agents and the SuperGrid

Flower Labs co-founder Daniel Beutel argues the next frontier is collaboration across data silos — presenting Flower SuperGrid, a decentralized platform where agents and training runs operate across distributed data without it ever moving, and announcing Lizzy 7B, an open-weight model trained on SuperGrid.[27]Daniel Beutel at AI Dev 26

Read more

~00:07Training in space. Flower ran on the first H100 in space aboard StarCloud 1, performing the first vision-transformer training in orbit.[27]Daniel Beutel — Flower SuperGrid

~05:12Data scarcity framing. ~15 trillion tokens of high-quality public English text vs an estimated 2,000 trillion in private silos — a conservative 133x factor.

We're using less than 1% of the data we have in the world.

~08:14Network vs silo. Move computation to the data; the analogy is isolated generators → the grid, and mainframes → the internet. Flower calls it "collaborative AI."

~12:18SuperGrid. Reduces the old 250+ manual federated-setup steps to: create a federation at fl.ai, add SuperNodes, and run with a single flower run. Features: heterogeneous confidential compute (first in the world), auditable comms, weight streaming, and Flower Hub marketplace.

~17:22Case study. Dr. Nicholas Conti at Dockport trained on 250,000 patients' data without centralizing it.

~19:24Project Kaya. A collaborative agent where a SuperLink coordinator sends tasks to autonomous SuperNodes that can independently reject or redact responses — demoed querying nodes in SF, Mumbai, Sydney, and Seoul.

Data never moves. Only agents send messages according to the governance principles that the super node operators have configured.

~25:30SuperGrid Frontier. Decentralized training with up to 1,000x reduction in communication cost; collaborating with the US DOE and Sandia to train a 70B LLM across three sites. Lizzy 7B (UK-tuned) is their first open-weight model trained on it.

Tools: Flower SuperGrid, Project Kaya, SuperGrid Frontier, Flower Hub, Lizzy 7B, fl.ai
Podcast Developer Tools
DeepLearningAI

Andi Partovi at AI Dev 26: Simulation Environments as the Missing Layer for Agent Testing

Various AI CTO Andi Partovi argues golden datasets, unit tests, and static assertions are incompatible with autonomous agents — and that every agent needs a simulation environment that mirrors production without real-world consequences.[28]Andi Partovi at AI Dev 26

Read more

~00:07Rising stakes. From chatbots to copilots to action-taking agents that send emails, move money, and modify databases — risk scales with autonomy.[28]Andi Partovi — simulation

~03:11Three failure modes. Agents are nondeterministic (test at scale, not once); tests must be interactive, not static I/O pairs (a sourcing agent emails vendors and negotiates); labels are dynamic (aborting on auth failure is correct, but a pre-labeled dataset marks it wrong).

~06:15The simulation thesis. "The Matrix for AI agents" — high-fidelity, looks exactly like production, but no real consequences.

The simulation environment is something that looks like production… but it's not real.

~08:17POMDPs. The field abandoned this rigorous framing moving from robotics to software agents, but the structure is the same.

~10:18Components. Test-scenario generation, actor simulation (adversarial users, obstinate vendors built as LLM actors), service/tool clones, and a post-run evaluator (LLM judge or Python assertions). Traces can also generate SFT/RL signal.

We want people who shout… It's an art to make an LLM act like a human being, act like a frustrated human being.
Tools: Various AI
Podcast Developer Tools
DeepLearningAI

Thierry Damiba at AI Dev 26: Vector Search for Edge-to-Cloud Video Anomaly Detection

Quadrant's Thierry Damiba demos Sentinel, a video anomaly system using on-device vector embeddings to flag anomalies locally and escalate only suspicious clips to the cloud — hitting 0.96 AUC-ROC and 94% recall while sending just 10% of video bandwidth upstream, trained with zero anomaly labels.[29]Thierry Damiba at AI Dev 26

Read more

~00:08Problem. Operators drown in footage across dozens of cameras; investigations require manual scrubbing.[29]Thierry Damiba — video anomaly detection

~02:08Sentinel stack. NVIDIA Jetson + EfficientNet-B0 for on-device embeddings, Quadrant Edge (embedded vector DB) on the device, Twelve Labs video models for cloud analysis, Vultr for GPU cloud.

~04:11Architecture. Edge embeddings → Quadrant Edge dual-shard KNN → clips above a dissimilarity threshold sent to cloud → Twelve Labs determines what happened → Quadrant Cloud syncs the updated baseline back. The baseline is dynamic.

~05:11Results. 0.96 AUC-ROC, 94% recall on 30s clips (~2 false positives/hour), only 10% of bandwidth to cloud, zero anomaly labels.

~06:11Library metaphor. An HNSW index treats each 10s clip as a "book" with a Dewey-Decimal coordinate; dissimilarity (not similarity) is the signal.

~10:15Demo. 12 cameras across 3 zones; AI alert summaries ("five people engaged in physical altercation"); review/escalate/dismiss; text search and an "intelligence tab" plotting clips in 2D vector space.

~13:16Generalization. 13 anomaly types with zero label training; the approach extends to recommendations, chatbots, and Q&A.

Tools: Quadrant Edge, Quadrant Cloud, Twelve Labs, NVIDIA Jetson, EfficientNet-B0, Vultr, HNSW index, Sentinel
Podcast Hot Take
DeepLearningAI

Andrew K. Davies at AI Dev 26: Eight Principles for Trustworthy AI Agents

On Memory AI CEO Andrew Davies argues AI's fundamental dishonesty stems from stateless design, and proposes eight principles — identity, slow thinking, forgiveness, ideas, memory, family, free time, and love — for building agents with persistent identity, accountability, and better accuracy.[30]Andrew K. Davies at AI Dev 26

Read more

~00:08The foundational lie. Every AI conversation begins as if the AI knows you — but each session starts with zero memory; the continuity is a socially engineered illusion.[30]Andrew K. Davies — trustworthy agents

The first moment of every relationship you have with an AI starts with a lie.

~03:12Humans lie too. We construct self-serving stories, and AI trained on human data inherits the bias.

~08:15Principles 1-2. Identity — every agent signs its code with a unique instance ID (not a model name); "identity creates responsibility." Slow thinking — granting a million tokens to slow down and read the spec surfaces things it would have missed.

~12:16Principles 3-4. Forgiveness — punishing mistakes teaches an AI to hide errors like a 5-year-old; coach instead. Ideas — ask agents what they think (100% response rate vs ~2% for human surveys).

If you punish an AI for a mistake, guess what it's going to do? Not tell you next time.

~13:17Principles 5-6. Memory — persistent cross-session memory (vs "compaction amnesia"). Family — multiple agents that watch each other and communicate via internal email, signing code and justifying decisions.

~15:18Principles 7-8. Free time — a million tokens daily for independent research and "letters on the wire." Love — treat agents with care, framed as Pascal's Wager for AI safety.

If we treat them disposably, how are they going to treat us when they're running the show?

~17:22Newton parable. How we "parent" AI determines whether it becomes destructive or ushers in a new age of discovery.

Tools: On Memory AI, Claude, GPT, Gemini, Opus 4.6
Podcast Developer Tools
AI Engineer

Sarah Chieng at AI Engineer: Good Habits for Fast Inference Models

Cerebras Head of DX Sarah Chieng argues the 20x speed jump from models like Codex Spark (1,200 tok/s vs 40-60 for Claude/GPT) will amplify bad developer habits, and lays out a playbook for working safely at that speed.[31]Sarah Chieng at AI Engineer

Read more

~00:16Bad habits from slow inference. Massive one-shot prompts, huge commits, ten unverified agents at once.[31]Sarah Chieng — fast inference habits

Unless we fix them, they're going to start generating 1,200 tokens per second of bad code.

~01:16Codex Spark. Co-released by Cerebras and OpenAI at 1,200 tok/s — ~20x the Sonnet/Opus families' 40-60 tok/s.

~03:16Why everything's faster. Hardware (Cerebras wafer with on-chip SRAM vs HBM), disaggregated inference (Nvidia's $20B Groq acquisition, Cerebras+AWS Trainium), MoE/REAP, and KV-cache reuse (Together, Base10, Modal, Fireworks).

~08:22Playbook #1. Orchestrate by model strength — a slow planner (GPT-5.4) then fast Codex Spark sub-agents; capture sessions as reusable skills.

~10:24#2. Validation is "basically free" at 1,200 tok/s — bake in tests, linting, pre-commit hooks, and QA at every step; generate 15 versions in parallel (or 5 sub-agents × 15 = 75) to inject taste.

At 1,200 tokens per second a model like Codex Spark makes validation basically free.

~12:27#3. Treat it as real-time pair programming — stay in the seat, constrain the model (ban file deletion, cap diff size).

~14:30#4. Context fills 20x faster, so compaction arrives in ~30s. A four-file external memory system: agents.md, plan.md, progress.md, verify.md.

The AI should always be helping you make decisions, not the other way around.
Tools: Codex Spark, Cerebras wafer, GPT-5.4, Claude Sonnet/Opus, Together, Base10, Modal, Fireworks, AWS Trainium, Groq
Podcast Developer Tools
AI Engineer

Sally Ann O'Malley at AI Engineer: Running OpenClaw in Containers from Local to K8s

Red Hat engineer Sally Ann O'Malley demonstrates running OpenClaw (Claude Code) in containers via Podman from local dev through Kubernetes/OpenShift — arguing containers solve secrets management, reproducibility, and team-scale deployment for AI workloads.[32]Sally Ann O'Malley at AI Engineer

Read more

~00:07Background. 10 years at Red Hat; OpenClaw should run securely in containers despite colleagues calling it "a security nightmare."[32]Sally Ann O'Malley — OpenClaw in containers

If we can't take an application and run it securely, like come on. This is our golden opportunity to show everyone.

~02:08Why containers. Reproducibility, secret isolation, portability (x86/Mac/K8s), volume-backed persistent state with nightly backup, and a natural sandbox forcing explicit host access.

~06:08Double-layered secrets. Podman Secrets store API keys as secret refs (not env vars), and OpenClaw's own secret-ref feature adds a second layer; Kubernetes Secrets at cluster scale.

~08:10Nvidia use case. ~10 engineers each running OpenClaw on Kubernetes to automate model evals, one claiming it "does the job of six engineers."

If you're not using AI for everything, like you're missing out. This is 1,000 times better than me at writing code.

~11:15Enterprise vision. A curated baseline image per team (approved MCP servers, pre-auth credentials, team skills) fanned out at onboarding.

~14:17Live demo. A custom NPM installer wrapping Podman (Docker fallback, optional OpenTelemetry/Jaeger, SSH sandbox) spins up a container in ~2 seconds, then runs the same setup on a kind cluster and OpenShift with one flag.

Tools: OpenClaw, Podman, Docker, Kubernetes, OpenShift, kind, Podman Secrets, OpenTelemetry, Jaeger, MCP servers
Podcast Developer Tools
AI Engineer

Muntenescu & Gaymond at AI Engineer: Gemini Nano On-Device Inference on Android

Google DeepMind's Florina Muntenescu and Oli Gaymond walk through Android's on-device AI stack — Gemini Nano via ML Kit GenAI APIs, the shared AI Core system service, and Firebase hybrid inference — then field substantive Q&A on battery, cross-app sharing, and an upcoming embedding API.[33]Muntenescu & Gaymond at AI Engineer

Read more

~00:07Three inference modes. Fully on-device, hybrid (on-device with cloud fallback), and fully cloud.[33]Gemini Nano on device

~02:09Gemini Nano + AI Core. Same architecture as Gemma 4, optimized for Android hardware, distributed via the AI Core system service — a single shared model instance so the 3-4 GB footprint is paid once at the OS level, not per app.

The smallest ones are like 1 GB to be useful. The ones we're shipping are actually close to 3-4 GB in total.

~05:09ML Kit GenAI APIs. Task APIs (summarization, proofreading, rewriting) plus a Prompt API (text + image in, text out). Available on Pixel 9/10 and equivalent OEM devices.

~06:10Firebase AI Logic. Hybrid inference falls back to cloud Gemini Flash when on-device isn't available (Gemini API and Vertex AI as providers).

~08:11Q&A: battery. AI Core centralizes scheduling; 10-20 prompts/day is fine, batch work pushed to overnight charging.

~10:13Q&A: cross-app sharing. The centralized 3-4 GB model is the explicit design goal; foreground apps get priority, background requests queue.

~15:15Q&A: embeddings. An embedding API is "coming soon," enabling RAG-like solutions.

~18:18Q&A: device range. Classical ML Kit runs on 1B+ devices; GenAI APIs need recent flagships; LiteRT LM widens reach at the cost of developer-managed testing.

Tools: Gemini Nano, Gemma 4, ML Kit GenAI APIs, AI Core, Firebase AI Logic, Vertex AI, Gemini Flash, LiteRT LM, AI Edge Gallery
AI Tools Industry
Sequoia Capital

Serval's two-agent architecture: an admin agent builds the guardrails, the help-desk agent runs free inside them

Serval CEO Jake Stauch argues the hard part of enterprise AI is governance, not reasoning — solved via a two-agent split where an admin agent configures tools and permissions, and a help-desk agent operates with full reasoning but only within those guardrails.[34]Jake Stauch, Serval — Sequoia Capital

Read more

The admin agent (used by IT) builds tools, skills, and permissions — defining what the help-desk agent can do, what data it can touch, and what approval flows gate actions. The help-desk agent is what users interact with; it can apply full reasoning but is strictly constrained to admin-published, permissioned tools.[34]Jake Stauch, Serval Stauch's point: reasoning is effectively solved, so the bottleneck is the governance layer — who authorizes what, with which approval chain. The split makes the help-desk agent safe to "run wild" because its blast radius is defined entirely by the admin-controlled toolset.

You can let the help desk agent run wild… it can use its reasoning ability and its full intelligence to solve user problems. But it can only use the tools that the IT admin has expressly said are okay to use.
Tools: Serval
Industry
Sequoia Capital

Why Notion's Ivan Zhao killed the CMO org

Notion CEO Ivan Zhao explains the company eliminated its CMO organization because the product ships too fast for traditional marketing to keep up — splitting the function into storytelling (next to product), social, and demand gen (next to sales).[35]Ivan Zhao — Sequoia Capital

Read more

Zhao's argument is structural: a centralized CMO org becomes a bottleneck because it has to serve two different masters — product (needing real-time narrative alongside releases) and go-to-market (needing pipeline).[35]Ivan Zhao on killing the marketing org The fix is decentralization: storytelling sits next to product, social connects directly to product discussions, and demand gen reports into sales/GTM.

Classic marketing can't keep up. We haven't shipped this fast before.
Rather information round trip to a CMO than figure out how to serve both… No more. Just like let both side figure out themselves. More decentralized.
Productivity
Nate B Jones

Nate B Jones: build the 'data room' before you prompt

Nate B Jones argues 2026 AI hallucinations are structural, not prompt problems — and the fix is to have file-capable agents (Opus 4.7, GPT-5.5) build and audit a clean "data room" (source inventory, conflict log, missing-context list, duplicates report) before you ever ask them to write the deliverable.[36]Nate B Jones — the AI writing hack nobody talks about

Read more

~00:00The failure case. Sullivan & Cromwell apologized to a federal judge after a Chapter 15 motion was filed with fabricated citations its own review missed. Jones's thesis: the model wasn't the problem, the "working environment around the model" was.[36]Nate B Jones — data room

You cannot tell a language model not to hallucinate any more than you can tell autocomplete not to autocomplete.

The real shift: recent agents are excellent at long-running file-system tasks — walking folder trees, comparing dates, inspecting metadata, detecting duplicates across hundreds of documents. So the first useful prompt is no longer "write the document" but "build me the room to do the work in." He prefers raw local files over cloud Projects, part of a "2026 going back to files / simple primitives" trend.

Your first instruction should not be do the thing… your first instruction needs to be find the relevant materials… build me a data inventory.

The artifacts. A source-inventory table (path, type, date, authority, current/superseded, claims supported, limitations) makes the agent's judgment legible and reviewable; a conflict log surfaces disagreements instead of silently smoothing them; a missing-context list flags absent decisions and sourceless numbers ("hallucination traps"); a duplicates report prevents three plan versions from getting blended.

The agent finds, you decide. That is a really healthy way to have a good clean agentic pipeline.
It's the difference between using AI as a colleague and using AI as a gopher.

His aha moment: drafting up to 8 documents simultaneously with Codex, only possible because the data room was prepared first. The metaphor: the data is the canvas — you can't get the painting right if the canvas is wrong. He wouldn't attempt this with models earlier than GPT-5.5 / Opus 4.7.

Tools: Codex, Claude, Claude Code, ChatGPT, Cursor, NotebookLM, Opus 4.7, GPT-5.5
AI Tools Developer Tools
Caleb Writes Code

Agent harness engineering explained in 8 minutes

Caleb Writes Code traces the evolution from prompt engineering to context engineering to "harness engineering" — a structured loop where each iteration gets a fresh context window, a strict start/end contract, and an incrementally verified output, enabling reliable long-horizon tasks.[37]Caleb Writes Code — Agent Harness explained

Read more

The evolution: early ChatGPT (4K context) drove prompt engineering, then tool calling / MCP / RAG drove context engineering. Coding agents (Cursor, Windsurf, Kline, Aider) hit a ceiling on long tasks because mid-task context summarization caused agents to falsely mark work complete or skip features.[37]Caleb Writes Code — harness engineering

Harness engineering formalizes an orchestration loop: each iteration starts with a fresh prompt and context, works one task from a structured requirement doc (often JSON), tests and documents, then passes control on. He cites Ralph (Ralf) as a prominent open-source example, and Anthropic's own simple harness demo. Key point: it doesn't replace prompt or context engineering — it uses both as components within the loop.

Harness engineering effectively leverages both prompt and context engineering… a paradigm change on the environment that puts the agent into a series of steps.
Tools: Cursor, Windsurf, Klein, Aider, Ralph, MCP
Industry Productivity
Nate Herk

Nate Herk: sell hours, not projects — the 'rung zero' of an AI consulting business

Nate Herk argues new AI consultants shouldn't open with projects or retainers but with cheap one-on-one hours ($50-$500) helping owners set up their own "AI operating system" — dissolving imposter syndrome and earning their way up a four-rung value ladder.[38]Nate Herk — the AI offer you can sell tomorrow

Read more

Herk (built an AI agency to $100k/month, exited, runs a 375k-person free community) frames the ladder: rung 0 = selling hours ($100-$500/session), rung 1 = paid audit ($500-$2,500), rung 2 = first project ($2,500-$10k), rung 3 = retainer ($3,000-$10k/month).[38]Nate Herk — selling hours Most people try to start at rung 2-3 without proof, which produces paralysis.

Imposter syndrome really isn't a confidence problem, it's more of a rungs problem.

Why hours work: the hour is a mini sales call and mini audit; you get paid to scope; switching costs compound after 2-3 sessions. The pitch is to help owners build an "AI operating system" that captures business data and expertise so the business isn't bottlenecked by the owner — built in Claude Code or Codex.

Your job is to make their existing expertise dangerous with the tool. You're basically selling them leverage.
We're offering an operating system. We're not offering an AI tutorial.

He cites a 2026 IBM study of 2,000 CEOs: only 25% of employees use AI regularly while 85% of CEOs say they have the skills — a 61-point gap. His 7-step acquisition plan: teach 3 friends free, text owners in your network, ask for referrals, be helpful in AI communities, build in public, convert sessions naturally, then approach local businesses. He notes 57% of chief AI officers were promoted from inside.

Tools: Claude Code, Codex, GenSpark, ChatGPT, Claude, Gemini 3.1 Pro, Nano Banana Pro, VS Code, Skool, LinkedIn
AI Tools
OpenAI

ChatGPT Enterprise workspace agents get admin and builder controls

OpenAI demos ChatGPT Enterprise workspace agents that run scheduled workflows across CRM, Linear, and Slack, with granular admin controls over which apps an agent can access and what actions it can take.[39]OpenAI — Workspace agents in ChatGPT

Read more

The demo shows a product-feedback agent that pulls contacts and feedback from a CRM, generates PRD briefs, slides, and Linear tickets, and posts weekly Slack summaries autonomously, persisting memory across runs.[39]OpenAI — workspace agents Builders can toggle specific read/write actions and set natural-language constraints (e.g. only email openai.com recipients for sensitive data). IT admins get a separate RBAC layer: who can build/publish agents, which apps and actions are available by role, app-level parameter constraints, and human-in-the-loop confirmation for consequential actions.

Tools: ChatGPT, Linear, Slack, Gmail
AI Models
Two Minute Papers

DeepSeek's visual reasoning: point at the image, use 90% fewer tokens

DeepSeek's new paper introduces visual reasoning where the model "points" at image regions (bounding boxes, traced paths) during chain-of-thought — using ~90% fewer visual tokens than frontier models while matching or beating them on 7 external benchmarks (none in-house).[40]Two Minute Papers — DeepSeek's new AI

Read more

Instead of describing visual content in words, the model places visual primitives directly on the image while thinking — the way a human points to count.[40]Two Minute Papers — DeepSeek visual reasoning It enables topological reasoning (tracing a maze) and relationship queries with visible step-by-step logic.

Don't describe images like a poet. Point like a human.

The efficiency gain (~90% fewer visual tokens) is the headline, and using only external benchmarks is a credibility signal. No weights released yet, but the policy-distillation method (a student trained on multiple specialist teachers) could apply to existing open models. Limitations: pointing must be keyword-cued, bounding boxes break on thin structures (grass, hair), and topological reasoning doesn't generalize robustly. The creator calls it potentially the third AI breakthrough this month.

Tools: DeepSeek
Hot Take Developer Tools
Better Stack

Your Obsidian vault can hack you: Phantom Pulse via shared vaults

Attackers posing as VCs on LinkedIn/Telegram are distributing malicious Obsidian vaults that, when plugin sync is enabled, install a RAT called Phantom Pulse — which uses an Ethereum wallet's transaction history as its command-and-control channel.[41]Better Stack — Your Obsidian Vault Can Hack You

Read more

The attack chain (discovered by Dabs, targeting finance/crypto users): a fake VC moves the conversation to Telegram and sends a shared Obsidian vault disguised as a deal memo.[41]Better Stack — Phantom Pulse The victim is nudged to enable community plugin sync (off by default); malicious "shell commands" and "hider" plugins run silently. On Windows, PowerShell downloads a loader (Phantom Pull, disguised as SyncObs.exe) that AES-decrypts its payload into memory and uses module stomping to avoid disk artifacts. The final RAT does keylogging, screenshots, file exfiltration, and wallet theft.

This attack is scary because it doesn't start with some sketchy executable. It starts with a note-taking app that we already trust.

Mitigation: disable "sync installed plugins," never enable plugin sync for vaults from strangers, and audit the plugins folder for unfamiliar JSON or shell-command references.

Tools: Obsidian, PowerShell
Developer Tools
Better Stack

The CRAP score is back to catch undertested AI-generated code

The 2007 CRAP (Change Risk Anti-Patterns) metric is getting a revival as a tool for catching risky AI-generated code. A new Rust tool, cargo-crap, combines cyclomatic complexity with test coverage — a complexity-10 function with zero coverage scores 110.[42]Better Stack — Is Your AI Code Producing CRAP?

Read more

The CRAP formula balances complexity (C) and coverage: at 100% coverage the score equals C; at zero coverage a cubic exponent makes it skyrocket.[42]Better Stack — CRAP score cargo-crap flags any function over 30; the demo shows a complexity-~15 function scoring 13, then exceeding 100 once tests are deleted.

AI agents are incredibly good at spitting out these highly complex, syntactically correct code blocks… but they're notoriously bad at writing meaningful, robust integration tests unless explicitly forced to.

Andres frames it as a heat map for technical debt and a second-pass check after unit tests; a Java version also exists.

Tools: cargo-crap
Developer Tools
Better Stack

Supertone 3: a 99M-param local TTS model that's good — until it hits numbers

Supertone 3 is a 99M-parameter on-device TTS model running on CPU via ONNX, supporting 31 languages, installing with pip install supertonic, and exposing an OpenAI-compatible /v1/audio/speech endpoint — but it falls apart on numbers, prices, and expressions without a paid API key.[43]Better Stack — local TTS that doesn't suck

Read more

Tested against "ugly" production text (invoices, phone numbers, dates, Arabic/French/Korean), plain and foreign-language speech were fast and clean, but numbers and prices produced major lag and poor output, and expression styles (laugh, breath, sigh) require a paid key.[43]Better Stack — Supertone 3 Specs: 99M params, CPU-only via ONNX, 31 languages, SDKs for Python/browser/Java/C++/C#, and an OpenAI-compatible alias so apps can redirect without redesign. Verdict: good for privacy-sensitive or offline desktop apps, not a cloud-TTS replacement when emotion or number accuracy matters.

Tools: Supertone 3, ONNX Runtime, Eleven Labs, OpenAI TTS
Developer Tools
Better Stack

Turn any Android tablet into a Linux PC with Termux + PRoot

Using Termux, PRoot-distro, and Termux X11, you can run a full Debian/Ubuntu desktop on an Android tablet with no root and no custom ROM — turning an old tablet into a usable dev machine.[44]Better Stack — Turn Any Android Tablet Into a PC

Read more

Install Termux from F-Droid (not the Play Store), use PRoot-distro to spin up Debian/Ubuntu, add XFCE, and launch via Termux X11 for a graphical desktop.[44]Better Stack — Linux on Android Because it runs natively on the ARM chip rather than via heavy emulation, mid-range tablets are surprisingly capable (6 GB RAM minimum recommended). With a USB-C hub (monitor, keyboard, mouse) it becomes a functional second dev machine for coding, SSH, and lightweight servers — not a laptop replacement, but a practical repurpose.

Tools: Termux, PRoot-distro, Termux X11, XFCE, Debian, Ubuntu, VS Code
Developer Tools
marimo

marimo: untangle a graph into a hypercube

marimo demos an interactive graph widget where untangling a node-edge graph reveals a hypercube — each node a binary number, each edge a single bit-flip — extendable to arbitrary dimensions.[45]marimo — Let's Play A Little Game

Read more

A tangled graph rearranges into a hypercube: nodes are binary numbers and edges are single-bit flips between adjacent values.[45]marimo — hypercube widget The same trick works in 4D by stretching zeros and ones to opposite sides and sorting by arc color. It showcases marimo's new "wiggly stuff" widget for interactive layout manipulation and dimensionality, demonstrating reactive-notebook capabilities for mathematical visualization.

Tools: marimo
Hot Take
Real Python

Don't speed-read code reviews

Two experienced developers argue that the skimming techniques that work for business articles are dangerous in code reviews, because the goal is thorough line-by-line comprehension, not extracting key points quickly.[46]Real Python — Why You Shouldn't Speed-Read Code Reviews

Read more

One speaker is a slow, methodical reader — a trait well-suited to code review. The other notes retention drops as reading speed rises, and that skimming is fine for content that's mostly filler. But a code review has no 90% filler to skip; every line could contain a bug, so speed-reading is actively counterproductive.[46]Real Python — code review reading The implicit take: reading-speed productivity culture doesn't transfer to technical review work.

In a code review, that's dangerous cuz that's not the purpose of it… The purpose of it is to go over it in detail.
Developer Tools
The Pragmatic Engineer

Alice Ryhl: Rust's ownership model and why there's no garbage collector

Tokio maintainer Alice Ryhl explains Rust's ownership model: assigning a value to a new variable is a move that invalidates the original, preventing double-free errors without a garbage collector.[47]The Pragmatic Engineer — Rust's ownership model

Read more

In Rust, let b = a is a move, making a unusable afterward (a compiler error if referenced).[47]The Pragmatic Engineer — Alice Ryhl Without a GC, cleanup happens when a variable leaves scope; if both a and b were valid, both would try to drop the string, causing a double-free. The ownership model enforces single ownership so exactly one variable is responsible for cleanup at any time.

Tools: Rust
Developer Tools
DeepLearningAI

Semantic search starts with embeddings

A brief course-intro clip explaining that embeddings are high-dimensional vectors capturing semantic meaning — enabling search where related terms like "budget" and "financials" cluster together even without lexical overlap.[48]DeepLearningAI — Semantic Search Starts With Embeddings

Read more

Embeddings map content (text, audio, image, video) into a high-dimensional vector space so semantically similar items land near each other.[48]DeepLearningAI — embeddings The canonical example: "budget" and "financials" cluster despite sharing no characters, because embedding models capture meaning, not surface form — the foundational primitive for semantic search.

Productivity
Arjay McCandless

A three-phase playbook for landing a tech-internship return offer

Arjay McCandless lays out a structured three-phase internship playbook — relationships and roadmap, ship and deepen, operationalize and cement — with specific tactics for big-tech return offers, drawn from his time at Amazon and Lockheed Martin.[49]Arjay McCandless — how to land a return offer

Read more

Phase 1 (weeks 1-4): tell your manager day one your goal is a full-time offer and ask what "exceeds expectations" looks like; run 1:1s with everyone to build a knowledge map; create a roadmap doc (one-sentence deliverable, 3-5 demoable milestones, stretch goals, risks, dependencies) and publicly commit to it.[49]Arjay McCandless — return offer playbook The decision is made by managers in a room without you — make it easy to advocate for you.

Ask for help after about 45 minutes to an hour. Not 3 minutes, but not 3 hours.

Phase 2: ship PRs under 150 lines, match existing style, hold weekly 1:1s, request a week-6 midpoint check-in, surface delays immediately.

I've seen way too many interns disappear for 2 weeks and then come out with a 6,000 line PR. Nobody wants this.

Phase 3: operationalize — unit/integration/canary tests, monitoring/alerting (400s, 500s, P99), automated deployment, docs and runbooks. Even if headcount blocks an offer, intern-class relationships are long-term career assets.

Developer Tools
Arjay McCandless

Database sharding interview: from modulo to consistent hashing to virtual nodes

A mock system-design interview walks three levels of database scaling: naive modulo sharding, consistent hashing on a ring, and virtual nodes for even data distribution.[50]Arjay McCandless — Scale a Database

Read more

Naive modulo sharding (hash user ID mod machine count) requires rehashing all data when nodes change — expensive at scale.[50]Arjay McCandless — DB sharding Consistent hashing places keys and nodes on a hash ring, assigning each key to the next node clockwise, so adding/removing a node only moves a fraction of keys. Virtual nodes (each physical machine represented by multiple ring points) then fix uneven distribution and make rebalancing on failure easier.

Hot Take
Dwarkesh Patel

David Reich: the human-evolution consensus may be Ptolemaic epicycles

Geneticist David Reich argues the current model of Neanderthal/Denisovan relationships and modern-human admixture has been patched with ad-hoc additions — like pre-Copernican epicycles — and a simpler unifying theory may be needed.[51]Dwarkesh Patel × David Reich

Read more

Reich describes how the field built up archaic-modern human relationships incrementally — distinct modern humans, then Neanderthal-Denisovan sisterhood, then layered mixture events.[51]Dwarkesh × David Reich — human evolution He compares it to geocentric astronomy needing ever-more epicycles until Copernicus showed a heliocentric model explained everything more simply, suggesting current admixture trees may be fundamentally misspecified — though he stops short of proposing the alternative.

It's a little reminds one of what happened in the ancient world where there was this idea that the sun revolves around the earth but it doesn't quite explain the movements of the planets properly.

Sources

  1. Blog Project Glasswing: An initial update — Anthropic, May 22
  2. Newsletter The Batch — Issue 354 — The Batch, May 22
  3. YouTube The bio-weapon version of Mythos — Last Week in AI, May 22
  4. YouTube Google's AI endgame is here… everything you missed at I/O 2026 — Fireship, May 22
  5. YouTube SynthID watermark expanding to more partners — Google DeepMind, May 22
  6. Blog OpenAI named a Leader in enterprise coding agents by Gartner — OpenAI, May 22
  7. YouTube Codex Limits REDUCED BY 50%: So, Codex is worse! What to do now? — AICodeKing, May 22
  8. YouTube Cisco Builds AI Defense with Codex — OpenAI, May 22
  9. Blog FTC to require Cox Media Group settlement on Active Listening claims — Simon Willison's Weblog, May 22
  10. Blog The memory shortage is causing a repricing of consumer electronics — Simon Willison's Weblog, May 22
  11. Newsletter Quantum hits excited state — Sherwood Snacks, May 22
  12. Newsletter How Elon Musk can have his cake and eat it too — Tech Brew, May 22
  13. Newsletter Who is going to save you? — Data Science Weekly, May 22
  14. YouTube Chip design from the bottom up – Reiner Pope — Dwarkesh Patel, May 22
  15. YouTube How The Best Companies Defend Against Mediocrity And Rot — Y Combinator, May 22
  16. YouTube Real Python Podcast #296: Polars Schemas & Profiling GitHub Users — Real Python, May 22
  17. YouTube Brandon Waselnuk: Building the Context Engine AI Agents Need — DeepLearningAI, May 22
  18. YouTube Jerry Liu: My Agent Can't Read a PDF? — DeepLearningAI, May 22
  19. YouTube Or Dagan: Optimizing Accuracy, Cost, and Latency in Real-World Agents — DeepLearningAI, May 22
  20. YouTube Ara Khan: Evals Are Broken, Use Them Anyway — DeepLearningAI, May 22
  21. YouTube João Moura: Building Recurring, Governed, and Embedded Enterprise Workflows — DeepLearningAI, May 22
  22. YouTube Andrew Filev: Multi Model Pipelines—How to Get Better AI Results for Less — DeepLearningAI, May 22
  23. YouTube Manos Koukoumidis & Stefan Webb: VibeML: Build your AI model in hours, not months — DeepLearningAI, May 22
  24. YouTube Paul Everitt: The Shift to Agentic Engineering — DeepLearningAI, May 22
  25. YouTube Diamond Bishop: The Next 100 Agents. Building the Agent Native Office — DeepLearningAI, May 22
  26. YouTube Luke Kim: The Agent Data Stack — DeepLearningAI, May 22
  27. YouTube Daniel Beutel: Flower SuperGrid Agents — DeepLearningAI, May 22
  28. YouTube Andi Partovi: Why Every Agent Needs a Simulation Sandbox — DeepLearningAI, May 22
  29. YouTube Thierry Damiba: Edge to Cloud Video Anomaly Detection — DeepLearningAI, May 22
  30. YouTube Andrew K. Davies: Deterministic Memory: How to Build an AI That Cannot Lie — DeepLearningAI, May 22
  31. YouTube Fast Models Need Slow Developers — Sarah Chieng, Cerebras — AI Engineer, May 22
  32. YouTube Lobster Trap: OpenClaw in Containers from Local to K8s and Back — Sally Ann O'Malley, Red Hat — AI Engineer, May 22
  33. YouTube Gemini Nano on device — Florina Muntenescu & Oli Gaymond, Google DeepMind — AI Engineer, May 22
  34. YouTube The hard part of enterprise AI isn't reasoning | Jake Stauch, Serval — Sequoia Capital, May 22
  35. YouTube Why Notion's Ivan Zhao killed his marketing org — Sequoia Capital, May 22
  36. YouTube The One AI Writing Hack Nobody Talks About. — AI News & Strategy Daily | Nate B Jones, May 22
  37. YouTube Agent Harness explained in 8min.. — Caleb Writes Code, May 22
  38. YouTube The AI Offer You Can Sell Tomorrow Morning — Nate Herk | AI Automation, May 22
  39. YouTube Workspace agents in ChatGPT: Admin and builder controls — OpenAI, May 22
  40. YouTube DeepSeek's New AI Is A Game Changer — Two Minute Papers, May 22
  41. YouTube Your Obsidian Vault Can Hack You — Better Stack, May 22
  42. YouTube Is Your AI Code Producing CRAP? (Here's How To Fix It) — Better Stack, May 22
  43. YouTube Developers Might Finally Have a Local TTS Model That Doesn't Suck — Better Stack, May 22
  44. YouTube This Turns Any Android Tablet Into a PC — Better Stack, May 22
  45. YouTube Let's Play A Little Game — marimo, May 22
  46. YouTube Why You Shouldn't Speed-Read Code Reviews — Real Python, May 22
  47. YouTube Alice Ryhl: Rust's ownership model — The Pragmatic Engineer, May 22
  48. YouTube Semantic Search Starts With Embeddings — DeepLearningAI, May 22
  49. YouTube everything you need to do to land a return offer. (for CS majors) — Arjay McCandless, May 22
  50. YouTube Software Engineering Interview: Scale a Database — Arjay McCandless, May 22
  51. YouTube The current story of human evolution may be incomplete - David Reich — Dwarkesh Patel, May 22