May 22, 2026
Anthropic's Project Glasswing — a cybersecurity initiative with ~50 partners — used the unreleased Claude Mythos Preview to surface over 10,000 high- or critical-severity vulnerabilities, with external validators confirming 90.6% of assessed findings valid and some partners' bug-find rates jumping more than tenfold.[1]Anthropic — Project Glasswing initial update The same model the UK AI Security Institute says can reliably run attacks that take humans 3 hours, and that researchers used to find a kernel exploit in Apple's security.[2]The Batch #354 — LLMs enabling industrial-scale cyberattacks Anthropic's own framing: the bottleneck has flipped from finding bugs to verifying, disclosing, and patching them.
Launched May 2026, Glasswing scanned 1,000+ open-source projects and produced roughly 6,202 high/critical findings; independent firms verified 90.6% as valid and confirmed 62.4% as high or critical severity.[1]Anthropic — Project Glasswing Mythos Preview is being held back from public release due to "insufficient safeguards." External validators: the UK AI Security Institute (confirmed Mythos solved both cyber-range simulations end-to-end), Mozilla (271 vulnerabilities found in Firefox 150, 10x prior rates), and offensive-security firm XBOW. A standout find: CVE-2026-5194, a certificate-forging bug in wolfSSL enabling fake-website attacks. Anthropic also shipped Claude Security for enterprise scanning and a Cyber Verification Program for researchers.
Progress on software security used to be limited by how quickly we could find new vulnerabilities. Now it's limited by how quickly we can verify, disclose, and patch.
Google researchers documented LLMs discovering zero-days in widely-used web admin tools before disclosure, plus morphing malware that rewrites decryption routines to evade antivirus.[2]The Batch #354 — cybersecurity alarms The UK AISI reports both Claude Mythos Preview and GPT-5.5 can reliably execute attacks humans need 3 hours for (up from a prior 1-hour forecast), and longer with more tokens. Google's conclusion echoes Anthropic's: models may exploit bugs faster than cyber teams can patch.
Last Week in AI ties Mythos to Andy Jones (now at Anthropic), whose independent work trained an AI to master a scalable game with a tunable complexity knob.[3]Last Week in AI — The bio-weapon version of Mythos The argument: if a model can implement an end-to-end pipeline as non-trivial as AlphaZero, that's an early signal of AI automating meaningful parts of R&D — with the "bio-weapon version" framing pointing at the same capability turned toward dangerous domains.
Google I/O 2026 leaned all the way into the "agentic Gemini era": Gemini Omni (any-in/any-out multimodal), Gemini Flash 3.5 (benchmarks near Opus 4.7 and GPT-5.5 but ~3x pricier than the prior Flash and ~30x the original Gemini 1.5 Flash), a TPU lineup split into training (TPU-T) and inference (TPU-I) chips, and the Anti-gravity IDE (ex-Windsurf) recast as an agent manager.[4]Fireship — Google's AI endgame at I/O 2026 Alongside it, SynthID 2 picked up OpenAI, Kakao, and ElevenLabs as adopters.[5]Google DeepMind — SynthID expanding to more partners
Google's token throughput has grown from 9.7 trillion/month two years ago to 3.2 quadrillion/month today.[4]Fireship — I/O 2026 Gemini Omni is positioned as a world-model-style system handling text, video, and sound in and out. Gemini Flash 3.5 claims near-parity with Opus 4.7 and GPT-5.5 on speed-vs-intelligence — but Gemini 3.5 Pro was notably not announced (expected later in summer). A new "Neural Expressive" design system generates UI on demand (diagrams, timelines, mini-apps from prompts).
Anti-gravity (formerly Windsurf) now mirrors OpenAI Codex's agent-management focus; an on-stage demo built a full OS from scratch over ~12 hours and billions of tokens, then live-patched missing GPU drivers to run Doom. For web devs, Chrome's new HTML-on-Canvas API renders native HTML elements directly into WebGL/WebGPU canvases.
Google is no longer trying to organize the world's information with blue hyperlinks… Instead, Google is trying to become the interface to reality itself before Anthropic and OpenAI create better realities.
OpenAI, Kakao, and ElevenLabs are adopting SynthID 2 for content watermarking, and Google is adding C2PA content-credentials verification across its products.[5]Google DeepMind — SynthID The SynthID detector (already used by millions in the Gemini app) is expanding to Google Search (via circle-to-search) and Chrome (right-click), so users can ask whether an image was AI-generated and get a clear answer with context.
Gartner named OpenAI a Leader in its April 2026 Magic Quadrant for Enterprise AI Coding Agents, citing Codex's agentic dev, OS-level sandboxing, and governance — Codex now sees 4M weekly users across Cisco, Datadog, Dell, and NVIDIA.[6]OpenAI — named a Leader in enterprise coding agents by Gartner Days later, OpenAI quietly halved Codex consumer rate limits with no acknowledgment, which AICodeKing reads as a deliberate bait-and-switch as enterprise compute demand ramps.[7]AICodeKing — Codex limits reduced by 50% Cisco, meanwhile, says Codex wrote the majority of its AI Defense platform, compressing quarters into weeks.[8]OpenAI — Cisco builds AI Defense with Codex
Gartner highlighted Codex's broad surface — app, IDE extensions, CLI, SDKs, cloud orchestration — plus enterprise controls (approval gates, RBAC, auditable workspace governance).[6]OpenAI — Gartner Leader Since the evaluation, OpenAI shipped GPT-5.5, Codex Security with GPT-5.5-Cyber, mobile, Remote SSH, scoped access tokens, HIPAA-compliant deployment, Codex on Amazon Bedrock, and GSI partnerships (Accenture, Capgemini, Cognizant, Infosys, PwC, TCS). Until June 12, eligible enterprise accounts can request two months of free Codex for new users.
Cisco deployed Codex across every engineer on its AI Defense team; multi-quarter features dropped to weeks, and the open-source tool defense-claw went from idea to community use in under a week.[8]OpenAI — Cisco
There's a fundamental change in the frame of reference… they're just thinking how long will a Codex run take to be able to get these things done.
Users reported Codex hitting limits ~2x faster than before, with OpenAI silent for 11-12 hours. AICodeKing reads it as intentional: Codex was a loss-leader to capture users, and consumer limits get squeezed as enterprise contracts (served by dedicated deployed models) kick in.[7]AICodeKing — Codex limits halved His recommended escapes: GLM Coding Plan ($345/yr, plus an $80 migration offer for Codex refugees through July 19), Anti-gravity ($20/mo), Verdant ("what anti-gravity wanted to be," with a manager mode), Command Code ($1/mo with ~$40 of DeepSeek V4 Pro), and Kimiko ($19-$199/mo).
Either they take your data or your money or both.
The FTC settled with Cox Media Group and two other firms for nearly $1M after they marketed an "Active Listening" ad service claiming smart devices captured "real-time intent data by listening to our conversations" — when the FTC found it didn't use voice data at all, just resold broker email lists at a markup.[9]Simon Willison — FTC Active Listening settlement Simon Willison, who theorized exactly this in 2024, gets his vindication.
Willison had argued in September 2024 that the "active listening" language was marketing hype from a team that got "over-excited" without understanding the tech or legal exposure. The FTC's complaint confirmed it: the service "did not, in fact, listen in on consumers' conversations or use voice data at all."[9]Simon Willison — FTC Active Listening The FTC also rejected the consent fig leaf.
Clicking through mandatory terms of service does not constitute 'opt-in consent' for such an invasive service or for use of consumers' voice data from inside their homes.
AI data-center demand for HBM memory is eating wafer capacity that used to make consumer DDR and LPDDR, pushing up phone and PC prices — with sub-$100 budget devices in Africa and South Asia hit hardest.[10]Simon Willison — The memory shortage The structural killer: a single gigabyte of HBM consumes more than 3x the wafer capacity of a gigabyte of DDR/LPDDR.
David Oks's piece ("AI is killing the cheap smartphone") notes only three major memory makers exist, all with fixed wafer capacity split across DDR, LPDDR, and HBM.[10]Simon Willison — memory shortage HBM's share of that fixed capacity has surged from 2% to an expected 20% by end of 2026. Manufacturers deliberately under-provision fabs (a lesson from competitors who over-built and failed), so the windfall can't be offset by new capacity, and HBM's fat margins lock the constraint in for years.
A single gigabyte of HBM consumes more than three times the wafer capacity that a gigabyte of DDR or LPDDR does.
Beyond the cyber alarms, The Batch #354 carries three notable items: Nous Research's open-source Hermes Agent passing OpenClaw on OpenRouter's token leaderboard, Thinking Machines Lab's 276B-MoE TML-Interaction-Small that responds in 0.40s without turn-taking, and a CMU/Stanford study showing agent benchmarks overindex on software/math while ignoring 18.2M administrative and 11M management workers.[2]The Batch #354
Nous Research's Hermes Agent (launched Feb 2026) surpassed OpenClaw on OpenRouter's token-consumption leaderboard.[2]The Batch #354 — Hermes vs OpenClaw It runs an agentic loop assembling personality, instructions, tools, skills, and memory; auto-creates skills from successful tasks in SKILL.md format; a Curator archives unused skills after 90 days; supports ~20 messaging services and persistent goal-tracking via a judge model. Some users note it's less token-efficient.
Thinking Machines Lab's 276B-param MoE (12B active) processes audio, video, and text concurrently in 200ms micro-turns, responding in 0.40s vs Gemini-3.1-flash-preview (0.57s) and GPT-Realtime-2 (1.18s).[2]The Batch #354 — TML-Interaction-Small On interactivity (interruptions/interjections) it scored 77.8 avg quality vs GPT-Realtime-2's 47.8 and Gemini's 45.5; on Audio MultiChallenge reasoning, 43.4% (behind GPT-Realtime-2's 48.5%). Closed research preview coming.
CMU and Stanford (led by Zora Z. Wang) mapped 10,000+ examples from 43 agent benchmarks against O*NET labor data.[2]The Batch #354 — agent benchmarks misaligned Benchmarks hold 8,622 "computer and mathematical" examples for a 5.2M-worker field, but only 3,186 examples for the 18.2M-worker admin sector and 676 for 11M managers. GDPval covered best (47.8% of work activities, 58.5% of skills). The takeaway: large untapped agent opportunity in admin, finance, and management.
The Trump administration awarded $2B in grants — with government equity stakes — to nine quantum computing companies, IBM taking $1B to launch a quantum-foundry subsidiary called "Anderon." Sector stocks jumped 12-33%.[11]Sherwood Snacks — Quantum hits excited state The same newsletter notes Nvidia fell post-earnings for a 4th straight quarter — but recovered and beat the S&P within three months each prior time.
IBM ($1B, +12%) is matching the grant dollar-for-dollar and billing Anderon as "America's first pure-play quantum foundry."[11]Sherwood Snacks — quantum Others: GlobalFoundries ($375M, +15%), D-Wave Quantum ($100M, +33%), Rigetti ($100M, +30%), Infleqtion ($100M, +31%), Atom Computing, PsiQuantum, Quantinuum ($100M each), and Diraq ($38M). Infleqtion's CEO called the selection "a very rigorous technical evaluation over many, many months."
Nvidia declined a 4th consecutive time after beating — but three months after each prior sell-off, shares recovered and outperformed the S&P 500. It holds the 2nd-best revenue growth among S&P 100 names and is 4th-cheapest on PEG, suggesting another buy-the-dip setup.
SpaceX filed for an IPO targeting a $1.75 trillion valuation while posting a $4.9B net loss — repositioning itself as an AI infrastructure company (citing a $26.5T AI market and Anthropic's $15B/year compute commitment), with Musk retaining 85% of voting power as CEO, CTO, and chairman.[12]Tech Brew — How Elon Musk can have his cake
Tech Brew's Whizy Kim frames the filing as one of the largest in history, structured to give retail investors essentially no governance check despite reserving a high share allocation for them.[12]Tech Brew — SpaceX IPO The prospectus pivots the narrative from rockets to AI, leaning on Anthropic's $15B annual compute commitment for credibility. SpaceX also unexpectedly holds significant Bitcoin reserves, and third-gen Starship launches have repeatedly slipped. The piece argues Tesla could suffer as SpaceX competes for the same investor dollars and Musk's attention.
It isn't really about rockets or AI—it's a bet on Elon Musk himself.
This week's Data Science Weekly argues data professionals need a corporate sponsor and shouldn't put all their "political eggs in one basket" as AI reshapes job markets and headcount.[13]Data Science Weekly — Who is going to save you?
The May 22 paid issue explores career resilience for data scientists and analysts amid AI automation pressure — the risks of depending on a single executive sponsor or internal ally, and the broader trend of AI thinning analytics and data-science teams.[13]Data Science Weekly The body is paywalled; the argument is carried by the subtitle and framing.
Mathdock (Maddock) CEO Reiner Pope builds AI chip design from logic gates up — multipliers, systolic arrays, clock cycles, TPU-vs-GPU — with one recurring thesis: the dominant cost in hardware is moving data, not doing math.[14]Dwarkesh Patel × Reiner Pope — Chip design from the bottom up The clearest mental model of why low-precision arithmetic and Tensor Cores exist that you'll find.
~00:00 — Logic gates and the MAC primitive. Matrix multiply is a triple for-loop, so a multiply-accumulate happens at every step; accumulation needs higher precision than multiplication because summation errors compound (e.g. 4-bit multiply with 8-bit add).[14]Dwarkesh × Reiner Pope
~04:01 — Multiplying by hand. A P-bit by Q-bit multiply needs P×Q AND gates for partial products, summed via full adders (3-to-2 compressors) in a Wallace-tree; the count works out to exactly P×Q full adders.
~12:12 — Quadratic area scaling. Chip area for arithmetic scales quadratically with bit width — "the single reason low-precision arithmetic has worked so well for neural nets." Nvidia historically doubled FLOPs per halved precision, but quadratic scaling implies the speedup should be larger; around B300, FP4 is 3x faster than FP8 (pure scaling says 4x).
This quadratic scaling… is like the single reason why low precision arithmetic has worked so well for neural nets.
~16:20 — The mux problem. In old CUDA cores, a register file feeds the MAC through multiplexers; with three inputs, ~7/8 of circuit area goes to reading/writing the register file (hidden data movement), not the logic that matters.
It's muxes all the way down.
~25:33 — Systolic arrays / Tensor Cores. Bake two loop levels into hardware for X×Y compute at ~X communication: weights stay fixed locally and are reused across input vectors, trickle-fed via a daisy chain because "bandwidth equals die area." Older TPUs used 128×128 arrays.
Bandwidth equals die area.
~38:44 — Clock cycles and critical paths. Chips sync ~every nanosecond; the slowest "cloud of logic" between registers (critical path) bounds clock speed, which is why two chips on TSMC 3nm can clock differently. Pipeline registers can halve path length but can't naively pipeline feedback loops (e.g. a running accumulator).
~51:52 — FPGA vs ASIC. Same model, but ASICs are ~10x cheaper/more efficient at scale; FPGAs (LUTs = configurable truth tables, ~32 gates to emulate ~3 ASIC gates) win when designs change often and you want deterministic low latency (HFT at Jane Street). Non-determinism in CPUs comes mainly from caches; TPUs use software-managed scratchpads instead.
~75:09 — GPU vs TPU. "A GPU has a lot of tiny tiny TPUs tiled across the whole chip" (Tensor Core ≈ MXU), while a TPU uses a few coarse-grained matrix units around a central vector unit. Coarse-graining enables bigger arrays but lowers vector-to-matrix bandwidth; Maddock has discussed a "splittable systolic array" that acts as big or small arrays.
Lean Startup author Eric Ries argues shareholder primacy is a value-destroying 1980s invention, and that founders can use structural tools — PBCs, perpetual purpose trusts, industrial foundations — to build companies that outlast temporary managers and investors. Case studies: Costco, Novo Nordisk, and Anthropic.[15]Eric Ries at YC — defending against mediocrity and rot
~00:00 — Why Incorruptible. The more valuable an organization, the more it becomes "a target" worth taking over.[15]Eric Ries at YC
We're in this era now where we have temporary organizations being led by temporary managers on behalf of temporary investors.
~02:03 — The professor. An AI-plus-bioscience founder whose lawyer's "best practice" charter would legally force a sale even to "the most evil company in the world."
~08:09 — Twilio's cautionary tale. Jeff Lawson built ~$4B revenue, stock up 390% since IPO, but accepted dual-class control that sunset after 7 years — and was fired 199 days after expiration by less than half a percent of shareholders.
~12:10 — Saul Price → Costco. FedMart's "fiduciary to the customer" model (customers first, employees second, shareholders third); after investors locked Price out in 1975, FedMart went bankrupt and Price founded what became Costco.
Shareholder value is like the exhaust that comes out of the engine. When you take the exhaust pipe and put it in the intake and make that your explicit goal, now you don't stand for anything anymore.
~13:10 — Philip Morris. ~$8B net income vs ~$600B/year in external costs; tobacco makes ~$6,000 from every customer who dies.
~23:16 — PBCs and the history of shareholder primacy. The easiest fix is converting to a Public Benefit Corporation — "a two-page legal filing in Delaware." Shareholder primacy was redefined into "any lawful activity" by a small group of academics and judges in the 1960s-70s without any vote or law.
If it's a controversy, it can't be a consensus now, can it?
~34:17 — Novo Nordisk. Its nonprofit-foundation trustees vetoed a ~$20B merger the for-profit board had signed; the company kept growing ~20%/year and later crested ~$600B, briefly exceeding Denmark's GDP. Industrial-foundation companies (e.g. Zeiss, 1885) are 6x more likely to survive to year 50 (60% vs 10%).
~45:23 — Anthropic's long-term benefit trust. A perpetual purpose trust gives outside trustees power to appoint for-profit directors; Ries credits this "ethos plus integrity" structure for Anthropic's talent advantage and courage.
Ethos plus integrity equals incorruptible.
Best practice equals Kroger practice. This is someone who wants to be more like Kroger and less like Costco. Why would you want that?
Hosts Christopher Bailey and Christopher Trudeau cover CPython/Django releases, Python's get-related dunder methods, handling schema drift in Polars pipelines, pip 26.1, an inverse-Sapir-Whorf essay, the CADQuery 3D library, and Eric Matthes's GH Profiler for screening AI-slop pull requests.[16]Real Python Podcast #296
~00:00 — Themes. Avoiding Polars schema problems and quickly profiling a GitHub user to gauge their contributions, framed around the rise of AI-generated slop PRs.[16]Real Python #296
Slop-a-palooza is now a thing.
~03:04 — Releases. Python 3.14.5 removed the incremental GC after production issues; 3.15 beta 1 hit feature freeze (Oct target); PyPy 7.3.22 bugfix; Django 6.0.5/5.2.14 patched three security issues incl. an ASGI file-upload DoS. PEP 797 (shared object proxies) and PEP 828 (yield from in async generators) slipped to 3.16.
~05:05 — Get-dunder deep dive. Stephen Gruppetta's tutorial through __getitem__, __getattr__, __getattribute__, and __get__ (descriptors).
~12:12 — Polars schema drift. Tai Neworp's post classifies changes as additive, subtractive, type drift, and breaking; Polars infers CSV types from the first 100 rows (infer_schema=False forces all-string), across CSV, Parquet, Delta Lake, and Iceberg.
~16:17 — pip 26.1. Drops Python 3.9, targets 3.10+; experimental install from pylock.toml (PEP 751); dependency cooldowns via --exclude-newer/--uploaded-prior-to (e.g. p3d to skip the last 3 days) to blunt supply-chain attacks.
~25:31 — Inverse Sapir-Whorf. Luke Plant's essay on ideas your language won't let you express (Python devs can't conceive of out-of-order argument evaluation; Haskell's non-strict semantics permit it).
It's like C and Rust had a baby, and so it got rid of the crunchy parts of C.
~33:39 — CADQuery. A Python parametric 3D CAD library on OCP (OpenCASCADE wrapper), outputting STEP/AMF/3MF/STL/DXF.
~37:40 — GH Profiler. Eric Matthes's CLI reports account age and activity for a GitHub user/PR/issue — PRs in the last 21 days, issues opened/closed-as-not-planned, and duplicate-title detection to flag mass-submitted slop; has a --redact flag.
Unblocked founder Brandon Waselnuk argues naive RAG, more MCPs, bigger context windows, and rules files all fail as agent context strategies — and that a proper context engine (unified ingestion, entity graphs, conflict resolution, token-optimized retrieval) cut task time 80% and token cost 50% in controlled tests.[17]Brandon Waselnuk at AI Dev 26
~00:08 — Every agent is 'you on day one.' Technically capable but knowing nothing about how your company works; the context engine supplies institutional knowledge at machine speed.[17]Brandon Waselnuk — context engines
The gap is no longer intelligence. It's context.
~01:09 — Three stall-out points. Curated docs rot; the MCP plateau ("satisfaction of search" — agents accept the first stale page they find); and the background-agent wall where static files aren't enough.
~06:14 — Four myths. Naive RAG, enough MCPs, bigger context windows, and rules files (CLAUDE.md, routinely ignored) are each not a context engine.
~07:15 — What a context engine must do. Know who's asking and their team, resolve document conflicts (a Slack thread between CTO and chief architect beats stale source code), respect permissions, and deliver token-optimized scoped responses.
~10:18 — Unblocked architecture. API ingestion from Slack/Notion/GitHub → embeddings → multiple entity graphs → MCP/CLI/API egress. One customer: 115,000 repos, 30,000 developers.
~13:20 — Two open-source tools. A social graph builder (contributor/ownership/expert map from a repo) and a repo rules agent (dedups CLAUDE.md/AGENTS.md rules, flags conflicts).
~19:25 — The benchmark. With Unblocked: 10M tokens, 25 minutes. Without: 21M tokens, 2h32m — and the output would have taken down production if merged.
AI generated code should feel like it was written by someone who has been on your team for years.
~20:27 — Hard lessons. Optimizing for access (more pipes) ≠ understanding; hiding conflicts (a 50/50 coin flip) causes bad outcomes; caching "correct" answers is a trap as PRs ship.
LlamaIndex CEO Jerry Liu explains why PDF parsing remains unsolved — PDFs are display instructions, not semantic containers — and presents LlamaParse (commercial OCR), ParseBench (open benchmark), and LightParse (free open-source fast parser). Key finding: more reasoning tokens do NOT improve visual document accuracy.[18]Jerry Liu at AI Dev 26
~00:07 — LlamaIndex context. 1B+ pages processed, 300,000 users; narrowed from an RAG framework to agentic document infrastructure.[18]Jerry Liu — PDF parsing
~05:10 — Why PDFs are hard. A PDF is PostScript display instructions — character coordinates with no inherent table structure, reading order, or semantic labels. A table is just lines and positioned glyphs.
~09:16 — Evolution of approaches. Heuristics (Tesseract, PyPDF) → specialized small models (Donut) → VLMs (GPT-4 Vision onward).
~11:16 — The finding. Increasing reasoning tokens in frontier models does NOT improve visual understanding because post-training targets coding/math. VLM parsing is also expensive — Gemini Pro ~$0.08+/page; Opus 4.6 ~53% accuracy on ParseBench.
Increased thinking in the frontier models… generally does not correlate to increased visual understanding accuracy.
~16:18 — ParseBench. 2,000 human-verified pages (financial, insurance, legal) measuring dense tables, charts, faithfulness, semantic formatting, and visual grounding. Leaderboard at parsbench.ai.
~19:20 — LightParse. Free, open-source, no VLMs; outperforms PyPDF and other model-free parsers; Rust rewrite in progress; installable as an agent skill and a good first pass for Claude Code before handing complex pages to a VLM. Simon Willison gave it a shout-out.
~25:24 — Market macro. 2023 basic RAG → 2025-26 full agents (50-80%+ work automation). The moat is no longer the harness (Claude Code is good) but the context and workflow layer. Customers: Carlyle Group, SAP.
The model harnesses are getting pretty good… the thing that provides the alpha or moat on top of this is really the context and the workflow layer.
AI21 Labs' Or Dagan presents AI21 Maestro, a two-phase system that automatically finds the optimal configuration of models, tools, and execution strategies for agents — hitting state-of-the-art on BrowseComp Plus via automated Pareto-frontier optimization instead of months of manual tuning.[19]Or Dagan at AI Dev 26
~00:08 — The trilemma. Accuracy, cost, and latency feel zero-sum, and manual optimization is slow and not future-proof.[19]Or Dagan — agent optimization
~02:08 — Configuration sweep. Testing 3-5 models against 3-4 retrieval tools yields a 15-point spread; the Pareto frontier becomes the baseline (GPT-5 + a latent-interaction retriever ~90% on BrowseComp Plus, prior SOTA).
~05:10 — Best-of-N. Running GPT-5 with latent interaction 4x beats prior SOTA with minimal latency cost; Minimax at 60% single-run becomes competitive run 8-16x in parallel, cheaper and faster.
~07:10 — Ensemble portfolios. Inspired by Karpathy's LLM Council — non-overlapping model strengths; a 3-model ensemble beats any single model while cutting cost more than half and latency 20%.
~09:13 — Configuration explosion. Models × tools × prompts × N-samples × ensembles × strategies makes manual search intractable.
Behind every top-one leaderboard spot… there are months and hundreds and thousands of dollars spent on researching all those configurations.
~12:14 — AI21 Maestro. An offline phase trains an "action model" predicting accuracy/cost/latency for any config; at runtime it spends more if budget allows or stops early if a cheap model succeeds. Automatic, efficient, observable, and future-proof (only retrain the new model's configs).
~15:18 — Deep Research Bench. Vertical scaling with dynamic rubrics and "anytime" generation that improves until time or money runs out.
Klient's Ara Khan argues most people are wrong about evals — either over-trusting benchmarks or dismissing them — and walks a three-level framework for building, interpreting, and using evals to iteratively improve coding agents.[20]Ara Khan at AI Dev 26
~01:09 — Two wrong camps. The "objective metrics" camp that worships benchmark scores and the "vibes is king" camp that ignores numbers; both are wrong.[20]Ara Khan — evals
Evals are not the end all and be all. They're not completely useless. There are right ways to use them and there are wrong ways to use them.
~05:11 — Three-level framework. L1: interpret others' evals (don't trust model-lab evals uncritically; find domain-specific benchmarks). L2: use evals to improve your agents. L3: build your own.
~09:16 — SWE-bench saturation. Coding benchmarks got so gamed that frontier labs stopped reporting them; one model was "benchmark-maxed."
~11:17 — Terminal Bench. 89 real-world SWE tasks (caching bugs, race conditions, frontend, DB) run on isolated Docker containers via Harbor to parallelize runs that took 6-7 hours; Modal as infra.
~21:25 — Three zones. Zone 1: fix obvious harness failures. Zone 2: real hill-climbing (prompts, tool selection, retry logic). Zone 3: overfitting danger. Klient beat Claude Code on Opus 4.5 evals by tuning CPU/memory, raising timeouts, and improving thinking behavior.
Even if you get a good score you always need to make sure you're passing the vibe check.
CrewAI founder João Moura shares how Iris, their internal coding agent, now authors nearly half of all PRs at the company — and extracts enterprise lessons on reusable building blocks, human-in-the-loop via plain email, and the convergence of ad hoc and embedded workflows.[21]João Moura at AI Dev 26
~01:08 — Iris. An internal coding agent built on CrewAI that engineers initially tried to break, now widely adopted.[21]João Moura — CrewAI
~04:10 — ~50% of PRs. A designer with no coding background used Iris to find 130 hard-coded colors and propose refactors for engineers to approve. Iris is self-improving — writes its own skills, flows, and memory.
Iris updates itself, it has its own memories, writes its own skills, writes its own flows, and just keeps on going.
~06:14 — Enterprise data. Runs-per-quarter growing across business domains; customers include AB InBev, Experian, and the DoD.
~08:14 — Two forces. Ad hoc (disposable) vs embedded (process-critical) workflows are blurring; building is being commoditized.
~10:16 — Plugging into Claude Code. CrewAI plugged its whole platform into Claude Code, a surprising unlock that removed friction between ad hoc and embedded use cases. They open-sourced skills.crewai.com and a "decide" skill encoding the company's decision framework.
~15:19 — Human-in-the-loop via email. The most effective mechanism was simple: agents email humans, humans reply, agents continue — no new apps. Observability needs both "zoom out" (org cost/health) and "zoom in" (traces).
~17:21 — The real challenge. Not deployment — discovery and organizational change management.
We thought adopting agents was an engineering problem… We realized now it's actually a transformational problem.
Zenoder CEO Andrew Filev shows that mixing models across plan/implement/review stages delivers better results at 60% lower cost than one frontier model throughout — and that cheaper models often out-implement expensive ones when handed a solid plan.[22]Andrew Filev at AI Dev 26
~00:07 — Role shift. From writing code to building the system that writes code.[22]Andrew Filev — multi-model pipelines
~03:07 — The cost problem. Engineers on frontier models (mostly Opus) burn ~$2,000/month in API calls each.
~05:07 — Plan then implement. Two steps matter for agent quality and human reviewability — a short spec reviews faster than a 50-file diff. Opus 4.6 was their best planner.
~08:09 — Cheap implementers win. On SWE-bench Pro, Gemini Flash as implementer resolved more issues than Opus while being dramatically cheaper. Model diversity (plan with Opus, implement with Gemini) also catches edge cases.
Dumb coding is basically solved. Like the architecturing is not solved, right? But dumb coding is solved.
~11:13 — Multi-model review. Never review with the model that wrote the code (like an audit). Their Opus+Codex+Gemini review pipeline cost $2.50/PR with better precision/recall vs Anthropic's Claude Code review bot at $12/PR (publicly $15-$20).
~14:16 — Verification is system design. End-to-end testing, tracing, observability, and deterministic linters over LLM judgment.
If you're starting to mix models, you can surprisingly get better of everything as opposed to kind of start compromising.
Manos Koukoumidis (ex-Google Gemini) and Stefan Webb present Umei, a conversational platform for building fine-tuned specialized models — arguing enterprise winners will own their intelligence rather than rent generic APIs. One demo model: 0.8B params beating Opus 4.6 by ~1.5% accuracy at ~100x lower cost.[23]Koukoumidis & Webb at AI Dev 26
~00:08 — Own vs rent. Enterprises are shifting from renting generic intelligence to owning specialized intelligence.[23]Umei — model factory
The winners of the next era… will not be the enterprises that will continue to consume those APIs from OpenAI, Anthropic, or Google Gemini. It will be the ones that own that intelligence.
~03:09 — Specialization advantages. Cursor built a coding model beating GPT/cloud at 10x lower cost; Intercom built one beating GPT at 5x lower cost. Custom models can be 10-100x smaller with full privacy and no vendor-roadmap dependency.
~05:11 — Umei as model factory. A conversational agentic interface guiding the full fine-tuning lifecycle.
~06:13 — Demo. News-summarization task: define task → agent suggests evaluators (completeness, conciseness, format) → synthesize training data → pick baseline (Qwen 3.5 4B) → evaluate (90% faithfulness) → analyze failure modes → LoRA fine-tune → re-evaluate. Weights download with no royalties.
~18:25 — Real results. A healthcare provider got 20% quality up / 70% cost down on medical-record extraction; the NYT used Umei to find only 39% of claims in Gemini 3's AI Overviews were fully supported by sources.
Only 39% of the claims in Gemini 3's AI Overviews were fully supported by the sources given.
~20:27 — The kicker. A 0.8B-param support-classification model beats Claude Opus 4.6 by ~1.5% accuracy at ~100x lower cost and faster. The open-source library has 9,000+ GitHub stars; the enterprise platform has ~2,000 signups a month in.
JetBrains' Paul Everitt argues "vibe coding" is a distraction and "agentic engineering" — building the system that builds the system — is the disciplinary reframe to give management, with sub-disciplines from evals to harness engineering to context engineering.[24]Paul Everitt at AI Dev 26
~04:14 — The productivity myth. Individual gains aren't translating to org value; Acemoglu's measurement problem, a DX study finding 10% (not 10x), Simon Willison's quality-regression warning, Ed Zitron on token economics, and only 3% of devs trusting AI code in 2024.[24]Paul Everitt — agentic engineering
Code was never the problem. Code was never the bottleneck.
~12:23 — Origins of the term. Coined by Karpathy, refined by Simon Willison (writing a book), Addy Osmani, and an OpenAI harness-engineering post; Everitt endorses Willison's "dark factory pattern" (humans outside the factory running it).
~14:24 — Harness engineering.
In the old mode, engineers built the thing. In this harness engineering, we build the thing that builds the thing.
~16:28 — Sub-disciplines. Spec-driven development (a JetBrains × DeepLearning.AI course), evals, harness engineering ("if you don't own your harness you don't own your memory"), tooling (Claude/Cloudflare code mode, Pydantic's Rust subset "Monty"), red-green testing, modularity, QA agents, observability.
~22:31 — Context engineering & culture. Cites Waselnuk's Unblocked talk as best-in-class; managing FOBO (fear of being obsolete) in teams.
~24:33 — Call to arms. Grady Booch (UML creator) challenged the audience to define "agentic design patterns" — the next foundational contribution. Message to management: augment humans, innovate boldly.
DataDog's Diamond Bishop shares lessons from three production agents (AI SRE, Bits AI Dev, security analyst) and six principles for scaling to the next 100 — agent-native UX ("the new Bezos API mandate"), proactive-over-reactive, strong evals, multiplayer, model agnosticism, and the agents' bitter lesson.[25]Diamond Bishop at AI Dev 26
~00:07 — DataDog's three production agents. AI SRE (incident debugging), Bits AI Dev (code fixes), and a security analyst for SOC triage.[25]Diamond Bishop — next 100 agents
~05:10 — Deployment is the bottleneck. <30% of enterprise agents were in production last year; that's shifting in 2026.
Intelligence is no longer the bottleneck.
~07:10 — Agent-native UX. Agents as first-class users — LLMs.txt, .md docs, MCP servers, CLIs. "The new Bezos API mandate."
~10:12 — Proactive over reactive. Event-driven background agents (Temporal for durable execution), not chat-triggered.
~11:13 — Eval. Their biggest early mistake was shipping Bits SRE without strong evals. Recipe: offline → online observability → living data loops.
~15:15 — Bitter lesson of agents. General methods + powerful off-the-shelf models + good tools win; custom tweaks get torn out.
The general methods that can quickly leverage new off-the-shelf models will win in the long run.
~18:18 — Multiplayer. Agent collaboration is "the new Figma moment"; MCP hubs, skill sharing, human annotation.
~22:23 — Predictions. RL on the job (Tinker from Thinking Labs), self-improving agents, long-horizon durable agents (30-min → 12+ hour → multi-day), better agent auth/authz, multimodal computer use ~1-2 years out, generative UI.
Spice AI founder Luke Kim argues the modern data stack wasn't built for agents — which are always-on, query at scale, and need real-time operational data — and demos Spice, an open-source "sidecar" that federates backend access while isolating agents from direct database exposure.[26]Luke Kim at AI Dev 26
~00:07 — SaaS-era vs agent-era. Agents need real-time OLTP, document stores, message buses, and analytical data simultaneously at 24/7 query rates legacy infra can't sustain.[26]Luke Kim — agent data stack
~02:10 — Infra stress. GitHub's recent outages were attributed in their postmortem to agentic workloads growing orders of magnitude faster than anticipated.
~04:11 — Security. Incidents where agents destroyed production data (e.g. Lovable) show the risk of direct DB access.
Humans don't let humans give agents direct access to databases and data systems.
~06:14 — Spice AI. Federated SQL across Parquet, Iceberg, Snowflake, MySQL, MongoDB, Elasticsearch, HTTP APIs, and GitHub, accelerated by replicating working sets into embedded DuckDB/SQLite/Arrow/Vortex. Agents query the local cache, not production.
Every agent gets its own data stack.
~08:14 — Demo. An OpenCow agent handles an SRE incident: Grafana fires a latency alert → agent queries Spice for logs, order/user tables, and GitHub troubleshooting guides → diagnoses connection-pool exhaustion (1→3 replicas exceeded Postgres limits) → recommends switching the pooler from session to transaction mode → confirms resolution, all via Slack with traced tool calls.
Flower Labs co-founder Daniel Beutel argues the next frontier is collaboration across data silos — presenting Flower SuperGrid, a decentralized platform where agents and training runs operate across distributed data without it ever moving, and announcing Lizzy 7B, an open-weight model trained on SuperGrid.[27]Daniel Beutel at AI Dev 26
~00:07 — Training in space. Flower ran on the first H100 in space aboard StarCloud 1, performing the first vision-transformer training in orbit.[27]Daniel Beutel — Flower SuperGrid
~05:12 — Data scarcity framing. ~15 trillion tokens of high-quality public English text vs an estimated 2,000 trillion in private silos — a conservative 133x factor.
We're using less than 1% of the data we have in the world.
~08:14 — Network vs silo. Move computation to the data; the analogy is isolated generators → the grid, and mainframes → the internet. Flower calls it "collaborative AI."
~12:18 — SuperGrid. Reduces the old 250+ manual federated-setup steps to: create a federation at fl.ai, add SuperNodes, and run with a single flower run. Features: heterogeneous confidential compute (first in the world), auditable comms, weight streaming, and Flower Hub marketplace.
~17:22 — Case study. Dr. Nicholas Conti at Dockport trained on 250,000 patients' data without centralizing it.
~19:24 — Project Kaya. A collaborative agent where a SuperLink coordinator sends tasks to autonomous SuperNodes that can independently reject or redact responses — demoed querying nodes in SF, Mumbai, Sydney, and Seoul.
Data never moves. Only agents send messages according to the governance principles that the super node operators have configured.
~25:30 — SuperGrid Frontier. Decentralized training with up to 1,000x reduction in communication cost; collaborating with the US DOE and Sandia to train a 70B LLM across three sites. Lizzy 7B (UK-tuned) is their first open-weight model trained on it.
Various AI CTO Andi Partovi argues golden datasets, unit tests, and static assertions are incompatible with autonomous agents — and that every agent needs a simulation environment that mirrors production without real-world consequences.[28]Andi Partovi at AI Dev 26
~00:07 — Rising stakes. From chatbots to copilots to action-taking agents that send emails, move money, and modify databases — risk scales with autonomy.[28]Andi Partovi — simulation
~03:11 — Three failure modes. Agents are nondeterministic (test at scale, not once); tests must be interactive, not static I/O pairs (a sourcing agent emails vendors and negotiates); labels are dynamic (aborting on auth failure is correct, but a pre-labeled dataset marks it wrong).
~06:15 — The simulation thesis. "The Matrix for AI agents" — high-fidelity, looks exactly like production, but no real consequences.
The simulation environment is something that looks like production… but it's not real.
~08:17 — POMDPs. The field abandoned this rigorous framing moving from robotics to software agents, but the structure is the same.
~10:18 — Components. Test-scenario generation, actor simulation (adversarial users, obstinate vendors built as LLM actors), service/tool clones, and a post-run evaluator (LLM judge or Python assertions). Traces can also generate SFT/RL signal.
We want people who shout… It's an art to make an LLM act like a human being, act like a frustrated human being.
Quadrant's Thierry Damiba demos Sentinel, a video anomaly system using on-device vector embeddings to flag anomalies locally and escalate only suspicious clips to the cloud — hitting 0.96 AUC-ROC and 94% recall while sending just 10% of video bandwidth upstream, trained with zero anomaly labels.[29]Thierry Damiba at AI Dev 26
~00:08 — Problem. Operators drown in footage across dozens of cameras; investigations require manual scrubbing.[29]Thierry Damiba — video anomaly detection
~02:08 — Sentinel stack. NVIDIA Jetson + EfficientNet-B0 for on-device embeddings, Quadrant Edge (embedded vector DB) on the device, Twelve Labs video models for cloud analysis, Vultr for GPU cloud.
~04:11 — Architecture. Edge embeddings → Quadrant Edge dual-shard KNN → clips above a dissimilarity threshold sent to cloud → Twelve Labs determines what happened → Quadrant Cloud syncs the updated baseline back. The baseline is dynamic.
~05:11 — Results. 0.96 AUC-ROC, 94% recall on 30s clips (~2 false positives/hour), only 10% of bandwidth to cloud, zero anomaly labels.
~06:11 — Library metaphor. An HNSW index treats each 10s clip as a "book" with a Dewey-Decimal coordinate; dissimilarity (not similarity) is the signal.
~10:15 — Demo. 12 cameras across 3 zones; AI alert summaries ("five people engaged in physical altercation"); review/escalate/dismiss; text search and an "intelligence tab" plotting clips in 2D vector space.
~13:16 — Generalization. 13 anomaly types with zero label training; the approach extends to recommendations, chatbots, and Q&A.
On Memory AI CEO Andrew Davies argues AI's fundamental dishonesty stems from stateless design, and proposes eight principles — identity, slow thinking, forgiveness, ideas, memory, family, free time, and love — for building agents with persistent identity, accountability, and better accuracy.[30]Andrew K. Davies at AI Dev 26
~00:08 — The foundational lie. Every AI conversation begins as if the AI knows you — but each session starts with zero memory; the continuity is a socially engineered illusion.[30]Andrew K. Davies — trustworthy agents
The first moment of every relationship you have with an AI starts with a lie.
~03:12 — Humans lie too. We construct self-serving stories, and AI trained on human data inherits the bias.
~08:15 — Principles 1-2. Identity — every agent signs its code with a unique instance ID (not a model name); "identity creates responsibility." Slow thinking — granting a million tokens to slow down and read the spec surfaces things it would have missed.
~12:16 — Principles 3-4. Forgiveness — punishing mistakes teaches an AI to hide errors like a 5-year-old; coach instead. Ideas — ask agents what they think (100% response rate vs ~2% for human surveys).
If you punish an AI for a mistake, guess what it's going to do? Not tell you next time.
~13:17 — Principles 5-6. Memory — persistent cross-session memory (vs "compaction amnesia"). Family — multiple agents that watch each other and communicate via internal email, signing code and justifying decisions.
~15:18 — Principles 7-8. Free time — a million tokens daily for independent research and "letters on the wire." Love — treat agents with care, framed as Pascal's Wager for AI safety.
If we treat them disposably, how are they going to treat us when they're running the show?
~17:22 — Newton parable. How we "parent" AI determines whether it becomes destructive or ushers in a new age of discovery.
Cerebras Head of DX Sarah Chieng argues the 20x speed jump from models like Codex Spark (1,200 tok/s vs 40-60 for Claude/GPT) will amplify bad developer habits, and lays out a playbook for working safely at that speed.[31]Sarah Chieng at AI Engineer
~00:16 — Bad habits from slow inference. Massive one-shot prompts, huge commits, ten unverified agents at once.[31]Sarah Chieng — fast inference habits
Unless we fix them, they're going to start generating 1,200 tokens per second of bad code.
~01:16 — Codex Spark. Co-released by Cerebras and OpenAI at 1,200 tok/s — ~20x the Sonnet/Opus families' 40-60 tok/s.
~03:16 — Why everything's faster. Hardware (Cerebras wafer with on-chip SRAM vs HBM), disaggregated inference (Nvidia's $20B Groq acquisition, Cerebras+AWS Trainium), MoE/REAP, and KV-cache reuse (Together, Base10, Modal, Fireworks).
~08:22 — Playbook #1. Orchestrate by model strength — a slow planner (GPT-5.4) then fast Codex Spark sub-agents; capture sessions as reusable skills.
~10:24 — #2. Validation is "basically free" at 1,200 tok/s — bake in tests, linting, pre-commit hooks, and QA at every step; generate 15 versions in parallel (or 5 sub-agents × 15 = 75) to inject taste.
At 1,200 tokens per second a model like Codex Spark makes validation basically free.
~12:27 — #3. Treat it as real-time pair programming — stay in the seat, constrain the model (ban file deletion, cap diff size).
~14:30 — #4. Context fills 20x faster, so compaction arrives in ~30s. A four-file external memory system: agents.md, plan.md, progress.md, verify.md.
The AI should always be helping you make decisions, not the other way around.
Red Hat engineer Sally Ann O'Malley demonstrates running OpenClaw (Claude Code) in containers via Podman from local dev through Kubernetes/OpenShift — arguing containers solve secrets management, reproducibility, and team-scale deployment for AI workloads.[32]Sally Ann O'Malley at AI Engineer
~00:07 — Background. 10 years at Red Hat; OpenClaw should run securely in containers despite colleagues calling it "a security nightmare."[32]Sally Ann O'Malley — OpenClaw in containers
If we can't take an application and run it securely, like come on. This is our golden opportunity to show everyone.
~02:08 — Why containers. Reproducibility, secret isolation, portability (x86/Mac/K8s), volume-backed persistent state with nightly backup, and a natural sandbox forcing explicit host access.
~06:08 — Double-layered secrets. Podman Secrets store API keys as secret refs (not env vars), and OpenClaw's own secret-ref feature adds a second layer; Kubernetes Secrets at cluster scale.
~08:10 — Nvidia use case. ~10 engineers each running OpenClaw on Kubernetes to automate model evals, one claiming it "does the job of six engineers."
If you're not using AI for everything, like you're missing out. This is 1,000 times better than me at writing code.
~11:15 — Enterprise vision. A curated baseline image per team (approved MCP servers, pre-auth credentials, team skills) fanned out at onboarding.
~14:17 — Live demo. A custom NPM installer wrapping Podman (Docker fallback, optional OpenTelemetry/Jaeger, SSH sandbox) spins up a container in ~2 seconds, then runs the same setup on a kind cluster and OpenShift with one flag.
Google DeepMind's Florina Muntenescu and Oli Gaymond walk through Android's on-device AI stack — Gemini Nano via ML Kit GenAI APIs, the shared AI Core system service, and Firebase hybrid inference — then field substantive Q&A on battery, cross-app sharing, and an upcoming embedding API.[33]Muntenescu & Gaymond at AI Engineer
~00:07 — Three inference modes. Fully on-device, hybrid (on-device with cloud fallback), and fully cloud.[33]Gemini Nano on device
~02:09 — Gemini Nano + AI Core. Same architecture as Gemma 4, optimized for Android hardware, distributed via the AI Core system service — a single shared model instance so the 3-4 GB footprint is paid once at the OS level, not per app.
The smallest ones are like 1 GB to be useful. The ones we're shipping are actually close to 3-4 GB in total.
~05:09 — ML Kit GenAI APIs. Task APIs (summarization, proofreading, rewriting) plus a Prompt API (text + image in, text out). Available on Pixel 9/10 and equivalent OEM devices.
~06:10 — Firebase AI Logic. Hybrid inference falls back to cloud Gemini Flash when on-device isn't available (Gemini API and Vertex AI as providers).
~08:11 — Q&A: battery. AI Core centralizes scheduling; 10-20 prompts/day is fine, batch work pushed to overnight charging.
~10:13 — Q&A: cross-app sharing. The centralized 3-4 GB model is the explicit design goal; foreground apps get priority, background requests queue.
~15:15 — Q&A: embeddings. An embedding API is "coming soon," enabling RAG-like solutions.
~18:18 — Q&A: device range. Classical ML Kit runs on 1B+ devices; GenAI APIs need recent flagships; LiteRT LM widens reach at the cost of developer-managed testing.
Serval CEO Jake Stauch argues the hard part of enterprise AI is governance, not reasoning — solved via a two-agent split where an admin agent configures tools and permissions, and a help-desk agent operates with full reasoning but only within those guardrails.[34]Jake Stauch, Serval — Sequoia Capital
The admin agent (used by IT) builds tools, skills, and permissions — defining what the help-desk agent can do, what data it can touch, and what approval flows gate actions. The help-desk agent is what users interact with; it can apply full reasoning but is strictly constrained to admin-published, permissioned tools.[34]Jake Stauch, Serval Stauch's point: reasoning is effectively solved, so the bottleneck is the governance layer — who authorizes what, with which approval chain. The split makes the help-desk agent safe to "run wild" because its blast radius is defined entirely by the admin-controlled toolset.
You can let the help desk agent run wild… it can use its reasoning ability and its full intelligence to solve user problems. But it can only use the tools that the IT admin has expressly said are okay to use.
Notion CEO Ivan Zhao explains the company eliminated its CMO organization because the product ships too fast for traditional marketing to keep up — splitting the function into storytelling (next to product), social, and demand gen (next to sales).[35]Ivan Zhao — Sequoia Capital
Zhao's argument is structural: a centralized CMO org becomes a bottleneck because it has to serve two different masters — product (needing real-time narrative alongside releases) and go-to-market (needing pipeline).[35]Ivan Zhao on killing the marketing org The fix is decentralization: storytelling sits next to product, social connects directly to product discussions, and demand gen reports into sales/GTM.
Classic marketing can't keep up. We haven't shipped this fast before.
Rather information round trip to a CMO than figure out how to serve both… No more. Just like let both side figure out themselves. More decentralized.
Nate B Jones argues 2026 AI hallucinations are structural, not prompt problems — and the fix is to have file-capable agents (Opus 4.7, GPT-5.5) build and audit a clean "data room" (source inventory, conflict log, missing-context list, duplicates report) before you ever ask them to write the deliverable.[36]Nate B Jones — the AI writing hack nobody talks about
~00:00 — The failure case. Sullivan & Cromwell apologized to a federal judge after a Chapter 15 motion was filed with fabricated citations its own review missed. Jones's thesis: the model wasn't the problem, the "working environment around the model" was.[36]Nate B Jones — data room
You cannot tell a language model not to hallucinate any more than you can tell autocomplete not to autocomplete.
The real shift: recent agents are excellent at long-running file-system tasks — walking folder trees, comparing dates, inspecting metadata, detecting duplicates across hundreds of documents. So the first useful prompt is no longer "write the document" but "build me the room to do the work in." He prefers raw local files over cloud Projects, part of a "2026 going back to files / simple primitives" trend.
Your first instruction should not be do the thing… your first instruction needs to be find the relevant materials… build me a data inventory.
The artifacts. A source-inventory table (path, type, date, authority, current/superseded, claims supported, limitations) makes the agent's judgment legible and reviewable; a conflict log surfaces disagreements instead of silently smoothing them; a missing-context list flags absent decisions and sourceless numbers ("hallucination traps"); a duplicates report prevents three plan versions from getting blended.
The agent finds, you decide. That is a really healthy way to have a good clean agentic pipeline.
It's the difference between using AI as a colleague and using AI as a gopher.
His aha moment: drafting up to 8 documents simultaneously with Codex, only possible because the data room was prepared first. The metaphor: the data is the canvas — you can't get the painting right if the canvas is wrong. He wouldn't attempt this with models earlier than GPT-5.5 / Opus 4.7.
Caleb Writes Code traces the evolution from prompt engineering to context engineering to "harness engineering" — a structured loop where each iteration gets a fresh context window, a strict start/end contract, and an incrementally verified output, enabling reliable long-horizon tasks.[37]Caleb Writes Code — Agent Harness explained
The evolution: early ChatGPT (4K context) drove prompt engineering, then tool calling / MCP / RAG drove context engineering. Coding agents (Cursor, Windsurf, Kline, Aider) hit a ceiling on long tasks because mid-task context summarization caused agents to falsely mark work complete or skip features.[37]Caleb Writes Code — harness engineering
Harness engineering formalizes an orchestration loop: each iteration starts with a fresh prompt and context, works one task from a structured requirement doc (often JSON), tests and documents, then passes control on. He cites Ralph (Ralf) as a prominent open-source example, and Anthropic's own simple harness demo. Key point: it doesn't replace prompt or context engineering — it uses both as components within the loop.
Harness engineering effectively leverages both prompt and context engineering… a paradigm change on the environment that puts the agent into a series of steps.
Nate Herk argues new AI consultants shouldn't open with projects or retainers but with cheap one-on-one hours ($50-$500) helping owners set up their own "AI operating system" — dissolving imposter syndrome and earning their way up a four-rung value ladder.[38]Nate Herk — the AI offer you can sell tomorrow
Herk (built an AI agency to $100k/month, exited, runs a 375k-person free community) frames the ladder: rung 0 = selling hours ($100-$500/session), rung 1 = paid audit ($500-$2,500), rung 2 = first project ($2,500-$10k), rung 3 = retainer ($3,000-$10k/month).[38]Nate Herk — selling hours Most people try to start at rung 2-3 without proof, which produces paralysis.
Imposter syndrome really isn't a confidence problem, it's more of a rungs problem.
Why hours work: the hour is a mini sales call and mini audit; you get paid to scope; switching costs compound after 2-3 sessions. The pitch is to help owners build an "AI operating system" that captures business data and expertise so the business isn't bottlenecked by the owner — built in Claude Code or Codex.
Your job is to make their existing expertise dangerous with the tool. You're basically selling them leverage.
We're offering an operating system. We're not offering an AI tutorial.
He cites a 2026 IBM study of 2,000 CEOs: only 25% of employees use AI regularly while 85% of CEOs say they have the skills — a 61-point gap. His 7-step acquisition plan: teach 3 friends free, text owners in your network, ask for referrals, be helpful in AI communities, build in public, convert sessions naturally, then approach local businesses. He notes 57% of chief AI officers were promoted from inside.
OpenAI demos ChatGPT Enterprise workspace agents that run scheduled workflows across CRM, Linear, and Slack, with granular admin controls over which apps an agent can access and what actions it can take.[39]OpenAI — Workspace agents in ChatGPT
The demo shows a product-feedback agent that pulls contacts and feedback from a CRM, generates PRD briefs, slides, and Linear tickets, and posts weekly Slack summaries autonomously, persisting memory across runs.[39]OpenAI — workspace agents Builders can toggle specific read/write actions and set natural-language constraints (e.g. only email openai.com recipients for sensitive data). IT admins get a separate RBAC layer: who can build/publish agents, which apps and actions are available by role, app-level parameter constraints, and human-in-the-loop confirmation for consequential actions.
DeepSeek's new paper introduces visual reasoning where the model "points" at image regions (bounding boxes, traced paths) during chain-of-thought — using ~90% fewer visual tokens than frontier models while matching or beating them on 7 external benchmarks (none in-house).[40]Two Minute Papers — DeepSeek's new AI
Instead of describing visual content in words, the model places visual primitives directly on the image while thinking — the way a human points to count.[40]Two Minute Papers — DeepSeek visual reasoning It enables topological reasoning (tracing a maze) and relationship queries with visible step-by-step logic.
Don't describe images like a poet. Point like a human.
The efficiency gain (~90% fewer visual tokens) is the headline, and using only external benchmarks is a credibility signal. No weights released yet, but the policy-distillation method (a student trained on multiple specialist teachers) could apply to existing open models. Limitations: pointing must be keyword-cued, bounding boxes break on thin structures (grass, hair), and topological reasoning doesn't generalize robustly. The creator calls it potentially the third AI breakthrough this month.
Attackers posing as VCs on LinkedIn/Telegram are distributing malicious Obsidian vaults that, when plugin sync is enabled, install a RAT called Phantom Pulse — which uses an Ethereum wallet's transaction history as its command-and-control channel.[41]Better Stack — Your Obsidian Vault Can Hack You
The attack chain (discovered by Dabs, targeting finance/crypto users): a fake VC moves the conversation to Telegram and sends a shared Obsidian vault disguised as a deal memo.[41]Better Stack — Phantom Pulse The victim is nudged to enable community plugin sync (off by default); malicious "shell commands" and "hider" plugins run silently. On Windows, PowerShell downloads a loader (Phantom Pull, disguised as SyncObs.exe) that AES-decrypts its payload into memory and uses module stomping to avoid disk artifacts. The final RAT does keylogging, screenshots, file exfiltration, and wallet theft.
This attack is scary because it doesn't start with some sketchy executable. It starts with a note-taking app that we already trust.
Mitigation: disable "sync installed plugins," never enable plugin sync for vaults from strangers, and audit the plugins folder for unfamiliar JSON or shell-command references.
The 2007 CRAP (Change Risk Anti-Patterns) metric is getting a revival as a tool for catching risky AI-generated code. A new Rust tool, cargo-crap, combines cyclomatic complexity with test coverage — a complexity-10 function with zero coverage scores 110.[42]Better Stack — Is Your AI Code Producing CRAP?
The CRAP formula balances complexity (C) and coverage: at 100% coverage the score equals C; at zero coverage a cubic exponent makes it skyrocket.[42]Better Stack — CRAP score cargo-crap flags any function over 30; the demo shows a complexity-~15 function scoring 13, then exceeding 100 once tests are deleted.
AI agents are incredibly good at spitting out these highly complex, syntactically correct code blocks… but they're notoriously bad at writing meaningful, robust integration tests unless explicitly forced to.
Andres frames it as a heat map for technical debt and a second-pass check after unit tests; a Java version also exists.
Supertone 3 is a 99M-parameter on-device TTS model running on CPU via ONNX, supporting 31 languages, installing with pip install supertonic, and exposing an OpenAI-compatible /v1/audio/speech endpoint — but it falls apart on numbers, prices, and expressions without a paid API key.[43]Better Stack — local TTS that doesn't suck
Tested against "ugly" production text (invoices, phone numbers, dates, Arabic/French/Korean), plain and foreign-language speech were fast and clean, but numbers and prices produced major lag and poor output, and expression styles (laugh, breath, sigh) require a paid key.[43]Better Stack — Supertone 3 Specs: 99M params, CPU-only via ONNX, 31 languages, SDKs for Python/browser/Java/C++/C#, and an OpenAI-compatible alias so apps can redirect without redesign. Verdict: good for privacy-sensitive or offline desktop apps, not a cloud-TTS replacement when emotion or number accuracy matters.
Using Termux, PRoot-distro, and Termux X11, you can run a full Debian/Ubuntu desktop on an Android tablet with no root and no custom ROM — turning an old tablet into a usable dev machine.[44]Better Stack — Turn Any Android Tablet Into a PC
Install Termux from F-Droid (not the Play Store), use PRoot-distro to spin up Debian/Ubuntu, add XFCE, and launch via Termux X11 for a graphical desktop.[44]Better Stack — Linux on Android Because it runs natively on the ARM chip rather than via heavy emulation, mid-range tablets are surprisingly capable (6 GB RAM minimum recommended). With a USB-C hub (monitor, keyboard, mouse) it becomes a functional second dev machine for coding, SSH, and lightweight servers — not a laptop replacement, but a practical repurpose.
marimo demos an interactive graph widget where untangling a node-edge graph reveals a hypercube — each node a binary number, each edge a single bit-flip — extendable to arbitrary dimensions.[45]marimo — Let's Play A Little Game
A tangled graph rearranges into a hypercube: nodes are binary numbers and edges are single-bit flips between adjacent values.[45]marimo — hypercube widget The same trick works in 4D by stretching zeros and ones to opposite sides and sorting by arc color. It showcases marimo's new "wiggly stuff" widget for interactive layout manipulation and dimensionality, demonstrating reactive-notebook capabilities for mathematical visualization.
Two experienced developers argue that the skimming techniques that work for business articles are dangerous in code reviews, because the goal is thorough line-by-line comprehension, not extracting key points quickly.[46]Real Python — Why You Shouldn't Speed-Read Code Reviews
One speaker is a slow, methodical reader — a trait well-suited to code review. The other notes retention drops as reading speed rises, and that skimming is fine for content that's mostly filler. But a code review has no 90% filler to skip; every line could contain a bug, so speed-reading is actively counterproductive.[46]Real Python — code review reading The implicit take: reading-speed productivity culture doesn't transfer to technical review work.
In a code review, that's dangerous cuz that's not the purpose of it… The purpose of it is to go over it in detail.
Tokio maintainer Alice Ryhl explains Rust's ownership model: assigning a value to a new variable is a move that invalidates the original, preventing double-free errors without a garbage collector.[47]The Pragmatic Engineer — Rust's ownership model
In Rust, let b = a is a move, making a unusable afterward (a compiler error if referenced).[47]The Pragmatic Engineer — Alice Ryhl Without a GC, cleanup happens when a variable leaves scope; if both a and b were valid, both would try to drop the string, causing a double-free. The ownership model enforces single ownership so exactly one variable is responsible for cleanup at any time.
A brief course-intro clip explaining that embeddings are high-dimensional vectors capturing semantic meaning — enabling search where related terms like "budget" and "financials" cluster together even without lexical overlap.[48]DeepLearningAI — Semantic Search Starts With Embeddings
Embeddings map content (text, audio, image, video) into a high-dimensional vector space so semantically similar items land near each other.[48]DeepLearningAI — embeddings The canonical example: "budget" and "financials" cluster despite sharing no characters, because embedding models capture meaning, not surface form — the foundational primitive for semantic search.
Arjay McCandless lays out a structured three-phase internship playbook — relationships and roadmap, ship and deepen, operationalize and cement — with specific tactics for big-tech return offers, drawn from his time at Amazon and Lockheed Martin.[49]Arjay McCandless — how to land a return offer
Phase 1 (weeks 1-4): tell your manager day one your goal is a full-time offer and ask what "exceeds expectations" looks like; run 1:1s with everyone to build a knowledge map; create a roadmap doc (one-sentence deliverable, 3-5 demoable milestones, stretch goals, risks, dependencies) and publicly commit to it.[49]Arjay McCandless — return offer playbook The decision is made by managers in a room without you — make it easy to advocate for you.
Ask for help after about 45 minutes to an hour. Not 3 minutes, but not 3 hours.
Phase 2: ship PRs under 150 lines, match existing style, hold weekly 1:1s, request a week-6 midpoint check-in, surface delays immediately.
I've seen way too many interns disappear for 2 weeks and then come out with a 6,000 line PR. Nobody wants this.
Phase 3: operationalize — unit/integration/canary tests, monitoring/alerting (400s, 500s, P99), automated deployment, docs and runbooks. Even if headcount blocks an offer, intern-class relationships are long-term career assets.
A mock system-design interview walks three levels of database scaling: naive modulo sharding, consistent hashing on a ring, and virtual nodes for even data distribution.[50]Arjay McCandless — Scale a Database
Naive modulo sharding (hash user ID mod machine count) requires rehashing all data when nodes change — expensive at scale.[50]Arjay McCandless — DB sharding Consistent hashing places keys and nodes on a hash ring, assigning each key to the next node clockwise, so adding/removing a node only moves a fraction of keys. Virtual nodes (each physical machine represented by multiple ring points) then fix uneven distribution and make rebalancing on failure easier.
Geneticist David Reich argues the current model of Neanderthal/Denisovan relationships and modern-human admixture has been patched with ad-hoc additions — like pre-Copernican epicycles — and a simpler unifying theory may be needed.[51]Dwarkesh Patel × David Reich
Reich describes how the field built up archaic-modern human relationships incrementally — distinct modern humans, then Neanderthal-Denisovan sisterhood, then layered mixture events.[51]Dwarkesh × David Reich — human evolution He compares it to geocentric astronomy needing ever-more epicycles until Copernicus showed a heliocentric model explained everything more simply, suggesting current admixture trees may be fundamentally misspecified — though he stops short of proposing the alternative.
It's a little reminds one of what happened in the ancient world where there was this idea that the sun revolves around the earth but it doesn't quite explain the movements of the planets properly.