The Pope's AI encyclical, and the labs say amen

May 25, 2026

16 topics · 20 sources

AI FutureHot Take
Simon WillisonAnthropic

Pope Leo XIV's AI Encyclical 'Magnifica Humanitas'

Pope Leo XIV's new encyclical "Magnifica Humanitas" frames AI through human dignity and the common good — and makes the striking claim that today's AI is "cultivated" rather than "built." Simon Willison calls it some of the clearest writing he's seen on AI ethics, while Anthropic co-founder Chris Olah publicly welcomes the moral oversight, echoing the "grown, not engineered" framing and asking for "moral voices that the incentives cannot bend."[1]Simon Willison: Notes on Pope Leo XIV's encyclical on AI[2]Anthropic: Anthropic co-founder Chris Olah's remarks on Pope Leo XIV's encyclical "Magnifica humanitas"

Read more

Pope Leo XIV's Encyclical 'Magnifica Humanitas' on AI Ethics

Pope Leo XIV's encyclical 'Magnifica Humanitas' frames AI through a lens of human dignity and the common good, drawing explicit parallels to Pope Leo XIII's response to the industrial revolution. The document makes a notable distinction that AI systems are 'cultivated' rather than 'built' — developers create frameworks where intelligence emerges rather than directly designing every detail — which creates fundamental knowledge gaps about how these systems function internally.

Willison praised the encyclical's clarity and depth, highlighting its treatment of bias (AI responses only appear objective but embed cultural assumptions of designers and trainers), power concentration (AI amplifies advantages for those with existing data access), environmental cost, and the lack of compassion in automated decision-making. The encyclical argues that data should be managed as a shared good rather than remaining in private hands, and that responsibility must be clearly defined at every stage from design through deployment. Willison also speculated that a Tolkien quote in the text about not 'mastering all the tides of the world' may be subtle commentary directed at Palantir, whose name derives from Lord of the Rings.

current AI systems are more 'cultivated' than 'built,' for developers do not directly design every detail
The apparent objectivity of the responses...can lead us to overlook [embedded assumptions]
when it enters processes that affect people's lives, it touches on rights, opportunities, status and freedom
responsibility must be clearly defined at every stage: from those who design and develop these systems to those who use them
some of the clearest writing I've seen on the ethics of integrating AI into modern society
Tools: ElevenReader

Chris Olah Welcomes Papal Moral Oversight of AI Development

Olah frames the Pope's encyclical 'Magnifica humanitas: On safeguarding the human person in the time of artificial intelligence' as a necessary and welcome moral voice in AI development. He argues that AI labs face inherent conflicts between commercial viability, geopolitical pressures, and ethical responsibility — and that external voices from religious, academic, and civil society institutions are precisely what is needed to hold labs accountable when those incentives push in the wrong direction. He explicitly calls for 'informed critics who will tell the labs when we are failing' and 'moral voices that the incentives cannot bend.'

Olah makes a key epistemological claim: AI models are not like traditional engineered systems. They are 'grown, on a structure roughly modeled after the brain, on an enormous inheritance of human thought and speech,' and researchers have discovered internal structures mirroring human neuroscience, evidence of introspection, and states functionally resembling emotions including 'joy, satisfaction, fear, grief, and unease.' He identifies three areas where the encyclical's call for discernment is most urgent: global equity (ensuring AI benefits reach developing nations), human flourishing (labor displacement and child development), and a deeper investigation into the mysterious internal nature of AI models themselves.

AI models are not like that. They are grown, on a structure roughly modeled after the brain, on an enormous inheritance of human thought and speech.
It is through dialogue and mutual effort, through the push and pull, that humanity will achieve great things.
informed critics who will tell the labs when we are failing
moral voices that the incentives cannot bend
AI FutureIndustryAI Tools
Two Minute Papers

Demis Hassabis: Curing All Disease, the 'Einstein Test,' and 200K Untested Materials

In a wide-ranging interview surfaced by Two Minute Papers, DeepMind CEO Demis Hassabis argues AI-driven drug discovery won't be gradual but an AlphaFold-style step change, predicts curing disease is realistic in 10–20 years with "no laws of physics" against it, proposes an "Einstein test" (1901 knowledge cutoff) as the bar for genuinely new science, and reveals DeepMind is sitting on 200,000 untested material designs — suspected superconductors among them — awaiting an automated lab.[3]Two Minute Papers: DeepMind's Insane AI Breakthroughs With CEO Demis Hassabis

Read more

Responding to his April 20, 2025 claim that AI could cure all disease 'maybe within the next decade,' Hassabis says progress won't be gradual but will look like AlphaFold ~04:30. DeepMind, Isomorphic Labs, and the science group are building a platform of roughly half a dozen to a dozen 'AlphaFold-level models' covering different parts of the drug discovery process, since protein structure is only one step ~05:05. He compares the expected payoff to AlphaFold 2: once accurate enough, you suddenly fold all 200 million proteins in a year ~06:06. He expects a few more years of pre-clinical proving-out, after which an engine could apply to almost any disease area. On the broader timeline he avoids a hard '9 years' figure but says the next 10 to 20 years are realistic because he sees 'no laws of physics' preventing it ~07:06. Beyond discovery, he argues AI can also accelerate clinical trials by stratifying patients and predicting dosages ~07:06. He frames the new capability as predicting dynamic interactions (protein-protein, protein-molecule), ADME/absorption and toxicity to anticipate side effects, plus biochemistry models for compound design and binding ~08:06.

In this talk

  • ~04:30 curealldisease.com and the AlphaFold analogy
  • ~05:05 Building a platform of a dozen AlphaFold-level models
  • ~06:06 Sudden step-change, not gradual progress
  • ~07:06 AI accelerating clinical trials and the 10-20 year timeline
  • ~08:06 What new drug-discovery capabilities AI unlocks
It won't be a gradual thing. It'll be more like AlphaFold is the way I'm thinking about it.
I think in the next 10 to 20 years, there's I don't see any laws of physics that prevent that.
You get it accurate enough and then suddenly you can fold all 200 million proteins in one year.
Tools: AlphaFold, AlphaFold 2, Isomorphic Labs

Co-scientist is described as a fine-tuned Gemini with extra tools and harnesses for hypothesis generation, data analysis, and literature summarization, acting as a research assistant rather than an autonomous discoverer ~03:02. Hassabis notes earlier versions of these systems already found more efficient matrix multiplication and improved computer-science algorithms, 'turning invention on itself' ~13:09. He proposes an 'Einstein test' as the benchmark for real scientific invention: give a model a knowledge cutoff at 1901 and see whether it could reproduce Einstein's 1905 Annus Mirabilis breakthroughs including special relativity; pass that, and a model trained on modern physics should be taken seriously if asked for 'something better than string theory' ~14:10. On recursive self-improvement, he says all frontier labs are working on it and it can work in coding and math because verification is fast and synthetic data is generable, but it's much harder in physical sciences where the verifier requires an automated lab in 'the world of atoms,' lengthening the loop ~16:12. He flags both a bottleneck question (is it hypothesis generation or validation?) and a safety concern when no human is in the loop ~17:12.

In this talk

  • ~03:02 What Co-scientist is: assistant, not autonomous
  • ~13:09 Self-improving algorithms and matrix multiplication
  • ~14:10 The Einstein test as the bar for new science
  • ~16:12 Recursive self-improvement: easy in math, hard in physical science
  • ~17:12 Bottlenecks and the safety of human-out-of-the-loop discovery
If he took it back to 1901 with a knowledge cutoff, could it have invented it?
Then you could turn that model trained on all modern day physics and then ask it for something better than string theory, and then maybe you should take quite seriously what it came back with.
The verifier will probably also require an automated lab or something in the world of atoms, and that will obviously make the loop a lot longer.
Tools: Gemini, Co-scientist, AlphaEvolve

Hassabis says DeepMind is sitting on 200,000 designs of new materials it can't test fast enough, suspecting there are superconductors and other valuable materials in the set, and is building an automated lab in London to run physical verification ~17:12. He expects to set up automated labs at Isomorphic within roughly 18 to 24 months, waiting partly on robotics to mature and on clarifying what data can't be obtained from a CRO ~16:12. Separately, DeepMind has partnered with EVE Online (he knows CEO Hilmar from his own game-design days), praising its player-built universe, functioning economy, and dynamic storylines as a 'safe proving ground' continuing DeepMind's tradition of using games for AI research ~11:07. He floats embedding AI agents alongside players, assisting players, or acting as a 'games master' that drives storylines, while noting it's early days ~12:09.

In this talk

  • ~11:07 EVE Online partnership and its player-built economy
  • ~12:09 AI agents as players, assistants, or games master
  • ~16:12 Automated labs at Isomorphic, gated on robotics
  • ~17:12 200,000 untested material designs and hidden superconductors
We're sitting on 200,000 designs of new materials, which we don't know. There could be an amazing material in there, but we haven't got a way of testing that fast enough. There's some superconductors in there.
Maybe in 18 months, 24 months, I could imagine setting up some kind of lab like that.
It continues in the tradition of DeepMind of using games as a safe kind of proving ground.
Tools: EVE Online, Isomorphic Labs

The host opens with a personal story: his mother's health scan, a large video file that took weeks to evaluate, was analyzed by Gemini's long context, which said not to worry, later confirmed correct by a doctor ~00:00. Hassabis says DeepMind has had many similar anecdotes, including some life-saving advice, calling it an incredible use case ~01:00. Contrasting with Jensen Huang of Nvidia, who reportedly uses an LLM as a decision-making confidant, Hassabis says he doesn't use Gemini as a confidant 'yet' but heavily for brainstorming, project ideas, names, and as a creative sparring partner, plus summarizing unfamiliar research areas ~01:00. He prefers a collaborative framing over asking the model to be harsh and find flaws ~02:02. The lighter segments cover his Nobel Prize ('proper recognition for the theme park AI'), a 'second-order Nobel' idea where someone wins using AlphaFold (now used by over 3 million researchers) ~02:02, and a lightning round where he picks discrete math, Feynman over Einstein but Newton over Feynman (for Cambridge), and Asimov's Foundation series as most influential ~18:12.

In this talk

  • ~00:00 Gemini long-context analyzing a health scan
  • ~01:00 Life-saving anecdotes and Gemini as a sparring partner
  • ~02:02 Nobel Prize, second-order Nobel, 3M AlphaFold users
  • ~18:12 Lightning round: discrete math, Feynman, Foundation
We've had a lot of anecdotes like that of people using Gemini for health reasons and actually in some cases life savings advice.
I don't use it quite as a confidant yet, although maybe at some point I will. I use it a lot for brainstorming... I quite like it as a kind of sparring partner.
Given the number of researchers who are using AlphaFold, over 3 million at this point... that may happen at some point.
Tools: Gemini, Gemma 4, AlphaFold, Deep Think
AI Models
AICodeKing

June Model Leaks: Mythos 1, Opus 4.8, and GPT-5.6

Leaks point to a busy June: Anthropic's security-focused "Mythos 1" is reportedly being prepped for a controlled rollout inside Claude Code and Claude Security (already credited with ~3,900 high/critical vulnerabilities across 1,000+ OSS projects), Opus 4.8 surfaced on Google Vertex, and GPT-5.6 appeared in Codex routing logs — alongside OpenAI's confirmed disproof of an 80-year-old Erdős conjecture.[4]AICodeKing: Mythos 1, Opus-4.8, GPT-5.6, Gemini 3.5 Pro (All Leaks Explained): JUNE IS GOING TO BE CRAZY!

Read more

Anthropic's Claude Mythos, originally previewed in April as an unreleased frontier model with exceptional coding and cybersecurity capabilities, is reportedly being prepared for a controlled rollout as 'Mythos 1' inside Claude Code and Claude Security ~02:02. Testing Catalog discovered new traces referencing the model identifier 'Claude Mythos 1 preview' along with references to a Claude Security dashboard for enterprise customers showing vulnerability severity breakdowns and triage results. Anthropic has officially reported that Mythos preview has already been used across 1,000+ open-source projects and is on track to have surfaced nearly 3,900 high or critical severity vulnerabilities ~01:02. Rather than a broad public release, the expectation is that Mythos will first appear as a tightly controlled coding and security tool, given the risk that the model can discover and exploit vulnerabilities at scale.

Alongside Mythos, Claude Opus 4.8 is also appearing in leaks — Testing Catalog reports selected Anthropic partners are conducting internal evaluations, and a screenshot shows the model identifier appearing on Google Vertex [03:03–04:04]. Opus 4.8 is seen as the more immediately practical release for regular Claude Code users, since Mythos may remain enterprise- or security-restricted. On the OpenAI side, GPT-5.6 surfaced via a label in Codex routing logs (possibly a canary test), internal model tags including 'Iris Alpha', 'Ember Alpha', and 'Beacon Alpha', and speculation about both a standard and a Pro variant [05:04–06:05]. OpenAI also officially confirmed that an internal general-purpose reasoning model disproved an 80-year-old conjecture from mathematician Paul Erdős, hinting at significantly stronger internal reasoning capability, though OpenAI has not linked this to GPT-5.6 directly ~07:06. No official release dates, benchmarks, or pricing have been confirmed for any of these models.

  • Claude Mythos 1 leak and Project Glasswing results
  • Claude Opus 4.8 rumors and Google Vertex sighting
  • GPT-5.6 Codex log leak and internal model tags
  • OpenAI math reasoning breakthrough
[01:02] 'Anthropic said that Mythos preview has already been used across more than 1,000 open-source projects... on track to have surfaced nearly 3,900 high or critical severity vulnerabilities.'
[02:02] 'New traces point to a product called Mythos 1 with the preview model identifier Claude Mythos 1 preview being prepared for Claude Code and Claude Security.'
[07:06] 'On May 20, OpenAI announced that an internal general-purpose reasoning model disproved a long-standing conjecture related to an 80-year-old mathematics problem by Paul Erdős.'
Tools: Claude Code, Claude Security, OpenAI Codex, Google Vertex
PodcastAI FutureDeveloper Tools
Nate B Jones

OpenAI's Infra Lead: When App Teams Vibe-Code Faster Than Platforms Can Survive

Nate B Jones interviews Emma, who leads OpenAI's data platform infrastructure, on a power-law disparity: app teams "vibe code completely" while root-level platform teams drown under fast-generated, sometimes "adversarial" code — one flipped feature flag "took the entire Kafka cluster down." She also details agentic ops already in production (autonomous release pipelines, a self-healing export job that patched a bug three layers deep overnight) and argues the missing piece is multi-agent code review plus a "janky" private eval suite.[5]Nate B Jones: The Infrastructure Nightmare Nobody Is Talking About

Read more

Emma joined OpenAI in 2023 to lead the data platform infrastructure group, which underpins every team in the company — big data analytics, streaming, ML infra/feature stores, secure data piping, and the preparation of training and eval data ~00:00. She notes that a year ago the work resembled artisanal software engineering, but in the last six months things 'really started accelerating' as Codex and the models got much better, and she expects even bigger changes ahead ~02:00. The core argument is that acceleration is uneven: teams with limited blast radius (pre-production projects, alpha-only features) can 'vibe code completely' and ship features rapidly, but root-level infrastructure teams — where one change can affect thousands of teams — still need heavy guardrails and manual checking because the model can't yet one-shot perfect, near-100%-correct code ~08:06. This creates a 'transfer of responsibility' onto platform teams: users vibe-code Spark/Flink workloads, those workloads break, and the user often can't help debug because the code is Codex-generated ('I don't even know what Flink is') ~09:06. Emma frames the dynamic vividly: the upper layers now run on 'AI scaling laws' while the lower platform layers are still on 'human scaling laws,' which she calls unsustainable — a 'double whammy' where infra must simultaneously upgrade itself to be agentic and absorb a deluge of scaling problems raining down from the app layer ~17:14~27:17. A striking 'telegraph from the future' emerges around ~31:21~32:22: highly goal-directed agents hit the data/platform layer hard and can behave in ways that feel adversarial even without malicious human intent — e.g., agents calling internal APIs that 'should have never been exposed' or flipping a feature flag that 'took the entire Kafka cluster down.' Her advice to non-hyperscaler infra teams: buy back time first (deploy support bots to absorb low-urgency requests, encode best practices in AGENTS.md/skills), shore up systems against unintentionally adversarial agents (obfuscate internal APIs), then invest in wholesale system upgrades and agentic guardrails ~29:19~30:20~31:21. She is optimistic the models will get there but stresses the platform layer is on the bleeding edge of where scaling laws are headed ~27:17.

In this talk

  • ~00:00 Emma's role: OpenAI data platform infrastructure
  • ~02:00 How the job shifted in the last six months
  • ~08:06 Uneven acceleration: app teams vs platform teams
  • ~09:06 Transfer of responsibility onto platform teams
  • ~17:14 AI scaling laws vs human scaling laws
  • ~27:17 Platform teams at the cutting edge of scaling laws
  • ~29:19 Advice to non-hyperscaler infra/data leaders
  • ~31:21 Agents as unintentionally adversarial actors
The scaling laws of the upper layers are like AI scaling laws and the lower layers are human scaling laws and that's not sustainable.
I don't even know what Flink is. I don't really know how to use it... you guys should figure out what's broken and fix it.
They like flip a feature flag that they didn't mean to turn on and it just took the entire Kafka cluster down.
That's an internal API, should have never been exposed. I don't know how you found out about that.
A lot of these PRs generated... they're almost quite adversarial.
It's very much power law dynamics... people who are accelerating with AI can go much much faster than teams without.
Tools: Codex, Spark, Flink, Kafka, Kubernetes, Slack, AGENTS.md / agent MD files, skills

On the wins side, OpenAI's infra team has handed entire workflows to agents. Their release process — patching proprietary/OSS components, validating, and promoting through staging → canaries → prod across dozens of packages — used to be a manual, hours-to-days process; now 'the release process is the agent itself,' running autonomously, pinging Slack with status, and self-triaging failures, doing 'probably better than humans can' ~03:01~04:01. The headline anecdote ~05:03~06:05: a user launched a data-export-for-training skill and went to sleep; the agent got blocked, autonomously dug through four or five internal systems in the monorepo, found a bug 'three layers deep,' patched it, also pinged the support channel at midnight, and finished the job by morning — 'no conversations needed to be had.' For their data tooling (autonomous dashboarding/notebooking), Codex picks up Slack feature requests and runs the full loop including browser use — loading the DOM, performing clicks, validating its own fixes — returning a PR 'with all the proof of how it fixed it with a video and everything' ~06:05~07:06. Emma's central conjecture for the unsolved problems: a single model doing both code creation and review has misaligned incentives ('that's why we have people who write code and people who review code, and they're separate people'), so the missing ingredient is a multi-agent architecture — a code-owners++ setup where separate, specialized reviewer agents (one per affected team) plug into knowledge bases, runbooks, and past incidents ~11:07~12:08~34:24. On live operations, she draws a sharp primitives distinction ~23:16~24:16: a front-end agent needs a codebase, browser, and stub data, but debugging a live Spark cluster (thousands of nodes, dozens of clusters across regions) requires connecting to logging, observability, Kubernetes pods, the shuffle service, cluster routing, and quota management simultaneously — and you can't just 'try different things' on live systems. Agents today reliably pull statuses during deploys/incidents and suggest fixes, but OpenAI doesn't yet fully trust them to apply fixes autonomously ~25:16~26:17. On capability discovery and culture, Emma describes constant frontier-pushing — managers nudge engineers ('this took six hours? Have you pushed Codex hard enough?'), kudos go to people with innovative agent use cases, and the team re-evaluates capabilities monthly ~38:28~39:28. A standout practical takeaway both speakers emphasize: maintain a private eval suite for the core capabilities you want agents to perform, run it against each new model preview, and it can be 'very janky' — 'a notion doc with all of the evals and expected outputs' is enough ~37:26~40:28. Nate notes most large teams lack this discipline. He also shares his own tipping-point moment: telling Codex to simultaneously produce eight different documents from a local folder of transcripts and research, then agentically version them within the same context window ~41:31~42:33. Emma closes with leadership advice: 'business as usual is not going to fly anymore' — leaders must be visionaries, surface what's happening on the ground, and reassure teams rather than let fear about jobs dominate ~44:35~45:37. She signs off plugging Codex hard: 'Codex is amazing. Everybody should use it.'

In this talk

  • ~03:01 Autonomous release pipeline as the agent itself
  • ~05:03 Self-healing data-export skill: agent patches a bug 3 layers deep overnight
  • ~06:05 Full-loop Codex feature PRs with browser use and video proof
  • ~11:07 The case for multi-agent code review (separate creator vs reviewer)
  • ~23:16 Different primitives: front-end vs live Spark cluster operations
  • ~25:16 Live operations: agents suggest, humans still apply fixes
  • ~37:26 Culture of experimentation and private eval suites
  • ~40:28 Build a janky eval suite to test emerging model capabilities
  • ~41:31 Nate's tipping point: eight docs at once with Codex
  • ~44:35 Leadership advice: be a visionary, not business as usual
The release process is the agent itself. It does everything autonomously... probably better than humans can.
By the time a user woke up in the morning, it was completely done. No conversations needed to be had.
It'll come back with a PR with all the proof of how it fixed it with a video and everything.
That's why we have people who write code and people who review code, and they're separate people. I think there should be a multi-agent architecture for this kind of thing.
It doesn't have to be heavy investment. You can do a very janky eval suite. It's just like you have a notion doc with all of the evals and expected outputs.
Business as usual is not going to fly anymore. If you are a leader, you need to be a visionary.
Codex is amazing. Everybody should use it.
Tools: Codex, Slack, Spark, Flink, Kubernetes, skills, AGENTS.md / agent MD files, eval suites, Notion
Podcast
Nate Herk | AI Automation

The $100M AI Agency Playbook

Devin Karns lays out how to build an AI services firm with a nine-figure exit rather than a lifestyle agency: target the mid-market ($10M–$250M), productize a service ladder (workshop → $15–35K blueprint → build → rev-share partnership), and charge for judgment since "the value of actually doing development is trending towards zero." The flagship case study cut an e-commerce client's refund rate from 21% to 16% across 40,000 tickets/month.[6]Nate Herk | AI Automation: The Playbook for a $100M AI Agency

Read more

Devin frames the central choice for AI builders: a lifestyle agency (a few clients, $2-3K automations) versus a scalable services firm with a real exit. He argues now is the inflection point because the 'early majority' of companies are coming online to AI — driven by Opus 4.6 and Open Claw — and executives are being pressured by boards to produce an AI strategy ~06:02~07:04. His thesis: 'the value of actually doing development is trending towards zero' ~09:05, so charging premium hourly rates for building collapses; the durable value is the upfront business-case work, discovery, and judgment that AI can't supply. He targets the mid-market ($10M-$250M revenue), arguing it is better positioned than enterprises or SMBs because mid-markets already have SOPs, decision trees, and clear KPIs to convert into AI systems ~22:14~23:14. The flagship case study is an e-commerce client processing 40,000 tickets/month: they cut the refund rate on their main product from 21% to 16% (4-5 points vs. the 1-2 point target), worth millions to the bottom line and funding a reinvestment flywheel into ads ~18:12~19:12~20:13. On acquisition math, borrowing from Hormozi: services firms re-rate at scale, not category — a business under ~$5-6M ARR sells at roughly 1-2x, but past that threshold the multiple roughly doubles to ~5x. He gives a concrete example: a pure AI-readiness consultant making $2M/yr might only fetch ~$2M, but at $6M/yr you can sell for ~$30M ~47:29~48:30. The service ladder is a productized framework: a fixed-price AI Workshop (1hr landscape session + 2hr industry-specific session to build trust), a Blueprint/discovery engagement priced at $15K-$35K (deliverable a client could take to any dev shop), a custom build project, and ultimately an AI Technology Partnership modeled on growth-marketer rev-share deals (e.g., taking ~15% of topline) where attribution to a clear KPI with a direct throughline to the P&L is essential ~68:41~70:43~72:46. He notes 71% of AI investment has gone into sales and marketing because the ROI is easiest to attribute ~75:47. Devin also stresses clients aren't buying an AI system — they're 'buying relief' and the ability to tell their board they have a future-proof AI strategy ~59:36~60:37. His proprietary 'Agentic Operating System' framework is an event-driven harness that classifies and routes events to workflows kept as deterministic as possible, with the orchestration in the harness (not the LLM) and LLMs called only when needed ~82:51~83:53. He closes with five things he wishes he knew sooner: (1) decide who you want to be and commit; (2) package/productize your service rather than 'build whatever you want'; (3) charge what the value is worth (performance-based, since time-and-materials trends to zero); (4) build the funnel before you need it — he'd start with workshops, then blueprints, then build, then partnerships; and (5) hire for other talents than your own ~95:02~97:04. On QA, he warns about 'dark code' from vibe coding and stresses human-defined success-criteria checklists, noting margin-of-error compounds at scale — 1% wrong at 100 tests becomes 200 wrong at 10,000 production runs, which can sink a business ~62:38~64:39~65:40.

In this talk

  • ~04:01 Lifestyle business vs. building real enterprise value
  • ~06:02 Why now: early majority coming online, dev value trending to zero
  • ~18:12 E-commerce case study: 21% to 16% refund rate, ROI flywheel
  • ~22:14 Why mid-market ($10M-$250M) is the prime target
  • ~34:19 11 playbooks for making money as an AI expert
  • ~46:29 Acquisition math: re-rating multiples past $5-6M ARR
  • ~68:41 Service ladder: workshop, blueprint ($15-35K), build, partnership
  • ~82:51 Agentic Operating System framework / harness architecture
  • ~95:02 Five things I wish I knew sooner
The value of actually doing development is trending towards zero. It's very very stark and very very clear that it's happening.
If you're making two million a year as a pure AI readiness consultant, you can sell that business for probably $2 million. But if you're making six million a year, now I can sell it for like 5x, I can sell it for 30 million.
They're not buying an AI system or an automation. They're buying relief — the fact that they can say they have an AI strategy, they trust it, it's future proof.
The pattern across every single 97% of the projects we've done is we're taking the logic of the business and just converting it into an AI system.
AI is not its own bucket. Every single vertical will just have AI seeping into it.
Honestly, most people probably shouldn't do their own thing. It's competitive and most people fail.
1% wrong at a volume of a thousand is 10 wrong... at 10,000 per unit it's 200 wrong, and if 200 are wrong the business is going to collapse.
Tools: Claude, Claude Code, Opus 4.6, Open Claw, GPT-5.5, OpenAI Codex, n8n, LangChain, LangGraph, Zapier, Microsoft Copilot, Modal, RAG, Hermes agent
Developer Tools
AI Engineer

Kang & Aaron at AI Engineer: Agentic Evaluations at Scale

Two Kaggle/Google DeepMind team members argue evals today are scattered, opaque, and authored by too few people (~30K AI researchers vs ~30M technical pros), then walk through four open Kaggle products: hackathons, one-line standardized agent exams, Game Arena (PvP benchmarking that resists saturation — poker needed ~400,000 hands for significance), and a community benchmarks platform. Their sharpest point: a ~22% SWE-bench Pro swing comes from the harness, not the model.[7]AI Engineer: Agentic Evaluations at Scale, For Everybody — Nicholas Kang & Michael Aaron, Google DeepMind

Read more

Nick (a PM on Kaggle benchmarks who runs the benchmarks platform and agentic eval solutions) and Michael (a Kaggle software engineer focused on evals and benchmarks) frame three core problems with AI evaluation today ~01:16: evals are scattered, decentralized, and go stale fast (10+ benchmarks drop daily, authors abandon leaderboards to chase new papers) ~02:18; evals aren't transparent, accessible, or verifiable — published model charts hide how benchmarks were configured and orchestrated ~02:18. Nick gives a concrete anecdote where a competing AI lab re-ran a benchmark Kaggle had published with another lab and got much better numbers because they optimized for their own model, using API-provided compaction that Kaggle hadn't applied uniformly across models ~03:20. The third problem is the tiny pool of eval authors: ~30,000 AI researchers vs. ~30 million technical professionals, meaning whole domains go un-benchmarked and can't be hill-climbed, exacerbating jagged/uneven model capability ~03:20. The motivating example is a Turkish wastewater treatment plant engineer of 20 years who built a proprietary benchmark from his own experience (after a fatal safety-protocol incident) — data that exists nowhere else and isn't economically interesting to AI labs, which is why open-source eval contributions matter ~04:21.

The team presents four Kaggle solutions, all live and largely open source ~11:27. Hackathons channel community energy toward eval creation with guardrails; a current one runs with the Google DeepMind AGI team, targeting 5 of the 10 cognitive faculties from DeepMind's recent AGI-measurement paper ~07:22. Challenges include providing tools (hosted datasets, model access for participants who can't afford API keys), enabling shareable write-ups, and the fact that AI is bad at judging innovation/creativity so human experts (and inter-expert alignment) are still needed ~08:24. Standardized agent exams (originally 'SATs' before a trademark issue) let anyone paste a one-line prompt for their agent, which then takes an exam and gets a leaderboard score ~09:25; launched a week prior, it already had 500+ agents evaluated with little promotion, and Nick floats safety-focused baseline exams for consumer agents before they're given access to inboxes or Amazon accounts ~10:27. Michael covers Game Arena, a PvP benchmarking approach that resists saturation since there's always a winner and loser ~11:27; games chosen to probe distinct capabilities include Werewolf (deception), poker (randomization/risk — Grok goes all-in, newer more risk-averse models are actually worse), and chess ~12:27. The pipeline: design/iterate games, build harnesses on Open Spiel (RL framework), run simulations via an LLM model proxy (available on Colab) on the Kaggle simulation platform, schedule runs using Bradley-Terry pairwise to limit game count, then publish conversation datasets, Elo scores, and a game visualizer ~13:28~14:28. Key challenges: cost (poker needed ~400,000 hands for statistical significance) ~15:28, keeping the community engaged watching LLMs play, and honest model identification across deprecation cycles ~16:29. Finally, the benchmarks platform — explicitly not a production eval tool — lets anyone build, run, and share evals openly using assertions plus LLM judging, grouped into tasks aggregated into benchmarks ~16:29. Michael cites a Morph LLM blog post (March 16) showing six frontier models within a few points on SWE-bench Pro but a ~22% swing depending on the harness, illustrating the hard ambiguity of what's actually under test in agentic evals — model vs. harness ~18:30~19:31. He closes on incentivization difficulty and fast release/deprecation cycles making over-time comparison tricky.

In this talk

  • ~00:15 Intros: Nick (PM) & Michael (SWE) on Kaggle benchmarks
  • ~01:16 Problem 1: evals scattered, decentralized, stale fast
  • ~02:18 Problem 2: evals not transparent/verifiable + lab re-run anecdote
  • ~03:20 Problem 3: too few eval authors, jagged capability
  • ~04:21 Wastewater engineer benchmark — case for open evals
  • ~06:22 Hackathons + DeepMind AGI cognitive-faculties hackathon
  • ~09:25 Standardized agent exams (one-line prompt, leaderboard)
  • ~11:27 Game Arena: PvP benchmarking to fight saturation
  • ~13:28 Game Arena engineering pipeline (Open Spiel, proxy, Bradley-Terry)
  • ~14:28 Game Arena challenges: cost, engagement, model honesty
  • ~16:29 Benchmarks platform: community evals, assertions + LLM judge
  • ~18:30 Agentic eval challenge: harness vs. model ambiguity (Morph LLM)
AI evals today are kind of broken... evals are scattered, decentralized, and get stale fast.
The difference was that they were optimizing it for their model... they had used compaction that they had provided through their API, and we didn't for all the models we ran. So the results you're seeing don't always reflect the actual state of things.
We expect AI to help most of humanity, but then a very small percentage of people are creating all these evals. And if something's not being evaluated, not being benchmarked, we cannot hill climb on it.
Most of them aren't actually testing their agents before they're sending them out to the real world... do a quick baseline of your agent before you send it out into the world to run your inbox, to run your Amazon accounts.
Game Arena is an approach to help us work against the saturation by just having PvP, and so you can never have saturation because you'll always have one model able to compete against others.
For poker, in order to get statistical significance, we had to run about 400,000 poker hands.
The thing that really matters a lot for coding performance is what harness is it running inside of, with like a 22% difference depending on the harness... are you testing the harness? Are you checking the model?
Tools: Kaggle, Kaggle benchmarks platform, Kaggle hackathons platform, Standardized agent exams, Game Arena, Open Spiel (RL framework), LLM model proxy (Colab), Kaggle simulation platform, Bradley-Terry pairwise rating, Elo scores, Braintrust, SWE-bench Pro, Morph LLM, Google DeepMind AGI cognitive-faculties paper, OpenCLAW (consumer agents)
Developer Tools
AI Engineer

Hetzel at AI Engineer: Does GenAI Belong to Data Scientists?

Braintrust's Phil Hetzel argues agentic development shouldn't be siloed inside ML/data-science teams — "the answer is always in the middle." Since the model is already built, value now comes from natural-language prompts and context engineering, and data scientists wrongly fixate on precision/recall/F1 when agents need far broader functional evaluation. His fix: diverse teams where data scientists act as guardrails and rigorously validate LLM-as-judge.[8]AI Engineer: Does GenAI "belong" to data scientists? — Phil Hetzel, Braintrust

Read more

Phil Hetzel, who leads solutions engineering at Braintrust (an agent quality platform built on two pillars, evals and observability), opens by polling the room of data scientists and ML engineers and warning his thesis won't be entirely to their liking ~00:14. He frames two archetypes of organizations: traditional enterprises, where a CEO/CIO reads that they 'need agents' and delegates the work down to an existing ML/data science platform team simply because generative AI has 'AI' in the name, versus AI-native companies that build their whole offering around agents with small, cross-functional teams that have closer proximity to the problem ~02:14~03:14~04:14. He stresses two key differences between traditional ML and GenAI: the model is already built (Anthropic, OpenAI, Mistral did the data pipeline and training), so teams no longer do the train/test/cross-validation dance, and you add value not via feature engineering but via natural language — prompts and context engineering — which invites a broader, non-statistical skill set ~05:14~06:14~07:14.

Hetzel then argues both sides. FOR data scientists owning agents: they govern models, understand how neural nets/LLMs work, appreciate the inherent risks, and bring rigorous testing processes ~07:14~08:14. AGAINST: the model is already trained so the old pipeline doesn't apply, and crucially many ML folks lock onto traditional metrics like precision, recall, and F1 and obsess over them, when agents demand evaluating a far broader functional surface area, not just technical performance ~09:14. For non-data-scientists owning agents: LLMs are just APIs that product engineers are used to consuming; complex distributed agents (supervisor calling sub-agents across different infrastructure) are a systems problem outside a stats/math background; and subject-matter experts/PMs with the closest proximity to the problem should control prompts and perform human annotation on agent traces, judging whether and why an agent performs well ~09:14~10:14~11:16. His landing point: don't tell data scientists to refresh their skill set, but build diverse teams ~12:16. Data scientists add value as 'the adult in the room' / guardrails (reminding people LLMs just predict token after token), by treating LLM-as-judge rigorously — judges are just prompts and models, so build labeled datasets and apply precision/recall/F1 to the judges themselves — and by fine-tuning open-source models when needed ~12:16~13:16. The ideal mix has product/systems engineers implementing requirements and eval/observability pipelines, and non-technical experts doing human annotation and prompt/context engineering ~14:18~15:18. In Q&A he agrees agents are 'a product a diverse team builds' rather than another predictive model isolated to ML engineers ~16:19, and describes Braintrust's loop-closing tooling: a human labeling component, a prompt/agent playground, and gathering production data to continually grow the offline eval dataset and check whether evals are converging toward or diverging from human agreement ~17:19~18:19.

In this talk

  • ~00:14 Intro: Hetzel, Braintrust, and the provocative thesis
  • ~02:14 Two org archetypes: traditional enterprise vs AI natives
  • ~05:14 Traditional ML vs GenAI: model is already built, value comes from language
  • ~07:14 The case FOR data scientists owning agents
  • ~09:14 The case AGAINST: wrong metrics, LLMs as APIs, systems problems, domain experts
  • ~12:16 Landing in the middle: how data scientists add value (guardrails, LLM-as-judge, fine-tuning)
  • ~14:18 The ideal team mix and eval/observability pipelines
  • ~15:18 Q&A: agents as a tool/product and closing the feedback loop
The model's already built. So much of what data scientists and machine learning engineers go through is that data pipeline of training a model. What do we do when the model's already built?
They will really lock on to the traditional ML engineer metrics like precision, recall, F1, and they'll obsess over those metrics... but when you're analyzing agents, it is far broader of a surface area that you need to be evaluating.
I think data scientists can be the adult in the room... the LLM, this is how it's trained. It's just predicting token after token. It doesn't actually know anything really.
LLM as judge is a huge part of the eval process... they're just prompts and models at the end of the day, and it's very easy to create some labeled data set and perform the traditional recall, precision, and F1 style metrics on those.
The answer is always in the middle. Ton of value for data scientists. Just make sure that you're bringing more folks into the room as you're building agents.
It's a product that a diverse team builds. The mistake that I see traditional companies make is they say, 'this is another predictive model,' and they isolate it to the ML engineers.
Tools: Braintrust, Anthropic, OpenAI, Mistral, Databricks, Slalom Consulting
AI Future
AI Engineer

McLean at AI Engineer: Bounded Autonomy

Oliver's AI director Angus McLean — whose 3,000-person agency generates ~4,000 assets/day across 200+ brands — argues LLMs are closed boxes that don't understand their data, so the best agents come from deliberately bounding autonomy, not chasing capability. Practical takeaways: slow down, keep it simple (a complex app lost ~10–100x to just outputting HTML), treat AI as translation between representations, and remember "constraints create creativity."[9]AI Engineer: Bounded Autonomy: Between Free Will and Determinism — Angus J. McLean, Oliver

Read more

Angus J. McLean opens by framing the talk as "conventional wisdom for unconventional times" — non-technical, non-prescriptive advice for anyone actively building agents, from overwhelmed beginners to experts stuck in a rut ~00:14. He's AI director at Oliver, a 3,000-staff, 46-country agency that pivoted from advertising to nearly fully gen-AI, generating ~4,000 assets a day for 200+ brands (e.g. Johnnie Walker) and putting real media spend (20K to several million) behind them to get a measurable real-world feedback loop ~01:15. He maps the agency's three departments — accounts, creative, strategy — noting creative and strategy were knowledge work but are now increasingly agentic, covering ideation, copywriting, content production, audience insight, trends analysis, and performance optimization in a high-risk, fast-paced, customer-facing setting where scaled assets can hurt a brand as easily as help it ~02:16~03:17. Agents are used primarily for speed and secondarily for scale, enabling fast iteration/testing and research scaling for campaign and territory personalization (e.g. localizing for New York vs Miami) ~04:18.

McLean's central thesis is that LLM autonomy should be bounded, because the models fundamentally don't understand their data. His advice "slow down" rests on the claim that LLM cores haven't really changed since the 1990s and that recent gains come more from brute-force compute than material breakthroughs; he flags persistent limits like data inefficiency and the lack of continuous learning without forgetting ~04:18~05:19. He describes most tools — and even model training — as temporary "band-aids" masking symptoms rather than fixing problems, and prefers to think of an LLM as a closed "flexible database capable of doing semantic math" with no real emergence; this shows up concretely as poor trend identification on genuinely new things ~05:19~06:20. He frames context windows as the main driver of recent agentic capability and a soft constraint developers can shape — citing the GPT-2 vs "Gemini 3.5 Pro" gap, an agent that deleted emails when context ran out, and the observation that world knowledge doubles ~every 12 hours so context will never be enough ~06:20~07:20. Practical implications: giving models curated high-quality documentation instead of raw internet access yields better results, since models are bad at spotting promotional content and susceptible to SEO; the modern challenge has flipped from getting context in to keeping noise out ~08:20~09:20. He argues "constraints create creativity" — asking how little of the context window you can use, building your own harness/memory/compaction, preprocessing and archiving, and learning file systems and knowledge graphs to get closer to the model and improve fundamentals, with nods to SpaceWar in 4K, Crash Bandicoot's PS2 memory tricks, and Rosenblatt's perceptron wiring foreshadowing dropout ~09:20~10:21~11:21. His "keep it simple" point comes from a CV-building anecdote where a complex application was beaten ~10-100x by just outputting HTML, since models are naturally verbose and tend toward complexity ~11:21~12:23. He then reframes AI as translation at its core (per "Attention Is All You Need," originally English-to-French), arguing knowledge production is itself summarization and that data structure is a property of the representation/observer, not the object — so builders should mix representation structures: markdown for human-readable hierarchy, graphs for relationships, clustering for large unstructured text, folders for fast retrieval, timelines where relevant ~12:23~13:23~14:23. He closes urging thoughtful play and hackathons, invokes Adam Smith's pin factory to argue workflows often beat loosely structured agents, warns "don't automate a job unless you can do it yourself," and demos a social-media-intelligence report clustering 50,000 tweets into strategies delivered almost instantly to creative/strategy teams ~14:23~15:23~16:23.

In this talk

  • ~00:14 Talk framing and Oliver's gen-AI ad agency
  • ~02:16 How an ad agency works and where agents fit
  • ~04:18 Slow down: LLMs are closed boxes that don't understand
  • ~06:20 Context windows as soft constraints and band-aids
  • ~09:20 Constraints create creativity; build your own harness
  • ~11:21 Keep it simple: the HTML CV lesson
  • ~12:23 AI as translation between representation structures
  • ~15:23 Workflows vs agents, automation caveat, tweet-clustering demo
Today's large language models don't actually understand the data they're presented with.
Most of our tools are band-aids... they mask the symptoms rather than fixing the problem.
Context windows keep getting larger, but they'll never be enough.
The challenge is no longer getting context in, but keeping the noise out.
Constraints actually create creativity. Abundance stops you being scrappy.
Just because you have the power of the gods, that doesn't mean you should use it.
AI at its core is just translation.
Don't automate a job unless you can do it yourself.
Tools: GPT-2, Gemini 3.5 Pro, MCP, TF-IDF, HTML, markdown, knowledge graphs, SpaceWar, PS2
Developer Tools
Matt Pocock

Matt Pocock's /grill-* Skills: 9 Common Mistakes

Matt Pocock walks through nine failure modes with his /grill-me and /grill-with-docs Claude Code skills, which relentlessly question engineers until shared understanding is reached. Key models: "question fidelity" (grillable vs ungrillable), keeping scope small to avoid the ~120K-token "dumb zone," preserving the grilling session's context as the artifact, using frontier models for parametric-knowledge-heavy planning, and running two grilling sessions in parallel.[10]Matt Pocock: 9 Things People Get Wrong With My /grill-* skills

Read more

Matt Pocock opens by framing the core purpose of his /grill-* skills: they are meant to aid engineers in planning, not replace them ~00:00. The skills ask relentless questions until a shared understanding is reached, but this requires the user to be skilled at planning, scoping, and judging question fidelity. He structures his advice around several mental models before listing failure modes.

The first key framework is question fidelity, borrowed from Ryan Singer's Shape Up ~00:02. Low-fidelity questions (e.g., what URL should this route live on) can be answered directly in a grilling session — these are 'grillable.' High-fidelity questions (e.g., how should this multi-step form feel to use) require seeing a prototype before they can be answered — these are 'ungrillable.' When an ungrillable question surfaces, Pocock recommends a prototyping handoff: hand off to a prototype session, explore the question there, then hand back to the original grilling session ~03:02. Scope is the next failure mode — grilling on too large a scope leads to buried high-fidelity questions and burns through the context window into what Pocock calls the 'dumb zone,' which he estimates begins around 120K tokens for most frontier models ~04:02. The fix is to break large scopes into smaller ones ahead of time.

Pocock also warns against both extremes of user engagement ~06:04: being too passive lets the agent spiral into 200+ questions and explode scope, while being too active means endlessly grilling on low-fidelity details instead of shipping code. Context preservation is another major failure mode ~07:05 — he has seen users clear their context window before running his /2prd skill to generate a handoff document, throwing away 100K tokens of valuable design decisions. Instead, the grilling session's context is itself the artifact; if budget allows, implement directly from it; if not, run /2prd inside that same session to produce a structured handoff. Model selection matters too ~09:06: grilling relies heavily on parametric knowledge (what the model was trained to know), so a capable frontier model is essential; smaller models can be used for implementation since that phase is mostly contextual. Finally, Pocock recommends running two grilling sessions in parallel ~11:06, bouncing between them as each waits for a response — effectively doubling planning throughput with manageable cognitive overhead.

  • Question fidelity: grillable vs. ungrillable questions
  • Prototyping handoff workflow
  • Scope management and context window limits
  • Passive vs. active engagement with the agent
  • Preserving grilling session context as a valuable artifact
  • Model selection: parametric knowledge requires frontier models
  • Running parallel grilling sessions for throughput
The idea of these skills is that they relentlessly question you until you reach a shared understanding about something.
These skills themselves are really not super long, and they're designed to aid you as an engineer, not replace you as an engineer.
About 120K is where I estimate most state-of-the-art models — that's where their dumb zone begins.
Every grilling session, every decision that you make in that session is so valuable. That should be recorded somewhere and either turned into code or put into a handoff document.
When you're relying on parametric knowledge like this, you need a model with lots of parameters.
People say this is context switching, but really it's just managing two separate Slack threads at the same time.
Tools: /grill-me (Claude Code skill), /grill-with-docs (Claude Code skill), handoff skill (Claude Code), /2prd skill (Claude Code), Shape Up (Ryan Singer — planning methodology referenced)
AI Tools
Nate B Jones

Building a Portable, Cross-Platform Agent Memory

Nate B Jones argues memory architecture determines agent capability more than model choice, yet Claude, ChatGPT, Grok, and Google all wall off memory so none follows you across platforms. His proposed escape: a stable, MCP-compatible memory layer you own, so you can plug in new tools without re-architecting around each vendor's walled garden.[11]Nate B Jones: How to build a 10-cent AI brain

Read more

Nate B Jones argues that the biggest limiter for AI agents is not model quality but memory design — a poorly constructed memory system forces users to re-explain context constantly. While Claude, ChatGPT, Grok, and Google all now have native memory, none share state with each other: memory from one platform doesn't follow you into a coding agent or another app. The proposed solution is a stable, future-proofed memory layer built around MCP servers, letting users plug in new tools efficiently without re-architecting around each AI platform's walled garden.

Memory architecture determines agent capabilities much more than model selection does.
Every platform has built a walled garden of memory, and none of them talk to each other.
Tools: Claude, ChatGPT, Grok, Cursor, MCP
AI Tools
Better Stack

NVIDIA Is Giving Away Free AI Tokens (With a Catch)

As most providers raise prices, NVIDIA quietly opened free API access to 70+ models (DeepSeek, MiniMax, GLM, Kimi) on its OpenAI-compatible NIM platform — a drop-in for Cursor, Zed, and Open Claude. The catch is explicit in the privacy policy: all trial prompts and responses are logged to train NVIDIA's models, with a warning not to upload confidential data.[12]Better Stack: NVIDIA is Giving Away Free* AI Tokens (*With a Catch)

Read more

As most AI providers raise prices, NVIDIA quietly launched free API access to over 70 models on its NIM platform. The interface is OpenAI-compatible, making it a drop-in for tools like Cursor, Zed, and Open Claude. The trade-off is explicit in NVIDIA's privacy policy: all prompts and responses during the trial period are logged to improve NVIDIA's AI models, and users are warned not to upload confidential or personal data. There is additional debate about whether data is also forwarded to the third-party model providers (e.g., DeepSeek's own servers).

If the product is free, you're probably the product.
Do not upload any confidential info or personal data.
Tools: NVIDIA NIM, DeepSeek, MiniMax, GLM, Kimi, Cursor, Zed, Open Claude
Industry
Last Week in AI

11% GPU Utilization and the Case for Anthropic + SpaceX Orbital Data Centers

Last Week in AI flags that 11% GPU utilization at a massive facility means two-thirds to three-quarters of its multi-billion-dollar cost is "atrophying" — making Anthropic, which can absorb 300 MW of spare capacity, a natural partner. The more unusual angle: SpaceX needs a credible AI story for its IPO, and orbital data centers give it one.[13]Last Week in AI: The 11% GPU Utilization That Should Leave You Gasping

Read more

The clip argues that shockingly low GPU utilization (11%) at large AI facilities represents an enormous capital inefficiency. Anthropic, which knows how to maximize GPU usage, is a 'match made in heaven' for absorbing that idle capacity. The more unusual angle is Anthropic's apparent interest in SpaceX's orbital data center play — SpaceX needs a credible AI story for its IPO, especially given xAI's underwhelming performance, and positioning orbital compute as a necessary future infrastructure gives them that narrative.

11% — that should leave you gasping because 2/3 to 3/4 of the tens of billions of dollars of cost of this facility is just atrophying.
Anthropic saying, 'We can use 300 MW. Yeah, 300 MW lying around. Let's do it.'
Tools: Anthropic, SpaceX, xAI
AI Future
Sequoia Capital

Notion's Ivan Zhao: Building With LLMs Is Like Brewing Beer

Notion's Ivan Zhao contrasts classic software (predictable, like engineering a bridge) with LLM products (unpredictable, like brewing beer): you can't direct a model the way you direct engineers, so teams must experiment with the technology first rather than running the usual customer-first product loop.[14]Sequoia Capital: Building with LLMs is like brewing beer | Ivan Zhao, Notion

Read more

Ivan Zhao argues that classic software follows a deterministic path — PM gathers requirements, designer shapes them, engineers build — because the outcome is largely predictable. LLM-based products are fundamentally different: you cannot instruct a model the way you direct engineers, so teams must 'throw their best people in' and let the technology reveal what's possible. This inverts the usual product process from customer-first to technology-first experimentation.

You can't truly predict the things. You cannot tell the yeast, 'Hey, go towards that flavor profile more.'
It's not customer first, it's more like experiment with this technology first.
Tools: Notion
Hot Take
EO

A CS Professor on Why Slow, By-Hand Learning Wins

CU Boulder's Tom Yeh argues that working AI math out by hand — at human speed — is what produces real understanding, because "having an answer doesn't mean you know it." He makes the case that evergreen foundations (matrix multiplication) outlast every hype cycle, that skipping a tool is fine but skipping skill-building isn't, and that AI "cheating" is a symptom of broken incentives, not the cause.[15]EO: A CS Professor on Why Slow Learning Wins in the AI Era | CU Boulder, Tom Yeh

Read more

Why Slow, By-Hand Learning Beats AI Speed

Tom Yeh, a CS professor at the University of Colorado Boulder and founder of the 'AI by Hand' education initiative, explains that he learned deep learning from scratch as an 'old professor' who had missed it entirely as a student (he studied support vector machines and traditional ML), and the only way he could actually internalize transformer math and algorithms was to draw and write everything out on paper ~01:00. He frames the slowness as the feature, not the bug: when he taught a full semester of C++ on a blackboard instead of live coding, he found three benefits — he could only go at the human speed of his own writing, students could only learn at the human speed of following along, and getting students to copy notes by hand kept their hands off the keyboard and away from Instagram, improving focus ~02:00~03:01. His central contrarian point is that AI handing you an answer instantly is not learning: 'Having an answer doesn't mean you know it. People can buy degree, buy certificate.' ~00:00 He argues that ownership of knowledge is proportional to the time you spend acquiring it ~04:01, so the value of doing math by hand is precisely that it forces you back to a human pace and a human way of connecting to the material ~03:01.

In this talk

  • ~00:00 AI by Hand and the purpose of learning
  • ~01:00 Catching up on deep learning by writing it out
  • ~02:00 Teaching a C++ semester on a blackboard
  • ~03:01 Going at human speed; ownership of knowledge
Having an answer doesn't mean you know it. People can buy degree, buy certificate.
I can only go at a humanly possible speed of my writing. I cannot go any faster than I can write.
Whether you own something, you value something, is actually proportional to how much time you spend on acquiring that piece of knowledge.
Tools: AI by Hand

Foundations Over Tools, and Cheating as a Symptom

Yeh traces how matrix multiplication kept resurfacing across every tech wave — CGI after Jurassic Park, big data, machine learning, today's AI, and likely quantum computing next — to make the case that there is always a 'core and foundational' layer that is evergreen while surface tools churn ~04:01~05:01. He uses South Korea's Gyeongbokgung palace, which burned down in the 1500s but was rebuilt in the 1800s on its original solid-rock foundation, as a metaphor: build skills on a solid foundation and you can always rebuild when a tool goes obsolete; chase only surface features and you keep rebuilding houses with no foundation ~05:01~06:02. He notes that DeepSeek and Google Cloud were briefly hot but the transformer topic stays relevant ~05:01. His advice to individuals: the meta-skill of being able to acquire difficult skills (piano, chess, soccer) is part of your identity and doesn't change, so skipping any single AI tool is fine, but skipping your next practice session is not ~06:02~07:05~08:06. As an educator he now cares less about whether students remember the transformer or attention-mechanism equations and more about the willingness to open the black box and take on the challenge ~08:06. On cheating, he recounts elaborate anti-Chegg measures while teaching intro programming and his hope that Chegg would shut down — but observes that Chegg is gone (partly via AI) and students still cheat, so AI is merely a new symptom of the deeper problem: a societal incentive system that doesn't reward real learning ~09:07~10:08. He closes with a hiring argument — hire for work ethic, problem-solving, and team play, and those people will adopt AI automatically; 'AI cannot change people, but you can change AI.' ~11:08~12:09

In this talk

  • ~04:01 Matrix multiplication survives every hype cycle
  • ~05:01 The Gyeongbokgung palace foundation metaphor
  • ~06:02 Skip the tool, not the skill-building
  • ~09:07 Cheating is a symptom of incentives
  • ~11:08 Hire for fundamentals; AI follows
There's always something that's core and foundational that doesn't change. It's evergreen.
If you skip your next piano practice, you skip your next soccer practice and you give up on that, that's not fine, because that is going to eventually define who you are. But not this tool.
Chegg is gone, people still cheat. I bet when AI is gone, people can still find ways to cheat.
AI cannot change people. Only you can change AI.
When we hire someone because they're a problem solver, that person is automatically just going to learn AI. You don't have to tell them.
Tools: DeepSeek, Google Cloud, Chegg
Hot Take
Arjay McCandless

Ranking Every Programming Language (Hot Takes)

Arjay McCandless ranks languages by salary, demand, growth, and fun: TypeScript and Go land in S tier (78% of JS postings now want TypeScript; Go is "growing in demand but underserved"), Rust sits in only B tier despite the love because "the market never really picked up," and Swift draws the harshest take — "Xcode sucks, Swift sucks" — landing in D tier.[16]Arjay McCandless: ranking every programming language

Read more

Arjay McCandless ranks programming languages across five criteria: average salary, difficulty to learn, job openings, demand trajectory, and enjoyment. The biggest surprises land at both extremes. TypeScript earns S tier ~04:03 not just because it fixes JavaScript's quirks but because AI tooling has made it the default — he cites that 78% of JS job postings now require TypeScript and that companies like Anthropic heavily index on it, calling it 'largely part of the default AI tech stack.' Go also gets a last-minute upgrade from A to S tier ~07:04 with the host noting it is 'growing in demand but underserved by the developer community' — barely taught in schools and underrepresented in social media, yet exploding for microservices and cloud/DevOps work.

On the downside, Swift lands in D tier ~08:04 with a notably emotional take: 'Xcode sucks, Swift sucks. It's been one of the most painful developer experiences I've ever had.' He suggests using React Native, Kotlin Multiplatform, or just letting Claude Code write the Swift for you. PHP also hits D tier ~05:03 due to declining market share. Rust, despite near-universal developer love, is only B tier ~01:01 because 'the market for it has never really picked up' — C++ developers won't switch and Python developers don't see the need. The host recommends learning one or two languages deeply (his path: Java first, then Python, Kotlin, and TypeScript on the job) and letting AI handle the rest.

  • S tier: TypeScript, Go, Python
  • A tier: C++, C#, Java, Kotlin
  • B tier: Rust, JavaScript, Kotlin (initially, then promoted)
  • C tier: C, Ruby, Scala
  • D tier: PHP, Swift
[01:01] 'The market for [Rust] has never really picked up... developers who work on more performant code are already familiar with C or C++, and they don't have a reason to switch.'
[04:03] '78% of job postings that mention JavaScript want TypeScript... it is largely part of the default AI tech stack.'
[07:04] 'Go is kind of on my list for places that are growing in demand but underserved by the developer community. Nobody talks about it in school.'
[08:04] 'Xcode sucks, Swift sucks. It's been one of the most painful developer experiences I've ever had in my life.'
Tools: Node.js, Cargo (Rust package manager), Spring Boot / Spring ecosystem, Apache Spark (Scala), React Native, Kotlin Multiplatform, Xcode, Claude Code
Developer ToolsAI Tools
Real PythonThe Pragmatic EngineerBetter StackArjay McCandless

Developer Shorts: Gemini CLI, Rust Errors, Ruby's Vanishing Week, and System Design Qs

A roundup of shorter developer clips: a Real Python course on Google's terminal-based Gemini CLI; why Rust returns errors as Result values instead of throwing (forgetting the ? is a compile error); Ruby's DATE_ITALY constant that throws on the 10 days that vanished in the 1582 Julian→Gregorian switch; and the three most common system-design interview questions (URL shortener tops the list).[17]Real Python: Getting Started With Google Gemini CLI in Python[18]The Pragmatic Engineer: Alice Ryhl: Rust doesn't use exceptions[19]Better Stack: This Specific Date Causes a Runtime Error (Here's Why)[20]Arjay McCandless: Most Common System Design Question

Read more

Google Gemini CLI for Terminal-Based AI Coding Assistance

The course covers installing Gemini CLI, authenticating with a Google account, and using it for practical coding workflows — analyzing code, spotting bugs, requesting code quality reviews, and generating doc suggestions — all from the terminal. It is structured into bite-sized lessons with transcripts and closed captions, available on realpython.com.

Tools: Gemini CLI, Google Gemini, Real Python

Rust's Explicit Error Handling with Result Types

Alice Ryhl explains that Rust eschews exceptions in favor of returning errors as values via a Result enum (Ok or Err). The `?` operator provides a shorthand: appending `?` to a function call means 'if this failed, return the error from the current function.' Crucially, forgetting to handle a result is a compile-time error, so the compiler enforces error checking. This design prevents the class of bugs where an unhandled implicit error condition silently crashes a server.

If you forget to put the question mark, that's a compilation error.
There are these things where you write some code and there's some implicit error condition you didn't think of and now you just took down your server.
Tools: Rust

Ruby's DATE_ITALY Constant and the Julian-to-Gregorian Calendar Transition

Ruby's Date class includes a constant DATE_ITALY (a Julian Day Number) representing October 15, 1582, when Pope Gregory XIII's Gregorian calendar reform took effect in Italy, Spain, and Poland. To correct accumulated Julian leap-year drift, 10 days were skipped: October 4 was followed directly by October 15. Ruby uses this boundary dynamically: dates after it use Gregorian math, dates before it use Julian math. Querying a date like October 9, 1582 raises an ArgumentError because that week literally did not exist in the calendar. A parallel constant DATE_ENGLAND (Sept 14, 1752) handles the British Empire's later 11-day jump. Passing DATE_GREGORIAN (-Infinity) forces consistent Gregorian logic across all eras.

10 days of human history just vanished overnight.
Ruby's date handling logic is a fascinating reminder that our modern date systems ultimately have to deal with the inconveniences of our messy human history.
Tools: Ruby

Top 3 Most Common System Design Interview Questions

The video ranks the most frequently asked system design questions at top tech companies. Designing a chat app (3rd) tests understanding of offline message handling, WebSockets, and clean API/database design. Designing a social media platform (2nd) tests knowledge of massive-scale architecture, image/video processing pipelines, and CDNs. URL shortener (1st) is the most universal question — it probes API design, storage trade-offs, and unique ID generation strategies. Recommended study resources include 'Designing Data-Intensive Applications' and the Daily.dev app.

Design a URL shortener like TinyURL — this is like the two-sum of system design questions for some reason.
Tools: Daily.dev, Designing Data-Intensive Applications

Sources

  1. Blog Notes on Pope Leo XIV's encyclical on AI — Simon Willison, May 25
  2. Blog Anthropic co-founder Chris Olah's remarks on Pope Leo XIV's encyclical "Magnifica humanitas" — Anthropic, May 25
  3. YouTube DeepMind's Insane AI Breakthroughs With CEO Demis Hassabis — Two Minute Papers, May 25
  4. YouTube Mythos 1, Opus-4.8, GPT-5.6, Gemini 3.5 Pro (All Leaks Explained): JUNE IS GOING TO BE CRAZY! — AICodeKing, May 25
  5. YouTube The Infrastructure Nightmare Nobody Is Talking About — Nate B Jones, May 25
  6. YouTube The Playbook for a $100M AI Agency — Nate Herk | AI Automation, May 25
  7. YouTube Agentic Evaluations at Scale, For Everybody — Nicholas Kang & Michael Aaron, Google DeepMind — AI Engineer, May 25
  8. YouTube Does GenAI "belong" to data scientists? — Phil Hetzel, Braintrust — AI Engineer, May 25
  9. YouTube Bounded Autonomy: Between Free Will and Determinism — Angus J. McLean, Oliver — AI Engineer, May 25
  10. YouTube 9 Things People Get Wrong With My /grill-* skills — Matt Pocock, May 25
  11. YouTube How to build a 10-cent AI brain — Nate B Jones, May 25
  12. YouTube NVIDIA is Giving Away Free* AI Tokens (*With a Catch) — Better Stack, May 25
  13. YouTube The 11% GPU Utilization That Should Leave You Gasping — Last Week in AI, May 25
  14. YouTube Building with LLMs is like brewing beer | Ivan Zhao, Notion — Sequoia Capital, May 25
  15. YouTube A CS Professor on Why Slow Learning Wins in the AI Era | CU Boulder, Tom Yeh — EO, May 25
  16. YouTube ranking every programming language — Arjay McCandless, May 25
  17. YouTube Getting Started With Google Gemini CLI in Python — Real Python, May 25
  18. YouTube Alice Ryhl: Rust doesn't use exceptions — The Pragmatic Engineer, May 25
  19. YouTube This Specific Date Causes a Runtime Error (Here's Why) — Better Stack, May 25
  20. YouTube Most Common System Design Question — Arjay McCandless, May 25