April 25, 2026
Two days after release, GPT-5.5 is sitting at #1 on Terminal-Bench (~12 percentage points ahead of Claude Opus 4.7), Artificial Analysis, LiveBench, and ARC-AGI2 (85%) — at roughly 2x the price of GPT-5.4 Extra High and a 922K-token context.[1]AI Search — GPT-5.5 is a total freak OpenAI's Romain Huet confirmed there will be no separate GPT-5.5-Codex; Codex was folded into the main model line back at GPT-5.4.[2]Simon Willison — Quoting Romain Huet OpenAI also published a fresh prompting guide whose central recommendation is counterintuitive: don't port your old prompts — start over from a fresh baseline.[3]Simon Willison — GPT-5.5 prompting guide The big asterisk: 86% hallucination rate on SimpleQA versus Opus 4.7's 36%.[1]AI Search — GPT-5.5 is a total freak
Romain Huet (OpenAI) clarified for anyone wondering whether OpenAI would ship a coding-specialized GPT-5.5: "Since GPT-5.4, we've unified Codex and the main model into a single system, so there's no separate coding line anymore."[2]Simon Willison — Quoting Romain Huet He pitches GPT-5.5 as carrying that further "with strong gains in agentic coding, computer use, and any task on a computer."
Simon Willison flags the most surprising recommendation in OpenAI's official prompting guide: "Begin migration with a fresh baseline instead of carrying over every instruction from an older prompt stack."[3]Simon Willison — GPT-5.5 prompting guide
Start minimal, then iteratively tune reasoning effort, verbosity, tool descriptions, and output formats. A second concrete UX rec: "Before any tool calls for a multi-step task, send a short user-visible update that acknowledges the request and states the first step." OpenAI also ships a Codex-app skill that runs $openai-docs migrate this project to gpt-5.5 against your codebase.
~01:00 In the Codex app, GPT-5.5 built a fully interactive 3D Earth digital twin with zoom from space to street view in roughly three prompts, and a browser-based ray-tracing simulation with adjustable material sliders in another three.[1]AI Search — GPT-5.5 is a total freak ~10:07 Two prompts produced a fully functional 3D third-person mecha-vs-aliens shooter on three.js with multiple waves and levels. ~18:24 An agentic prompt scraped three California roofing companies that had email but no website, then built a custom landing page for each with linked CTA buttons — all in about three minutes.
~22:27 GPT-5.5 takes #1 on Terminal-Bench (beating Opus 4.7 by ~12pp), Artificial Analysis (both Extra High and High variants over Opus 4.7 Max), LiveBench, and ARC-AGI2 (85%, the highest score on the leaderboard). It uses fewer tokens than GPT-5.4 while scoring higher. Context window: 922K tokens (~700K words). Pricing is roughly 2x GPT-5.4 Extra High and slightly above Opus 4.7 Max.
~25:27 The headline caveat: GPT-5.5 Extra High hallucinates 86% of the time on SimpleQA, against Opus 4.7's 36% and even-lower scores from open-weight models like GLM 5.1. Medical imaging tests were mixed — 3/4 chest CT lesions identified correctly, but 0/6 brain tumors classified correctly.
"At least to me, this is noticeably better than Opus 4.7. It just handles things more autonomously. It makes fewer mistakes, and it just runs a bit smoother. At least that's the vibe that I got." — AI Search
"If factual accuracy is super important for you, like for example if you're working in medical research or law, then GPT 5.5 might not be the best option for you." — AI Search
GPT Image 2 won 93% of blind pairwise comparisons in Image Arena — a 26-point gap over Google's Nano Banana 2 — by wrapping a thinking pass, live web search, and a self-verification step around the raw image model.[4]Nate B Jones — ChatGPT Images Just Replaced Three People on Your Team Nate B Jones argues that this collapses the first-draft researcher, copywriter, and layout designer into a single prompt — and that the same machinery now produces convincing forgeries of receipts, Slack screenshots, boarding passes, and pharmacy labels from a free ChatGPT account. Simon Willison's contribution: ChatGPT Images 2.0 spontaneously generated a "WHY ARE YOU LIKE THIS" road sign that was never in the prompt — proof, in his telling, that the model now exercises something like editorial judgment.[5]Simon Willison — WHY ARE YOU LIKE THIS
~00:00 93% of blind pairwise comparisons in Image Arena vs Nano Banana 2 at 67% — image-generation leaders normally trade places by 3–4 points, so 26 points is unprecedented. Inkdrop's Takuya Matsuyama fed the model his app summary, V6 release notes, and blog posts about Japanese aesthetics in a single prompt and got back a complete Hokusai-inspired landing page mock-up — typography, hero illustration, and feature grid — in his actual written voice.[4]Nate B Jones — ChatGPT Images Just Replaced Three People on Your Team
~02:01 Thinking mode: a Pro/reasoning model spends 10–20 seconds planning composition, typography, object placement, and constraint satisfaction before committing pixels. ~03:01 Web search inside the generation loop: a demo rendered a geologically accurate depth chart of the Strait of Hormuz as a Richard Scarry children's book illustration; the knowledge cutoff is December 2025, but the model self-retrieves anything it's uncertain about. ~04:02 Eight coherent frames from a single prompt — Sam Altman demoed an eight-panel manga with consistent characters — replacing the old generate/screenshot/feed-back/stitch workflow. A self-verification pass corrects typos between first and second generation.
~05:02 A single session produced a French fashion magazine cover, a Japanese restaurant menu with hiragana and kanji (vertical flow respected), and a high-density Russian annotation — zero spelling errors. UI specs become Codex rendering targets: PMs describe a settings page, the model renders the mock-up, the coding agent implements against it inside the same environment. Microsoft Foundry demoed a fictional flower-brand subway-car ad campaign from a photo of an empty car in three prompts.
~09:04 With one prompt and a free ChatGPT account, you can now forge a restaurant receipt with a specific date and time, a Slack screenshot with a real user's avatar, a real-flight boarding pass, a pharmacy label with a real drug and dose, a government notice on real letterhead, or a competitor menu with undercut prices. Text renders at 99% accuracy; over 70% of blind testers during the Arena rollout believed the outputs were real photos. Content credentials and watermarking don't survive a screenshot and recrop.
~11:05 Anthropic shipped Claude Design four days earlier; the underlying insight is identical (reasoning joined the visual stack). OpenAI keeps pixels as the primitive; Anthropic skipped pixels entirely and renders editable HTML directly implementable by Claude Code. Decision rule: pick GPT Image 2 for end-state assets (posters, menus, packaging, social posts), pick Claude Design for working prototypes (landing pages, dashboards, interactive mocks).
~13:08 (1) First-draft researcher + copywriter + layout designer now collapse into one prompt — the word-processor moment for design. (2) Image generation is now an agent-callable subroutine: the next consumer is Claude's tool loop or Codex, not a human. (3) The image itself is a compressed reasoning trace; auditing AI visuals is a different discipline because failure modes shifted (an image can be wrong because the source was wrong, not just hallucination).
~17:10 Role-by-role: design leadership reweights toward briefing/brand/QA; founders/solo operators get "what a five-person agency did for you a month ago" for $20/month; trust/risk teams need to red-team their own evidence baselines now.
Willison fed ChatGPT Images 2.0 a prompt for "a horse riding an astronaut riding a pelican on a bicycle" and the model independently added a road sign reading "WHY ARE YOU LIKE THIS" — never in the prompt.[5]Simon Willison — WHY ARE YOU LIKE THIS
"Generation became a reasoning workload, and 93% of blind human judges could feel the difference without anybody explaining to them why." — Nate B Jones
"The work of a five-person agency a month ago is now a $20 subscription and a good brief." — Nate B Jones
"The new ceiling is specification. Your leverage now depends on how precisely you can describe the layout, the typography hierarchy, the text content, the constraints, the reference material, the audience, the format." — Nate B Jones
Anthropic's postmortem (April 23) confirmed Claude Code regressed for about a month due to three harness-level changes — default reasoning was silently downgraded from high to medium, a bug dropped reasoning context after idle sessions, and a verbosity-cutting system prompt change had to be reverted because it hurt code quality.[6]Better Stack — Claude ACTUALLY got dumber Separately, Anthropic was caught testing the removal of Claude Code from the $20/mo Pro plan entirely (Max-only) before Anthropic's head of growth said it was a 2% test and reverted the page.[7]Better Stack — The Claude Price Hike They Didn't Announce And on the verbosity question, a community-built "Caveman" prompt skill is making the rounds as the actual fix.[8]Better Stack — Kevin was right about Claude
Anthropic's three-cause postmortem: (1) default reasoning effort was downgraded from high to medium for latency, cutting capability on harder coding tasks; (2) a bug dropped accumulated reasoning context after every message in an idle session, producing forgetful and repetitive behavior; (3) a system-prompt change designed to reduce verbosity was found to hurt code quality and was reverted.[6]Better Stack — Claude ACTUALLY got dumber The model itself didn't get dumber — the harness did.
"It wasn't actually Claude the model that got dumber, it was the harness, Claude code." — Better Stack
"It's kind of insane to me that they don't test these things before pushing out these changes." — Better Stack
~00:00 Pro subscribers ($20/mo) discovered via screenshots on X and Reddit that Claude Code had been quietly removed from their plan. Anthropic's head of growth later said this was a limited test on 2% of new sign-ups; the pricing page was reverted.[7]Better Stack — The Claude Price Hike They Didn't Announce Stated rationale: the $20 plan wasn't designed for multi-hour agent workflows or always-on coding sessions. Implied rationale (per Better Stack): compute constraints — Pro users are reportedly hitting caps after just a few prompts at peak hours — and a signal that Claude Code becomes Max-only or that Pro pricing rises.
~00:00 A prompt skill called "Caveman" tells Claude or Codex to drop filler phrases like "you're absolutely right" and reply tersely while preserving technical detail.[8]Better Stack — Kevin was right about Claude Configurable conciseness levels include an extreme "Wenglall mode" advertised as the most token-efficient language. The pitch is fewer output tokens, faster scanning, and more usable headroom inside a usage cap.
"Why waste time say lot word when few word do trick?" — Caveman skill
Per Nate B Jones, OpenAI describes Codex 5.3 as "the first frontier AI model that was instrumental in creating itself" — earlier Codex builds analyzed training logs, flagged failing tests, and suggested fixes, with the resulting model showing 25% speed gains and 93% fewer wasted tokens.[9]Nate B Jones — AI models writing their own successors Anthropic has been even more direct: 90% of Claude Code is itself written by Claude Code, with that figure converging toward 100%. Boris Trenne (Anthropic) says he hasn't written code in months — his role is now specification, direction, and judgment.
~00:00 Specific claims: Codex 5.3 is described as the first frontier model "instrumental in creating itself" — earlier Codex iterations analyzed training logs, flagged failing tests, and suggested fixes to training scripts, contributing to a 25% speed improvement and 93% reduction in wasted tokens. Anthropic estimates the entire company will move to entirely AI-generated code around April 2026.
"Codex 5.3 is the first frontier AI model that was instrumental in creating itself, and that's not a metaphor." — Nate B Jones
"90% of the code in Claude Code, including the tool itself, was built by Claude Code, and that number is rapidly converging toward 100%." — Nate B Jones
"Boris Trenne isn't joking when he talks about not writing code in the last few months. He's simply saying his role has shifted to specification, to direction, to judgment." — Nate B Jones
Amjad Masad walks YC through Replit's pivot from dev environments to "vibe coding" and lands the headline thesis: in the company of the future, only two roles remain — builders and salespeople. Sales survives because customers still want to talk to humans they trust; builders persist because every employee becomes a generalist entrepreneur deputizing agents to solve their own problems.[10]Y Combinator — Replit's CEO On The Only Two Jobs Left He sketches a "post-prompting world" where you tell Replit "every day build me a SaaS company and try to market it and make me some revenue" and flags computer-use models and continual learning as the two capabilities still gating full autonomy.
September 2024: Replit became the first product to abstract code entirely behind a natural-language agent and reframe the audience away from traditional engineers.
Masad explicitly walks away from React/Webpack stacks. "VB6 was better than setting up React and Webpack." The audience is product managers, designers, and entrepreneurs — "AI-native developers."
Concrete examples: a physical therapist building a 3D body-scan app after burning hundreds of thousands on offshore devs; pool-maintenance SaaS; sports clubs running on MS-DOS migrating to Replit-built tools.
Whoop testing 10x more product experiments; revops teams replacing six-figure SaaS spend with internal Replit builds.
"You should be able to tell Replit every day build me a SaaS company and try to market it and see what works and make me some revenue." — Amjad Masad
"Coding turned out to be a bit of a hack or workaround for computer use agents." — Amjad Masad
"I think the company of the future is made of builders and sales people broadly." — Amjad Masad
Sales survives as evangelism and trust-based transformation work — "a lot of other companies will want to talk to someone... they trust other humans." Builders persist because abstraction layers keep climbing — humans were once literal "computers," then operators, then software engineers, now agent-deputizers.
Replit's internal "vibe coding resident team" roams the company hunting problems (support queue prioritization, HR onboarding portals) and spawning agents to solve them. Masad's vision: "almost everyone is a founder. They wake up in the morning and they think how can I make the company more successful?"
"True product market fit is entirely different. It's like an explosive thing." — Amjad Masad
Nufar Gaspar's thesis on the AI Daily Brief: every agentic harness — Cursor, Claude Code, Codex, OpenClaude, Windsurf, Antigravity, Hermes — is converging on the same primitives, all reading text files defining who you are, what you know, what you can do, what you remember, and what you can reach. So the tool you pick matters less; the system you build underneath is the real moat. She lays out a seven-layer "Agent OS" framework — identity, context, skills, memory, connections, verification, automations — using a running "Chloe the Chief of Staff" example.[11]AI Daily Brief — How To Build a Personal Agentic Operating System
"Every agentic tool is becoming every agentic tool... the tool you pick matters less and less and what matters much more is the system that you build underneath it." — Nufar Gaspar
The file the tool reads first (soul, AGENTS.md, CLAUDE.md, copilot-instructions). Don't write from scratch — brain-dump to an AI and let it interview you with ~15 questions.
"Every time you catch yourself re-explaining something about your situation to AI, that thing should have been in a context file." — Nufar Gaspar
Stakeholders, strategy/priorities, operating principles. Not a 40-page novel.
Every knowledge worker has 20–30 of them — pre-reads, daily brief, voice match, commitment tracker. Ship MVP and patch.
Explicitly ask the tool "explain how your memory system works." Add specialized memory (decision logs, relationship context) only where the leverage justifies it.
Email, calendar, Slack, Jira. Start read-only and only grant write access after weeks of trust — "an agent gossiping in company Slack" is already a real incident pattern.
"Without it, your OS has a shelf life of maybe 8 weeks before everything goes stale. With it, your OS compounds even further and forever." — Nufar Gaspar
"Your first agent is hard... your chief of staff maybe took you a weekend. But the second agent... takes you an afternoon because it inherits everything." — Nufar Gaspar
Historian Ada Palmer hands Dwarkesh actual early-modern artifacts — a hand-stitched pamphlet, a copy of The Gentleman's Magazine, papyrus, parchment — and walks through how cheap, fast, contradictory print created the original information-overload crisis. Her punchline: when newspapers proliferated and contradicted each other, somebody invented the magazine as a weekly fact-checking roundup. The format was born to adjudicate.[12]Dwarkesh Patel — Pamphlets, Newspapers, and the Birth of the Magazine
Naked pages, hand-stitched, printable in two-to-four days, sold cheap around town and to traveling news writers.
"It's cheap. It's ephemeral. You print a thousand of them." — Ada Palmer
"My favorite ever title of a pamphlet was the scandalous tale of a doctor from Padua and how he seduced his maid, murdered his wife, murdered the maid, cut out her heart and ate it, and how he was justly punished by God." — Ada Palmer
Made from rag pulp — laundry-lint color. "Fundamentally laundry lint is what paper is."
"Every week they would publish a roundup of that week's news saying what each newspaper said about it and where they contradicted each other and analyzing who's right and wrong. It was the fact-checking. This is the first magazine." — Ada Palmer
"They wrote around the hole because too valuable to not use that sheet." — Ada Palmer
Cloudflare's Matt Carey argues that naively dumping every API endpoint into MCP tools blows up context windows — Cloudflare's 2.3M-token OpenAPI spec converts to ~1.1M tokens of tool defs, "never going to fly even with the biggest foundational models."[13]AI Engineer — MCP = Mega Context Problem (Matt Carey)
His proposed fix: "Code Mode" — generate a typed TypeScript SDK from OpenAPI, let the model write code against the types, execute inside lightweight V8 isolates (Cloudflare's WorkerD) with programmable guardrails. One tool called code replaces many tool calls. He predicts MCP becomes middleware (an MCP=true flag in Next.js by year-end).
Splitting an API into many product-specific MCP servers (Cloudflare ended up with 16 covering ~2,600 endpoints) forces users to pick the right server and still leaves coverage gaps.
Tool search (e.g., Claude Code's keyword-matched K-tool loader) burns ~2,100 tokens to surface ~500 actually used.
"Instead of doing tool calls, you can have one tool called code where the model generates the code of your choice and then you run it." — Matt Carey
"Running untrusted code is mega mega scary." — Matt Carey
File/secret exfiltration, infinite loops, crypto miners.
WorkerD spawns dynamic V8 isolates from a string. Toggling node compat hides process.env; flipping a boolean blocks or allows internet access.
"Your APIs have to be ready to take a beating because they have to have good rate limiting. Cuz I can run this in a for loop on multiple sandboxes at once and just hammer your API." — Matt Carey
"By the end of this year, we'll be like natively in every single at least TypeScript big full stack framework... they'll just have a native integration." — Matt Carey
Ido Salomon's thesis: scaling from one agent to dozens does not 100x productivity because the engineer is the bottleneck — managing reckless "employees" is not a skill most engineers have practiced. He demos AgentCraft, an RTS-game-inspired orchestrator that visualizes agents as units on a map of your file system, with hotkey cycling, agent-proposed quests, container-isolated campaigns, cron-driven idea channels, and shared workspaces where teammates' agents appear alongside yours.[14]AI Engineer — AgentCraft (Ido Salomon)
"Spinning them up isn't the problem. It's us. We are the bottleneck in orchestrating all of these agents." — Ido Salomon
"The role of the engineer to actually go and manage dozens of reckless employees is not typically what we do in most companies." — Ido Salomon
Each agent is a physical unit on screen, backed by a real coding agent session (Cursor, Claude Code, Codex, OpenCloud). Buildings represent functionality (skills/plugins, integrated terminal, git).
"The map is actually a projection of my file system. Each directory is on the map, each file is a room — so I can track visually what the agent is working on." — Ido Salomon
"Once it's decomposed, I'm not the one doing the babysitting. Now I have the campaign orchestrator and that's his problem." — Ido Salomon
"How much time do I need to spend on the plan if I can just do it 10 times and pick the one that fits?" — Ido Salomon
Kimi K2.6 scales its agent swarm from ~100 sub-agents in K2.5 to 300, with up to 4,000 coordinated steps and a new "preserve thinking mode" to stop memory drift on long tasks; Moonshot reports a 13-hour task with a 185% throughput gain.[15]Better Stack — Kimi K2.6 vs Claude Code The model also adds MoonVIT, an open-source native vision encoder, and the whole thing is on Hugging Face. The reviewer's "$39 plan" pitch: a 40-minute web-agency demo (find 20 Toronto notaries without sites, generate landing pages and outreach emails) would have torched Claude Code usage caps but ran fine on Kimi's Allegretto plan.
~100 sub-agents in K2.5 → 300 specialized agents in K2.6, up to 4,000 coordinated steps. Preserve thinking mode keeps reasoning consistent across multi-turn tasks.
"In K2.5, we were looking at about 100 sub-agents, but K2.6 scales this horizontally to 300 specialized agents that can execute up to 4,000 coordinated steps." — Better Stack
Native vision encoder for UI/UX reasoning. Generates fully functional interactive prototypes (GSAP animations, scroll-triggered effects) from a single visual reference. The model and encoder are both on Hugging Face.
Five sub-agents found notaries without websites (Google Maps + Canadian Yellow Pages), generated landing pages, produced outreach emails, and a market-size report. A 17-minute follow-up added unique CSS animations and AI-generated headers per page. Pages still shared boilerplate structure under the visual differences.
"I have a feeling that I would certainly have burned through all of my usage limits by now if I used Claude to do the same thing." — Better Stack
Full-stack scraper across Amazon, Newegg, Best Buy via Axios + Cheerio. Bare-bones Node + Express + vanilla JS, no React. New token counter in K2.6's CLI.
DeepSeek v4 ships in two open-weights variants: V4 Pro (1.6T params, 49B active) and V4 Flash (284B / 13B), both with native 1M-token context. The architecture uses a hybrid attention scheme — Compressed Sparse Attention plus Heavy Compressed Attention — that runs on 27% of the FLOPs and 10% of the KV cache of V3.2 at 1M tokens. V4 Pro Max benchmarks against Opus 4.6 and GPT-5.4. Pricing: V4 Pro at $1.74/$3.48 per million in/out tokens; V4 Flash at $0.14/$0.28.[16]Developers Digest — DeepSeek v4 in 4 Minutes
V4 Pro (1.6T / 49B active) and V4 Flash (284B / 13B) — both open weights on Hugging Face with 1M token native context. V4 Pro Max benchmarks against Opus 4.6 and GPT-5.4.
Two interleaved mechanisms: Compressed Sparse Attention (CSA) — 4-token collapse plus sparse top-K — and Heavy Compressed Attention (HCA) — 128-token collapse, no sparsity. The combination is the source of the memory savings.
V4 Pro: $1.74/$3.48 per million in/out tokens. V4 Flash: $0.14/$0.28. Context caching included. Open weights on Hugging Face.
1M context enables long-horizon agent loops that were previously cost-prohibitive; Flash pricing makes the math viable.
Two Kilo blog posts dropped the same day: RooCode is shutting down on May 15 (VS Code extension, cloud, router) — credited with pioneering agentic coding modes (architect/code/debug) but pivoting away from IDEs toward remote cloud agents. Kilo positions itself as the natural migration target. Separately, SpaceX reportedly holds an option to acquire Cursor for $60B (or pay $10B for partnership work), and the precedent (Anthropic cutting Claude access to Windsurf when OpenAI acquisition rumors surfaced) puts model flexibility at risk.[17]AICodeKing — RIP Roo & Cursor
Shutting down May 15 — VS Code extension, cloud, router. Credited with pioneering architect/code/debug agentic modes; pivoting away from IDEs toward remote cloud agents.
Rebuilt VS Code extension on open code server (same core as its CLI/cloud). Features: parallel execution, sub-agent delegation, agent manager, inline diff review with line-level comments.
$60B option, or $10B for partnership work. Concern: coding tools are now "distribution layers for models," and consolidation with XAI threatens Cursor's model flexibility. Windsurf precedent — Anthropic cut Claude access when OpenAI acquisition rumors surfaced.
Hot take: best model shifts constantly (Claude vs GPT-5 Codex vs Gemini vs Grok vs Qwen depending on task and cost). Model-agnostic tools (Kilo, Cline, OpenCode, Aider) are the safest bet. Kilo's positioning is strong but still needs to prove it won't lock in later.
Better Stack tested Zilliz's Claude Context MCP plugin as a replacement for grep/glob in coding agents. It uses Tree-sitter (AST parsing), a Merkle DAG for incremental re-indexing, and hybrid vector + BM25 search across 9 languages via MCP. Claims 40% context reduction. Best fit: 20–30K-line codebases (sub-minute indexing for cents); large codebases like VS Code (1.5M lines) take ~50 minutes and $1.06 to index.[18]Better Stack — I Stopped Using Grep and My Agent Got 10x Faster
Tree-sitter AST parsing + Merkle DAG for incremental re-index + hybrid vector/BM25 search across 9 languages via MCP. Works with any agent harness. 40% context reduction claimed.
Requires Zilliz Cloud (paid serverless recommended; free tier timed out), OpenAI key for embeddings, Node v20–23. Indexing 1.5M-line VS Code repo: ~50 min, $1.06 in embeddings. 23K-line repo: <1 min, $0.01.
Best fit is 20–30K-line codebases where indexing is fast and quality gains are clear. Very large codebases impractical due to indexing time.
Nate B Jones contrasts the "55% lab speedup for GitHub Copilot" study with a recurring production reality: experienced developers take ~19% longer with AI tooling. His framing: this is a J-curve — the productivity dip is workflow-adaptation lag, not evidence that AI is hype. Specific production problems he cites: larger pull requests, higher review costs, more security vulnerabilities.[19]Nate B Jones — Experienced developers took 19% longer with AI
The sharpest line, from a senior engineer Nate quotes: "Copilot makes writing code cheaper, but owning it more expensive." Recurring sentiment across the industry, per his telling.
Apple is expanding enforcement of App Store guideline 4.2.6 against vibe-coded apps from Bolt, Lovable, and Replit Agent. Even though each AI-generated app produces a technically unique codebase, they share "hallucinated DNA" — identical logic errors, unoptimized assets, identical UI patterns — that triggers Apple's spam filters. The existing rule against "template-based functional clones" is being applied to this new category.[20]Better Stack — Why Apple is Cracking Down On Vibe-Coding Apps Practical advice: vibe-coded apps need human-led engineering or unique architectural value to survive review.
GitHub had a data integrity incident where commits were generated from the wrong base state, causing previously merged changes to be randomly reverted. 2,804 pull requests affected. Remediation instructions sent to impacted customers. The optics are bad because the incident coincided with a Verge article reporting GitHub employee concerns about reliability and leadership.[21]Better Stack — GitHub just BROKE
Awesome Agents — building on a peer-reviewed Star Scout study from CMU, NC State, and Socket — identifies ~6M fake stars across 18,600 repos run by ~301,000 accounts, with 16%+ of all repos with 50+ stars involved in fake-star campaigns by July 2024. Stars cost as little as 6¢ at the low end; aged premium accounts run 80–90¢. Pre-built GitHub profiles with 5-year commit histories sell for ~$5,000 on Telegram. Social Plug claims 3.1M stars delivered to 53,000 clients. WeChat groups making $3.4–4.4M/year.[22]Theo - t3.gg — Making millions of dollars on fake GitHub stars Theo's strong pushback on the "stars-to-VC" examples (Lovable, Browser Use, Pinkalan) and his hot take that GitHub is "a place that holds your source code, kind of" round it out.
Star Scout analyzed 20TB of GitHub metadata, 6.7B events, 326M stars from 2019–2024. ~6M fake stars across 18,600 repos by ~301,000 accounts. By July 2024, 16%+ of all repos with 50+ stars were involved. GitHub itself validated detection by deleting 90% of flagged repos and 57% of flagged accounts as of Jan 2025. AI/LLM repos became the largest non-malicious category (177,000 fake stars), ahead of blockchain. 78 fake-star repos made GitHub trending.
Pricing tiers: 3–10¢ disposable, 20–50¢ mid-range (1–2 weeks), 80–90¢ premium aged. Dagster's 2023 research bought €85 per 100 stars from registered German company GitHub24 (all 100 persisted). Pre-built profiles with Arctic Code Vault badges go for ~$5,000 on Telegram. Social Plug claims 3.1M stars to 53,000 clients with a formal API.
Flask baseline (71K stars): median account age 4,481 days, 5.3% zero-repo, 10% zero-follower, fork-to-star ratio ~0.20, watcher-to-star ~0.03. Manipulated examples: Union Labs 47% suspected fake, FreedomDAO 81% zero-followers (watcher-to-star of 0.001), OpenAFM 66% suspicious accounts and 36% ghost. Heuristic: fork-to-star below 0.05 with 10K+ stars warrants scrutiny; organic watcher-to-star is 0.005–0.03.
"You can fake a star count, but you can't fake a bug fix that saves someone's weekend." — Theo
Redpoint's Jordan Segal: median GitHub stars at seed = 2,850; at Series A = 4,980. Buying $85–285 in budget stars hits the seed median; $1K–4.5K hits Series A. Returns: 3,500x–117,000x on a $1M–10M round. Runa Capital's ROSS index, GitHub Fund + M12 ($10M/yr), and an Organization Science paper (15pp more likely to raise if active on GitHub) all reinforce the loop.
Theo's pushback: Lovable raised on $400M/yr revenue. Browser Use (50K stars in 3 months, YC W25, $17M seed) — Theo invested, confirms it raised on demand and the agent-browser thesis, not stars. Pinkalan got into YC, $4.7M seed; Theo passed.
"Lovable did not raise based on their stars on GitHub. They raised based on their fucking unbelievable revenue." — Theo
Svelte's NPM downloads jumped from ~370K to 28M (clear manipulation). Andy Richardson demoed pushing his package to ~1M downloads/week using a single Lambda on the free tier. Aqua Security found 1,283 VS Code extensions with malicious deps totaling 229M installs. NBC News + Clemson identified a network of 686 X accounts posting 130,000+ LLM-generated replies (with the uncensored "dolphin" model leaking through artifacts) — promoting Blackbox AI / Claudex. Theo turned down a seven-digit Higgsfield sponsorship after they purged paying users' accounts and got banned from Twitter for ToS violations.
FTC Consumer Review Rule (effective Oct 21, 2024): up to $53,000 per violation for selling/buying fake social-influence indicators. SEC precedent: Headspin CEO charged with wire fraud (max 20 years) and securities fraud for inflating metrics to scam $80M from investors. Theo offers to be FTC expert witness against fake-viewership YouTubers (e.g., 4M views / 36 comments examples).
"GitHub doesn't know how to run a platform. They know how to run a place that holds your source code kind of." — Theo
CMU researchers recommend a network-centrality-weighted popularity metric. Jono Bacon (StateShift) recommends package downloads, issue quality, contributor retention, community discussion depth, usage telemetry. Healthy fork-to-star ratio: 100–200 forks per 1,000 stars.
"Star economy is a $50 problem with a $50 million consequence." — Theo
Two Minute Papers covers NVIDIA "Sonic," a multimodal teleoperation controller for humanoid robots. It takes video of human motion, voice commands, or music as input and translates them into joint/motor commands via a universal-token architecture (motion generator → human encoder → quantizer → decoder, with a "root trajectory spring model" damping rapid motions). Trained on 100M frames of human motion with no manual action labels, on 128 GPUs over 3 days. ~42M parameters — runs on a smartphone. Open and free.[23]Two Minute Papers — NVIDIA's New AI Broke My Brain Demos: walking, crawling, kung fu, expressive gaits (happy / stealthy / injured), lawn mowing via voice, dancing to music. Led by Prof. Zu and Jim Fan (NVIDIA humanoid robots lab).
Nate Herk's pattern for end-to-end browser automation: skip Chrome DevTools MCP (it floods context with tool definitions) and have Claude Code drive Playwright CLI directly. He demos six concrete workflows — automated QA loops that find and self-patch bugs, web scraping that auto-switches search engines after detecting bot-blocking, persistent-profile authenticated sessions, and a daily-scheduled community bot. The recommended end state: iterate a Playwright script to reliability, then wrap it as a named Claude Code skill.[24]Nate Herk — Claude Code + Playwright Automates Literally Anything
Token efficiency is the deciding factor; MCP floods context with tool descriptions.
Claude Code built a 12-question form, ran Playwright in headed mode, found 3 bugs (enter-key navigation, missing review page, stale overlay), self-patched, and re-ran to green.
Auto-switched from Google to DuckDuckGo after detecting bot-blocking; collected 5 phone numbers across dental office sites.
Tested on Skool; 4–5 iterations to reliably like posts (gray vs. yellow icon distinction, newest-sort filter).
Runs daily on a schedule via Claude Code desktop app: AI news roundups, wins engagement, notification replies, unprompted birthday post; self-extended by writing a new poll-voting script when it hit a capability gap.
Iterate a Playwright script to reliability, then wrap it as a named Claude Code skill for repeatable invocation.
LearnThatStack walks through the four promises HTTP makes (stateless, short-lived, client-initiated, infrastructure-friendly) and how WebSockets break each one to enable persistent bidirectional channels — including the SHA-1 challenge-response (not for security; to prove the server is a real WebSocket endpoint), thread-pool collapse at 10K connections, sticky sessions + Redis pub/sub once stateless dies, NAT eviction as short as 30s on cellular, and the universal exponential-backoff + jitter + reset-on-success pattern for thundering herds on deploy.[25]LearnThatStack — A WebSocket Is an HTTP Request That Stops Being HTTP Closing thesis: protocol ossification — WebSockets, HTTP/2, and QUIC all had to smuggle through existing infrastructure rather than negotiate on their own terms.
MDN rebuilt its front-end. Out: Yari (React SPA, ejected CRA, heavy Webpack, dangerouslySetInnerHTML). In: Lit-based web components with custom elements embedded directly in content, plus custom server components for per-page CSS/JS delivery so unused JS never ships. The main nav dropdown runs on CSS alone, progressively enhanced with JS.[26]Better Stack — MDN's New Stack Is Nuts
Honker is a Rust-based loadable SQLite extension that ports Postgres's NOTIFY/LISTEN pub/sub mechanism to SQLite. The point: durable pub/sub and task queues live inside the existing DB file, transactions atomically span business logic and queue tasks, no polling, single-digit-millisecond latency.[27]Github Awesome — Honker: a Rust SQLite extension