Gemma 4 goes Apache, Wuhan's robotaxis freeze

AI Models AI Tools

Gemma 4 Lands on Apache 2.0 with Native Audio and 128-Expert MoE

Google shipped four Gemma 4 models — a 31B dense, a 26B/3.8B-active MoE, and two edge models (E2B, E4B) — under an actual Apache 2.0 license, no custom clauses.^{[1]Google — Gemma 4: Byte for byte, the most capable open models} The architecture is ported down from Gemini 3 research: native vision + audio, built-in thinking/reasoning, function calling, and context windows up to 256K on the workstation models. Google claims the 31B ranks #3 on the Arena leaderboard and "outcompetes models 20x its size"; Sam Witteveen's hands-on testing confirms real ASR and English→Japanese translation working on the E2B.^{[2]Sam Witteveen — Gemma 4 Has Landed!}

The license is the headline

Witteveen opens by flagging what actually changed: "Gemma 4 ships under an Apache 2 license, not a custom license with weird restrictions" (~00:30). Gemma 3 was capable but had enough license restrictions that many teams defaulted to Llama or Qwen; this move strips that friction. Notable backdrop: some Chinese open-model providers are simultaneously pulling back on openness, making Google's timing look strategic.

Four models, two tiers

Workstation — 31B dense: 256K context; fewer layers than Gemma 3 but with value normalization and a new long-context attention mechanism. Arena rank #3 per Google.
Workstation — 26B MoE (3.8B active): 128 experts, 8 activated per token, 1 always-on shared expert. Roughly 27B-class quality at ~4B compute cost. 256K context. Arena rank #6.
Edge — E4B / E2B: 128K context, native audio (ASR + speech-to-translated-text), native vision, function calling, and reasoning. The only models in the family with audio.

Encoder compression is the under-the-radar story

The audio encoder on the edge models dropped 50% vs Gemma 3N — from 681M to 305M parameters, from 390MB to 87MB on disk (~09:45). Frame duration moved from 160ms to 40ms, which should give noticeably more responsive transcription. The vision encoder on the edge models shrank from ~300-350M to 150M, with native aspect-ratio processing — Witteveen reads this as meaningful for OCR and document understanding. "My guess is that this is the Gemma team realizing that not everything carries over necessarily from research to production."

What works right now (live demo)

Witteveen runs the E2B on a single T4 (~11:40):

Thinking toggle via enable_thinking=True on the chat template — same API, same weights.
Vision: one-shot image description of a beach/dog photo, clean and fast.
Audio transcription: correctly attributes two speakers (one male, one female) in a mixed clip.
Translation: transcribes English and emits Japanese in a single call using a target/source language prompt pattern — Google Translate agrees.

Would I necessarily use this instead of an ASR model? Probably not. But if you plan to use these in chain — ASR model then going into an LLM model — you can certainly do that here.

Serving surface

Available on Hugging Face, Google AI Studio, Kaggle, Ollama, plus vLLM / llama.cpp / LiteRT-LM / NVIDIA NIM integrations. Cloud Run now supports a G4 GPU instance (Nvidia RTX Pro 6000, 96GB VRAM), which lets you serverless-host the 31B dense model with scale-to-zero — a legitimately new deployment option for open models this size.

Tools: Gemma 4 (E2B, E4B, 26B MoE, 31B dense), Apache 2.0, Gemini 3, Hugging Face, Ollama, LM Studio, vLLM, llama.cpp, NVIDIA NIM, Google Cloud Run, QAT checkpoints, Transformers

AI Tools Industry

Google Dev Tools

Gemini API Splits Into Flex (50% Off) and Priority Tiers

Google introduced two new inference tiers for the Gemini API: Flex at "half the price of Standard" with higher latency and lower reliability, and Priority for real-time workloads that need capacity guarantees even under peak load.^{[3]Google — New ways to balance cost and reliability in the Gemini API} Both run through the same synchronous endpoint — no separate batch pipeline — and the API response tells you which tier actually served the request (Priority gracefully degrades to Standard under capacity pressure).

The practical shape

This is Google productizing the same split AWS and others have used for years in other domains: pay half to run on spare capacity, pay premium for guaranteed seats. Flex is pitched for background CRM updates, large-scale research simulations, and agent workflows that do a lot of background "browsing" or "thinking." Priority is pitched for live customer support, content moderation, and anything where tail latency matters.

Why it matters for agent workloads

Agent scaffolds typically burn tokens on background reasoning/search loops that don't need to be real-time. Routing those to Flex while keeping user-facing turns on Priority or Standard is a straightforward 2x effective spend reduction with no architectural change. The fact that the response surfaces the actual served tier means autoscaling cost accounting can attribute spend correctly after the fact.

Highest criticality leading to higher reliability, even during peak load.

Tools: Gemini API, Flex tier, Priority tier, Standard tier

AI Tools

Developers Digest

Replit Agent 4: Design Canvas, Parallel Agents, Economy Mode

Replit shipped Agent 4, reframing its product around four pillars: an infinite design canvas, parallel agents, multi-output (website / mobile / slides / animation), and team collaboration.^{[4]Developers Digest — Replit Agent 4: Design-to-Full App} The product pitch has shifted: Agent 3 was about letting the AI run longer autonomously; Agent 4 is about running more things in parallel while you stay in the loop. A new Economy Mode is "up to 3x more cost-effective with nearly the same performance" and ships on by default.

Design-first workflow

The walk-through (~01:30) shows the canvas generating a fitness dashboard from a prompt, then accepting a brand-styling prompt pasted from an existing website — the canvas re-skins the dashboard to match. Layout nudges ("move these panels higher") work in-canvas, and side-by-side variants are a first-class feature. The progression is: design → web app → mobile → slides → social-media animation, all within one project.

Build loop with self-test

Once you tell it to build, Agent 4 scaffolds backend first (DB schema, OpenAPI, endpoints, seed data), then frontend, with explicit in-scope / out-of-scope sections in the plan (~06:20). The self-test loop runs migrations, validates the API, type-checks, and retries on failure — which is why Replit runs "take a little bit longer" than competitors but with higher confidence that something actually works end-to-end. Automatic checkpoints let you roll back any step.

Mobile + collaboration

Push notifications from the Replit mobile app alert when an agent finishes — a real workflow change for long jobs. Multi-stakeholder collaboration (designers, PMs, engineers) happens inside the same project, and the platform now supports element-level deterministic edits WordPress-style, alongside the AI edits. Subdomain publishing on replit.app is one-click; custom domains and access controls are available.

The implicit thesis

The reviewer's framing (~12:00): personal software as a product entry point — build it for yourself, iterate, productize if useful. Replit's bet is that parallel agents + multi-output + one-click deploy compresses the design-to-SaaS distance enough that this becomes realistic for non-engineers.

Tools: Replit Agent 4, Replit Mobile App, Figma imports, Economy Mode, Power Mode, Replit skills (sales, career, research), replit.app deployments, checkpoints

Developer Tools

Fireship

Pretext: Bypassing the Browser to Measure Text Without Reflows

Former React core member Cheng Lu (ex-Midjourney) released pretext, a pure-TypeScript library that measures text dimensions without triggering browser layout reflows.^{[5]Fireship — He just crawled through hell to fix the browser} Width comes from the Canvas API (which lives outside the DOM); height comes from a custom line-break algorithm the author built by looping AI agents against real browser behavior across languages for weeks. The end result: virtualized lists, masonry layouts, and other text-heavy UIs stop paying the reflow tax.

Why reflow is the problem

Measuring text height normally forces the browser to recalculate the position and geometry of every element on the page (~00:55). For 10,000-message chat lists or infinite-scroll feeds, this is the single biggest performance landmine — you either render everything (slow), guess heights (wrong), or skip virtualization.

The trick

Width: Canvas API's text metrics give you pixel widths for any string in any font without touching the DOM. Height: no equivalent existed, so Cheng Lu built one — a line-break algorithm that handles every major browser's quirks across every language (~02:10). His reported method: have AI agents write the break logic, test against real browsers, compare, iterate for weeks until the algorithm matched reality.

He summoned a few clankers into a recursive hellscape and had them do the dirty work... until the clankers were begging for the sweet release of death.

API surface

Two calls: prepare segments text and caches per-segment widths; layout returns total height and line count. Fireship's demo builds an ASCII-art video player that maps each cell of a character grid to one pixel of a live video feed, using pretext to know exactly where each character belongs. The whole thing runs without a single reflow.

Pretext is proof that the browser doesn't have to own text measurement anymore.

Tools: pretext (TypeScript), Canvas API, React Motion, JetBrains Juny CLI, IntelliJ IDEA

AI Research

Caleb Writes Code

TurboQuant: Quantizing the KV Cache via Random Rotations

TurboQuant is a new lossy KV-cache quantization method that moved Micron, Western Digital, and SanDisk stocks down ~7% when the paper landed — the market read it as a real reduction in inference-memory demand.^{[6]Caleb Writes Code — TurboQuant Explained} The core move: apply a random rotation matrix to inputs so they converge to a Gaussian distribution via the central limit theorem, then quantize against a precomputed codebook optimized for that exact Gaussian. Compared to pruning methods like SnapKV and PyramidKV, TurboQuant preserves the entire attention instead of dropping entries.

Why KV cache quantization is the right lever

Most local-LLM quantization discourse focuses on model weights. The KV cache — which scales with context length, not parameter count — is often the same size as the weights and has been largely untouched (~01:55). Quantizing it directly translates into either longer context per GPU or more concurrent users per GPU.

The algorithm, in plain terms

Normalize: divide the vector by its norm to get a unit vector.
Random rotation: multiply by a random rotation matrix. This spreads energy concentrated in "spiky" coordinates into an even distribution.
Central Limit Theorem kicks in: in high-dimensional vectors, rotated coordinates converge to a Gaussian with mean 1/√d and variance 1/d. This happens regardless of what the input looked like (HTML, legal docs, code — all converge).
Lookup-table quantize: precompute an optimal codebook against that fixed Gaussian using the Max-weight algorithm for each target bit width. Apply once at runtime.
QJL residual correction: step back one bit, compute mean-squared residual, apply the Johnson-Lindenstrauss-based QJL algorithm to preserve dot products — which is what attention scores depend on.

What used to be unknown because we don't really know what kind of input is contained in the context window, whether it's all A's or random characters or HTML code, we can essentially converge them into a Gaussian distribution with high degree of certainty.

Economic implication

Caleb's back-of-envelope (~10:55): "We could be needing two to three times less graphics cards in inference." That's the Micron/WDC/SanDisk sell-off in one sentence — HBM and flash demand is indexed to KV-cache size. A viable lossy KV quantizer that preserves attention quality could reshape inference hardware budgets for the rest of 2026.

Tools: TurboQuant, KV cache quantization, SnapKV, PyramidKV, QJL algorithm (2024 paper), Johnson-Lindenstrauss, central limit theorem, Max-weight algorithm

Podcast AI Future

Latent Space

Latent Space x Moonlake: World Models That Aren't Sora

Swyx interviews Moonlake co-founder Fan-yun Sun and Stanford NLP legend Chris Manning on their bet that world models need structure, not just scale. Their thesis: pixel-level diffusion models like Sora 2 don't have spatial intelligence or causal consequences — they produce lovely videos with no understanding of what happens if you take an action.^{[7]Latent Space — Moonlake: Interactive, Multimodal World Models} Moonlake splits the job across two models: a multimodal reasoning model that handles persistency, logic, and causality, and a diffusion model called Revery that handles photorealistic rendering on top — explicitly positioned to replace rasterizers and DLSS.

Why "structure, not scale" is not anti-bitter-lesson

Sun pushes back on the framing (~14:30): the bitter-lesson-maximal approach would be next-byte prediction across all modalities, but the compute requirement is absurd. The question is just "what is the right abstraction level today?" Moonlake's answer: keep physics, state, and long-term consistency in a symbolic/reasoning layer; keep pixels in a separate fidelity layer you can swap out.

Action-conditioned world models vs. video generators

Chris Manning's sharpest point (~07:20): "You only actually have a world model if you can predict given some action is taken what is going to change in the world because of it." Observational video data (YouTube dumps) lacks actions entirely, which is why simulation-sourced action-conditioned data is such a scarce resource. Over minutes-long time horizons, you need abstracted semantic state — pixels alone blow up.

Humans managed to develop language, and that gave a symbolic knowledge representation and reasoning level which just gave this sort of vaulting of what could be done with the intelligence in brains.

Explicit disagreement with Yann LeCun and JEPA

Manning names the philosophical split (~16:20): LeCun "has never appreciated the power of language in particular or symbolic representations in general." JEPA, in his read, is "a reasonable research bet... not really established it's the best one." He argues transformer weights themselves constitute a joint embedding, so the LeCun objection to autoregressive LMs doesn't actually land.

Revery: diffusion rendering as part of the gameplay loop

The product pitch that perks Swyx up (~30:10): Revery is positioned to replace rasterizers and DLSS. More radically, the renderer is part of the game loop — "upon getting 10 apples, my bullet turns into apples" — because rendering can be triggered by game state rather than being a one-way derivative of it. Skins, ray-tracing replacement, photorealistic GTA mods on demand all fall out of this.

Bowling demo as proof-of-capability

A reasoning trace for a single bowling game prompt (~22:20) includes geometry, physics, affordances, symbolic logic, audio triggers, score increments, pin-falling sequencing, and persistent state — none of which Genie or World Labs' Marble provide. Multiplayer is already prompt-configurable.

Audio is spatially integrated, which video-gen models skip

Sun (~54:20): Sora 2's audio is a bolted-on 11 Labs-style voice track, not spatial. A dog running away from the camera in Sora doesn't fade. Moonlake's spatial audio falls out of the game-engine tool abstraction, not a standalone audio model.

Evaluation is unsolved (and not unique to world models)

Manning closes on the benchmarking problem (~37:40): just like language-model benchmarks failed to capture "recommend me a backpack for Europe," world-model benchmarks won't capture "can a game designer produce what they were imagining in a reasonable amount of time." The field will walk with its feet; evaluation converges to vibe checks and downstream use.

Hiring call

18 people, Saratoga moving to SF. Looking for intersection of code generation + computer vision + graphics. "If you've written a game engine before, if you've RL'd a variety of coding models on different objectives... if you've done multimodal latent-space alignment."

Tools: Moonlake, Revery diffusion renderer, Sora 2, Genie, World Labs Marble, JEPA, DLSS, PA (Demi Gore), Industrial Light and Magic (vibe reference)

Industry

Tech Brew

The Day 100 Baidu Apollo Go Robotaxis Froze in Wuhan

Over 100 Baidu Apollo Go robotaxis halted on Wuhan highways when a centralized cloud system handling routing, navigation, and remote assistance failed — vehicles stopped mid-lane without pulling to emergency shoulders, SOS buttons failed, and customer-service response lagged ~30 minutes.^{[8]Tech Brew — The day the robotaxis froze} Some collisions, no reported injuries. This is the second major cloud-dependent AV failure in four months, following a December 2025 San Francisco blackout that disrupted Waymo.

What actually failed

Not the individual vehicles' autonomy stacks — the shared cloud backbone. Apollo Go operates 1,000+ vehicles in Wuhan alone; Waymo is doing 500,000 paid rides weekly and targeting 1M by year-end. Both fleets rely on centralized remote-assistance services that confirm ambiguous road situations. When that layer went down, Wuhan's cars had no graceful fallback — no "pull to shoulder and wait" protocol fired, so they stopped in-lane.

The comparable December 2025 event

San Francisco: a traffic-light blackout triggered so many remote-confirmation requests that cellular networks overloaded, which cascaded into Waymo fleet disruption. Same structural failure — centralized cloud plus remote-assistance plus no standardized emergency protocol — different trigger.

It took 30 minutes to even connect to a customer service rep — and help never came.

Why this matters for AV policy

Regulators have been evaluating per-vehicle safety (disengagements, crash rates) rather than fleet-level failure modes. Two fleet-wide outages in a quarter force the question of whether AV services need something like aviation-grade failover requirements — geographically distributed remote-ops, mandatory on-vehicle safe-halt behaviors, redundant connectivity — before scale targets like Waymo's 1M weekly are acceptable.

Tools: Baidu Apollo Go, Waymo, centralized cloud routing, remote assistance, SOS protocols

Developer Tools

Data Science Weekly

Data Science Weekly #645: A/B Testing at CarGurus and the Revenge of the Data Scientist

Issue 645's editor picks lean on a single theme: actual data science work is not model training — it's experiment design, metric engineering, and debugging stochastic systems.^{[9]Data Science Weekly - Issue 645} The technical highlight is CarGurus' genStats framework, a Python orchestrator that auto-generates notebooks and centralizes statistical logic across hundreds of A/B tests a year. Also surfaced: a PyTorch autograd deep-dive on in-place operations + view aliasing, and a skeptical take on AI models that claim to "understand" genomics.

Editor picks

"The Revenge of the Data Scientist": Argues that the ML/agent era made training look like the whole job — the actual labor is experiment design, metric work, and stochastic debugging. A counter to the "ML engineer eats the DS role" narrative.
100 economics papers curated as inspiration across causal inference and applied fields — aimed at DS practitioners looking for non-ML problem framings.
James Simons on quant hiring: continuously test models, recruit exceptional scientists rather than finance veterans. Reproduced for the Nth time but still the load-bearing DS hiring heuristic.

Infrastructure-heavy technical articles

CarGurus genStats: A Python framework for standardizing A/B test analysis across hundreds of experiments/year. Auto-generates notebooks, centralizes statistical logic. A concrete answer to "how do we scale experimentation without analysts reinventing the wheel."
Serverless social-media pipelines on AWS with cost efficiency as the primary constraint.
PyTorch autograd internals: How in-place ops and tensor view aliasing interact — narrow but the exact thing that bites people in production RL loops.
Skeptical take on "genomic-understanding" AI: Cautions against overclaiming, particularly where benchmarks don't cover actual biology.

Tool drops

bananarama: Generates presentation images via Gemini from version-controlled YAML.
Claude Code learning modules: Interactive, no-setup; signals DS community is formally onboarding onto Claude Code as a learning surface.

Tools: CarGurus genStats, PyTorch autograd, Google Gemini, bananarama, Claude Code, AWS serverless, DSPy (critiqued)

Gemma 4 Lands on Apache 2.0 with Native Audio and 128-Expert MoE

The license is the headline

Four models, two tiers

Encoder compression is the under-the-radar story

What works right now (live demo)

Serving surface

Gemini API Splits Into Flex (50% Off) and Priority Tiers

The practical shape

Why it matters for agent workloads

Replit Agent 4: Design Canvas, Parallel Agents, Economy Mode

Design-first workflow

Build loop with self-test

Mobile + collaboration

The implicit thesis

Pretext: Bypassing the Browser to Measure Text Without Reflows

Why reflow is the problem

The trick

API surface

TurboQuant: Quantizing the KV Cache via Random Rotations

Why KV cache quantization is the right lever

The algorithm, in plain terms

Economic implication

Latent Space x Moonlake: World Models That Aren't Sora

Why "structure, not scale" is not anti-bitter-lesson

Action-conditioned world models vs. video generators

Explicit disagreement with Yann LeCun and JEPA

Revery: diffusion rendering as part of the gameplay loop

Bowling demo as proof-of-capability

Audio is spatially integrated, which video-gen models skip

Evaluation is unsolved (and not unique to world models)

Hiring call

The Day 100 Baidu Apollo Go Robotaxis Froze in Wuhan

What actually failed

The comparable December 2025 event

Why this matters for AV policy

Data Science Weekly #645: A/B Testing at CarGurus and the Revenge of the Data Scientist

Editor picks

Infrastructure-heavy technical articles

Tool drops

Sources