Gemma 4 Is Google's Best Open Model Yet (Here's How to Use It)

AI CodingMLOpen SourceSelf-HostingDeveloper Tools

Google released Gemma 4 on April 2, 2026 with four model sizes, Apache 2.0 licensing, 256K context, and native multimodal support. It currently ranks #3 among all open models on Arena AI. Here are the benchmarks, the hardware you need, and five use cases you can run today.

What Changed from Gemma 3#

The biggest change is the license. Gemma 3 shipped with a restrictive custom license that made lawyers nervous and kept the model out of production pipelines. Gemma 4 is Apache 2.0. That alone makes it a different product.

The context window jumped from 128K to 256K tokens. Native function calling and structured JSON output are built in, not bolted on. All four variants handle text and image input natively.

The two smallest models (E2B and E4B) also support audio input, which is new for the Gemma family. Language coverage expanded to 140+, and the community reports noticeably better performance on DevOps tasks and multilingual prompts compared to Gemma 3.

Google claims over 400 million downloads across all Gemma versions. The Apache 2.0 switch will probably accelerate that.

The Benchmarks That Matter#

I care about three benchmarks for coding and reasoning: AIME 2026, LiveCodeBench v6, and GPQA Diamond. Here is how the Gemma 4 31B Dense stacks up against Llama 4 Maverick and the broader field.

Gemma 4 31B vs Llama 4 Maverick (% scores)

Gemma 4 wins every category. The LiveCodeBench gap is the most interesting one. Llama 4 Maverick scored 77.1% on the standard eval, but some independent benchmarks put it as low as 43.4% on coding tasks. That inconsistency is pushing users toward Gemma 4, and the Reddit threads reflect it.

The 26B MoE variant (which activates only 3.8B parameters per token) ranks #6 among open models on Arena AI. For a model that runs on 16GB VRAM, that is remarkable.

Pick the Right Model Size#

Not every task needs 31 billion parameters. Here is a quick guide to picking the right variant for your hardware and workload.

Gemma 4 Model Picker

Fits in 8GB RAM. Good for summarization, simple Q&A, and on-device inference. Supports audio input natively. Best for mobile apps and Raspberry Pi deployments where you need a model that just works.

Also runs in 8GB RAM but gives noticeably better reasoning than E2B. Audio input support makes it a strong pick for voice assistant prototypes. I would start here if you are building anything consumer-facing on constrained hardware.

Needs 16GB+ VRAM but only activates 3.8B parameters per token, so it runs at 80+ tokens per second. This is the sweet spot for most developers. It ranks #6 on Arena AI and fits on a single RTX 4090 or M2 Pro with 16GB unified memory.

Requires 24GB+ VRAM (RTX 3090/4090 or A10G). Top scores across every benchmark. Use this when accuracy matters more than speed, like code review, complex reasoning, or RAG pipelines where you cannot afford hallucinations.

If you have a 16GB GPU, the 26B MoE is probably the right call. It is the model most people on r/LocalLLaMA are running, and for good reason.

Five Use Cases You Can Run in 5 Minutes#

1. Local code review. Point Gemma 4 at a diff and ask for bugs. The 26B MoE handles this well at 80+ tok/s, fast enough to feel interactive. If you are already running local models with Claude Code, swapping in Gemma 4 is a one-line change.

2. Multimodal document parsing. Feed it a screenshot of a UI and ask it to generate the HTML. All variants support image input, so even the E4B can do this on a laptop.

3. Structured data extraction. Native JSON output means you can skip the "please respond in JSON" prompt hack. Pass an invoice image, get clean structured data back.

4. Function calling agents. Built-in function calling support makes Gemma 4 a solid backbone for tool-using agents. Define your tools as JSON schema, and the model will generate the correct call format without few-shot examples.

5. Multilingual customer support. 140+ languages with improved multilingual performance. The Apache 2.0 license means you can deploy this in production without worrying about Google's usage restrictions.

Getting Started with Ollama and Transformers#

Two paths. If you want to be running in 30 seconds, use Ollama. If you need fine-grained control, use Transformers.

Ollama (fastest)#

terminal

# Edge model — runs on anything
ollama run gemma4:e2b

# Community favorite — 16GB VRAM
ollama run gemma4:26b-a4b

# Full power — 24GB VRAM
ollama run gemma4:31b

That's it. Day-one support shipped for Ollama, llama.cpp, vLLM, LM Studio, MLX, and Unsloth.

Transformers (multimodal)#

inference.py

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_id = "google/gemma-4-26b-it"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)

If you are working with less VRAM, 4-bit quantization cuts memory usage roughly in half:

inference.py

from transformers import BitsAndBytesConfig

quant_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_quant_type="nf4"
)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, quantization_config=quant_config
)

Tip: The community-recommended sampling settings are temp 0.3, top-p 0.9, min-p 0.1, top-k 20. These work well across all four variants.

Gemma 4 running locally on a developer workstation with terminal output showing token generation speed

Where This Fits#

A week ago, Llama 4 was the default recommendation for local LLM work. Gemma 4 changes that calculus. Better coding benchmarks, Apache 2.0 licensing, and a MoE variant that runs on consumer hardware at 80+ tokens per second.

The 26B MoE is the model I would point most developers toward right now. It hits a rare combination of quality, speed, and accessibility. If you are building developer tooling on Windows or trying to cut your token costs, running Gemma 4 locally is one of the highest-impact changes you can make this month.

Google got this one right.

Sources: Google AI Blog, DeepMind, HuggingFace, Latent Space