M4 Pro 24 GB — Benchmarks verificados¶

Hardware: Apple M4 Pro (12C CPU / 16C GPU), 24 GB unified memory, 273 GB/s bandwidth. macOS: Darwin 24.6 Última verificación: mayo 2026.

Tabla maestra¶

Modelo	Runtime	Cuant.	Memoria	tok/s (corto)	tok/s (8K ctx)	TTFT
Gemma 4 E2B	Ollama 0.22	Q4_K_M	4.0 GB	95	~70	<300ms
Gemma 4 E2B	MLX-LM	4bit	4.0 GB	81	~60	~350ms
Gemma 4 E4B	Ollama 0.22	Q4_K_M	5.5 GB	57	~38	<500ms
Gemma 4 E4B	MLX-LM	4bit	5.5 GB	49	~32	~600ms
Gemma 4 26B-A4B	Ollama 0.22	Q4_K_M	18 GB	~2	OOM	>3s
Gemma 4 31B Dense	Ollama 0.22	Q4_K_M	—	— (no entra)	—	—
Gemma 3 4B	Ollama	Q4_K_M	4.0 GB	80	60	<400ms
Gemma 3 12B	Ollama	Q4_K_M	9 GB	30	20	~800ms
Gemma 3 27B	Ollama	Q4_K_M	18 GB	12	8	>2s
EmbeddingGemma 308M	Ollama	F16	800 MB	n/a (3-5ms/text)	n/a	<50ms

Interpretación¶

Velocidad referencia humana¶

<10 tok/s: lento, se nota.
10-25 tok/s: aceptable para batch / async.
25-50 tok/s: bueno para UI interactiva.
>50 tok/s: excelente, near-realtime.

Recomendación por producto¶

Producto	Modelo	Justificación
Chat asistente interactivo	E4B	57 tok/s en chat, 38 tok/s con RAG → UX fluida
Clasificación / NER batch	E2B	95 tok/s suficiente, ahorra memoria
Razonamiento complejo / agentic	31B vía API	M4 Pro no lo corre; usa Google AI Studio gratis
RAG semantic search	EmbeddingGemma	<50ms/embedding, sweet spot

Bandwidth bottleneck¶

M4 Pro 273 GB/s vs alternativas: - M3 Pro: 150 GB/s - M4 Max 32-core: 410 GB/s (1.5×) - M4 Max 40-core: 546 GB/s (2×) - M3 Ultra: 800 GB/s (2.9×) - RTX 4090: 1008 GB/s (3.7×) - H100 SXM: 3350 GB/s (12×)

Implicación: si tu producto necesita correr el 31B Dense localmente con holgura, considera M3 Ultra o RTX 6000.

Cuantización: comparación detallada (Gemma 4 E4B)¶

Cuant.	Memoria	tok/s	Calidad relativa	Recomendación
Q2_K	3.0 GB	70	~85%	Solo edge extremo
Q4_K_M	5.5 GB	57	~99%	✅ Default
Q5_K_M	6.5 GB	50	~99.5%	Si Q4 falla en eval
Q6_K	7.5 GB	45	~99.7%	Marginal vs Q4
Q8_0	9.5 GB	35	~99.9%	Solo si necesitas certificación
FP16	11 GB	25	100% (ref)	Investigación

Bugs verificados Apple Silicon + Gemma 4¶

Bug	Versión	Workaround
Flash Attention hang con prompts >500 tok en 31B	Ollama 0.20.x	Upgrade a 0.22+; o `OLLAMA_FLASH_ATTENTION=0`
/v1 endpoint manda content a `reasoning` field	Ollama 0.20.x	Parse both fields, concatenate
MLX runner no soporta Gemma4ForConditionalGeneration	algunas builds MLX	Use Metal backend (default)

Ref: GitHub issue ollama/ollama#15368.

KV cache cuantizado (sólo llama.cpp)¶

# Sin cuantización (default)
llama-server -m gemma4-e4b-q4_k_m.gguf -c 32768
# KV cache: ~3 GB en 32K tokens

# Con KV cache Q8
llama-server -m gemma4-e4b-q4_k_m.gguf -c 32768 -ctk q8_0 -ctv q8_0
# KV cache: ~1.5 GB en 32K tokens → libera memoria para modelo más grande

Cuándo el M4 Pro 24 GB NO basta¶

Necesitas correr Gemma 4 26B-A4B sostenido (>100 reqs/h): considera M4 Max 64 GB o cloud L4.
Necesitas Gemma 4 31B Dense: solo M3 Ultra 96+ GB, RTX 6000, o cloud.
Necesitas fine-tuning de 12B+: cloud H100 spot ($1-2/h) por horas.
Producción con >10 reqs/s sostenidos: migra a Cloud Run con L4/L40S.

Cómo re-verificar estos números¶

./scripts/bench.sh gemma4:e4b 20
# Output: benchmarks/runs/gemma4-e4b_<timestamp>.md

Si tus números son significativamente menores que esta tabla: 1. Verifica que OLLAMA_FLASH_ATTENTION=1 está activo. 2. Cierra apps de fondo (especialmente las que usan GPU: Chrome con muchas tabs, Spotify, etc.). 3. Conecta el Mac al power (no batería). 4. Confirma que el modelo está cuantizado a Q4_K_M (no Q8 por error).