06 · Stack técnico recomendado¶

Lo que instalas el lunes para empezar a construir.

Diagrama de arquitectura¶

┌──────────────────────────────────────────────────────────┐
│  Frontend                                                 │
│  Next.js 15 (web) / Tauri 2 (desktop) /                   │
│  React Native + Expo (móvil)                              │
└────────────────────┬─────────────────────────────────────┘
                     │
┌────────────────────▼─────────────────────────────────────┐
│  Backend / API gateway                                   │
│  FastAPI (Python) + LiteLLM proxy                        │
│  Auth: Clerk o Better-Auth                               │
│  Pagos: Stripe Billing (usage-based)                     │
└────┬─────────────┬─────────────┬─────────────────────────┘
     │             │             │
┌────▼──────┐ ┌────▼──────┐ ┌────▼───────────────────────┐
│ Inferencia│ │ RAG       │ │ Observabilidad            │
│ Ollama    │ │ Qdrant    │ │ Langfuse self-hosted      │
│ (local M4)│ │ +Embedding│ │ + PostHog (producto)      │
│ + Cloud   │ │ Gemma 308M│ │                           │
│ Run NIM   │ │           │ │                           │
└───────────┘ └───────────┘ └────────────────────────────┘
     │
┌────▼──────────────────────────────────────────────────────┐
│  Fine-tuning local: Unsloth (CLI) o MLX-LM (LoRA/QLoRA)  │
│  Export GGUF → vuelves a Ollama                          │
└──────────────────────────────────────────────────────────┘

Componentes y por qué cada uno¶

Inferencia local: Ollama 0.22+¶

Por qué: setup en 1 comando, runner MLX integrado, API OpenAI-compatible, modelos pre-empaquetados.

brew install ollama
ollama pull gemma4:e4b
ollama pull gemma4:e2b
ollama pull embeddinggemma
ollama serve  # expone http://localhost:11434/v1

Configuración recomendada (~/.ollama/config.yaml o vía env):

OLLAMA_FLASH_ATTENTION=1
OLLAMA_KV_CACHE_TYPE=q8_0
OLLAMA_NUM_PARALLEL=2
OLLAMA_KEEP_ALIVE=10m

Inferencia avanzada: llama.cpp¶

Cuándo: tool-calling agentic complejo, control fino de cuantización, prompts >500 tokens.

brew install llama.cpp
# o desde fuente para flash attention en Metal
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make LLAMA_METAL=1 -j8

Embeddings: EmbeddingGemma 308M¶

Por qué: MRL (Matryoshka Representation Learning) 768/512/256/128 dim, +100 idiomas, sweet spot precio/rendimiento.

ollama pull embeddinggemma
# Vía API OpenAI-compatible:
curl http://localhost:11434/v1/embeddings \
  -d '{"input": "texto a embeddear", "model": "embeddinggemma"}'

Truco: usa MRL truncation para reducir storage. 768 dim → 256 dim baja a 1/3 el storage de Qdrant con <2% pérdida de recall.

Vector DB¶

Opción	Cuándo	Pros	Contras
LanceDB	Apps de escritorio (Tauri/Electron)	Embebido, file-based, cero infra	Single-tenant
Qdrant self-hosted	SaaS multi-tenant	Filtering rico, multi-tenant nativo, fast	VPS $30-80/mes
pgvector	Ya tienes Postgres	Una sola DB	Menos performante a escala

Recomendación inicial: Qdrant en Docker para SaaS, LanceDB para desktop apps.

API gateway: LiteLLM¶

Por qué: abstrae Ollama/OpenAI/Anthropic/Google → cambia provider sin tocar código.

from litellm import completion

# Local con Ollama
response = completion(
    model="ollama/gemma4:e4b",
    api_base="http://localhost:11434",
    messages=[{"role": "user", "content": "Hola"}]
)

# Cambia a Gemma 4 31B vía Google AI Studio en una línea
response = completion(
    model="gemini/gemma-4-31b",
    messages=[{"role": "user", "content": "Hola"}]
)

Backend: FastAPI¶

Por qué: async nativo, type-safe con Pydantic, OpenAPI auto-generado, ecosistema Python.

Estructura recomendada:

backend/
├── pyproject.toml
├── app/
│   ├── __init__.py
│   ├── main.py              # FastAPI app
│   ├── settings.py          # Pydantic Settings (env vars)
│   ├── inference.py         # LiteLLM wrapper
│   ├── rag.py               # Qdrant + EmbeddingGemma
│   ├── routes/
│   │   ├── chat.py
│   │   ├── documents.py
│   │   └── health.py
│   ├── models/              # Pydantic models
│   ├── services/            # business logic
│   └── deps.py              # FastAPI dependencies
└── tests/

Fine-tuning¶

Mac-nativo (recomendado en M4 Pro 24 GB): - Unsloth (Mac soporte estable desde abril 2026) — LoRA y QLoRA, hasta 4× más rápido que HF Transformers. - mlx-tune (github.com/ARahim3/mlx-tune) — nativo MLX, simple. - gemma-tuner-multimodal (github.com/mattmireles/gemma-tuner-multimodal) — texto + imagen + audio en MPS.

Cloud (cuando necesites 26B/31B): - Modal o Lambda Cloud: alquila 1× H100 spot por $1-2/h, ejecuta Unsloth, exporta LoRA adapter.

Más detalles en ../fine-tuning/.

Observabilidad¶

Capa	Tool	Por qué
Trazas LLM	Langfuse self-hosted	Open source, trazas + evals + prompt management
Product analytics	PostHog (cloud o self-host)	Funnel, retention, feature flags
Errores	Sentry	Auto-instrumentation FastAPI/Next.js
Métricas infra	Prometheus + Grafana	A escala; opcional al inicio

Frontend¶

Plataforma	Stack	Por qué
Web	Next.js 15 + App Router + Server Actions	Standard, SSR, optimización de bundle automática
Desktop	Tauri 2	Mucho más ligero que Electron, Rust + WebView nativo
Móvil	React Native + Expo + Google AI Edge	Edge inference con LiteRT-LM para Gemma 4 E2B

Auth y pagos¶

Servicio	Por qué
Clerk	Mejor DX, social login, magic links, multi-factor
Better-Auth	Self-hosted, open source si Clerk es caro
Stripe Billing	Pago por uso, suscripciones, EU friendly
Lemon Squeezy	Si vendes a consumidores, gestiona VAT EU automáticamente

Deploy de producción¶

Etapa MVP (0-50 clientes)¶

Frontend: Cloudflare Pages / Vercel.
Backend: Railway / Fly.io (CPU-only).
Inferencia: tu Mac vía Cloudflare Tunnel (temporal).

Etapa tracción (50-500 clientes)¶

Frontend: Cloudflare Pages / Vercel.
Backend: Railway / Fly.io / Hetzner CPX31 (€8/mes).
Inferencia: Mac Mini M4 Pro 48 GB headless con Tailscale, o cloud GPU on-demand.

Etapa escala (>500 clientes)¶

Frontend: CDN propio + edge functions.
Backend: K8s cluster (3-5 nodos) en GKE/EKS.
Inferencia: Cloud Run con GPU NVIDIA L4 o RTX PRO 6000 (96 GB) — scale-to-zero, OpenAI-compatible.

Stack final concreto (cópialo y ejecuta)¶

# Setup base (15 min)
brew install ollama
ollama pull gemma4:e4b
ollama pull embeddinggemma
ollama pull gemma4:e2b  # backup más rápido

# Backend
mkdir mi-saas && cd mi-saas
uv init . && uv add fastapi "uvicorn[standard]" litellm qdrant-client \
  langfuse sqlmodel stripe python-multipart pypdf

# Vector DB local (5 min)
docker run -d -p 6333:6333 -v $(pwd)/qdrant:/qdrant/storage qdrant/qdrant

# Observabilidad (10 min)
git clone https://github.com/langfuse/langfuse && cd langfuse
docker compose up -d

# Frontend
npx create-next-app@latest frontend --typescript --tailwind --app
cd frontend && npm install @clerk/nextjs @stripe/stripe-js

Ver ../stack/ para el scaffolding completo.