Backend FastAPI¶

Backend Python compartido por todas las implementations. Expone endpoints OpenAI-compatible y RAG, abstrae el provider de inferencia con LiteLLM.

Setup¶

# Requisitos: Python 3.11+, uv (https://docs.astral.sh/uv/)

uv sync                          # instala dependencias
cp ../../.env.example .env       # rellena las variables
uv run uvicorn app.main:app --reload --port 8000

# Verifica
curl http://localhost:8000/health
curl http://localhost:8000/docs   # OpenAPI UI

Estructura¶

backend/
├── pyproject.toml
├── app/
│   ├── __init__.py
│   ├── main.py              # FastAPI app + lifespan
│   ├── settings.py          # Pydantic Settings (env vars)
│   ├── inference.py         # LiteLLM wrapper + retry logic
│   ├── rag.py               # Qdrant + EmbeddingGemma pipeline
│   ├── deps.py              # FastAPI dependencies (auth, db, etc)
│   ├── routes/
│   │   ├── __init__.py
│   │   ├── health.py
│   │   ├── chat.py
│   │   ├── embeddings.py
│   │   └── documents.py
│   ├── models/              # Pydantic schemas (request/response)
│   │   ├── __init__.py
│   │   └── chat.py
│   └── services/            # Business logic (per implementation)
│       └── __init__.py
└── tests/
    └── test_health.py

Endpoints incluidos¶

Endpoint	Método	Descripción
`/health`	GET	Health check de Ollama + Qdrant + Postgres
`/v1/chat/completions`	POST	OpenAI-compatible chat (proxy LiteLLM → Ollama/Gemini/etc)
`/v1/embeddings`	POST	OpenAI-compatible embeddings (EmbeddingGemma)
`/documents/ingest`	POST	Ingesta de PDF/texto → chunks → embeddings → Qdrant
`/documents/search`	POST	Búsqueda semántica + filtros
`/docs`	GET	OpenAPI UI (Swagger)
`/redoc`	GET	OpenAPI UI (ReDoc)

Provider switching (LiteLLM)¶

# Llamada por defecto: Gemma 4 E4B local
response = await inference.complete(
    messages=[{"role": "user", "content": "Hola"}],
    model="ollama/gemma4:e4b",
)

# Override a Gemini API si query es grande
if estimated_tokens > 30_000:
    response = await inference.complete(
        messages=...,
        model="gemini/gemma-4-31b",
    )

inference.py aplica: - Retry con backoff exponencial. - Timeout configurable. - Trazas a Langfuse automáticas. - Routing por tipo de tarea (config en settings.py).

Tests¶

uv run pytest                    # todos los tests
uv run pytest tests/test_health.py -v
uv run pytest --cov=app          # con coverage

Producción¶

# Sin reload, multi-worker
uv run gunicorn app.main:app \
  -w 4 \
  -k uvicorn.workers.UvicornWorker \
  --bind 0.0.0.0:8000 \
  --access-logfile - \
  --error-logfile -

# O con Docker (Dockerfile a crear):
docker build -t gemma4-backend .
docker run -p 8000:8000 --env-file .env gemma4-backend

Extending para una implementation¶

Crea app/routes/<impl_name>.py:

from fastapi import APIRouter
router = APIRouter(prefix="/contracts", tags=["legaltech"])

@router.post("/review")
async def review_contract(...):
    ...

Registra en app/main.py:

from app.routes import contracts
app.include_router(contracts.router)

Añade schema en app/models/<impl>.py.
Logic en app/services/<impl>.py.

No copies el backend — todas las implementations comparten este runtime.