MLX-LM fine-tuning nativo en Apple Silicon¶

Alternativa 100% Apple Stack a Unsloth. Sin dependencias CUDA/PyTorch heavy.

Cuándo usar MLX vs Unsloth¶

Criterio	MLX-LM	Unsloth
Velocidad pura	⚖️ Similar	⚖️ Similar
Setup simplicidad	✅ Mejor	🟡 Bien
Ecosystem (HF integration)	🟡 Menor	✅ Excelente
Export a GGUF	🟡 Manual (paso extra)	✅ Built-in
Soporte Gemma 4 vision/audio	🟡 Vía mlx-vlm	🟡 Solo texto
Memory efficiency	✅ Excelente	✅ Excelente
Debuggability	🟡 Stack joven	✅ Maduro

Recomendación: empieza con Unsloth. Migra a MLX solo si: - Quieres stack 100% Apple. - Necesitas integrar con Swift/macOS native. - Unsloth te da problemas específicos en MPS.

Instalación¶

uv pip install "mlx-lm>=0.20" "mlx>=0.20"

# Para vision-language:
uv pip install "mlx-vlm>=0.4.3"

LoRA con mlx-lm.lora¶

# 1. Prepara dataset en formato JSONL (ver datasets-format.md)
ls data/
# train.jsonl  valid.jsonl  test.jsonl

# 2. LoRA training
mlx_lm.lora \
    --model mlx-community/gemma-4-e4b-4bit \
    --train \
    --data data/ \
    --num-layers 16 \
    --batch-size 2 \
    --iters 1000 \
    --learning-rate 2e-4 \
    --adapter-path outputs/lora-v1

Flags importantes: - --num-layers 16: cuántas capas LoRA aplicar (16 es buen default; aumentar = más expressivo + más memoria). - --batch-size 2: cabe en 24 GB para E4B. - --iters 1000: ~2 epochs de 500 muestras.

Fuse adapter al modelo base¶

# Genera modelo "merged" (full weights, no adapter)
mlx_lm.fuse \
    --model mlx-community/gemma-4-e4b-4bit \
    --adapter-path outputs/lora-v1 \
    --save-path outputs/gemma4-e4b-finetuned-v1

Convert a GGUF para Ollama¶

# Necesita llama.cpp clonado
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Convert MLX → safetensors → GGUF
python convert_hf_to_gguf.py \
    ../outputs/gemma4-e4b-finetuned-v1 \
    --outfile ../outputs/gemma4-e4b-finetuned-v1.gguf \
    --outtype q4_k_m

Y luego import a Ollama (ver unsloth-lora.md#cargar-adapter-de-vuelta-a-ollama).

Eval con mlx-lm¶

# Comparar base vs fine-tuned en eval set
mlx_lm.evaluate \
    --model outputs/gemma4-e4b-finetuned-v1 \
    --data data/test.jsonl \
    --max-tokens 256 \
    --output reports/eval-mlx-v1.json

Inferencia rápida con mlx-lm¶

# Test rápido en CLI
mlx_lm.generate \
    --model outputs/gemma4-e4b-finetuned-v1 \
    --prompt "Tu prompt aquí" \
    --max-tokens 256 \
    --temp 0.3

Para servir como API OpenAI-compatible:

mlx_lm.server \
    --model outputs/gemma4-e4b-finetuned-v1 \
    --host 0.0.0.0 \
    --port 8080

Luego desde código:

import litellm
response = litellm.completion(
    model="openai/local",
    api_base="http://localhost:8080/v1",
    messages=[{"role": "user", "content": "Hola"}],
)

Alternativas Mac-nativas¶

Tool	URL	Pros
mlx-tune	github.com/ARahim3/mlx-tune	Simple wrapper, opinionated
gemma-tuner-multimodal	github.com/mattmireles	Soporta texto + imagen + audio en MPS

Memory tips para M4 Pro 24 GB¶

Con mlx-lm y E4B: - --batch-size 1 + --num-layers 8: ~8 GB - --batch-size 2 + --num-layers 16: ~14 GB (recomendado) - --batch-size 4 + --num-layers 16: ~22 GB (al límite)

Si OOM: 1. Reduce --max-seq-length a 2048. 2. Reduce --batch-size a 1. 3. --grad-checkpoint true. 4. Quanitiza más agresivo (mlx-community/gemma-4-e4b-3bit).

Decisión: ¿a quién recomendar?¶

Si tu situación es...	Usa...
Dev Python estándar, ecosystem HF	Unsloth
Dev Apple-first, Swift / Tauri integration	MLX-LM
Fine-tune multimodal (vision + audio)	gemma-tuner-multimodal o mlx-vlm
Solo necesitas un experimento rápido	mlx_lm.lora (single CLI command)