llama-swap
llama-swap is an efficient LLM inference server based on llama.cpp. Unlike Ollama, it offers more control over model configuration, supports HuggingFace models directly, and is often significantly more performant.
Recommendation for Linux/AMD64 Users
If you are running a Linux server with AMD64 architecture and want maximum performance, llama-swap is the better choice over Ollama.
Installation
Add the following template to your docker-compose.yml and then run ei23 dc.
GPU Required
This template is configured for NVIDIA GPUs. Without a GPU, inference will be very slow.
Template
llama-swap:
image: ghcr.io/mostlygeek/llama-swap:cuda
container_name: llama-swap
restart: unless-stopped
ports:
- 9292:8080
- 10008:10008
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
- ./volumes/llama-swap/config.yaml:/app/config.yaml
- ./volumes/llama-swap/models:/models:ro
Configuration
Create the configuration file /home/[user]/ei23-docker/volumes/llama-swap/config.yaml:
# Example configuration for llama-swap
# Documentation: https://github.com/mostlygeek/llama-swap
models:
# Example: Mistral 7B
- name: mistral
model: /models/mistral-7b-instruct-v0.2.Q4_K_M.gguf
context_length: 8192
gpu_layers: 99 # All layers on GPU (nvidia)
# Example: Llama 3 8B
- name: llama3
model: /models/llama-3-8b-instruct.Q4_K_M.gguf
context_length: 8192
gpu_layers: 99
# Swap configuration: Models are loaded/unloaded on demand
swap:
strategy: timeout
timeout: 300 # Seconds until an unused model is unloaded
Download Models
Download GGUF models from HuggingFace to the models folder:
# Create folder
mkdir -p ~/ei23-docker/volumes/llama-swap/models
# Example: Download Qwen3.5-9B Instruct
cd ~/ei23-docker/volumes/llama-swap/models
wget https://huggingface.co/unsloth/Qwen3.5-9B-GGUF/resolve/main/Qwen3.5-9B-UD-Q4_K_XL.gguf?download=true
Where to find models?
- Unsloth on HuggingFace - Many quantized models
- gguf-my-repo - Convert your own model
- For 8GB VRAM: Q4_K_M quantizations recommended
- For 4GB VRAM: Q3_K_M or Q2_K quantizations
Notes
- After startup, the API is available at
http://[IP]:9292 - The API is compatible with the OpenAI API interface
- Advantages over Ollama:
- Direct support for HuggingFace GGUF models
- Finer control over context length and GPU layers
- Lower memory consumption through intelligent swapping
- Often faster inference
- Combine with Open WebUI for a chat interface
- Port 10008 is reserved for internal purposes
Or configure the connection in the Open WebUI settings under "Connections".
Further Information
- GitHub Repository
- llama.cpp Documentation
- HuggingFace - Model Repository