There's a reason every serious European startup is looking into self-hosted LLMs: data sovereignty. You cannot send customer emails, medical records, legal drafts, or internal Slack archives to an OpenAI endpoint in Virginia if you have a GDPR-sensitive workload. You also cannot burn €3,000 a month on token bills when you're bootstrapping.
The good news is that in 2026 you can run a capable 7B-parameter language model on a CPU-only VPS in Falkenstein, Germany, for around €50/month, with full data ownership, a stable HTTP API, and zero rate limits. The honest news is that it's slower than a GPU, and it won't replace GPT-4 for deep reasoning. But for 80% of real-world tasks—summarising, extracting, classifying, rewriting, simple RAG—it's more than good enough.
This guide shows you exactly how to do it with Ollama on a DanubeData VPS. No GPU required. Everything here is tested on AMD EPYC hardware with realistic numbers, not marketing fluff.
Why Self-Host an LLM in 2026?
There are five honest reasons, and you need at least one of them to justify the effort. If none apply, just pay OpenAI.
- Data privacy and GDPR. Your data never leaves your VPS. No third-party subprocessor agreement, no DPIA gymnastics, no Schrems-II landmines. For European healthcare, legal tech, public sector, and finance, this alone is decisive.
- Predictable cost. A fixed €49.99/month instead of a metered per-token bill. If you're running batch jobs that hit the model millions of times a day (document classification, ETL tagging, log summarisation), the fixed-cost model wins quickly.
- Offline availability. Your app keeps working when OpenAI has an outage, which happens surprisingly often. You also stop leaking production traffic patterns to a US provider.
- No training-on-your-data clauses. Even with enterprise tier settings, people worry. When the weights are on your disk, the question evaporates.
- Full control over the stack. Swap models freely, fine-tune, inject system prompts at the server level, add custom stop tokens, run your own guardrails. You own it.
What you are not getting: frontier-model quality. A local 7B model is roughly comparable to GPT-3.5, not GPT-4. We'll come back to this.
What Is Ollama?
Ollama is a single Go binary that wraps llama.cpp and gives you three things:
- A model registry. One command (
ollama pull qwen2.5:7b) fetches a quantized GGUF model from their library. - An HTTP API on port
11434with a native endpoint and an OpenAI-compatible endpoint (so existing SDKs just work). - A model runner that handles loading, caching, eviction, and concurrent requests—no Python required, no CUDA dance, no torch compile.
It runs on Linux, macOS, and Windows. On a CPU-only VPS it uses AVX2/AVX-512 vector instructions and multi-threaded tensor math from llama.cpp. On AMD EPYC (which is what DanubeData runs), you get AVX2 across every core, which makes CPU inference surprisingly usable.
Honest Hardware Sizing for CPU Inference
The biggest misconception beginners bring: "I'll grab a 2GB VPS and run Llama 70B on it." You won't. Here's what actually works, measured on shared AMD EPYC cores (DanubeData VPS nodes).
| Model Size | Quant | RAM Needed | Realistic tok/sec (CPU) | DanubeData Plan |
|---|---|---|---|---|
| 3B (Llama 3.2, Phi-3.5 Mini) | Q4_K_M | ~4 GB | 15-25 tok/sec | DD Medium (€24.99/mo) |
| 7B (Qwen 2.5, Mistral, Llama 3.1) | Q4_K_M | ~8 GB | 5-10 tok/sec | DD Large (€49.99/mo) |
| 9B (Gemma 2) | Q4_K_M | ~10 GB | 4-7 tok/sec | DD Large (€49.99/mo) |
| 13B (Llama 2 13B, CodeLlama 13B) | Q4_K_M | ~14 GB | 2-4 tok/sec | DD Large + headroom (€49.99/mo) |
| 70B (Llama 3.1 70B) | Q4_K_M | ~42 GB | 0.5-1 tok/sec (painful) | Not recommended on CPU — get a GPU box |
To put "5-10 tokens/sec" in perspective: an average sentence is around 15-25 tokens. So a 7B model on DD Large gives you a sentence every 2-5 seconds. That feels slow for a chat UI but is excellent for background jobs, batch classification, and non-interactive pipelines.
DanubeData does not currently offer GPU VPS instances. We run AMD EPYC CPU-only servers. If you need sub-second response times for 7B+ models or you want to run 70B models, you'll need to rent a GPU server elsewhere (Hetzner GPU, Lambda, Runpod) and use DanubeData for the rest of your stack (Postgres, Redis, S3, static frontend). We're transparent about this.
Picking the Right Model
Don't just download Llama because it's famous. In 2026 the Chinese and European open-weight models are competitive. Here is what I actually recommend, ordered from smallest to largest.
Llama 3.2 3B — The Fast Default
Meta's smallest production-quality model. Great for classification, short rewrites, simple extraction, routing. Runs on 4GB RAM at Q4. On DD Medium you'll see 15-25 tok/sec, which is fast enough for real-time chat UIs. Knowledge cut-off is late 2024. Weakness: struggles with multi-step reasoning.
Phi-3.5 Mini (3.8B) — The Reasoner Runt
Microsoft's "punches above its weight" model. Roughly matches 7B models on many benchmarks despite being smaller. Use it when you have a DD Medium and want better quality than Llama 3.2 3B without going to DD Large.
Qwen 2.5 7B — The Best All-Rounder (My Default)
Alibaba's 7B model. In 2026 it's the model I reach for by default on DD Large. Excellent at instruction-following, tool use, multilingual (including German and Romanian), and code. There's also qwen2.5-coder:7b tuned specifically for code completion. At Q4_K_M you need ~8GB, so DD Large (32GB) gives you enormous headroom for context and concurrent requests.
Mistral 7B — The Reliable European Option
From the Paris-based Mistral lab. Slightly behind Qwen 2.5 on most benchmarks in 2026, but fully European weights if that matters to your procurement team. Good at French, Spanish, and German out of the box.
Gemma 2 9B — The Quality Upgrade
Google's open-weight model. Better reasoning than 7B models, but roughly 30-40% slower because of the extra parameters. Worth it if you're doing document Q&A or RAG where quality trumps speed.
CodeLlama / Qwen 2.5 Coder — For Internal Dev Tools
If you're building an internal code-completion bot for a small team, qwen2.5-coder:7b or codellama:13b are both excellent. For a 3-5 person team with non-frantic usage, a DD Large VPS handles it comfortably.
Quantization 101 (and Why You Want Q4_K_M)
Quantization means storing model weights in fewer bits. The original weights are 16-bit floats. Q4 means 4-bit integers—a 4x reduction in RAM at the cost of a small quality drop.
| Quant | Bits/weight | Quality loss | When to use |
|---|---|---|---|
| Q8_0 | ~8 | Negligible | You have RAM to burn and want max quality |
| Q4_K_M (default) | ~4.5 | Very small | Sweet spot. Pick this. |
| Q3_K_M | ~3.5 | Noticeable | You're squeezing onto smaller hardware |
| Q2_K | ~2.5 | Significant | Emergency size reduction only |
Ollama defaults to Q4_K_M when you run ollama pull qwen2.5:7b. You'd have to specifically request a different quant (e.g. qwen2.5:7b-instruct-q8_0). For 95% of use cases, don't change it.
How Does This Compare to GPT-4 and Cloud APIs?
Time for some uncomfortable honesty. A local 7B model is not GPT-4. It's not even GPT-4o-mini. On hard reasoning tasks, code generation of novel architectures, long-context synthesis, or genuinely creative writing, frontier models crush local ones. If your product depends on model quality, self-hosting a 7B will disappoint you.
However, for a huge class of tasks—summarising an email into three bullets, extracting structured data, classifying support tickets, rewriting marketing copy to a target tone, answering questions over a small RAG corpus—a well-prompted Qwen 2.5 7B is ~90% as useful as GPT-4o-mini, at 0% of the privacy risk.
Cost Comparison (Real Numbers)
Let's assume a background job that processes 10,000 documents per day, each ~1,500 tokens in, ~300 tokens out.
| Path | Daily tokens | Monthly cost | Data leaves EU? |
|---|---|---|---|
| OpenAI GPT-4o-mini | 18M in / 3M out | ~$100/mo | Yes (US) |
| Anthropic Haiku | 18M in / 3M out | ~$50/mo | Yes (US) |
| DanubeData DD Large + Qwen 2.5 7B | unlimited | €49.99/mo (fixed) | No (Germany) |
Honest verdict: OpenAI's cheapest model often wins on pure cost until you care about privacy, predictable billing, or you're running massive throughput. If any of those three apply, self-host.
Step-by-Step: Provision a VPS and Install Ollama
We'll do this on a fresh DD Large (16 vCPU / 32 GB RAM) running Ubuntu 24.04.
Step 1: Provision the VPS
- Go to danubedata.ro/vps/create
- Pick DD Large (€49.99/mo) — 16 vCPU, 32 GB RAM, 160 GB NVMe
- OS: Ubuntu 24.04 LTS
- Region: Falkenstein (fsn1)
- Add your SSH key and click Create
New accounts get a €50 signup credit, which pays for the first month.
Step 2: Initial Server Hardening
ssh root@YOUR_VPS_IP
# Update and install basics
apt update && apt upgrade -y
apt install -y curl ufw fail2ban
# Firewall: SSH + 443 only (Ollama itself stays internal)
ufw allow OpenSSH
ufw allow 80/tcp
ufw allow 443/tcp
ufw enable
# Create a non-root user for running Ollama
adduser --disabled-password --gecos "" ollama
Step 3: Install Ollama (One Command)
curl -fsSL https://ollama.com/install.sh | sh
That's it. The installer drops a Go binary at /usr/local/bin/ollama, creates an ollama system user, and installs a systemd service that starts on boot and listens on 127.0.0.1:11434.
# Verify
systemctl status ollama
# active (running)
curl http://127.0.0.1:11434/
# Ollama is running
Step 4: Pull Your First Model
# Download Qwen 2.5 7B (around 4.7 GB)
ollama pull qwen2.5:7b
# Or a smaller model to test
ollama pull llama3.2:3b
# List what's installed
ollama list
Models are cached under /usr/share/ollama/.ollama/models/. On DD Large you have 160 GB NVMe, so you can comfortably keep several models around.
Step 5: Test Inference from the CLI
ollama run qwen2.5:7b "Summarise in three bullet points: ..."
On the first run you'll see the model load into RAM (takes 10-30 seconds). Subsequent runs are instant because the model stays warm for 5 minutes by default (configurable via OLLAMA_KEEP_ALIVE).
Step 6: Tune the systemd Service
We want three things: bind to localhost, use all available CPU threads, and keep the model warm longer for interactive use.
mkdir -p /etc/systemd/system/ollama.service.d
cat > /etc/systemd/system/ollama.service.d/override.conf << 'EOF'
[Service]
Environment="OLLAMA_HOST=127.0.0.1:11434"
Environment="OLLAMA_KEEP_ALIVE=30m"
Environment="OLLAMA_NUM_PARALLEL=2"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
EOF
systemctl daemon-reload
systemctl restart ollama
OLLAMA_NUM_PARALLEL=2 lets Ollama serve two concurrent requests, at the cost of roughly 2x RAM per model. If you have only one concurrent user, leave it at 1 for max speed per request.
Step 7: Expose the API Safely with Caddy + Basic Auth
Ollama has no authentication. Never, under any circumstances, bind it to a public IP without a reverse proxy in front that handles auth. We'll use Caddy.
# Install Caddy
apt install -y debian-keyring debian-archive-keyring apt-transport-https
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' |
gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' |
tee /etc/apt/sources.list.d/caddy-stable.list
apt update && apt install -y caddy
# Generate a bcrypt hash for your API password
caddy hash-password --plaintext 'your-strong-password-here'
# $2a$14$xxxxxxxxxxxxxxxxxxxxxx... — copy this
# /etc/caddy/Caddyfile
llm.yourdomain.com {
basic_auth {
apiuser $2a$14$THE_HASH_YOU_COPIED
}
reverse_proxy 127.0.0.1:11434 {
# Ollama streams responses — don't buffer
flush_interval -1
}
}
systemctl restart caddy
Caddy handles TLS via Let's Encrypt automatically. Point your DNS llm.yourdomain.com A record at the VPS and within 30 seconds you have HTTPS.
Step 8: Use the API
Native Ollama Endpoint
curl -u apiuser:your-strong-password-here
https://llm.yourdomain.com/api/generate
-d '{
"model": "qwen2.5:7b",
"prompt": "Write a one-line GDPR compliance tagline.",
"stream": false
}'
OpenAI-Compatible Endpoint (the magic)
Ollama exposes /v1/chat/completions that speaks OpenAI's API dialect. Any existing SDK that targets OpenAI will work by swapping the base_url.
# Python with the official OpenAI SDK
from openai import OpenAI
client = OpenAI(
base_url="https://llm.yourdomain.com/v1",
api_key="not-used-but-required-by-sdk",
default_headers={
"Authorization": "Basic " + __import__('base64').b64encode(
b"apiuser:your-strong-password-here"
).decode()
}
)
response = client.chat.completions.create(
model="qwen2.5:7b",
messages=[
{"role": "system", "content": "You are a concise assistant."},
{"role": "user", "content": "Summarise GDPR Article 28 in three bullets."}
]
)
print(response.choices[0].message.content)
Yes, the Basic-auth-as-header trick is ugly. In production put the Basic auth at the Caddy layer and have the app talk to an internal endpoint via a private network, or issue a token-based auth proxy. For prototypes this is fine.
Step 9: Add a Chat UI with Open WebUI
Open WebUI is a ChatGPT-style web interface that points at any Ollama server. Multi-user, conversations saved, model switcher, RAG built in.
# Run it with Docker, pointed at your local Ollama
docker run -d --name open-webui
-p 127.0.0.1:3000:8080
-e OLLAMA_BASE_URL=http://host.docker.internal:11434
-v open-webui:/app/backend/data
--add-host=host.docker.internal:host-gateway
--restart always
ghcr.io/open-webui/open-webui:main
Add another Caddy block to expose it behind your domain:
chat.yourdomain.com {
reverse_proxy 127.0.0.1:3000
}
First user to register becomes the admin. Turn off open signups in the admin panel immediately.
Real-World Use Cases That Actually Work
Here's what I've seen succeed on a CPU-only Ollama deployment:
- Internal knowledge-base chatbot for a ~20-person team. RAG with pgvector, Qwen 2.5 7B, Open WebUI. Response time: 3-8 seconds. Acceptable for non-chat tasks.
- Support ticket classification and routing. Batch job runs every 5 minutes, processes ~500 tickets, tags them into 12 categories. Zero user-facing latency. Zero data leaving the company.
- Document summarisation. Long PDFs in, 5-bullet summaries out. Queue-driven, runs overnight. Nobody cares that it's 8 tok/sec.
- Code completion for a small team.
qwen2.5-coder:7bbehind a self-hosted Continue.dev extension. Works for 3-5 developers. Latency noticeable but usable for small completions, painful for whole-function generation. - Meeting transcript analysis. Whisper transcribes, Ollama extracts action items. End-to-end privacy-preserving pipeline.
- Content moderation and PII redaction. Classifier runs before any data gets stored. CPU inference is fine because moderation happens inline with request handling.
When NOT to Self-Host
I promised honesty, so here are the cases where self-hosting is the wrong call:
- You need 70B+ quality. A 70B model on CPU gives you ~1 tok/sec, which is unusable. Rent a GPU server, or pay OpenAI/Anthropic.
- You need sub-second latency. A chat UI with <1 second first-token is hard to deliver from CPU inference. If your product is a real-time voice assistant, this is not the stack.
- You have high concurrency. Each inference request saturates available CPU cores. Three simultaneous 7B users on DD Large and everyone's waiting. Above ~5 concurrent users, costs stop making sense vs. cloud.
- You need frontier model capabilities. Complex multi-step reasoning, novel code architectures, long-context synthesis of 50-page documents. Use GPT-4, Claude Sonnet, or Gemini Pro.
- You're a team of one and your time is expensive. The €49.99 is only part of the cost. Operating the stack, tuning prompts for smaller models, and monitoring it costs engineering hours. If you're not hitting €200+/month in API bills, just pay OpenAI.
Integrating with RAG
Ollama on its own is just the brain. For most real applications you want retrieval-augmented generation: find the relevant document chunks, stuff them into the prompt, ask the model. If you're building this on DanubeData, we have a separate guide on running pgvector on a managed PostgreSQL instance—pair that with Ollama and you have a full, European-hosted RAG stack.
The quick version: embed your documents with Ollama's nomic-embed-text or mxbai-embed-large model, store them in pgvector, query by cosine similarity, pass the top-k chunks to Qwen 2.5 as context. Ollama serves both the embeddings and the chat model, so you have a single HTTP endpoint for your whole AI stack.
Caching Models with Object Storage
If you run multiple Ollama VPS instances (for redundancy or horizontal scaling), save the model cache to DanubeData Object Storage (€3.99/month, S3-compatible) and pull it on each node instead of re-downloading from Ollama's registry. For a 7B model that's 4.7 GB—saves 15-30 minutes on every new VPS provision.
# Upload once
rclone copy /usr/share/ollama/.ollama/models/
danubedata-s3:ollama-models/
# On a new VPS after installing Ollama
systemctl stop ollama
rclone copy danubedata-s3:ollama-models/
/usr/share/ollama/.ollama/models/
systemctl start ollama
Monitoring and Operations
# How much RAM is Ollama using?
systemctl status ollama
ps aux | grep ollama
# Which models are loaded right now?
curl -s http://127.0.0.1:11434/api/ps | jq
# Journal logs
journalctl -u ollama -f
# If you run out of RAM mid-inference, Ollama will OOM-kill
# the model. Watch for it:
dmesg | grep -i "killed process"
For proper observability, enable the Prometheus exporter via an OpenTelemetry sidecar, or just scrape /api/ps periodically with a custom Prometheus exporter. Ollama itself doesn't ship native Prometheus metrics yet (as of early 2026).
Frequently Asked Questions
How fast is CPU inference, really?
For a 7B model at Q4 on DD Large (AMD EPYC, 16 shared vCPU), expect 5-10 tokens/sec sustained. That's a word every second roughly. Fine for background jobs, usable-but-slow for chat. A 3B model hits 15-25 tok/sec which feels snappy in a UI.
Can I run Llama 70B on CPU?
Technically yes, realistically no. You'd need ~42 GB RAM for Q4_K_M and would see ~1 tok/sec. A 500-word response takes 10+ minutes. For 70B use a GPU server. DanubeData does not currently offer GPU VPS—pair us with a GPU provider if you truly need frontier open models.
Does DanubeData have GPU VPS instances?
Not yet. All our VPS run on AMD EPYC CPUs in Falkenstein, Germany. GPU offerings are on our roadmap. For now, if you need GPU inference, consider using DanubeData for the data plane (Postgres, S3, queues, frontend) and renting a dedicated GPU server elsewhere.
What's the difference between Q4, Q5, Q8 quantization?
Fewer bits = smaller file and less RAM, at the cost of quality. Q4_K_M is the default in Ollama and the right choice 95% of the time. Q8 gives you near-original quality at ~2x the RAM. Q2/Q3 are only worth it if you're desperately trying to squeeze onto smaller hardware—quality drops become noticeable.
Can I fine-tune models on a DanubeData VPS?
Fine-tuning even a 7B model properly requires GPU. On CPU, a LoRA fine-tune on a 3B model is feasible but will take days. Most teams are better off using prompt engineering, few-shot examples, or a RAG layer rather than fine-tuning. If you truly need a fine-tune, rent a GPU for a day, train, then deploy the resulting GGUF model on Ollama.
How do I integrate Ollama with a RAG pipeline?
Use Ollama for both the embedding model (nomic-embed-text or mxbai-embed-large) and the chat model (qwen2.5:7b). Store embeddings in pgvector on a DanubeData managed PostgreSQL. Retrieve top-k chunks, pass them as context to the chat model. The whole stack stays in your DanubeData tenant, no external APIs.
How private is "self-hosted" really?
Your inference traffic, prompts, and completions never leave the VPS. DanubeData runs the physical server but has no access to your VM's memory or disk contents in day-to-day operations. For even stronger guarantees, encrypt your model cache at rest with dm-crypt and require signed SSH access only. GDPR: as the controller, you still need a DPA with DanubeData as the processor—we provide one.
Which model should I start with?
If you have DD Medium: llama3.2:3b for speed, phi3.5:3.8b for quality.
If you have DD Large: qwen2.5:7b is my unambiguous default in 2026. It's multilingual, fast-ish, and has excellent instruction following. Try gemma2:9b if you need a quality bump and can tolerate 30% more latency.
Ready to Run Your Own LLM?
Self-hosting a capable LLM in Europe is no longer exotic—it's a one-hour task. If your workload is privacy-sensitive, batch-heavy, or needs predictable cost, the economics and engineering are both finally on your side in 2026.
Recommended starting setup:
- DanubeData DD Large (€49.99/mo) — 16 vCPU / 32 GB RAM / 160 GB NVMe in Falkenstein, Germany
- Qwen 2.5 7B at Q4_K_M via Ollama
- Caddy reverse proxy with Basic auth + Let's Encrypt TLS
- Optional: Open WebUI for a chat frontend, pgvector for RAG, Object Storage (€3.99/mo) for model cache
New accounts get €50 signup credit—your first month of DD Large is effectively free.
👉 Create a DD Large VPS
👉 Read more DanubeData tutorials
👉 Talk to us about your AI stack
Running something interesting on Ollama + DanubeData? We'd love to hear about it—reply to any of our emails or drop us a line at the contact page.