BlogTutorialsSelf-Hosted AI Coding Assistants: Copilot Alternatives for Privacy-First Teams (2026)

Self-Hosted AI Coding Assistants: Copilot Alternatives for Privacy-First Teams (2026)

Adrian Silaghi
Adrian Silaghi
April 20, 2026
14 min read
11 views
#ai-coding #self-hosted #copilot-alternative #continue #tabby #ollama #privacy #europe
Self-Hosted AI Coding Assistants: Copilot Alternatives for Privacy-First Teams (2026)

Your source code is the single most concentrated piece of intellectual property your company owns. It encodes product strategy, architecture, security controls, customer-specific logic, and — increasingly — the prompts and fine-tuning recipes that differentiate your AI features. Yet most teams casually hand the entire codebase to a US-hosted large language model every time a developer hits Tab.

GitHub Copilot, Cursor, and Codeium cloud all send snippets of your private code to Azure OpenAI or equivalent US infrastructure for inference. That may be fine for a hobby project. It is not fine for a regulated European business post-Schrems II, for a defense contractor, for a company working on patent-sensitive work, or for anyone whose customers contractually forbid third-party code exfiltration.

This guide is a pragmatic, non-fanatical walkthrough of self-hosting AI coding assistants in 2026. We will cover the real options, the honest limitations of CPU-only inference, a concrete reference architecture you can deploy on a single European VPS, and the scenarios where self-hosting genuinely makes sense — and the ones where it does not.

Why Self-Host Coding AI in 2026?

Let us get the marketing out of the way first. Self-hosting is not automatically better. It has costs: hardware, operational work, slower inference on CPU, worse out-of-the-box completion quality than GitHub Copilot. You accept those trade-offs for four tangible benefits.

1. Your Code Is Your IP

Copilot's telemetry and prompt-construction pipeline sends the active file, neighboring open tabs, and project context to Microsoft's backend. Microsoft has published enterprise data-handling commitments, and they are genuinely good — but "we promise not to look" is not the same as "the data never leaves the building." For a fintech working on a novel risk model, or a biotech training on proprietary datasets, the difference matters.

2. Schrems II and GDPR Realities

The European Court of Justice invalidated the EU-US Privacy Shield in 2020 (Schrems II). The 2023 EU-US Data Privacy Framework restored a transfer mechanism, but it is under active legal challenge and many Data Protection Officers still flag US cloud AI as a transfer risk. Self-hosting on German-based infrastructure eliminates the cross-border transfer question entirely — there is no transfer.

3. Cost Predictability at Team Scale

Copilot Business is $19 per user per month. A ten-person engineering team pays $190/month; fifty people pay $950/month; and pricing scales linearly forever. A self-hosted Tabby or Continue.dev + Ollama stack on a single DanubeData DD Large VPS (€49.99/month) serves an unlimited number of developers. Past about three seats, the math starts favoring self-hosting; past ten, it is not close.

4. Customization and Repository Awareness

Self-hosted assistants can be pointed at your internal documentation, fine-tuned on your codebase style, or wired up to your issue tracker. Tabby and Cody both offer repository-aware retrieval. Copilot offers "Copilot Enterprise" with GitHub.com-hosted repo context, but it requires your code to live on github.com — a non-starter for many regulated teams using self-hosted GitLab, Gitea, or Bitbucket Data Center.

The Honest Limitation: CPU Inference Is Slow

This section matters more than anything else in this post. If you take one thing away, make it this.

There are two fundamentally different AI coding workflows:

  • Interactive chat: "Explain this function", "write a unit test for this", "refactor this to use early returns". Latency budget: a few seconds is fine.
  • Inline Fill-In-the-Middle (FIM) completion: The grey-italic ghost text that appears as you type, sometimes called tab-completion. Latency budget: under 500ms, ideally under 300ms. Anything slower and developers turn it off because it disrupts flow.

CPU inference on a 7B parameter coding model using a quantized build (GGUF Q4_K_M) produces roughly 8-15 tokens per second on a modern 16-vCPU VPS. That is fast enough to feel conversational for chat. It is nowhere near fast enough for inline completion, where you need the first token in 200-300ms and a short completion finished in under a second.

What you need for real-time inline completion is a GPU — specifically, a consumer GPU with 8-24GB of VRAM running a small (1.5-3B) code model. RTX 3060 12GB, RTX 4060 Ti 16GB, or a datacenter A10/L4 all work. DanubeData does not currently offer GPU VPS, so for inline completion you have three options:

  1. Run the GPU at your office and tunnel to it (works for on-site teams).
  2. Rent a GPU-equipped VPS from a European GPU provider and co-locate the chat stack on DanubeData.
  3. Accept that self-hosted inline completion is not yet viable and use self-hosted chat-style assistance for sensitive work, while leaving inline completion to Copilot on non-sensitive repos — or disabling it entirely.

The pragmatic 2026 stance taken by many European teams: use Copilot on private Azure OpenAI EU (stays in-region but is still a US vendor) or a similar EU-hosted service for inline completion on ordinary work, and fully self-host a chat assistant for sensitive repositories. Purism is tempting; shipped product matters more.

The Self-Hosted AI Coding Landscape

There are six options worth knowing about in 2026. We will walk through each with a clear verdict on when it fits.

1. Continue.dev (VSCode + JetBrains extension)

Continue.dev is an open-source IDE extension that acts as a client for any LLM backend. It supports chat, inline edit, code explanation, and commit message generation, and it can talk to Ollama, vLLM, OpenAI-compatible endpoints, Anthropic, or any HTTP model server. It does not ship a backend — you bring your own. This is a huge advantage: you get a well-maintained IDE UX on top of whatever inference stack fits your policy.

Best for: Individual developers and small teams who want full control over the model backend and are comfortable managing Ollama or vLLM separately.

2. Tabby

Tabby is a full-stack self-hosted coding assistant: a Rust-based server that hosts models, exposes a web UI for chat and repository search, and ships plugins for VSCode and JetBrains. It has native Fill-In-the-Middle support, repository indexing, team admin, and SSO. It is the closest thing to a drop-in Copilot replacement you can run on your own hardware.

Best for: Teams that want a single product with a web UI, user management, and repository awareness out of the box.

3. Twinny

Twinny is a free, open-source VSCode extension that connects to an Ollama or OpenAI-compatible backend for both chat and FIM completion. It is lighter-weight than Continue.dev and focuses specifically on the Copilot-style experience.

Best for: Solo developers who want a minimalist Copilot clone and do not need JetBrains support.

4. Codeium Enterprise (self-hosted)

Codeium offers an enterprise tier that lets you run their inference and retrieval stack on your own infrastructure. It is a commercial product with a polished UX and strong IDE coverage, but it is not open source and requires a sales conversation.

Best for: Large enterprises that want a vendor-supported, SLA-backed product and can write a six-figure PO.

5. Cody by Sourcegraph (self-hosted)

Cody is Sourcegraph's AI coding assistant. It is deeply integrated with Sourcegraph's code search and graph, which makes it exceptional at repository-aware questions — "where else do we handle retries?", "show me all uses of this interface". The self-hosted tier requires a Sourcegraph Enterprise license.

Best for: Organizations already running Sourcegraph who want to extend it with AI.

6. llama.cpp + llama.vim / llama.vscode

The llama.cpp project ships lightweight IDE plugins (llama.vim for Vim/Neovim, and a VSCode extension) that talk to a running llama.cpp server. It is bare-bones, highly efficient, and ideal if you like composing your own stack. Expect to do more wiring yourself.

Best for: Vim/Neovim users and hackers who prefer minimal, composable pieces.

Which Code Model Should You Run?

The model matters as much as the runtime. In 2026, the four code-specialized families worth considering are:

Model Family Sizes FIM Support Strengths Notes
Qwen 2.5 Coder 1.5B / 3B / 7B / 14B / 32B Yes Best all-rounder in 2026; strong chat + FIM; excellent multilingual code Apache 2.0, recommended default
DeepSeek Coder V2 Lite 16B MoE (2.4B active) Yes Strong reasoning for a self-hosted model; good math/algorithms MoE architecture needs more RAM but runs faster than dense 16B
StarCoder2 3B / 7B / 15B Yes Permissive license (BigCode); trained on The Stack v2 Solid FIM, weaker chat than Qwen
CodeLlama 7B / 13B / 34B / 70B Partial (7B/13B only) Mature ecosystem, well-documented Falling behind in 2026; Qwen is usually a better pick

Recommendation for CPU-only inference: qwen2.5-coder:7b in Q4_K_M quantization. It fits comfortably in 8GB RAM, streams at 10-15 tok/s on 16 vCPUs, supports FIM if you ever move to GPU, and gives the best chat quality in that class.

Recommendation for CPU with 32GB+ RAM: qwen2.5-coder:14b in Q4_K_M. Slower (5-8 tok/s) but noticeably better at architectural questions and refactoring tasks.

For inline completion (requires GPU): qwen2.5-coder:1.5b or starcoder2:3b, both of which hit sub-300ms first-token latency on consumer GPUs.

Comparison: Cloud Copilots vs Self-Hosted Options

Product Data Path Cost (10 devs) Inline Completion Chat Repo Awareness
GitHub Copilot Business US Azure OpenAI $190/mo Excellent Excellent Enterprise tier only
Cursor US (Anthropic/OpenAI) $200/mo Excellent Excellent (agent mode) Strong
Tabby self-hosted (CPU) Your VPS, EU €49.99/mo flat Not viable on CPU Good Built-in indexing
Continue.dev + Ollama (CPU) Your VPS, EU €49.99/mo flat Not viable on CPU Good Limited (config-driven)
Codeium Enterprise (self-hosted) Your infra Contact sales Excellent (with GPU) Excellent Strong

The table tells the story: if your main concern is data residency and you are willing to give up best-in-class inline completion, a Tabby or Continue.dev + Ollama stack on a single European VPS replaces Copilot Chat for an entire team at a fraction of the cost. That is the architecture we will build next.

Reference Architecture: Continue.dev + Ollama + Tabby on DanubeData

Here is the stack we recommend for a team of 5-50 developers that needs GDPR-clean, chat-style AI coding help on sensitive repositories:

  • VPS: DanubeData DD Large (16 vCPU / 32GB RAM) at €49.99/month, in Falkenstein, Germany.
  • OS: Ubuntu 24.04 LTS.
  • Inference engine: Ollama (simple) or llama.cpp server (more control).
  • Model: qwen2.5-coder:7b Q4_K_M as the default; qwen2.5-coder:14b for heavier teams that tolerate 5-8 tok/s.
  • Team chat UI: Tabby (optional, adds web UI + repository search).
  • IDE extension: Continue.dev in every developer's VSCode or JetBrains.
  • Public entry point: Caddy reverse proxy with HTTPS and basic auth for chat access; tailnet (Tailscale/Headscale) for developer IDE traffic.
  • Persistence: Managed Postgres at €19.99/month for Tabby's relational data (users, chat history, settings).

Monthly cost: €49.99 + €19.99 = €69.98 for unlimited developers, EU-resident data. Compare to $190/month Copilot Business for ten developers.

Step-by-Step: Deploying the Stack

Step 1: Provision a DD Large VPS

  1. Create a VPS on DanubeData.
  2. Choose the DD Large plan (16 vCPU / 32GB / 160GB NVMe).
  3. Select Ubuntu 24.04 LTS.
  4. Note the IPv4 and IPv6 addresses.

Step 2: Base System Setup

# SSH in
ssh root@YOUR_SERVER_IP

# Update
apt update && apt upgrade -y

# Essentials
apt install -y curl wget git ufw htop tmux

# Firewall: SSH, HTTP, HTTPS, and the Ollama port only on loopback
ufw allow OpenSSH
ufw allow 80/tcp
ufw allow 443/tcp
ufw enable

# Hostname
hostnamectl set-hostname dd-coding-ai

Step 3: Install Ollama

Ollama is the quickest path to a working local model server. It is a single Go binary that handles model downloads, quantization selection, and exposes an OpenAI-compatible HTTP API on port 11434.

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Verify service
systemctl status ollama

# Bind only to localhost (we will proxy via Caddy)
# Edit the systemd unit:
mkdir -p /etc/systemd/system/ollama.service.d
cat > /etc/systemd/system/ollama.service.d/override.conf <<EOF
[Service]
Environment="OLLAMA_HOST=127.0.0.1:11434"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_NUM_PARALLEL=2"
EOF

systemctl daemon-reload
systemctl restart ollama

Step 4: Pull the Coding Model

# 7B model, ~4.4GB download, fits comfortably in 8GB RAM
ollama pull qwen2.5-coder:7b

# Optional: also pull a smaller model for quick completions
ollama pull qwen2.5-coder:1.5b

# Verify
ollama list
ollama run qwen2.5-coder:7b "Write a Python function that reverses a string."

The first interactive run loads the model into RAM and caches it. Subsequent requests reuse the warm model.

Step 5: Install Caddy for HTTPS + Basic Auth

# Install Caddy
apt install -y debian-keyring debian-archive-keyring apt-transport-https
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' | tee /etc/apt/sources.list.d/caddy-stable.list
apt update
apt install -y caddy

# Generate a bcrypt password hash for basic auth
caddy hash-password --plaintext 'YOUR_STRONG_PASSWORD'
# Copy the output hash

# Create Caddyfile
cat > /etc/caddy/Caddyfile <<EOF
ai.yourdomain.com {
    # Basic auth gate for anyone hitting the chat endpoint directly
    basic_auth {
        yourteam JDJhJDE0JEhhc2hPZllvdXJQYXNzd29yZA...
    }

    # Route Ollama OpenAI-compatible API to localhost:11434
    reverse_proxy /v1/* 127.0.0.1:11434
    reverse_proxy /api/* 127.0.0.1:11434

    # Security headers
    header {
        X-Content-Type-Options nosniff
        X-Frame-Options DENY
        Referrer-Policy strict-origin-when-cross-origin
    }
}
EOF

systemctl reload caddy

Point your DNS at the server, wait for propagation, and Caddy will grab a Let's Encrypt certificate automatically.

Step 6: Install Continue.dev in VSCode

On each developer's machine:

  1. Open VSCode and install the Continue extension from the marketplace.
  2. Open the Continue settings (~/.continue/config.json).
  3. Replace the default config with:
{
  "models": [
    {
      "title": "Qwen 2.5 Coder 7B (DanubeData)",
      "provider": "ollama",
      "model": "qwen2.5-coder:7b",
      "apiBase": "https://ai.yourdomain.com",
      "requestOptions": {
        "headers": {
          "Authorization": "Basic BASE64_OF_user:password"
        }
      }
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen 2.5 Coder 1.5B",
    "provider": "ollama",
    "model": "qwen2.5-coder:1.5b",
    "apiBase": "https://ai.yourdomain.com",
    "requestOptions": {
      "headers": {
        "Authorization": "Basic BASE64_OF_user:password"
      }
    }
  },
  "allowAnonymousTelemetry": false,
  "customCommands": [
    {
      "name": "review",
      "prompt": "Review this code for correctness, clarity, and potential bugs. Be specific and concise."
    }
  ]
}

Continue uses tabAutocompleteModel for inline FIM completion and the main models entry for chat. On CPU-only hosting, we recommend disabling tabAutocompleteModel entirely (leave it out of the config) and relying on chat-triggered edits.

Step 7: Optional — Install Tabby for a Team Web UI

Continue.dev is great per-developer, but if you want a shared chat UI, repository search, and user admin, add Tabby:

# Create Tabby data directory
mkdir -p /opt/tabby
cd /opt/tabby

# docker-compose.yml
cat > docker-compose.yml <<EOF
services:
  tabby:
    image: tabbyml/tabby:latest
    restart: unless-stopped
    command: serve --model qwen2.5-coder/7b --chat-model qwen2.5-coder/7b --device cpu
    volumes:
      - tabby-data:/data
    ports:
      - "127.0.0.1:8080:8080"
    environment:
      - TABBY_DISABLE_USAGE_COLLECTION=1

volumes:
  tabby-data:
EOF

docker compose up -d

Then add a second Caddy route for tabby.yourdomain.com pointing at 127.0.0.1:8080. Tabby has its own user management, so you can remove basic auth for that hostname.

Step 8: Secure Developer Access with Tailscale

Basic auth works for casual protection, but for serious deployments use a mesh VPN like Tailscale (or self-hosted Headscale). Install the Tailscale agent on the VPS, join your tailnet, and have developers connect via the private tailnet IP. You can then drop basic auth and bind Ollama to the tailnet interface only.

Performance Tuning for CPU Inference

Key levers for getting 10-15 tok/s out of Qwen 2.5 Coder 7B on a 16-vCPU VPS:

  • Q4_K_M quantization (the Ollama default) is the best quality/speed trade-off on CPU.
  • Pin OLLAMA_NUM_THREADS to physical core count, not vCPU count. 8-12 usually wins over 16 on shared cores.
  • Keep the model warm with OLLAMA_KEEP_ALIVE=24h. The cold-start penalty is ~20 seconds.
  • Watch RAM. 7B Q4 takes ~6GB; 14B Q4 takes ~10GB; plus 2-4GB for context windows.

Cost Math: Copilot Business vs Self-Hosted

Team Size Copilot Business Self-Hosted (DD Large + Postgres) Annual Savings
5 devs $95/mo ($1,140/yr) €69.98/mo (~€840/yr) ~$230/yr
10 devs $190/mo ($2,280/yr) €69.98/mo (~€840/yr) ~$1,370/yr
25 devs $475/mo ($5,700/yr) €69.98/mo (~€840/yr) ~$4,790/yr
50 devs $950/mo ($11,400/yr) €69.98/mo (~€840/yr) ~$10,490/yr

The break-even is around 3-4 developers. Beyond that, self-hosting is cheaper by an order of magnitude. Beyond ten developers, you are effectively burning money on Copilot — and still shipping your code to the US.

Security and Governance Checklist

Before rolling out to your team:

  • TLS on everything. Caddy handles this automatically.
  • Strong auth. Basic auth is the minimum; prefer OIDC via oauth2-proxy or Authentik for real deployments.
  • Private network for IDE traffic. Tailscale, Headscale, or WireGuard keeps developer machines off the public API.
  • Server-side request logging for incident response.
  • Disable telemetry in Ollama and Continue (allowAnonymousTelemetry: false).
  • Backup Postgres if you run Tabby — daily snapshots suffice.
  • Document the data flow. Your security policy should spell out that Continue.dev sends the active buffer only to your self-hosted endpoint.

When Self-Hosting Is NOT the Right Answer

Self-hosting is a bad fit for a few scenarios:

  • Solo developer on a non-sensitive project. Copilot at $10/month is probably the better deal, and its inline completion is sharper.
  • You need inline tab-completion and cannot afford a GPU. CPU is not fast enough in 2026.
  • You have no ops capability. A few hours a month of maintenance beats zero with a cloud product.
  • Your repos are already on GitHub.com. The marginal privacy gain from self-hosting AI while hosting code on GitHub is limited — fix the root cause first.

FAQ

Can a self-hosted stack completely replace GitHub Copilot?

For chat, code explanation, refactoring, test generation, and commit message drafting, yes — Qwen 2.5 Coder 14B running on a DD Large VPS is very close to GPT-4-class Copilot Chat quality in 2026. For real-time inline ghost-text completion, you need a GPU, which DanubeData does not currently offer. Many teams pair a self-hosted chat stack with Copilot on Azure OpenAI EU for inline completion, keeping the most sensitive repositories on the self-hosted path only.

Qwen 2.5 Coder vs DeepSeek Coder V2 — which is better?

Qwen 2.5 Coder 7B/14B is the stronger choice for most teams in 2026: better chat quality, tighter FIM performance, and smaller RAM footprint. DeepSeek Coder V2 Lite (MoE 16B, 2.4B active) has an edge on algorithmic and math-heavy problems and runs surprisingly fast given its parameter count, but needs ~14GB RAM and is harder to tune. Start with Qwen; swap to DeepSeek if you have a specific benchmark where it wins.

Is inline completion really not feasible on CPU?

Feasible is not the same as usable. A 1.5B model quantized to Q4 will technically return completions on CPU in 1-2 seconds. That is 3-5x the latency budget developers tolerate before they turn off ghost text. We have seen teams try it, swear it is fine for a week, and then quietly disable it. If you want inline completion, plan for GPU. If you only have CPU, lean into Continue.dev's chat-triggered edits instead.

How strong are the privacy guarantees vs Copilot Enterprise on Azure EU?

GitHub Copilot Enterprise on Azure OpenAI EU keeps data in a European Azure region and contractually commits that prompts are not used for training. That is materially better than default Copilot, but the data-controller is still Microsoft, the infrastructure is still Azure, and the US CLOUD Act remains a theoretical exposure under Schrems II discussions. A self-hosted stack on DanubeData has no third-party data processor at all. For highly regulated industries (defense, health, some financial services), that distinction matters; for most others, Azure EU is a reasonable compromise.

Can I fine-tune the model on my own codebase?

Yes, and it is one of the strongest arguments for self-hosting. LoRA fine-tuning on 10-100k of your internal code samples, using a tool like axolotl or unsloth on a rented GPU for a few hours, produces a model that understands your framework conventions, internal APIs, and naming style. You then merge the LoRA into the base model and serve the result from the same Ollama instance. Budget a GPU day (~€50 on-demand) for a first run. DanubeData's Managed Postgres is a good place to store training metadata and experiment tracking.

What about Retrieval-Augmented Generation (RAG) over our docs?

Tabby ships with code-repo indexing out of the box. For custom document RAG — wiki pages, Confluence, internal runbooks — the cleanest 2026 stack is pgvector on your Managed Postgres plus a small embedder (Ollama's nomic-embed-text works well), plus Continue.dev's context provider API. The whole thing runs on the same VPS.

What if I outgrow a single VPS?

Two paths. Vertical: move to a dedicated-CPU plan for more consistent throughput. Horizontal: run Ollama on multiple VPSes behind a simple round-robin load balancer (or vLLM, which has much better batching than Ollama but is more operational work). Most teams never need this — a single DD Large with Qwen 2.5 Coder 7B comfortably serves 20+ concurrent chat users.

Can I use this with JetBrains IDEs (IntelliJ, PyCharm, GoLand)?

Yes. Continue.dev has first-class JetBrains plugins. The config file lives at ~/.continue/config.json regardless of IDE, so your team ships one configuration and it works everywhere. Tabby also has JetBrains plugins.

Get Started

Ready to bring your AI coding assistant in-house?

  1. Create a DD Large VPS on DanubeData (€49.99/month, Falkenstein Germany).
  2. Follow the step-by-step above: install Ollama, pull qwen2.5-coder:7b, wire up Caddy.
  3. Roll out Continue.dev to your team; optionally add Tabby for a shared web UI and repo search.
  4. Point your DPO at the deployment — they will thank you.

What you get with DanubeData: DD Large (16 vCPU / 32GB / 160GB NVMe) at €49.99/month comfortably runs Qwen 2.5 Coder 7B for a 20-person team. Managed Postgres at €19.99/month for Tabby or pgvector RAG. Falkenstein data center (GDPR-friendly, no US transfer). 20TB included traffic per VPS. €50 signup credit to test risk-free.

Thinking at organisation scale, or unsure whether you need chat-only or a full FIM stack? Talk to our team — we have deployed this architecture for several European engineering teams.

Share this article

Ready to Get Started?

Deploy your infrastructure in minutes with DanubeData's managed services.