Retrieval-Augmented Generation (RAG) has become the default pattern for building AI products on top of your own data. In 2024 every RAG tutorial shipped you off to Pinecone, Weaviate, or Qdrant — a dedicated, expensive, US-hosted vector database. In 2026 the smart move is simpler: use the Postgres you already have.
The pgvector extension is now mature, fast, and used in production by OpenAI, Supabase, Neon, and thousands of European teams that need their data to stay on European soil. On DanubeData Managed PostgreSQL from €19.99/mo, you get pgvector preinstalled, Germany-based storage, daily snapshots, and every Postgres feature you already know. No second database. No extra bill. No data leaving the EU.
This guide walks you through building a full production RAG stack on managed Postgres: schema design, embedding choice, HNSW indexing, a Python ingestion pipeline, a FastAPI retrieval endpoint with LLM synthesis, hybrid search, and the 2026 optimizations (Matryoshka embeddings, halfvec quantization) that let you scale to millions of chunks on a single database server.
What is RAG, in 90 seconds?
Large language models are brilliant generalists but terrible specialists. Ask GPT-4 about your internal HR policy and it will hallucinate. Ask it about a document you uploaded five minutes ago and it cannot see it.
RAG (Retrieval-Augmented Generation) fixes this by stapling a search engine onto the LLM. The flow is always the same four steps:
- Chunk — split your documents (Markdown, PDFs, web pages, tickets) into small overlapping passages of 200–800 tokens.
- Embed — turn each chunk into a high-dimensional vector (typically 384–3072 floats) using an embedding model like
text-embedding-3-small. Semantically similar text ends up close together in vector space. - Retrieve — when a user asks a question, embed their question with the same model, then find the k nearest chunks by cosine distance.
- Generate — stuff those retrieved chunks into the LLM prompt as context and let it answer.
The magic is step 3. Vector similarity search is what turns a dumb keyword lookup into "find me the paragraphs that mean roughly what the user just asked, even if the exact words are different." That is the job pgvector does inside Postgres.
Why pgvector (and not a dedicated vector DB)
Every team building RAG hits the same fork in the road: do I spin up Pinecone / Weaviate / Qdrant, or do I just add a vector column to my existing Postgres? In 2023 the answer leaned toward dedicated vector DBs. In 2026 pgvector has caught up on every metric that matters for <10M vector workloads — and crushes everyone on operational simplicity.
| Feature | pgvector on DanubeData | Pinecone Starter | Weaviate Cloud | Qdrant Cloud |
|---|---|---|---|---|
| Monthly price (≈1M vectors, 1536d) | €19.99 | $70+ (Starter) | $25+ (Serverless, usage-based) | $25+ (1GB tier) |
| Data residency | Germany (Falkenstein) | US / some EU regions | US / EU | US / EU |
| GDPR-native | Yes (EU provider, DPA included) | Via DPA, US controller | Via DPA | Via DPA |
| Transactional joins (vector + SQL) | Native, single query | No — separate store | Limited | Limited payload filters |
| Hybrid search (vector + BM25) | Yes (tsvector, pg_trgm, RRF) | Yes (v2) | Yes | Yes |
| Index types | HNSW, IVFFlat | Proprietary (HNSW-ish) | HNSW | HNSW |
| Quantization (halfvec / binary) | Yes (pgvector 0.7+) | Yes | Yes | Yes |
| Operational burden | One database to run | +1 service, +1 vendor | +1 service, +1 vendor | +1 service, +1 vendor |
| Backups & PITR | Included, snapshots | Managed, opaque | Managed, opaque | Managed, opaque |
The killer argument is the third row: transactional joins. When your RAG system needs to filter by tenant_id, user_id, document_type, or created_at > now() - interval '30 days', you just write SQL. In Pinecone you are stuck with pre-filter metadata and two network hops. In pgvector, it is one query, one round-trip, one transaction.
Below ~10 million vectors — which covers the vast majority of B2B RAG products — pgvector is faster end-to-end, cheaper, and operationally simpler. The dedicated vector DB argument only really wins above 50M vectors or when you need globally distributed, multi-region HNSW.
Step 1: Enable pgvector on DanubeData Managed PostgreSQL
DanubeData Managed PostgreSQL ships with the vector extension available out of the box. Connect with any client (DBeaver, psql, your app) and enable it:
-- Connect to your database, then:
CREATE EXTENSION IF NOT EXISTS vector;
-- Verify
SELECT extname, extversion FROM pg_extension WHERE extname = 'vector';
-- extname | extversion
-- ---------+------------
-- vector | 0.8.0
While you are at it, enable the two extensions that make hybrid search easy:
CREATE EXTENSION IF NOT EXISTS pg_trgm; -- fuzzy/trigram search
CREATE EXTENSION IF NOT EXISTS unaccent; -- strip diacritics for better matching
That is it. No sidecar, no cluster, no separate billing line. On a DD Small Managed PostgreSQL (€19.99) you now have a production-grade vector database.
Step 2: Design the schema
Good RAG schemas separate documents (the source of truth — a blog post, PDF, support ticket) from chunks (the retrievable units). This lets you re-chunk without re-uploading, and lets you attach rich metadata for filtering.
-- documents: one row per ingested source
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
tenant_id UUID NOT NULL,
source_type TEXT NOT NULL, -- 'markdown', 'pdf', 'web', 'ticket'
source_uri TEXT NOT NULL, -- s3://..., https://..., file path
title TEXT,
content_sha256 CHAR(64) NOT NULL, -- dedup + change detection
metadata JSONB NOT NULL DEFAULT '{}'::jsonb,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE (tenant_id, source_uri)
);
CREATE INDEX documents_tenant_idx ON documents(tenant_id);
CREATE INDEX documents_metadata_gin ON documents USING GIN (metadata jsonb_path_ops);
-- chunks: the unit we embed and retrieve
CREATE TABLE chunks (
id BIGSERIAL PRIMARY KEY,
document_id BIGINT NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
tenant_id UUID NOT NULL, -- denormalized for fast filtering
chunk_index INT NOT NULL, -- 0, 1, 2, ... within the document
content TEXT NOT NULL, -- the actual chunk text
token_count INT NOT NULL,
embedding vector(1536), -- OpenAI text-embedding-3-small dim
content_tsv tsvector GENERATED ALWAYS AS
(to_tsvector('english', content)) STORED,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE (document_id, chunk_index)
);
-- Indexes: one for each access pattern
CREATE INDEX chunks_tenant_idx ON chunks(tenant_id);
CREATE INDEX chunks_document_idx ON chunks(document_id);
CREATE INDEX chunks_fts_idx ON chunks USING GIN (content_tsv);
Three design choices worth calling out:
- Denormalized
tenant_idonchunks— lets the HNSW index filter by tenant in-memory without a join. content_sha256on documents — idempotent re-ingestion. Hash the content, skip if unchanged.- Generated
tsvectorcolumn — free full-text index, always in sync, used later for hybrid search.
Step 3: Pick an embedding model (and dimension)
Your embedding model decides three things: quality, cost, and vector dimension. Dimension matters because it drives storage and index speed — a 3072-dim vector takes 6× the space of a 512-dim one.
| Model | Dim | Cost | Self-hostable | Best for |
|---|---|---|---|---|
OpenAI text-embedding-3-small |
1536 (Matryoshka: 256–1536) | $0.02 / 1M tokens | No | Default pick, great quality/price |
OpenAI text-embedding-3-large |
3072 | $0.13 / 1M tokens | No | Maximum recall, multilingual |
Cohere embed-multilingual-v3 |
1024 | $0.10 / 1M tokens | No | Strong EU-language coverage |
BGE bge-m3 |
1024 | Free | Yes (Ollama, TEI) | GDPR-strict, on-prem |
nomic-embed-text-v1.5 |
768 (Matryoshka: 64–768) | Free | Yes (Ollama) | Small, fast, local |
mxbai-embed-large |
1024 | Free | Yes (Ollama) | Open, high MTEB score |
For most teams starting out, text-embedding-3-small at 1536 dimensions is the right default: excellent quality, cheap, and Matryoshka-compatible so you can truncate to 512 or 768 later without re-embedding.
For GDPR-strict projects where you cannot send content to the OpenAI API, run bge-m3 or nomic-embed-text on a DanubeData VPS (DD Small at €12.49/mo is plenty for a few hundred thousand chunks) via Ollama or Hugging Face text-embeddings-inference. Then your embeddings, your vectors, and your data all stay on German servers.
Step 4: Choose an index — HNSW vs IVFFlat
A vector(1536) column without an index works fine up to ~10,000 rows — Postgres will do a sequential scan. Beyond that, you need an approximate nearest neighbor (ANN) index. pgvector ships two:
| Aspect | HNSW | IVFFlat |
|---|---|---|
| Recall @ 10 | 95–99% | 80–95% (tune lists) |
| Build time | Slower (graph construction) | Faster |
| Query latency | Very low, sub-ms to ms | Higher, tunable |
| Insert cost | Moderate | Low |
| Memory use | Higher (graph in RAM) | Lower |
| Requires data to build? | No — can build on empty | Yes — needs data to cluster |
| 2026 default | Yes | Legacy workloads only |
HNSW is the 2026 default. It has higher recall, much lower query latency, and — unlike IVFFlat — you can build the index before you have any data, which matters for CI/CD and blue/green deploys.
-- Create the HNSW index (cosine distance is what OpenAI embeddings expect)
CREATE INDEX chunks_embedding_hnsw_idx
ON chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- For larger datasets, increase both:
-- WITH (m = 32, ef_construction = 128) -- higher recall, slower build
-- At query time, tune the search width:
SET hnsw.ef_search = 100; -- default 40. Higher = more recall, slower.
Only reach for IVFFlat if you are ingesting hundreds of thousands of new rows per hour and HNSW insert throughput becomes the bottleneck. For ~99% of RAG apps, HNSW is what you want.
Step 5: The ingestion pipeline (Python)
Here is a minimal, production-shaped ingestion script. It chunks Markdown files with overlap, embeds them with OpenAI, and upserts into Postgres. Run it locally, in a CI job, or on a DanubeData VPS.
#!/usr/bin/env python3
# ingest.py — chunk + embed + upsert into pgvector
import hashlib
import os
import re
from pathlib import Path
import psycopg
import tiktoken
from openai import OpenAI
from pgvector.psycopg import register_vector
DATABASE_URL = os.environ["DATABASE_URL"] # postgres://user:pass@db.danubedata.ro:5432/rag
TENANT_ID = os.environ["TENANT_ID"] # a UUID
EMBED_MODEL = "text-embedding-3-small"
EMBED_DIM = 1536
CHUNK_TOKENS = 500
CHUNK_OVERLAP = 75
client = OpenAI()
enc = tiktoken.get_encoding("cl100k_base")
def chunk_text(text: str, max_tokens: int = CHUNK_TOKENS, overlap: int = CHUNK_OVERLAP):
"""Split text into overlapping token windows. Good enough for 95% of cases."""
# Preserve paragraph boundaries when possible
paragraphs = re.split(r"
s*
", text.strip())
tokens, buf = [], []
for p in paragraphs:
p_tokens = enc.encode(p)
if len(buf) + len(p_tokens) > max_tokens and buf:
tokens.append(buf)
# keep last `overlap` tokens as bridge to next chunk
buf = buf[-overlap:] + p_tokens
else:
buf.extend(p_tokens)
if buf:
tokens.append(buf)
return [enc.decode(t) for t in tokens]
def embed_batch(texts: list[str]) -> list[list[float]]:
"""Embed up to 100 strings in a single API call."""
resp = client.embeddings.create(model=EMBED_MODEL, input=texts)
return [d.embedding for d in resp.data]
def ingest_file(conn, path: Path):
content = path.read_text(encoding="utf-8")
sha = hashlib.sha256(content.encode()).hexdigest()
with conn.cursor() as cur:
# Upsert the document. If content unchanged, skip entirely.
cur.execute(
"""
INSERT INTO documents (tenant_id, source_type, source_uri, title, content_sha256)
VALUES (%s, 'markdown', %s, %s, %s)
ON CONFLICT (tenant_id, source_uri)
DO UPDATE SET title = EXCLUDED.title, updated_at = now()
RETURNING id, content_sha256 = %s AS unchanged
""",
(TENANT_ID, str(path), path.stem, sha, sha),
)
doc_id, unchanged = cur.fetchone()
if unchanged:
# Look up stored sha; if same, skip re-embedding
cur.execute("SELECT content_sha256 FROM documents WHERE id = %s", (doc_id,))
if cur.fetchone()[0] == sha:
print(f"skip {path} (unchanged)")
return
# Re-chunk: wipe old chunks, insert new ones
cur.execute("DELETE FROM chunks WHERE document_id = %s", (doc_id,))
chunks = chunk_text(content)
# Batch embed in groups of 100
for i in range(0, len(chunks), 100):
batch = chunks[i:i + 100]
vectors = embed_batch(batch)
rows = [
(doc_id, TENANT_ID, i + j, chunks[i + j],
len(enc.encode(chunks[i + j])), vectors[j])
for j in range(len(batch))
]
cur.executemany(
"""
INSERT INTO chunks
(document_id, tenant_id, chunk_index, content, token_count, embedding)
VALUES (%s, %s, %s, %s, %s, %s)
""",
rows,
)
print(f"ingested {path}: {len(chunks)} chunks")
conn.commit()
def main():
with psycopg.connect(DATABASE_URL) as conn:
register_vector(conn)
for md in Path("./docs").rglob("*.md"):
ingest_file(conn, md)
if __name__ == "__main__":
main()
Notes on this script:
- Idempotent — hash-based skip means you can re-run on every CI build without re-embedding unchanged docs.
- Batch embeds — 100 texts per API call. OpenAI allows up to 2048 inputs but 100 is a sweet spot for latency and error isolation.
- Overlap — 75 tokens of overlap means important sentences never fall on a chunk boundary.
- Tenant isolation —
tenant_idon every chunk. Always filter on it at query time.
Step 6: The retrieval query
pgvector overloads three distance operators:
<=>— cosine distance (1 - cosine similarity). What you want for OpenAI / most modern embeddings.<->— L2 (Euclidean) distance.<#>— negative inner product.
The basic retrieval query — "find the top 5 chunks closest to my question embedding" — looks like this:
-- $1 = query embedding as vector(1536), $2 = tenant_id
SELECT
c.id,
c.content,
c.document_id,
d.title,
d.source_uri,
c.embedding <=> $1 AS distance
FROM chunks c
JOIN documents d ON d.id = c.document_id
WHERE c.tenant_id = $2
ORDER BY c.embedding <=> $1
LIMIT 5;
The ORDER BY ... <=> $1 clause is what triggers the HNSW index. EXPLAIN ANALYZE should show an Index Scan using chunks_embedding_hnsw_idx — if you see a sequential scan, check that hnsw.ef_search is set and that your WHERE filter is not too restrictive.
Hybrid search: vector + BM25 via tsvector
Pure vector search misses exact-match queries like product SKUs, version numbers, or unusual proper nouns. Hybrid search combines vector similarity with traditional keyword scoring via Reciprocal Rank Fusion (RRF) — and because your content_tsv column is already there, it is a single SQL statement:
WITH
vector_hits AS (
SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> $1) AS rank
FROM chunks
WHERE tenant_id = $2
ORDER BY embedding <=> $1
LIMIT 50
),
keyword_hits AS (
SELECT id, ROW_NUMBER() OVER
(ORDER BY ts_rank_cd(content_tsv, plainto_tsquery('english', $3)) DESC) AS rank
FROM chunks
WHERE tenant_id = $2
AND content_tsv @@ plainto_tsquery('english', $3)
LIMIT 50
)
SELECT c.id, c.content, d.title, d.source_uri,
COALESCE(1.0 / (60 + v.rank), 0) + COALESCE(1.0 / (60 + k.rank), 0) AS rrf_score
FROM chunks c
LEFT JOIN vector_hits v ON v.id = c.id
LEFT JOIN keyword_hits k ON k.id = c.id
JOIN documents d ON d.id = c.document_id
WHERE v.id IS NOT NULL OR k.id IS NOT NULL
ORDER BY rrf_score DESC
LIMIT 5;
$1 is the query embedding, $2 the tenant, $3 the raw query text. RRF with k = 60 is the standard fusion weight from the original paper and works well without tuning.
Step 7: Wrap it in a FastAPI endpoint
Now glue it together into a real /ask endpoint that embeds the question, retrieves chunks, and calls an LLM to synthesize the answer.
# api.py — FastAPI RAG endpoint
import os
from contextlib import asynccontextmanager
from typing import List
import psycopg
from fastapi import FastAPI, HTTPException
from openai import OpenAI
from pgvector.psycopg import register_vector
from pydantic import BaseModel
DATABASE_URL = os.environ["DATABASE_URL"]
EMBED_MODEL = "text-embedding-3-small"
CHAT_MODEL = "gpt-4o-mini"
TOP_K = 5
client = OpenAI()
conn: psycopg.Connection | None = None
@asynccontextmanager
async def lifespan(app: FastAPI):
global conn
conn = psycopg.connect(DATABASE_URL, autocommit=True)
register_vector(conn)
with conn.cursor() as cur:
cur.execute("SET hnsw.ef_search = 100")
yield
conn.close()
app = FastAPI(lifespan=lifespan)
class AskRequest(BaseModel):
question: str
tenant_id: str
class Citation(BaseModel):
title: str
source_uri: str
excerpt: str
class AskResponse(BaseModel):
answer: str
citations: List[Citation]
SYSTEM_PROMPT = """You are a helpful assistant. Answer the user's question using ONLY
the provided context snippets. If the answer is not in the context, say you don't know.
Always cite sources inline like [1], [2] matching the snippet numbers."""
@app.post("/ask", response_model=AskResponse)
async def ask(req: AskRequest) -> AskResponse:
# 1. Embed the question
emb = client.embeddings.create(model=EMBED_MODEL, input=[req.question]).data[0].embedding
# 2. Retrieve top-k chunks (hybrid search)
with conn.cursor() as cur:
cur.execute(
"""
WITH
vh AS (SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> %s) AS r
FROM chunks WHERE tenant_id = %s
ORDER BY embedding <=> %s LIMIT 50),
kh AS (SELECT id, ROW_NUMBER() OVER
(ORDER BY ts_rank_cd(content_tsv, plainto_tsquery('english', %s)) DESC) AS r
FROM chunks
WHERE tenant_id = %s
AND content_tsv @@ plainto_tsquery('english', %s)
LIMIT 50)
SELECT c.content, d.title, d.source_uri,
COALESCE(1.0/(60+vh.r),0) + COALESCE(1.0/(60+kh.r),0) AS rrf
FROM chunks c
LEFT JOIN vh ON vh.id = c.id
LEFT JOIN kh ON kh.id = c.id
JOIN documents d ON d.id = c.document_id
WHERE vh.id IS NOT NULL OR kh.id IS NOT NULL
ORDER BY rrf DESC LIMIT %s
""",
(emb, req.tenant_id, emb, req.question, req.tenant_id, req.question, TOP_K),
)
rows = cur.fetchall()
if not rows:
raise HTTPException(404, "No relevant context found")
# 3. Build the prompt
context_block = "
".join(
f"[{i+1}] {content}
(source: {title})" for i, (content, title, *_) in enumerate(rows)
)
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Context:
{context_block}
Question: {req.question}"},
]
# 4. Call the LLM
completion = client.chat.completions.create(
model=CHAT_MODEL, messages=messages, temperature=0.2
)
answer = completion.choices[0].message.content
return AskResponse(
answer=answer,
citations=[
Citation(title=title, source_uri=uri, excerpt=content[:200])
for content, title, uri, _ in rows
],
)
Run it on a DanubeData DD Small VPS (€12.49/mo) with uvicorn api:app --host 0.0.0.0 --port 8000 and put it behind Caddy or nginx with automatic HTTPS. Your total infrastructure for a production RAG stack is now one database (€19.99) and one VPS (€12.49): €32.48/month, all in Germany.
Step 8: Optimizations that matter at scale
1. Matryoshka: truncate dimensions without re-embedding
text-embedding-3-small and nomic-embed-text-v1.5 are Matryoshka Representation Learning (MRL) models. That means the first N dimensions of the vector are themselves a valid (lower-quality) embedding. You can store the full 1536 dims but index only 512 or 768 for a 3× speedup with negligible recall loss.
-- Store full embeddings, index only first 512 dims
ALTER TABLE chunks ADD COLUMN embedding_512 vector(512)
GENERATED ALWAYS AS ((embedding::vector(512))) STORED;
CREATE INDEX chunks_embedding_512_hnsw
ON chunks USING hnsw (embedding_512 vector_cosine_ops);
-- Query: retrieve 50 with cheap 512-dim index, then re-rank with full 1536-dim
WITH candidates AS (
SELECT id, content, embedding FROM chunks
WHERE tenant_id = $2
ORDER BY embedding_512 <=> (($1)::vector(512))
LIMIT 50
)
SELECT id, content FROM candidates
ORDER BY embedding <=> $1
LIMIT 5;
2. halfvec quantization (pgvector 0.7+)
halfvec stores each dimension as a 16-bit float instead of 32-bit. You cut storage and index memory in half with essentially no recall hit for most embedding models. On DanubeData this turns a 1M-chunk, 1536-dim index from ~6 GB to ~3 GB.
-- Add halfvec column alongside the main one
ALTER TABLE chunks ADD COLUMN embedding_half halfvec(1536)
GENERATED ALWAYS AS (embedding::halfvec(1536)) STORED;
CREATE INDEX chunks_embedding_half_hnsw
ON chunks USING hnsw (embedding_half halfvec_cosine_ops);
-- Query the halfvec index for 3x less memory pressure
SELECT id, content
FROM chunks
WHERE tenant_id = $2
ORDER BY embedding_half <=> ($1::halfvec(1536))
LIMIT 5;
3. Partial indexes per tenant or content type
If one tenant dominates the table, a partial index keeps HNSW small and fast:
CREATE INDEX chunks_embedding_acme_hnsw
ON chunks USING hnsw (embedding vector_cosine_ops)
WHERE tenant_id = '11111111-2222-3333-4444-555555555555';
4. Periodic VACUUM and ANALYZE
HNSW does not delete gracefully. When chunks get re-ingested frequently, the index accumulates tombstones. DanubeData's managed Postgres runs autovacuum, but for very write-heavy tables you can schedule a nightly REINDEX INDEX CONCURRENTLY once a week.
How many vectors can Postgres really handle?
Real-world numbers on a DanubeData Managed PostgreSQL (DD Standard, 4 vCPU, 8GB RAM) with pgvector 0.8 and HNSW:
- 100K vectors, 1536d — p95 latency 2–5 ms, index ~1 GB, builds in under 2 minutes.
- 1M vectors, 1536d — p95 latency 8–15 ms, index ~8 GB (3 GB with halfvec), builds in 20–30 minutes.
- 5M vectors, 1536d — p95 latency 20–40 ms, fits on a DD Large (16GB) with halfvec quantization.
- 10M+ vectors, 1536d — still fine with halfvec + dedicated CPU plan; beyond this consider a dedicated vector store or sharding by tenant.
The honest threshold where a dedicated vector DB becomes worth the operational cost is around 25–50M vectors or heavy multi-region requirements. Everything below that, pgvector wins.
GDPR, embeddings, and the OpenAI question
An under-discussed RAG pitfall: sending your customer content to OpenAI to embed it is a GDPR data transfer. You need a DPA with OpenAI, Standard Contractual Clauses, and a legal basis. Two practical options:
- Stay on OpenAI — OpenAI's API does not train on API inputs by default (since March 2023) and they offer an enterprise DPA. Document this in your Record of Processing Activities.
- Self-host embeddings on a DanubeData VPS — run
bge-m3ornomic-embed-textvia Ollama ortext-embeddings-inferenceon a DD Small (€12.49/mo) or DD Standard. Zero cross-border transfers, zero third-party processor. Your vectors, your database, your data center — all in Germany.
Because your Postgres lives on DanubeData in Falkenstein, the vectors themselves never leave the EU either way. Only the ephemeral embedding API call (option 1) leaves, and even that can be kept in-EU with OpenAI's EU data residency options or by switching to option 2.
What it actually costs
| Stack piece | DanubeData | Pinecone + Render equivalent |
|---|---|---|
| Vector DB (1M chunks, 1536d) | €19.99 (Managed Postgres with pgvector) | $70 (Pinecone Starter) |
| API / ingestion host | €12.49 (DD Small VPS) | ~$19 (Render Starter) |
| Document storage (100GB S3) | €3.99 (includes 1TB traffic) | ~$23 (AWS S3 + egress) |
| Monthly total | €36.47 | $112+ |
| GDPR / data residency | Germany, EU provider | Mixed US/EU |
Roughly 3× cheaper, simpler to operate, and GDPR-native — for anything below 10 million chunks.
FAQ
How many vectors can Postgres handle with pgvector?
Comfortably up to 10 million 1536-dimensional vectors on a single DanubeData Managed Postgres instance with HNSW and halfvec quantization. Many teams run tens of millions with careful tuning and partial indexes. The rule of thumb: if your HNSW index fits in RAM, you are fine.
When should I actually use a dedicated vector DB like Pinecone or Qdrant?
Three scenarios: (1) you have >50M vectors and need distributed HNSW, (2) you need globally replicated vector search with low latency on every continent, or (3) your team explicitly does not want to run Postgres. Below those bars, pgvector on managed Postgres is cheaper and simpler.
HNSW or IVFFlat — which should I pick?
HNSW, always, in 2026. Higher recall, lower latency, can be built on empty tables. IVFFlat only makes sense for insert-heavy workloads above ~100K writes/hour and even there HNSW 0.8 is usually competitive.
Which embedding model should I start with?
Default to OpenAI text-embedding-3-small at 1536 dimensions. Quality is excellent, cost is negligible ($0.02 per 1M tokens), and it supports Matryoshka truncation so you can shrink to 512 later without re-embedding. If you need 100% self-hosted, use bge-m3 or nomic-embed-text via Ollama on a DanubeData VPS.
Can I use Ollama / self-hosted embeddings with pgvector?
Absolutely. Run Ollama or text-embeddings-inference on a DanubeData VPS, point your ingestion script at http://vps-ip:11434/api/embeddings instead of OpenAI, and store the resulting vectors in the same schema. The only thing that changes is the embedding dimension and the API client.
Is sending content to OpenAI for embeddings a GDPR problem?
It is a data transfer to a US processor, so you need OpenAI's DPA, SCCs, a Transfer Impact Assessment, and to document it in your Record of Processing Activities. OpenAI does not train on API inputs. For truly strict setups (health, legal, public sector), self-host embeddings on a DanubeData VPS — that way no customer content ever leaves the EU.
How do I update or re-index chunks when a document changes?
Use the content_sha256 pattern from the ingestion script: hash the document, compare to what is stored, and if it changed, DELETE FROM chunks WHERE document_id = ? and re-embed. pgvector's HNSW handles deletes and reinserts fine; schedule a weekly REINDEX INDEX CONCURRENTLY if you rewrite a lot.
What about hybrid search with BM25?
Postgres's built-in tsvector full-text search gives you BM25-flavored keyword ranking out of the box. Combine it with vector similarity via Reciprocal Rank Fusion (RRF, see the SQL example above). You can also add pg_trgm for fuzzy matching or the paradedb extension for true BM25 if you need it.
Get started today
A full RAG stack on DanubeData takes about 20 minutes:
- Create a Managed PostgreSQL on DanubeData (€19.99/mo, pgvector included)
- Run
CREATE EXTENSION vector; - Apply the schema from Step 2
- Run the ingestion script from Step 5 against your Markdown, PDFs, or Notion export
- Deploy the FastAPI endpoint on a DD Small VPS (€12.49/mo)
- Ask your data questions
Why DanubeData for RAG workloads:
- Managed PostgreSQL 15–17 with pgvector, pg_trgm, and unaccent preinstalled
- From €19.99/mo with NVMe storage and automatic snapshots
- Germany-based (Falkenstein), GDPR-native, EU-based company
- Pair with DD Small VPS (€12.49) and S3 storage (€3.99) for a full in-house AI stack
- 20TB included traffic, €50 signup credit
You do not need Pinecone. You do not need Weaviate. You already have Postgres. Turn it on.
👉 Create your Managed PostgreSQL on DanubeData
Building a RAG product in regulated industry? Talk to us — we help design GDPR-safe AI stacks on European infrastructure.