BlogTutorialsBuild a RAG System with pgvector on Managed PostgreSQL (2026)

Build a RAG System with pgvector on Managed PostgreSQL (2026)

Adrian Silaghi
Adrian Silaghi
April 20, 2026
15 min read
9 views
#pgvector #postgresql #rag #ai #llm #embeddings #vector-database #managed-postgres
Build a RAG System with pgvector on Managed PostgreSQL (2026)

Retrieval-Augmented Generation (RAG) has become the default pattern for building AI products on top of your own data. In 2024 every RAG tutorial shipped you off to Pinecone, Weaviate, or Qdrant — a dedicated, expensive, US-hosted vector database. In 2026 the smart move is simpler: use the Postgres you already have.

The pgvector extension is now mature, fast, and used in production by OpenAI, Supabase, Neon, and thousands of European teams that need their data to stay on European soil. On DanubeData Managed PostgreSQL from €19.99/mo, you get pgvector preinstalled, Germany-based storage, daily snapshots, and every Postgres feature you already know. No second database. No extra bill. No data leaving the EU.

This guide walks you through building a full production RAG stack on managed Postgres: schema design, embedding choice, HNSW indexing, a Python ingestion pipeline, a FastAPI retrieval endpoint with LLM synthesis, hybrid search, and the 2026 optimizations (Matryoshka embeddings, halfvec quantization) that let you scale to millions of chunks on a single database server.

What is RAG, in 90 seconds?

Large language models are brilliant generalists but terrible specialists. Ask GPT-4 about your internal HR policy and it will hallucinate. Ask it about a document you uploaded five minutes ago and it cannot see it.

RAG (Retrieval-Augmented Generation) fixes this by stapling a search engine onto the LLM. The flow is always the same four steps:

  1. Chunk — split your documents (Markdown, PDFs, web pages, tickets) into small overlapping passages of 200–800 tokens.
  2. Embed — turn each chunk into a high-dimensional vector (typically 384–3072 floats) using an embedding model like text-embedding-3-small. Semantically similar text ends up close together in vector space.
  3. Retrieve — when a user asks a question, embed their question with the same model, then find the k nearest chunks by cosine distance.
  4. Generate — stuff those retrieved chunks into the LLM prompt as context and let it answer.

The magic is step 3. Vector similarity search is what turns a dumb keyword lookup into "find me the paragraphs that mean roughly what the user just asked, even if the exact words are different." That is the job pgvector does inside Postgres.

Why pgvector (and not a dedicated vector DB)

Every team building RAG hits the same fork in the road: do I spin up Pinecone / Weaviate / Qdrant, or do I just add a vector column to my existing Postgres? In 2023 the answer leaned toward dedicated vector DBs. In 2026 pgvector has caught up on every metric that matters for <10M vector workloads — and crushes everyone on operational simplicity.

Feature pgvector on DanubeData Pinecone Starter Weaviate Cloud Qdrant Cloud
Monthly price (≈1M vectors, 1536d) €19.99 $70+ (Starter) $25+ (Serverless, usage-based) $25+ (1GB tier)
Data residency Germany (Falkenstein) US / some EU regions US / EU US / EU
GDPR-native Yes (EU provider, DPA included) Via DPA, US controller Via DPA Via DPA
Transactional joins (vector + SQL) Native, single query No — separate store Limited Limited payload filters
Hybrid search (vector + BM25) Yes (tsvector, pg_trgm, RRF) Yes (v2) Yes Yes
Index types HNSW, IVFFlat Proprietary (HNSW-ish) HNSW HNSW
Quantization (halfvec / binary) Yes (pgvector 0.7+) Yes Yes Yes
Operational burden One database to run +1 service, +1 vendor +1 service, +1 vendor +1 service, +1 vendor
Backups & PITR Included, snapshots Managed, opaque Managed, opaque Managed, opaque

The killer argument is the third row: transactional joins. When your RAG system needs to filter by tenant_id, user_id, document_type, or created_at > now() - interval '30 days', you just write SQL. In Pinecone you are stuck with pre-filter metadata and two network hops. In pgvector, it is one query, one round-trip, one transaction.

Below ~10 million vectors — which covers the vast majority of B2B RAG products — pgvector is faster end-to-end, cheaper, and operationally simpler. The dedicated vector DB argument only really wins above 50M vectors or when you need globally distributed, multi-region HNSW.

Step 1: Enable pgvector on DanubeData Managed PostgreSQL

DanubeData Managed PostgreSQL ships with the vector extension available out of the box. Connect with any client (DBeaver, psql, your app) and enable it:

-- Connect to your database, then:
CREATE EXTENSION IF NOT EXISTS vector;

-- Verify
SELECT extname, extversion FROM pg_extension WHERE extname = 'vector';
--  extname | extversion
-- ---------+------------
--  vector  | 0.8.0

While you are at it, enable the two extensions that make hybrid search easy:

CREATE EXTENSION IF NOT EXISTS pg_trgm;    -- fuzzy/trigram search
CREATE EXTENSION IF NOT EXISTS unaccent;   -- strip diacritics for better matching

That is it. No sidecar, no cluster, no separate billing line. On a DD Small Managed PostgreSQL (€19.99) you now have a production-grade vector database.

Step 2: Design the schema

Good RAG schemas separate documents (the source of truth — a blog post, PDF, support ticket) from chunks (the retrievable units). This lets you re-chunk without re-uploading, and lets you attach rich metadata for filtering.

-- documents: one row per ingested source
CREATE TABLE documents (
    id              BIGSERIAL PRIMARY KEY,
    tenant_id       UUID NOT NULL,
    source_type     TEXT NOT NULL,              -- 'markdown', 'pdf', 'web', 'ticket'
    source_uri      TEXT NOT NULL,              -- s3://..., https://..., file path
    title           TEXT,
    content_sha256  CHAR(64) NOT NULL,          -- dedup + change detection
    metadata        JSONB NOT NULL DEFAULT '{}'::jsonb,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE (tenant_id, source_uri)
);

CREATE INDEX documents_tenant_idx ON documents(tenant_id);
CREATE INDEX documents_metadata_gin ON documents USING GIN (metadata jsonb_path_ops);

-- chunks: the unit we embed and retrieve
CREATE TABLE chunks (
    id              BIGSERIAL PRIMARY KEY,
    document_id     BIGINT NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
    tenant_id       UUID NOT NULL,              -- denormalized for fast filtering
    chunk_index     INT NOT NULL,               -- 0, 1, 2, ... within the document
    content         TEXT NOT NULL,              -- the actual chunk text
    token_count     INT NOT NULL,
    embedding       vector(1536),               -- OpenAI text-embedding-3-small dim
    content_tsv     tsvector GENERATED ALWAYS AS
                    (to_tsvector('english', content)) STORED,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE (document_id, chunk_index)
);

-- Indexes: one for each access pattern
CREATE INDEX chunks_tenant_idx     ON chunks(tenant_id);
CREATE INDEX chunks_document_idx   ON chunks(document_id);
CREATE INDEX chunks_fts_idx        ON chunks USING GIN (content_tsv);

Three design choices worth calling out:

  • Denormalized tenant_id on chunks — lets the HNSW index filter by tenant in-memory without a join.
  • content_sha256 on documents — idempotent re-ingestion. Hash the content, skip if unchanged.
  • Generated tsvector column — free full-text index, always in sync, used later for hybrid search.

Step 3: Pick an embedding model (and dimension)

Your embedding model decides three things: quality, cost, and vector dimension. Dimension matters because it drives storage and index speed — a 3072-dim vector takes 6× the space of a 512-dim one.

Model Dim Cost Self-hostable Best for
OpenAI text-embedding-3-small 1536 (Matryoshka: 256–1536) $0.02 / 1M tokens No Default pick, great quality/price
OpenAI text-embedding-3-large 3072 $0.13 / 1M tokens No Maximum recall, multilingual
Cohere embed-multilingual-v3 1024 $0.10 / 1M tokens No Strong EU-language coverage
BGE bge-m3 1024 Free Yes (Ollama, TEI) GDPR-strict, on-prem
nomic-embed-text-v1.5 768 (Matryoshka: 64–768) Free Yes (Ollama) Small, fast, local
mxbai-embed-large 1024 Free Yes (Ollama) Open, high MTEB score

For most teams starting out, text-embedding-3-small at 1536 dimensions is the right default: excellent quality, cheap, and Matryoshka-compatible so you can truncate to 512 or 768 later without re-embedding.

For GDPR-strict projects where you cannot send content to the OpenAI API, run bge-m3 or nomic-embed-text on a DanubeData VPS (DD Small at €12.49/mo is plenty for a few hundred thousand chunks) via Ollama or Hugging Face text-embeddings-inference. Then your embeddings, your vectors, and your data all stay on German servers.

Step 4: Choose an index — HNSW vs IVFFlat

A vector(1536) column without an index works fine up to ~10,000 rows — Postgres will do a sequential scan. Beyond that, you need an approximate nearest neighbor (ANN) index. pgvector ships two:

Aspect HNSW IVFFlat
Recall @ 10 95–99% 80–95% (tune lists)
Build time Slower (graph construction) Faster
Query latency Very low, sub-ms to ms Higher, tunable
Insert cost Moderate Low
Memory use Higher (graph in RAM) Lower
Requires data to build? No — can build on empty Yes — needs data to cluster
2026 default Yes Legacy workloads only

HNSW is the 2026 default. It has higher recall, much lower query latency, and — unlike IVFFlat — you can build the index before you have any data, which matters for CI/CD and blue/green deploys.

-- Create the HNSW index (cosine distance is what OpenAI embeddings expect)
CREATE INDEX chunks_embedding_hnsw_idx
    ON chunks
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

-- For larger datasets, increase both:
-- WITH (m = 32, ef_construction = 128)  -- higher recall, slower build

-- At query time, tune the search width:
SET hnsw.ef_search = 100;  -- default 40. Higher = more recall, slower.

Only reach for IVFFlat if you are ingesting hundreds of thousands of new rows per hour and HNSW insert throughput becomes the bottleneck. For ~99% of RAG apps, HNSW is what you want.

Step 5: The ingestion pipeline (Python)

Here is a minimal, production-shaped ingestion script. It chunks Markdown files with overlap, embeds them with OpenAI, and upserts into Postgres. Run it locally, in a CI job, or on a DanubeData VPS.

#!/usr/bin/env python3
# ingest.py — chunk + embed + upsert into pgvector
import hashlib
import os
import re
from pathlib import Path

import psycopg
import tiktoken
from openai import OpenAI
from pgvector.psycopg import register_vector

DATABASE_URL = os.environ["DATABASE_URL"]  # postgres://user:pass@db.danubedata.ro:5432/rag
TENANT_ID = os.environ["TENANT_ID"]         # a UUID
EMBED_MODEL = "text-embedding-3-small"
EMBED_DIM = 1536
CHUNK_TOKENS = 500
CHUNK_OVERLAP = 75

client = OpenAI()
enc = tiktoken.get_encoding("cl100k_base")

def chunk_text(text: str, max_tokens: int = CHUNK_TOKENS, overlap: int = CHUNK_OVERLAP):
    """Split text into overlapping token windows. Good enough for 95% of cases."""
    # Preserve paragraph boundaries when possible
    paragraphs = re.split(r"
s*
", text.strip())
    tokens, buf = [], []
    for p in paragraphs:
        p_tokens = enc.encode(p)
        if len(buf) + len(p_tokens) > max_tokens and buf:
            tokens.append(buf)
            # keep last `overlap` tokens as bridge to next chunk
            buf = buf[-overlap:] + p_tokens
        else:
            buf.extend(p_tokens)
    if buf:
        tokens.append(buf)
    return [enc.decode(t) for t in tokens]

def embed_batch(texts: list[str]) -> list[list[float]]:
    """Embed up to 100 strings in a single API call."""
    resp = client.embeddings.create(model=EMBED_MODEL, input=texts)
    return [d.embedding for d in resp.data]

def ingest_file(conn, path: Path):
    content = path.read_text(encoding="utf-8")
    sha = hashlib.sha256(content.encode()).hexdigest()

    with conn.cursor() as cur:
        # Upsert the document. If content unchanged, skip entirely.
        cur.execute(
            """
            INSERT INTO documents (tenant_id, source_type, source_uri, title, content_sha256)
            VALUES (%s, 'markdown', %s, %s, %s)
            ON CONFLICT (tenant_id, source_uri)
            DO UPDATE SET title = EXCLUDED.title, updated_at = now()
            RETURNING id, content_sha256 = %s AS unchanged
            """,
            (TENANT_ID, str(path), path.stem, sha, sha),
        )
        doc_id, unchanged = cur.fetchone()
        if unchanged:
            # Look up stored sha; if same, skip re-embedding
            cur.execute("SELECT content_sha256 FROM documents WHERE id = %s", (doc_id,))
            if cur.fetchone()[0] == sha:
                print(f"skip {path} (unchanged)")
                return

        # Re-chunk: wipe old chunks, insert new ones
        cur.execute("DELETE FROM chunks WHERE document_id = %s", (doc_id,))
        chunks = chunk_text(content)
        # Batch embed in groups of 100
        for i in range(0, len(chunks), 100):
            batch = chunks[i:i + 100]
            vectors = embed_batch(batch)
            rows = [
                (doc_id, TENANT_ID, i + j, chunks[i + j],
                 len(enc.encode(chunks[i + j])), vectors[j])
                for j in range(len(batch))
            ]
            cur.executemany(
                """
                INSERT INTO chunks
                    (document_id, tenant_id, chunk_index, content, token_count, embedding)
                VALUES (%s, %s, %s, %s, %s, %s)
                """,
                rows,
            )
        print(f"ingested {path}: {len(chunks)} chunks")

    conn.commit()

def main():
    with psycopg.connect(DATABASE_URL) as conn:
        register_vector(conn)
        for md in Path("./docs").rglob("*.md"):
            ingest_file(conn, md)

if __name__ == "__main__":
    main()

Notes on this script:

  • Idempotent — hash-based skip means you can re-run on every CI build without re-embedding unchanged docs.
  • Batch embeds — 100 texts per API call. OpenAI allows up to 2048 inputs but 100 is a sweet spot for latency and error isolation.
  • Overlap — 75 tokens of overlap means important sentences never fall on a chunk boundary.
  • Tenant isolationtenant_id on every chunk. Always filter on it at query time.

Step 6: The retrieval query

pgvector overloads three distance operators:

  • <=> — cosine distance (1 - cosine similarity). What you want for OpenAI / most modern embeddings.
  • <-> — L2 (Euclidean) distance.
  • <#> — negative inner product.

The basic retrieval query — "find the top 5 chunks closest to my question embedding" — looks like this:

-- $1 = query embedding as vector(1536), $2 = tenant_id
SELECT
    c.id,
    c.content,
    c.document_id,
    d.title,
    d.source_uri,
    c.embedding <=> $1 AS distance
FROM chunks c
JOIN documents d ON d.id = c.document_id
WHERE c.tenant_id = $2
ORDER BY c.embedding <=> $1
LIMIT 5;

The ORDER BY ... <=> $1 clause is what triggers the HNSW index. EXPLAIN ANALYZE should show an Index Scan using chunks_embedding_hnsw_idx — if you see a sequential scan, check that hnsw.ef_search is set and that your WHERE filter is not too restrictive.

Hybrid search: vector + BM25 via tsvector

Pure vector search misses exact-match queries like product SKUs, version numbers, or unusual proper nouns. Hybrid search combines vector similarity with traditional keyword scoring via Reciprocal Rank Fusion (RRF) — and because your content_tsv column is already there, it is a single SQL statement:

WITH
vector_hits AS (
    SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> $1) AS rank
    FROM chunks
    WHERE tenant_id = $2
    ORDER BY embedding <=> $1
    LIMIT 50
),
keyword_hits AS (
    SELECT id, ROW_NUMBER() OVER
        (ORDER BY ts_rank_cd(content_tsv, plainto_tsquery('english', $3)) DESC) AS rank
    FROM chunks
    WHERE tenant_id = $2
      AND content_tsv @@ plainto_tsquery('english', $3)
    LIMIT 50
)
SELECT c.id, c.content, d.title, d.source_uri,
       COALESCE(1.0 / (60 + v.rank), 0) + COALESCE(1.0 / (60 + k.rank), 0) AS rrf_score
FROM chunks c
LEFT JOIN vector_hits  v ON v.id = c.id
LEFT JOIN keyword_hits k ON k.id = c.id
JOIN documents d ON d.id = c.document_id
WHERE v.id IS NOT NULL OR k.id IS NOT NULL
ORDER BY rrf_score DESC
LIMIT 5;

$1 is the query embedding, $2 the tenant, $3 the raw query text. RRF with k = 60 is the standard fusion weight from the original paper and works well without tuning.

Step 7: Wrap it in a FastAPI endpoint

Now glue it together into a real /ask endpoint that embeds the question, retrieves chunks, and calls an LLM to synthesize the answer.

# api.py — FastAPI RAG endpoint
import os
from contextlib import asynccontextmanager
from typing import List

import psycopg
from fastapi import FastAPI, HTTPException
from openai import OpenAI
from pgvector.psycopg import register_vector
from pydantic import BaseModel

DATABASE_URL = os.environ["DATABASE_URL"]
EMBED_MODEL = "text-embedding-3-small"
CHAT_MODEL = "gpt-4o-mini"
TOP_K = 5

client = OpenAI()
conn: psycopg.Connection | None = None

@asynccontextmanager
async def lifespan(app: FastAPI):
    global conn
    conn = psycopg.connect(DATABASE_URL, autocommit=True)
    register_vector(conn)
    with conn.cursor() as cur:
        cur.execute("SET hnsw.ef_search = 100")
    yield
    conn.close()

app = FastAPI(lifespan=lifespan)

class AskRequest(BaseModel):
    question: str
    tenant_id: str

class Citation(BaseModel):
    title: str
    source_uri: str
    excerpt: str

class AskResponse(BaseModel):
    answer: str
    citations: List[Citation]

SYSTEM_PROMPT = """You are a helpful assistant. Answer the user's question using ONLY
the provided context snippets. If the answer is not in the context, say you don't know.
Always cite sources inline like [1], [2] matching the snippet numbers."""

@app.post("/ask", response_model=AskResponse)
async def ask(req: AskRequest) -> AskResponse:
    # 1. Embed the question
    emb = client.embeddings.create(model=EMBED_MODEL, input=[req.question]).data[0].embedding

    # 2. Retrieve top-k chunks (hybrid search)
    with conn.cursor() as cur:
        cur.execute(
            """
            WITH
            vh AS (SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> %s) AS r
                   FROM chunks WHERE tenant_id = %s
                   ORDER BY embedding <=> %s LIMIT 50),
            kh AS (SELECT id, ROW_NUMBER() OVER
                     (ORDER BY ts_rank_cd(content_tsv, plainto_tsquery('english', %s)) DESC) AS r
                   FROM chunks
                   WHERE tenant_id = %s
                     AND content_tsv @@ plainto_tsquery('english', %s)
                   LIMIT 50)
            SELECT c.content, d.title, d.source_uri,
                   COALESCE(1.0/(60+vh.r),0) + COALESCE(1.0/(60+kh.r),0) AS rrf
            FROM chunks c
            LEFT JOIN vh ON vh.id = c.id
            LEFT JOIN kh ON kh.id = c.id
            JOIN documents d ON d.id = c.document_id
            WHERE vh.id IS NOT NULL OR kh.id IS NOT NULL
            ORDER BY rrf DESC LIMIT %s
            """,
            (emb, req.tenant_id, emb, req.question, req.tenant_id, req.question, TOP_K),
        )
        rows = cur.fetchall()

    if not rows:
        raise HTTPException(404, "No relevant context found")

    # 3. Build the prompt
    context_block = "

".join(
        f"[{i+1}] {content}
(source: {title})" for i, (content, title, *_) in enumerate(rows)
    )
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": f"Context:
{context_block}

Question: {req.question}"},
    ]

    # 4. Call the LLM
    completion = client.chat.completions.create(
        model=CHAT_MODEL, messages=messages, temperature=0.2
    )
    answer = completion.choices[0].message.content

    return AskResponse(
        answer=answer,
        citations=[
            Citation(title=title, source_uri=uri, excerpt=content[:200])
            for content, title, uri, _ in rows
        ],
    )

Run it on a DanubeData DD Small VPS (€12.49/mo) with uvicorn api:app --host 0.0.0.0 --port 8000 and put it behind Caddy or nginx with automatic HTTPS. Your total infrastructure for a production RAG stack is now one database (€19.99) and one VPS (€12.49): €32.48/month, all in Germany.

Step 8: Optimizations that matter at scale

1. Matryoshka: truncate dimensions without re-embedding

text-embedding-3-small and nomic-embed-text-v1.5 are Matryoshka Representation Learning (MRL) models. That means the first N dimensions of the vector are themselves a valid (lower-quality) embedding. You can store the full 1536 dims but index only 512 or 768 for a 3× speedup with negligible recall loss.

-- Store full embeddings, index only first 512 dims
ALTER TABLE chunks ADD COLUMN embedding_512 vector(512)
    GENERATED ALWAYS AS ((embedding::vector(512))) STORED;

CREATE INDEX chunks_embedding_512_hnsw
    ON chunks USING hnsw (embedding_512 vector_cosine_ops);

-- Query: retrieve 50 with cheap 512-dim index, then re-rank with full 1536-dim
WITH candidates AS (
    SELECT id, content, embedding FROM chunks
    WHERE tenant_id = $2
    ORDER BY embedding_512 <=> (($1)::vector(512))
    LIMIT 50
)
SELECT id, content FROM candidates
ORDER BY embedding <=> $1
LIMIT 5;

2. halfvec quantization (pgvector 0.7+)

halfvec stores each dimension as a 16-bit float instead of 32-bit. You cut storage and index memory in half with essentially no recall hit for most embedding models. On DanubeData this turns a 1M-chunk, 1536-dim index from ~6 GB to ~3 GB.

-- Add halfvec column alongside the main one
ALTER TABLE chunks ADD COLUMN embedding_half halfvec(1536)
    GENERATED ALWAYS AS (embedding::halfvec(1536)) STORED;

CREATE INDEX chunks_embedding_half_hnsw
    ON chunks USING hnsw (embedding_half halfvec_cosine_ops);

-- Query the halfvec index for 3x less memory pressure
SELECT id, content
FROM chunks
WHERE tenant_id = $2
ORDER BY embedding_half <=> ($1::halfvec(1536))
LIMIT 5;

3. Partial indexes per tenant or content type

If one tenant dominates the table, a partial index keeps HNSW small and fast:

CREATE INDEX chunks_embedding_acme_hnsw
    ON chunks USING hnsw (embedding vector_cosine_ops)
    WHERE tenant_id = '11111111-2222-3333-4444-555555555555';

4. Periodic VACUUM and ANALYZE

HNSW does not delete gracefully. When chunks get re-ingested frequently, the index accumulates tombstones. DanubeData's managed Postgres runs autovacuum, but for very write-heavy tables you can schedule a nightly REINDEX INDEX CONCURRENTLY once a week.

How many vectors can Postgres really handle?

Real-world numbers on a DanubeData Managed PostgreSQL (DD Standard, 4 vCPU, 8GB RAM) with pgvector 0.8 and HNSW:

  • 100K vectors, 1536d — p95 latency 2–5 ms, index ~1 GB, builds in under 2 minutes.
  • 1M vectors, 1536d — p95 latency 8–15 ms, index ~8 GB (3 GB with halfvec), builds in 20–30 minutes.
  • 5M vectors, 1536d — p95 latency 20–40 ms, fits on a DD Large (16GB) with halfvec quantization.
  • 10M+ vectors, 1536d — still fine with halfvec + dedicated CPU plan; beyond this consider a dedicated vector store or sharding by tenant.

The honest threshold where a dedicated vector DB becomes worth the operational cost is around 25–50M vectors or heavy multi-region requirements. Everything below that, pgvector wins.

GDPR, embeddings, and the OpenAI question

An under-discussed RAG pitfall: sending your customer content to OpenAI to embed it is a GDPR data transfer. You need a DPA with OpenAI, Standard Contractual Clauses, and a legal basis. Two practical options:

  1. Stay on OpenAI — OpenAI's API does not train on API inputs by default (since March 2023) and they offer an enterprise DPA. Document this in your Record of Processing Activities.
  2. Self-host embeddings on a DanubeData VPS — run bge-m3 or nomic-embed-text via Ollama or text-embeddings-inference on a DD Small (€12.49/mo) or DD Standard. Zero cross-border transfers, zero third-party processor. Your vectors, your database, your data center — all in Germany.

Because your Postgres lives on DanubeData in Falkenstein, the vectors themselves never leave the EU either way. Only the ephemeral embedding API call (option 1) leaves, and even that can be kept in-EU with OpenAI's EU data residency options or by switching to option 2.

What it actually costs

Stack piece DanubeData Pinecone + Render equivalent
Vector DB (1M chunks, 1536d) €19.99 (Managed Postgres with pgvector) $70 (Pinecone Starter)
API / ingestion host €12.49 (DD Small VPS) ~$19 (Render Starter)
Document storage (100GB S3) €3.99 (includes 1TB traffic) ~$23 (AWS S3 + egress)
Monthly total €36.47 $112+
GDPR / data residency Germany, EU provider Mixed US/EU

Roughly 3× cheaper, simpler to operate, and GDPR-native — for anything below 10 million chunks.

FAQ

How many vectors can Postgres handle with pgvector?

Comfortably up to 10 million 1536-dimensional vectors on a single DanubeData Managed Postgres instance with HNSW and halfvec quantization. Many teams run tens of millions with careful tuning and partial indexes. The rule of thumb: if your HNSW index fits in RAM, you are fine.

When should I actually use a dedicated vector DB like Pinecone or Qdrant?

Three scenarios: (1) you have >50M vectors and need distributed HNSW, (2) you need globally replicated vector search with low latency on every continent, or (3) your team explicitly does not want to run Postgres. Below those bars, pgvector on managed Postgres is cheaper and simpler.

HNSW or IVFFlat — which should I pick?

HNSW, always, in 2026. Higher recall, lower latency, can be built on empty tables. IVFFlat only makes sense for insert-heavy workloads above ~100K writes/hour and even there HNSW 0.8 is usually competitive.

Which embedding model should I start with?

Default to OpenAI text-embedding-3-small at 1536 dimensions. Quality is excellent, cost is negligible ($0.02 per 1M tokens), and it supports Matryoshka truncation so you can shrink to 512 later without re-embedding. If you need 100% self-hosted, use bge-m3 or nomic-embed-text via Ollama on a DanubeData VPS.

Can I use Ollama / self-hosted embeddings with pgvector?

Absolutely. Run Ollama or text-embeddings-inference on a DanubeData VPS, point your ingestion script at http://vps-ip:11434/api/embeddings instead of OpenAI, and store the resulting vectors in the same schema. The only thing that changes is the embedding dimension and the API client.

Is sending content to OpenAI for embeddings a GDPR problem?

It is a data transfer to a US processor, so you need OpenAI's DPA, SCCs, a Transfer Impact Assessment, and to document it in your Record of Processing Activities. OpenAI does not train on API inputs. For truly strict setups (health, legal, public sector), self-host embeddings on a DanubeData VPS — that way no customer content ever leaves the EU.

How do I update or re-index chunks when a document changes?

Use the content_sha256 pattern from the ingestion script: hash the document, compare to what is stored, and if it changed, DELETE FROM chunks WHERE document_id = ? and re-embed. pgvector's HNSW handles deletes and reinserts fine; schedule a weekly REINDEX INDEX CONCURRENTLY if you rewrite a lot.

What about hybrid search with BM25?

Postgres's built-in tsvector full-text search gives you BM25-flavored keyword ranking out of the box. Combine it with vector similarity via Reciprocal Rank Fusion (RRF, see the SQL example above). You can also add pg_trgm for fuzzy matching or the paradedb extension for true BM25 if you need it.

Get started today

A full RAG stack on DanubeData takes about 20 minutes:

  1. Create a Managed PostgreSQL on DanubeData (€19.99/mo, pgvector included)
  2. Run CREATE EXTENSION vector;
  3. Apply the schema from Step 2
  4. Run the ingestion script from Step 5 against your Markdown, PDFs, or Notion export
  5. Deploy the FastAPI endpoint on a DD Small VPS (€12.49/mo)
  6. Ask your data questions

Why DanubeData for RAG workloads:

  • Managed PostgreSQL 15–17 with pgvector, pg_trgm, and unaccent preinstalled
  • From €19.99/mo with NVMe storage and automatic snapshots
  • Germany-based (Falkenstein), GDPR-native, EU-based company
  • Pair with DD Small VPS (€12.49) and S3 storage (€3.99) for a full in-house AI stack
  • 20TB included traffic, €50 signup credit

You do not need Pinecone. You do not need Weaviate. You already have Postgres. Turn it on.

👉 Create your Managed PostgreSQL on DanubeData

Building a RAG product in regulated industry? Talk to us — we help design GDPR-safe AI stacks on European infrastructure.

Share this article

Ready to Get Started?

Deploy your infrastructure in minutes with DanubeData's managed services.