Qdrant Vector Database Production Deployment Guide

Running Qdrant at Scale: Lessons from a 40M-Vector Deployment

Most teams adopt a vector database after a comfortable proof of concept on a single laptop, then panic when the corpus crosses a few million vectors. We hit that wall at 12 million embeddings and ended up rebuilding the entire Qdrant vector database production setup over a six-week sprint. The lessons below are the ones I wish someone had handed me on day one.

Furthermore, Qdrant’s documentation is excellent for getting started, but the operational details — shard sizing, raft quorum, quantization trade-offs, the Kubernetes operator quirks — only emerge once you’re running it 24/7. Let’s walk through what actually matters when the pager goes off at 3 AM.

Distributed system diagram representing Qdrant cluster with shards and replicas — A three-node Qdrant cluster with shard count tuned for 40M vectors at 768 dimensions.

Distributed mode and raft consensus

Qdrant supports a single-node mode that’s fine for development, but production requires distributed mode. Specifically, you set cluster.enabled=true, define peer addresses, and let raft handle leadership and metadata replication. Three nodes is the minimum for tolerating a single failure; five is more comfortable if you can afford the hardware.

Importantly, raft only replicates collection metadata, not the actual vector data. Shard placement and replication are configured per-collection. Therefore, a healthy raft quorum tells you nothing about whether your search is up — that requires monitoring shard transfer status directly.

Sharding and replication strategy

Each collection is split into shards, and each shard can have multiple replicas. For our 40M-vector collection at 768 dimensions, we settled on 6 shards with replication factor 2. That gave us roughly 6.7M vectors per shard, well below the 10M soft ceiling we’d hit during stress tests on r6i.4xlarge nodes.

Conversely, smaller collections under 5M vectors don’t need more than 2 or 3 shards. Over-sharding hurts because each query fans out to every shard, and the slowest shard sets the latency floor. Plan shard count for your peak corpus size, not your launch-day numbers.

# Create a production-grade collection with quantization and payload indexes
curl -X PUT 'http://qdrant.internal:6333/collections/documents' \
  -H 'Content-Type: application/json' \
  -d '{
    "vectors": {
      "size": 768,
      "distance": "Cosine",
      "on_disk": true
    },
    "shard_number": 6,
    "replication_factor": 2,
    "write_consistency_factor": 1,
    "hnsw_config": {
      "m": 32,
      "ef_construct": 256,
      "full_scan_threshold": 10000,
      "on_disk": false
    },
    "quantization_config": {
      "scalar": {
        "type": "int8",
        "quantile": 0.99,
        "always_ram": true
      }
    },
    "optimizers_config": {
      "default_segment_number": 4,
      "indexing_threshold": 20000
    }
  }'

# Create a payload index so filters run before vector search
curl -X PUT 'http://qdrant.internal:6333/collections/documents/index' \
  -H 'Content-Type: application/json' \
  -d '{
    "field_name": "tenant_id",
    "field_schema": "keyword"
  }'

# Snapshot a collection (run on each shard host or via S3 endpoint)
curl -X POST 'http://qdrant.internal:6333/collections/documents/snapshots'

Quantization: where the memory savings come from

Vector data is expensive in RAM. A 40M corpus at 768 float32 dimensions is roughly 115 GB raw, before HNSW graph overhead. Consequently, quantization is non-negotiable at scale. Qdrant offers three flavors: scalar (int8), product, and binary.

In practice, scalar int8 quantization with always_ram=true is the sweet spot for most teams. It cuts memory by roughly 4x with a recall drop under 1 percent on dense embeddings from models like BGE or text-embedding-3. Binary quantization is more aggressive — 32x reduction — but recall drops noticeably unless you re-rank with the original vectors.

Meanwhile, on our deployment, scalar quantization brought RAM from a projected 140 GB to 38 GB per replica, letting us use r6i.4xlarge instead of r6i.8xlarge. That alone saved roughly $2,800 per month per cluster.

Scalar int8 quantization shrinks memory roughly 4x with sub-1% recall loss on dense embeddings.

HNSW tuning: ef_construct, m, and the recall trade

HNSW has two indexing parameters that materially affect recall and build time: m and ef_construct. Higher values yield better recall but cost more memory and longer index builds. For our 768-dim corpus, m=32 and ef_construct=256 gave 98.4 percent recall@10 against an exhaustive search baseline.

At query time, ef is the knob. We expose it per-request and let high-precision queries push it to 256 while bulk reranking jobs use 64. This kind of per-request tuning is impossible in some competing engines and is one reason we picked Qdrant.

Payload indexing and hybrid search

Filtering is where naive vector setups fall apart. Without a payload index, filtering by tenant_id means Qdrant scans every match candidate post-search, which destroys latency under multi-tenant load. Therefore, always create payload indexes on fields used in filters before launching.

Additionally, hybrid search — combining dense vectors with sparse BM25 or SPLADE signals — meaningfully boosts recall on keyword-heavy queries. Qdrant’s named vectors feature lets you store both representations on the same point and fuse them with reciprocal rank fusion at query time. For background on choosing the right embedding model, see my notes on embedding model comparison.

Kubernetes operator and cluster lifecycle

The official Qdrant operator handles rolling upgrades, snapshots, and scaling without the manual statefulset choreography. We use it with hostpath-provisioned NVMe on bare-metal nodes for the hot tier, and a separate S3-backed cold tier for snapshots.

However, the operator has rough edges. Specifically, scaling up shards on an existing collection is still not automated — you must create a new collection, replicate data, and switch traffic. Plan capacity headroom (we keep 40 percent free) so you can defer that operation. Pair this with the broader patterns I’ve covered in Redis 8 streams in production when you need a unified data plane.

Real latency numbers and pgvector comparison

On our 40M-vector collection with scalar quantization, p50 latency for top-10 search is 22 ms, p99 is 78 ms, all measured at the gRPC client. REST adds roughly 4 ms of overhead per request, which matters for high-QPS paths but not for human-facing search.

By contrast, pgvector on the same data and hardware delivered 140 ms p50 with ivfflat, and roughly 65 ms p50 with HNSW (added in pgvector 0.6). pgvector wins on operational simplicity if you’re already on Postgres; Qdrant wins on raw throughput, hybrid search, and quantization sophistication. Weaviate and Milvus sit in between on most metrics. The official Qdrant optimizer documentation goes deeper on segment-level tuning.

In conclusion, a successful Qdrant vector database production deployment hinges on three decisions made early: shard count sized for peak corpus, scalar quantization enabled from day one, and payload indexes on every filter field. Get those right and Qdrant scales gracefully into the tens of millions of vectors with sub-100 ms p99 latency. Skip them and you’ll rebuild the cluster six weeks in, just like we did.