Vector databases became the "must-have" infrastructure for AI applications in 2024. Every RAG tutorial starts with "first, set up your vector database." But here's what most tutorials won't tell you: many AI applications don't need a dedicated vector database at all.
This guide explains what vector databases do under the hood, compares six major options with real pricing and performance data, walks through concrete architecture patterns, breaks down costs, and gives you a clear framework for deciding whether you need one.
What Vector Databases Do
The Problem They Solve
Traditional databases are built for exact matches. You search for a customer ID, a product name, or a date range, and the database returns rows that match precisely. But AI applications need similarity search: "find the documents most similar to this question."
Similarity search works on embeddings: numerical representations of text (or images, or any data) where similar items are close together in high-dimensional space. The sentence "How do I reset my password?" and "I forgot my login credentials" have different words but similar embeddings because they mean similar things.
A vector database stores these embeddings and finds the closest ones to a query vector quickly, even across millions of items. That's the core functionality. Everything else, metadata filtering, hybrid search, multi-tenancy, is an optimization on top of that foundation.
How Embeddings Work
An embedding model converts text into a fixed-length array of floating point numbers. OpenAI's text-embedding-3-small produces 1536-dimensional vectors. Cohere's embed-v3 outputs 1024 dimensions. Open-source models like BGE-large and E5-large produce 1024 dimensions. Smaller models like all-MiniLM-L6-v2 output 384 dimensions and run on a laptop CPU.
Two pieces of text with similar meaning produce vectors that are close together when you measure distance between them. The model learns these relationships during training on large text corpora.
Embedding Model Selection Matters More Than Database Choice
This is worth emphasizing early. The embedding model determines retrieval quality. The database stores and retrieves whatever vectors you give it. If your embedding model doesn't capture the semantic relationships in your domain, switching from pgvector to Pinecone won't help.
General-purpose models (OpenAI, Cohere) work well for most text. Domain-specific fine-tuned models improve results for specialized content like medical records, legal documents, or code. If your RAG system returns irrelevant results, test a different embedding model before changing your database.
Distance Metrics
The most common distance metrics are:
- Cosine similarity: Measures the angle between vectors. Most popular for text similarity. Score ranges from -1 (opposite) to 1 (identical). Direction matters, magnitude doesn't.
- Euclidean distance (L2): Straight-line distance between points. Works well when vector magnitude carries meaning, like when comparing document lengths or signal strengths.
- Dot product (inner product): Computationally fastest. Equivalent to cosine similarity when vectors are normalized, which most embedding models produce by default.
For most text-based AI applications, cosine similarity with a standard embedding model works well. Don't overthink the distance metric choice unless you're seeing specific retrieval quality issues.
How Indexing Works Under the Hood
Finding the nearest neighbor in high-dimensional space by brute force means computing the distance to every single vector. At 1 million vectors with 1536 dimensions, that's 1 million distance calculations per query. Brute force works for small datasets (under 50K vectors) but falls apart at scale.
Vector databases use approximate nearest neighbor (ANN) algorithms that trade a small amount of accuracy for massive speed improvements. The two dominant approaches:
HNSW (Hierarchical Navigable Small World). Builds a multi-layer graph where each vector connects to its nearby neighbors. Queries traverse the graph from coarse layers to fine layers, narrowing the search space at each step. HNSW delivers excellent recall (typically 95-99%) with fast queries. The tradeoff: it's memory-intensive because the entire graph lives in RAM. At 1M vectors with 1536 dimensions, expect 6-8GB of memory for the index alone.
IVF (Inverted File Index). Partitions vectors into clusters using k-means. During a query, only the nearest clusters are searched. IVF uses less memory than HNSW and supports disk-based storage, but requires tuning the number of clusters and probes. Recall depends on how many clusters you search, more clusters means higher recall but slower queries.
Product Quantization (PQ). Compresses vectors by splitting them into sub-vectors and replacing each with a codebook entry. Reduces memory usage by 4-8x with some accuracy loss. Often combined with IVF (IVF-PQ) for large-scale deployments where memory is the constraint.
Most managed vector databases use HNSW internally because it delivers the best query performance without tuning. pgvector supports both IVFFlat and HNSW indexes.
The Six Major Vector Databases Compared
| Feature | Pinecone | Weaviate | Chroma | pgvector | Qdrant | Milvus |
|---|---|---|---|---|---|---|
| Type | Managed only | Open source + managed | Open source + managed | Postgres extension | Open source + managed | Open source + managed |
| License | Proprietary | BSD-3 | Apache 2.0 | PostgreSQL License | Apache 2.0 | Apache 2.0 |
| Free Tier | Yes (2GB) | Sandbox | Yes | Free (extension) | Yes (1GB cloud) | Free (Zilliz Cloud) |
| Paid Starting | $70/mo | $25/mo | Usage-based | Your Postgres cost | $25/mo | $65/mo (Zilliz) |
| Hybrid Search | Sparse vectors | Built-in (BM25 + vector) | Limited | Combine with ts_vector | Sparse vectors + payload | Built-in |
| Max Dimensions | 20,000 | 65,535 | No limit | 16,000 | 65,536 | 32,768 |
| Index Type | Proprietary | HNSW | HNSW | IVFFlat, HNSW | HNSW | IVF, HNSW, DiskANN |
| Metadata Filtering | Yes | Yes | Yes | SQL WHERE clauses | Yes (rich filters) | Yes |
| Multi-tenancy | Namespaces | Native | Collections | Row-level security | Collection-level | Partitions |
| SDK Languages | Python, JS, Go, Java | Python, JS, Go, Java | Python, JS | Any Postgres client | Python, JS, Go, Rust, Java | Python, JS, Go, Java |
| Best For | Zero-ops teams | Hybrid search | Prototyping | Postgres shops | Filtering-heavy workloads | Billion-scale datasets |
Pinecone
The most popular managed vector database. Fully hosted, no infrastructure to manage. You send vectors through the API, and Pinecone handles indexing, replication, and scaling.
Pricing: Free tier (1 index, 2GB storage). Starter at $70/month. Standard from $231/month. Enterprise custom pricing.
Strengths: Easiest setup (5 minutes to first query). Excellent documentation. Reliable uptime. Metadata filtering built in. Serverless option reduces costs for sporadic workloads.
Weaknesses: Vendor lock-in (proprietary, no self-hosted option). Gets expensive at scale. Limited query flexibility compared to self-hosted options. No hybrid search with BM25 (uses sparse vectors instead).
Best for: Teams that want zero infrastructure management. Startups and mid-size companies building RAG applications where ops simplicity matters more than cost optimization.
Pinecone's serverless offering (launched 2024) changed the pricing model. You pay per query and storage rather than provisioning fixed pods. For applications with variable traffic, this can reduce costs by 50-80% compared to pod-based pricing. For steady high-throughput workloads, pods are still more predictable.
Weaviate
Open-source vector database with both self-hosted and managed cloud options. Weaviate's differentiator is built-in hybrid search that combines BM25 keyword scoring with vector similarity in a single query.
Pricing: Free (self-hosted). Weaviate Cloud: free sandbox, Standard from $25/month for small workloads. Enterprise pricing scales with usage.
Strengths: Open source (can self-host for free). Built-in hybrid search (BM25 + vector). GraphQL API. Supports multiple embedding models natively with vectorizer modules. Named vectors allow multiple embedding spaces per object.
Weaknesses: Self-hosting requires DevOps knowledge. Cloud pricing can surprise at scale. More complex setup than Pinecone. Resource-heavy for small deployments.
Best for: Teams that want hybrid search, need to self-host for compliance, or want to run embedding models alongside the database.
Weaviate's hybrid search deserves a closer look. Pure vector search sometimes misses results that contain exact keywords users expect. Hybrid search runs both a BM25 keyword query and a vector similarity query, then fuses the results. For customer support applications and documentation search, hybrid search consistently outperforms pure vector search in retrieval quality benchmarks.
Chroma
Lightweight, developer-friendly vector database designed for rapid prototyping and small-to-medium production workloads.
Pricing: Free and open source. Hosted offering available with free tier.
Strengths: Simplest API of any vector database (4 main functions: add, query, update, delete). Runs in-process (no server needed for development). Great Python integration. Embeds documents for you if you provide an embedding function.
Weaknesses: Limited production track record at large scale (10M+ vectors). Fewer enterprise features (no built-in replication, backup). Smaller ecosystem compared to Pinecone or Weaviate.
Best for: Prototyping, hackathons, small to medium applications (under 1M vectors), developers who want the simplest possible setup.
Chroma's in-process mode is its killer feature for development. You import it, create a collection, and start adding documents. No Docker, no server process, no config files. When you're iterating on chunking strategies or embedding models, this removes friction that slows down experimentation. For production, switch to client-server mode or the hosted offering.
pgvector (PostgreSQL Extension)
A PostgreSQL extension that adds vector similarity search to your existing Postgres database. This is the "boring technology" option, and for many teams it's the right one.
Pricing: Free (it's an extension for Postgres you already run). Managed Postgres services (Supabase, Neon, RDS) support it at their standard pricing.
Strengths: No new infrastructure. Vectors live alongside your relational data. Full SQL for querying. ACID transactions. JOIN vectors with user tables, permission tables, anything. You already know Postgres.
Weaknesses: Slower than dedicated vector databases at scale (10M+ vectors). HNSW index uses significant memory. No built-in sharding for horizontal scaling. Tuning requires Postgres knowledge.
Best for: Teams already using Postgres. Applications under 5M vectors. Situations where you need vectors + relational data in the same query. Regulated industries where adding new infrastructure requires security review.
pgvector added HNSW indexing in version 0.5.0, which closed the performance gap significantly. With HNSW, pgvector handles 1M vectors at 30-80ms p95 latency. That's fast enough for any RAG application where the LLM generation step takes 500ms or more.
The underrated advantage of pgvector is transactional consistency. When you update a document and its embedding in the same transaction, you never have stale vectors pointing to deleted content. With separate vector databases, you need a sync pipeline, and sync pipelines eventually drift.
Qdrant
Open-source vector database written in Rust, known for strong filtering performance and flexible payload storage.
Pricing: Free (self-hosted). Qdrant Cloud: 1GB free tier, then from ~$25/month. Enterprise pricing available.
Strengths: Written in Rust (fast, memory-efficient). Excellent metadata filtering, supports nested objects and geo-filters. Quantization built in (binary, scalar, product). Snapshot and backup support. gRPC API for low-latency access.
Weaknesses: Smaller community than Pinecone or Weaviate. Cloud offering is newer. Fewer built-in integrations (no native embedding module).
Best for: Applications with complex filtering requirements (e-commerce, multi-tenant SaaS). Teams that want high performance with self-hosted control. Rust-native infrastructure stacks.
Qdrant's filtering is worth highlighting. Most vector databases apply filters after the ANN search, which means you search for nearest neighbors first and then remove results that don't match your filter. Qdrant applies filters during the search using a technique called payload indexing. For queries like "find similar documents, but only from workspace X, created after January 2026, tagged as 'engineering'," Qdrant handles this without the recall degradation that post-filtering causes.
Qdrant also supports quantization out of the box. Binary quantization reduces memory usage by 32x with moderate recall loss (good for re-ranking pipelines). Scalar quantization offers a 4x reduction with minimal recall impact. This makes Qdrant practical for larger datasets on smaller machines.
Milvus
Open-source vector database designed for billion-scale deployments. Backed by Zilliz, which offers a managed cloud version.
Pricing: Free (self-hosted). Zilliz Cloud: free tier, Standard from $65/month. Enterprise pricing for large clusters.
Strengths: Built for massive scale (billions of vectors). Supports GPU-accelerated indexing. Multiple index types (IVF, HNSW, DiskANN, SCANN). Horizontal scaling with separation of storage and compute. Strong in the Chinese tech ecosystem.
Weaknesses: Complex to self-host (depends on etcd, MinIO, Pulsar/Kafka). Overkill for most applications under 10M vectors. Cloud offering (Zilliz) costs more than alternatives. Documentation can be inconsistent across versions.
Best for: Large-scale applications with 100M+ vectors. Teams that need GPU-accelerated index building. Organizations with the infrastructure team to manage a distributed system.
Milvus separates storage and compute, which means you can scale query nodes independently of data nodes. For applications with bursty query patterns (a product catalog that gets hammered during sales events), you can scale query capacity without duplicating your entire dataset. This architecture is powerful but adds operational complexity that most teams don't need.
Milvus Lite is a lighter option that runs in-process (similar to Chroma) for development and small deployments. It's a good way to prototype with Milvus before committing to the full distributed setup.
Performance Benchmarks
Here are realistic numbers for a common use case: 1M document chunks, 1536-dimension embeddings, returning top 10 results. These come from ANN-Benchmarks and vendor-published data, so treat them as indicative rather than definitive. Your results will vary based on hardware, index configuration, and data distribution.
| Database | p95 Latency (1M vectors) | Recall @10 | Memory Usage | Index Build Time |
|---|---|---|---|---|
| Pinecone (serverless) | 15-30ms | ~98% | Managed | Minutes |
| Weaviate (HNSW) | 20-50ms | ~97% | ~8GB | 10-20 min |
| Chroma (in-process) | 10-25ms | ~97% | ~6GB | 5-10 min |
| pgvector (HNSW) | 30-80ms | ~95% | ~8GB | 15-30 min |
| Qdrant (HNSW) | 10-30ms | ~98% | ~7GB | 8-15 min |
| Milvus (HNSW) | 15-40ms | ~97% | ~8GB | 10-20 min |
All of these are fast enough for production RAG applications where the LLM generation step takes 500-3000ms. The vector search latency is rarely your bottleneck. If your application feels slow, profile the full pipeline before blaming the vector database.
At 10M vectors, the picture changes. pgvector latency climbs to 200-400ms without careful tuning. Dedicated vector databases stay in the 30-80ms range through more sophisticated indexing and memory management. At 100M+ vectors, Milvus and Qdrant with quantization pull ahead of the pack.
Real Architecture Examples
Abstract comparisons only go so far. Here's how vector databases fit into three common production architectures.
Architecture 1: RAG Chatbot for Internal Documentation
Use case: An internal chatbot that answers employee questions using company documentation (Confluence, Notion, Google Docs). 50K documents, 200K chunks after splitting.
Stack:
- Embedding model: OpenAI text-embedding-3-small (1536 dimensions, $0.02 per 1M tokens)
- Vector storage: pgvector on existing Supabase instance
- LLM: GPT-4o or Claude for answer generation
- Ingestion: Python script runs nightly, pulls from APIs, chunks documents, embeds, upserts to Postgres
Why pgvector works here: 200K vectors is well within pgvector's comfort zone. The company already uses Supabase for its main application, so pgvector adds zero infrastructure. Document access permissions map to Postgres row-level security. The nightly sync is a single cron job, no message queue or sync pipeline needed.
Query flow:
- User asks a question in Slack or a web UI
- Embed the question using the same model used for documents
- Query pgvector:
SELECT content, metadata FROM documents ORDER BY embedding <=> $query_vector LIMIT 5 - Pass the top 5 chunks + the original question to the LLM
- Return the generated answer with source links
Cost: $0 for vector storage (part of existing Supabase plan). Embedding the corpus costs ~$4 one-time. Ongoing query costs are the LLM calls, typically $50-200/month for a team of 100.
Architecture 2: Semantic Search for an E-Commerce Catalog
Use case: A product search engine that understands natural language queries like "warm waterproof jacket for hiking" across a catalog of 2M products. Results need filtering by category, price range, brand, and availability.
Stack:
- Embedding model: Cohere embed-v3 (1024 dimensions, multilingual support)
- Vector storage: Qdrant (self-hosted, 3-node cluster)
- Re-ranker: Cohere rerank-v3 on top 50 results
- Ingestion: Kafka pipeline from product catalog updates
Why Qdrant here: 2M products with complex filters (category, price, brand, in-stock status, warehouse location) makes Qdrant's payload filtering a strong fit. Post-filtering at this scale drops too many relevant results. Qdrant's pre-filtering maintains recall while applying business constraints. Scalar quantization keeps memory manageable on 3 nodes with 64GB RAM each.
Query flow:
- User types "warm waterproof jacket for hiking" and selects filters (price under $300, in-stock only)
- Embed the query with Cohere
- Search Qdrant with vector + payload filter: category IN [outerwear, jackets], price <= 300, in_stock = true
- Retrieve top 50 results, pass to Cohere rerank with the original query
- Return top 20 re-ranked results to the user
Cost: Qdrant self-hosted on 3x r6i.2xlarge (64GB RAM) costs ~$1,200/month on AWS. Cohere embedding + reranking runs ~$500/month at 100K daily queries. Total: ~$1,700/month. A managed alternative (Pinecone) would run $2,500-4,000/month for equivalent throughput.
Architecture 3: Recommendation Engine for a Content Platform
Use case: A news/content platform recommends articles based on reading history. 500K articles, 5M user preference vectors (updated as users read and engage).
Stack:
- Embedding model: Custom fine-tuned E5-large (1024 dimensions, trained on engagement data)
- Vector storage: Milvus on Zilliz Cloud
- Feature store: Redis for real-time user signals
- Ingestion: Flink pipeline processes engagement events, updates user vectors every 15 minutes
Why Milvus here: 5.5M total vectors with frequent updates (user vectors change as they read). Milvus handles the write throughput from continuous user vector updates while maintaining query performance. Separation of storage and compute means query latency stays stable during bulk index rebuilds. Partitioning by content category speeds up "recommend within this section" queries.
Query flow:
- User opens the app or finishes an article
- Fetch the user's preference vector from Milvus
- Query Milvus: find 100 nearest article vectors, filtered to articles published in the last 7 days and not already read
- Blend vector similarity scores with real-time signals from Redis (trending, recency boost)
- Return top 20 recommendations
Cost: Zilliz Cloud for this workload runs ~$800/month. Custom embedding model hosting (2x A10G on AWS) adds ~$1,500/month. Redis for feature store adds ~$200/month. Total: ~$2,500/month. Self-hosting Milvus would save on the Zilliz cost but adds 0.5-1 FTE of DevOps effort.
Cost Analysis: Managed vs Self-Hosted
Vector database costs break down into three categories: infrastructure, embedding generation, and engineering time. Most teams focus only on the first and underestimate the third.
Infrastructure Costs at Different Scales
| Scale | pgvector | Pinecone | Weaviate Cloud | Qdrant Cloud | Self-Hosted (AWS) |
|---|---|---|---|---|---|
| 100K vectors | $0 (existing DB) | $0 (free tier) | $0 (sandbox) | $0 (free tier) | ~$50/mo (t3.medium) |
| 1M vectors | $0-50 (existing DB) | $70-231/mo | $25-100/mo | $25-75/mo | ~$150/mo (r6i.large) |
| 10M vectors | $200-400/mo (larger instance) | $500-1,500/mo | $300-800/mo | $200-500/mo | ~$500/mo (r6i.xlarge) |
| 100M vectors | Not recommended | $2,000-5,000/mo | $1,500-4,000/mo | $1,000-3,000/mo | ~$2,000/mo (multi-node) |
Embedding Generation Costs
Embedding your corpus is a one-time cost (plus incremental updates). Re-embedding happens when you switch models or dimensions.
| Model | Cost per 1M Tokens | Cost to Embed 1M Documents (~500 tokens each) |
|---|---|---|
| OpenAI text-embedding-3-small | $0.02 | ~$10 |
| OpenAI text-embedding-3-large | $0.13 | ~$65 |
| Cohere embed-v3 | $0.10 | ~$50 |
| Self-hosted (e.g., BGE-large on GPU) | $0 (infra cost only) | ~$5-15 (compute time) |
For most applications, embedding costs are negligible compared to LLM inference costs and infrastructure. Don't optimize embedding costs unless you're processing tens of millions of documents.
The Hidden Cost: Engineering Time
Self-hosting a vector database saves on licensing but costs engineering time. A realistic breakdown:
- Initial setup (self-hosted): 2-5 days for deployment, monitoring, backups
- Ongoing maintenance: 2-4 hours/month for updates, capacity planning, incident response
- Index tuning: 1-2 days when performance issues arise
- Scaling events: 1-3 days for major capacity changes
At an engineering cost of $150/hour, self-hosting overhead runs $400-1,000/month in labor. For teams under 50 engineers, managed services often save money when you account for engineering time. For larger infrastructure teams that already manage Kubernetes clusters and database deployments, self-hosting adds minimal incremental work.
When Managed Services Make Financial Sense
Use managed services (Pinecone, Weaviate Cloud, Qdrant Cloud, Zilliz) when:
- Your team is under 10 engineers and doesn't have dedicated DevOps
- Your vector count is under 10M (managed pricing is reasonable at this scale)
- You need production uptime without on-call rotation for the vector layer
- You're in a startup where engineering time is the scarcest resource
Self-host when:
- You have 50M+ vectors and managed pricing becomes prohibitive
- You have compliance requirements that prevent sending data to third parties
- Your team already manages Kubernetes and has infrastructure expertise
- You need custom configurations that managed services don't expose
When You Need a Dedicated Vector Database
You need a dedicated vector database when:
You have more than 5 million vectors
At this scale, pgvector performance degrades noticeably, and dedicated vector databases maintain consistent latency through purpose-built indexing and memory management. If your document corpus is large (millions of pages), or you're storing per-user vectors alongside content vectors, a dedicated solution makes sense.
You need sub-10ms query latency
Real-time applications like autocomplete, live recommendations, or streaming retrieval need the fastest possible vector lookup. Dedicated vector databases with in-memory HNSW indexing and gRPC APIs deliver consistent single-digit millisecond latency that pgvector can't match.
You're doing complex filtered search at scale
If your workload combines vector similarity with multiple metadata filters (category, date range, tenant ID, permissions), and you have millions of vectors, dedicated databases handle this more efficiently. Qdrant and Milvus apply filters during the search rather than after, maintaining recall quality.
You need horizontal scaling
Single-node Postgres has limits. If your data grows beyond what a single large instance can handle (typically 10-20M vectors depending on dimensions), you need a database that supports sharding across nodes. Milvus, Qdrant, and Weaviate all support distributed deployments.
When You Don't Need One
This is the section most vector database marketing materials skip. You probably don't need a dedicated vector database when:
Your corpus is under 100K documents
For small to medium document sets, pgvector works perfectly well. Query latency at 100K vectors is under 20ms with a properly configured HNSW index. You avoid adding a new piece of infrastructure to your stack, which means less to monitor, maintain, and pay for.
You already use Postgres
If your application already runs on Postgres (and most web applications do), pgvector keeps everything in one database. Your vectors join with your relational data in a single query. You don't need a separate data pipeline to sync between systems. For most startups and early-stage products, this simplicity is worth more than the performance gains of a dedicated solution.
You're prototyping
During prototyping and early development, use Chroma (in-process, zero setup) or pgvector (if you already have Postgres). Don't add infrastructure complexity before you've validated that your RAG approach works. You can migrate to a dedicated vector database later if you need the scale.
Your retrieval quality issues aren't about the database
This is the most common mistake. Teams switch vector databases hoping to improve RAG quality. But poor retrieval quality is almost always one of these problems:
- Bad chunking strategy: Chunks too large (lose specificity) or too small (lose context). Try 256-512 tokens with 50-token overlap as a starting point.
- Wrong embedding model: A general-purpose model on domain-specific content. Fine-tuned or domain-specific models improve recall by 10-30% in specialized domains.
- Poor query formulation: The user's query doesn't match the style of the embedded documents. Query expansion, HyDE (Hypothetical Document Embeddings), or query rewriting with an LLM can help.
- Missing metadata: You're not using metadata filters to narrow the search space before vector similarity. Adding filters for document type, date, or section improves precision.
The database stores and retrieves vectors. Switching from pgvector to Pinecone won't fix a bad chunking strategy.
Common Mistakes and How to Avoid Them
Mistake 1: Choosing the database before validating the approach
Teams spend weeks evaluating vector databases before confirming that vector search solves their problem. Start with Chroma or pgvector. Build a prototype. Measure retrieval quality. If the approach works and you hit scale limits, migrate. The migration cost is 1-3 days, and you'll make a better database decision with production data.
Mistake 2: Storing too much in the vector database
Vector databases are optimized for similarity search, not general-purpose storage. Store the embedding, a chunk ID, and minimal metadata in the vector database. Keep the full document content, user data, and application state in your primary database. Query the vector database for IDs, then hydrate results from your main database.
Mistake 3: Ignoring index configuration
The default HNSW parameters (M=16, ef_construction=200) work for most cases. But if you're seeing poor recall or high latency, tuning these matters. Higher M increases connections per node (better recall, more memory). Higher ef_construction builds a more accurate index (slower build, better queries). Higher ef_search checks more candidates at query time (better recall, slower queries). Run benchmarks with your data, because optimal parameters depend on your dataset's characteristics.
Mistake 4: Not monitoring recall quality over time
As your dataset grows, recall can degrade if your index parameters were tuned for a smaller corpus. Build an evaluation set of query-result pairs and measure recall weekly. If recall drops below your threshold (95% is a common target), rebuild the index with updated parameters or consider scaling your infrastructure.
Mistake 5: Overengineering multi-tenancy
If you have 10 tenants, separate collections work fine. At 1,000 tenants, you need metadata-filtered search within a shared collection. At 100,000 tenants, you need a database that supports efficient partition pruning (Milvus partitions or Qdrant's payload indexing). Match the multi-tenancy strategy to your scale, not to what you think you'll need in two years.
Migration Considerations
Starting with pgvector and migrating later is a valid strategy. Here's what the migration involves:
- Re-embed your documents: You might want to use a different embedding model anyway. Budget a few dollars for the embedding API calls and 1-2 hours of compute time.
- Abstraction layer: If you build a thin retrieval interface from the start (a function that takes a query string and returns relevant chunks), the migration surface is small. All the database-specific code lives behind one function.
- Update your retrieval code: Swap the query logic from SQL to the new database's SDK. Typically 50-200 lines of code if you used an abstraction layer.
- Set up the new infrastructure: Managed services (Pinecone, Weaviate Cloud) take 30 minutes. Self-hosted takes 2-5 days including monitoring and backups.
- Test retrieval quality: Ensure results are equivalent. Run your eval suite against both systems. Look for edge cases in filtered queries and metadata handling.
- Parallel run: For production systems, run both databases in parallel for a week. Query both, compare results, switch traffic gradually.
Total migration time: 1-3 days for managed, 3-7 days for self-hosted. It's not a major undertaking, which is why starting simple and scaling up is usually the right call.
How to Choose: A Decision Framework
Answer these five questions to narrow your choice:
1. How many vectors do you have (or expect within 12 months)?
- Under 100K: Chroma (prototyping) or pgvector (production)
- 100K-5M: pgvector, Pinecone, Weaviate, or Qdrant
- 5M-50M: Pinecone, Weaviate, Qdrant, or Milvus
- 50M+: Milvus, Qdrant, or Weaviate (all self-hosted or enterprise cloud)
2. Do you need hybrid search (keyword + vector)?
- Yes: Weaviate (best built-in hybrid) or Qdrant (sparse vectors)
- No: Any option works
3. Do you already use PostgreSQL?
- Yes and under 5M vectors: pgvector is the default choice. No new infrastructure, no sync pipeline, transactional consistency.
- Yes but over 5M vectors: Evaluate a dedicated database, but consider pgvector with HNSW tuning first.
4. Do you have DevOps capacity to self-host?
- Yes: Self-hosted Qdrant, Weaviate, or Milvus (save on managed service markup)
- No: Pinecone, Weaviate Cloud, Qdrant Cloud, or Zilliz
5. Do you have compliance constraints (data residency, no third-party storage)?
- Yes: pgvector (existing infra) or self-hosted Weaviate/Qdrant/Milvus
- No: Managed services are fair game
Prototyping: Chroma (zero setup, runs in your Python process)
Production, under 5M vectors, already on Postgres: pgvector
Production, under 5M vectors, want managed: Pinecone (easiest) or Qdrant Cloud (best filtering)
Production, need hybrid search: Weaviate
Production, 5M-50M vectors: Qdrant or Weaviate (self-hosted or cloud)
Production, 50M+ vectors: Milvus or Qdrant with quantization
On-premises / compliance: pgvector, Weaviate, Qdrant, or Milvus (all open source)
Complex filtering (e-commerce, multi-tenant): Qdrant
The 2026 Vector Database Landscape
The vector database market has matured significantly since the initial RAG boom of 2023-2024. Several trends are reshaping which databases developers should consider for new projects.
PostgreSQL with pgvector Has Become the Default Starting Point
The pgvector extension has improved dramatically. HNSW index support, better parallel query execution, and the pgvector 0.7+ releases with quantization support have closed much of the performance gap with purpose-built vector databases. For teams already running PostgreSQL (which is most teams), starting with pgvector eliminates an entire category of infrastructure complexity. You can always migrate to a dedicated vector database later if you hit scale limits, but most applications never reach the point where pgvector becomes a bottleneck.
Hybrid Search Is No Longer Optional
Pure vector similarity search has a well-documented weakness: it misses exact keyword matches that users expect. Searching for "error code E-4021" using only vector search may return documents about errors generally rather than the specific error code. The industry has converged on hybrid search (combining vector similarity with BM25 keyword matching) as the standard approach. Weaviate and Qdrant both offer native hybrid search. Pinecone added keyword search support in late 2025. If you're evaluating vector databases today, hybrid search capability should be a requirement, not a nice-to-have.
Serverless Pricing Changed the Cost Equation
Pinecone's serverless offering and similar pay-per-query models from other vendors have made managed vector databases accessible to small teams and prototypes. You no longer need to provision and pay for always-on instances during development. For production workloads with predictable traffic, dedicated instances still offer better economics. But for development, staging, and low-traffic applications, serverless vector databases have eliminated the cost barrier that previously pushed teams toward self-hosted options.
Multi-Modal Embeddings Are Expanding Use Cases
Vector databases are no longer just for text. Models like OpenAI's CLIP and Google's multimodal embeddings allow you to store and search across text, images, and video in the same vector space. This opens up use cases like visual product search, cross-modal document retrieval (search text to find relevant images), and multi-modal RAG systems. Most major vector databases now support the higher-dimensional vectors these models produce, though storage and query costs scale accordingly.
The best vector database is the one that fits your current needs without adding unnecessary complexity. Start simple. Scale when you have data showing you need to. And spend more time on your embedding model and chunking strategy than on database selection, because those choices affect retrieval quality far more than the storage layer.
For more on building AI applications with retrieval, see our RAG architecture guide and the vector database glossary entry.