performance-and-compatibility-summary
Quick reference answers for DuckDB vector search performance and DuckLake compatibility.
Question 1: How does speed compare to OpenSearch?
Short Answer
DuckDB is orders of magnitude slower than OpenSearch for vector search. Not recommended for production search applications.
Performance Comparison
| System | Single Query Latency | Concurrent Performance | Use Case |
|---|---|---|---|
| OpenSearch | <100ms | Excellent (sub-second) | ✅ Production search |
| Elasticsearch | <100ms | Excellent (sub-second) | ✅ Production search |
| pgvector | ~1.9s | Excellent (9.8s for 100 concurrent) | ✅ Production hybrid |
| DuckDB VSS | ~23s | Poor (1,416s for 100 concurrent) | ❌ Production search |
Key Findings
DuckDB vs pgvector Benchmark (1M vectors):
- Single query: DuckDB is ~12× slower (23s vs 1.9s)
- Concurrent queries: DuckDB is ~144× slower (1,416s vs 9.8s)
- Latency breakdown: 98% overhead from point query inefficiency, only 2% in actual HNSW search
Why DuckDB is Slow for Vector Search:
- Not optimized for point queries - designed for analytical workloads
- Poor concurrent performance - massive degradation under load
- Architecture mismatch - batch processing engine, not search engine
- No query optimizer for vectors - inefficient single-record lookups
OpenSearch/Elasticsearch Advantages:
- Purpose-built for search workloads
- Distributed architecture for scale
- Optimized for low-latency point queries
- Excellent concurrent performance
- Production-ready clustering
When to Use Each
Use OpenSearch/Elasticsearch when:
- ✅ Real-time search API (need <100ms response)
- ✅ Production user-facing applications
- ✅ High concurrent query load
- ✅ RAG systems requiring low latency
- ✅ Scale beyond single machine
Use DuckDB VSS when:
- ✅ Offline analytics with vectors
- ✅ Data exploration and prototyping
- ✅ Complex SQL + vector queries (hybrid analytics)
- ✅ Batch processing (single-threaded)
- ✅ Research and experimentation
- ❌ NOT for production search APIs
- ❌ NOT for real-time user queries
- ❌ NOT for concurrent workloads
Bottom Line
OpenSearch is 100-200× faster than DuckDB for typical vector search workloads. DuckDB should only be used for offline analytics where you need powerful SQL combined with vector operations, not for production search applications.
Question 2: Will it work if using DuckLake?
Short Answer
No, VSS extension does NOT work with DuckLake. You must choose between vector search (in-memory) or persistent storage (DuckLake).
The Compatibility Problem
| Feature | In-Memory DuckDB | DuckLake Storage |
|---|---|---|
| VSS Extension | ✅ Works | ❌ Does NOT work |
| Vector Search | ✅ Full support | ❌ Not supported |
| HNSW Indexes | ✅ Supported | ❌ Not supported |
| Persistence | ⚠️ Experimental (risky) | ✅ Production-ready |
| ACID Transactions | Limited | ✅ Full support |
| Time Travel | No | ✅ Supported |
| Data Loss Risk | High (on restart) | None |
Why They Don’t Work Together
DuckLake is a persistent storage format with:
- ACID transactions
- Snapshot isolation
- Time-travel queries
- Production-grade durability
VSS Extension requires:
- In-memory database
- No filesystem persistence
- Experimental persistence flag may corrupt storage
Technical Issue: VSS indexes cannot be safely persisted to DuckLake format. Enabling experimental persistence (SET hnsw_enable_experimental_persistence = true) may corrupt file storage and is not production-ready.
Workarounds and Solutions
Option 1: In-Memory DuckDB (Rebuild on Startup)
Best if: Data fits in RAM, acceptable to rebuild indexes
import duckdb
# Connect to in-memory databaseconn = duckdb.connect(':memory:')
# Load data from DuckLakeconn.execute("ATTACH 'ducklake:data.ducklake' AS source")conn.execute("CREATE TABLE docs AS SELECT * FROM source.docs")conn.execute("DETACH source")
# Build VSS index in-memoryconn.execute("INSTALL vss")conn.execute("LOAD vss")conn.execute("CREATE INDEX vec_idx ON docs USING HNSW (embedding)")Pros:
- ✅ VSS works perfectly
- ✅ Persistent source data
- ✅ Best of both worlds
Cons:
- ❌ Startup time for index rebuild
- ❌ Requires RAM for full dataset
- ❌ Double storage
Option 2: Use pgvector Instead
Best if: Need both persistence AND vector search in production
-- PostgreSQL with pgvectorCREATE EXTENSION vector;
CREATE TABLE docs ( id SERIAL PRIMARY KEY, text TEXT, embedding vector(384));
CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops);Pros:
- ✅ Full persistence support
- ✅ ACID transactions
- ✅ 12× faster than DuckDB
- ✅ Production-ready
Cons:
- ❌ Requires PostgreSQL server
- ❌ More setup complexity
- ❌ Weaker analytical SQL
Option 3: Separate Systems
Best if: High-scale production, need best-in-class for both
# Vector search: OpenSearch/Pineconeimport opensearchpyclient = opensearchpy.OpenSearch(...)vector_results = client.search( index="vectors", body={"query": {"knn": {"embedding": query_vec}}})
# Analytics: DuckLakeimport duckdbconn = duckdb.connect()conn.execute("ATTACH 'ducklake:analytics.ducklake' AS data")analytical_results = conn.execute(""" SELECT * FROM data.documents WHERE id IN (?)""", [vector_results['hits']['hits']])Pros:
- ✅ Best tool for each job
- ✅ Scales independently
- ✅ Production-ready
Cons:
- ❌ More complex architecture
- ❌ Operational overhead
- ❌ Cannot do hybrid queries in SQL
Option 4: DuckLake Only (No Vectors)
Best if: Don’t actually need vector search
INSTALL ducklake;LOAD ducklake;
ATTACH 'ducklake:data.ducklake' AS db;
-- Use DuckLake for analytics without vectorsSELECT * FROM db.documentsWHERE category = 'tech'ORDER BY created_at DESC;Pros:
- ✅ Production-ready persistence
- ✅ ACID transactions
- ✅ Time-travel queries
Cons:
- ❌ No vector search capability
- ❌ Must use traditional WHERE clauses
Decision Matrix
Choose Based on Your Requirements
| Requirement | Recommended Solution |
|---|---|
| Vector search + Persistence | pgvector (PostgreSQL) |
| Vector search + Analytics | DuckDB in-memory (rebuild on startup) |
| Production search API | OpenSearch or Elasticsearch |
| Analytics without vectors | DuckLake |
| Offline analytics + vectors | DuckDB in-memory |
| High-scale production | Separate systems (OpenSearch + DuckLake) |
| Prototyping/Research | DuckDB in-memory |
Architecture Recommendations
Scenario 1: RAG System (Production)
✅ Use: OpenSearch or pgvector❌ Don't use: DuckDBReason: Need <1s latency for retrievalScenario 2: Data Exploration
✅ Use: DuckDB in-memory❌ Don't use: DuckLakeReason: Need flexible vector queries with SQLScenario 3: Analytics Dashboard
✅ Use: DuckLake (without vectors)Or: Separate OpenSearch (vectors) + DuckLake (analytics)❌ Don't use: DuckDB VSSReason: Need persistence and concurrent queriesScenario 4: Batch Recommendation Pipeline
✅ Use: DuckDB in-memory (load from DuckLake on startup)Reason: Offline processing, rich SQL, acceptable rebuild timeSummary
Performance: DuckDB vs OpenSearch
- OpenSearch is 100-200× faster for production vector search
- DuckDB takes ~23 seconds per query vs <100ms for OpenSearch
- Concurrent performance: OpenSearch stable, DuckDB degrades 144×
- Use DuckDB for: Offline analytics only
- Use OpenSearch for: Production search applications
Compatibility: DuckDB VSS + DuckLake
- They do NOT work together
- Must choose: In-memory + vectors OR DuckLake + persistence
- Best workaround: Rebuild VSS indexes on startup from DuckLake
- Alternative: Use pgvector for both persistence + vectors
- For production: Consider separate systems (OpenSearch + DuckLake)
Key Takeaway
DuckDB VSS is excellent for offline analytics combining rich SQL with vector operations, but it’s:
- ❌ Not suitable for production search APIs
- ❌ Not compatible with DuckLake persistent storage
- ❌ Not optimized for low-latency or concurrent queries
Use the right tool for the job:
- Production search: OpenSearch, Elasticsearch, or pgvector
- Offline analytics: DuckDB in-memory
- Persistent analytics: DuckLake (without vectors) or pgvector
Last Updated: 2025-11-05 Based on: DuckDB v1.1.3+, OpenSearch benchmarks 2025, pgvector comparisons