performance-and-compatibility-summary

Quick reference answers for DuckDB vector search performance and DuckLake compatibility.

Question 1: How does speed compare to OpenSearch?

Short Answer

DuckDB is orders of magnitude slower than OpenSearch for vector search. Not recommended for production search applications.

Performance Comparison

System	Single Query Latency	Concurrent Performance	Use Case
OpenSearch	<100ms	Excellent (sub-second)	✅ Production search
Elasticsearch	<100ms	Excellent (sub-second)	✅ Production search
pgvector	~1.9s	Excellent (9.8s for 100 concurrent)	✅ Production hybrid
DuckDB VSS	~23s	Poor (1,416s for 100 concurrent)	❌ Production search

Key Findings

DuckDB vs pgvector Benchmark (1M vectors):

Single query: DuckDB is ~12× slower (23s vs 1.9s)
Concurrent queries: DuckDB is ~144× slower (1,416s vs 9.8s)
Latency breakdown: 98% overhead from point query inefficiency, only 2% in actual HNSW search

Why DuckDB is Slow for Vector Search:

Not optimized for point queries - designed for analytical workloads
Poor concurrent performance - massive degradation under load
Architecture mismatch - batch processing engine, not search engine
No query optimizer for vectors - inefficient single-record lookups

OpenSearch/Elasticsearch Advantages:

Purpose-built for search workloads
Distributed architecture for scale
Optimized for low-latency point queries
Excellent concurrent performance
Production-ready clustering

When to Use Each

Use OpenSearch/Elasticsearch when:

✅ Real-time search API (need <100ms response)
✅ Production user-facing applications
✅ High concurrent query load
✅ RAG systems requiring low latency
✅ Scale beyond single machine

Use DuckDB VSS when:

✅ Offline analytics with vectors
✅ Data exploration and prototyping
✅ Complex SQL + vector queries (hybrid analytics)
✅ Batch processing (single-threaded)
✅ Research and experimentation
❌ NOT for production search APIs
❌ NOT for real-time user queries
❌ NOT for concurrent workloads

Bottom Line

OpenSearch is 100-200× faster than DuckDB for typical vector search workloads. DuckDB should only be used for offline analytics where you need powerful SQL combined with vector operations, not for production search applications.

Question 2: Will it work if using DuckLake?

Short Answer

No, VSS extension does NOT work with DuckLake. You must choose between vector search (in-memory) or persistent storage (DuckLake).

The Compatibility Problem

Feature	In-Memory DuckDB	DuckLake Storage
VSS Extension	✅ Works	❌ Does NOT work
Vector Search	✅ Full support	❌ Not supported
HNSW Indexes	✅ Supported	❌ Not supported
Persistence	⚠️ Experimental (risky)	✅ Production-ready
ACID Transactions	Limited	✅ Full support
Time Travel	No	✅ Supported
Data Loss Risk	High (on restart)	None

Why They Don’t Work Together

DuckLake is a persistent storage format with:

ACID transactions
Snapshot isolation
Time-travel queries
Production-grade durability

VSS Extension requires:

In-memory database
No filesystem persistence
Experimental persistence flag may corrupt storage

Technical Issue: VSS indexes cannot be safely persisted to DuckLake format. Enabling experimental persistence (SET hnsw_enable_experimental_persistence = true) may corrupt file storage and is not production-ready.

Workarounds and Solutions

Option 1: In-Memory DuckDB (Rebuild on Startup)

Best if: Data fits in RAM, acceptable to rebuild indexes

import duckdb

# Connect to in-memory database
conn = duckdb.connect(':memory:')

# Load data from DuckLake
conn.execute("ATTACH 'ducklake:data.ducklake' AS source")
conn.execute("CREATE TABLE docs AS SELECT * FROM source.docs")
conn.execute("DETACH source")

# Build VSS index in-memory
conn.execute("INSTALL vss")
conn.execute("LOAD vss")
conn.execute("CREATE INDEX vec_idx ON docs USING HNSW (embedding)")

Pros:

✅ VSS works perfectly
✅ Persistent source data
✅ Best of both worlds

Cons:

❌ Startup time for index rebuild
❌ Requires RAM for full dataset
❌ Double storage

Option 2: Use pgvector Instead

Best if: Need both persistence AND vector search in production

-- PostgreSQL with pgvector
CREATE EXTENSION vector;

CREATE TABLE docs (
    id SERIAL PRIMARY KEY,
    text TEXT,
    embedding vector(384)
);

CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops);

Pros:

✅ Full persistence support
✅ ACID transactions
✅ 12× faster than DuckDB
✅ Production-ready

Cons:

❌ Requires PostgreSQL server
❌ More setup complexity
❌ Weaker analytical SQL

Option 3: Separate Systems

Best if: High-scale production, need best-in-class for both

# Vector search: OpenSearch/Pinecone
import opensearchpy
client = opensearchpy.OpenSearch(...)
vector_results = client.search(
    index="vectors",
    body={"query": {"knn": {"embedding": query_vec}}}
)

# Analytics: DuckLake
import duckdb
conn = duckdb.connect()
conn.execute("ATTACH 'ducklake:analytics.ducklake' AS data")
analytical_results = conn.execute("""
    SELECT * FROM data.documents
    WHERE id IN (?)
""", [vector_results['hits']['hits']])

Pros:

✅ Best tool for each job
✅ Scales independently
✅ Production-ready

Cons:

❌ More complex architecture
❌ Operational overhead
❌ Cannot do hybrid queries in SQL

Option 4: DuckLake Only (No Vectors)

Best if: Don’t actually need vector search

INSTALL ducklake;
LOAD ducklake;

ATTACH 'ducklake:data.ducklake' AS db;

-- Use DuckLake for analytics without vectors
SELECT * FROM db.documents
WHERE category = 'tech'
ORDER BY created_at DESC;

Pros:

✅ Production-ready persistence
✅ ACID transactions
✅ Time-travel queries

Cons:

❌ No vector search capability
❌ Must use traditional WHERE clauses

Decision Matrix

Choose Based on Your Requirements

Requirement	Recommended Solution
Vector search + Persistence	pgvector (PostgreSQL)
Vector search + Analytics	DuckDB in-memory (rebuild on startup)
Production search API	OpenSearch or Elasticsearch
Analytics without vectors	DuckLake
Offline analytics + vectors	DuckDB in-memory
High-scale production	Separate systems (OpenSearch + DuckLake)
Prototyping/Research	DuckDB in-memory

Architecture Recommendations

Scenario 1: RAG System (Production)

✅ Use: OpenSearch or pgvector
❌ Don't use: DuckDB
Reason: Need <1s latency for retrieval

Scenario 2: Data Exploration

✅ Use: DuckDB in-memory
❌ Don't use: DuckLake
Reason: Need flexible vector queries with SQL

Scenario 3: Analytics Dashboard

✅ Use: DuckLake (without vectors)
Or: Separate OpenSearch (vectors) + DuckLake (analytics)
❌ Don't use: DuckDB VSS
Reason: Need persistence and concurrent queries

Scenario 4: Batch Recommendation Pipeline

✅ Use: DuckDB in-memory (load from DuckLake on startup)
Reason: Offline processing, rich SQL, acceptable rebuild time

Summary

Performance: DuckDB vs OpenSearch

OpenSearch is 100-200× faster for production vector search
DuckDB takes ~23 seconds per query vs <100ms for OpenSearch
Concurrent performance: OpenSearch stable, DuckDB degrades 144×
Use DuckDB for: Offline analytics only
Use OpenSearch for: Production search applications

Compatibility: DuckDB VSS + DuckLake

They do NOT work together
Must choose: In-memory + vectors OR DuckLake + persistence
Best workaround: Rebuild VSS indexes on startup from DuckLake
Alternative: Use pgvector for both persistence + vectors
For production: Consider separate systems (OpenSearch + DuckLake)

Key Takeaway

DuckDB VSS is excellent for offline analytics combining rich SQL with vector operations, but it’s:

❌ Not suitable for production search APIs
❌ Not compatible with DuckLake persistent storage
❌ Not optimized for low-latency or concurrent queries

Use the right tool for the job:

Production search: OpenSearch, Elasticsearch, or pgvector
Offline analytics: DuckDB in-memory
Persistent analytics: DuckLake (without vectors) or pgvector

Last Updated: 2025-11-05 Based on: DuckDB v1.1.3+, OpenSearch benchmarks 2025, pgvector comparisons