Quick reference answers for DuckDB vector search performance and DuckLake compatibility.

Question 1: How does speed compare to OpenSearch?

Short Answer

DuckDB is orders of magnitude slower than OpenSearch for vector search. Not recommended for production search applications.

Performance Comparison

SystemSingle Query LatencyConcurrent PerformanceUse Case
OpenSearch<100msExcellent (sub-second)✅ Production search
Elasticsearch<100msExcellent (sub-second)✅ Production search
pgvector~1.9sExcellent (9.8s for 100 concurrent)✅ Production hybrid
DuckDB VSS~23sPoor (1,416s for 100 concurrent)❌ Production search

Key Findings

DuckDB vs pgvector Benchmark (1M vectors):

  • Single query: DuckDB is ~12× slower (23s vs 1.9s)
  • Concurrent queries: DuckDB is ~144× slower (1,416s vs 9.8s)
  • Latency breakdown: 98% overhead from point query inefficiency, only 2% in actual HNSW search

Why DuckDB is Slow for Vector Search:

  1. Not optimized for point queries - designed for analytical workloads
  2. Poor concurrent performance - massive degradation under load
  3. Architecture mismatch - batch processing engine, not search engine
  4. No query optimizer for vectors - inefficient single-record lookups

OpenSearch/Elasticsearch Advantages:

  • Purpose-built for search workloads
  • Distributed architecture for scale
  • Optimized for low-latency point queries
  • Excellent concurrent performance
  • Production-ready clustering

When to Use Each

Use OpenSearch/Elasticsearch when:

  • ✅ Real-time search API (need <100ms response)
  • ✅ Production user-facing applications
  • ✅ High concurrent query load
  • ✅ RAG systems requiring low latency
  • ✅ Scale beyond single machine

Use DuckDB VSS when:

  • ✅ Offline analytics with vectors
  • ✅ Data exploration and prototyping
  • ✅ Complex SQL + vector queries (hybrid analytics)
  • ✅ Batch processing (single-threaded)
  • ✅ Research and experimentation
  • ❌ NOT for production search APIs
  • ❌ NOT for real-time user queries
  • ❌ NOT for concurrent workloads

Bottom Line

OpenSearch is 100-200× faster than DuckDB for typical vector search workloads. DuckDB should only be used for offline analytics where you need powerful SQL combined with vector operations, not for production search applications.

Question 2: Will it work if using DuckLake?

Short Answer

No, VSS extension does NOT work with DuckLake. You must choose between vector search (in-memory) or persistent storage (DuckLake).

The Compatibility Problem

FeatureIn-Memory DuckDBDuckLake Storage
VSS Extension✅ Works❌ Does NOT work
Vector Search✅ Full support❌ Not supported
HNSW Indexes✅ Supported❌ Not supported
Persistence⚠️ Experimental (risky)✅ Production-ready
ACID TransactionsLimited✅ Full support
Time TravelNo✅ Supported
Data Loss RiskHigh (on restart)None

Why They Don’t Work Together

DuckLake is a persistent storage format with:

  • ACID transactions
  • Snapshot isolation
  • Time-travel queries
  • Production-grade durability

VSS Extension requires:

  • In-memory database
  • No filesystem persistence
  • Experimental persistence flag may corrupt storage

Technical Issue: VSS indexes cannot be safely persisted to DuckLake format. Enabling experimental persistence (SET hnsw_enable_experimental_persistence = true) may corrupt file storage and is not production-ready.

Workarounds and Solutions

Option 1: In-Memory DuckDB (Rebuild on Startup)

Best if: Data fits in RAM, acceptable to rebuild indexes

import duckdb
# Connect to in-memory database
conn = duckdb.connect(':memory:')
# Load data from DuckLake
conn.execute("ATTACH 'ducklake:data.ducklake' AS source")
conn.execute("CREATE TABLE docs AS SELECT * FROM source.docs")
conn.execute("DETACH source")
# Build VSS index in-memory
conn.execute("INSTALL vss")
conn.execute("LOAD vss")
conn.execute("CREATE INDEX vec_idx ON docs USING HNSW (embedding)")

Pros:

  • ✅ VSS works perfectly
  • ✅ Persistent source data
  • ✅ Best of both worlds

Cons:

  • ❌ Startup time for index rebuild
  • ❌ Requires RAM for full dataset
  • ❌ Double storage

Option 2: Use pgvector Instead

Best if: Need both persistence AND vector search in production

-- PostgreSQL with pgvector
CREATE EXTENSION vector;
CREATE TABLE docs (
id SERIAL PRIMARY KEY,
text TEXT,
embedding vector(384)
);
CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops);

Pros:

  • ✅ Full persistence support
  • ✅ ACID transactions
  • ✅ 12× faster than DuckDB
  • ✅ Production-ready

Cons:

  • ❌ Requires PostgreSQL server
  • ❌ More setup complexity
  • ❌ Weaker analytical SQL

Option 3: Separate Systems

Best if: High-scale production, need best-in-class for both

# Vector search: OpenSearch/Pinecone
import opensearchpy
client = opensearchpy.OpenSearch(...)
vector_results = client.search(
index="vectors",
body={"query": {"knn": {"embedding": query_vec}}}
)
# Analytics: DuckLake
import duckdb
conn = duckdb.connect()
conn.execute("ATTACH 'ducklake:analytics.ducklake' AS data")
analytical_results = conn.execute("""
SELECT * FROM data.documents
WHERE id IN (?)
""", [vector_results['hits']['hits']])

Pros:

  • ✅ Best tool for each job
  • ✅ Scales independently
  • ✅ Production-ready

Cons:

  • ❌ More complex architecture
  • ❌ Operational overhead
  • ❌ Cannot do hybrid queries in SQL

Option 4: DuckLake Only (No Vectors)

Best if: Don’t actually need vector search

INSTALL ducklake;
LOAD ducklake;
ATTACH 'ducklake:data.ducklake' AS db;
-- Use DuckLake for analytics without vectors
SELECT * FROM db.documents
WHERE category = 'tech'
ORDER BY created_at DESC;

Pros:

  • ✅ Production-ready persistence
  • ✅ ACID transactions
  • ✅ Time-travel queries

Cons:

  • ❌ No vector search capability
  • ❌ Must use traditional WHERE clauses

Decision Matrix

Choose Based on Your Requirements

RequirementRecommended Solution
Vector search + Persistencepgvector (PostgreSQL)
Vector search + AnalyticsDuckDB in-memory (rebuild on startup)
Production search APIOpenSearch or Elasticsearch
Analytics without vectorsDuckLake
Offline analytics + vectorsDuckDB in-memory
High-scale productionSeparate systems (OpenSearch + DuckLake)
Prototyping/ResearchDuckDB in-memory

Architecture Recommendations

Scenario 1: RAG System (Production)

✅ Use: OpenSearch or pgvector
❌ Don't use: DuckDB
Reason: Need <1s latency for retrieval

Scenario 2: Data Exploration

✅ Use: DuckDB in-memory
❌ Don't use: DuckLake
Reason: Need flexible vector queries with SQL

Scenario 3: Analytics Dashboard

✅ Use: DuckLake (without vectors)
Or: Separate OpenSearch (vectors) + DuckLake (analytics)
❌ Don't use: DuckDB VSS
Reason: Need persistence and concurrent queries

Scenario 4: Batch Recommendation Pipeline

✅ Use: DuckDB in-memory (load from DuckLake on startup)
Reason: Offline processing, rich SQL, acceptable rebuild time

Summary

Performance: DuckDB vs OpenSearch

  • OpenSearch is 100-200× faster for production vector search
  • DuckDB takes ~23 seconds per query vs <100ms for OpenSearch
  • Concurrent performance: OpenSearch stable, DuckDB degrades 144×
  • Use DuckDB for: Offline analytics only
  • Use OpenSearch for: Production search applications

Compatibility: DuckDB VSS + DuckLake

  • They do NOT work together
  • Must choose: In-memory + vectors OR DuckLake + persistence
  • Best workaround: Rebuild VSS indexes on startup from DuckLake
  • Alternative: Use pgvector for both persistence + vectors
  • For production: Consider separate systems (OpenSearch + DuckLake)

Key Takeaway

DuckDB VSS is excellent for offline analytics combining rich SQL with vector operations, but it’s:

  • ❌ Not suitable for production search APIs
  • ❌ Not compatible with DuckLake persistent storage
  • ❌ Not optimized for low-latency or concurrent queries

Use the right tool for the job:

  • Production search: OpenSearch, Elasticsearch, or pgvector
  • Offline analytics: DuckDB in-memory
  • Persistent analytics: DuckLake (without vectors) or pgvector

Last Updated: 2025-11-05 Based on: DuckDB v1.1.3+, OpenSearch benchmarks 2025, pgvector comparisons