README

Purpose

Comprehensive research on DuckDB’s analytical capabilities, vector similarity search, hybrid storage architecture, and performance characteristics for both traditional SQL analytics and modern AI/ML embeddings use cases.

Core Research Documents

vector-capabilities.md - Comprehensive guide to DuckDB’s VSS (Vector Similarity Search) extension, HNSW indexing, distance functions, full-text search integration, and performance benchmarks against pgvector and OpenSearch
hnsw-persistence-issues.md - Deep dive into HNSW index persistence limitations, WAL recovery issues, open GitHub bugs, timeline analysis, and production workarounds
hybrid-memory-persistent-tables.md - Deep dive into DuckDB’s hybrid storage model: attaching multiple databases, temporary tables, cross-database joins across engines, and multi-database transaction constraints
performance-and-compatibility-summary.md - Detailed analysis of DuckDB vector search performance vs OpenSearch/pgvector, DuckLake compatibility issues, decision matrix for architecture selection, and practical workarounds
float-array-where-clause-bug.md - Critical bug documentation: WHERE clauses fail to filter NULL values in FLOAT[512] fixed-size arrays, affecting vector search and embedding repair operations with workarounds

Executive Summary

DuckDB is a powerful embedded analytical SQL database that combines:

Strengths

Vector Search Capabilities: Native VSS extension with HNSW indexing for similarity search
Hybrid Storage: Mix in-memory and persistent tables in same session for hot/cold data separation
Rich SQL Analytics: Complex queries combining vector operations with traditional SQL
Full-Text Search: Integrated FTS extension with BM25 scoring for hybrid search (semantic + keyword)
Cross-Database Queries: Join across multiple database engines (PostgreSQL, SQLite, DuckDB) in single query
Embedded Architecture: No separate server needed, works directly in Python/applications
Data Format Support: Seamless reading/writing of Parquet, CSV, JSON, and remote files

Critical Limitations

Vector Search Performance: ~12× slower than pgvector on single queries (23s vs 1.9s)
Concurrent Performance: 144× degradation under concurrent load (unsuitable for production search APIs)
DuckLake Incompatibility: VSS extension does NOT work with DuckLake persistent format
Point Query Optimization: Designed for analytical workloads, not low-latency individual vector lookups
Memory Constraints: In-memory vectors limited by RAM, experimental disk persistence is risky

Key Findings

Vector Search

Supported Distance Metrics: L2 Euclidean (default), Cosine distance, Inner product
Index Type: HNSW (Hierarchical Navigable Small World) with tunable parameters (M, ef_construction, ef_search)
Data Type: Fixed-size FLOAT arrays (384, 512, etc. dimensions)
Recent Improvement: 66× speedup for LATERAL joins with HNSW_INDEX_JOIN operator (October 2024)
Hybrid Search: Can combine BM25 full-text search with vector embeddings in single query

Hybrid Architecture

Multiple Database Attachment: Attach both persistent (*.duckdb) and in-memory (‘:memory:’) databases simultaneously
Cross-Database Joins: Query and join across different database engines in single SQL statement
Temporary Tables: Session-local tables that spill to disk if needed, auto-dropped on disconnect
Write Constraints: Can only write to ONE database per transaction (critical limitation)
Query Optimization: Pushes filters to source databases, uses specialized scanners, cost-based join optimization

Performance Benchmarks

Scenario	DuckDB	pgvector	OpenSearch	Recommendation
Single Vector Query	~23s	~1.9s	<100ms	❌ DuckDB too slow
100 Concurrent Queries	~1,416s	~9.8s	<100ms	❌ DuckDB fails under load
Offline Analytics	✅ Excellent	Good	Poor	✅ Use DuckDB
Hybrid SQL+Vector	✅ Excellent	Limited	Limited	✅ Use DuckDB
Production Search API	❌ Not suitable	Good	Excellent	Use pgvector or OpenSearch

DuckLake Compatibility Issue

Problem: VSS extension requires in-memory database; DuckLake is persistent storage format
Solution Options:
1. Use in-memory DuckDB with VSS (lose persistence)
2. Rebuild indexes on startup from DuckLake data (hybrid approach)
3. Use pgvector for persistent + vector support
4. Separate systems: OpenSearch/Pinecone for vectors + DuckLake for analytics

Use Cases

When to Use DuckDB

✅ Offline Analytics - Complex SQL queries combining vectors with tabular data ✅ Data Exploration - Ad-hoc analysis of embeddings with rich SQL ✅ Batch Processing - Entire dataset transformations with complex logic ✅ Hybrid Search - Native FTS + vector embeddings in single query ✅ Research/Prototyping - Quick experimentation with embeddings ✅ Recommendation Systems (batch-mode) - Complex analytical recommendations

When NOT to Use DuckDB

❌ Production APIs - High latency unacceptable (23s vs required <100ms) ❌ Real-Time Search - Concurrent performance degrades 144× ❌ RAG Systems - Sub-second retrieval needed for LLM prompts ❌ Point Query Workloads - Not optimized for individual vector lookups ❌ High-Throughput Systems - Poor concurrent performance ❌ Persistent Vector Storage - Use pgvector or specialized vector DBs

Technical Highlights

Vector Search Features

HNSW index with configurable accuracy/speed tradeoff
Multiple distance metrics (L2, cosine, inner product)
Index-accelerated top-k queries with min_by()
Fuzzy vector joins without indexes via vss_join() and vss_match()
Recent 66× speedup for lateral join operations

Hybrid Storage

Attach same session: persistent + in-memory databases
Cross-database joins with filter pushdown optimization
Automatic query plan generation across engines
Temporary tables with optional disk spillover

Text Analytics

Full-text search with BM25 scoring
Stemming support for multiple languages
Integrated hybrid search combining BM25 + vector embeddings
Text preprocessing with tokenization and stopword filtering

Architecture Decision Guide

Need	Recommended	Alternative
Vector search + Persistence	pgvector	Rebuild DuckDB indexes on startup
Production search API	OpenSearch/Pinecone	pgvector if SQL important
Analytics + Vectors (batch)	DuckDB in-memory	pgvector
Analytics only (persistent)	DuckLake	PostgreSQL
All requirements (scale)	OpenSearch + DuckLake	Elasticsearch + pgvector

agents - Agent systems and database-backed architectures
bun-nestjs - Performance optimization and full-stack systems
automated-reasoning - Database query optimization and verification
webgpu - GPU computing for analytics and parallel processing

Resources

Official Documentation:

Key Blog Posts:

Community Resources:

Last updated: 2025-11-24 DuckDB Version: 1.1.3+ (VSS extension experimental) Status: VSS and FTS extensions actively developed; persistence improvements planned