README
Purpose
Comprehensive research on DuckDB’s analytical capabilities, vector similarity search, hybrid storage architecture, and performance characteristics for both traditional SQL analytics and modern AI/ML embeddings use cases.
Contents
Core Research Documents
- vector-capabilities.md - Comprehensive guide to DuckDB’s VSS (Vector Similarity Search) extension, HNSW indexing, distance functions, full-text search integration, and performance benchmarks against pgvector and OpenSearch
- hybrid-memory-persistent-tables.md - Deep dive into DuckDB’s hybrid storage model: attaching multiple databases, temporary tables, cross-database joins across engines, and multi-database transaction constraints
- performance-and-compatibility-summary.md - Detailed analysis of DuckDB vector search performance vs OpenSearch/pgvector, DuckLake compatibility issues, decision matrix for architecture selection, and practical workarounds
- float-array-where-clause-bug.md - Critical bug documentation: WHERE clauses fail to filter NULL values in FLOAT[512] fixed-size arrays, affecting vector search and embedding repair operations with workarounds
Executive Summary
DuckDB is a powerful embedded analytical SQL database that combines:
Strengths
- Vector Search Capabilities: Native VSS extension with HNSW indexing for similarity search
- Hybrid Storage: Mix in-memory and persistent tables in same session for hot/cold data separation
- Rich SQL Analytics: Complex queries combining vector operations with traditional SQL
- Full-Text Search: Integrated FTS extension with BM25 scoring for hybrid search (semantic + keyword)
- Cross-Database Queries: Join across multiple database engines (PostgreSQL, SQLite, DuckDB) in single query
- Embedded Architecture: No separate server needed, works directly in Python/applications
- Data Format Support: Seamless reading/writing of Parquet, CSV, JSON, and remote files
Critical Limitations
- Vector Search Performance: ~12× slower than pgvector on single queries (23s vs 1.9s)
- Concurrent Performance: 144× degradation under concurrent load (unsuitable for production search APIs)
- DuckLake Incompatibility: VSS extension does NOT work with DuckLake persistent format
- Point Query Optimization: Designed for analytical workloads, not low-latency individual vector lookups
- Memory Constraints: In-memory vectors limited by RAM, experimental disk persistence is risky
Key Findings
Vector Search
- Supported Distance Metrics: L2 Euclidean (default), Cosine distance, Inner product
- Index Type: HNSW (Hierarchical Navigable Small World) with tunable parameters (M, ef_construction, ef_search)
- Data Type: Fixed-size FLOAT arrays (384, 512, etc. dimensions)
- Recent Improvement: 66× speedup for LATERAL joins with HNSW_INDEX_JOIN operator (October 2024)
- Hybrid Search: Can combine BM25 full-text search with vector embeddings in single query
Hybrid Architecture
- Multiple Database Attachment: Attach both persistent (*.duckdb) and in-memory (‘:memory:’) databases simultaneously
- Cross-Database Joins: Query and join across different database engines in single SQL statement
- Temporary Tables: Session-local tables that spill to disk if needed, auto-dropped on disconnect
- Write Constraints: Can only write to ONE database per transaction (critical limitation)
- Query Optimization: Pushes filters to source databases, uses specialized scanners, cost-based join optimization
Performance Benchmarks
| Scenario | DuckDB | pgvector | OpenSearch | Recommendation |
|---|---|---|---|---|
| Single Vector Query | ~23s | ~1.9s | <100ms | ❌ DuckDB too slow |
| 100 Concurrent Queries | ~1,416s | ~9.8s | <100ms | ❌ DuckDB fails under load |
| Offline Analytics | ✅ Excellent | Good | Poor | ✅ Use DuckDB |
| Hybrid SQL+Vector | ✅ Excellent | Limited | Limited | ✅ Use DuckDB |
| Production Search API | ❌ Not suitable | Good | Excellent | Use pgvector or OpenSearch |
DuckLake Compatibility Issue
- Problem: VSS extension requires in-memory database; DuckLake is persistent storage format
- Solution Options:
- Use in-memory DuckDB with VSS (lose persistence)
- Rebuild indexes on startup from DuckLake data (hybrid approach)
- Use pgvector for persistent + vector support
- Separate systems: OpenSearch/Pinecone for vectors + DuckLake for analytics
Use Cases
When to Use DuckDB
✅ Offline Analytics - Complex SQL queries combining vectors with tabular data ✅ Data Exploration - Ad-hoc analysis of embeddings with rich SQL ✅ Batch Processing - Entire dataset transformations with complex logic ✅ Hybrid Search - Native FTS + vector embeddings in single query ✅ Research/Prototyping - Quick experimentation with embeddings ✅ Recommendation Systems (batch-mode) - Complex analytical recommendations
When NOT to Use DuckDB
❌ Production APIs - High latency unacceptable (23s vs required <100ms) ❌ Real-Time Search - Concurrent performance degrades 144× ❌ RAG Systems - Sub-second retrieval needed for LLM prompts ❌ Point Query Workloads - Not optimized for individual vector lookups ❌ High-Throughput Systems - Poor concurrent performance ❌ Persistent Vector Storage - Use pgvector or specialized vector DBs
Technical Highlights
Vector Search Features
- HNSW index with configurable accuracy/speed tradeoff
- Multiple distance metrics (L2, cosine, inner product)
- Index-accelerated top-k queries with
min_by() - Fuzzy vector joins without indexes via
vss_join()andvss_match() - Recent 66× speedup for lateral join operations
Hybrid Storage
- Attach same session: persistent + in-memory databases
- Cross-database joins with filter pushdown optimization
- Automatic query plan generation across engines
- Temporary tables with optional disk spillover
Text Analytics
- Full-text search with BM25 scoring
- Stemming support for multiple languages
- Integrated hybrid search combining BM25 + vector embeddings
- Text preprocessing with tokenization and stopword filtering
Architecture Decision Guide
| Need | Recommended | Alternative |
|---|---|---|
| Vector search + Persistence | pgvector | Rebuild DuckDB indexes on startup |
| Production search API | OpenSearch/Pinecone | pgvector if SQL important |
| Analytics + Vectors (batch) | DuckDB in-memory | pgvector |
| Analytics only (persistent) | DuckLake | PostgreSQL |
| All requirements (scale) | OpenSearch + DuckLake | Elasticsearch + pgvector |
Related Research
- agents - Agent systems and database-backed architectures
- bun-nestjs - Performance optimization and full-stack systems
- automated-reasoning - Database query optimization and verification
- webgpu - GPU computing for analytics and parallel processing
Resources
Official Documentation:
Key Blog Posts:
- What’s New in VSS (October 2024)
- Text Analytics Workflows (June 2025)
- Multi-Database Support in DuckDB
Community Resources:
- MotherDuck Vector Search Series
- Simon Willison: DuckDB for Embeddings
- Sören Brunk: Similarity Search with DuckDB
Last updated: 2025-11-24 DuckDB Version: 1.1.3+ (VSS extension experimental) Status: VSS and FTS extensions actively developed; persistence improvements planned