Purpose

Comprehensive research on DuckDB’s analytical capabilities, vector similarity search, hybrid storage architecture, and performance characteristics for both traditional SQL analytics and modern AI/ML embeddings use cases.

Contents

Core Research Documents

  • vector-capabilities.md - Comprehensive guide to DuckDB’s VSS (Vector Similarity Search) extension, HNSW indexing, distance functions, full-text search integration, and performance benchmarks against pgvector and OpenSearch
  • hybrid-memory-persistent-tables.md - Deep dive into DuckDB’s hybrid storage model: attaching multiple databases, temporary tables, cross-database joins across engines, and multi-database transaction constraints
  • performance-and-compatibility-summary.md - Detailed analysis of DuckDB vector search performance vs OpenSearch/pgvector, DuckLake compatibility issues, decision matrix for architecture selection, and practical workarounds
  • float-array-where-clause-bug.md - Critical bug documentation: WHERE clauses fail to filter NULL values in FLOAT[512] fixed-size arrays, affecting vector search and embedding repair operations with workarounds

Executive Summary

DuckDB is a powerful embedded analytical SQL database that combines:

Strengths

  • Vector Search Capabilities: Native VSS extension with HNSW indexing for similarity search
  • Hybrid Storage: Mix in-memory and persistent tables in same session for hot/cold data separation
  • Rich SQL Analytics: Complex queries combining vector operations with traditional SQL
  • Full-Text Search: Integrated FTS extension with BM25 scoring for hybrid search (semantic + keyword)
  • Cross-Database Queries: Join across multiple database engines (PostgreSQL, SQLite, DuckDB) in single query
  • Embedded Architecture: No separate server needed, works directly in Python/applications
  • Data Format Support: Seamless reading/writing of Parquet, CSV, JSON, and remote files

Critical Limitations

  • Vector Search Performance: ~12× slower than pgvector on single queries (23s vs 1.9s)
  • Concurrent Performance: 144× degradation under concurrent load (unsuitable for production search APIs)
  • DuckLake Incompatibility: VSS extension does NOT work with DuckLake persistent format
  • Point Query Optimization: Designed for analytical workloads, not low-latency individual vector lookups
  • Memory Constraints: In-memory vectors limited by RAM, experimental disk persistence is risky

Key Findings

  • Supported Distance Metrics: L2 Euclidean (default), Cosine distance, Inner product
  • Index Type: HNSW (Hierarchical Navigable Small World) with tunable parameters (M, ef_construction, ef_search)
  • Data Type: Fixed-size FLOAT arrays (384, 512, etc. dimensions)
  • Recent Improvement: 66× speedup for LATERAL joins with HNSW_INDEX_JOIN operator (October 2024)
  • Hybrid Search: Can combine BM25 full-text search with vector embeddings in single query

Hybrid Architecture

  • Multiple Database Attachment: Attach both persistent (*.duckdb) and in-memory (‘:memory:’) databases simultaneously
  • Cross-Database Joins: Query and join across different database engines in single SQL statement
  • Temporary Tables: Session-local tables that spill to disk if needed, auto-dropped on disconnect
  • Write Constraints: Can only write to ONE database per transaction (critical limitation)
  • Query Optimization: Pushes filters to source databases, uses specialized scanners, cost-based join optimization

Performance Benchmarks

ScenarioDuckDBpgvectorOpenSearchRecommendation
Single Vector Query~23s~1.9s<100ms❌ DuckDB too slow
100 Concurrent Queries~1,416s~9.8s<100ms❌ DuckDB fails under load
Offline Analytics✅ ExcellentGoodPoor✅ Use DuckDB
Hybrid SQL+Vector✅ ExcellentLimitedLimited✅ Use DuckDB
Production Search API❌ Not suitableGoodExcellentUse pgvector or OpenSearch

DuckLake Compatibility Issue

  • Problem: VSS extension requires in-memory database; DuckLake is persistent storage format
  • Solution Options:
    1. Use in-memory DuckDB with VSS (lose persistence)
    2. Rebuild indexes on startup from DuckLake data (hybrid approach)
    3. Use pgvector for persistent + vector support
    4. Separate systems: OpenSearch/Pinecone for vectors + DuckLake for analytics

Use Cases

When to Use DuckDB

Offline Analytics - Complex SQL queries combining vectors with tabular data ✅ Data Exploration - Ad-hoc analysis of embeddings with rich SQL ✅ Batch Processing - Entire dataset transformations with complex logic ✅ Hybrid Search - Native FTS + vector embeddings in single query ✅ Research/Prototyping - Quick experimentation with embeddings ✅ Recommendation Systems (batch-mode) - Complex analytical recommendations

When NOT to Use DuckDB

Production APIs - High latency unacceptable (23s vs required <100ms) ❌ Real-Time Search - Concurrent performance degrades 144× ❌ RAG Systems - Sub-second retrieval needed for LLM prompts ❌ Point Query Workloads - Not optimized for individual vector lookups ❌ High-Throughput Systems - Poor concurrent performance ❌ Persistent Vector Storage - Use pgvector or specialized vector DBs

Technical Highlights

Vector Search Features

  • HNSW index with configurable accuracy/speed tradeoff
  • Multiple distance metrics (L2, cosine, inner product)
  • Index-accelerated top-k queries with min_by()
  • Fuzzy vector joins without indexes via vss_join() and vss_match()
  • Recent 66× speedup for lateral join operations

Hybrid Storage

  • Attach same session: persistent + in-memory databases
  • Cross-database joins with filter pushdown optimization
  • Automatic query plan generation across engines
  • Temporary tables with optional disk spillover

Text Analytics

  • Full-text search with BM25 scoring
  • Stemming support for multiple languages
  • Integrated hybrid search combining BM25 + vector embeddings
  • Text preprocessing with tokenization and stopword filtering

Architecture Decision Guide

NeedRecommendedAlternative
Vector search + PersistencepgvectorRebuild DuckDB indexes on startup
Production search APIOpenSearch/Pineconepgvector if SQL important
Analytics + Vectors (batch)DuckDB in-memorypgvector
Analytics only (persistent)DuckLakePostgreSQL
All requirements (scale)OpenSearch + DuckLakeElasticsearch + pgvector
  • agents - Agent systems and database-backed architectures
  • bun-nestjs - Performance optimization and full-stack systems
  • automated-reasoning - Database query optimization and verification
  • webgpu - GPU computing for analytics and parallel processing

Resources

Official Documentation:

Key Blog Posts:

Community Resources:


Last updated: 2025-11-24 DuckDB Version: 1.1.3+ (VSS extension experimental) Status: VSS and FTS extensions actively developed; persistence improvements planned