vector-capabilities
Comprehensive guide to DuckDB’s vector similarity search, embeddings, and text analytics capabilities.
Table of Contents
- Quick Answer
- Overview
- Installation
- Vector Similarity Search (VSS)
- Full-Text Search (FTS)
- Hybrid Search: Combining FTS + Vector Similarity
- Text Analytics Workflow
- Performance Characteristics
- Persistence and Storage
- DuckLake Compatibility
- Use Cases
- Recent Updates (2024-2025)
- Performance Comparison with Other Vector Databases
- Best Practices
- Limitations and Gotchas
- Future Roadmap
- Resources
Quick Answer
Yes, DuckDB has vector capabilities through its experimental VSS (Vector Similarity Search) extension, but with important performance considerations:
✅ Capabilities
- HNSW indexing for similarity search
- Multiple distance metrics (L2, cosine, inner product)
- Native ARRAY type for fixed-size embeddings (32-bit FLOAT)
- Full-text search (FTS) extension for hybrid search
- Recent updates with 66× speedup for joins (Oct 2024)
⚠️ Performance Reality
- ~12× slower than pgvector for single queries (23s vs 1.9s in benchmarks)
- Not optimized for point queries - designed for analytical workloads
- Poor concurrent performance - significant degradation under load
- High latency - unsuitable for real-time/production vector search
- Best for: Analytical queries combining vectors with rich SQL, not low-latency search
❌ DuckLake Compatibility
- VSS extension does NOT work with DuckLake (persistent storage format)
- Only works with in-memory databases
- Experimental persistence flag may cause data corruption
- Must choose between DuckLake or VSS, cannot use both
Overview
DuckDB provides a complete solution for vector similarity search and text analytics through two experimental extensions:
- VSS Extension - Vector similarity search with HNSW indexing
- FTS Extension - Full-text search with BM25 scoring
Combined, these enable powerful hybrid search capabilities for RAG systems, recommendation engines, and semantic search applications.
Installation
VSS Extension
INSTALL vss;LOAD vss;FTS Extension
INSTALL fts;LOAD fts;Note: The FTS extension will be transparently auto-loaded on first use from the official extension repository.
Vector Similarity Search (VSS)
Supported Data Types
ARRAY Type (Since DuckDB 0.10)
- Fixed-size arrays for storing embeddings
- Currently only supports
FLOAT(32-bit, single precision) - All elements in a column must have same length and type
-- Create table with embedding columnCREATE TABLE documents ( id INTEGER, text TEXT, embedding FLOAT[384] -- 384-dimensional vector);
-- Insert embeddingsINSERT INTO documents VALUES (1, 'DuckDB is fast', [0.1, 0.2, ..., 0.384]), (2, 'Vectors are cool', [0.3, 0.4, ..., 0.385]);Distance Functions
Three distance metrics are available:
| Function | Metric | Use Case | Formula |
|---|---|---|---|
array_distance() | L2 Euclidean (default) | General similarity | √Σ(a-b)² |
array_cosine_distance() | Cosine distance | Text/semantic similarity | 1 - cos(θ) |
array_negative_inner_product() | Negative inner product | Ranking/scoring | -Σ(a×b) |
Alias: The <=> operator is aliased to array_cosine_distance() (compatible with PostgreSQL pgvector)
HNSW Index Creation
Basic Index:
CREATE INDEX doc_idx ON documentsUSING HNSW (embedding);With Custom Metric:
CREATE INDEX doc_cosine_idx ON documentsUSING HNSW (embedding)WITH (metric = 'cosine');Available Metrics:
l2sq- L2 Euclidean distance (default)cosine- Cosine distanceip- Inner product
Configurable Hyperparameters
| Parameter | Default | Description | Impact |
|---|---|---|---|
ef_construction | 128 | Candidates during index build | Higher = better accuracy, slower build |
ef_search | 64 | Candidates during search | Higher = better recall, slower queries |
M | 16 | Max neighbors per vertex | Higher = better accuracy, more memory |
M0 | 2 × M | Zero-level connectivity | Typically 2× M value |
Example with Custom Parameters:
CREATE INDEX doc_idx ON documentsUSING HNSW (embedding)WITH ( metric = 'cosine', ef_construction = 256, M = 32);Runtime Configuration:
-- Override ef_search at query timeSET hnsw_ef_search = 128;Similarity Search Queries
Basic Top-K Search:
-- Find 5 most similar documentsSELECT id, text, array_distance(embedding, [0.1, 0.2, ..., 0.384]::FLOAT[384]) AS distanceFROM documentsORDER BY distanceLIMIT 5;Using Cosine Similarity:
-- Semantic search with cosine distanceSELECT id, text, embedding <=> [0.1, 0.2, ..., 0.384]::FLOAT[384] AS similarityFROM documentsORDER BY similarityLIMIT 10;Top-K Aggregates (New Feature):
-- Efficient top-k using min_by with index accelerationSELECT min_by(id, array_distance(embedding, query_vec), 5) AS top_5_idsFROM documents;Query Optimization
Verify Index Usage:
EXPLAINSELECT * FROM documentsORDER BY array_distance(embedding, [0.1, 0.2, ..., 0.384]::FLOAT[384])LIMIT 5;Look for HNSW_INDEX_SCAN in the query plan to confirm index usage.
Optimized Query Patterns:
ORDER BY distance_function(...) LIMIT n- Uses indexmin_by(column, distance_function(...), n)- Uses index (new)- LATERAL joins with distance functions - Uses
HNSW_INDEX_JOIN(new)
Fuzzy Vector Joins
Two table macros for joining without indexes:
vss_join() - Match vectors between tables:
SELECT d1.id, d2.id, distanceFROM vss_join( documents AS d1, products AS d2, ON embedding, k = 5, metric = 'cosine');vss_match() - Lateral join variant:
SELECT d.id, matches.similar_idFROM documents d,LATERAL vss_match( products, embedding, d.embedding, k = 3) AS matches;Note: These perform brute-force searches but may receive index-based optimizations in future versions.
Full-Text Search (FTS)
Creating FTS Index
-- Create FTS indexPRAGMA create_fts_index( 'documents', -- table name 'id', -- unique identifier 'text', -- text column(s) to index stemmer = 'english', stopwords = 'english', ignore = '(\\.|[^a-z])+', strip_accents = true, lower = true);BM25 Scoring
-- Full-text search with BM25 relevance scoringSELECT id, text, scoreFROM ( SELECT *, fts_main_documents.match_bm25(id, 'vector search') AS score FROM documents)WHERE score IS NOT NULLORDER BY score DESC;Stemming
-- Extract word stemsSELECT stem('feeling', 'english'); -- Returns: 'feel'SELECT stem('running', 'english'); -- Returns: 'run'
-- Use in queriesSELECT * FROM documentsWHERE stem(text, 'english') LIKE '%' || stem('running', 'english') || '%';Hybrid Search: Combining FTS + Vector Similarity
Three-Method Approach
1. Keyword Search - Basic pattern matching 2. Full-Text Search - BM25 scoring 3. Semantic Search - Vector embeddings
Weighted Hybrid Query
-- Combine BM25 and vector similarity with weighted scoringWITH keyword_scores AS ( SELECT id, fts_main_documents.match_bm25(id, 'query text') AS bm25_score FROM documents),vector_scores AS ( SELECT id, 1 - array_cosine_distance(embedding, $query_embedding) AS vector_score FROM documents),normalized_scores AS ( SELECT COALESCE(k.id, v.id) AS id, -- Min-max normalization for BM25 (unbounded) (k.bm25_score - MIN(k.bm25_score) OVER ()) / (MAX(k.bm25_score) OVER () - MIN(k.bm25_score) OVER ()) AS norm_bm25, v.vector_score AS norm_vector FROM keyword_scores k FULL OUTER JOIN vector_scores v ON k.id = v.id)SELECT d.id, d.text, -- Weighted combination (70% semantic, 30% keyword) (0.7 * norm_vector + 0.3 * norm_bm25) AS final_scoreFROM normalized_scores nJOIN documents d ON n.id = d.idORDER BY final_score DESCLIMIT 10;Text Analytics Workflow
1. Text Preprocessing
Tokenization:
-- Split text into tokensSELECT unnest(regexp_split_to_table(text, '\\s+')) AS tokenFROM documents;Stopword Filtering:
-- Remove common wordsWITH tokens AS ( SELECT id, unnest(regexp_split_to_table(text, '\\s+')) AS token FROM documents),stopwords AS ( SELECT unnest(['the', 'is', 'at', 'which', 'on']) AS word)SELECT t.id, t.tokenFROM tokens tWHERE t.token NOT IN (SELECT word FROM stopwords);2. Generate Embeddings
Using sentence-transformers (Python):
from sentence_transformers import SentenceTransformerimport duckdb
# Load modelmodel = SentenceTransformer('all-MiniLM-L6-v2')
# Generate embeddingstexts = ["DuckDB is fast", "Vectors are cool"]embeddings = model.encode(texts)
# Insert into DuckDBconn = duckdb.connect('database.db')conn.execute(""" CREATE TABLE documents ( id INTEGER, text TEXT, embedding FLOAT[384] )""")
for i, (text, emb) in enumerate(zip(texts, embeddings)): conn.execute( "INSERT INTO documents VALUES (?, ?, ?)", [i, text, emb.tolist()] )3. Create Indexes
-- Create vector indexCREATE INDEX vec_idx ON documents USING HNSW (embedding);
-- Create FTS indexPRAGMA create_fts_index('documents', 'id', 'text');4. Query
-- Hybrid search combining both methodsSELECT d.id, d.text, -- Semantic similarity 1 - (d.embedding <=> $query_embedding) AS semantic_score, -- Keyword relevance fts_main_documents.match_bm25(d.id, 'search terms') AS keyword_scoreFROM documents dWHERE semantic_score > 0.7 OR keyword_score > 5.0ORDER BY (0.7 * semantic_score + 0.3 * keyword_score) DESCLIMIT 20;Performance Characteristics
HNSW Index Performance
Build Time:
- Improved multi-threading in recent versions
- Better parallelism when indexing after table population
- Progress indicator shows build status
Search Performance:
- Sub-millisecond queries for millions of vectors (typical)
- Configurable accuracy/speed tradeoff via
ef_search - Approximately 66× speedup for LATERAL joins with
HNSW_INDEX_JOIN
Memory Usage:
- Indexes remain in RAM
- Do not count toward
memory_limit - Higher
Mvalues increase memory consumption
Optimization Tips
1. Build Indexes After Population
-- Better: Insert first, index laterCREATE TABLE documents (...);INSERT INTO documents SELECT * FROM source;CREATE INDEX vec_idx ON documents USING HNSW (embedding);
-- Slower: Create index on empty table then insert2. Handle Deletions
-- Deletes mark rows as deleted but don't compact index immediatelyDELETE FROM documents WHERE id < 1000;
-- Manually trigger compaction after significant deletionsPRAGMA hnsw_compact_index('vec_idx');3. Tune Hyperparameters
-- For high recall (slower):SET hnsw_ef_search = 256;
-- For speed (lower recall):SET hnsw_ef_search = 32;4. Monitor Query Plans
-- Always verify index usageEXPLAIN SELECT * FROM documentsORDER BY array_distance(embedding, $vec)LIMIT 10;Persistence and Storage
Current Limitations
Default Behavior:
- HNSW indexes work best on in-memory databases
- Persistence to disk is experimental
Enable Experimental Persistence:
SET hnsw_enable_experimental_persistence = true;Risks:
- WAL recovery not fully implemented
- Unexpected shutdowns risk data loss
- May corrupt file storage
- Not recommended for production
Recommended Approach:
- Use in-memory databases for production workloads
- Or rebuild indexes on startup if using disk persistence
- Monitor DuckDB releases for stable persistence support
DuckLake Compatibility
❌ VSS Extension Does NOT Work with DuckLake
Critical Limitation: The VSS (Vector Similarity Search) extension is incompatible with DuckLake, DuckDB’s persistent storage format released in May 2025.
What is DuckLake?
DuckLake is a DuckDB core extension that adds support for:
- Attaching databases stored in the DuckLake format
- ACID transactions with snapshot isolation
- Time-travel queries across snapshots
- Efficient storage management and compaction
- Production-grade persistent storage
Basic DuckLake Usage:
INSTALL ducklake;LOAD ducklake;
-- Attach DuckLake databaseATTACH 'ducklake:metadata.ducklake' AS my_db (DATA_PATH 'data_files');USE my_db;
-- Query snapshot historySELECT * FROM ducklake_snapshots('my_db');The Compatibility Problem
You must choose one or the other:
| Feature | In-Memory DuckDB | DuckLake Storage |
|---|---|---|
| VSS Extension | ✅ Works | ❌ Does NOT work |
| Vector Search | ✅ Full support | ❌ Not supported |
| HNSW Indexes | ✅ Supported | ❌ Not supported |
| Persistence | ⚠️ Experimental (risky) | ✅ Production-ready |
| ACID Transactions | Limited | ✅ Full support |
| Time Travel | No | ✅ Supported |
| Snapshot Isolation | No | ✅ Supported |
Workarounds and Alternatives
Option 1: Use In-Memory DuckDB (Recommended for VSS)
-- Connect to in-memory database.open :memory:
-- Use VSS normallyINSTALL vss;LOAD vss;CREATE INDEX vec_idx ON docs USING HNSW (embedding);Pros:
- ✅ VSS works perfectly
- ✅ Fast performance
- ✅ No persistence issues
Cons:
- ❌ Data lost on shutdown
- ❌ Must reload data on startup
- ❌ Limited by RAM
Option 2: Rebuild Indexes on Startup
-- Load data from DuckLakeATTACH 'ducklake:data.ducklake' AS source;
-- Copy to in-memory databaseCREATE TABLE docs AS SELECT * FROM source.docs;DETACH source;
-- Build VSS index in-memoryINSTALL vss;LOAD vss;CREATE INDEX vec_idx ON docs USING HNSW (embedding);Pros:
- ✅ Persistent source data in DuckLake
- ✅ VSS works in-memory
- ✅ Best of both worlds
Cons:
- ❌ Startup time for index rebuild
- ❌ Requires scripting
- ❌ Double storage (DuckLake + in-memory)
Option 3: Use pgvector with PostgreSQL Instead
-- PostgreSQL with pgvector supports both persistence and vectorsCREATE EXTENSION vector;
CREATE TABLE docs ( id SERIAL PRIMARY KEY, embedding vector(384));
CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops);Pros:
- ✅ Full persistence support
- ✅ ACID transactions
- ✅ Better vector search performance (12× faster than DuckDB)
- ✅ Production-ready
Cons:
- ❌ Not embedded (requires PostgreSQL server)
- ❌ More complex setup
- ❌ Less powerful analytical SQL than DuckDB
Option 4: Separate Systems for Vectors and Analytics
# Use specialized vector DB for searchimport pineconepinecone.init(api_key="...")index = pinecone.Index("vectors")
# Use DuckLake for analytical dataimport duckdbconn = duckdb.connect()conn.execute("ATTACH 'ducklake:analytics.ducklake' AS data")
# Combine results in application layervector_results = index.query(query_embedding, top_k=100)analytical_results = conn.execute(""" SELECT * FROM data.documents WHERE id IN (?)""", [vector_results.ids])Pros:
- ✅ Best tool for each job
- ✅ Scales independently
- ✅ Production-ready for both
Cons:
- ❌ More complex architecture
- ❌ Additional operational overhead
- ❌ Cannot use SQL for hybrid queries
Architecture Decision Guide
Choose In-Memory DuckDB + VSS when:
- Data fits in RAM
- Acceptable to rebuild indexes on startup
- Need rich SQL + vector search in single query
- Offline/batch analytics workload
- No persistence requirements
Choose DuckLake (without VSS) when:
- Need persistent storage
- ACID transactions required
- Time-travel queries needed
- Vector search not required
- Analytical workload without embeddings
Choose pgvector when:
- Need both persistence AND vectors
- Production vector search required
- Transactional workload
- Can accept PostgreSQL complexity
- Want better vector search performance
Choose Separate Systems when:
- High-scale production system
- Need best-in-class for both vectors and analytics
- Can manage multiple databases
- Budget allows operational complexity
Use Cases
1. RAG (Retrieval-Augmented Generation)
-- Find relevant context for LLM promptsWITH query_embedding AS ( SELECT $user_query_embedding AS vec)SELECT d.text, d.metadataFROM documents d, query_embedding qORDER BY d.embedding <=> q.vecLIMIT 5;2. Recommendation Systems
-- Find similar products/contentSELECT p2.id, p2.name, array_distance(p1.embedding, p2.embedding) AS distanceFROM products p1, products p2WHERE p1.id = $current_product_id AND p2.id != p1.idORDER BY distanceLIMIT 10;3. Semantic Search
-- Search documents by meaning, not just keywordsSELECT id, title, snippet, 1 - (embedding <=> $query_embedding) AS relevanceFROM articlesWHERE relevance > 0.6ORDER BY relevance DESC;4. Duplicate Detection
-- Find near-duplicate contentSELECT d1.id AS doc1, d2.id AS doc2, array_cosine_distance(d1.embedding, d2.embedding) AS similarityFROM documents d1, documents d2WHERE d1.id < d2.id AND similarity < 0.1 -- Very similarORDER BY similarity;5. Image Search (with embeddings from vision models)
-- Find visually similar imagesCREATE TABLE images ( id INTEGER, filename TEXT, embedding FLOAT[512] -- From CLIP, ResNet, etc.);
SELECT filename, array_distance(embedding, $query_image_embedding) AS distanceFROM imagesORDER BY distanceLIMIT 20;Recent Updates (2024-2025)
October 2024 Release
1. Performance Improvements
- Enhanced multi-threaded index building
- Fine-grained work distribution
- Better memory allocation
- Progress indicators during index creation
2. New Distance Functions
array_cosine_distance()- 1 - cosine_similarityarray_negative_inner_product()- negative inner product<=>operator for cosine distance (pgvector compatibility)
3. Top-K Aggregates
min_by()/max_by()now accept optionalnparameter- Index-accelerated top-k queries
- Avoids full table scans
4. LATERAL Join Optimization
- New
HNSW_INDEX_JOINoperator - 66× speedup for vector joins (10s → 0.15s in benchmarks)
- Efficient cross-table similarity searches
June 2025 - Text Analytics
Lightweight Text Analytics Workflows:
- Demonstrated hybrid search combining keyword, FTS, and vector embeddings
- Integration examples with sentence-transformers
- Best practices for text preprocessing and normalization
Performance Comparison with Other Vector Databases
Benchmark Results: DuckDB vs pgvector vs OpenSearch
DuckDB vs PostgreSQL pgvector
Single Query Performance (1M vector dataset):
- pgvector: 1.9s average
- DuckDB: 23s average (~12× slower)
- ChromaDB: Better than DuckDB for single queries
Concurrent Performance (100 queries, 100 concurrent requests):
- pgvector: 9.81s average - excellent, stable performance
- DuckDB: 1,416s average (144× slower) - massive degradation under load
- ChromaDB: 23.08s average
Key Finding: DuckDB’s architecture optimized for analytical workloads makes it unsuitable for low-latency, high-concurrency vector search scenarios.
Latency Analysis:
- Most DuckDB query overhead comes from point query inefficiency (~98% overhead)
- Only ~2% of runtime spent in actual HNSW search (usearch library)
- DuckDB not designed for individual vector lookups
OpenSearch vs Elasticsearch (For Reference)
While no direct DuckDB comparison exists, 2025 benchmarks show:
Elastic’s Benchmarks:
- Elasticsearch: 2-12× faster than OpenSearch
- With Binary Quantization (BBQ): 5× faster than OpenSearch with FAISS
Independent Study (Trail of Bits - March 2025):
- OpenSearch v2.17.1: 1.6× faster on Big5 workload
- OpenSearch: 11% faster on Vectorsearch workload
- Results vary by workload and configuration
Key Difference: Both OpenSearch and Elasticsearch are orders of magnitude faster than DuckDB for vector search due to purpose-built architecture.
Detailed Feature Comparison
| Feature | DuckDB VSS | pgvector | OpenSearch | Elasticsearch | Pinecone |
|---|---|---|---|---|---|
| Single Query Latency | ~23s | ~1.9s | <100ms | <100ms | <100ms |
| Concurrent Performance | Poor (144× slower) | Excellent | Excellent | Excellent | Excellent |
| Deployment | Embedded | PostgreSQL | Distributed | Distributed | Cloud |
| SQL Support | Full SQL | Full SQL | Limited | Limited | None |
| Index Type | HNSW | HNSW, IVFFlat | HNSW, FAISS | HNSW, BBQ | Proprietary |
| Max Dimensions | Unlimited | 2,000 (recommended) | 16,000 | 4,096 | 20,000+ |
| Distance Metrics | L2, Cosine, IP | L2, Cosine, IP | Multiple | Multiple | Multiple |
| FTS Integration | Native | Via PostgreSQL | Native | Native | No |
| Analytical Queries | Excellent | Good | Limited | Limited | Limited |
| Point Query Performance | Poor | Excellent | Excellent | Excellent | Excellent |
| Setup Complexity | Minimal | Medium | High | High | Low |
| Cost | Free/OSS | Free/OSS | Free/OSS | Paid + OSS | $70+/month |
| Best For | Offline analytics | Hybrid transactional/vector | Real-time search | Real-time search | Scale + managed |
Performance Recommendations by Use Case
| Use Case | Recommended Solution | Why |
|---|---|---|
| Real-time search API | OpenSearch, Elasticsearch, Pinecone | Sub-100ms latency required |
| Concurrent user queries | pgvector, OpenSearch, Elasticsearch | Need stable performance under load |
| RAG production system | pgvector, Pinecone, Weaviate | Low latency + high availability |
| Offline analytics | DuckDB ✅ | Complex SQL + vector queries |
| Data exploration | DuckDB ✅ | Rich SQL analytical capabilities |
| Batch processing | DuckDB, pgvector | Single-threaded bulk operations |
| Hybrid search (keyword + vector) | DuckDB, OpenSearch, Elasticsearch | Native FTS + vector integration |
| Recommendation systems (real-time) | OpenSearch, Pinecone | Need <100ms response |
| Recommendation systems (batch) | DuckDB ✅ | Complex analytical queries |
When to Use DuckDB for Vectors
✅ Good Use Cases:
- Offline analytics - Complex SQL queries combining vectors with tabular data
- Data exploration - Ad-hoc analysis of embeddings with rich SQL
- Batch processing - Process entire datasets with complex transformations
- Research/prototyping - Quick experimentation with embeddings
- Hybrid analytics - Combine FTS, vector search, and analytical queries
- Single-threaded workloads - No concurrent query requirements
❌ Poor Use Cases:
- Production APIs - High latency unacceptable for user-facing apps
- Real-time search - 23s query time vs <100ms requirement
- Concurrent queries - 144× performance degradation under load
- Point queries - Not optimized for individual vector lookups
- Low-latency RAG - Need sub-second retrieval for LLM prompts
- High-throughput systems - Poor concurrent performance
Best Practices
Data Modeling
1. Use Fixed-Size Arrays
-- Good: Fixed sizeCREATE TABLE docs (embedding FLOAT[384]);
-- Bad: Variable size not supported-- CREATE TABLE docs (embedding FLOAT[]);2. Normalize Embeddings for Cosine Distance
# Python: Normalize before storingimport numpy as npnormalized = embedding / np.linalg.norm(embedding)3. Store Metadata Alongside Embeddings
CREATE TABLE documents ( id INTEGER PRIMARY KEY, text TEXT, embedding FLOAT[384], created_at TIMESTAMP, source TEXT, category TEXT);Indexing Strategy
1. Build After Bulk Insert
-- Efficient: Bulk insert then indexCOPY documents FROM 'data.parquet';CREATE INDEX vec_idx ON documents USING HNSW (embedding);2. Choose Appropriate Parameters
-- For high accuracy (slower):CREATE INDEX vec_idx ON documents USING HNSW (embedding)WITH (ef_construction = 256, M = 32);
-- For speed (lower accuracy):CREATE INDEX vec_idx ON documents USING HNSW (embedding)WITH (ef_construction = 64, M = 16);3. Regular Maintenance
-- After significant deletions/updatesPRAGMA hnsw_compact_index('vec_idx');Query Patterns
1. Always Use LIMIT
-- Good: Uses index efficientlySELECT * FROM docsORDER BY embedding <=> $queryLIMIT 10;
-- Bad: Full scan without LIMITSELECT * FROM docsORDER BY embedding <=> $query;2. Filter Before Vector Search
-- Efficient: Filter reduces search spaceSELECT * FROM docsWHERE category = 'technology'ORDER BY embedding <=> $queryLIMIT 10;3. Batch Queries When Possible
-- Process multiple queries in one passSELECT q.id AS query_id, d.id AS doc_id, array_distance(d.embedding, q.embedding) AS distanceFROM queries q, documents dORDER BY q.id, distanceLIMIT 10;Hybrid Search Configuration
1. Tune Weights Based on Use Case
-- News/articles: Higher semantic weightSELECT *, (0.8 * semantic + 0.2 * bm25) AS scoreFROM results;
-- Technical docs: Higher keyword weightSELECT *, (0.4 * semantic + 0.6 * bm25) AS scoreFROM results;2. Normalize Scores Properly
-- BM25 is unbounded, needs min-max normalization-- Cosine similarity is [0,1], already normalizedLimitations and Gotchas
Current Limitations
-
32-bit FLOAT Only
- No support for DOUBLE (64-bit) or INT arrays
- Precision limitations for certain applications
-
Experimental Persistence
- Disk persistence not production-ready
- Risk of data loss on unexpected shutdown
- In-memory recommended for production
-
Index Memory Management
- Indexes don’t count toward
memory_limit - Can cause OOM if not monitored separately
- No automatic eviction policy
- Indexes don’t count toward
-
No Dynamic Dimensionality
- All embeddings must be same dimension
- Cannot mix 384 and 512 dimensional vectors
-
Limited Distance Metrics
- Only L2, cosine, and inner product
- No Manhattan, Hamming, or custom metrics
Gotchas
1. Index Build Time
-- Large datasets take time-- Millions of vectors can take minutes-hours-- Use progress indicator to monitor2. Deletes Don’t Compact Automatically
-- Manual compaction requiredDELETE FROM docs WHERE old = true;PRAGMA hnsw_compact_index('vec_idx');3. Query Pattern Matters
-- Uses index: ✅SELECT * FROM docs ORDER BY distance LIMIT 10;
-- No index: ❌SELECT * FROM docs WHERE distance < 0.5;4. Case Sensitivity in FTS
-- FTS is case-sensitive by default-- Use lower=true in create_fts_indexPRAGMA create_fts_index(..., lower = true);Future Roadmap
Based on documentation and community discussions:
Planned/Potential Features:
- ✅ Stable disk persistence (high priority)
- ✅ Additional distance metrics
- ✅ Quantization support (reduced memory)
- ✅ Filtered vector search improvements
- ✅ Index-accelerated fuzzy joins
- ✅ Better memory management integration
Community Requests:
- Support for other numeric types (INT8, DOUBLE)
- Approximate filtering (pre-filter before vector search)
- Multi-vector queries
- Batch insertion optimization
- Distributed vector search
Resources
Official Documentation:
- VSS Extension Reference
- FTS Extension Reference
- Vector Similarity Search Blog
- What’s New in VSS
- Text Analytics Workflows
Community Resources:
- DuckDB Embeddings Discussion
- MotherDuck Vector Search Series
- Simon Willison: DuckDB for Embeddings
- Sören Brunk: Similarity Search with DuckDB
Integration Examples:
- sentence-transformers (Python)
- OpenAI embeddings API
- Hugging Face models
- CLIP for image embeddings
Related Extensions:
httpfs- Load embeddings from remote filesjson- Parse embedding arrays from JSONparquet- Efficient storage for embeddings
Last Updated: 2025-11-05 DuckDB Version: 1.1.3+ (VSS extension experimental) Status: VSS and FTS extensions are experimental but actively developed