Comprehensive guide to DuckDB’s vector similarity search, embeddings, and text analytics capabilities.

Table of Contents

Quick Answer

Yes, DuckDB has vector capabilities through its experimental VSS (Vector Similarity Search) extension, but with important performance considerations:

✅ Capabilities

  • HNSW indexing for similarity search
  • Multiple distance metrics (L2, cosine, inner product)
  • Native ARRAY type for fixed-size embeddings (32-bit FLOAT)
  • Full-text search (FTS) extension for hybrid search
  • Recent updates with 66× speedup for joins (Oct 2024)

⚠️ Performance Reality

  • ~12× slower than pgvector for single queries (23s vs 1.9s in benchmarks)
  • Not optimized for point queries - designed for analytical workloads
  • Poor concurrent performance - significant degradation under load
  • High latency - unsuitable for real-time/production vector search
  • Best for: Analytical queries combining vectors with rich SQL, not low-latency search

❌ DuckLake Compatibility

  • VSS extension does NOT work with DuckLake (persistent storage format)
  • Only works with in-memory databases
  • Experimental persistence flag may cause data corruption
  • Must choose between DuckLake or VSS, cannot use both

Overview

DuckDB provides a complete solution for vector similarity search and text analytics through two experimental extensions:

  1. VSS Extension - Vector similarity search with HNSW indexing
  2. FTS Extension - Full-text search with BM25 scoring

Combined, these enable powerful hybrid search capabilities for RAG systems, recommendation engines, and semantic search applications.

Installation

VSS Extension

INSTALL vss;
LOAD vss;

FTS Extension

INSTALL fts;
LOAD fts;

Note: The FTS extension will be transparently auto-loaded on first use from the official extension repository.

Vector Similarity Search (VSS)

Supported Data Types

ARRAY Type (Since DuckDB 0.10)

  • Fixed-size arrays for storing embeddings
  • Currently only supports FLOAT (32-bit, single precision)
  • All elements in a column must have same length and type
-- Create table with embedding column
CREATE TABLE documents (
id INTEGER,
text TEXT,
embedding FLOAT[384] -- 384-dimensional vector
);
-- Insert embeddings
INSERT INTO documents VALUES
(1, 'DuckDB is fast', [0.1, 0.2, ..., 0.384]),
(2, 'Vectors are cool', [0.3, 0.4, ..., 0.385]);

Distance Functions

Three distance metrics are available:

FunctionMetricUse CaseFormula
array_distance()L2 Euclidean (default)General similarity√Σ(a-b)²
array_cosine_distance()Cosine distanceText/semantic similarity1 - cos(θ)
array_negative_inner_product()Negative inner productRanking/scoring-Σ(a×b)

Alias: The <=> operator is aliased to array_cosine_distance() (compatible with PostgreSQL pgvector)

HNSW Index Creation

Basic Index:

CREATE INDEX doc_idx ON documents
USING HNSW (embedding);

With Custom Metric:

CREATE INDEX doc_cosine_idx ON documents
USING HNSW (embedding)
WITH (metric = 'cosine');

Available Metrics:

  • l2sq - L2 Euclidean distance (default)
  • cosine - Cosine distance
  • ip - Inner product

Configurable Hyperparameters

ParameterDefaultDescriptionImpact
ef_construction128Candidates during index buildHigher = better accuracy, slower build
ef_search64Candidates during searchHigher = better recall, slower queries
M16Max neighbors per vertexHigher = better accuracy, more memory
M02 × MZero-level connectivityTypically 2× M value

Example with Custom Parameters:

CREATE INDEX doc_idx ON documents
USING HNSW (embedding)
WITH (
metric = 'cosine',
ef_construction = 256,
M = 32
);

Runtime Configuration:

-- Override ef_search at query time
SET hnsw_ef_search = 128;

Similarity Search Queries

Basic Top-K Search:

-- Find 5 most similar documents
SELECT id, text, array_distance(embedding, [0.1, 0.2, ..., 0.384]::FLOAT[384]) AS distance
FROM documents
ORDER BY distance
LIMIT 5;

Using Cosine Similarity:

-- Semantic search with cosine distance
SELECT id, text, embedding <=> [0.1, 0.2, ..., 0.384]::FLOAT[384] AS similarity
FROM documents
ORDER BY similarity
LIMIT 10;

Top-K Aggregates (New Feature):

-- Efficient top-k using min_by with index acceleration
SELECT min_by(id, array_distance(embedding, query_vec), 5) AS top_5_ids
FROM documents;

Query Optimization

Verify Index Usage:

EXPLAIN
SELECT * FROM documents
ORDER BY array_distance(embedding, [0.1, 0.2, ..., 0.384]::FLOAT[384])
LIMIT 5;

Look for HNSW_INDEX_SCAN in the query plan to confirm index usage.

Optimized Query Patterns:

  • ORDER BY distance_function(...) LIMIT n - Uses index
  • min_by(column, distance_function(...), n) - Uses index (new)
  • LATERAL joins with distance functions - Uses HNSW_INDEX_JOIN (new)

Fuzzy Vector Joins

Two table macros for joining without indexes:

vss_join() - Match vectors between tables:

SELECT d1.id, d2.id, distance
FROM vss_join(
documents AS d1,
products AS d2,
ON embedding,
k = 5,
metric = 'cosine'
);

vss_match() - Lateral join variant:

SELECT d.id, matches.similar_id
FROM documents d,
LATERAL vss_match(
products,
embedding,
d.embedding,
k = 3
) AS matches;

Note: These perform brute-force searches but may receive index-based optimizations in future versions.

Full-Text Search (FTS)

Creating FTS Index

-- Create FTS index
PRAGMA create_fts_index(
'documents', -- table name
'id', -- unique identifier
'text', -- text column(s) to index
stemmer = 'english',
stopwords = 'english',
ignore = '(\\.|[^a-z])+',
strip_accents = true,
lower = true
);

BM25 Scoring

-- Full-text search with BM25 relevance scoring
SELECT id, text, score
FROM (
SELECT *, fts_main_documents.match_bm25(id, 'vector search') AS score
FROM documents
)
WHERE score IS NOT NULL
ORDER BY score DESC;

Stemming

-- Extract word stems
SELECT stem('feeling', 'english'); -- Returns: 'feel'
SELECT stem('running', 'english'); -- Returns: 'run'
-- Use in queries
SELECT * FROM documents
WHERE stem(text, 'english') LIKE '%' || stem('running', 'english') || '%';

Hybrid Search: Combining FTS + Vector Similarity

Three-Method Approach

1. Keyword Search - Basic pattern matching 2. Full-Text Search - BM25 scoring 3. Semantic Search - Vector embeddings

Weighted Hybrid Query

-- Combine BM25 and vector similarity with weighted scoring
WITH keyword_scores AS (
SELECT id,
fts_main_documents.match_bm25(id, 'query text') AS bm25_score
FROM documents
),
vector_scores AS (
SELECT id,
1 - array_cosine_distance(embedding, $query_embedding) AS vector_score
FROM documents
),
normalized_scores AS (
SELECT
COALESCE(k.id, v.id) AS id,
-- Min-max normalization for BM25 (unbounded)
(k.bm25_score - MIN(k.bm25_score) OVER ()) /
(MAX(k.bm25_score) OVER () - MIN(k.bm25_score) OVER ()) AS norm_bm25,
v.vector_score AS norm_vector
FROM keyword_scores k
FULL OUTER JOIN vector_scores v ON k.id = v.id
)
SELECT
d.id,
d.text,
-- Weighted combination (70% semantic, 30% keyword)
(0.7 * norm_vector + 0.3 * norm_bm25) AS final_score
FROM normalized_scores n
JOIN documents d ON n.id = d.id
ORDER BY final_score DESC
LIMIT 10;

Text Analytics Workflow

1. Text Preprocessing

Tokenization:

-- Split text into tokens
SELECT unnest(regexp_split_to_table(text, '\\s+')) AS token
FROM documents;

Stopword Filtering:

-- Remove common words
WITH tokens AS (
SELECT id, unnest(regexp_split_to_table(text, '\\s+')) AS token
FROM documents
),
stopwords AS (
SELECT unnest(['the', 'is', 'at', 'which', 'on']) AS word
)
SELECT t.id, t.token
FROM tokens t
WHERE t.token NOT IN (SELECT word FROM stopwords);

2. Generate Embeddings

Using sentence-transformers (Python):

from sentence_transformers import SentenceTransformer
import duckdb
# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Generate embeddings
texts = ["DuckDB is fast", "Vectors are cool"]
embeddings = model.encode(texts)
# Insert into DuckDB
conn = duckdb.connect('database.db')
conn.execute("""
CREATE TABLE documents (
id INTEGER,
text TEXT,
embedding FLOAT[384]
)
""")
for i, (text, emb) in enumerate(zip(texts, embeddings)):
conn.execute(
"INSERT INTO documents VALUES (?, ?, ?)",
[i, text, emb.tolist()]
)

3. Create Indexes

-- Create vector index
CREATE INDEX vec_idx ON documents USING HNSW (embedding);
-- Create FTS index
PRAGMA create_fts_index('documents', 'id', 'text');

4. Query

-- Hybrid search combining both methods
SELECT d.id, d.text,
-- Semantic similarity
1 - (d.embedding <=> $query_embedding) AS semantic_score,
-- Keyword relevance
fts_main_documents.match_bm25(d.id, 'search terms') AS keyword_score
FROM documents d
WHERE semantic_score > 0.7 OR keyword_score > 5.0
ORDER BY (0.7 * semantic_score + 0.3 * keyword_score) DESC
LIMIT 20;

Performance Characteristics

HNSW Index Performance

Build Time:

  • Improved multi-threading in recent versions
  • Better parallelism when indexing after table population
  • Progress indicator shows build status

Search Performance:

  • Sub-millisecond queries for millions of vectors (typical)
  • Configurable accuracy/speed tradeoff via ef_search
  • Approximately 66× speedup for LATERAL joins with HNSW_INDEX_JOIN

Memory Usage:

  • Indexes remain in RAM
  • Do not count toward memory_limit
  • Higher M values increase memory consumption

Optimization Tips

1. Build Indexes After Population

-- Better: Insert first, index later
CREATE TABLE documents (...);
INSERT INTO documents SELECT * FROM source;
CREATE INDEX vec_idx ON documents USING HNSW (embedding);
-- Slower: Create index on empty table then insert

2. Handle Deletions

-- Deletes mark rows as deleted but don't compact index immediately
DELETE FROM documents WHERE id < 1000;
-- Manually trigger compaction after significant deletions
PRAGMA hnsw_compact_index('vec_idx');

3. Tune Hyperparameters

-- For high recall (slower):
SET hnsw_ef_search = 256;
-- For speed (lower recall):
SET hnsw_ef_search = 32;

4. Monitor Query Plans

-- Always verify index usage
EXPLAIN SELECT * FROM documents
ORDER BY array_distance(embedding, $vec)
LIMIT 10;

Persistence and Storage

Current Limitations

Default Behavior:

  • HNSW indexes work best on in-memory databases
  • Persistence to disk is experimental

Enable Experimental Persistence:

SET hnsw_enable_experimental_persistence = true;

Risks:

  • WAL recovery not fully implemented
  • Unexpected shutdowns risk data loss
  • May corrupt file storage
  • Not recommended for production

Recommended Approach:

  • Use in-memory databases for production workloads
  • Or rebuild indexes on startup if using disk persistence
  • Monitor DuckDB releases for stable persistence support

DuckLake Compatibility

❌ VSS Extension Does NOT Work with DuckLake

Critical Limitation: The VSS (Vector Similarity Search) extension is incompatible with DuckLake, DuckDB’s persistent storage format released in May 2025.

What is DuckLake?

DuckLake is a DuckDB core extension that adds support for:

  • Attaching databases stored in the DuckLake format
  • ACID transactions with snapshot isolation
  • Time-travel queries across snapshots
  • Efficient storage management and compaction
  • Production-grade persistent storage

Basic DuckLake Usage:

INSTALL ducklake;
LOAD ducklake;
-- Attach DuckLake database
ATTACH 'ducklake:metadata.ducklake' AS my_db (DATA_PATH 'data_files');
USE my_db;
-- Query snapshot history
SELECT * FROM ducklake_snapshots('my_db');

The Compatibility Problem

You must choose one or the other:

FeatureIn-Memory DuckDBDuckLake Storage
VSS Extension✅ Works❌ Does NOT work
Vector Search✅ Full support❌ Not supported
HNSW Indexes✅ Supported❌ Not supported
Persistence⚠️ Experimental (risky)✅ Production-ready
ACID TransactionsLimited✅ Full support
Time TravelNo✅ Supported
Snapshot IsolationNo✅ Supported

Workarounds and Alternatives

Option 1: Use In-Memory DuckDB (Recommended for VSS)

-- Connect to in-memory database
.open :memory:
-- Use VSS normally
INSTALL vss;
LOAD vss;
CREATE INDEX vec_idx ON docs USING HNSW (embedding);

Pros:

  • ✅ VSS works perfectly
  • ✅ Fast performance
  • ✅ No persistence issues

Cons:

  • ❌ Data lost on shutdown
  • ❌ Must reload data on startup
  • ❌ Limited by RAM

Option 2: Rebuild Indexes on Startup

-- Load data from DuckLake
ATTACH 'ducklake:data.ducklake' AS source;
-- Copy to in-memory database
CREATE TABLE docs AS SELECT * FROM source.docs;
DETACH source;
-- Build VSS index in-memory
INSTALL vss;
LOAD vss;
CREATE INDEX vec_idx ON docs USING HNSW (embedding);

Pros:

  • ✅ Persistent source data in DuckLake
  • ✅ VSS works in-memory
  • ✅ Best of both worlds

Cons:

  • ❌ Startup time for index rebuild
  • ❌ Requires scripting
  • ❌ Double storage (DuckLake + in-memory)

Option 3: Use pgvector with PostgreSQL Instead

-- PostgreSQL with pgvector supports both persistence and vectors
CREATE EXTENSION vector;
CREATE TABLE docs (
id SERIAL PRIMARY KEY,
embedding vector(384)
);
CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops);

Pros:

  • ✅ Full persistence support
  • ✅ ACID transactions
  • ✅ Better vector search performance (12× faster than DuckDB)
  • ✅ Production-ready

Cons:

  • ❌ Not embedded (requires PostgreSQL server)
  • ❌ More complex setup
  • ❌ Less powerful analytical SQL than DuckDB

Option 4: Separate Systems for Vectors and Analytics

# Use specialized vector DB for search
import pinecone
pinecone.init(api_key="...")
index = pinecone.Index("vectors")
# Use DuckLake for analytical data
import duckdb
conn = duckdb.connect()
conn.execute("ATTACH 'ducklake:analytics.ducklake' AS data")
# Combine results in application layer
vector_results = index.query(query_embedding, top_k=100)
analytical_results = conn.execute("""
SELECT * FROM data.documents
WHERE id IN (?)
""", [vector_results.ids])

Pros:

  • ✅ Best tool for each job
  • ✅ Scales independently
  • ✅ Production-ready for both

Cons:

  • ❌ More complex architecture
  • ❌ Additional operational overhead
  • ❌ Cannot use SQL for hybrid queries

Architecture Decision Guide

Choose In-Memory DuckDB + VSS when:

  • Data fits in RAM
  • Acceptable to rebuild indexes on startup
  • Need rich SQL + vector search in single query
  • Offline/batch analytics workload
  • No persistence requirements

Choose DuckLake (without VSS) when:

  • Need persistent storage
  • ACID transactions required
  • Time-travel queries needed
  • Vector search not required
  • Analytical workload without embeddings

Choose pgvector when:

  • Need both persistence AND vectors
  • Production vector search required
  • Transactional workload
  • Can accept PostgreSQL complexity
  • Want better vector search performance

Choose Separate Systems when:

  • High-scale production system
  • Need best-in-class for both vectors and analytics
  • Can manage multiple databases
  • Budget allows operational complexity

Use Cases

1. RAG (Retrieval-Augmented Generation)

-- Find relevant context for LLM prompts
WITH query_embedding AS (
SELECT $user_query_embedding AS vec
)
SELECT d.text, d.metadata
FROM documents d, query_embedding q
ORDER BY d.embedding <=> q.vec
LIMIT 5;

2. Recommendation Systems

-- Find similar products/content
SELECT p2.id, p2.name,
array_distance(p1.embedding, p2.embedding) AS distance
FROM products p1, products p2
WHERE p1.id = $current_product_id
AND p2.id != p1.id
ORDER BY distance
LIMIT 10;
-- Search documents by meaning, not just keywords
SELECT id, title, snippet,
1 - (embedding <=> $query_embedding) AS relevance
FROM articles
WHERE relevance > 0.6
ORDER BY relevance DESC;

4. Duplicate Detection

-- Find near-duplicate content
SELECT d1.id AS doc1, d2.id AS doc2,
array_cosine_distance(d1.embedding, d2.embedding) AS similarity
FROM documents d1, documents d2
WHERE d1.id < d2.id
AND similarity < 0.1 -- Very similar
ORDER BY similarity;

5. Image Search (with embeddings from vision models)

-- Find visually similar images
CREATE TABLE images (
id INTEGER,
filename TEXT,
embedding FLOAT[512] -- From CLIP, ResNet, etc.
);
SELECT filename, array_distance(embedding, $query_image_embedding) AS distance
FROM images
ORDER BY distance
LIMIT 20;

Recent Updates (2024-2025)

October 2024 Release

1. Performance Improvements

  • Enhanced multi-threaded index building
  • Fine-grained work distribution
  • Better memory allocation
  • Progress indicators during index creation

2. New Distance Functions

  • array_cosine_distance() - 1 - cosine_similarity
  • array_negative_inner_product() - negative inner product
  • <=> operator for cosine distance (pgvector compatibility)

3. Top-K Aggregates

  • min_by()/max_by() now accept optional n parameter
  • Index-accelerated top-k queries
  • Avoids full table scans

4. LATERAL Join Optimization

  • New HNSW_INDEX_JOIN operator
  • 66× speedup for vector joins (10s → 0.15s in benchmarks)
  • Efficient cross-table similarity searches

June 2025 - Text Analytics

Lightweight Text Analytics Workflows:

  • Demonstrated hybrid search combining keyword, FTS, and vector embeddings
  • Integration examples with sentence-transformers
  • Best practices for text preprocessing and normalization

Performance Comparison with Other Vector Databases

Benchmark Results: DuckDB vs pgvector vs OpenSearch

DuckDB vs PostgreSQL pgvector

Single Query Performance (1M vector dataset):

  • pgvector: 1.9s average
  • DuckDB: 23s average (~12× slower)
  • ChromaDB: Better than DuckDB for single queries

Concurrent Performance (100 queries, 100 concurrent requests):

  • pgvector: 9.81s average - excellent, stable performance
  • DuckDB: 1,416s average (144× slower) - massive degradation under load
  • ChromaDB: 23.08s average

Key Finding: DuckDB’s architecture optimized for analytical workloads makes it unsuitable for low-latency, high-concurrency vector search scenarios.

Latency Analysis:

  • Most DuckDB query overhead comes from point query inefficiency (~98% overhead)
  • Only ~2% of runtime spent in actual HNSW search (usearch library)
  • DuckDB not designed for individual vector lookups

OpenSearch vs Elasticsearch (For Reference)

While no direct DuckDB comparison exists, 2025 benchmarks show:

Elastic’s Benchmarks:

  • Elasticsearch: 2-12× faster than OpenSearch
  • With Binary Quantization (BBQ): 5× faster than OpenSearch with FAISS

Independent Study (Trail of Bits - March 2025):

  • OpenSearch v2.17.1: 1.6× faster on Big5 workload
  • OpenSearch: 11% faster on Vectorsearch workload
  • Results vary by workload and configuration

Key Difference: Both OpenSearch and Elasticsearch are orders of magnitude faster than DuckDB for vector search due to purpose-built architecture.

Detailed Feature Comparison

FeatureDuckDB VSSpgvectorOpenSearchElasticsearchPinecone
Single Query Latency~23s~1.9s<100ms<100ms<100ms
Concurrent PerformancePoor (144× slower)ExcellentExcellentExcellentExcellent
DeploymentEmbeddedPostgreSQLDistributedDistributedCloud
SQL SupportFull SQLFull SQLLimitedLimitedNone
Index TypeHNSWHNSW, IVFFlatHNSW, FAISSHNSW, BBQProprietary
Max DimensionsUnlimited2,000 (recommended)16,0004,09620,000+
Distance MetricsL2, Cosine, IPL2, Cosine, IPMultipleMultipleMultiple
FTS IntegrationNativeVia PostgreSQLNativeNativeNo
Analytical QueriesExcellentGoodLimitedLimitedLimited
Point Query PerformancePoorExcellentExcellentExcellentExcellent
Setup ComplexityMinimalMediumHighHighLow
CostFree/OSSFree/OSSFree/OSSPaid + OSS$70+/month
Best ForOffline analyticsHybrid transactional/vectorReal-time searchReal-time searchScale + managed

Performance Recommendations by Use Case

Use CaseRecommended SolutionWhy
Real-time search APIOpenSearch, Elasticsearch, PineconeSub-100ms latency required
Concurrent user queriespgvector, OpenSearch, ElasticsearchNeed stable performance under load
RAG production systempgvector, Pinecone, WeaviateLow latency + high availability
Offline analyticsDuckDBComplex SQL + vector queries
Data explorationDuckDBRich SQL analytical capabilities
Batch processingDuckDB, pgvectorSingle-threaded bulk operations
Hybrid search (keyword + vector)DuckDB, OpenSearch, ElasticsearchNative FTS + vector integration
Recommendation systems (real-time)OpenSearch, PineconeNeed <100ms response
Recommendation systems (batch)DuckDBComplex analytical queries

When to Use DuckDB for Vectors

✅ Good Use Cases:

  1. Offline analytics - Complex SQL queries combining vectors with tabular data
  2. Data exploration - Ad-hoc analysis of embeddings with rich SQL
  3. Batch processing - Process entire datasets with complex transformations
  4. Research/prototyping - Quick experimentation with embeddings
  5. Hybrid analytics - Combine FTS, vector search, and analytical queries
  6. Single-threaded workloads - No concurrent query requirements

❌ Poor Use Cases:

  1. Production APIs - High latency unacceptable for user-facing apps
  2. Real-time search - 23s query time vs <100ms requirement
  3. Concurrent queries - 144× performance degradation under load
  4. Point queries - Not optimized for individual vector lookups
  5. Low-latency RAG - Need sub-second retrieval for LLM prompts
  6. High-throughput systems - Poor concurrent performance

Best Practices

Data Modeling

1. Use Fixed-Size Arrays

-- Good: Fixed size
CREATE TABLE docs (embedding FLOAT[384]);
-- Bad: Variable size not supported
-- CREATE TABLE docs (embedding FLOAT[]);

2. Normalize Embeddings for Cosine Distance

# Python: Normalize before storing
import numpy as np
normalized = embedding / np.linalg.norm(embedding)

3. Store Metadata Alongside Embeddings

CREATE TABLE documents (
id INTEGER PRIMARY KEY,
text TEXT,
embedding FLOAT[384],
created_at TIMESTAMP,
source TEXT,
category TEXT
);

Indexing Strategy

1. Build After Bulk Insert

-- Efficient: Bulk insert then index
COPY documents FROM 'data.parquet';
CREATE INDEX vec_idx ON documents USING HNSW (embedding);

2. Choose Appropriate Parameters

-- For high accuracy (slower):
CREATE INDEX vec_idx ON documents USING HNSW (embedding)
WITH (ef_construction = 256, M = 32);
-- For speed (lower accuracy):
CREATE INDEX vec_idx ON documents USING HNSW (embedding)
WITH (ef_construction = 64, M = 16);

3. Regular Maintenance

-- After significant deletions/updates
PRAGMA hnsw_compact_index('vec_idx');

Query Patterns

1. Always Use LIMIT

-- Good: Uses index efficiently
SELECT * FROM docs
ORDER BY embedding <=> $query
LIMIT 10;
-- Bad: Full scan without LIMIT
SELECT * FROM docs
ORDER BY embedding <=> $query;

2. Filter Before Vector Search

-- Efficient: Filter reduces search space
SELECT * FROM docs
WHERE category = 'technology'
ORDER BY embedding <=> $query
LIMIT 10;

3. Batch Queries When Possible

-- Process multiple queries in one pass
SELECT q.id AS query_id, d.id AS doc_id,
array_distance(d.embedding, q.embedding) AS distance
FROM queries q, documents d
ORDER BY q.id, distance
LIMIT 10;

Hybrid Search Configuration

1. Tune Weights Based on Use Case

-- News/articles: Higher semantic weight
SELECT *, (0.8 * semantic + 0.2 * bm25) AS score
FROM results;
-- Technical docs: Higher keyword weight
SELECT *, (0.4 * semantic + 0.6 * bm25) AS score
FROM results;

2. Normalize Scores Properly

-- BM25 is unbounded, needs min-max normalization
-- Cosine similarity is [0,1], already normalized

Limitations and Gotchas

Current Limitations

  1. 32-bit FLOAT Only

    • No support for DOUBLE (64-bit) or INT arrays
    • Precision limitations for certain applications
  2. Experimental Persistence

    • Disk persistence not production-ready
    • Risk of data loss on unexpected shutdown
    • In-memory recommended for production
  3. Index Memory Management

    • Indexes don’t count toward memory_limit
    • Can cause OOM if not monitored separately
    • No automatic eviction policy
  4. No Dynamic Dimensionality

    • All embeddings must be same dimension
    • Cannot mix 384 and 512 dimensional vectors
  5. Limited Distance Metrics

    • Only L2, cosine, and inner product
    • No Manhattan, Hamming, or custom metrics

Gotchas

1. Index Build Time

-- Large datasets take time
-- Millions of vectors can take minutes-hours
-- Use progress indicator to monitor

2. Deletes Don’t Compact Automatically

-- Manual compaction required
DELETE FROM docs WHERE old = true;
PRAGMA hnsw_compact_index('vec_idx');

3. Query Pattern Matters

-- Uses index: ✅
SELECT * FROM docs ORDER BY distance LIMIT 10;
-- No index: ❌
SELECT * FROM docs WHERE distance < 0.5;

4. Case Sensitivity in FTS

-- FTS is case-sensitive by default
-- Use lower=true in create_fts_index
PRAGMA create_fts_index(..., lower = true);

Future Roadmap

Based on documentation and community discussions:

Planned/Potential Features:

  • ✅ Stable disk persistence (high priority)
  • ✅ Additional distance metrics
  • ✅ Quantization support (reduced memory)
  • ✅ Filtered vector search improvements
  • ✅ Index-accelerated fuzzy joins
  • ✅ Better memory management integration

Community Requests:

  • Support for other numeric types (INT8, DOUBLE)
  • Approximate filtering (pre-filter before vector search)
  • Multi-vector queries
  • Batch insertion optimization
  • Distributed vector search

Resources

Official Documentation:

Community Resources:

Integration Examples:

  • sentence-transformers (Python)
  • OpenAI embeddings API
  • Hugging Face models
  • CLIP for image embeddings

Related Extensions:

  • httpfs - Load embeddings from remote files
  • json - Parse embedding arrays from JSON
  • parquet - Efficient storage for embeddings

Last Updated: 2025-11-05 DuckDB Version: 1.1.3+ (VSS extension experimental) Status: VSS and FTS extensions are experimental but actively developed