vector-capabilities

Comprehensive guide to DuckDB’s vector similarity search, embeddings, and text analytics capabilities.

Quick Answer
Overview
Installation
- VSS Extension
- FTS Extension
Vector Similarity Search (VSS)
Full-Text Search (FTS)
Hybrid Search: Combining FTS + Vector Similarity
- Three-Method Approach
- Weighted Hybrid Query
Text Analytics Workflow
Performance Characteristics
- HNSW Index Performance
- Optimization Tips
Persistence and Storage
- Current Limitations
DuckLake Compatibility
Use Cases
Recent Updates (2024-2025)
- October 2024 Release
- June 2025 - Text Analytics
Performance Comparison with Other Vector Databases
Best Practices
Limitations and Gotchas
- Current Limitations
- Gotchas
Future Roadmap
Resources

Quick Answer

Yes, DuckDB has vector capabilities through its experimental VSS (Vector Similarity Search) extension, but with important performance considerations:

✅ Capabilities

HNSW indexing for similarity search
Multiple distance metrics (L2, cosine, inner product)
Native ARRAY type for fixed-size embeddings (32-bit FLOAT)
Full-text search (FTS) extension for hybrid search
Recent updates with 66× speedup for joins (Oct 2024)

⚠️ Performance Reality

~12× slower than pgvector for single queries (23s vs 1.9s in benchmarks)
Not optimized for point queries - designed for analytical workloads
Poor concurrent performance - significant degradation under load
High latency - unsuitable for real-time/production vector search
Best for: Analytical queries combining vectors with rich SQL, not low-latency search

❌ DuckLake Compatibility

VSS extension does NOT work with DuckLake (persistent storage format)
Only works with in-memory databases
Experimental persistence flag may cause data corruption
Must choose between DuckLake or VSS, cannot use both

Overview

DuckDB provides a complete solution for vector similarity search and text analytics through two experimental extensions:

VSS Extension - Vector similarity search with HNSW indexing
FTS Extension - Full-text search with BM25 scoring

Combined, these enable powerful hybrid search capabilities for RAG systems, recommendation engines, and semantic search applications.

Installation

VSS Extension

INSTALL vss;
LOAD vss;

FTS Extension

INSTALL fts;
LOAD fts;

Note: The FTS extension will be transparently auto-loaded on first use from the official extension repository.

Vector Similarity Search (VSS)

Supported Data Types

ARRAY Type (Since DuckDB 0.10)

Fixed-size arrays for storing embeddings
Currently only supports FLOAT (32-bit, single precision)
All elements in a column must have same length and type

-- Create table with embedding column
CREATE TABLE documents (
    id INTEGER,
    text TEXT,
    embedding FLOAT[384]  -- 384-dimensional vector
);

-- Insert embeddings
INSERT INTO documents VALUES
    (1, 'DuckDB is fast', [0.1, 0.2, ..., 0.384]),
    (2, 'Vectors are cool', [0.3, 0.4, ..., 0.385]);

Distance Functions

Three distance metrics are available:

Function	Metric	Use Case	Formula
`array_distance()`	L2 Euclidean (default)	General similarity	√Σ(a-b)²
`array_cosine_distance()`	Cosine distance	Text/semantic similarity	1 - cos(θ)
`array_negative_inner_product()`	Negative inner product	Ranking/scoring	-Σ(a×b)

Alias: The <=> operator is aliased to array_cosine_distance() (compatible with PostgreSQL pgvector)

HNSW Index Creation

Basic Index:

CREATE INDEX doc_idx ON documents
USING HNSW (embedding);

With Custom Metric:

CREATE INDEX doc_cosine_idx ON documents
USING HNSW (embedding)
WITH (metric = 'cosine');

Available Metrics:

l2sq - L2 Euclidean distance (default)
cosine - Cosine distance
ip - Inner product

Configurable Hyperparameters

Parameter	Default	Description	Impact
`ef_construction`	128	Candidates during index build	Higher = better accuracy, slower build
`ef_search`	64	Candidates during search	Higher = better recall, slower queries
`M`	16	Max neighbors per vertex	Higher = better accuracy, more memory
`M0`	2 × M	Zero-level connectivity	Typically 2× M value

Example with Custom Parameters:

CREATE INDEX doc_idx ON documents
USING HNSW (embedding)
WITH (
    metric = 'cosine',
    ef_construction = 256,
    M = 32
);

Runtime Configuration:

-- Override ef_search at query time
SET hnsw_ef_search = 128;

Similarity Search Queries

Basic Top-K Search:

-- Find 5 most similar documents
SELECT id, text, array_distance(embedding, [0.1, 0.2, ..., 0.384]::FLOAT[384]) AS distance
FROM documents
ORDER BY distance
LIMIT 5;

Using Cosine Similarity:

-- Semantic search with cosine distance
SELECT id, text, embedding <=> [0.1, 0.2, ..., 0.384]::FLOAT[384] AS similarity
FROM documents
ORDER BY similarity
LIMIT 10;

Top-K Aggregates (New Feature):

-- Efficient top-k using min_by with index acceleration
SELECT min_by(id, array_distance(embedding, query_vec), 5) AS top_5_ids
FROM documents;

Query Optimization

Verify Index Usage:

EXPLAIN
SELECT * FROM documents
ORDER BY array_distance(embedding, [0.1, 0.2, ..., 0.384]::FLOAT[384])
LIMIT 5;

Look for HNSW_INDEX_SCAN in the query plan to confirm index usage.

Optimized Query Patterns:

ORDER BY distance_function(...) LIMIT n - Uses index
min_by(column, distance_function(...), n) - Uses index (new)
LATERAL joins with distance functions - Uses HNSW_INDEX_JOIN (new)

Fuzzy Vector Joins

Two table macros for joining without indexes:

vss_join() - Match vectors between tables:

SELECT d1.id, d2.id, distance
FROM vss_join(
    documents AS d1,
    products AS d2,
    ON embedding,
    k = 5,
    metric = 'cosine'
);

vss_match() - Lateral join variant:

SELECT d.id, matches.similar_id
FROM documents d,
LATERAL vss_match(
    products,
    embedding,
    d.embedding,
    k = 3
) AS matches;

Note: These perform brute-force searches but may receive index-based optimizations in future versions.

Full-Text Search (FTS)

Creating FTS Index

-- Create FTS index
PRAGMA create_fts_index(
    'documents',      -- table name
    'id',            -- unique identifier
    'text',          -- text column(s) to index
    stemmer = 'english',
    stopwords = 'english',
    ignore = '(\\.|[^a-z])+',
    strip_accents = true,
    lower = true
);

BM25 Scoring

-- Full-text search with BM25 relevance scoring
SELECT id, text, score
FROM (
    SELECT *, fts_main_documents.match_bm25(id, 'vector search') AS score
    FROM documents
)
WHERE score IS NOT NULL
ORDER BY score DESC;

Stemming

-- Extract word stems
SELECT stem('feeling', 'english');  -- Returns: 'feel'
SELECT stem('running', 'english');  -- Returns: 'run'

-- Use in queries
SELECT * FROM documents
WHERE stem(text, 'english') LIKE '%' || stem('running', 'english') || '%';

Hybrid Search: Combining FTS + Vector Similarity

Three-Method Approach

1. Keyword Search - Basic pattern matching 2. Full-Text Search - BM25 scoring 3. Semantic Search - Vector embeddings

Weighted Hybrid Query

-- Combine BM25 and vector similarity with weighted scoring
WITH keyword_scores AS (
    SELECT id,
           fts_main_documents.match_bm25(id, 'query text') AS bm25_score
    FROM documents
),
vector_scores AS (
    SELECT id,
           1 - array_cosine_distance(embedding, $query_embedding) AS vector_score
    FROM documents
),
normalized_scores AS (
    SELECT
        COALESCE(k.id, v.id) AS id,
        -- Min-max normalization for BM25 (unbounded)
        (k.bm25_score - MIN(k.bm25_score) OVER ()) /
        (MAX(k.bm25_score) OVER () - MIN(k.bm25_score) OVER ()) AS norm_bm25,
        v.vector_score AS norm_vector
    FROM keyword_scores k
    FULL OUTER JOIN vector_scores v ON k.id = v.id
)
SELECT
    d.id,
    d.text,
    -- Weighted combination (70% semantic, 30% keyword)
    (0.7 * norm_vector + 0.3 * norm_bm25) AS final_score
FROM normalized_scores n
JOIN documents d ON n.id = d.id
ORDER BY final_score DESC
LIMIT 10;

Text Analytics Workflow

1. Text Preprocessing

Tokenization:

-- Split text into tokens
SELECT unnest(regexp_split_to_table(text, '\\s+')) AS token
FROM documents;

Stopword Filtering:

-- Remove common words
WITH tokens AS (
    SELECT id, unnest(regexp_split_to_table(text, '\\s+')) AS token
    FROM documents
),
stopwords AS (
    SELECT unnest(['the', 'is', 'at', 'which', 'on']) AS word
)
SELECT t.id, t.token
FROM tokens t
WHERE t.token NOT IN (SELECT word FROM stopwords);

2. Generate Embeddings

Using sentence-transformers (Python):

from sentence_transformers import SentenceTransformer
import duckdb

# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings
texts = ["DuckDB is fast", "Vectors are cool"]
embeddings = model.encode(texts)

# Insert into DuckDB
conn = duckdb.connect('database.db')
conn.execute("""
    CREATE TABLE documents (
        id INTEGER,
        text TEXT,
        embedding FLOAT[384]
    )
""")

for i, (text, emb) in enumerate(zip(texts, embeddings)):
    conn.execute(
        "INSERT INTO documents VALUES (?, ?, ?)",
        [i, text, emb.tolist()]
    )

3. Create Indexes

-- Create vector index
CREATE INDEX vec_idx ON documents USING HNSW (embedding);

-- Create FTS index
PRAGMA create_fts_index('documents', 'id', 'text');

4. Query

-- Hybrid search combining both methods
SELECT d.id, d.text,
       -- Semantic similarity
       1 - (d.embedding <=> $query_embedding) AS semantic_score,
       -- Keyword relevance
       fts_main_documents.match_bm25(d.id, 'search terms') AS keyword_score
FROM documents d
WHERE semantic_score > 0.7 OR keyword_score > 5.0
ORDER BY (0.7 * semantic_score + 0.3 * keyword_score) DESC
LIMIT 20;

Performance Characteristics

HNSW Index Performance

Build Time:

Improved multi-threading in recent versions
Better parallelism when indexing after table population
Progress indicator shows build status

Search Performance:

Sub-millisecond queries for millions of vectors (typical)
Configurable accuracy/speed tradeoff via ef_search
Approximately 66× speedup for LATERAL joins with HNSW_INDEX_JOIN

Memory Usage:

Indexes remain in RAM
Do not count toward memory_limit
Higher M values increase memory consumption

Optimization Tips

1. Build Indexes After Population

-- Better: Insert first, index later
CREATE TABLE documents (...);
INSERT INTO documents SELECT * FROM source;
CREATE INDEX vec_idx ON documents USING HNSW (embedding);

-- Slower: Create index on empty table then insert

2. Handle Deletions

-- Deletes mark rows as deleted but don't compact index immediately
DELETE FROM documents WHERE id < 1000;

-- Manually trigger compaction after significant deletions
PRAGMA hnsw_compact_index('vec_idx');

3. Tune Hyperparameters

-- For high recall (slower):
SET hnsw_ef_search = 256;

-- For speed (lower recall):
SET hnsw_ef_search = 32;

4. Monitor Query Plans

-- Always verify index usage
EXPLAIN SELECT * FROM documents
ORDER BY array_distance(embedding, $vec)
LIMIT 10;

Persistence and Storage

Current Limitations

Default Behavior:

HNSW indexes work best on in-memory databases
Persistence to disk is experimental

Enable Experimental Persistence:

SET hnsw_enable_experimental_persistence = true;

Risks:

WAL recovery not fully implemented
Unexpected shutdowns risk data loss
May corrupt file storage
Not recommended for production

Recommended Approach:

Use in-memory databases for production workloads
Or rebuild indexes on startup if using disk persistence
Monitor DuckDB releases for stable persistence support

DuckLake Compatibility

❌ VSS Extension Does NOT Work with DuckLake

Critical Limitation: The VSS (Vector Similarity Search) extension is incompatible with DuckLake, DuckDB’s persistent storage format released in May 2025.

What is DuckLake?

DuckLake is a DuckDB core extension that adds support for:

Attaching databases stored in the DuckLake format
ACID transactions with snapshot isolation
Time-travel queries across snapshots
Efficient storage management and compaction
Production-grade persistent storage

Basic DuckLake Usage:

INSTALL ducklake;
LOAD ducklake;

-- Attach DuckLake database
ATTACH 'ducklake:metadata.ducklake' AS my_db (DATA_PATH 'data_files');
USE my_db;

-- Query snapshot history
SELECT * FROM ducklake_snapshots('my_db');

The Compatibility Problem

You must choose one or the other:

Feature	In-Memory DuckDB	DuckLake Storage
VSS Extension	✅ Works	❌ Does NOT work
Vector Search	✅ Full support	❌ Not supported
HNSW Indexes	✅ Supported	❌ Not supported
Persistence	⚠️ Experimental (risky)	✅ Production-ready
ACID Transactions	Limited	✅ Full support
Time Travel	No	✅ Supported
Snapshot Isolation	No	✅ Supported

Workarounds and Alternatives

Option 1: Use In-Memory DuckDB (Recommended for VSS)

-- Connect to in-memory database
.open :memory:

-- Use VSS normally
INSTALL vss;
LOAD vss;
CREATE INDEX vec_idx ON docs USING HNSW (embedding);

Pros:

✅ VSS works perfectly
✅ Fast performance
✅ No persistence issues

Cons:

❌ Data lost on shutdown
❌ Must reload data on startup
❌ Limited by RAM

Option 2: Rebuild Indexes on Startup

-- Load data from DuckLake
ATTACH 'ducklake:data.ducklake' AS source;

-- Copy to in-memory database
CREATE TABLE docs AS SELECT * FROM source.docs;
DETACH source;

-- Build VSS index in-memory
INSTALL vss;
LOAD vss;
CREATE INDEX vec_idx ON docs USING HNSW (embedding);

Pros:

✅ Persistent source data in DuckLake
✅ VSS works in-memory
✅ Best of both worlds

Cons:

❌ Startup time for index rebuild
❌ Requires scripting
❌ Double storage (DuckLake + in-memory)

Option 3: Use pgvector with PostgreSQL Instead

-- PostgreSQL with pgvector supports both persistence and vectors
CREATE EXTENSION vector;

CREATE TABLE docs (
    id SERIAL PRIMARY KEY,
    embedding vector(384)
);

CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops);

Pros:

✅ Full persistence support
✅ ACID transactions
✅ Better vector search performance (12× faster than DuckDB)
✅ Production-ready

Cons:

❌ Not embedded (requires PostgreSQL server)
❌ More complex setup
❌ Less powerful analytical SQL than DuckDB

Option 4: Separate Systems for Vectors and Analytics

# Use specialized vector DB for search
import pinecone
pinecone.init(api_key="...")
index = pinecone.Index("vectors")

# Use DuckLake for analytical data
import duckdb
conn = duckdb.connect()
conn.execute("ATTACH 'ducklake:analytics.ducklake' AS data")

# Combine results in application layer
vector_results = index.query(query_embedding, top_k=100)
analytical_results = conn.execute("""
    SELECT * FROM data.documents
    WHERE id IN (?)
""", [vector_results.ids])

Pros:

✅ Best tool for each job
✅ Scales independently
✅ Production-ready for both

Cons:

❌ More complex architecture
❌ Additional operational overhead
❌ Cannot use SQL for hybrid queries

Architecture Decision Guide

Choose In-Memory DuckDB + VSS when:

Data fits in RAM
Acceptable to rebuild indexes on startup
Need rich SQL + vector search in single query
Offline/batch analytics workload
No persistence requirements

Choose DuckLake (without VSS) when:

Need persistent storage
ACID transactions required
Time-travel queries needed
Vector search not required
Analytical workload without embeddings

Choose pgvector when:

Need both persistence AND vectors
Production vector search required
Transactional workload
Can accept PostgreSQL complexity
Want better vector search performance

Choose Separate Systems when:

High-scale production system
Need best-in-class for both vectors and analytics
Can manage multiple databases
Budget allows operational complexity

Use Cases

1. RAG (Retrieval-Augmented Generation)

-- Find relevant context for LLM prompts
WITH query_embedding AS (
    SELECT $user_query_embedding AS vec
)
SELECT d.text, d.metadata
FROM documents d, query_embedding q
ORDER BY d.embedding <=> q.vec
LIMIT 5;

2. Recommendation Systems

-- Find similar products/content
SELECT p2.id, p2.name,
       array_distance(p1.embedding, p2.embedding) AS distance
FROM products p1, products p2
WHERE p1.id = $current_product_id
  AND p2.id != p1.id
ORDER BY distance
LIMIT 10;

3. Semantic Search

-- Search documents by meaning, not just keywords
SELECT id, title, snippet,
       1 - (embedding <=> $query_embedding) AS relevance
FROM articles
WHERE relevance > 0.6
ORDER BY relevance DESC;

4. Duplicate Detection

-- Find near-duplicate content
SELECT d1.id AS doc1, d2.id AS doc2,
       array_cosine_distance(d1.embedding, d2.embedding) AS similarity
FROM documents d1, documents d2
WHERE d1.id < d2.id
  AND similarity < 0.1  -- Very similar
ORDER BY similarity;

5. Image Search (with embeddings from vision models)

-- Find visually similar images
CREATE TABLE images (
    id INTEGER,
    filename TEXT,
    embedding FLOAT[512]  -- From CLIP, ResNet, etc.
);

SELECT filename, array_distance(embedding, $query_image_embedding) AS distance
FROM images
ORDER BY distance
LIMIT 20;

Recent Updates (2024-2025)

October 2024 Release

1. Performance Improvements

Enhanced multi-threaded index building
Fine-grained work distribution
Better memory allocation
Progress indicators during index creation

2. New Distance Functions

array_cosine_distance() - 1 - cosine_similarity
array_negative_inner_product() - negative inner product
<=> operator for cosine distance (pgvector compatibility)

3. Top-K Aggregates

min_by()/max_by() now accept optional n parameter
Index-accelerated top-k queries
Avoids full table scans

4. LATERAL Join Optimization

New HNSW_INDEX_JOIN operator
66× speedup for vector joins (10s → 0.15s in benchmarks)
Efficient cross-table similarity searches

June 2025 - Text Analytics

Lightweight Text Analytics Workflows:

Demonstrated hybrid search combining keyword, FTS, and vector embeddings
Integration examples with sentence-transformers
Best practices for text preprocessing and normalization

Performance Comparison with Other Vector Databases

Benchmark Results: DuckDB vs pgvector vs OpenSearch

DuckDB vs PostgreSQL pgvector

Single Query Performance (1M vector dataset):

pgvector: 1.9s average
DuckDB: 23s average (~12× slower)
ChromaDB: Better than DuckDB for single queries

Concurrent Performance (100 queries, 100 concurrent requests):

pgvector: 9.81s average - excellent, stable performance
DuckDB: 1,416s average (144× slower) - massive degradation under load
ChromaDB: 23.08s average

Key Finding: DuckDB’s architecture optimized for analytical workloads makes it unsuitable for low-latency, high-concurrency vector search scenarios.

Latency Analysis:

Most DuckDB query overhead comes from point query inefficiency (~98% overhead)
Only ~2% of runtime spent in actual HNSW search (usearch library)
DuckDB not designed for individual vector lookups

OpenSearch vs Elasticsearch (For Reference)

While no direct DuckDB comparison exists, 2025 benchmarks show:

Elastic’s Benchmarks:

Elasticsearch: 2-12× faster than OpenSearch
With Binary Quantization (BBQ): 5× faster than OpenSearch with FAISS

Independent Study (Trail of Bits - March 2025):

OpenSearch v2.17.1: 1.6× faster on Big5 workload
OpenSearch: 11% faster on Vectorsearch workload
Results vary by workload and configuration

Key Difference: Both OpenSearch and Elasticsearch are orders of magnitude faster than DuckDB for vector search due to purpose-built architecture.

Detailed Feature Comparison

Feature	DuckDB VSS	pgvector	OpenSearch	Elasticsearch	Pinecone
Single Query Latency	~23s	~1.9s	<100ms	<100ms	<100ms
Concurrent Performance	Poor (144× slower)	Excellent	Excellent	Excellent	Excellent
Deployment	Embedded	PostgreSQL	Distributed	Distributed	Cloud
SQL Support	Full SQL	Full SQL	Limited	Limited	None
Index Type	HNSW	HNSW, IVFFlat	HNSW, FAISS	HNSW, BBQ	Proprietary
Max Dimensions	Unlimited	2,000 (recommended)	16,000	4,096	20,000+
Distance Metrics	L2, Cosine, IP	L2, Cosine, IP	Multiple	Multiple	Multiple
FTS Integration	Native	Via PostgreSQL	Native	Native	No
Analytical Queries	Excellent	Good	Limited	Limited	Limited
Point Query Performance	Poor	Excellent	Excellent	Excellent	Excellent
Setup Complexity	Minimal	Medium	High	High	Low
Cost	Free/OSS	Free/OSS	Free/OSS	Paid + OSS	$70+/month
Best For	Offline analytics	Hybrid transactional/vector	Real-time search	Real-time search	Scale + managed

Performance Recommendations by Use Case

Use Case	Recommended Solution	Why
Real-time search API	OpenSearch, Elasticsearch, Pinecone	Sub-100ms latency required
Concurrent user queries	pgvector, OpenSearch, Elasticsearch	Need stable performance under load
RAG production system	pgvector, Pinecone, Weaviate	Low latency + high availability
Offline analytics	DuckDB ✅	Complex SQL + vector queries
Data exploration	DuckDB ✅	Rich SQL analytical capabilities
Batch processing	DuckDB, pgvector	Single-threaded bulk operations
Hybrid search (keyword + vector)	DuckDB, OpenSearch, Elasticsearch	Native FTS + vector integration
Recommendation systems (real-time)	OpenSearch, Pinecone	Need <100ms response
Recommendation systems (batch)	DuckDB ✅	Complex analytical queries

When to Use DuckDB for Vectors

✅ Good Use Cases:

Offline analytics - Complex SQL queries combining vectors with tabular data
Data exploration - Ad-hoc analysis of embeddings with rich SQL
Batch processing - Process entire datasets with complex transformations
Research/prototyping - Quick experimentation with embeddings
Hybrid analytics - Combine FTS, vector search, and analytical queries
Single-threaded workloads - No concurrent query requirements

❌ Poor Use Cases:

Production APIs - High latency unacceptable for user-facing apps
Real-time search - 23s query time vs <100ms requirement
Concurrent queries - 144× performance degradation under load
Point queries - Not optimized for individual vector lookups
Low-latency RAG - Need sub-second retrieval for LLM prompts
High-throughput systems - Poor concurrent performance

Best Practices

Data Modeling

1. Use Fixed-Size Arrays

-- Good: Fixed size
CREATE TABLE docs (embedding FLOAT[384]);

-- Bad: Variable size not supported
-- CREATE TABLE docs (embedding FLOAT[]);

2. Normalize Embeddings for Cosine Distance

# Python: Normalize before storing
import numpy as np
normalized = embedding / np.linalg.norm(embedding)

3. Store Metadata Alongside Embeddings

CREATE TABLE documents (
    id INTEGER PRIMARY KEY,
    text TEXT,
    embedding FLOAT[384],
    created_at TIMESTAMP,
    source TEXT,
    category TEXT
);

Indexing Strategy

1. Build After Bulk Insert

-- Efficient: Bulk insert then index
COPY documents FROM 'data.parquet';
CREATE INDEX vec_idx ON documents USING HNSW (embedding);

2. Choose Appropriate Parameters

-- For high accuracy (slower):
CREATE INDEX vec_idx ON documents USING HNSW (embedding)
WITH (ef_construction = 256, M = 32);

-- For speed (lower accuracy):
CREATE INDEX vec_idx ON documents USING HNSW (embedding)
WITH (ef_construction = 64, M = 16);

3. Regular Maintenance

-- After significant deletions/updates
PRAGMA hnsw_compact_index('vec_idx');

Query Patterns

1. Always Use LIMIT

-- Good: Uses index efficiently
SELECT * FROM docs
ORDER BY embedding <=> $query
LIMIT 10;

-- Bad: Full scan without LIMIT
SELECT * FROM docs
ORDER BY embedding <=> $query;

2. Filter Before Vector Search

-- Efficient: Filter reduces search space
SELECT * FROM docs
WHERE category = 'technology'
ORDER BY embedding <=> $query
LIMIT 10;

3. Batch Queries When Possible

-- Process multiple queries in one pass
SELECT q.id AS query_id, d.id AS doc_id,
       array_distance(d.embedding, q.embedding) AS distance
FROM queries q, documents d
ORDER BY q.id, distance
LIMIT 10;

Hybrid Search Configuration

1. Tune Weights Based on Use Case

-- News/articles: Higher semantic weight
SELECT *, (0.8 * semantic + 0.2 * bm25) AS score
FROM results;

-- Technical docs: Higher keyword weight
SELECT *, (0.4 * semantic + 0.6 * bm25) AS score
FROM results;

2. Normalize Scores Properly

-- BM25 is unbounded, needs min-max normalization
-- Cosine similarity is [0,1], already normalized

Limitations and Gotchas

Current Limitations

32-bit FLOAT Only
- No support for DOUBLE (64-bit) or INT arrays
- Precision limitations for certain applications
Experimental Persistence
- Disk persistence not production-ready
- Risk of data loss on unexpected shutdown
- In-memory recommended for production
Index Memory Management
- Indexes don’t count toward memory_limit
- Can cause OOM if not monitored separately
- No automatic eviction policy
No Dynamic Dimensionality
- All embeddings must be same dimension
- Cannot mix 384 and 512 dimensional vectors
Limited Distance Metrics
- Only L2, cosine, and inner product
- No Manhattan, Hamming, or custom metrics

Gotchas

1. Index Build Time

-- Large datasets take time
-- Millions of vectors can take minutes-hours
-- Use progress indicator to monitor

2. Deletes Don’t Compact Automatically

-- Manual compaction required
DELETE FROM docs WHERE old = true;
PRAGMA hnsw_compact_index('vec_idx');

3. Query Pattern Matters

-- Uses index: ✅
SELECT * FROM docs ORDER BY distance LIMIT 10;

-- No index: ❌
SELECT * FROM docs WHERE distance < 0.5;

4. Case Sensitivity in FTS

-- FTS is case-sensitive by default
-- Use lower=true in create_fts_index
PRAGMA create_fts_index(..., lower = true);

Future Roadmap

Based on documentation and community discussions:

Planned/Potential Features:

✅ Stable disk persistence (high priority)
✅ Additional distance metrics
✅ Quantization support (reduced memory)
✅ Filtered vector search improvements
✅ Index-accelerated fuzzy joins
✅ Better memory management integration

Community Requests:

Support for other numeric types (INT8, DOUBLE)
Approximate filtering (pre-filter before vector search)
Multi-vector queries
Batch insertion optimization
Distributed vector search

Resources

Official Documentation:

Community Resources:

Integration Examples:

sentence-transformers (Python)
OpenAI embeddings API
Hugging Face models
CLIP for image embeddings

Related Extensions:

httpfs - Load embeddings from remote files
json - Parse embedding arrays from JSON
parquet - Efficient storage for embeddings

Last Updated: 2025-11-05 DuckDB Version: 1.1.3+ (VSS extension experimental) Status: VSS and FTS extensions are experimental but actively developed

vector-capabilities

Table of Contents

Quick Answer

✅ Capabilities

⚠️ Performance Reality

❌ DuckLake Compatibility

Overview

Installation

VSS Extension

FTS Extension

Vector Similarity Search (VSS)

Supported Data Types

Distance Functions

HNSW Index Creation

Configurable Hyperparameters

Similarity Search Queries

Query Optimization

Fuzzy Vector Joins

Full-Text Search (FTS)

Creating FTS Index

BM25 Scoring

Stemming

Hybrid Search: Combining FTS + Vector Similarity

Three-Method Approach

Weighted Hybrid Query

Text Analytics Workflow

1. Text Preprocessing

2. Generate Embeddings

3. Create Indexes

4. Query

Performance Characteristics

HNSW Index Performance

Optimization Tips

Persistence and Storage

Current Limitations

DuckLake Compatibility

❌ VSS Extension Does NOT Work with DuckLake

What is DuckLake?

The Compatibility Problem

Workarounds and Alternatives

Architecture Decision Guide

Use Cases

1. RAG (Retrieval-Augmented Generation)

2. Recommendation Systems

3. Semantic Search

4. Duplicate Detection

5. Image Search (with embeddings from vision models)

Recent Updates (2024-2025)

October 2024 Release

June 2025 - Text Analytics

Performance Comparison with Other Vector Databases

Benchmark Results: DuckDB vs pgvector vs OpenSearch

DuckDB vs PostgreSQL pgvector

OpenSearch vs Elasticsearch (For Reference)

Detailed Feature Comparison

Performance Recommendations by Use Case

When to Use DuckDB for Vectors

Best Practices

Data Modeling

Indexing Strategy

Query Patterns

Hybrid Search Configuration

Limitations and Gotchas

Current Limitations

Gotchas

Future Roadmap

Resources