hnsw-persistence-issues

Deep dive into the persistence limitations of DuckDB’s VSS extension HNSW indexes, current status, and timeline for fixes.

Summary

The HNSW index persistence issue in DuckDB VSS is a known architectural limitation that the team is working on, but with no official timeline for stable release. As of December 2025, it remains experimental and not recommended for production use.

The Core Problem

Why Persistence is Experimental

DuckDB’s HNSW indexes face three fundamental challenges:

WAL Recovery Not Implemented
- Custom extension indexes don’t integrate with DuckDB’s crash recovery
- If a crash occurs with uncommitted changes, index corruption is likely
- Recovery requires manual intervention
Full Index Serialization
- No incremental updates to persisted indexes
- Every checkpoint rewrites the entire index to disk
- Causes significant I/O overhead for large indexes
Custom Index Architecture
- VSS uses a custom index type outside DuckDB’s standard index framework
- Requires special handling that isn’t fully integrated with core persistence

Enabling Experimental Persistence

-- Enable at your own risk
SET hnsw_enable_experimental_persistence = true;

-- Create index on disk-backed database
CREATE INDEX vec_idx ON documents USING HNSW (embedding);

Recovery After Crash

If you experience an unexpected shutdown:

-- 1. Start DuckDB separately
-- 2. Load VSS extension FIRST
LOAD vss;

-- 3. Then attach the database (allows WAL playback with index support)
ATTACH 'mydata.duckdb' AS db;

Current Status (December 2025)

Open GitHub Issues

Issue	Description	Status	Maintainer Response
#70	Does buffer managed index in 1.4 improve persistence?	Open	No response
#32	HNSW index causes 15-100× data inflation	Open	No response
#54	Creating VSS fails without warning on large databases	Open	No response
#19	”Could not find node” error at ~300k rows	Open	Unknown

Work is happening in the main DuckDB repository:

PR #18313: “Buffer index appends during WAL replay”
PR #18700: “Correctly handle table and index chunks in WAL replay buffering”

However, it’s unclear if these PRs fully enable VSS HNSW persistence - Issue #70 asks this exact question with no response.

Historical Timeline

Date	Event
May 2024	DuckDB team stated actively working on persistence for v0.10.3
Oct 2024	VSS extension update with performance improvements, persistence still experimental
Sep 2025	Issue #70 opened asking about buffer managed index improvements
Dec 2025	Still experimental, no official timeline

Known Bugs and Limitations

1. Data Inflation (Issue #32)

Creating HNSW indexes causes severe storage bloat:

Scenario	Original Size	With Index	Inflation
Index before data	78 MB	7,055 MB	~100×
Index after data	78 MB	1,328 MB	~15×

Workaround: Always create indexes AFTER populating data.

2. Silent Failures on Large Databases (Issue #54)

VSS can fail silently without warning when:

Database exceeds certain size thresholds
Complex queries combine with HNSW operations

3. Row Count Limits (Issue #19)

Reported errors at ~300,000 rows:

Error: Could not find node in column segment tree!

Works fine with 100,000 rows, suggesting scaling issues.

Assessment: Bug vs Design Limitation

What We Know

Aspect	Assessment
Is it a bug?	Partially - some issues are bugs (#32, #19), but core limitation is architectural
Will it be fixed?	Likely yes - PRs exist in core DuckDB, but VSS integration unclear
Timeline?	Unknown - no official roadmap, May 2024 target (v0.10.3) was missed
Priority level?	Unclear - open issues receiving no maintainer responses

Evidence Suggests

Active Development - Core DuckDB PRs for WAL buffering exist
Resource Constraints - VSS-specific issues not receiving maintainer attention
Architectural Complexity - Custom index integration is non-trivial
No Committed Timeline - Despite promises in May 2024, still experimental

Workarounds

Option 1: In-Memory Only (Recommended for Production)

-- Use in-memory database
.open :memory:

-- Load data on startup
CREATE TABLE documents AS SELECT * FROM read_parquet('data.parquet');

-- Create index in-memory
CREATE INDEX vec_idx ON documents USING HNSW (embedding);

Pros: Stable, fast Cons: Data lost on shutdown, limited by RAM

Option 2: Hybrid Architecture

-- Store data in persistent DuckDB
ATTACH 'persistent.duckdb' AS source;

-- Copy to in-memory for vector operations
CREATE TABLE docs AS SELECT * FROM source.documents;

-- Build index in-memory
CREATE INDEX vec_idx ON docs USING HNSW (embedding);

Pros: Data persists, index rebuilt on startup Cons: Startup time, double memory usage

Option 3: Use pgvector Instead

-- PostgreSQL with pgvector has stable persistence
CREATE EXTENSION vector;
CREATE TABLE docs (id SERIAL, embedding vector(384));
CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops);

Pros: Production-ready, 12× faster queries Cons: Requires PostgreSQL server, less analytical SQL

Option 4: Separate Systems

Use specialized vector database for search, DuckDB for analytics:

# Vector DB for search
pinecone_results = index.query(embedding, #70631F;--1:#DED47E">top_k=100)

# DuckDB for analytics on results
conn.execute("""
    SELECT * FROM analytics.documents
    WHERE id IN (?)
""", [pinecone_results.ids])

Recommendations

For Production Use

Do NOT rely on experimental persistence - data loss risk is real
Use in-memory databases with data reload on startup
Consider pgvector if you need persistence + vectors
Monitor GitHub issues for updates on stability

For Development/Research

Experimental persistence is acceptable for non-critical data
Keep backups of source data (Parquet, CSV)
Test recovery procedures before relying on them

For Future Planning

Watch duckdb-vss issues for persistence updates
Monitor DuckDB release notes for custom index improvements
Plan architecture assuming persistence may never be production-ready

Sources

DuckDB VSS Extension Documentation
GitHub: duckdb/duckdb-vss Issues
Issue #70: Buffer managed index persistence
Issue #32: Data inflation bug
Vector Similarity Search Blog (May 2024)

Last Updated: 2025-12-26 Status: Experimental - No production use recommended