hnsw-persistence-issues
Deep dive into the persistence limitations of DuckDB’s VSS extension HNSW indexes, current status, and timeline for fixes.
Summary
The HNSW index persistence issue in DuckDB VSS is a known architectural limitation that the team is working on, but with no official timeline for stable release. As of December 2025, it remains experimental and not recommended for production use.
The Core Problem
Why Persistence is Experimental
DuckDB’s HNSW indexes face three fundamental challenges:
-
WAL Recovery Not Implemented
- Custom extension indexes don’t integrate with DuckDB’s crash recovery
- If a crash occurs with uncommitted changes, index corruption is likely
- Recovery requires manual intervention
-
Full Index Serialization
- No incremental updates to persisted indexes
- Every checkpoint rewrites the entire index to disk
- Causes significant I/O overhead for large indexes
-
Custom Index Architecture
- VSS uses a custom index type outside DuckDB’s standard index framework
- Requires special handling that isn’t fully integrated with core persistence
Enabling Experimental Persistence
-- Enable at your own riskSET hnsw_enable_experimental_persistence = true;
-- Create index on disk-backed databaseCREATE INDEX vec_idx ON documents USING HNSW (embedding);Recovery After Crash
If you experience an unexpected shutdown:
-- 1. Start DuckDB separately-- 2. Load VSS extension FIRSTLOAD vss;
-- 3. Then attach the database (allows WAL playback with index support)ATTACH 'mydata.duckdb' AS db;Current Status (December 2025)
Open GitHub Issues
| Issue | Description | Status | Maintainer Response |
|---|---|---|---|
| #70 | Does buffer managed index in 1.4 improve persistence? | Open | No response |
| #32 | HNSW index causes 15-100× data inflation | Open | No response |
| #54 | Creating VSS fails without warning on large databases | Open | No response |
| #19 | ”Could not find node” error at ~300k rows | Open | Unknown |
Related Core DuckDB PRs
Work is happening in the main DuckDB repository:
- PR #18313: “Buffer index appends during WAL replay”
- PR #18700: “Correctly handle table and index chunks in WAL replay buffering”
However, it’s unclear if these PRs fully enable VSS HNSW persistence - Issue #70 asks this exact question with no response.
Historical Timeline
| Date | Event |
|---|---|
| May 2024 | DuckDB team stated actively working on persistence for v0.10.3 |
| Oct 2024 | VSS extension update with performance improvements, persistence still experimental |
| Sep 2025 | Issue #70 opened asking about buffer managed index improvements |
| Dec 2025 | Still experimental, no official timeline |
Known Bugs and Limitations
1. Data Inflation (Issue #32)
Creating HNSW indexes causes severe storage bloat:
| Scenario | Original Size | With Index | Inflation |
|---|---|---|---|
| Index before data | 78 MB | 7,055 MB | ~100× |
| Index after data | 78 MB | 1,328 MB | ~15× |
Workaround: Always create indexes AFTER populating data.
2. Silent Failures on Large Databases (Issue #54)
VSS can fail silently without warning when:
- Database exceeds certain size thresholds
- Complex queries combine with HNSW operations
3. Row Count Limits (Issue #19)
Reported errors at ~300,000 rows:
Error: Could not find node in column segment tree!Works fine with 100,000 rows, suggesting scaling issues.
Assessment: Bug vs Design Limitation
What We Know
| Aspect | Assessment |
|---|---|
| Is it a bug? | Partially - some issues are bugs (#32, #19), but core limitation is architectural |
| Will it be fixed? | Likely yes - PRs exist in core DuckDB, but VSS integration unclear |
| Timeline? | Unknown - no official roadmap, May 2024 target (v0.10.3) was missed |
| Priority level? | Unclear - open issues receiving no maintainer responses |
Evidence Suggests
- Active Development - Core DuckDB PRs for WAL buffering exist
- Resource Constraints - VSS-specific issues not receiving maintainer attention
- Architectural Complexity - Custom index integration is non-trivial
- No Committed Timeline - Despite promises in May 2024, still experimental
Workarounds
Option 1: In-Memory Only (Recommended for Production)
-- Use in-memory database.open :memory:
-- Load data on startupCREATE TABLE documents AS SELECT * FROM read_parquet('data.parquet');
-- Create index in-memoryCREATE INDEX vec_idx ON documents USING HNSW (embedding);Pros: Stable, fast Cons: Data lost on shutdown, limited by RAM
Option 2: Hybrid Architecture
-- Store data in persistent DuckDBATTACH 'persistent.duckdb' AS source;
-- Copy to in-memory for vector operationsCREATE TABLE docs AS SELECT * FROM source.documents;
-- Build index in-memoryCREATE INDEX vec_idx ON docs USING HNSW (embedding);Pros: Data persists, index rebuilt on startup Cons: Startup time, double memory usage
Option 3: Use pgvector Instead
-- PostgreSQL with pgvector has stable persistenceCREATE EXTENSION vector;CREATE TABLE docs (id SERIAL, embedding vector(384));CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops);Pros: Production-ready, 12× faster queries Cons: Requires PostgreSQL server, less analytical SQL
Option 4: Separate Systems
Use specialized vector database for search, DuckDB for analytics:
# Vector DB for search
# DuckDB for analytics on resultsconn.execute(""" SELECT * FROM analytics.documents WHERE id IN (?)""", [pinecone_results.ids])Recommendations
For Production Use
- Do NOT rely on experimental persistence - data loss risk is real
- Use in-memory databases with data reload on startup
- Consider pgvector if you need persistence + vectors
- Monitor GitHub issues for updates on stability
For Development/Research
- Experimental persistence is acceptable for non-critical data
- Keep backups of source data (Parquet, CSV)
- Test recovery procedures before relying on them
For Future Planning
- Watch duckdb-vss issues for persistence updates
- Monitor DuckDB release notes for custom index improvements
- Plan architecture assuming persistence may never be production-ready
Sources
- DuckDB VSS Extension Documentation
- GitHub: duckdb/duckdb-vss Issues
- Issue #70: Buffer managed index persistence
- Issue #32: Data inflation bug
- Vector Similarity Search Blog (May 2024)
Last Updated: 2025-12-26 Status: Experimental - No production use recommended