Deep dive into the persistence limitations of DuckDB’s VSS extension HNSW indexes, current status, and timeline for fixes.

Summary

The HNSW index persistence issue in DuckDB VSS is a known architectural limitation that the team is working on, but with no official timeline for stable release. As of December 2025, it remains experimental and not recommended for production use.

The Core Problem

Why Persistence is Experimental

DuckDB’s HNSW indexes face three fundamental challenges:

  1. WAL Recovery Not Implemented

    • Custom extension indexes don’t integrate with DuckDB’s crash recovery
    • If a crash occurs with uncommitted changes, index corruption is likely
    • Recovery requires manual intervention
  2. Full Index Serialization

    • No incremental updates to persisted indexes
    • Every checkpoint rewrites the entire index to disk
    • Causes significant I/O overhead for large indexes
  3. Custom Index Architecture

    • VSS uses a custom index type outside DuckDB’s standard index framework
    • Requires special handling that isn’t fully integrated with core persistence

Enabling Experimental Persistence

-- Enable at your own risk
SET hnsw_enable_experimental_persistence = true;
-- Create index on disk-backed database
CREATE INDEX vec_idx ON documents USING HNSW (embedding);

Recovery After Crash

If you experience an unexpected shutdown:

-- 1. Start DuckDB separately
-- 2. Load VSS extension FIRST
LOAD vss;
-- 3. Then attach the database (allows WAL playback with index support)
ATTACH 'mydata.duckdb' AS db;

Current Status (December 2025)

Open GitHub Issues

IssueDescriptionStatusMaintainer Response
Does buffer managed index in 1.4 improve persistence?OpenNo response
HNSW index causes 15-100× data inflationOpenNo response
Creating VSS fails without warning on large databasesOpenNo response
”Could not find node” error at ~300k rowsOpenUnknown

Work is happening in the main DuckDB repository:

  • PR : “Buffer index appends during WAL replay”
  • PR : “Correctly handle table and index chunks in WAL replay buffering”

However, it’s unclear if these PRs fully enable VSS HNSW persistence - Issue asks this exact question with no response.

Historical Timeline

DateEvent
May 2024DuckDB team stated actively working on persistence for v0.10.3
Oct 2024VSS extension update with performance improvements, persistence still experimental
Sep 2025Issue opened asking about buffer managed index improvements
Dec 2025Still experimental, no official timeline

Known Bugs and Limitations

1. Data Inflation (Issue )

Creating HNSW indexes causes severe storage bloat:

ScenarioOriginal SizeWith IndexInflation
Index before data78 MB7,055 MB~100×
Index after data78 MB1,328 MB~15×

Workaround: Always create indexes AFTER populating data.

2. Silent Failures on Large Databases (Issue )

VSS can fail silently without warning when:

  • Database exceeds certain size thresholds
  • Complex queries combine with HNSW operations

3. Row Count Limits (Issue )

Reported errors at ~300,000 rows:

Error: Could not find node in column segment tree!

Works fine with 100,000 rows, suggesting scaling issues.

Assessment: Bug vs Design Limitation

What We Know

AspectAssessment
Is it a bug?Partially - some issues are bugs (, ), but core limitation is architectural
Will it be fixed?Likely yes - PRs exist in core DuckDB, but VSS integration unclear
Timeline?Unknown - no official roadmap, May 2024 target (v0.10.3) was missed
Priority level?Unclear - open issues receiving no maintainer responses

Evidence Suggests

  1. Active Development - Core DuckDB PRs for WAL buffering exist
  2. Resource Constraints - VSS-specific issues not receiving maintainer attention
  3. Architectural Complexity - Custom index integration is non-trivial
  4. No Committed Timeline - Despite promises in May 2024, still experimental

Workarounds

-- Use in-memory database
.open :memory:
-- Load data on startup
CREATE TABLE documents AS SELECT * FROM read_parquet('data.parquet');
-- Create index in-memory
CREATE INDEX vec_idx ON documents USING HNSW (embedding);

Pros: Stable, fast Cons: Data lost on shutdown, limited by RAM

Option 2: Hybrid Architecture

-- Store data in persistent DuckDB
ATTACH 'persistent.duckdb' AS source;
-- Copy to in-memory for vector operations
CREATE TABLE docs AS SELECT * FROM source.documents;
-- Build index in-memory
CREATE INDEX vec_idx ON docs USING HNSW (embedding);

Pros: Data persists, index rebuilt on startup Cons: Startup time, double memory usage

Option 3: Use pgvector Instead

-- PostgreSQL with pgvector has stable persistence
CREATE EXTENSION vector;
CREATE TABLE docs (id SERIAL, embedding vector(384));
CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops);

Pros: Production-ready, 12× faster queries Cons: Requires PostgreSQL server, less analytical SQL

Option 4: Separate Systems

Use specialized vector database for search, DuckDB for analytics:

# Vector DB for search
pinecone_results = index.query(embedding, 631F;--1:#DED47E">top_k=100)
# DuckDB for analytics on results
conn.execute("""
SELECT * FROM analytics.documents
WHERE id IN (?)
""", [pinecone_results.ids])

Recommendations

For Production Use

  1. Do NOT rely on experimental persistence - data loss risk is real
  2. Use in-memory databases with data reload on startup
  3. Consider pgvector if you need persistence + vectors
  4. Monitor GitHub issues for updates on stability

For Development/Research

  1. Experimental persistence is acceptable for non-critical data
  2. Keep backups of source data (Parquet, CSV)
  3. Test recovery procedures before relying on them

For Future Planning

  1. Watch duckdb-vss issues for persistence updates
  2. Monitor DuckDB release notes for custom index improvements
  3. Plan architecture assuming persistence may never be production-ready

Sources

  1. DuckDB VSS Extension Documentation
  2. GitHub: duckdb/duckdb-vss Issues
  3. Issue : Buffer managed index persistence
  4. Issue : Data inflation bug
  5. Vector Similarity Search Blog (May 2024)

Last Updated: 2025-12-26 Status: Experimental - No production use recommended