duckdb-hybrid-search
Research on combining DuckDB’s VSS (Vector Similarity Search) and DuckPGQ (Property Graph Queries) for hybrid search in a single embedded database.
The Question
Can DuckDB serve as an all-in-one embedded solution for GraphRAG by combining:
- VSS extension: Vector similarity search with HNSW indexes
- DuckPGQ extension: Property graph queries with SQL/PGQ syntax
Using CTEs (Common Table Expressions) to compose both in a single query?
Research Findings
Can Both Extensions Work Together?
Theoretically: Yes - Both extensions load successfully.
Practically: CRASHES when combined in CTE (validated 2025-12-06)
Both extensions:
- ✅ Load into the same DuckDB instance
- ✅ Work independently with standard SQL
- ❌ CRASH when combined in CTE (segmentation fault)
Validation Results (packages/duckpgq-vss)
| Test | Result | Notes |
|---|---|---|
| VSS extension loads | ✅ Pass | DuckDB 1.3.1 |
| DuckPGQ extension loads | ✅ Pass | Custom S3 repo required |
| Both coexist | ✅ Pass | Same instance |
| Vector search alone | ✅ Pass | HNSW index works |
| Graph search alone | ✅ Pass | SQL/PGQ syntax works |
| Hybrid CTE | ❌ CRASH | Segfault in both Node.js and Bun |
| Separate queries | ✅ Pass | ~100ms total latency |
Critical finding: The crash is caused by a DuckPGQ bug (issue #276) where GRAPH_TABLE CTEs that reference each other via subqueries crash. This is NOT a VSS+GRAPH_TABLE conflict.
| Pattern | Works? |
|---|---|
| VSS CTE → single GRAPH CTE | ✅ Yes |
| Multiple independent GRAPH CTEs | ✅ Yes |
| GRAPH CTE referencing another GRAPH CTE | ❌ Crashes |
VSS Extension Limitations
From DuckDB VSS docs:
| Aspect | Constraint |
|---|---|
| Index optimization | Only works with simple ORDER BY distance LIMIT k patterns |
| Complex queries | HNSW index may not be used in CTEs/subqueries |
| Join macros | vss_join and vss_match do NOT use HNSW index |
| Memory | Index must fit in RAM |
Key insight: VSS is optimized for straightforward vector search, not complex query compositions.
DuckPGQ Capabilities
From DuckPGQ docs:
| Aspect | Support |
|---|---|
| GRAPH_TABLE in queries | Yes |
| WHERE filters after GRAPH_TABLE | Yes |
| Subqueries with results | Yes (external to GRAPH_TABLE) |
| CTEs | Standard SQL CTEs work |
Key insight: GRAPH_TABLE results can be filtered and joined with other SQL operations, but filtering inside MATCH clauses is limited.
Proposed Hybrid Query Pattern
-- THEORETICAL: May work but HNSW index optimization uncertainWITH-- Step 1: Vector search (VSS)vector_matches AS ( SELECT id, content, embedding <=> $query_embedding AS distance FROM documents ORDER BY distance LIMIT 5),
-- Step 2: Get entities from matched docs (DuckPGQ)doc_entities AS ( SELECT * FROM GRAPH_TABLE (kg MATCH (d:Document)-[:MENTIONS]->(e:Entity) WHERE d.id IN (SELECT id FROM vector_matches) -- May need external filter COLUMNS (d.id AS doc_id, e.name AS entity_name, e.type AS entity_type) )),
-- Step 3: Expand via graph - find related docsexpanded_docs AS ( SELECT * FROM GRAPH_TABLE (kg MATCH (d:Document)-[:MENTIONS]->(e:Entity) WHERE e.name IN (SELECT entity_name FROM doc_entities) AND d.id NOT IN (SELECT id FROM vector_matches) COLUMNS (d.id, d.content, e.name AS via_entity) ))
-- Final: Combine original + expandedSELECT id, content, distance, NULL as via_entity, 'vector' as sourceFROM vector_matchesUNION ALLSELECT id, content, NULL as distance, via_entity, 'graph' as sourceFROM expanded_docs;Potential Issues
- WHERE IN subquery inside GRAPH_TABLE: May need to move to external filter
- HNSW index usage: Uncertain if index is used when VSS is in a CTE
- Query planner optimization: Cross-extension optimization untested
Alternative: External Filter Pattern
-- More likely to work: filter GRAPH_TABLE results externallyWITHvector_matches AS ( SELECT id, content, embedding <=> $query_embedding AS distance FROM documents ORDER BY distance LIMIT 5),
graph_results AS ( SELECT * FROM GRAPH_TABLE (kg MATCH (d:Document)-[:MENTIONS]->(e:Entity) COLUMNS (d.id AS doc_id, e.name AS entity_name) ))
-- Filter graph results using vector matchesSELECT g.*FROM graph_results gWHERE g.doc_id IN (SELECT id FROM vector_matches);Current Industry Approach: Separate Systems
Cognee Architecture (DuckDB + Kuzu)
Cognee uses:
┌─────────────────────────────────────────────────┐│ Cognee │├─────────────────────────────────────────────────┤│ ┌─────────────┐ ┌─────────────┐ ││ │ DuckDB │ │ Kuzu │ ││ │ (Vectors) │ │ (Graph) │ ││ │ │ │ │ ││ │ - Embeddings│ │ - Entities │ ││ │ - Metadata │ │ - Relations │ ││ │ - Analytics │ │ - Traversal │ ││ └─────────────┘ └─────────────┘ ││ ↓ ↓ ││ Vector Search Graph Queries ││ └───────────┬───────────┘ ││ ↓ ││ Application Layer ││ (combines results) │└─────────────────────────────────────────────────┘Why separate systems?
- Each optimized for its workload
- No cross-extension complexity
- Proven in production
Performance Expectations
Single Query (If It Works)
| Metric | Estimate | Notes |
|---|---|---|
| Latency | ~50-150ms | One round-trip |
| VSS index usage | Uncertain | May fall back to scan |
| Complexity | High | Query planner untested |
Separate Queries (Recommended)
| Metric | Estimate | Notes |
|---|---|---|
| Total latency | ~50-150ms | 3 queries × ~20-50ms each |
| VSS index usage | Guaranteed | Simple pattern |
| Complexity | Low | Well-understood |
For small knowledge bases (~1000 docs): Difference is negligible.
Practical Recommendation
For Your Use Case (Lattice-style KB)
// Recommended: Separate queries, simple implementationasync function hybridSearch(query: string): Promise<SearchResult[]> { const embedding = await embed(query);
// Query 1: Vector search (uses HNSW index) const vectorResults = await db.query(` SELECT id, content, embedding <=> ? AS distance FROM documents ORDER BY distance LIMIT 5 `, [embedding]);
// Query 2: Get entities from matches const docIds = vectorResults.map(r => r.id); const entities = await db.query(` SELECT * FROM GRAPH_TABLE (kg MATCH (d:Document)-[:MENTIONS]->(e:Entity) COLUMNS (d.id, e.name, e.type) ) WHERE id IN (?) `, [docIds]);
// Query 3: Expand via graph const entityNames = entities.map(e => e.name); const expanded = await db.query(` SELECT * FROM GRAPH_TABLE (kg MATCH (d:Document)-[:MENTIONS]->(e:Entity) COLUMNS (d.id, d.content, e.name AS via_entity) ) WHERE name IN (?) AND id NOT IN (?) `, [entityNames, docIds]);
return [...vectorResults, ...expanded];}Why This Over Single CTE?
- Guaranteed index usage - VSS pattern stays simple
- Debuggable - Each step can be tested independently
- Proven pattern - This is how cognee and others do it
- Same performance - At your scale, difference is ~20ms
When Single-Query CTE Makes Sense
Consider the CTE approach if:
- You verify HNSW index is used in your CTE pattern
- You need absolute minimum latency (<20ms matters)
- You’re willing to test and debug edge cases
- DuckDB team confirms extension interop
Until then, separate queries are more reliable.
Summary
| Approach | Works? | Performance | Recommended? |
|---|---|---|---|
| VSS + DuckPGQ in CTE | ⚠️ Limited | N/A | Works if single GRAPH CTE (see bug) |
| Separate queries | ✅ Yes | ~100ms | Yes |
| DuckDB + Kuzu | Yes | Varies | For production |
| FalkorDB (native) | Yes | ~5-10ms | If you need speed |
Bottom line: DuckDB can do hybrid search. Single-CTE works if using only one GRAPH_TABLE CTE. Multi-GRAPH CTEs with cross-references crash due to DuckPGQ bug #276. Use separate queries for complex graph expansion patterns (validated 2025-12-06).