Research on combining DuckDB’s VSS (Vector Similarity Search) and DuckPGQ (Property Graph Queries) for hybrid search in a single embedded database.

The Question

Can DuckDB serve as an all-in-one embedded solution for GraphRAG by combining:

  • VSS extension: Vector similarity search with HNSW indexes
  • DuckPGQ extension: Property graph queries with SQL/PGQ syntax

Using CTEs (Common Table Expressions) to compose both in a single query?


Research Findings

Can Both Extensions Work Together?

Theoretically: Yes - Both extensions load successfully.

Practically: CRASHES when combined in CTE (validated 2025-12-06)

Both extensions:

  • ✅ Load into the same DuckDB instance
  • ✅ Work independently with standard SQL
  • CRASH when combined in CTE (segmentation fault)

Validation Results (packages/duckpgq-vss)

TestResultNotes
VSS extension loads✅ PassDuckDB 1.3.1
DuckPGQ extension loads✅ PassCustom S3 repo required
Both coexist✅ PassSame instance
Vector search alone✅ PassHNSW index works
Graph search alone✅ PassSQL/PGQ syntax works
Hybrid CTECRASHSegfault in both Node.js and Bun
Separate queries✅ Pass~100ms total latency

Critical finding: The crash is caused by a DuckPGQ bug (issue ) where GRAPH_TABLE CTEs that reference each other via subqueries crash. This is NOT a VSS+GRAPH_TABLE conflict.

PatternWorks?
VSS CTE → single GRAPH CTE✅ Yes
Multiple independent GRAPH CTEs✅ Yes
GRAPH CTE referencing another GRAPH CTE❌ Crashes

VSS Extension Limitations

From DuckDB VSS docs:

AspectConstraint
Index optimizationOnly works with simple ORDER BY distance LIMIT k patterns
Complex queriesHNSW index may not be used in CTEs/subqueries
Join macrosvss_join and vss_match do NOT use HNSW index
MemoryIndex must fit in RAM

Key insight: VSS is optimized for straightforward vector search, not complex query compositions.

DuckPGQ Capabilities

From DuckPGQ docs:

AspectSupport
GRAPH_TABLE in queriesYes
WHERE filters after GRAPH_TABLEYes
Subqueries with resultsYes (external to GRAPH_TABLE)
CTEsStandard SQL CTEs work

Key insight: GRAPH_TABLE results can be filtered and joined with other SQL operations, but filtering inside MATCH clauses is limited.


Proposed Hybrid Query Pattern

-- THEORETICAL: May work but HNSW index optimization uncertain
WITH
-- Step 1: Vector search (VSS)
vector_matches AS (
SELECT id, content, embedding <=> $query_embedding AS distance
FROM documents
ORDER BY distance
LIMIT 5
),
-- Step 2: Get entities from matched docs (DuckPGQ)
doc_entities AS (
SELECT * FROM GRAPH_TABLE (kg
MATCH (d:Document)-[:MENTIONS]->(e:Entity)
WHERE d.id IN (SELECT id FROM vector_matches) -- May need external filter
COLUMNS (d.id AS doc_id, e.name AS entity_name, e.type AS entity_type)
)
),
-- Step 3: Expand via graph - find related docs
expanded_docs AS (
SELECT * FROM GRAPH_TABLE (kg
MATCH (d:Document)-[:MENTIONS]->(e:Entity)
WHERE e.name IN (SELECT entity_name FROM doc_entities)
AND d.id NOT IN (SELECT id FROM vector_matches)
COLUMNS (d.id, d.content, e.name AS via_entity)
)
)
-- Final: Combine original + expanded
SELECT id, content, distance, NULL as via_entity, 'vector' as source
FROM vector_matches
UNION ALL
SELECT id, content, NULL as distance, via_entity, 'graph' as source
FROM expanded_docs;

Potential Issues

  1. WHERE IN subquery inside GRAPH_TABLE: May need to move to external filter
  2. HNSW index usage: Uncertain if index is used when VSS is in a CTE
  3. Query planner optimization: Cross-extension optimization untested

Alternative: External Filter Pattern

-- More likely to work: filter GRAPH_TABLE results externally
WITH
vector_matches AS (
SELECT id, content, embedding <=> $query_embedding AS distance
FROM documents
ORDER BY distance
LIMIT 5
),
graph_results AS (
SELECT * FROM GRAPH_TABLE (kg
MATCH (d:Document)-[:MENTIONS]->(e:Entity)
COLUMNS (d.id AS doc_id, e.name AS entity_name)
)
)
-- Filter graph results using vector matches
SELECT g.*
FROM graph_results g
WHERE g.doc_id IN (SELECT id FROM vector_matches);

Current Industry Approach: Separate Systems

Cognee Architecture (DuckDB + Kuzu)

Cognee uses:

┌─────────────────────────────────────────────────┐
│ Cognee │
├─────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ │
│ │ DuckDB │ │ Kuzu │ │
│ │ (Vectors) │ │ (Graph) │ │
│ │ │ │ │ │
│ │ - Embeddings│ │ - Entities │ │
│ │ - Metadata │ │ - Relations │ │
│ │ - Analytics │ │ - Traversal │ │
│ └─────────────┘ └─────────────┘ │
│ ↓ ↓ │
│ Vector Search Graph Queries │
│ └───────────┬───────────┘ │
│ ↓ │
│ Application Layer │
│ (combines results) │
└─────────────────────────────────────────────────┘

Why separate systems?

  • Each optimized for its workload
  • No cross-extension complexity
  • Proven in production

Performance Expectations

Single Query (If It Works)

MetricEstimateNotes
Latency~50-150msOne round-trip
VSS index usageUncertainMay fall back to scan
ComplexityHighQuery planner untested
MetricEstimateNotes
Total latency~50-150ms3 queries × ~20-50ms each
VSS index usageGuaranteedSimple pattern
ComplexityLowWell-understood

For small knowledge bases (~1000 docs): Difference is negligible.


Practical Recommendation

For Your Use Case (Lattice-style KB)

// Recommended: Separate queries, simple implementation
async function hybridSearch(query: string): Promise<SearchResult[]> {
const embedding = await embed(query);
// Query 1: Vector search (uses HNSW index)
const vectorResults = await db.query(`
SELECT id, content, embedding <=> ? AS distance
FROM documents
ORDER BY distance LIMIT 5
`, [embedding]);
// Query 2: Get entities from matches
const docIds = vectorResults.map(r => r.id);
const entities = await db.query(`
SELECT * FROM GRAPH_TABLE (kg
MATCH (d:Document)-[:MENTIONS]->(e:Entity)
COLUMNS (d.id, e.name, e.type)
) WHERE id IN (?)
`, [docIds]);
// Query 3: Expand via graph
const entityNames = entities.map(e => e.name);
const expanded = await db.query(`
SELECT * FROM GRAPH_TABLE (kg
MATCH (d:Document)-[:MENTIONS]->(e:Entity)
COLUMNS (d.id, d.content, e.name AS via_entity)
) WHERE name IN (?) AND id NOT IN (?)
`, [entityNames, docIds]);
return [...vectorResults, ...expanded];
}

Why This Over Single CTE?

  1. Guaranteed index usage - VSS pattern stays simple
  2. Debuggable - Each step can be tested independently
  3. Proven pattern - This is how cognee and others do it
  4. Same performance - At your scale, difference is ~20ms

When Single-Query CTE Makes Sense

Consider the CTE approach if:

  • You verify HNSW index is used in your CTE pattern
  • You need absolute minimum latency (<20ms matters)
  • You’re willing to test and debug edge cases
  • DuckDB team confirms extension interop

Until then, separate queries are more reliable.


Summary

ApproachWorks?PerformanceRecommended?
VSS + DuckPGQ in CTE⚠️ LimitedN/AWorks if single GRAPH CTE (see bug)
Separate queries✅ Yes~100msYes
DuckDB + KuzuYesVariesFor production
FalkorDB (native)Yes~5-10msIf you need speed

Bottom line: DuckDB can do hybrid search. Single-CTE works if using only one GRAPH_TABLE CTE. Multi-GRAPH CTEs with cross-references crash due to DuckPGQ bug . Use separate queries for complex graph expansion patterns (validated 2025-12-06).


Sources

  1. DuckDB VSS Documentation
  2. DuckPGQ Documentation
  3. DuckDB × Cognee: SQL Analytics + Graph RAG
  4. DuckDB Graph Queries with DuckPGQ
  5. What’s New in VSS Extension
  6. DuckPGQ: SQL/PGQ for DuckDB (VLDB Paper)
  7. DuckPGQ GitHub Repository