duckdb-hybrid-search

Research on combining DuckDB’s VSS (Vector Similarity Search) and DuckPGQ (Property Graph Queries) for hybrid search in a single embedded database.

The Question

Can DuckDB serve as an all-in-one embedded solution for GraphRAG by combining:

VSS extension: Vector similarity search with HNSW indexes
DuckPGQ extension: Property graph queries with SQL/PGQ syntax

Using CTEs (Common Table Expressions) to compose both in a single query?

Research Findings

Can Both Extensions Work Together?

Theoretically: Yes - Both extensions load successfully.

Practically: CRASHES when combined in CTE (validated 2025-12-06)

Both extensions:

✅ Load into the same DuckDB instance
✅ Work independently with standard SQL
❌ CRASH when combined in CTE (segmentation fault)

Validation Results (packages/duckpgq-vss)

Test	Result	Notes
VSS extension loads	✅ Pass	DuckDB 1.3.1
DuckPGQ extension loads	✅ Pass	Custom S3 repo required
Both coexist	✅ Pass	Same instance
Vector search alone	✅ Pass	HNSW index works
Graph search alone	✅ Pass	SQL/PGQ syntax works
Hybrid CTE	❌ CRASH	Segfault in both Node.js and Bun
Separate queries	✅ Pass	~100ms total latency

Critical finding: The crash is caused by a DuckPGQ bug (issue #276) where GRAPH_TABLE CTEs that reference each other via subqueries crash. This is NOT a VSS+GRAPH_TABLE conflict.

Pattern	Works?
VSS CTE → single GRAPH CTE	✅ Yes
Multiple independent GRAPH CTEs	✅ Yes
GRAPH CTE referencing another GRAPH CTE	❌ Crashes

VSS Extension Limitations

From DuckDB VSS docs:

Aspect	Constraint
Index optimization	Only works with simple `ORDER BY distance LIMIT k` patterns
Complex queries	HNSW index may not be used in CTEs/subqueries
Join macros	`vss_join` and `vss_match` do NOT use HNSW index
Memory	Index must fit in RAM

Key insight: VSS is optimized for straightforward vector search, not complex query compositions.

DuckPGQ Capabilities

From DuckPGQ docs:

Aspect	Support
GRAPH_TABLE in queries	Yes
WHERE filters after GRAPH_TABLE	Yes
Subqueries with results	Yes (external to GRAPH_TABLE)
CTEs	Standard SQL CTEs work

Key insight: GRAPH_TABLE results can be filtered and joined with other SQL operations, but filtering inside MATCH clauses is limited.

Proposed Hybrid Query Pattern

-- THEORETICAL: May work but HNSW index optimization uncertain
WITH
-- Step 1: Vector search (VSS)
vector_matches AS (
  SELECT id, content, embedding <=> $query_embedding AS distance
  FROM documents
  ORDER BY distance
  LIMIT 5
),

-- Step 2: Get entities from matched docs (DuckPGQ)
doc_entities AS (
  SELECT * FROM GRAPH_TABLE (kg
    MATCH (d:Document)-[:MENTIONS]->(e:Entity)
    WHERE d.id IN (SELECT id FROM vector_matches)  -- May need external filter
    COLUMNS (d.id AS doc_id, e.name AS entity_name, e.type AS entity_type)
  )
),

-- Step 3: Expand via graph - find related docs
expanded_docs AS (
  SELECT * FROM GRAPH_TABLE (kg
    MATCH (d:Document)-[:MENTIONS]->(e:Entity)
    WHERE e.name IN (SELECT entity_name FROM doc_entities)
      AND d.id NOT IN (SELECT id FROM vector_matches)
    COLUMNS (d.id, d.content, e.name AS via_entity)
  )
)

-- Final: Combine original + expanded
SELECT id, content, distance, NULL as via_entity, 'vector' as source
FROM vector_matches
UNION ALL
SELECT id, content, NULL as distance, via_entity, 'graph' as source
FROM expanded_docs;

Potential Issues

WHERE IN subquery inside GRAPH_TABLE: May need to move to external filter
HNSW index usage: Uncertain if index is used when VSS is in a CTE
Query planner optimization: Cross-extension optimization untested

Alternative: External Filter Pattern

-- More likely to work: filter GRAPH_TABLE results externally
WITH
vector_matches AS (
  SELECT id, content, embedding <=> $query_embedding AS distance
  FROM documents
  ORDER BY distance
  LIMIT 5
),

graph_results AS (
  SELECT * FROM GRAPH_TABLE (kg
    MATCH (d:Document)-[:MENTIONS]->(e:Entity)
    COLUMNS (d.id AS doc_id, e.name AS entity_name)
  )
)

-- Filter graph results using vector matches
SELECT g.*
FROM graph_results g
WHERE g.doc_id IN (SELECT id FROM vector_matches);

Current Industry Approach: Separate Systems

Cognee Architecture (DuckDB + Kuzu)

Cognee uses:

┌─────────────────────────────────────────────────┐
│                    Cognee                        │
├─────────────────────────────────────────────────┤
│  ┌─────────────┐         ┌─────────────┐       │
│  │   DuckDB    │         │    Kuzu     │       │
│  │  (Vectors)  │         │   (Graph)   │       │
│  │             │         │             │       │
│  │ - Embeddings│         │ - Entities  │       │
│  │ - Metadata  │         │ - Relations │       │
│  │ - Analytics │         │ - Traversal │       │
│  └─────────────┘         └─────────────┘       │
│         ↓                       ↓              │
│    Vector Search         Graph Queries         │
│         └───────────┬───────────┘              │
│                     ↓                          │
│              Application Layer                  │
│           (combines results)                    │
└─────────────────────────────────────────────────┘

Why separate systems?

Each optimized for its workload
No cross-extension complexity
Proven in production

Performance Expectations

Single Query (If It Works)

Metric	Estimate	Notes
Latency	~50-150ms	One round-trip
VSS index usage	Uncertain	May fall back to scan
Complexity	High	Query planner untested

Separate Queries (Recommended)

Metric	Estimate	Notes
Total latency	~50-150ms	3 queries × ~20-50ms each
VSS index usage	Guaranteed	Simple pattern
Complexity	Low	Well-understood

For small knowledge bases (~1000 docs): Difference is negligible.

Practical Recommendation

For Your Use Case (Lattice-style KB)

// Recommended: Separate queries, simple implementation
async function hybridSearch(query: string): Promise<SearchResult[]> {
  const embedding = await embed(query);

  // Query 1: Vector search (uses HNSW index)
  const vectorResults = await db.query(`
    SELECT id, content, embedding <=> ? AS distance
    FROM documents
    ORDER BY distance LIMIT 5
  `, [embedding]);

  // Query 2: Get entities from matches
  const docIds = vectorResults.map(r => r.id);
  const entities = await db.query(`
    SELECT * FROM GRAPH_TABLE (kg
      MATCH (d:Document)-[:MENTIONS]->(e:Entity)
      COLUMNS (d.id, e.name, e.type)
    ) WHERE id IN (?)
  `, [docIds]);

  // Query 3: Expand via graph
  const entityNames = entities.map(e => e.name);
  const expanded = await db.query(`
    SELECT * FROM GRAPH_TABLE (kg
      MATCH (d:Document)-[:MENTIONS]->(e:Entity)
      COLUMNS (d.id, d.content, e.name AS via_entity)
    ) WHERE name IN (?) AND id NOT IN (?)
  `, [entityNames, docIds]);

  return [...vectorResults, ...expanded];
}

Why This Over Single CTE?

Guaranteed index usage - VSS pattern stays simple
Debuggable - Each step can be tested independently
Proven pattern - This is how cognee and others do it
Same performance - At your scale, difference is ~20ms

When Single-Query CTE Makes Sense

Consider the CTE approach if:

You verify HNSW index is used in your CTE pattern
You need absolute minimum latency (<20ms matters)
You’re willing to test and debug edge cases
DuckDB team confirms extension interop

Until then, separate queries are more reliable.

Summary

Approach	Works?	Performance	Recommended?
VSS + DuckPGQ in CTE	⚠️ Limited	N/A	Works if single GRAPH CTE (see bug)
Separate queries	✅ Yes	~100ms	Yes
DuckDB + Kuzu	Yes	Varies	For production
FalkorDB (native)	Yes	~5-10ms	If you need speed

Bottom line: DuckDB can do hybrid search. Single-CTE works if using only one GRAPH_TABLE CTE. Multi-GRAPH CTEs with cross-references crash due to DuckPGQ bug #276. Use separate queries for complex graph expansion patterns (validated 2025-12-06).