Purpose

Document a critical bug in DuckDB where WHERE clauses fail to correctly filter NULL values in fixed-size FLOAT array columns, discovered while implementing @zabaca/lattice knowledge graph system.

Problem Statement

DuckDB has a query-dependent bug with FLOAT[N] fixed-size array columns where:

  1. WHERE clauses return incorrect results: WHERE embedding[1] IS NULL returns 0 rows even when NULL embeddings exist
  2. Aggregate functions work correctly: SUM(CASE WHEN embedding[1] IS NULL THEN 1 ELSE 0 END) accurately counts NULL values
  3. Multi-row queries return incorrect NULL values: Array values appear as NULL in result sets even when they have valid data

Discovered Behavior

What Works ✅

-- Aggregates correctly identify NULL values
SELECT label,
SUM(CASE WHEN embedding[1] IS NULL THEN 1 ELSE 0 END) as null_count
FROM nodes
GROUP BY label;
-- Returns: 38 nodes with NULL embeddings
-- Single-row queries return correct values
SELECT embedding[1]
FROM nodes
WHERE name = 'Agent Bridge Layer';
-- Returns: -0.016005931422114372 (correct value)

What Fails ❌

-- WHERE clause with NULL check returns 0 rows
SELECT * FROM nodes WHERE embedding[1] IS NULL;
-- Expected: 38 rows
-- Actual: 0 rows
-- WHERE clause with IS NOT NULL check returns incorrect NULLs
SELECT name, embedding[1]
FROM nodes
WHERE embedding[1] IS NOT NULL
LIMIT 10;
-- Returns rows with NULL embedding[1] values (should be filtered out)
-- IN clause returns NULL for valid data
SELECT embedding[1]
FROM nodes
WHERE name IN ('Claude', 'Agent Bridge Layer');
-- Returns: NULL, NULL (both should have valid float values)

This bug matches GitHub Issue - “Array With Values Shows Null Depending On Query” (October 2025):

  • Array columns return NULL for certain entries under some queries but not others
  • Using LIMIT and OFFSET can trigger the bug
  • WHERE clauses sometimes act as a “secret key” that unlocks missing data
  • Root cause: Query optimization issues where DuckDB’s optimizer makes incorrect assumptions with arrays
  • : Unable to correctly filter out rows with NULL values using IS NOT NULL
  • : ARRAY/LIST returns NULL with read_parquet schema parameter
  • : Inconsistent array concatenation with NULL values
  • : Unnesting does not work on array columns with fixed sizes
  • : Parquet difficulties with ARRAY - DuckDB writes fixed-size lists incompatibly

Impact on Lattice

This bug affects two critical operations in @zabaca/lattice:

1. Vector Search Results

When searching, nodes with NULL embeddings slip through the WHERE embedding[1] IS NOT NULL filter and appear in results with 0% similarity:

13. [Tool] Claude
Similarity: 0.00%
14. [Process] Agent Bridge Layer
Similarity: 0.00%

2. Embedding Repair

The repair function cannot find nodes to fix because the WHERE clause returns nothing:

// This query should return 38 nodes but returns 0
async findNodesWithMissingEmbeddings(labels: string[]) {
const result = await conn.runAndReadAll(`
SELECT label, name FROM nodes
WHERE label IN (${labels})
AND embedding[1] IS NULL
`);
return result.getRows(); // Returns []
}

Workarounds

1. Post-Query Filtering (Search Results)

Filter out 0% similarity results after the query:

async vectorSearchAll(queryVector: number[], k: number = 10) {
const reader = await conn.runAndReadAll(`
SELECT name, label,
array_cosine_similarity(embedding, ${vectorStr}::FLOAT[512]) as similarity
FROM nodes
WHERE embedding[1] IS NOT NULL
ORDER BY similarity DESC
LIMIT ${k}
`);
// Filter out NULL embeddings that slipped through
return reader.getRows()
.map(row => ({ ...row, score: row.similarity }))
.filter(r => r.score !== null && r.score !== 0);
}

2. Individual Row Checks (Repair Function)

Get all candidate nodes, then check each one individually:

async findNodesWithMissingEmbeddings(labels: string[]) {
// Get all nodes for the specified labels
const allNodes = await conn.runAndReadAll(`
SELECT label, name FROM nodes WHERE label IN (${labels})
`);
// Check each node individually (single-row queries work correctly)
const missing = [];
for (const node of allNodes.getRows()) {
const check = await conn.runAndReadAll(`
SELECT embedding[1] IS NULL as is_null
FROM nodes
WHERE label = '${node.label}' AND name = '${node.name}'
`);
if (check.getRows()[0][0] === true) {
missing.push(node);
}
}
return missing;
}

3. Use WHERE Clause as Query Plan Hack

Adding an unrelated WHERE clause can sometimes trigger a different query plan that avoids the bug:

-- May work better than unfiltered query
SELECT name, embedding[1]
FROM nodes
WHERE label = 'Tool' AND embedding[1] IS NOT NULL;

4. Avoid LIMIT/OFFSET

Use range-based filtering instead of LIMIT and OFFSET when paginating:

-- Instead of: LIMIT 10 OFFSET 20
-- Use: WHERE id BETWEEN 21 AND 30

Performance Impact

The individual row check workaround is slow:

  • 2051 total nodes in knowledge graph
  • ~1500 nodes in entity labels to check (Tool, Concept, Process, Technology, Document)
  • 1500+ individual queries required to find 38 NULL embeddings
  • Estimated time: 1-2 seconds vs. <10ms for a proper WHERE clause

Recommendations

  1. Report to DuckDB: File a new issue linking to with our specific FLOAT[512] reproduction case
  2. Monitor DuckDB releases: Check for fixes in upcoming versions (currently using DuckDB Node API)
  3. Consider alternative types: Test if FLOAT[] (variable-length) behaves better than FLOAT[512] (fixed-length)
  4. Use workarounds: Implement post-query filtering and individual checks until fixed

Technical Details

  • DuckDB Version: @duckdb/node-api (latest as of December 2025)
  • Column Type: FLOAT[512] (fixed-size array for 512-dimensional embeddings)
  • Use Case: Vector embeddings for semantic search with Voyage AI
  • Extension: VSS (Vector Similarity Search) for HNSW indexing

Sources

  1. Array With Values Shows Null Depending On Query · Issue
  2. Unable to correctly filter out rows with null values · Issue
  3. DuckDB Array Null Issue: Query-Dependent?
  4. Array Type – DuckDB
  5. unnesting does not work on array column with fixed size · Issue
  6. Parquet Difficulties with Array · Issue