float-array-where-clause-bug
Purpose
Document a critical bug in DuckDB where WHERE clauses fail to correctly filter NULL values in fixed-size FLOAT array columns, discovered while implementing @zabaca/lattice knowledge graph system.
Problem Statement
DuckDB has a query-dependent bug with FLOAT[N] fixed-size array columns where:
- WHERE clauses return incorrect results:
WHERE embedding[1] IS NULLreturns 0 rows even when NULL embeddings exist - Aggregate functions work correctly:
SUM(CASE WHEN embedding[1] IS NULL THEN 1 ELSE 0 END)accurately counts NULL values - Multi-row queries return incorrect NULL values: Array values appear as NULL in result sets even when they have valid data
Discovered Behavior
What Works ✅
-- Aggregates correctly identify NULL valuesSELECT label, SUM(CASE WHEN embedding[1] IS NULL THEN 1 ELSE 0 END) as null_countFROM nodesGROUP BY label;-- Returns: 38 nodes with NULL embeddings
-- Single-row queries return correct valuesSELECT embedding[1]FROM nodesWHERE name = 'Agent Bridge Layer';-- Returns: -0.016005931422114372 (correct value)What Fails ❌
-- WHERE clause with NULL check returns 0 rowsSELECT * FROM nodes WHERE embedding[1] IS NULL;-- Expected: 38 rows-- Actual: 0 rows
-- WHERE clause with IS NOT NULL check returns incorrect NULLsSELECT name, embedding[1]FROM nodesWHERE embedding[1] IS NOT NULLLIMIT 10;-- Returns rows with NULL embedding[1] values (should be filtered out)
-- IN clause returns NULL for valid dataSELECT embedding[1]FROM nodesWHERE name IN ('Claude', 'Agent Bridge Layer');-- Returns: NULL, NULL (both should have valid float values)Related DuckDB Issues
This bug matches GitHub Issue #19377 - “Array With Values Shows Null Depending On Query” (October 2025):
- Array columns return NULL for certain entries under some queries but not others
- Using LIMIT and OFFSET can trigger the bug
- WHERE clauses sometimes act as a “secret key” that unlocks missing data
- Root cause: Query optimization issues where DuckDB’s optimizer makes incorrect assumptions with arrays
Other Related Issues
- #9717: Unable to correctly filter out rows with NULL values using IS NOT NULL
- #15504: ARRAY/LIST returns NULL with read_parquet schema parameter
- #11781: Inconsistent array concatenation with NULL values
- #16672: Unnesting does not work on array columns with fixed sizes
- #17238: Parquet difficulties with ARRAY - DuckDB writes fixed-size lists incompatibly
Impact on Lattice
This bug affects two critical operations in @zabaca/lattice:
1. Vector Search Results
When searching, nodes with NULL embeddings slip through the WHERE embedding[1] IS NOT NULL filter and appear in results with 0% similarity:
13. [Tool] Claude Similarity: 0.00%14. [Process] Agent Bridge Layer Similarity: 0.00%2. Embedding Repair
The repair function cannot find nodes to fix because the WHERE clause returns nothing:
// This query should return 38 nodes but returns 0async findNodesWithMissingEmbeddings(labels: string[]) { const result = await conn.runAndReadAll(` SELECT label, name FROM nodes WHERE label IN (${labels}) AND embedding[1] IS NULL `); return result.getRows(); // Returns []}Workarounds
1. Post-Query Filtering (Search Results)
Filter out 0% similarity results after the query:
async vectorSearchAll(queryVector: number[], k: number = 10) { const reader = await conn.runAndReadAll(` SELECT name, label, array_cosine_similarity(embedding, ${vectorStr}::FLOAT[512]) as similarity FROM nodes WHERE embedding[1] IS NOT NULL ORDER BY similarity DESC LIMIT ${k} `);
// Filter out NULL embeddings that slipped through return reader.getRows() .map(row => ({ ...row, score: row.similarity })) .filter(r => r.score !== null && r.score !== 0);}2. Individual Row Checks (Repair Function)
Get all candidate nodes, then check each one individually:
async findNodesWithMissingEmbeddings(labels: string[]) { // Get all nodes for the specified labels const allNodes = await conn.runAndReadAll(` SELECT label, name FROM nodes WHERE label IN (${labels}) `);
// Check each node individually (single-row queries work correctly) const missing = []; for (const node of allNodes.getRows()) { const check = await conn.runAndReadAll(` SELECT embedding[1] IS NULL as is_null FROM nodes WHERE label = '${node.label}' AND name = '${node.name}' `); if (check.getRows()[0][0] === true) { missing.push(node); } } return missing;}3. Use WHERE Clause as Query Plan Hack
Adding an unrelated WHERE clause can sometimes trigger a different query plan that avoids the bug:
-- May work better than unfiltered querySELECT name, embedding[1]FROM nodesWHERE label = 'Tool' AND embedding[1] IS NOT NULL;4. Avoid LIMIT/OFFSET
Use range-based filtering instead of LIMIT and OFFSET when paginating:
-- Instead of: LIMIT 10 OFFSET 20-- Use: WHERE id BETWEEN 21 AND 30Performance Impact
The individual row check workaround is slow:
- 2051 total nodes in knowledge graph
- ~1500 nodes in entity labels to check (Tool, Concept, Process, Technology, Document)
- 1500+ individual queries required to find 38 NULL embeddings
- Estimated time: 1-2 seconds vs. <10ms for a proper WHERE clause
Recommendations
- Report to DuckDB: File a new issue linking to #19377 with our specific FLOAT[512] reproduction case
- Monitor DuckDB releases: Check for fixes in upcoming versions (currently using DuckDB Node API)
- Consider alternative types: Test if
FLOAT[](variable-length) behaves better thanFLOAT[512](fixed-length) - Use workarounds: Implement post-query filtering and individual checks until fixed
Technical Details
- DuckDB Version: @duckdb/node-api (latest as of December 2025)
- Column Type:
FLOAT[512](fixed-size array for 512-dimensional embeddings) - Use Case: Vector embeddings for semantic search with Voyage AI
- Extension: VSS (Vector Similarity Search) for HNSW indexing
Sources
- Array With Values Shows Null Depending On Query · Issue #19377
- Unable to correctly filter out rows with null values · Issue #9717
- DuckDB Array Null Issue: Query-Dependent?
- Array Type – DuckDB
- unnesting does not work on array column with fixed size · Issue #16672
- Parquet Difficulties with Array · Issue #17238