float-array-where-clause-bug

Purpose

Document a critical bug in DuckDB where WHERE clauses fail to correctly filter NULL values in fixed-size FLOAT array columns, discovered while implementing @zabaca/lattice knowledge graph system.

Problem Statement

DuckDB has a query-dependent bug with FLOAT[N] fixed-size array columns where:

WHERE clauses return incorrect results: WHERE embedding[1] IS NULL returns 0 rows even when NULL embeddings exist
Aggregate functions work correctly: SUM(CASE WHEN embedding[1] IS NULL THEN 1 ELSE 0 END) accurately counts NULL values
Multi-row queries return incorrect NULL values: Array values appear as NULL in result sets even when they have valid data

Discovered Behavior

What Works ✅

-- Aggregates correctly identify NULL values
SELECT label,
  SUM(CASE WHEN embedding[1] IS NULL THEN 1 ELSE 0 END) as null_count
FROM nodes
GROUP BY label;
-- Returns: 38 nodes with NULL embeddings

-- Single-row queries return correct values
SELECT embedding[1]
FROM nodes
WHERE name = 'Agent Bridge Layer';
-- Returns: -0.016005931422114372 (correct value)

What Fails ❌

-- WHERE clause with NULL check returns 0 rows
SELECT * FROM nodes WHERE embedding[1] IS NULL;
-- Expected: 38 rows
-- Actual: 0 rows

-- WHERE clause with IS NOT NULL check returns incorrect NULLs
SELECT name, embedding[1]
FROM nodes
WHERE embedding[1] IS NOT NULL
LIMIT 10;
-- Returns rows with NULL embedding[1] values (should be filtered out)

-- IN clause returns NULL for valid data
SELECT embedding[1]
FROM nodes
WHERE name IN ('Claude', 'Agent Bridge Layer');
-- Returns: NULL, NULL (both should have valid float values)

This bug matches GitHub Issue #19377 - “Array With Values Shows Null Depending On Query” (October 2025):

Array columns return NULL for certain entries under some queries but not others
Using LIMIT and OFFSET can trigger the bug
WHERE clauses sometimes act as a “secret key” that unlocks missing data
Root cause: Query optimization issues where DuckDB’s optimizer makes incorrect assumptions with arrays

#9717: Unable to correctly filter out rows with NULL values using IS NOT NULL
#15504: ARRAY/LIST returns NULL with read_parquet schema parameter
#11781: Inconsistent array concatenation with NULL values
#16672: Unnesting does not work on array columns with fixed sizes
#17238: Parquet difficulties with ARRAY - DuckDB writes fixed-size lists incompatibly

Impact on Lattice

This bug affects two critical operations in @zabaca/lattice:

1. Vector Search Results

When searching, nodes with NULL embeddings slip through the WHERE embedding[1] IS NOT NULL filter and appear in results with 0% similarity:

13. [Tool] Claude
    Similarity: 0.00%
14. [Process] Agent Bridge Layer
    Similarity: 0.00%

2. Embedding Repair

The repair function cannot find nodes to fix because the WHERE clause returns nothing:

// This query should return 38 nodes but returns 0
async findNodesWithMissingEmbeddings(labels: string[]) {
  const result = await conn.runAndReadAll(`
    SELECT label, name FROM nodes
    WHERE label IN (${labels})
      AND embedding[1] IS NULL
  `);
  return result.getRows(); // Returns []
}

Workarounds

1. Post-Query Filtering (Search Results)

Filter out 0% similarity results after the query:

async vectorSearchAll(queryVector: number[], k: number = 10) {
  const reader = await conn.runAndReadAll(`
    SELECT name, label,
      array_cosine_similarity(embedding, ${vectorStr}::FLOAT[512]) as similarity
    FROM nodes
    WHERE embedding[1] IS NOT NULL
    ORDER BY similarity DESC
    LIMIT ${k}
  `);

  // Filter out NULL embeddings that slipped through
  return reader.getRows()
    .map(row => ({ ...row, score: row.similarity }))
    .filter(r => r.score !== null && r.score !== 0);
}

2. Individual Row Checks (Repair Function)

Get all candidate nodes, then check each one individually:

async findNodesWithMissingEmbeddings(labels: string[]) {
  // Get all nodes for the specified labels
  const allNodes = await conn.runAndReadAll(`
    SELECT label, name FROM nodes WHERE label IN (${labels})
  `);

  // Check each node individually (single-row queries work correctly)
  const missing = [];
  for (const node of allNodes.getRows()) {
    const check = await conn.runAndReadAll(`
      SELECT embedding[1] IS NULL as is_null
      FROM nodes
      WHERE label = '${node.label}' AND name = '${node.name}'
    `);
    if (check.getRows()[0][0] === true) {
      missing.push(node);
    }
  }
  return missing;
}

3. Use WHERE Clause as Query Plan Hack

Adding an unrelated WHERE clause can sometimes trigger a different query plan that avoids the bug:

-- May work better than unfiltered query
SELECT name, embedding[1]
FROM nodes
WHERE label = 'Tool' AND embedding[1] IS NOT NULL;

4. Avoid LIMIT/OFFSET

Use range-based filtering instead of LIMIT and OFFSET when paginating:

-- Instead of: LIMIT 10 OFFSET 20
-- Use: WHERE id BETWEEN 21 AND 30

Performance Impact

The individual row check workaround is slow:

2051 total nodes in knowledge graph
~1500 nodes in entity labels to check (Tool, Concept, Process, Technology, Document)
1500+ individual queries required to find 38 NULL embeddings
Estimated time: 1-2 seconds vs. <10ms for a proper WHERE clause

Recommendations

Report to DuckDB: File a new issue linking to #19377 with our specific FLOAT[512] reproduction case
Monitor DuckDB releases: Check for fixes in upcoming versions (currently using DuckDB Node API)
Consider alternative types: Test if FLOAT[] (variable-length) behaves better than FLOAT[512] (fixed-length)
Use workarounds: Implement post-query filtering and individual checks until fixed

Technical Details

DuckDB Version: @duckdb/node-api (latest as of December 2025)
Column Type: FLOAT[512] (fixed-size array for 512-dimensional embeddings)
Use Case: Vector embeddings for semantic search with Voyage AI
Extension: VSS (Vector Similarity Search) for HNSW indexing