Data Integration

Google Sheets DuckDB Integration: What It Actually Does

Marc SeanApril 4, 20267 min read

That's the pitch. The reality is more nuanced, and worth understanding before you wire it up.

What DuckDB Actually Adds to Google Sheets

Google Sheets has QUERY(), which speaks a limited SQL dialect. It handles basic SELECT and WHERE, but stumbles on anything involving multiple source ranges, aggregations across joins, or window functions. DuckDB doesn't have those restrictions — it's a full OLAP engine that supports standard SQL, including CTEs, lateral joins, and QUALIFY for filtering window function results.

The architectural difference matters. QUERY() evaluates inside the Sheets formula engine, which means it's limited to what that engine exposes. DuckDB runs as a separate in-process database that reads your sheet data as tabular input. Your spreadsheet becomes a data source rather than the execution environment.

In practice, this means you can do things like:

  • Join two ranges from different sheets on a common key column
  • Compute rolling averages or running totals with OVER (PARTITION BY ... ORDER BY ...)
  • Filter aggregated results with HAVING instead of wrapping in a subquery
  • Run GROUP BY on multiple dimensions at once and pivot the output

A financial analyst building a contribution margin analysis across product lines and regions — the kind of thing that currently lives in a 12-tab workbook with VLOOKUP chains — can collapse that into a single SQL query against the raw data ranges.

How the Integration Works

There are a few ways to get DuckDB operating against Sheets data, covered in more detail in How to Connect DuckDB to Google Sheets. The short version: you're either exporting data to a format DuckDB can read (CSV, Parquet), or you're using an integration layer that handles the data bridging automatically.

The more interesting approach is embedding DuckDB directly in the toolchain — so the query engine has live access to sheet ranges rather than stale exports. This is what ModelMonkey does internally: a sheet_range() table function registered as a DuckDB UDF lets you reference spreadsheet data in A1 notation directly inside SQL queries.

SELECT 
  category,
  SUM(revenue) as total_revenue,
  AVG(margin_pct) as avg_margin
FROM sheet_range('P&L!B2:E500')
GROUP BY category
ORDER BY total_revenue DESC

That query runs against live spreadsheet data. No export step, no stale CSV. The sheet_range() function fetches the range at query time and returns it as a typed tabular source DuckDB can operate on.

Column types get inferred from the data — numbers come through as DOUBLE, booleans as BOOLEAN, and anything mixed defaults to VARCHAR. As of April 2026, DuckDB's type inference handles most finance-oriented data well, though you'll occasionally need explicit casts when a column has header text mixed with numeric data.

The Limits That Actually Matter

Google Sheets caps at 10 million cells. DuckDB can handle datasets orders of magnitude larger — but inside a Sheets integration, you're bounded by what Sheets itself can hold. According to Google's Sheets documentation, that 10M cell limit is hard. So the scenario where you'd need to "promote" the workload to a server-side database (MotherDuck, for instance) just doesn't arise if your data lives entirely in Sheets.

What does matter is output size. Returning a 50,000-row result set inline to a spreadsheet is slow and usually pointless — you wanted the aggregate, not the raw rows. A sensible integration caps inline results at around 200 rows and routes larger outputs to a destination sheet. ModelMonkey enforces this: queries return 200 rows by default, and larger outputs require specifying output:"sheet" to write results directly to a named range. This keeps the LLM context clean while still making full result sets accessible.

Query timeouts are the other practical limit. Complex joins against large ranges can take several seconds. The DuckDB instance runs in-memory, which is fast, but the bottleneck is usually fetching the range data from the Sheets API — that round trip adds latency proportional to range size.

Where the Native Embedding Wins

The export-and-query approach (dump to CSV, run DuckDB locally, import results back) works, but it has a friction problem: the data is stale the moment you export it. If you're building analysis that needs to refresh as the spreadsheet updates, you're back to manual steps.

An embedded integration where DuckDB reads ranges at query time solves this. The query always reflects the current state of the sheet. This matters most for operational dashboards — inventory trackers, sales pipelines, appointment schedules — where the underlying data changes daily and the analysis needs to stay current without manual intervention.

The other advantage is provenance. When DuckDB runs as part of an AI agent workflow (as it does in ModelMonkey), the system can track which ranges were accessed for each query. That lets the agent explain its reasoning: "I queried ranges P&L!B2:E500 and Products!A2:C200 and joined on product_id." Compare that to a formula in a cell — you know the output, but the derivation is buried in nested functions.

What a Real Workflow Looks Like

Consider an operations analyst with a surgery scheduling dashboard: one sheet has procedure codes and duration estimates, another has actual start/end times, a third has surgeon assignments. Calculating on-time performance by surgeon and procedure type is genuinely hard with formulas. It involves multiple VLOOKUP chains, intermediate calculation columns, and pivot tables that need manual refreshing.

With DuckDB:

SELECT 
  s.surgeon_name,
  p.procedure_type,
  COUNT(*) as total_cases,
  ROUND(
    100.0 * SUM(CASE WHEN a.start_time <= s.scheduled_start THEN 1 ELSE 0 END) / COUNT(*), 
    1
  ) as on_time_pct
FROM sheet_range('Actuals!A2:D500') a
JOIN sheet_range('Schedule!A2:C500') s ON a.case_id = s.case_id
JOIN sheet_range('Procedures!A2:B100') p ON s.procedure_code = p.code
GROUP BY s.surgeon_name, p.procedure_type
HAVING COUNT(*) >= 5
ORDER BY on_time_pct ASC

That's one query. No intermediate columns, no VLOOKUP chains, no manually refreshed pivot. The analyst asks for it, it runs against the current sheet data, and the results land in a summary sheet.

The formula equivalent would take an afternoon to build and another afternoon to debug when someone adds a surgeon with a name that has a comma in it.

DuckDB in Google Sheets adds real SQL capability — joins, window functions, aggregations — against live spreadsheet data. Google's 10M cell limit means you're not running petabyte queries, but you're also not exporting and reimporting CSVs. The embedded approach (where DuckDB reads ranges directly via a UDF) beats the export approach because data stays current. The practical constraints are inline result size (cap at ~200 rows and write larger outputs to a destination sheet) and fetch latency for large ranges.

ModelMonkey ships this as a built-in capability — the duckdb.query tool is available in the agent sidebar and works against any ranges in your spreadsheet. You describe the analysis you need, the agent writes and runs the SQL, and the results appear in your sheet.

Try ModelMonkey free for 14 days — it works in both Google Sheets and Excel.


Frequently Asked Questions