Data ManagementIntermediate5 min read

How to Connect DuckDB to Google Sheets

Query Google Sheets data with DuckDB SQL using Python and the Sheets API. Run joins, aggregations, and GROUP BY on live spreadsheet data in under 50 lines of code.

Google Sheets is where your data lives. DuckDB is where serious SQL analysis happens. This guide connects them - so you can run GROUP BY, window functions, and multi-table joins against your live spreadsheet data, then write the results back to a new sheet. By the end, you'll have a working Python script that reads from Sheets into an in-memory DuckDB instance, queries it, and pushes results back.

What You'll Need

Python 3.9+ installed locally
A Google Cloud project with the Sheets API enabled
A Google Service Account with a downloaded JSON credentials file
Basic familiarity with SQL and running Python scripts from a terminal
A Google Sheet with tabular data (column headers in row 1)

Step-by-Step Guide

Install Python Dependencies

Install the 3 packages this pipeline depends on: duckdb for the SQL engine, gspread for the Sheets API wrapper, and google-auth for service account authentication. bash pip install duckdb gspread google-auth

duckdb 0.10+ is recommended - earlier versions lack some CSV parsing improvements
gspread 6.x changed its auth API; confirm you're on 6.x with pip show gspread
google-auth handles the OAuth token refresh automatically - you don't need to manage tokens manually

Pro Tip

Use a virtual environment (python -m venv venv && source venv/bin/activate) to keep these dependencies isolated from your system Python.

Create a Google Service Account

A service account lets your script authenticate as a non-human identity that can read and write spreadsheets without user interaction - essential for scripts running on schedules or servers.

Go to Google Cloud Console → IAM & Admin → Service Accounts
Click Create Service Account, name it something like duckdb-sheets-connector
Skip the optional IAM role assignment - permissions are controlled at the spreadsheet level
Under the service account's Keys tab, click Add Key → Create new key → JSON
Save the downloaded .json file as credentials.json in your project directory

Pro Tip

Never commit credentials.json to a git repository. Add it to .gitignore immediately.

Share Your Spreadsheet with the Service Account

The service account has an email address (visible in the JSON file under client_email, format: name@project-id.iam.gserviceaccount.com). Google Sheets treats it like any other user.

Open the spreadsheet you want to query
Click Share, paste the service account email address
Grant Editor access (required if you want to write results back; Viewer works for read-only)
Copy the spreadsheet ID from the URL - it's the long string between /d/ and /edit

Load Sheets Data into DuckDB

Google Sheets returns data as a 2D array of strings. The code below fetches a named worksheet, converts it to a list of dicts using the first row as headers, then registers it as a DuckDB relation. python import duckdb import gspread from google.oauth2.service_account import Credentials SCOPES = ["https://www.googleapis.com/auth/spreadsheets"] SPREADSHEET_ID = "your-spreadsheet-id-here" WORKSHEET_NAME = "Sheet1" # Authenticate and open the spreadsheet creds = Credentials.from_service_account_file("credentials.json", scopes=SCOPES) gc = gspread.authorize(creds) sheet = gc.open_by_key(SPREADSHEET_ID).worksheet(WORKSHEET_NAME) # Fetch all records as list of dicts records = sheet.get_all_records() # Uses row 1 as column headers automatically # Load into DuckDB con = duckdb.connect() con.register("sheet_data", duckdb.from_arrow( __import__("pyarrow").Table.from_pylist(records) ))

get_all_records() returns every row as a dict - a 1,000-row sheet takes roughly 300ms over the API
DuckDB's register() makes the data queryable without copying it to disk
Column types are inferred from the data; numeric-looking strings become DOUBLE automatically

Pro Tip

For sheets with more than 5,000 rows, consider batching with sheet.get('A1:Z5000') and sheet.get('A5001:Z10000') to stay within the Sheets API's per-request size limits.

Run SQL Queries Against Your Sheets Data

Once the data is registered, you write standard SQL. DuckDB supports window functions, CTEs, LATERAL joins, and most PostgreSQL-compatible syntax - none of which exist in Sheets formulas. python # Aggregation example: total revenue by region result = con.execute(""" SELECT region, COUNT(*) AS deal_count, SUM(CAST(revenue AS DOUBLE)) AS total_revenue, AVG(CAST(revenue AS DOUBLE)) AS avg_deal_size FROM sheet_data WHERE status = 'Closed Won' GROUP BY region ORDER BY total_revenue DESC """).fetchdf() print(result)

DuckDB processes 10 million rows in under 2 seconds on a laptop - significantly faster than a pivot table on the same data
fetchdf() returns a pandas DataFrame; use fetchall() for a plain list of tuples
String columns from Sheets often need explicit CAST - Sheets doesn't enforce types, so revenue might arrive as "$12,000" rather than 12000

Pro Tip

Clean dirty Sheets data in SQL using REPLACE(REPLACE(revenue, '$', ''), ',', '') before casting. Easier than cleaning it in the sheet itself.

Write Query Results Back to Google Sheets

After running your analysis, push the result DataFrame back to a new or existing worksheet. The pattern below clears the target sheet and rewrites it entirely - reliable for scheduled jobs, though not ideal for sheets people are actively editing. python # Create or clear the output worksheet try: output_sheet = gc.open_by_key(SPREADSHEET_ID).worksheet("DuckDB Output") output_sheet.clear() except gspread.exceptions.WorksheetNotFound: output_sheet = gc.open_by_key(SPREADSHEET_ID).add_worksheet( title="DuckDB Output", rows=1000, cols=20 ) # Write headers + data headers = result.columns.tolist() rows = result.fillna("").values.tolist() output_sheet.update([headers] + rows)

update() accepts a 2D list and writes it starting from A1
fillna("") converts NaN to empty string - the Sheets API rejects Python's NaN type
The Sheets API quota is 300 write requests per minute per project; for large result sets, use batch_update() instead

Automate the Pipeline

A one-off script is useful. A scheduled one is a data pipeline. Wrap the logic in a function and run it on a cron schedule to keep your output sheet fresh. python import schedule import time def refresh_analysis(): records = sheet.get_all_records() con.register("sheet_data", duckdb.from_arrow( __import__("pyarrow").Table.from_pylist(records) )) result = con.execute(""" SELECT region, SUM(revenue) as total FROM sheet_data GROUP BY region """).fetchdf() output_sheet.clear() output_sheet.update([result.columns.tolist()] + result.fillna("").values.tolist()) print(f"Refreshed at {__import__('datetime').datetime.now()}") schedule.every(30).minutes.do(refresh_analysis) while True: schedule.run_pending() time.sleep(60)

Run this on a small VM (a $5/month instance handles this easily) to keep your Sheets dashboard live
According to Google's Sheets API documentation, read requests are capped at 300 per minute per project - a 30-minute refresh interval keeps you well within quota
For production use, add error handling around the API calls and log failures to a separate monitoring sheet

Pro Tip

Store the spreadsheet ID and worksheet names in environment variables (os.environ.get("SPREADSHEET_ID")) rather than hardcoding them. Makes the script reusable across projects.

Wrapping Up

You've wired DuckDB to Google Sheets: fetching live data via the Sheets API, loading it into an in-memory DuckDB instance, running SQL that Sheets formulas can't touch, and writing results back. The same pattern scales to multiple worksheets, cross-sheet joins, and scheduled pipelines that keep a reporting tab current without any manual exports.\n\nThe friction this exposes is real: every time your source sheet changes schema (someone renames a column, adds a tab), your SQL breaks silently. That's the gap between ad hoc analysis and a proper data pipeline. If you're hitting that problem repeatedly across a complex model, it's worth looking at ModelMonkey - it keeps the DuckDB query layer inside Sheets where your collaborators can see what's running.

Frequently Asked Questions

Do I need to publish my Google Sheet to the web to query it with DuckDB?

No. The service account approach authenticates via OAuth - your sheet stays private and shared only with the service account email. Publishing to the web is only needed if you're using DuckDB's `read_csv` function to fetch a public export URL directly, which skips authentication entirely but exposes your data publicly.

How many rows can DuckDB handle from Google Sheets?

DuckDB itself handles hundreds of millions of rows in memory (limited by RAM, not DuckDB). The practical bottleneck is the Sheets API: `get_all_records()` on a 50,000-row sheet takes 3-8 seconds and hits payload limits. Google's own documentation notes that Sheets caps at 10 million cells per spreadsheet. For sheets above ~20,000 rows, batch your `get()` calls or export to CSV first, then use `duckdb.read_csv()` directly.

Can I join data from two different Google Sheets in a single DuckDB query?

Yes. Fetch each sheet into a separate list of dicts, register each as a different DuckDB relation (`con.register("orders", ...)`, `con.register("customers", ...)`), then join them with standard SQL. DuckDB handles the join in memory - no database server needed.

What happens if a column in my sheet has mixed types (some cells numbers, some text)?

DuckDB infers column types from the data sample. If a column has mixed types, it defaults to `VARCHAR`. You'll need to cast explicitly in your SQL: `CAST(TRY_CAST(revenue AS DOUBLE) AS DOUBLE)`. The `TRY_CAST` variant returns NULL instead of erroring on unconvertible values, which is safer for messy Sheets data.

Is there a way to query Google Sheets with DuckDB without Python?

DuckDB's `read_csv` function can fetch any public CSV URL directly: `SELECT * FROM read_csv('https://docs.google.com/spreadsheets/d/{ID}/export?format=csv')`. This requires the sheet to be published or shared publicly. For private sheets, Python with a service account is the only supported path as of March 2026.