SQL Queries with DuckDB
Query your spreadsheet data and external databases using SQL
Overview
ModelMonkey includes a built-in SQL engine powered by DuckDB, allowing you to run SQL queries directly on your spreadsheet data. You can also query external databases (PostgreSQL, BigQuery) and even join data across multiple sources.
This is useful when you need precise control over data transformations, want to perform complex aggregations, or need to combine data from different sources in ways that natural language might not express clearly.
Tip
You don't need to write SQL yourself. Just describe what you want in plain English, and ModelMonkey will generate and execute the appropriate SQL query for you.
When SQL Queries Are Used
ModelMonkey automatically uses SQL queries when your request involves:
Complex Aggregations
Grouping data, calculating sums, averages, counts, or other statistical operations across many rows.
Filtering and Sorting
Finding specific records based on multiple conditions, or sorting large datasets by various criteria.
Joining Data
Combining data from your spreadsheet with external databases, or merging multiple sheets together.
Data Transformations
Reformatting, calculating new columns, or restructuring data in ways that require row-by-row processing.
You can also explicitly ask ModelMonkey to "use SQL" or "run a SQL query" if you prefer the precision of SQL for a particular operation.
Querying Spreadsheet Data
Your spreadsheet data is accessible through the sheet_range() function. This function takes an A1 notation reference and returns the data as a queryable table.
Important Requirements
- The first row of your range must contain column headers
- Column names are derived from the header row
- Empty headers result in columns named like "column_1", "column_2", etc.
Supported Range Formats
- Full sheet:
sheet_range('Sheet1!A:Z') - Specific range:
sheet_range('Sales!A1:D100') - Named ranges with special characters are automatically escaped
-- Query all data from a sheet
SELECT * FROM sheet_range('Sheet1!A1:D100')
-- Filter and aggregate
SELECT category, SUM(amount) as total
FROM sheet_range('Sales!A1:E500')
WHERE date >= '2024-01-01'
GROUP BY category
-- Sort by multiple columns
SELECT * FROM sheet_range('Customers!A:F')
ORDER BY region ASC, revenue DESCExample spreadsheet queries
Querying External Databases
If you have external data connections configured, you can query them using their connection name:
PostgreSQL
Reference tables as connection_name.schema.table. The default schema is usually "public".
BigQuery
Reference tables as connection_name.dataset.table.
External database queries work exactly like standard SQL. You can use all the features supported by DuckDB, including joins, subqueries, window functions, and CTEs.
Tip
Set up your external connections first in the External Data Connections panel before querying them.
-- Query a PostgreSQL table
SELECT * FROM my_postgres.public.customers
WHERE country = 'USA'
-- Query a BigQuery table
SELECT event_date, COUNT(*) as events
FROM my_bigquery.analytics.page_views
GROUP BY event_date
-- Join PostgreSQL with BigQuery
SELECT p.name, b.total_events
FROM my_postgres.public.products p
JOIN my_bigquery.analytics.product_stats b
ON p.id = b.product_idExample external database queries
Joining Spreadsheet and Database Data
One of the most powerful features is the ability to join data across different sources. You can combine spreadsheet data with external databases in a single query.
Common Use Cases
- Enrich spreadsheet records with customer data from your database
- Compare local tracking data against your data warehouse
- Validate spreadsheet entries against external reference tables
- Combine sales data from sheets with product catalog from database
-- Join spreadsheet orders with database customers
SELECT
s.order_id,
s.order_date,
s.amount,
c.customer_name,
c.email,
c.tier
FROM sheet_range('Orders!A1:D500') s
JOIN my_postgres.public.customers c
ON s.customer_id = c.id
WHERE s.amount > 1000
-- Cross-database aggregation
SELECT
c.country,
COUNT(*) as order_count,
SUM(s.amount) as total_sales
FROM sheet_range('Sales!A1:C1000') s
JOIN my_postgres.public.customers c
ON s.customer_id = c.id
GROUP BY c.country
ORDER BY total_sales DESCExample cross-source joins
Query Output Options
Query results can be delivered in two ways:
Inline Results (Default)
Results are returned directly to ModelMonkey for analysis and reasoning. This is useful for exploratory queries where you want to understand the data before deciding what to do with it.
- Returns up to 20 rows by default (configurable up to 100)
- Includes column types and metadata
- Ideal for quick data exploration
Sheet Output
Results are written directly to a new sheet in your spreadsheet. This is useful when you want to keep the query results for further work.
- Creates a new sheet with your specified name
- Includes all rows (no row limit)
- Header row added automatically
- Perfect for creating reports or staging data
Tip
Ask ModelMonkey to "save the results to a new sheet" or "write the output to a sheet called Results" to use sheet output mode.
Data Type Handling
ModelMonkey automatically infers column types from your spreadsheet data:
Supported Types
- Numbers: Integers and decimals are detected and treated as numeric types
- Booleans: TRUE/FALSE values are recognized as booleans
- Text: Everything else is treated as text (VARCHAR)
- Dates: Date values in cells are typically handled as text; use SQL date functions to parse them
Type Inference Process
ModelMonkey samples the first several rows of data to determine the appropriate type for each column. If a column contains mixed types, it defaults to text to avoid data loss.
-- Parse date strings in queries
SELECT
order_id,
strptime(order_date, '%Y-%m-%d') as parsed_date,
amount
FROM sheet_range('Orders!A1:C100')
WHERE strptime(order_date, '%Y-%m-%d') >= DATE '2024-01-01'
-- Cast numeric strings
SELECT
product_name,
CAST(price AS DOUBLE) as numeric_price
FROM sheet_range('Products!A1:B50')Working with data types
Performance Considerations
Query Execution
- Queries have a 30-second timeout
- Maximum result set is 16,384 rows for safety
- Large spreadsheet ranges are loaded into memory for querying
Best Practices
- Specify precise ranges instead of entire columns (e.g.,
A1:D500instead ofA:D) - Use WHERE clauses to filter early and reduce data processed
- For very large datasets, consider using sheet output to avoid memory issues
- Index large external database tables for faster joins
Caching
Query results are not cached between requests. Each query re-reads the current spreadsheet data, ensuring you always get the latest values.
Warning
Very large ranges (tens of thousands of rows) may slow down query execution. Consider filtering data or using external databases for large-scale analytics.
Supported SQL Features
ModelMonkey supports a wide range of SQL features through DuckDB:
Standard Operations
- SELECT, FROM, WHERE, ORDER BY, LIMIT
- GROUP BY with aggregate functions (SUM, COUNT, AVG, MIN, MAX, etc.)
- DISTINCT and HAVING clauses
- UNION, INTERSECT, EXCEPT
Joins
- INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL OUTER JOIN
- Cross joins and self joins
- Multiple table joins
Advanced Features
- Subqueries and CTEs (WITH clauses)
- Window functions (ROW_NUMBER, RANK, LAG, LEAD, etc.)
- CASE expressions
- String functions, date functions, mathematical functions
Not Supported
- INSERT, UPDATE, DELETE (queries are read-only)
- CREATE TABLE, ALTER TABLE (no schema modifications)
- Stored procedures or user-defined functions
Example Queries
Here are some practical examples of SQL queries you might use:
-- Monthly sales summary
SELECT
strftime(strptime(date, '%Y-%m-%d'), '%Y-%m') as month,
COUNT(*) as num_orders,
SUM(amount) as total_revenue,
AVG(amount) as avg_order_value
FROM sheet_range('Orders!A1:D1000')
GROUP BY month
ORDER BY month DESC
-- Find duplicate entries
SELECT email, COUNT(*) as count
FROM sheet_range('Contacts!A1:E500')
GROUP BY email
HAVING COUNT(*) > 1
-- Top customers by spending
SELECT
customer_id,
SUM(amount) as total_spent,
COUNT(*) as num_orders
FROM sheet_range('Orders!A1:D1000')
GROUP BY customer_id
ORDER BY total_spent DESC
LIMIT 10
-- Running total with window function
SELECT
date,
amount,
SUM(amount) OVER (ORDER BY date) as running_total
FROM sheet_range('Transactions!A1:C100')Practical SQL query examples
Troubleshooting
"Column not found" errors
- Check that your header row contains the expected column names
- Column names are case-sensitive
- Spaces in column names work but may need quoting in complex queries
"Invalid range" errors
- Verify the sheet name is spelled correctly
- Ensure the range exists and contains data
- Check for special characters in sheet names that need escaping
"Query timeout" errors
- Simplify your query or reduce the data range
- Add WHERE clauses to filter data earlier
- Consider using sheet output for large result sets
"Connection not found" errors
- Verify the external connection name is correct
- Check that the connection is active in the External Data Connections panel
- Ensure you have permission to access the external database
Empty or unexpected results
- Check data types match your WHERE conditions
- Verify date formats match what's in your spreadsheet
- Look for leading/trailing spaces in text data