- Published on
BigQuery Query Optimisations
- Authors
- Name
- Shubham Jain
- https://x.com/shubhrjain
Optimizing BigQuery Queries for Performance
Optimizing BigQuery queries is crucial for reducing execution times and costs. This blog post will explore key parameters for query optimization and discuss specific techniques to enhance query performance.
Key Parameters for Query Optimization
When optimizing BigQuery queries, focus on the following metrics:
- Bytes Scanned: The amount of data read from storage
- Slot Time: The total time slots spend processing the query
- Bytes Shuffled: The amount of data exchanged between slots
- Query Execution Time: The overall time taken to complete the query
- Number of Rows Processed: The total rows handled by the query
There are some standard optimisation we are usually aware of
- Selecting only required columns - It can reduce bytes scanned.
- Filtering data as much as possible early if it is possible - It reduces processing time
- Limit doesn't bring huge advantage as main query is fully executed.
Join Order Optimization
When joining multiple tables, follow these best practices:
- Place the largest table first in the JOIN query
- Follow with the smallest table
- Continue in decreasing order of table size
This approach optimizes the join operation, especially for broadcast joins where BigQuery distributes smaller tables to slots processing larger tables.
Pattern Matching vs. Exact Matching
While pattern matching using LIKE operators may seem convenient, exact matching often yields better performance. Consider the following example:
Original Query (Pattern Matching with regexp_contains):
SELECT
year,
COUNT(*)
FROM
`bigquery-public-data.noaa_gsod.gsod*`
WHERE
regexp_contains(year, '.*11.*')
GROUP BY
year
Results:
- Rows processed: 3.82 Million Records
- Bytes processed: 1,003.28 MB
- Slot Time: 11 seconds
- Bytes shuffled: 89 B
- Query execution time: 5 seconds
like
:
Improved query with SELECT
year,
COUNT(*)
FROM
`bigquery-public-data.noaa_gsod.gsod*`
WHERE
year like '%11%'
GROUP BY
year
Results:
- Rows processed: 3.82 Million Records
- Bytes processed: 1,003.28 MB
- Slot Time: 8 seconds
- Bytes shuffled: 89 B
- Query execution time: 3 seconds
Optimized Query (Exact Matching):
SELECT
year,
COUNT(*)
FROM
`bigquery-public-data.noaa_gsod.gsod*`
WHERE
year = '2011'
GROUP BY
year
Results:
- Rows processed: 3.82 Million Records
- Bytes processed: 1,003.28 MB
- Slot Time: 1 seconds
- Bytes shuffled: 89 B
- Query execution time: 5 seconds
While the execution time increased slightly, the optimized query significantly reduced slot time. This demonstrates that exact matching can be possibly more efficient than pattern matching when possible. On a small data set, we won't see significant difference but on large table scans, exact matching would be beneficial.
Performance considerations:
Prefer EXACT matching as it results in faster query execution
LIKE is generally faster than REGEXP_CONTAINS
Using LIKE with a wildcard at the end (e.g., 'abc%') is SARGable[1] and can perform a range seek on an index
For simple pattern matching, prefer LIKE over REGEXP_CONTAINS
Filter data as much as possible before applying pattern matching
Sorting Optimisations
ORDER BY is computationally expensive, so use it judiciously and only when necessary for your specific use case.
- Place ORDER BY at the end of complex queries involving multiple subqueries5. Sorting the final result set is more efficient than sorting intermediate results.
- Be cautious when using ORDER BY on large datasets, as it can lead to "resources exceeded" errors. This occurs because final sorting must be done on a single slot.
- If possible, apply filters (WHERE clauses) before sorting to reduce the data volume.
- When possible, leverage BigQuery's automatic sorting capabilities by using clustered tables based on common query patterns
Adding Primary Key and Foreign Keys
Partitioning and Clustering
References
[1] Sargable is a term used in relational databases to describe a query condition that can use an index to speed up query execution. It's an abbreviation for "Search ARGument ABLE".