Optimizing BigQuery Queries for Performance

Optimizing BigQuery queries is crucial for reducing execution times and costs. This blog post will explore key parameters for query optimization and discuss specific techniques to enhance query performance.

Key Parameters for Query Optimization

When optimizing BigQuery queries, focus on the following metrics:

Bytes Scanned: The amount of data read from storage
Slot Time: The total time slots spend processing the query
Bytes Shuffled: The amount of data exchanged between slots
Query Execution Time: The overall time taken to complete the query
Number of Rows Processed: The total rows handled by the query

There are some standard optimisation we are usually aware of

Selecting only required columns - It can reduce bytes scanned.
Filtering data as much as possible early if it is possible - It reduces processing time
Limit doesn't bring huge advantage as main query is fully executed.

Join Order Optimization

When joining multiple tables, follow these best practices:

Place the largest table first in the JOIN query
Follow with the smallest table
Continue in decreasing order of table size

This approach optimizes the join operation, especially for broadcast joins where BigQuery distributes smaller tables to slots processing larger tables.

Pattern Matching vs. Exact Matching

While pattern matching using LIKE operators may seem convenient, exact matching often yields better performance. Consider the following example:

Original Query (Pattern Matching with regexp_contains):

SELECT
  year,
  COUNT(*)
FROM
  `bigquery-public-data.noaa_gsod.gsod*`
WHERE
regexp_contains(year, '.*11.*')
GROUP BY
  year

Results:

Rows processed: 3.82 Million Records
Bytes processed: 1,003.28 MB
Slot Time: 11 seconds
Bytes shuffled: 89 B
Query execution time: 5 seconds

Improved query with `like`:

SELECT
  year,
  COUNT(*)
FROM
  `bigquery-public-data.noaa_gsod.gsod*`
WHERE
  year like '%11%'
GROUP BY
  year

Results:

Rows processed: 3.82 Million Records
Bytes processed: 1,003.28 MB
Slot Time: 8 seconds
Bytes shuffled: 89 B
Query execution time: 3 seconds

Optimized Query (Exact Matching):

SELECT
  year,
  COUNT(*)
FROM
  `bigquery-public-data.noaa_gsod.gsod*`
WHERE
  year = '2011'
GROUP BY
  year

Results:

Rows processed: 3.82 Million Records
Bytes processed: 1,003.28 MB
Slot Time: 1 seconds
Bytes shuffled: 89 B
Query execution time: 5 seconds

While the execution time increased slightly, the optimized query significantly reduced slot time. This demonstrates that exact matching can be possibly more efficient than pattern matching when possible. On a small data set, we won't see significant difference but on large table scans, exact matching would be beneficial.

Performance considerations:

Prefer EXACT matching as it results in faster query execution
LIKE is generally faster than REGEXP_CONTAINS
Using LIKE with a wildcard at the end (e.g., 'abc%') is SARGable[1] and can perform a range seek on an index
For simple pattern matching, prefer LIKE over REGEXP_CONTAINS
Filter data as much as possible before applying pattern matching

Sorting Optimisations

ORDER BY is computationally expensive, so use it judiciously and only when necessary for your specific use case.

Place ORDER BY at the end of complex queries involving multiple subqueries5. Sorting the final result set is more efficient than sorting intermediate results.
Be cautious when using ORDER BY on large datasets, as it can lead to "resources exceeded" errors. This occurs because final sorting must be done on a single slot.
If possible, apply filters (WHERE clauses) before sorting to reduce the data volume.
When possible, leverage BigQuery's automatic sorting capabilities by using clustered tables based on common query patterns

Adding Primary Key and Foreign Keys

Partitioning and Clustering

References

[1] Sargable is a term used in relational databases to describe a query condition that can use an index to speed up query execution. It's an abbreviation for "Search ARGument ABLE".

Optimizing BigQuery Queries for Performance

Key Parameters for Query Optimization

Join Order Optimization

Pattern Matching vs. Exact Matching

Original Query (Pattern Matching with regexp_contains):

Improved query with like:

Optimized Query (Exact Matching):

Performance considerations:

Sorting Optimisations

Adding Primary Key and Foreign Keys

Partitioning and Clustering

References

Improved query with `like`: