Published on

BigQuery Query Optimisations

Authors

Optimizing BigQuery Queries for Performance

Optimizing BigQuery queries is crucial for reducing execution times and costs. This blog post will explore key parameters for query optimization and discuss specific techniques to enhance query performance.

Key Parameters for Query Optimization

When optimizing BigQuery queries, focus on the following metrics:

  1. Bytes Scanned: The amount of data read from storage
  2. Slot Time: The total time slots spend processing the query
  3. Bytes Shuffled: The amount of data exchanged between slots
  4. Query Execution Time: The overall time taken to complete the query
  5. Number of Rows Processed: The total rows handled by the query

There are some standard optimisation we are usually aware of

  • Selecting only required columns - It can reduce bytes scanned.
  • Filtering data as much as possible early if it is possible - It reduces processing time
  • Limit doesn't bring huge advantage as main query is fully executed.

Join Order Optimization

When joining multiple tables, follow these best practices:

  1. Place the largest table first in the JOIN query
  2. Follow with the smallest table
  3. Continue in decreasing order of table size

This approach optimizes the join operation, especially for broadcast joins where BigQuery distributes smaller tables to slots processing larger tables.

Pattern Matching vs. Exact Matching

While pattern matching using LIKE operators may seem convenient, exact matching often yields better performance. Consider the following example:

Original Query (Pattern Matching with regexp_contains):

SELECT
  year,
  COUNT(*)
FROM
  `bigquery-public-data.noaa_gsod.gsod*`
WHERE
regexp_contains(year, '.*11.*')
GROUP BY
  year

Results:

  • Rows processed: 3.82 Million Records
  • Bytes processed: 1,003.28 MB
  • Slot Time: 11 seconds
  • Bytes shuffled: 89 B
  • Query execution time: 5 seconds

Improved query with like:

SELECT
  year,
  COUNT(*)
FROM
  `bigquery-public-data.noaa_gsod.gsod*`
WHERE
  year like '%11%'
GROUP BY
  year

Results:

  • Rows processed: 3.82 Million Records
  • Bytes processed: 1,003.28 MB
  • Slot Time: 8 seconds
  • Bytes shuffled: 89 B
  • Query execution time: 3 seconds

Optimized Query (Exact Matching):

SELECT
  year,
  COUNT(*)
FROM
  `bigquery-public-data.noaa_gsod.gsod*`
WHERE
  year = '2011'
GROUP BY
  year

Results:

  • Rows processed: 3.82 Million Records
  • Bytes processed: 1,003.28 MB
  • Slot Time: 1 seconds
  • Bytes shuffled: 89 B
  • Query execution time: 5 seconds

While the execution time increased slightly, the optimized query significantly reduced slot time. This demonstrates that exact matching can be possibly more efficient than pattern matching when possible. On a small data set, we won't see significant difference but on large table scans, exact matching would be beneficial.

Performance considerations:

  • Prefer EXACT matching as it results in faster query execution

  • LIKE is generally faster than REGEXP_CONTAINS

  • Using LIKE with a wildcard at the end (e.g., 'abc%') is SARGable[1] and can perform a range seek on an index

  • For simple pattern matching, prefer LIKE over REGEXP_CONTAINS

  • Filter data as much as possible before applying pattern matching

Sorting Optimisations

ORDER BY is computationally expensive, so use it judiciously and only when necessary for your specific use case.

  • Place ORDER BY at the end of complex queries involving multiple subqueries5. Sorting the final result set is more efficient than sorting intermediate results.
  • Be cautious when using ORDER BY on large datasets, as it can lead to "resources exceeded" errors. This occurs because final sorting must be done on a single slot.
  • If possible, apply filters (WHERE clauses) before sorting to reduce the data volume.
  • When possible, leverage BigQuery's automatic sorting capabilities by using clustered tables based on common query patterns

Adding Primary Key and Foreign Keys

Partitioning and Clustering

References

[1] Sargable is a term used in relational databases to describe a query condition that can use an index to speed up query execution. It's an abbreviation for "Search ARGument ABLE".