Bigquery SQL optimization

BigQuery is a fully-managed, serverless data warehouse that enables scalable analysis over petabytes of data. Despite its scalable nature, there exist tips and tricks in writing Bigquery SQL to further improve the query performance

1. Use standard SQL instead of legacy SQL

2. Best practices for query structure

2.1 select necessary columns

Original code
- content_copy
```
select
  * 
from 
  dataset.table
```

Optimized

content_copy

select
  dim1,
  metric1
from 
  dataset.table

Reason
- Only select the columns necessary, especially in inner queries. SELECT * is cost-inefficient and may also hurt performance
- If the number of columns to return is large, consider using SELECT * EXCEPT to exclude unneeded columns

2.2 ORDER BY with LIMIT

Original code

content_copy

select
  t.dim1,
  t.dim2,
  t.metric1
from 
  dataset.table
order by metric1 desc

Optimized code

content_copy

select
  t.dim1,
  t.dim2,
  t.metric1
from 
  dataset.table
order by metric1 desc
limit 1000

Reason
- Writing results for a query with an ORDER BY clause can result in Resources Exceeded errors. Because the final sorting must be done on a single slot, if you are attempting to order a very large result set, the final sorting can overwhelm the slot that is processing the data
- If you are sorting a very large number of values, use a LIMIT clause.

2.3 WHERE clause: Expression order matters

Original code

content_copy

select
  text
from
  stackoverflow.comments
where
  text LIKE '%java%'
  AND user_display_name = 'anon'

Optimized code
- content_copy
```
select
  text
from
  stackoverflow.comments
where
  AND user_display_name = 'anon'
  text LIKE '%java%'
```
(The expression user_display_name = 'anon' filters out much more data than the expression: text LIKE '%java%')
Reason
- Bigquery assumes that the user has provided the best order of expressions in the WHERE clause, and does not attempt to reorder expressions. Expressions in your WHERE clauses should be ordered with the most selective expression first
- The optimized example is faster because it doesn't execute the expensive LIKE expression on the entire column content, but rather only on the content from user, 'anon'.

3. Best practices for JOINS

3.1 WHERE, ASAP

Original code

content_copy

select
  t1.dim1,
  sum(t1.metric1)
from 
  dataset.table1 t1
left join
  dataset.table2 t2
on
  t1.dim1 = t2.dim1
where ifnull(t2.dim2, 'def') = 'abc'
group by 1;

Optimized code

content_copy

select
  t1.dim1,
  sum(t1.metric1)
from 
  dataset.table1 t1
left join
  dataset.table2 t2
on
  t1.dim1 = t2.dim1
where t2.dim2 = 'abc'
group by 1;

Reason
- WHERE clauses should be executed as soon as possible, especially within joins, so the tables to be joined are as small as possible
- WHERE clauses may not always be necessary, as standard SQL will do its best to push down filters.

3.2 late aggregation

Original code

content_copy

select
  t1.dim1,
  sum(t1.m1),
  sum(t2.m2),
from (select
    dim1,
    sum(metric1) m1
  from dataset.table1 group by 1) t1
join (select
    dim1,
    sum(metric2) m2
  from dataset.table2 group by 1) t2
on t1.dim1 = t2.dim1
grouop by 1;

Optimized code

content_copy

select
  t1.dim1,
  sum(t1.m1),
  sum(t2.m2),
from (select
    dim1,
    metric1 m1
  from dataset.table1) t1
join (select
    dim1,
    metric2 m2
  from dataset.table2) t2
on t1.dim1 = t2.dim1
grouop by 1;

Reason
- Aggregate as late and as seldom as possible, because aggregation is very costly
- BUT: if a table can be reduced drastically by aggregation in preparation for bein joined, then aggregate it early
- Caution: With JOINS, this only works if the two table are already aggregated to the same level (i.e., if there is only one row for every join key value)

3.3 JOIN pattern

Original code

content_copy

select
  t1.dim1,
  sum(t1.metric1),
  sum(t2.metric2),
from
  small_table t1
join
  large_table t2
on
  t1.dim1 = t2.dim1
where t1.dim1 = 'abc'
grouop by 1;

Optimized code

content_copy

select
  t1.dim1,
  sum(t1.metric1),
  sum(t2.metric2),
from
  large_table t2
join
  small_table t1
on
  t1.dim1 = t2.dim1
where t1.dim1 = 'abc'
grouop by 1;

Reason
- When you create a query by using a JOIN, consider the order in which you are merging the data. The standard SQL query optimizer can determine which table should be on which side of the join, but it is still recommended to order your joined tables appropriately
- The best practice is to place the largest table first, followed by the smallest, and then by decreasing size.

3.4 Filter before JOINs

Original code

content_copy

select
  t1.dim1,
  sum(t1.metric1),
from
  dataset.table1 t1
left join
  dataset.table2 t2
on
  t1.dim1 = t2.dim1
where t2.dim2 = 'abc'
grouop by 1;

Optimized code

content_copy

select
  t1.dim1,
  sum(t1.metric1),
from
  dataset.table1 t1
left join
  dataset.table2 t2
on
  t1.dim1 = t2.dim1
where t1.dim2 = 'abc' AND t2.dim2 = 'abc'
grouop by 1;

Reason
- WHERE clauses should be executed as soon as possible, especially within joins, so the tables to be joined are as small as possible
- WHERE clauses may not always be necessary, as standard SQL will do its best to push down filters. Review the explanation plan to see if filtering is happening as early as possible, and either fix the condition or use a subquery to filter in advance

4. Best practices for functions

4.1 case-insensitive search

Original code

content_copy

select
  count(*) c
from
  dataset.table
where
  lower(text) LIKE '%bigquery%' OR
  lower(text) LIKE '%big query%'

Optimized code

content_copy

select
  count(*) c
from
  dataset.table
where
  regexp_contains(
    text, '(?i)(bigquery|big query)'
  )

Reason
- LOWER() and UPPER() operations have a hard time when dealing with Unicode text: each character needs to be mapped individually and they can also be multibytes
- Use REGEX_MATCH() and add the case insensitive (?i) modifier to your regular expression to do case-insensitive searches. Multiple search values over the same field can be combined into a single regex

4.2 approximate functions

Original code

content_copy

select
  dim1,
  count(distinct dim2)
from
  dataset.table
group by 1;

Optimized code

content_copy

select
  dim1,
  approx_count_distinct(dim2)
from
  dataset.table
group by 1;

Reason
- If the SQL aggregation function you're using has an equivalent approximation function, the approximation function will yield faster query performance
- Approximate functinos produce a result which is generally within 1% of the exact number

4.3 latest record

Original code

content_copy

select
  * except(rn)
from (
  select *,
    row_number() over(
      partition by id
      order by created_at desc) rn
  from
    dataset.table t
)
where rn = 1
order by created_at

Optimized code

content_copy

select
  event.*
from (
  select array_agg(
    t order by t.created_at desc limit 1
  )[offset(0)] event
  from
    dataset.table t
  group by
    id
)
order by created_at

Reason
- Using the ROW_NUMBER() function can fail with Resources Exceeded errors as data volume grows if there are too many elements to ORDER BY in a single partition.
- Using ARRAY_AGG() in standard SQL allows the query to run more efficiently because the ORDER BY is allowed to drop everything except the top record on each GROUP BY.

4.4 NTILE functions

Original code

content_copy

select
  individual_id,
  ntile(3) over
    (order by sales desc) as sales_third
from
  dataset.table

Optimized code

content_copy

with QuanInfo AS (
  select
    o, qval
  from unnest((
    select APPROX_QUANTILES(sales, 2)
    from dataset.table
  )) AS qval
  with offset 0
),
select
  individual_id,
  (select
    MIN(o)
  from QuantInfo
  where sales <= QuantInfo.qval
  ) as sales_third
from dataset.table

Reason
- Using NTILE() function can fail with Resources Exceeded errors as data volume grows if there are too many elements to ORDER BY in a single partition
- Using APPROX_QUANTILES() in standard SQL allows the query to run more efficiently because it doesn't require a global ORDER BY for all rows in the table

4.5 SQL UDFs > JavaScript UDFs

Original code

content_copy

create temporary function
  multiply(x INT64, y INT64)
returns INT 64
language js
as """
  return x * y;
""";
select multiply(2, 2) as result;

Optimized code

content_copy

create temporary function
  multiply(x INT64, y INT64)
as
  (x * y);
select multiply(2, 2) as result;

Reason
- JavaScript UDFs are a performance killer because they have to spin up a V8 subprocess evaluate.
- Prefer SQL UDFs where possible

4.6 Persistent UDFs

Original code

content_copy

CREATE TEMP FUNCTION addFourAndDivide(x INT64, y INT64) AS ((x + 4) / y);
WITH numbers AS
  (SELECT 1 as val
  UNION ALL
  SELECT 3 as val
  UNION ALL
  SELECT 4 as val
  UNION ALL
  SELECT 5 as val)
SELECT val, addFourAndDivide(val, 2) AS
result
FROM numbers;

Optimized code

content_copy

WITH numbers AS
  (SELECT 1 as val
  UNION ALL
  SELECT 3 as val
  UNION ALL
  SELECT 4 as val
  UNION ALL
  SELECT 5 as val)
SELECT val,
your_project.your_dataset.addFourAndDivide(val, 2) AS
result
FROM numbers;

Reason
- Create persistent user-defined SQL and JavaScript functions in a centralized Bigquery dataset which can be invoked across queries and in logical views
- Create org-wide libraries of business logic within shared datasets

5. Interpreting the query plan

Significant difference between avg and max time
- Probably data skew - use APPROX_TOP_COUNT to check
- Filter early to workaround
Most time spent reading from intermediate stages
- Consider filtering earlier in the query
Most time spent on CPU tasks
- Consider approximate functions, inspect UDF usage, filter earlier

6. Nest repeated data

Customers often default to "flat" denormalization even if it is not the most beneficial
- Requires a GROUP BY to analyze data
Example: Orders table with a row for each line item
- {order_id1, item_id1}, {order_id1, item_id2},...
If you model one order per row and nest line items in a nested field, GROUP BY no longer required
- {order_id1, [ {item_id1}, {item_id2} ]}

7. Join explosions

Caused by JOIN with non-unique key on both sides
SQL relational algebra gives cartesian product of rows which have the same join key
- Worst case: Number of output rows is number of rows in left table multiplied by number of rows in right table
- In extreme cases, query will not finish
If job finishes then query explanation will show output rows versus input rows
Confirm diagnosis by modifying query to print number of rows on each side of the JOIN grouped by the JOIN key
Workaround is to use GROUP BY to pre-aggregate

8. Skewed JOINS

Dremel shuffles data on each side of the join
- All data with the same join key goes to the same shard
- Data can overload the shard
Typically result from data skew
Workarounds
- Pre-filter rows from query with the unbalanced key
- Potentially split into two queries

Search This Blog

Learnings