Bigquery SQL optimization
BigQuery is a fully-managed, serverless data warehouse that enables scalable analysis over petabytes of data. Despite its scalable nature, there exist tips and tricks in writing Bigquery SQL to further improve the query performance
1. Use standard SQL instead of legacy SQL
2. Best practices for query structure
2.1 select necessary columns
- Original code
select * from dataset.table
- Optimized
select dim1, metric1 from dataset.table
- Reason
- Only select the columns necessary, especially in inner queries. SELECT * is cost-inefficient and may also hurt performance
- If the number of columns to return is large, consider using SELECT * EXCEPT to exclude unneeded columns
2.2 ORDER BY with LIMIT
- Original code
select t.dim1, t.dim2, t.metric1 from dataset.table order by metric1 desc
- Optimized code
select t.dim1, t.dim2, t.metric1 from dataset.table order by metric1 desc limit 1000
- Reason
- Writing results for a query with an ORDER BY clause can result in Resources Exceeded errors. Because the final sorting must be done on a single slot, if you are attempting to order a very large result set, the final sorting can overwhelm the slot that is processing the data
- If you are sorting a very large number of values, use a LIMIT clause.
2.3 WHERE clause: Expression order matters
Original code
select text from stackoverflow.comments where text LIKE '%java%' AND user_display_name = 'anon'
Optimized code
select text from stackoverflow.comments where AND user_display_name = 'anon' text LIKE '%java%'
(The expression
user_display_name = 'anon'
filters out much more data than the expression:text LIKE '%java%'
)Reason
- Bigquery assumes that the user has provided the best order of expressions in the WHERE clause, and does not attempt to reorder expressions. Expressions in your WHERE clauses should be ordered with the most selective expression first
- The optimized example is faster because it doesn't execute the expensive LIKE expression on the entire column content, but rather only on the content from user, 'anon'.
3. Best practices for JOINS
3.1 WHERE, ASAP
- Original code
select t1.dim1, sum(t1.metric1) from dataset.table1 t1 left join dataset.table2 t2 on t1.dim1 = t2.dim1 where ifnull(t2.dim2, 'def') = 'abc' group by 1;
- Optimized code
select t1.dim1, sum(t1.metric1) from dataset.table1 t1 left join dataset.table2 t2 on t1.dim1 = t2.dim1 where t2.dim2 = 'abc' group by 1;
- Reason
- WHERE clauses should be executed as soon as possible, especially within joins, so the tables to be joined are as small as possible
- WHERE clauses may not always be necessary, as standard SQL will do its best to push down filters.
3.2 late aggregation
- Original code
select t1.dim1, sum(t1.m1), sum(t2.m2), from (select dim1, sum(metric1) m1 from dataset.table1 group by 1) t1 join (select dim1, sum(metric2) m2 from dataset.table2 group by 1) t2 on t1.dim1 = t2.dim1 grouop by 1;
- Optimized code
select t1.dim1, sum(t1.m1), sum(t2.m2), from (select dim1, metric1 m1 from dataset.table1) t1 join (select dim1, metric2 m2 from dataset.table2) t2 on t1.dim1 = t2.dim1 grouop by 1;
- Reason
- Aggregate as late and as seldom as possible, because aggregation is very costly
- BUT: if a table can be reduced drastically by aggregation in preparation for bein joined, then aggregate it early
- Caution: With JOINS, this only works if the two table are already aggregated to the same level (i.e., if there is only one row for every join key value)
3.3 JOIN pattern
- Original code
select t1.dim1, sum(t1.metric1), sum(t2.metric2), from small_table t1 join large_table t2 on t1.dim1 = t2.dim1 where t1.dim1 = 'abc' grouop by 1;
- Optimized code
select t1.dim1, sum(t1.metric1), sum(t2.metric2), from large_table t2 join small_table t1 on t1.dim1 = t2.dim1 where t1.dim1 = 'abc' grouop by 1;
- Reason
- When you create a query by using a JOIN, consider the order in which you are merging the data. The standard SQL query optimizer can determine which table should be on which side of the join, but it is still recommended to order your joined tables appropriately
- The best practice is to place the largest table first, followed by the smallest, and then by decreasing size.
3.4 Filter before JOINs
- Original code
select t1.dim1, sum(t1.metric1), from dataset.table1 t1 left join dataset.table2 t2 on t1.dim1 = t2.dim1 where t2.dim2 = 'abc' grouop by 1;
- Optimized code
select t1.dim1, sum(t1.metric1), from dataset.table1 t1 left join dataset.table2 t2 on t1.dim1 = t2.dim1 where t1.dim2 = 'abc' AND t2.dim2 = 'abc' grouop by 1;
- Reason
- WHERE clauses should be executed as soon as possible, especially within joins, so the tables to be joined are as small as possible
- WHERE clauses may not always be necessary, as standard SQL will do its best to push down filters. Review the explanation plan to see if filtering is happening as early as possible, and either fix the condition or use a subquery to filter in advance
4. Best practices for functions
4.1 case-insensitive search
- Original code
select count(*) c from dataset.table where lower(text) LIKE '%bigquery%' OR lower(text) LIKE '%big query%'
- Optimized code
select count(*) c from dataset.table where regexp_contains( text, '(?i)(bigquery|big query)' )
- Reason
- LOWER() and UPPER() operations have a hard time when dealing with Unicode text: each character needs to be mapped individually and they can also be multibytes
- Use REGEX_MATCH() and add the case insensitive (?i) modifier to your regular expression to do case-insensitive searches. Multiple search values over the same field can be combined into a single regex
4.2 approximate functions
- Original code
select dim1, count(distinct dim2) from dataset.table group by 1;
- Optimized code
select dim1, approx_count_distinct(dim2) from dataset.table group by 1;
- Reason
- If the SQL aggregation function you're using has an equivalent approximation function, the approximation function will yield faster query performance
- Approximate functinos produce a result which is generally within 1% of the exact number
4.3 latest record
- Original code
select * except(rn) from ( select *, row_number() over( partition by id order by created_at desc) rn from dataset.table t ) where rn = 1 order by created_at
- Optimized code
select event.* from ( select array_agg( t order by t.created_at desc limit 1 )[offset(0)] event from dataset.table t group by id ) order by created_at
- Reason
- Using the ROW_NUMBER() function can fail with Resources Exceeded errors as data volume grows if there are too many elements to ORDER BY in a single partition.
- Using ARRAY_AGG() in standard SQL allows the query to run more efficiently because the ORDER BY is allowed to drop everything except the top record on each GROUP BY.
4.4 NTILE functions
- Original code
select individual_id, ntile(3) over (order by sales desc) as sales_third from dataset.table
- Optimized code
with QuanInfo AS ( select o, qval from unnest(( select APPROX_QUANTILES(sales, 2) from dataset.table )) AS qval with offset 0 ), select individual_id, (select MIN(o) from QuantInfo where sales <= QuantInfo.qval ) as sales_third from dataset.table
- Reason
- Using NTILE() function can fail with Resources Exceeded errors as data volume grows if there are too many elements to ORDER BY in a single partition
- Using APPROX_QUANTILES() in standard SQL allows the query to run more efficiently because it doesn't require a global ORDER BY for all rows in the table
4.5 SQL UDFs > JavaScript UDFs
- Original code
create temporary function multiply(x INT64, y INT64) returns INT 64 language js as """ return x * y; """; select multiply(2, 2) as result;
- Optimized code
create temporary function multiply(x INT64, y INT64) as (x * y); select multiply(2, 2) as result;
- Reason
- JavaScript UDFs are a performance killer because they have to spin up a V8 subprocess evaluate.
- Prefer SQL UDFs where possible
4.6 Persistent UDFs
- Original code
CREATE TEMP FUNCTION addFourAndDivide(x INT64, y INT64) AS ((x + 4) / y); WITH numbers AS (SELECT 1 as val UNION ALL SELECT 3 as val UNION ALL SELECT 4 as val UNION ALL SELECT 5 as val) SELECT val, addFourAndDivide(val, 2) AS result FROM numbers;
- Optimized code
WITH numbers AS (SELECT 1 as val UNION ALL SELECT 3 as val UNION ALL SELECT 4 as val UNION ALL SELECT 5 as val) SELECT val, your_project.your_dataset.addFourAndDivide(val, 2) AS result FROM numbers;
- Reason
- Create persistent user-defined SQL and JavaScript functions in a centralized Bigquery dataset which can be invoked across queries and in logical views
- Create org-wide libraries of business logic within shared datasets
5. Interpreting the query plan
- Significant difference between avg and max time
- Probably data skew - use APPROX_TOP_COUNT to check
- Filter early to workaround
- Most time spent reading from intermediate stages
- Consider filtering earlier in the query
- Most time spent on CPU tasks
- Consider approximate functions, inspect UDF usage, filter earlier
6. Nest repeated data
- Customers often default to "flat" denormalization even if it is not the most beneficial
- Requires a GROUP BY to analyze data
- Example: Orders table with a row for each line item
- {order_id1, item_id1}, {order_id1, item_id2},...
- If you model one order per row and nest line items in a nested field, GROUP BY no longer required
- {order_id1, [ {item_id1}, {item_id2} ]}
7. Join explosions
- Caused by JOIN with non-unique key on both sides
- SQL relational algebra gives cartesian product of rows which have the same join key
- Worst case: Number of output rows is number of rows in left table multiplied by number of rows in right table
- In extreme cases, query will not finish
- If job finishes then query explanation will show output rows versus input rows
- Confirm diagnosis by modifying query to print number of rows on each side of the JOIN grouped by the JOIN key
- Workaround is to use GROUP BY to pre-aggregate
8. Skewed JOINS
- Dremel shuffles data on each side of the join
- All data with the same join key goes to the same shard
- Data can overload the shard
- Typically result from data skew
- Workarounds
- Pre-filter rows from query with the unbalanced key
- Potentially split into two queries
Comments
Post a Comment