Bigquery SQL optimization

BigQuery is a fully-managed, serverless data warehouse that enables scalable analysis over petabytes of data. Despite its scalable nature, there exist tips and tricks in writing Bigquery SQL to further improve the query performance

1. Use standard SQL instead of legacy SQL

2. Best practices for query structure

2.1 select necessary columns

  • Original code
    • content_copy
      select
        * 
      from 
        dataset.table
      
  • Optimized
    • content_copy
      select
        dim1,
        metric1
      from 
        dataset.table
      
  • Reason
    • Only select the columns necessary, especially in inner queries. SELECT * is cost-inefficient and may also hurt performance
    • If the number of columns to return is large, consider using SELECT * EXCEPT to exclude unneeded columns

2.2 ORDER BY with LIMIT

  • Original code
    • content_copy
      select
        t.dim1,
        t.dim2,
        t.metric1
      from 
        dataset.table
      order by metric1 desc
      
  • Optimized code
    • content_copy
      select
        t.dim1,
        t.dim2,
        t.metric1
      from 
        dataset.table
      order by metric1 desc
      limit 1000
      
  • Reason
    • Writing results for a query with an ORDER BY clause can result in Resources Exceeded errors. Because the final sorting must be done on a single slot, if you are attempting to order a very large result set, the final sorting can overwhelm the slot that is processing the data
    • If you are sorting a very large number of values, use a LIMIT clause.

2.3 WHERE clause: Expression order matters

  • Original code

    • content_copy
      select
        text
      from
        stackoverflow.comments
      where
        text LIKE '%java%'
        AND user_display_name = 'anon'
      
  • Optimized code

    • content_copy
      select
        text
      from
        stackoverflow.comments
      where
        AND user_display_name = 'anon'
        text LIKE '%java%'
      

    (The expression user_display_name = 'anon' filters out much more data than the expression: text LIKE '%java%')

  • Reason

    • Bigquery assumes that the user has provided the best order of expressions in the WHERE clause, and does not attempt to reorder expressions. Expressions in your WHERE clauses should be ordered with the most selective expression first
    • The optimized example is faster because it doesn't execute the expensive LIKE expression on the entire column content, but rather only on the content from user, 'anon'.

3. Best practices for JOINS

3.1 WHERE, ASAP

  • Original code
    • content_copy
      select
        t1.dim1,
        sum(t1.metric1)
      from 
        dataset.table1 t1
      left join
        dataset.table2 t2
      on
        t1.dim1 = t2.dim1
      where ifnull(t2.dim2, 'def') = 'abc'
      group by 1;
      
  • Optimized code
    • content_copy
      select
        t1.dim1,
        sum(t1.metric1)
      from 
        dataset.table1 t1
      left join
        dataset.table2 t2
      on
        t1.dim1 = t2.dim1
      where t2.dim2 = 'abc'
      group by 1;
      
  • Reason
    • WHERE clauses should be executed as soon as possible, especially within joins, so the tables to be joined are as small as possible
    • WHERE clauses may not always be necessary, as standard SQL will do its best to push down filters.

3.2 late aggregation

  • Original code
    • content_copy
      select
        t1.dim1,
        sum(t1.m1),
        sum(t2.m2),
      from (select
          dim1,
          sum(metric1) m1
        from dataset.table1 group by 1) t1
      join (select
          dim1,
          sum(metric2) m2
        from dataset.table2 group by 1) t2
      on t1.dim1 = t2.dim1
      grouop by 1;
      
  • Optimized code
    • content_copy
      select
        t1.dim1,
        sum(t1.m1),
        sum(t2.m2),
      from (select
          dim1,
          metric1 m1
        from dataset.table1) t1
      join (select
          dim1,
          metric2 m2
        from dataset.table2) t2
      on t1.dim1 = t2.dim1
      grouop by 1;
      
  • Reason
    • Aggregate as late and as seldom as possible, because aggregation is very costly
    • BUT: if a table can be reduced drastically by aggregation in preparation for bein joined, then aggregate it early
    • Caution: With JOINS, this only works if the two table are already aggregated to the same level (i.e., if there is only one row for every join key value)

3.3 JOIN pattern

  • Original code
    • content_copy
      select
        t1.dim1,
        sum(t1.metric1),
        sum(t2.metric2),
      from
        small_table t1
      join
        large_table t2
      on
        t1.dim1 = t2.dim1
      where t1.dim1 = 'abc'
      grouop by 1;
      
  • Optimized code
    • content_copy
      select
        t1.dim1,
        sum(t1.metric1),
        sum(t2.metric2),
      from
        large_table t2
      join
        small_table t1
      on
        t1.dim1 = t2.dim1
      where t1.dim1 = 'abc'
      grouop by 1;
      
  • Reason
    • When you create a query by using a JOIN, consider the order in which you are merging the data. The standard SQL query optimizer can determine which table should be on which side of the join, but it is still recommended to order your joined tables appropriately
    • The best practice is to place the largest table first, followed by the smallest, and then by decreasing size.

3.4 Filter before JOINs

  • Original code
    • content_copy
      select
        t1.dim1,
        sum(t1.metric1),
      from
        dataset.table1 t1
      left join
        dataset.table2 t2
      on
        t1.dim1 = t2.dim1
      where t2.dim2 = 'abc'
      grouop by 1;
      
  • Optimized code
    • content_copy
      select
        t1.dim1,
        sum(t1.metric1),
      from
        dataset.table1 t1
      left join
        dataset.table2 t2
      on
        t1.dim1 = t2.dim1
      where t1.dim2 = 'abc' AND t2.dim2 = 'abc'
      grouop by 1;
      
  • Reason
    • WHERE clauses should be executed as soon as possible, especially within joins, so the tables to be joined are as small as possible
    • WHERE clauses may not always be necessary, as standard SQL will do its best to push down filters. Review the explanation plan to see if filtering is happening as early as possible, and either fix the condition or use a subquery to filter in advance

4. Best practices for functions

  • Original code
    • content_copy
      select
        count(*) c
      from
        dataset.table
      where
        lower(text) LIKE '%bigquery%' OR
        lower(text) LIKE '%big query%'
      
  • Optimized code
    • content_copy
      select
        count(*) c
      from
        dataset.table
      where
        regexp_contains(
          text, '(?i)(bigquery|big query)'
        )
      
  • Reason
    • LOWER() and UPPER() operations have a hard time when dealing with Unicode text: each character needs to be mapped individually and they can also be multibytes
    • Use REGEX_MATCH() and add the case insensitive (?i) modifier to your regular expression to do case-insensitive searches. Multiple search values over the same field can be combined into a single regex

4.2 approximate functions

  • Original code
    • content_copy
      select
        dim1,
        count(distinct dim2)
      from
        dataset.table
      group by 1;
      
  • Optimized code
    • content_copy
      select
        dim1,
        approx_count_distinct(dim2)
      from
        dataset.table
      group by 1;
      
  • Reason
    • If the SQL aggregation function you're using has an equivalent approximation function, the approximation function will yield faster query performance
    • Approximate functinos produce a result which is generally within 1% of the exact number

4.3 latest record

  • Original code
    • content_copy
      select
        * except(rn)
      from (
        select *,
          row_number() over(
            partition by id
            order by created_at desc) rn
        from
          dataset.table t
      )
      where rn = 1
      order by created_at
      
  • Optimized code
    • content_copy
      select
        event.*
      from (
        select array_agg(
          t order by t.created_at desc limit 1
        )[offset(0)] event
        from
          dataset.table t
        group by
          id
      )
      order by created_at
      
  • Reason
    • Using the ROW_NUMBER() function can fail with Resources Exceeded errors as data volume grows if there are too many elements to ORDER BY in a single partition.
    • Using ARRAY_AGG() in standard SQL allows the query to run more efficiently because the ORDER BY is allowed to drop everything except the top record on each GROUP BY.

4.4 NTILE functions

  • Original code
    • content_copy
      select
        individual_id,
        ntile(3) over
          (order by sales desc) as sales_third
      from
        dataset.table
      
  • Optimized code
    • content_copy
      with QuanInfo AS (
        select
          o, qval
        from unnest((
          select APPROX_QUANTILES(sales, 2)
          from dataset.table
        )) AS qval
        with offset 0
      ),
      select
        individual_id,
        (select
          MIN(o)
        from QuantInfo
        where sales <= QuantInfo.qval
        ) as sales_third
      from dataset.table
      
  • Reason
    • Using NTILE() function can fail with Resources Exceeded errors as data volume grows if there are too many elements to ORDER BY in a single partition
    • Using APPROX_QUANTILES() in standard SQL allows the query to run more efficiently because it doesn't require a global ORDER BY for all rows in the table

4.5 SQL UDFs > JavaScript UDFs

  • Original code
    • content_copy
      create temporary function
        multiply(x INT64, y INT64)
      returns INT 64
      language js
      as """
        return x * y;
      """;
      select multiply(2, 2) as result;
      
  • Optimized code
    • content_copy
      create temporary function
        multiply(x INT64, y INT64)
      as
        (x * y);
      select multiply(2, 2) as result;
      
  • Reason
    • JavaScript UDFs are a performance killer because they have to spin up a V8 subprocess evaluate.
    • Prefer SQL UDFs where possible

4.6 Persistent UDFs

  • Original code
    • content_copy
      CREATE TEMP FUNCTION addFourAndDivide(x INT64, y INT64) AS ((x + 4) / y);
      WITH numbers AS
        (SELECT 1 as val
        UNION ALL
        SELECT 3 as val
        UNION ALL
        SELECT 4 as val
        UNION ALL
        SELECT 5 as val)
      SELECT val, addFourAndDivide(val, 2) AS
      result
      FROM numbers;
      
  • Optimized code
    • content_copy
      WITH numbers AS
        (SELECT 1 as val
        UNION ALL
        SELECT 3 as val
        UNION ALL
        SELECT 4 as val
        UNION ALL
        SELECT 5 as val)
      SELECT val,
      your_project.your_dataset.addFourAndDivide(val, 2) AS
      result
      FROM numbers;
      
  • Reason
    • Create persistent user-defined SQL and JavaScript functions in a centralized Bigquery dataset which can be invoked across queries and in logical views
    • Create org-wide libraries of business logic within shared datasets

5. Interpreting the query plan

  • Significant difference between avg and max time
    • Probably data skew - use APPROX_TOP_COUNT to check
    • Filter early to workaround
  • Most time spent reading from intermediate stages
    • Consider filtering earlier in the query
  • Most time spent on CPU tasks
    • Consider approximate functions, inspect UDF usage, filter earlier

6. Nest repeated data

  • Customers often default to "flat" denormalization even if it is not the most beneficial
    • Requires a GROUP BY to analyze data
  • Example: Orders table with a row for each line item
    • {order_id1, item_id1}, {order_id1, item_id2},...
  • If you model one order per row and nest line items in a nested field, GROUP BY no longer required
    • {order_id1, [ {item_id1}, {item_id2} ]}

7. Join explosions

  • Caused by JOIN with non-unique key on both sides
  • SQL relational algebra gives cartesian product of rows which have the same join key
    • Worst case: Number of output rows is number of rows in left table multiplied by number of rows in right table
    • In extreme cases, query will not finish
  • If job finishes then query explanation will show output rows versus input rows
  • Confirm diagnosis by modifying query to print number of rows on each side of the JOIN grouped by the JOIN key
  • Workaround is to use GROUP BY to pre-aggregate

8. Skewed JOINS

  • Dremel shuffles data on each side of the join
    • All data with the same join key goes to the same shard
    • Data can overload the shard
  • Typically result from data skew
  • Workarounds
    • Pre-filter rows from query with the unbalanced key
    • Potentially split into two queries

Comments