Optimizing BigQuery Table Partitions

Different partitioning strategies within BigQuery have various pros and cons. A typical strategy for partitioning is by DAY and partitioned on the field _PARTITIONTIME. For context, _PARTITIONTIME is a pseudo column that contains the UTC day that the partitioned data was loaded. In many cases, this is a good default. However, there may be cases where you want to optimize a table beyond Google defaults.

For example, if you have a high volume of loads for a common data source, using a different strategy can lower query costs. Be aware that some partitioning strategies, while improving costs, may increase the complexity of your queries.

There is no universal approach for all use cases. In this post, we detail how to migrate from one type of partitioned table to another.

Before proceeding, please review the documentation from Google on the topic: https://cloud.google.com/bigquery/docs/partitioned-tables

Convert Partitioning On An Existing BigQuery Table

In our example, we will show you how to convert partitioning from _PARTITIONTIMEto a different fieldob_date. We will use an Amazon Advertising table called amzadvertising_sp_productads_v5 for reference.

Step 1. Back-Up Your SOURCE Table

The first step is to make sure you back up your table. If something goes wrong, having a backup will be critical. We suggest you place the file(s) in an organized, easy to understand structure within Google Cloud Storage. For example, here is a structure:

  • gs://<bucket-name>/<parent-directory>/<child-directory>/<file-name>

Using the data source name usually makes sense for these. For example, you create a bucket called “openbridge-exports.” Next, you define a parent directory called amazonadvertising. The child directory would be the specific table name for the source so useamzadvertising_sp_productads. Lastly, the file name should be a combination of the child and the export date.

  • gs://openbridge-exports/amazonadvertsing/amzadvertising_sp_productads_v5/amzadvertising_sp_productads_v5_06012021

Once you have your structure planned out, you are ready to back up your table:

  • Go to the BigQuery WebUI.
  • Select the amzadvertising_sp_productads_v5 table for export.
  • Click on Export Table in the top-right.
  • Select the Export format (CSV) and Compression (GZIP).
  • Adjust the Google Cloud Storage path to match the bucket, directories, and file name you want to use.
  • Click OK and wait for the job to complete.

Once you verify the export completed, proceed to Step 2.

Step 2: Create Temp Table Based On SOURCE

After we have data safely back up to Cloud Storage, we need to create an empty TEMP table reflecting the table you want to convert.

The TEMP table should follow the same schema definition as your SOURCE table. For amzadvertising_sp_productads_v5, the schema and create table definition would be as follows.

The BigQuery UI allows you to inspect the schema, so you have access to all the columns and data types needed. Just follow the formatting for each in our example CREATE TABLE statement:

CREATE TABLE
my-google-project.mws.amzadvertising_sp_productads_TEMP01 (
ad_group_id INT64,
ad_group_name STRING,
ad_id INT64,
asin STRING,
attributed_conversions14d INT64,
attributed_conversions14d_same_sku INT64,
attributed_conversions1d INT64,
attributed_conversions1d_same_sku INT64,
attributed_conversions30d INT64,
attributed_conversions30d_same_sku INT64,
attributed_conversions7d INT64,
attributed_conversions7d_same_sku INT64,
attributed_sales14d FLOAT64,
attributed_sales14d_same_sku FLOAT64,
attributed_sales1d FLOAT64,
attributed_sales1d_same_sku FLOAT64,
attributed_sales30d FLOAT64,
attributed_sales30d_same_sku FLOAT64,
attributed_sales7d FLOAT64,
attributed_sales7d_same_sku FLOAT64,
attributed_units_ordered14d INT64,
attributed_units_ordered14d_same_sku INT64,
attributed_units_ordered1d INT64,
attributed_units_ordered1d_same_sku INT64,
attributed_units_ordered30d INT64,
attributed_units_ordered30d_same_sku INT64,
attributed_units_ordered7d INT64,
attributed_units_ordered7d_same_sku INT64,
campaign_id INT64,
campaign_name STRING,
clicks INT64,
cost FLOAT64,
currency STRING,
impressions INT64,
profile_id INT64,
sku STRING,
ob_date DATE,
ob_transaction_id STRING,
ob_file_name STRING,
ob_processed_at STRING,
ob_modified_date DATETIME
)
PARTITION BY
ob_date

Notice we are using ob_date as the PARTITION_BY value. Google should confirm this was created, and you should see it in your tables list.

Step 3: Load TEMP Table From SOURCE

Next, we want to load all the data from SOURCE into the TEMP. To do this, we need to SELECT everything and set the TEMP as the target.

SELECT * FROM `my-google-project.mws.amzadvertising_sp_productads_v5`

Don’t run it yet! Go to Query Settings as you need to set a target destination for the results (“Set a destination table for query results”).

Follow the process they define: https://cloud.google.com/bigquery/docs/writing-results#writing_large_results_using_legacy_sql

Once complete, the process will load everything from the SOURCE table to TEMP. Spot check the table details to make sure the size of SOURCE and TEMP are the same. Also, it would be best if you ran some test queries to confirm parity between TEMP and SOURCE.

Step 4: Delete the Source Table

Did you confirm TEMP and SOURCE are the same? Good. BigQuery does not allow you to rename a table, so we need to delete SOURCEso we can recreate it based on TEMP. See this doc for more info: https://cloud.google.com/bigquery/docs/managing-tables

Step 5: Copy TEMP into New SOURCE

In BigQuery UI, expand your project and dataset, then select the newly createdamzadvertising_sp_productads_TEMP01table which holds all your data

In the details panel, click Copy table. In the Copy table dialog, under Destination, set the Project name and Dataset name. You want to make sure that bothTEMP and SOURCE are in the same project and dataset. It will not work otherwise.

For Table name, enter a name for the new table. In this case, we will want to use the original table name we deleted:amzadvertising_sp_productads_v5.

All set? Click Copy to start the copy job.

Confirm that your newamzadvertising_sp_productads_v5 matches your TEMP table amzadvertising_sp_productads_TEMP01. Run the same checks as Step 3 to verify everything looks good. Feel free to delete TEMP when you are confident everything is aligned.

That is it! You can check billing to see if cost improvements are using the new PARTITION strategy.


Optimizing BigQuery Table Partitions was originally published in Openbridge on Medium, where people are continuing the conversation by highlighting and responding to this story.



source https://blog.openbridge.com/optimizing-bigquery-table-partitions-64c42bdc11d4?source=rss----4c5221789b3---4

Popular posts from this blog

Data Lake Icon: Visual Reference

Why Timeszones Cause Amazon Seller Central Confusion

PrestoDB vs PrestoSQL & the new Presto Foundation