banner website

BigQuery and Data Studio cost optimization

introduction

We want to share with you a cost optimization that you can do if you are using Data Studio on BigQuery (aka BQ). If you don’t care about your money, this article is not for you 😉

To use BQ, most of the time, you have loaded your data into BQ tables and run your queries on them. A best practice is to partition your data to be able to read data for a particular day or on a specific date range. It’s also an effective way to reduce your processing costs.

BQ have 2 main mechanisms to partition your table:

  • partitioned tables based on a dedicated pseudo column
  • sharded tables based on tables suffixes with a naming convention

Google recommends to use partitioned tables instead of sharded tables but this feature has still a beta status. Then most of the GCP products (GA 360, Firebase, …) use sharded tables to export data to BQ.

According to the documentation, you should use “View” to query your data from Data Studio. For example, to transform field integer to string representation with your business rule or something else. The problem with this approach is how Data Studio uses this view to filter the data to build your dashboards.

When you are using a real table to construct your dashboard, Data Studio is able to use BigQuery sharded table natively. But when you use a view based on sharded table, BQ is not able to do this.

Here are some tests that we have done to illustrate this.

Scenario: We have loaded data for only 2 days: on 25/03/2018 and on 29/03/2018. For our dashboard we choose a date range (from 23/03/2018 to 27/03/2018) to exclude the 29/03/2018 then we expect to read the data for only 1 day. On this specific day (25/03/2018), we only keep 100 differents visitors to limit the data processing.

tableau article technique

Test on a sharded table

Note the date range on top right corner in the following dashboard.

– Lien avec les campagnes publicitaires : ROI/campagne/source…

				
					SELECT
    APPROX_COUNT_DISTINCT(t0.user_id) AS t0_qt_eq1v6zjqlb
FROM(
    SELECT
    *
    FROM
    'op-rate-labs.test.myusers_*'
    WHERE
    _TABLE_SUFFIX BETWEEN '20180323'
    AND '20180327' AS t0
ORDER BY 
    t0_qt_eq1v6zjqlb ASC;
				
			

You can find these queries in BQ, in the query history section. Here you can see with the “_TABLE_SUFFIX” statement, how Data Studio does to limit the data processing. It uses the sharded table naming convention to scope to the expected date range.

Here you can see the query execution details. Notice that even we have only requested 339KB we will be billed for 10MB. This is the minimum amount billed per request. For information, at the writing time, the pricing is 5$/TB processed and the first TB is free. Then you can launch a lot of queries before paying something (100k free requests like this one).

Test on a view

Now we will use a copy of this dashboard but use a view instead of my sharded tables. This is a simple view that retrieves the same fields than our table.

Table
				
					SELECT
    APPROX_COUNT_DISTINCT(t0.user_id) AS t0_qt_quytwu7w1b
FROM
    'op-rate-labs.test.myusers_view_v1' AS t0
WHERE
    (FORMAT_TIMESTAMP ('%Y%m%d', TIMESTAMP(t0.Date)) >= '20180323'
    AND FORMAT_TIMESTAMP ('%Y%m%d', TIMESTAMP(t0.Date)) < = '20180327')
ORDER BY 
    t0_qt_quytwu7w1b ASC;
				
			

Then you can see that the request, built by Data Studio, doesn’t use the same WHERE clause to limit our data processing. It uses a simple field instead of table suffix mechanism.

Ligne de codage

Here you can see that BQ has read the data on 29/03/2018 that should be outside of our date range. Then you can also see that we will be billed for 41MB instead of 339KB because it read data on 29/03/2018 and because we don’t “clean” the data on this date. But you can imagine this processing if you have loaded a lot of data each day during several months… It will always scan all your sharded tables.

In fact, you have this amount of processing because BQ scans all the sharded tables and filters the data once it has read them. In the first case, it can filter the data before reading them which is very more efficient.

Conclusion

Don’t use views with Data Studio on sharded tables for the moment. If you really need a view, make your view on a partioned table. You can convert a sharded table into a partitioned table like this: 

https://cloud.google.com/bigquery/docs/creating-partitioned-tables#converting_dated_tables_into_a_partitioned_table

Note that this feature is relatively new…

Data Studio and partitioned tables are still in Beta then this behaviour can change in the future. This issue will be probably resolved in the next releases 😀

If you need any help to use your data properly, don’t hesitate to contact us

Nous rejoindre

Faites partis de l’aventure OP-Rate en rejoignant une équipe avec des valeurs fortes de sens. Vous voulez en savoir plus sur nos offres, rdv juste ici 👇

ARTICLES

contacter un expert