BigQuery and Data Studio cost optimization
We want to share with you a cost optimization that you can do if you are using Data Studio on BigQuery (aka BQ). If you don’t care about your money, this article is not for you 😉
To use BQ, most of the time, you have loaded your data into BQ tables and run your queries on them. A best practice is to partition your data to be able to read data for a particular day or on a specific date range. It’s also an effective way to reduce your processing costs.
BQ have 2 main mechanisms to partition your table:
- partitioned tables based on a dedicated pseudo column
- sharded tables based on tables suffixes with a naming convention
Google recommends to use partitioned tables instead of sharded tables but this feature has still a beta status. Then most of the GCP products (GA 360, Firebase, …) use sharded tables to export data to BQ.
According to the documentation, you should use “View” to query your data from Data Studio. For example, to transform field integer to string representation with your business rule or something else. The problem with this approach is how Data Studio uses this view to filter the data to build your dashboards.
When you are using a real table to construct your dashboard, Data Studio is able to use BigQuery sharded table natively. But when you use a view based on sharded table, BQ is not able to do this.
Here are some tests that we have done to illustrate this.
Scenario: We have loaded data for only 2 days: on 25/03/2018 and on 29/03/2018. For our dashboard we choose a date range (from 23/03/2018 to 27/03/2018) to exclude the 29/03/2018 then we expect to read the data for only 1 day. On this specific day (25/03/2018), we only keep 100 differents visitors to limit the data processing.
Test on a sharded table
Note the date range on top right corner in the following dashboard.
– Lien avec les campagnes publicitaires : ROI/campagne/source…
SELECT APPROX_COUNT_DISTINCT(t0.user_id) AS t0_qt_eq1v6zjqlb FROM( SELECT * FROM 'op-rate-labs.test.myusers_*' WHERE _TABLE_SUFFIX BETWEEN '20180323' AND '20180327' AS t0 ORDER BY t0_qt_eq1v6zjqlb ASC;
You can find these queries in BQ, in the query history section. Here you can see with the “_TABLE_SUFFIX” statement, how Data Studio does to limit the data processing. It uses the sharded table naming convention to scope to the expected date range.
Here you can see the query execution details. Notice that even we have only requested 339KB we will be billed for 10MB. This is the minimum amount billed per request. For information, at the writing time, the pricing is 5$/TB processed and the first TB is free. Then you can launch a lot of queries before paying something (100k free requests like this one).
Test on a view
Now we will use a copy of this dashboard but use a view instead of my sharded tables. This is a simple view that retrieves the same fields than our table.
SELECT APPROX_COUNT_DISTINCT(t0.user_id) AS t0_qt_quytwu7w1b FROM 'op-rate-labs.test.myusers_view_v1' AS t0 WHERE (FORMAT_TIMESTAMP ('%Y%m%d', TIMESTAMP(t0.Date)) >= '20180323' AND FORMAT_TIMESTAMP ('%Y%m%d', TIMESTAMP(t0.Date)) < = '20180327') ORDER BY t0_qt_quytwu7w1b ASC;
Then you can see that the request, built by Data Studio, doesn’t use the same WHERE clause to limit our data processing. It uses a simple field instead of table suffix mechanism.
Here you can see that BQ has read the data on 29/03/2018 that should be outside of our date range. Then you can also see that we will be billed for 41MB instead of 339KB because it read data on 29/03/2018 and because we don’t “clean” the data on this date. But you can imagine this processing if you have loaded a lot of data each day during several months… It will always scan all your sharded tables.
In fact, you have this amount of processing because BQ scans all the sharded tables and filters the data once it has read them. In the first case, it can filter the data before reading them which is very more efficient.
Don’t use views with Data Studio on sharded tables for the moment. If you really need a view, make your view on a partioned table. You can convert a sharded table into a partitioned table like this:
Note that this feature is relatively new…
Data Studio and partitioned tables are still in Beta then this behaviour can change in the future. This issue will be probably resolved in the next releases 😀
If you need any help to use your data properly, don’t hesitate to contact us
Faites partis de l’aventure OP-Rate en rejoignant une équipe avec des valeurs fortes de sens. Vous voulez en savoir plus sur nos offres, rdv juste ici 👇
Déployer des modèles de Machine Learning grâce à Vertex AI Introduction Ça y est, vous avez un modèle de machine learning qui tourne sur votre
Licence open source, entre stratégie et enjeux, que faut-il savoir pour faire le bon choix ? Depuis 2017, OP-Rate, Centre de compétences Cloud et Big Data
Détection d’intrusion dans un environnement Kubernetes avec FALCO Introduction Selon le rapport du CESIN de Janvier 2022, 1 entreprise sur 2 a été victime d’une cyberattaque en
Traitement et Analyse de données : Clickhouse & Apache Superset® introduction Le flux de données collectées de nos jours est de plus en plus conséquent,
Le couple Garden.io & Kubernetes simplifie l’automatisation de vos déploiements introduction Quelle équipe de développement ne s’est jamais confrontée à la complexité de Kubernetes et Helm ?