SlideShare a Scribd company logo
Google Cloud Platform
1. Dive into BiqQuery
2. An idea for ETL - DataPrep
Paweł Mitruś
Warsaw, 5th April 2018
About me
 Team Manager @
 Almost 4 years with Data
 Triathlon freak
 Softskills optimist
Human brain explorer
Agenda
BigQuery DataPrep
1. What BQ is for
2. Pricing
3. BQ in action
• Using
• Loading
• Querying
4. Best practises
5. Limits
6. Summary
1. What DataPrep is for
2. Pricing
3. Wrangling data with DataPrep
4. Creating / scheduling Flows
5. Limits
6. Summary
45min ~ 15min ~
What BQ is for - Overview BigQuery
What BQ is for - Overview BigQuery
What BQ is for - Overview BigQuery
• Storage usage – based on data consumption
• Query – estimates how much slots it needs for query (CPU & RAM)
• Default 2 000 slots
• Spend over 40k $ / month – increase
• Maintenance - fully managed
• Backup and recovery
• 7 day history changes
• Point-in-time snapshot (Legacy SQL)
• Monitoring
• Stackdriver (GCP service)
• Export to BQ
• Security – IAM & admin (GCP)
BQ Pricing BigQuery
BQ Pricing – DML non-partitioned tables BigQuery
BQ Pricing – DML non-partitioned tables BigQuery
DEMO
BQ Pricing – DML partitioned tables BigQuery
BQ in action BigQuery
Using BQ:
• Web
• Command line (SDK)
• REST API / Client libraries (C#, GO, Node.js, PHP, Python, Java, Ruby)
File types:
• CSV
• JSON (newline delimited)
• Avro
• Parquet
• Cloud Datastore Backup
• Google Cloud Bigtable
• Google Sheets
Sources:
• Google Cloud Storage
• Google Drive
• Google Cloud Bigtable
• Local File
• Streaming
BQ in action - demos BigQuery
DEMO
BQ in action – CSV, schema autodetection BigQuery
Encoding
• BigQuery expects CSV data to be UTF-8 encoded. If you have CSV files with data encoded in ISO-
8859-1 (also known as Latin-1) format, you should explicitly specify the encoding when you load
your data so it can be converted to UTF-8. – NOT IN WEB UI, only UTF-8 and ISO-8859-1 supported
• Delimiters in CSV files can be any ISO-8859-1 single-byte character. To use a character in the range
128-255, you must encode the character as UTF-8. BigQuery converts the string to ISO-8859-1
encoding and uses the first byte of the encoded string to split the data in its raw, binary state.
Schema autodetection
• Delimiters: comma (,), pipe(|), tab(t)
• Header – if 1st row contains only strings and 2nd does not, then 1st is header
• Quoted new lines – detects, but is not recognized as a row boundary
BQ in action – append, job demos BigQuery
DEMO
BQ in action – querying BigQuery
DEMO
BQ best practises – cost controll BigQuery
• Avoid SELECT *
• Don't run queries to explore or preview table data – use preview option
• Before running queries, preview them to estimate costs
• Use the maximum bytes billed setting to limit query costs – using LIMIT statement changes
NOTHING
• Partition your tables by date
• If possible, materialize your query results in stages
• Use expiration time to remove the data when it's no longer needed
BQ best practises – query performance BigQuery
• Denormalize whenever possible (use nested / repeated fields)
• _PARTITIONTIME pseudo column to filter the partitions
• Do not overuse wildcards
• BQ doesn’t like joins
• Use approximate aggregation functions
BQ Limits - querying BigQuery
• Concurrent rate limit for on-demand, interactive queries — 50 concurrent queries
• Concurrent rate limit for queries that contain user-defined functions (UDFs) — 6 concurrent
queries
• Maximum number of tables referenced per query — 1,000
• Maximum concurrent slots per project for on-demand pricing — 2,000
BQ Limits – load, copy, export jobs BigQuery
• Load jobs per project per day — 50,000 (including failures)
• Wildcard URIs — 500 wildcard URIs per export
• Exports per day — 1,000 exports per project and up to 10 TB per day (the 10TB data limit is
cumulative across all exports)
BQ Limits – DML BigQuery
• Maximum UPDATE/DELETE statements per day per table — 96
• Maximum UPDATE/DELETE statements per day per project — 10,000
• Maximum INSERT statements per day per table — 1,000
BQ Summary BigQuery
Pros:
• Ease of pricing
• Computing resources
• Documentation
• Price
Cons:
• Troubleshooting
• ETL (Matillion?)
• Production cases, competences
• Available datasources
DataPrep - Agenda DataPrep
1. What DataPrep is for
2. Pricing
3. Wrangling data with DataPrep
4. Creating / scheduling Flows
5. Limits
6. Summary
What DataPrep is for - Overview DataPrep
• ETL
• Extract
• Transform
• Load
• Data Cleansing
• Data Profiling
• Data Enrichment
• Data Flows
DataPrep - Pricing DataPrep
DataPrep in action - demos
DEMO
DataPrep
DataPrep Perforemance - optimize
• Limit rows / filter data / drop unused columns
• Late unions
• Joins:
• Join operations should be performed early in your recipe. These steps bring together your data into a single consistent
dataset. By doing them early in the process, you reduce the chance of having changes to your join keys impacting the
results of your join operations.
• Tip: You should perform your join operation as late as possible in your recipe steps. If your joined dataset has not been
completely transformed, subsequent steps might impact the data in the dataset to which it was joined. If needed, you
can modify your join after its creation.
DataPrep
DataPrep Limits
• Sampling - Sample sizes are 10 MB (All values displayed or generated in the application are based
on the currently displayed sample)
• Sampling - Random samples are derived from up to the first 1 GB of the source file.
• Encoding - Within the application, UTF-8 encodings are displayed
• User-defined functions are not supported
• Integrations with datastores other than BigQuery, Google Cloud Storage, and the local filesystem
are not supported
• The Command Line Interface is not supported
• Sharing is not supported
• If you're using a Free Trial project, your project has a maximum of 8 cores available. You must
specify a combination of numWorkers, workerMachineType, and maxNumWorkers that fits within
your trial limit
DataPrep
DataPrep Summary
Pros:
• Multiple encoding supported
• Profiling results
Cons:
• Integration
• SDK / Command Line not supported
• Dependencies between flows
• „Parametrized” jobs throught DataFlow
(code-driven)
DataPrep
Contact me!
www.linkedin.com/in/pawelmitrus
www.facebook.com/pawelmitrus
pawel.mitrus@gmail.com

More Related Content

What's hot

VictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - Preview
VictoriaMetrics
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
Ido Green
 
Google BigQuery
Google BigQueryGoogle BigQuery
Google BigQuery
Matthias Feys
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
Databricks
 
MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks
EDB
 
Airflow 101
Airflow 101Airflow 101
Airflow 101
SaarBergerbest
 
BigQuery for Beginners
BigQuery for BeginnersBigQuery for Beginners
BigQuery for Beginners
Better&Stronger
 
Introducing MongoDB Atlas
Introducing MongoDB AtlasIntroducing MongoDB Atlas
Introducing MongoDB Atlas
MongoDB
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
Sumit Maheshwari
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
Aljoscha Krettek
 
MongoDB performance
MongoDB performanceMongoDB performance
MongoDB performance
Mydbops
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using it
Bruno Faria
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
YounesCharfaoui
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
Chandler Huang
 
Deploying and Operating KSQL
Deploying and Operating KSQLDeploying and Operating KSQL
Deploying and Operating KSQL
confluent
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
confluent
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
SATOSHI TAGOMORI
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Arnab Mitra
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architecture
Bishal Khanal
 

What's hot (20)

VictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - Preview
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
 
Google BigQuery
Google BigQueryGoogle BigQuery
Google BigQuery
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
 
MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks
 
Airflow 101
Airflow 101Airflow 101
Airflow 101
 
BigQuery for Beginners
BigQuery for BeginnersBigQuery for Beginners
BigQuery for Beginners
 
Introducing MongoDB Atlas
Introducing MongoDB AtlasIntroducing MongoDB Atlas
Introducing MongoDB Atlas
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
MongoDB performance
MongoDB performanceMongoDB performance
MongoDB performance
 
Introducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using itIntroducing Apache Airflow and how we are using it
Introducing Apache Airflow and how we are using it
 
Cassandra Database
Cassandra DatabaseCassandra Database
Cassandra Database
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
 
Deploying and Operating KSQL
Deploying and Operating KSQLDeploying and Operating KSQL
Deploying and Operating KSQL
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architecture
 

Similar to Introduction to GCP BigQuery and DataPrep

[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
Hiram Fleitas León
 
(ATS3-PLAT08) Optimizing Protocol Performance
(ATS3-PLAT08) Optimizing Protocol Performance(ATS3-PLAT08) Optimizing Protocol Performance
(ATS3-PLAT08) Optimizing Protocol Performance
BIOVIA
 
ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smart
Evans Ye
 
9.6_Course Material-Postgresql_002.pdf
9.6_Course Material-Postgresql_002.pdf9.6_Course Material-Postgresql_002.pdf
9.6_Course Material-Postgresql_002.pdf
sreedb2
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Altinity Ltd
 
BigQuery at AppsFlyer - past, present and future
BigQuery at AppsFlyer - past, present and futureBigQuery at AppsFlyer - past, present and future
BigQuery at AppsFlyer - past, present and future
Nir Rubinstein
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
Stratebi
 
Large Data Volume Salesforce experiences
Large Data Volume Salesforce experiencesLarge Data Volume Salesforce experiences
Large Data Volume Salesforce experiences
Cidar Mendizabal
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
Chester Chen
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Dmitry Anoshin
 
Designing, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDesigning, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons Learned
Denny Lee
 
Optimising Queries - Series 1 Query Optimiser Architecture
Optimising Queries - Series 1 Query Optimiser ArchitectureOptimising Queries - Series 1 Query Optimiser Architecture
Optimising Queries - Series 1 Query Optimiser Architecture
DAGEOP LTD
 
Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0
EDB
 
Day 8.1 system_admin_tasks
Day 8.1 system_admin_tasksDay 8.1 system_admin_tasks
Day 8.1 system_admin_tasks
tovetrivel
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Xu Jiang
 
New enhancements for security and usability in EDB 13
New enhancements for security and usability in EDB 13New enhancements for security and usability in EDB 13
New enhancements for security and usability in EDB 13
EDB
 
Webinar: Introduction to MongoDB 3.0
Webinar: Introduction to MongoDB 3.0Webinar: Introduction to MongoDB 3.0
Webinar: Introduction to MongoDB 3.0
MongoDB
 
PPWT2019 - EmPower your BI architecture
PPWT2019 - EmPower your BI architecturePPWT2019 - EmPower your BI architecture
PPWT2019 - EmPower your BI architecture
Riccardo Perico
 
MySQL 5.7 what's new
MySQL 5.7 what's newMySQL 5.7 what's new
MySQL 5.7 what's new
Ricky Setyawan
 
MySQL 5.7: What's New, Nov. 2015
MySQL 5.7: What's New, Nov. 2015MySQL 5.7: What's New, Nov. 2015
MySQL 5.7: What's New, Nov. 2015
Mario Beck
 

Similar to Introduction to GCP BigQuery and DataPrep (20)

[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
 
(ATS3-PLAT08) Optimizing Protocol Performance
(ATS3-PLAT08) Optimizing Protocol Performance(ATS3-PLAT08) Optimizing Protocol Performance
(ATS3-PLAT08) Optimizing Protocol Performance
 
ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smart
 
9.6_Course Material-Postgresql_002.pdf
9.6_Course Material-Postgresql_002.pdf9.6_Course Material-Postgresql_002.pdf
9.6_Course Material-Postgresql_002.pdf
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
BigQuery at AppsFlyer - past, present and future
BigQuery at AppsFlyer - past, present and futureBigQuery at AppsFlyer - past, present and future
BigQuery at AppsFlyer - past, present and future
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
 
Large Data Volume Salesforce experiences
Large Data Volume Salesforce experiencesLarge Data Volume Salesforce experiences
Large Data Volume Salesforce experiences
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
Designing, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDesigning, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons Learned
 
Optimising Queries - Series 1 Query Optimiser Architecture
Optimising Queries - Series 1 Query Optimiser ArchitectureOptimising Queries - Series 1 Query Optimiser Architecture
Optimising Queries - Series 1 Query Optimiser Architecture
 
Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0Active/Active Database Solutions with Log Based Replication in xDB 6.0
Active/Active Database Solutions with Log Based Replication in xDB 6.0
 
Day 8.1 system_admin_tasks
Day 8.1 system_admin_tasksDay 8.1 system_admin_tasks
Day 8.1 system_admin_tasks
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
New enhancements for security and usability in EDB 13
New enhancements for security and usability in EDB 13New enhancements for security and usability in EDB 13
New enhancements for security and usability in EDB 13
 
Webinar: Introduction to MongoDB 3.0
Webinar: Introduction to MongoDB 3.0Webinar: Introduction to MongoDB 3.0
Webinar: Introduction to MongoDB 3.0
 
PPWT2019 - EmPower your BI architecture
PPWT2019 - EmPower your BI architecturePPWT2019 - EmPower your BI architecture
PPWT2019 - EmPower your BI architecture
 
MySQL 5.7 what's new
MySQL 5.7 what's newMySQL 5.7 what's new
MySQL 5.7 what's new
 
MySQL 5.7: What's New, Nov. 2015
MySQL 5.7: What's New, Nov. 2015MySQL 5.7: What's New, Nov. 2015
MySQL 5.7: What's New, Nov. 2015
 

Recently uploaded

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 

Recently uploaded (20)

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 

Introduction to GCP BigQuery and DataPrep

  • 1. Google Cloud Platform 1. Dive into BiqQuery 2. An idea for ETL - DataPrep Paweł Mitruś Warsaw, 5th April 2018
  • 2. About me  Team Manager @  Almost 4 years with Data  Triathlon freak  Softskills optimist Human brain explorer
  • 3. Agenda BigQuery DataPrep 1. What BQ is for 2. Pricing 3. BQ in action • Using • Loading • Querying 4. Best practises 5. Limits 6. Summary 1. What DataPrep is for 2. Pricing 3. Wrangling data with DataPrep 4. Creating / scheduling Flows 5. Limits 6. Summary 45min ~ 15min ~
  • 4. What BQ is for - Overview BigQuery
  • 5. What BQ is for - Overview BigQuery
  • 6. What BQ is for - Overview BigQuery • Storage usage – based on data consumption • Query – estimates how much slots it needs for query (CPU & RAM) • Default 2 000 slots • Spend over 40k $ / month – increase • Maintenance - fully managed • Backup and recovery • 7 day history changes • Point-in-time snapshot (Legacy SQL) • Monitoring • Stackdriver (GCP service) • Export to BQ • Security – IAM & admin (GCP)
  • 8. BQ Pricing – DML non-partitioned tables BigQuery
  • 9. BQ Pricing – DML non-partitioned tables BigQuery DEMO
  • 10. BQ Pricing – DML partitioned tables BigQuery
  • 11. BQ in action BigQuery Using BQ: • Web • Command line (SDK) • REST API / Client libraries (C#, GO, Node.js, PHP, Python, Java, Ruby) File types: • CSV • JSON (newline delimited) • Avro • Parquet • Cloud Datastore Backup • Google Cloud Bigtable • Google Sheets Sources: • Google Cloud Storage • Google Drive • Google Cloud Bigtable • Local File • Streaming
  • 12. BQ in action - demos BigQuery DEMO
  • 13. BQ in action – CSV, schema autodetection BigQuery Encoding • BigQuery expects CSV data to be UTF-8 encoded. If you have CSV files with data encoded in ISO- 8859-1 (also known as Latin-1) format, you should explicitly specify the encoding when you load your data so it can be converted to UTF-8. – NOT IN WEB UI, only UTF-8 and ISO-8859-1 supported • Delimiters in CSV files can be any ISO-8859-1 single-byte character. To use a character in the range 128-255, you must encode the character as UTF-8. BigQuery converts the string to ISO-8859-1 encoding and uses the first byte of the encoded string to split the data in its raw, binary state. Schema autodetection • Delimiters: comma (,), pipe(|), tab(t) • Header – if 1st row contains only strings and 2nd does not, then 1st is header • Quoted new lines – detects, but is not recognized as a row boundary
  • 14. BQ in action – append, job demos BigQuery DEMO
  • 15. BQ in action – querying BigQuery DEMO
  • 16. BQ best practises – cost controll BigQuery • Avoid SELECT * • Don't run queries to explore or preview table data – use preview option • Before running queries, preview them to estimate costs • Use the maximum bytes billed setting to limit query costs – using LIMIT statement changes NOTHING • Partition your tables by date • If possible, materialize your query results in stages • Use expiration time to remove the data when it's no longer needed
  • 17. BQ best practises – query performance BigQuery • Denormalize whenever possible (use nested / repeated fields) • _PARTITIONTIME pseudo column to filter the partitions • Do not overuse wildcards • BQ doesn’t like joins • Use approximate aggregation functions
  • 18. BQ Limits - querying BigQuery • Concurrent rate limit for on-demand, interactive queries — 50 concurrent queries • Concurrent rate limit for queries that contain user-defined functions (UDFs) — 6 concurrent queries • Maximum number of tables referenced per query — 1,000 • Maximum concurrent slots per project for on-demand pricing — 2,000
  • 19. BQ Limits – load, copy, export jobs BigQuery • Load jobs per project per day — 50,000 (including failures) • Wildcard URIs — 500 wildcard URIs per export • Exports per day — 1,000 exports per project and up to 10 TB per day (the 10TB data limit is cumulative across all exports)
  • 20. BQ Limits – DML BigQuery • Maximum UPDATE/DELETE statements per day per table — 96 • Maximum UPDATE/DELETE statements per day per project — 10,000 • Maximum INSERT statements per day per table — 1,000
  • 21. BQ Summary BigQuery Pros: • Ease of pricing • Computing resources • Documentation • Price Cons: • Troubleshooting • ETL (Matillion?) • Production cases, competences • Available datasources
  • 22. DataPrep - Agenda DataPrep 1. What DataPrep is for 2. Pricing 3. Wrangling data with DataPrep 4. Creating / scheduling Flows 5. Limits 6. Summary
  • 23. What DataPrep is for - Overview DataPrep • ETL • Extract • Transform • Load • Data Cleansing • Data Profiling • Data Enrichment • Data Flows
  • 24. DataPrep - Pricing DataPrep
  • 25. DataPrep in action - demos DEMO DataPrep
  • 26. DataPrep Perforemance - optimize • Limit rows / filter data / drop unused columns • Late unions • Joins: • Join operations should be performed early in your recipe. These steps bring together your data into a single consistent dataset. By doing them early in the process, you reduce the chance of having changes to your join keys impacting the results of your join operations. • Tip: You should perform your join operation as late as possible in your recipe steps. If your joined dataset has not been completely transformed, subsequent steps might impact the data in the dataset to which it was joined. If needed, you can modify your join after its creation. DataPrep
  • 27. DataPrep Limits • Sampling - Sample sizes are 10 MB (All values displayed or generated in the application are based on the currently displayed sample) • Sampling - Random samples are derived from up to the first 1 GB of the source file. • Encoding - Within the application, UTF-8 encodings are displayed • User-defined functions are not supported • Integrations with datastores other than BigQuery, Google Cloud Storage, and the local filesystem are not supported • The Command Line Interface is not supported • Sharing is not supported • If you're using a Free Trial project, your project has a maximum of 8 cores available. You must specify a combination of numWorkers, workerMachineType, and maxNumWorkers that fits within your trial limit DataPrep
  • 28. DataPrep Summary Pros: • Multiple encoding supported • Profiling results Cons: • Integration • SDK / Command Line not supported • Dependencies between flows • „Parametrized” jobs throught DataFlow (code-driven) DataPrep

Editor's Notes

  1. Serverless Linki: https://cloud.google.com/solutions/bigquery-data-warehouse
  2. DEMO – jak to wygląda w WEB
  3. https://cloud.google.com/solutions/bigquery-data-warehouse#costs https://cloud.google.com/bigquery/docs/slots Hardware: https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood Data Structures: https://cloud.google.com/blog/big-data/2016/04/inside-capacitor-bigquerys-next-generation-columnar-storage-format
  4. Long term storage – no edited for 90 days (There is no degradation of performance, durability, availability, or any other functionality when a table is considered long term storage)
  5. Demo - Costs of querying Cancelling a running query job may incur charges up to the full cost for the query were it allowed to run to completion. You don’t pay if query returns an error You don’t pay if query uses cached results
  6. https://cloud.google.com/bigquery/pricing#samplecosts
  7. Demo
  8. DEMO
  9. - Windows Unicode = "UTF-16LE" <> "UTF-8", ISO-8859-1 - DEMO: WEB UI – change schema after initial load
  10. WEB: Rerun job Append data to table Overwrite table
  11. WEB: Cached / not cached – when cached, no fee (factSales_denormalized vs factSales_normalized query) Wildcard (factSales_wildcard) Differenct dataset localization (EU & US differentRegion_EU_US) Shared view – each time qurryin (not materialized)
  12. https://cloud.google.com/bigquery/docs/best-practices-costs
  13. DEMO https://cloud.google.com/bigquery/docs/best-practices-performance-overview
  14. https://cloud.google.com/bigquery/quotas
  15. https://cloud.google.com/bigquery/quotas
  16. https://cloud.google.com/bigquery/quotas
  17. Serverless Desktop (Trifacta)
  18. https://cloud.google.com/dataprep/docs/concepts/gcs-buckets#removing_service_account_access_to_a_bucket https://cloud.google.com/dataprep/docs/html/Run-Job-on-Dataflow_99745844