SlideShare a Scribd company logo
Apache Beam in Production
June 10th, 2020
● USA based monetization platform for mobile game
developers.
● Growing our engineering office in Barcelona with a strong
tech-driven culture:
- Visibility of the impact of your code.
- Not afraid of implementing tech, like Apache beam ; )
- Fostering best practices and a true care for code quality.
+300k
Mobile game
integrations
900M
Unique
Users
200B
Ad request
2011
Founding
year
Who we are
100TBs
Data Scale
Hi, I’m Ferran!
Data Engineer @ Chartboost
+5 years of Big Data experience
Agenda
● What is Apache Beam
● Basic Requirements
● Production use cases:
○ Ingest data into BigQuery or GCS
○ BigQuery to BigTable
○ BigQuery to S3 (Parquet)
● Dealing with Streaming
● Questions
What is Apache Beam
“Apache Beam is an open source,
unified model for defining both batch
and streaming data-parallel processing
pipelines.”
Write once, run anywhere
Basic Requirements
- Reduce cluster provisioning overhead.
- Create a generic and reusable code.
- The architecture must be agnostic of the source (streaming or
batch).
- The architecture must have different configurable sinks.
Text files or Parquet
Ingest data into BQ or GCS
Parameter Input
inputPath gs://bucket_x/*.parquet
isPartitionedTable false
disposition TRUNCATE
outputTableSpec project:dataset.table
batchJobInstance NO_TRANSFORMATION
howToParse {"original_app_id":"original
_app_id","app.id":"app_id",
"src_name":"src_name","id
":"id"}
You have loaded data into BigQuery!
I wanted upper case...
import com.google.api.services.bigquery.model.TableRow;
public class TestTransformation {
public TableRow customTransformation(TableRow tr) {
tr.set("src_name",
tr.get("src_name").toString().toUpperCase());
return tr;
}
}
class Types {
static final String TEST = "test";
}
Parameter Input
inputPath gs://bucket_x/*.parquet
isPartitionedTable false
disposition TRUNCATE
batchJobInstance test
outputTableSpec project:dataset.table
howToParse {"original_app_id":"original
_app_id","app.id":"app_id",
"src_name":"src_name","id
":"id"}
You have loaded custom data into BigQuery!
Core
batchJobInstance
streamingJobInstance
customTransA
customTransB
customTransC
How custom transformations work
How custom transformations work
@ProcessElement
public void processElement(ProcessContext context) {
try {
String element = context.element();
if (element != null) {
JsonModel jsonModel = Utils.parseRawJson(element, JsonModel.class);
if (jsonModel != null) {
TableRow tr = customTransformation(JsonParseToBigQuery.getInstance()
.getJsonParse(jsonModel.getDetails(), howToParse));
if (tr != null) {
context.output(tr);
}
}
}
} catch (Exception e) {
LOG.error("Error on parsing", e);
}
}
}
Parameter Default Value
inputPath -
isParquet true
outputTableSpec -
howToParse -
disposition APPEND
isPartitionedTable true
writeIntoBq true
writeIntoGCS false
outputDirectory -
numShards 20
writeAsParquet true
Main Batch Jobs Parameters
Parameter Input
inputPath gs://bucket_x/*.parquet
batchJobInstance test
outputDirectory gs://bucket_y/test
outputTableSpec project:dataset.table
writeIntoGCS true
parquetSchema original_app_id:STRING,app_id:LON
G
howToParse {"original_app_id":"original_app_id","
app.id":"app_id","src_name":"src_nam
e","id":"id"}
Parameter Default Value
streamSource pub_sub
inputTopic -
subscription -
kafkaBrokers -
kafkaGroupId -
filePartitionPolicy DAILY
writeIntoBq true
writeIntoGCS false
outputDirectory -
numShards 20
windowDuration 5m
Main Streaming Jobs Parameters
Problems / Tips
- Debugging failures was not always easy.
- If you want to create templates, remember, ValueProviders are
only available at Runtime.
- Be careful with non thread safe classes.
- Default GCP instances are okay, but try to use custom ones.
BigQuery to BigTable
Going from SQL to NoSQL
What is BigTable
“Bigtable is a compressed, high
performance, proprietary data storage
system built on Google File System, Chubby
Lock Service, SSTable (log-structured
storage like LevelDB) and a few other
Google technologies”
Key, Value storage
ColumnFamily (info)
RowKey Qualifier (name) Qualifier (email) Qualifier(phone)
ofsehn28u492 Bill Green bgreen@gmail.com 555-958-382
kfgiuiu5937je3 jdoe@gmail.com 555-738-234
iojcou9wujd77 Rick Sanchez
BigTable table example
Parameter Input
sqlQuery SELECT X, Y, Z FROM ...
rowKeyMap -
bqToBtMap cf:qualifier:something,cf2...
bigTableInstanceId chartboost
externalSinkProject project-id-x
bigTableAppProfileId batch
Problems / Tips
● For heavy load jobs always use
BigTable application profiles.
BigQuery to S3 (Parquet)
From BQ Dataflow to S3 Parquet
Parameter Input
AWSRegion us-east-1
inputTableSpec project:dataset.table
outputPath gs://bucket_y/test
successFileName _SUCCESS
schema original_app_id:STRING,app_id:LON
G
writeIntoS3 true
public interface Options extends BigQueryToParquetOptions {
}
----------------------------------------------------------------------------------------------------
BasicAWSCredentials awsCred = new BasicAWSCredentials(options.getAWSAccessKey(),
options.getAWSSecretKey());
options.as(AwsOptions.class).setAwsCredentialsProvider(new AWSStaticCredentialsProvider(awsCred));
options.as(AwsOptions.class).setAwsRegion(options.getAWSRegion());
What we need to connect to AWS
private static PCollection<TableRow> executeSql(Pipeline p, String sql) {
return
p.apply(BigQueryIO.readTableRows().fromQuery(sql).withMethod(BigQueryIO.TypedRead.Method.DIRECT_READ)
.usingStandardSql());
}
How to read from BigQuery
Problems / Tips
- Choose the right region in order to reduce latency and cost.
- To avoid extraction quota issues use DIRECT_READ
- FileIO only writes.
- Be careful with complex types (arrays, nested arrays).
Dealing with Streaming
Custom Airflow plugin to trigger Dataflow Jobs
Questions?
Thank you!

More Related Content

Similar to Apache Beam in Production

Apache StreamPipes – Flexible Industrial IoT Management
Apache StreamPipes – Flexible Industrial IoT ManagementApache StreamPipes – Flexible Industrial IoT Management
Apache StreamPipes – Flexible Industrial IoT Management
Apache StreamPipes
 
InterConnect2016: WebApp Architectures with Java and Node.js
InterConnect2016: WebApp Architectures with Java and Node.jsInterConnect2016: WebApp Architectures with Java and Node.js
InterConnect2016: WebApp Architectures with Java and Node.js
Chris Bailey
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
Hamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreHamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature Store
Moritz Meister
 
Data orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | GimelData orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | Gimel
Deepak Chandramouli
 
Unified Data Access with Gimel
Unified Data Access with GimelUnified Data Access with Gimel
Unified Data Access with Gimel
Alluxio, Inc.
 
Connecting the Dots: Kong for GraphQL Endpoints
Connecting the Dots: Kong for GraphQL EndpointsConnecting the Dots: Kong for GraphQL Endpoints
Connecting the Dots: Kong for GraphQL Endpoints
Julien Bataillé
 
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
Márton Kodok
 
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Codemotion
 
Usable APIs at Scale
Usable APIs at ScaleUsable APIs at Scale
Usable APIs at Scale
Tim Burks
 
Spring Data, Jongo & Co.
Spring Data, Jongo & Co.Spring Data, Jongo & Co.
Spring Data, Jongo & Co.
Tobias Trelle
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Chester Chen
 
Statsd introduction
Statsd introductionStatsd introduction
Statsd introduction
Rick Chang
 
apidays LIVE Australia 2020 - From micro to macro-coordination through domain...
apidays LIVE Australia 2020 - From micro to macro-coordination through domain...apidays LIVE Australia 2020 - From micro to macro-coordination through domain...
apidays LIVE Australia 2020 - From micro to macro-coordination through domain...
apidays
 
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, EverAltitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Fastly
 
How to Leverage APIs for SEO #TTTLive2019
How to Leverage APIs for SEO #TTTLive2019How to Leverage APIs for SEO #TTTLive2019
How to Leverage APIs for SEO #TTTLive2019
Paul Shapiro
 
Semantic Web & TYPO3
Semantic Web & TYPO3Semantic Web & TYPO3
Semantic Web & TYPO3
André Wuttig
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Jason Dai
 
Introduction to GraphQL and AWS Appsync on AWS - iOS
Introduction to GraphQL and AWS Appsync on AWS - iOSIntroduction to GraphQL and AWS Appsync on AWS - iOS
Introduction to GraphQL and AWS Appsync on AWS - iOS
Amazon Web Services
 
Parse cloud code
Parse cloud codeParse cloud code
Parse cloud code
維佋 唐
 

Similar to Apache Beam in Production (20)

Apache StreamPipes – Flexible Industrial IoT Management
Apache StreamPipes – Flexible Industrial IoT ManagementApache StreamPipes – Flexible Industrial IoT Management
Apache StreamPipes – Flexible Industrial IoT Management
 
InterConnect2016: WebApp Architectures with Java and Node.js
InterConnect2016: WebApp Architectures with Java and Node.jsInterConnect2016: WebApp Architectures with Java and Node.js
InterConnect2016: WebApp Architectures with Java and Node.js
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Hamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreHamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature Store
 
Data orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | GimelData orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | Gimel
 
Unified Data Access with Gimel
Unified Data Access with GimelUnified Data Access with Gimel
Unified Data Access with Gimel
 
Connecting the Dots: Kong for GraphQL Endpoints
Connecting the Dots: Kong for GraphQL EndpointsConnecting the Dots: Kong for GraphQL Endpoints
Connecting the Dots: Kong for GraphQL Endpoints
 
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
 
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
 
Usable APIs at Scale
Usable APIs at ScaleUsable APIs at Scale
Usable APIs at Scale
 
Spring Data, Jongo & Co.
Spring Data, Jongo & Co.Spring Data, Jongo & Co.
Spring Data, Jongo & Co.
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
Statsd introduction
Statsd introductionStatsd introduction
Statsd introduction
 
apidays LIVE Australia 2020 - From micro to macro-coordination through domain...
apidays LIVE Australia 2020 - From micro to macro-coordination through domain...apidays LIVE Australia 2020 - From micro to macro-coordination through domain...
apidays LIVE Australia 2020 - From micro to macro-coordination through domain...
 
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, EverAltitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
 
How to Leverage APIs for SEO #TTTLive2019
How to Leverage APIs for SEO #TTTLive2019How to Leverage APIs for SEO #TTTLive2019
How to Leverage APIs for SEO #TTTLive2019
 
Semantic Web & TYPO3
Semantic Web & TYPO3Semantic Web & TYPO3
Semantic Web & TYPO3
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
 
Introduction to GraphQL and AWS Appsync on AWS - iOS
Introduction to GraphQL and AWS Appsync on AWS - iOSIntroduction to GraphQL and AWS Appsync on AWS - iOS
Introduction to GraphQL and AWS Appsync on AWS - iOS
 
Parse cloud code
Parse cloud codeParse cloud code
Parse cloud code
 

Recently uploaded

Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
mamunhossenbd75
 
Casting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdfCasting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdf
zubairahmad848137
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
sachin chaurasia
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
Yasser Mahgoub
 
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
rpskprasana
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
University of Maribor
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
Las Vegas Warehouse
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball playEric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
enizeyimana36
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 

Recently uploaded (20)

Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
 
Casting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdfCasting-Defect-inSlab continuous casting.pdf
Casting-Defect-inSlab continuous casting.pdf
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
 
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball playEric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
Eric Nizeyimana's document 2006 from gicumbi to ttc nyamata handball play
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 

Apache Beam in Production

  • 1. Apache Beam in Production June 10th, 2020
  • 2. ● USA based monetization platform for mobile game developers. ● Growing our engineering office in Barcelona with a strong tech-driven culture: - Visibility of the impact of your code. - Not afraid of implementing tech, like Apache beam ; ) - Fostering best practices and a true care for code quality. +300k Mobile game integrations 900M Unique Users 200B Ad request 2011 Founding year Who we are 100TBs Data Scale
  • 3. Hi, I’m Ferran! Data Engineer @ Chartboost +5 years of Big Data experience
  • 4. Agenda ● What is Apache Beam ● Basic Requirements ● Production use cases: ○ Ingest data into BigQuery or GCS ○ BigQuery to BigTable ○ BigQuery to S3 (Parquet) ● Dealing with Streaming ● Questions
  • 5. What is Apache Beam “Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines.” Write once, run anywhere
  • 6. Basic Requirements - Reduce cluster provisioning overhead. - Create a generic and reusable code. - The architecture must be agnostic of the source (streaming or batch). - The architecture must have different configurable sinks.
  • 7. Text files or Parquet
  • 8. Ingest data into BQ or GCS
  • 9. Parameter Input inputPath gs://bucket_x/*.parquet isPartitionedTable false disposition TRUNCATE outputTableSpec project:dataset.table batchJobInstance NO_TRANSFORMATION howToParse {"original_app_id":"original _app_id","app.id":"app_id", "src_name":"src_name","id ":"id"}
  • 10. You have loaded data into BigQuery! I wanted upper case...
  • 11. import com.google.api.services.bigquery.model.TableRow; public class TestTransformation { public TableRow customTransformation(TableRow tr) { tr.set("src_name", tr.get("src_name").toString().toUpperCase()); return tr; } } class Types { static final String TEST = "test"; }
  • 12. Parameter Input inputPath gs://bucket_x/*.parquet isPartitionedTable false disposition TRUNCATE batchJobInstance test outputTableSpec project:dataset.table howToParse {"original_app_id":"original _app_id","app.id":"app_id", "src_name":"src_name","id ":"id"}
  • 13. You have loaded custom data into BigQuery!
  • 15. How custom transformations work @ProcessElement public void processElement(ProcessContext context) { try { String element = context.element(); if (element != null) { JsonModel jsonModel = Utils.parseRawJson(element, JsonModel.class); if (jsonModel != null) { TableRow tr = customTransformation(JsonParseToBigQuery.getInstance() .getJsonParse(jsonModel.getDetails(), howToParse)); if (tr != null) { context.output(tr); } } } } catch (Exception e) { LOG.error("Error on parsing", e); } } }
  • 16. Parameter Default Value inputPath - isParquet true outputTableSpec - howToParse - disposition APPEND isPartitionedTable true writeIntoBq true writeIntoGCS false outputDirectory - numShards 20 writeAsParquet true Main Batch Jobs Parameters
  • 17. Parameter Input inputPath gs://bucket_x/*.parquet batchJobInstance test outputDirectory gs://bucket_y/test outputTableSpec project:dataset.table writeIntoGCS true parquetSchema original_app_id:STRING,app_id:LON G howToParse {"original_app_id":"original_app_id"," app.id":"app_id","src_name":"src_nam e","id":"id"}
  • 18. Parameter Default Value streamSource pub_sub inputTopic - subscription - kafkaBrokers - kafkaGroupId - filePartitionPolicy DAILY writeIntoBq true writeIntoGCS false outputDirectory - numShards 20 windowDuration 5m Main Streaming Jobs Parameters
  • 19.
  • 20. Problems / Tips - Debugging failures was not always easy. - If you want to create templates, remember, ValueProviders are only available at Runtime. - Be careful with non thread safe classes. - Default GCP instances are okay, but try to use custom ones.
  • 22. Going from SQL to NoSQL
  • 23. What is BigTable “Bigtable is a compressed, high performance, proprietary data storage system built on Google File System, Chubby Lock Service, SSTable (log-structured storage like LevelDB) and a few other Google technologies” Key, Value storage
  • 24. ColumnFamily (info) RowKey Qualifier (name) Qualifier (email) Qualifier(phone) ofsehn28u492 Bill Green bgreen@gmail.com 555-958-382 kfgiuiu5937je3 jdoe@gmail.com 555-738-234 iojcou9wujd77 Rick Sanchez BigTable table example
  • 25. Parameter Input sqlQuery SELECT X, Y, Z FROM ... rowKeyMap - bqToBtMap cf:qualifier:something,cf2... bigTableInstanceId chartboost externalSinkProject project-id-x bigTableAppProfileId batch
  • 26. Problems / Tips ● For heavy load jobs always use BigTable application profiles.
  • 27. BigQuery to S3 (Parquet)
  • 28. From BQ Dataflow to S3 Parquet
  • 29. Parameter Input AWSRegion us-east-1 inputTableSpec project:dataset.table outputPath gs://bucket_y/test successFileName _SUCCESS schema original_app_id:STRING,app_id:LON G writeIntoS3 true
  • 30. public interface Options extends BigQueryToParquetOptions { } ---------------------------------------------------------------------------------------------------- BasicAWSCredentials awsCred = new BasicAWSCredentials(options.getAWSAccessKey(), options.getAWSSecretKey()); options.as(AwsOptions.class).setAwsCredentialsProvider(new AWSStaticCredentialsProvider(awsCred)); options.as(AwsOptions.class).setAwsRegion(options.getAWSRegion()); What we need to connect to AWS
  • 31. private static PCollection<TableRow> executeSql(Pipeline p, String sql) { return p.apply(BigQueryIO.readTableRows().fromQuery(sql).withMethod(BigQueryIO.TypedRead.Method.DIRECT_READ) .usingStandardSql()); } How to read from BigQuery
  • 32. Problems / Tips - Choose the right region in order to reduce latency and cost. - To avoid extraction quota issues use DIRECT_READ - FileIO only writes. - Be careful with complex types (arrays, nested arrays).
  • 33. Dealing with Streaming Custom Airflow plugin to trigger Dataflow Jobs