SlideShare a Scribd company logo
1 of 35
Apache Beam in Production
June 10th, 2020
● USA based monetization platform for mobile game
developers.
● Growing our engineering office in Barcelona with a strong
tech-driven culture:
- Visibility of the impact of your code.
- Not afraid of implementing tech, like Apache beam ; )
- Fostering best practices and a true care for code quality.
+300k
Mobile game
integrations
900M
Unique
Users
200B
Ad request
2011
Founding
year
Who we are
100TBs
Data Scale
Hi, I’m Ferran!
Data Engineer @ Chartboost
+5 years of Big Data experience
Agenda
● What is Apache Beam
● Basic Requirements
● Production use cases:
○ Ingest data into BigQuery or GCS
○ BigQuery to BigTable
○ BigQuery to S3 (Parquet)
● Dealing with Streaming
● Questions
What is Apache Beam
“Apache Beam is an open source,
unified model for defining both batch
and streaming data-parallel processing
pipelines.”
Write once, run anywhere
Basic Requirements
- Reduce cluster provisioning overhead.
- Create a generic and reusable code.
- The architecture must be agnostic of the source (streaming or
batch).
- The architecture must have different configurable sinks.
Text files or Parquet
Ingest data into BQ or GCS
Parameter Input
inputPath gs://bucket_x/*.parquet
isPartitionedTable false
disposition TRUNCATE
outputTableSpec project:dataset.table
batchJobInstance NO_TRANSFORMATION
howToParse {"original_app_id":"original
_app_id","app.id":"app_id",
"src_name":"src_name","id
":"id"}
You have loaded data into BigQuery!
I wanted upper case...
import com.google.api.services.bigquery.model.TableRow;
public class TestTransformation {
public TableRow customTransformation(TableRow tr) {
tr.set("src_name",
tr.get("src_name").toString().toUpperCase());
return tr;
}
}
class Types {
static final String TEST = "test";
}
Parameter Input
inputPath gs://bucket_x/*.parquet
isPartitionedTable false
disposition TRUNCATE
batchJobInstance test
outputTableSpec project:dataset.table
howToParse {"original_app_id":"original
_app_id","app.id":"app_id",
"src_name":"src_name","id
":"id"}
You have loaded custom data into BigQuery!
Core
batchJobInstance
streamingJobInstance
customTransA
customTransB
customTransC
How custom transformations work
How custom transformations work
@ProcessElement
public void processElement(ProcessContext context) {
try {
String element = context.element();
if (element != null) {
JsonModel jsonModel = Utils.parseRawJson(element, JsonModel.class);
if (jsonModel != null) {
TableRow tr = customTransformation(JsonParseToBigQuery.getInstance()
.getJsonParse(jsonModel.getDetails(), howToParse));
if (tr != null) {
context.output(tr);
}
}
}
} catch (Exception e) {
LOG.error("Error on parsing", e);
}
}
}
Parameter Default Value
inputPath -
isParquet true
outputTableSpec -
howToParse -
disposition APPEND
isPartitionedTable true
writeIntoBq true
writeIntoGCS false
outputDirectory -
numShards 20
writeAsParquet true
Main Batch Jobs Parameters
Parameter Input
inputPath gs://bucket_x/*.parquet
batchJobInstance test
outputDirectory gs://bucket_y/test
outputTableSpec project:dataset.table
writeIntoGCS true
parquetSchema original_app_id:STRING,app_id:LON
G
howToParse {"original_app_id":"original_app_id","
app.id":"app_id","src_name":"src_nam
e","id":"id"}
Parameter Default Value
streamSource pub_sub
inputTopic -
subscription -
kafkaBrokers -
kafkaGroupId -
filePartitionPolicy DAILY
writeIntoBq true
writeIntoGCS false
outputDirectory -
numShards 20
windowDuration 5m
Main Streaming Jobs Parameters
Problems / Tips
- Debugging failures was not always easy.
- If you want to create templates, remember, ValueProviders are
only available at Runtime.
- Be careful with non thread safe classes.
- Default GCP instances are okay, but try to use custom ones.
BigQuery to BigTable
Going from SQL to NoSQL
What is BigTable
“Bigtable is a compressed, high
performance, proprietary data storage
system built on Google File System, Chubby
Lock Service, SSTable (log-structured
storage like LevelDB) and a few other
Google technologies”
Key, Value storage
ColumnFamily (info)
RowKey Qualifier (name) Qualifier (email) Qualifier(phone)
ofsehn28u492 Bill Green bgreen@gmail.com 555-958-382
kfgiuiu5937je3 jdoe@gmail.com 555-738-234
iojcou9wujd77 Rick Sanchez
BigTable table example
Parameter Input
sqlQuery SELECT X, Y, Z FROM ...
rowKeyMap -
bqToBtMap cf:qualifier:something,cf2...
bigTableInstanceId chartboost
externalSinkProject project-id-x
bigTableAppProfileId batch
Problems / Tips
● For heavy load jobs always use
BigTable application profiles.
BigQuery to S3 (Parquet)
From BQ Dataflow to S3 Parquet
Parameter Input
AWSRegion us-east-1
inputTableSpec project:dataset.table
outputPath gs://bucket_y/test
successFileName _SUCCESS
schema original_app_id:STRING,app_id:LON
G
writeIntoS3 true
public interface Options extends BigQueryToParquetOptions {
}
----------------------------------------------------------------------------------------------------
BasicAWSCredentials awsCred = new BasicAWSCredentials(options.getAWSAccessKey(),
options.getAWSSecretKey());
options.as(AwsOptions.class).setAwsCredentialsProvider(new AWSStaticCredentialsProvider(awsCred));
options.as(AwsOptions.class).setAwsRegion(options.getAWSRegion());
What we need to connect to AWS
private static PCollection<TableRow> executeSql(Pipeline p, String sql) {
return
p.apply(BigQueryIO.readTableRows().fromQuery(sql).withMethod(BigQueryIO.TypedRead.Method.DIRECT_READ)
.usingStandardSql());
}
How to read from BigQuery
Problems / Tips
- Choose the right region in order to reduce latency and cost.
- To avoid extraction quota issues use DIRECT_READ
- FileIO only writes.
- Be careful with complex types (arrays, nested arrays).
Dealing with Streaming
Custom Airflow plugin to trigger Dataflow Jobs
Questions?
Thank you!

More Related Content

Similar to Apache Beam in Production

Parse cloud code
Parse cloud codeParse cloud code
Parse cloud code
維佋 唐
 

Similar to Apache Beam in Production (20)

Apache StreamPipes – Flexible Industrial IoT Management
Apache StreamPipes – Flexible Industrial IoT ManagementApache StreamPipes – Flexible Industrial IoT Management
Apache StreamPipes – Flexible Industrial IoT Management
 
InterConnect2016: WebApp Architectures with Java and Node.js
InterConnect2016: WebApp Architectures with Java and Node.jsInterConnect2016: WebApp Architectures with Java and Node.js
InterConnect2016: WebApp Architectures with Java and Node.js
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Hamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreHamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature Store
 
Data orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | GimelData orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | Gimel
 
Unified Data Access with Gimel
Unified Data Access with GimelUnified Data Access with Gimel
Unified Data Access with Gimel
 
Connecting the Dots: Kong for GraphQL Endpoints
Connecting the Dots: Kong for GraphQL EndpointsConnecting the Dots: Kong for GraphQL Endpoints
Connecting the Dots: Kong for GraphQL Endpoints
 
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
 
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
Perchè potresti aver bisogno di un database NoSQL anche se non sei Google o F...
 
Usable APIs at Scale
Usable APIs at ScaleUsable APIs at Scale
Usable APIs at Scale
 
Spring Data, Jongo & Co.
Spring Data, Jongo & Co.Spring Data, Jongo & Co.
Spring Data, Jongo & Co.
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
Statsd introduction
Statsd introductionStatsd introduction
Statsd introduction
 
apidays LIVE Australia 2020 - From micro to macro-coordination through domain...
apidays LIVE Australia 2020 - From micro to macro-coordination through domain...apidays LIVE Australia 2020 - From micro to macro-coordination through domain...
apidays LIVE Australia 2020 - From micro to macro-coordination through domain...
 
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, EverAltitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, Ever
 
How to Leverage APIs for SEO #TTTLive2019
How to Leverage APIs for SEO #TTTLive2019How to Leverage APIs for SEO #TTTLive2019
How to Leverage APIs for SEO #TTTLive2019
 
Semantic Web & TYPO3
Semantic Web & TYPO3Semantic Web & TYPO3
Semantic Web & TYPO3
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
 
Introduction to GraphQL and AWS Appsync on AWS - iOS
Introduction to GraphQL and AWS Appsync on AWS - iOSIntroduction to GraphQL and AWS Appsync on AWS - iOS
Introduction to GraphQL and AWS Appsync on AWS - iOS
 
Parse cloud code
Parse cloud codeParse cloud code
Parse cloud code
 

Recently uploaded

Online crime reporting system project.pdf
Online crime reporting system project.pdfOnline crime reporting system project.pdf
Online crime reporting system project.pdf
Kamal Acharya
 
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
Madan Karki
 
Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..
MaherOthman7
 
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Lovely Professional University
 

Recently uploaded (20)

Online crime reporting system project.pdf
Online crime reporting system project.pdfOnline crime reporting system project.pdf
Online crime reporting system project.pdf
 
Linux Systems Programming: Semaphores, Shared Memory, and Message Queues
Linux Systems Programming: Semaphores, Shared Memory, and Message QueuesLinux Systems Programming: Semaphores, Shared Memory, and Message Queues
Linux Systems Programming: Semaphores, Shared Memory, and Message Queues
 
Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...
Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...
Fabrication Of Automatic Star Delta Starter Using Relay And GSM Module By Utk...
 
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGBRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
 
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
 
Artificial Intelligence Bayesian Reasoning
Artificial Intelligence Bayesian ReasoningArtificial Intelligence Bayesian Reasoning
Artificial Intelligence Bayesian Reasoning
 
How to Design and spec harmonic filter.pdf
How to Design and spec harmonic filter.pdfHow to Design and spec harmonic filter.pdf
How to Design and spec harmonic filter.pdf
 
RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5
RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5
RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5
 
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfInstruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
 
Electrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission lineElectrostatic field in a coaxial transmission line
Electrostatic field in a coaxial transmission line
 
Interfacing Analog to Digital Data Converters ee3404.pdf
Interfacing Analog to Digital Data Converters ee3404.pdfInterfacing Analog to Digital Data Converters ee3404.pdf
Interfacing Analog to Digital Data Converters ee3404.pdf
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1
 
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxSLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
 
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
 
Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..
 
Intelligent Agents, A discovery on How A Rational Agent Acts
Intelligent Agents, A discovery on How A Rational Agent ActsIntelligent Agents, A discovery on How A Rational Agent Acts
Intelligent Agents, A discovery on How A Rational Agent Acts
 
BURGER ORDERING SYSYTEM PROJECT REPORT..pdf
BURGER ORDERING SYSYTEM PROJECT REPORT..pdfBURGER ORDERING SYSYTEM PROJECT REPORT..pdf
BURGER ORDERING SYSYTEM PROJECT REPORT..pdf
 
Filters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsFilters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility Applications
 
Theory for How to calculation capacitor bank
Theory for How to calculation capacitor bankTheory for How to calculation capacitor bank
Theory for How to calculation capacitor bank
 
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
 

Apache Beam in Production

  • 1. Apache Beam in Production June 10th, 2020
  • 2. ● USA based monetization platform for mobile game developers. ● Growing our engineering office in Barcelona with a strong tech-driven culture: - Visibility of the impact of your code. - Not afraid of implementing tech, like Apache beam ; ) - Fostering best practices and a true care for code quality. +300k Mobile game integrations 900M Unique Users 200B Ad request 2011 Founding year Who we are 100TBs Data Scale
  • 3. Hi, I’m Ferran! Data Engineer @ Chartboost +5 years of Big Data experience
  • 4. Agenda ● What is Apache Beam ● Basic Requirements ● Production use cases: ○ Ingest data into BigQuery or GCS ○ BigQuery to BigTable ○ BigQuery to S3 (Parquet) ● Dealing with Streaming ● Questions
  • 5. What is Apache Beam “Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines.” Write once, run anywhere
  • 6. Basic Requirements - Reduce cluster provisioning overhead. - Create a generic and reusable code. - The architecture must be agnostic of the source (streaming or batch). - The architecture must have different configurable sinks.
  • 7. Text files or Parquet
  • 8. Ingest data into BQ or GCS
  • 9. Parameter Input inputPath gs://bucket_x/*.parquet isPartitionedTable false disposition TRUNCATE outputTableSpec project:dataset.table batchJobInstance NO_TRANSFORMATION howToParse {"original_app_id":"original _app_id","app.id":"app_id", "src_name":"src_name","id ":"id"}
  • 10. You have loaded data into BigQuery! I wanted upper case...
  • 11. import com.google.api.services.bigquery.model.TableRow; public class TestTransformation { public TableRow customTransformation(TableRow tr) { tr.set("src_name", tr.get("src_name").toString().toUpperCase()); return tr; } } class Types { static final String TEST = "test"; }
  • 12. Parameter Input inputPath gs://bucket_x/*.parquet isPartitionedTable false disposition TRUNCATE batchJobInstance test outputTableSpec project:dataset.table howToParse {"original_app_id":"original _app_id","app.id":"app_id", "src_name":"src_name","id ":"id"}
  • 13. You have loaded custom data into BigQuery!
  • 15. How custom transformations work @ProcessElement public void processElement(ProcessContext context) { try { String element = context.element(); if (element != null) { JsonModel jsonModel = Utils.parseRawJson(element, JsonModel.class); if (jsonModel != null) { TableRow tr = customTransformation(JsonParseToBigQuery.getInstance() .getJsonParse(jsonModel.getDetails(), howToParse)); if (tr != null) { context.output(tr); } } } } catch (Exception e) { LOG.error("Error on parsing", e); } } }
  • 16. Parameter Default Value inputPath - isParquet true outputTableSpec - howToParse - disposition APPEND isPartitionedTable true writeIntoBq true writeIntoGCS false outputDirectory - numShards 20 writeAsParquet true Main Batch Jobs Parameters
  • 17. Parameter Input inputPath gs://bucket_x/*.parquet batchJobInstance test outputDirectory gs://bucket_y/test outputTableSpec project:dataset.table writeIntoGCS true parquetSchema original_app_id:STRING,app_id:LON G howToParse {"original_app_id":"original_app_id"," app.id":"app_id","src_name":"src_nam e","id":"id"}
  • 18. Parameter Default Value streamSource pub_sub inputTopic - subscription - kafkaBrokers - kafkaGroupId - filePartitionPolicy DAILY writeIntoBq true writeIntoGCS false outputDirectory - numShards 20 windowDuration 5m Main Streaming Jobs Parameters
  • 19.
  • 20. Problems / Tips - Debugging failures was not always easy. - If you want to create templates, remember, ValueProviders are only available at Runtime. - Be careful with non thread safe classes. - Default GCP instances are okay, but try to use custom ones.
  • 22. Going from SQL to NoSQL
  • 23. What is BigTable “Bigtable is a compressed, high performance, proprietary data storage system built on Google File System, Chubby Lock Service, SSTable (log-structured storage like LevelDB) and a few other Google technologies” Key, Value storage
  • 24. ColumnFamily (info) RowKey Qualifier (name) Qualifier (email) Qualifier(phone) ofsehn28u492 Bill Green bgreen@gmail.com 555-958-382 kfgiuiu5937je3 jdoe@gmail.com 555-738-234 iojcou9wujd77 Rick Sanchez BigTable table example
  • 25. Parameter Input sqlQuery SELECT X, Y, Z FROM ... rowKeyMap - bqToBtMap cf:qualifier:something,cf2... bigTableInstanceId chartboost externalSinkProject project-id-x bigTableAppProfileId batch
  • 26. Problems / Tips ● For heavy load jobs always use BigTable application profiles.
  • 27. BigQuery to S3 (Parquet)
  • 28. From BQ Dataflow to S3 Parquet
  • 29. Parameter Input AWSRegion us-east-1 inputTableSpec project:dataset.table outputPath gs://bucket_y/test successFileName _SUCCESS schema original_app_id:STRING,app_id:LON G writeIntoS3 true
  • 30. public interface Options extends BigQueryToParquetOptions { } ---------------------------------------------------------------------------------------------------- BasicAWSCredentials awsCred = new BasicAWSCredentials(options.getAWSAccessKey(), options.getAWSSecretKey()); options.as(AwsOptions.class).setAwsCredentialsProvider(new AWSStaticCredentialsProvider(awsCred)); options.as(AwsOptions.class).setAwsRegion(options.getAWSRegion()); What we need to connect to AWS
  • 31. private static PCollection<TableRow> executeSql(Pipeline p, String sql) { return p.apply(BigQueryIO.readTableRows().fromQuery(sql).withMethod(BigQueryIO.TypedRead.Method.DIRECT_READ) .usingStandardSql()); } How to read from BigQuery
  • 32. Problems / Tips - Choose the right region in order to reduce latency and cost. - To avoid extraction quota issues use DIRECT_READ - FileIO only writes. - Be careful with complex types (arrays, nested arrays).
  • 33. Dealing with Streaming Custom Airflow plugin to trigger Dataflow Jobs