Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Vicky Avison Cox Automotive UK
Alex Bush KPMG Lighthouse New Zealand
Best Practices for Building and Deploying Data
Pipeli...
Cox Automotive
Data Platform
KPMG Lighthouse
Centre of Excellence for Information and Analytics
We provide services across the data value chain includi...
What is this talk about?
- What are data pipelines and who builds them?
- Why is data pipeline development difficult to get...
What are data pipelines and who builds them?
Data Sources
(e.g. files, relational
databases, REST APIs)
What do we mean by ‘Data Pipeline’?
Data Platform
(storage + com...
Who is in a data team?
Data Engineering
Deep understanding of the
technology
Know how to build robust,
performant data pip...
Why is data pipeline development difficult to get
right?
What do we need to think about when building a pipeline?
1. How do we handle late-arriving or duplicate data?
2. How can w...
What about the business logic?
table_a table_b table_c table_d
cleanup, convert data types, create user-friendly column na...
deployment
location
environment
(data location)
paths
Hive
databases
What about deployments?
environment (server)
software...
What are the main challenges?
- A lot of overhead for every new data pipeline, even when the problems are
very similar eac...
How have we changed the way we develop and
deploy our data pipelines?
A long(ish) time ago, in an office quite far away….
How were we dealing with the main challenges?
A lot of overhead for every new data pipeline, even when the problems are ve...
How were we dealing with the main challenges?
Production-grade business logic hard to write without specialist Data
Engine...
How were we dealing with the main challenges?
No tools or best practice around deploying and managing environments for dat...
Business
Logic
Applications
Business
Logic
Applications
Could we make better use of the skills in the team?
Business Intel...
What tools and frameworks would we need to provide?
Spark and Hadoop KMS Delta Lake Deequ Etc...
Third-party services and ...
How would we design a Data Engineering framework?
Input
Output
{ Inputs }
{ Transformations }
{ Outputs }
Business Logic
P...
How would we like to manage deployments?
v 0.1 v 0.2
Paths
/data/prod/my_project
Hive databases
prod_my_project
Deployed j...
What does this look like in practice?
Simpler data ingestion
case class SQLServerConnectionDetails(server: String, user: String, password: String)
val dbConf = ...
val flow = Waimak.sparkFlow(spark)
.snapshotFromStorage(basePath, tsNow)("table1", "table2", "table3")
.transform("table1"...
Simpler environment management
case class MySparkAppEnv(project: String, //e.g. my_spark_app
environment: String, //e.g. d...
Simpler deployments
Questions?
github.com/
CoxAutomotiveDataSolutions/
waimak
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Best Practices for Building and Deploying Data Pipelines in Apache Spark

Download to read offline

Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, we will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. We’ll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations.

We’ll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we’ll define new approaches and best practices about what we believe is the most overlooked aspect of Data Engineering: deploying data pipelines.

Best Practices for Building and Deploying Data Pipelines in Apache Spark

  1. 1. Vicky Avison Cox Automotive UK Alex Bush KPMG Lighthouse New Zealand Best Practices for Building and Deploying Data Pipelines in Apache Spark #UnifiedDataAnalytics #SparkAISummit
  2. 2. Cox Automotive Data Platform
  3. 3. KPMG Lighthouse Centre of Excellence for Information and Analytics We provide services across the data value chain including: ● Data strategy and analytics maturity assessment ● Information management ● Data engineering ● Data warehousing, business intelligence (BI) and data visualisation ● Data science, advanced analytics and artificial intelligence (AI) ● Cloud-based analytics services
  4. 4. What is this talk about? - What are data pipelines and who builds them? - Why is data pipeline development difficult to get right? - How have we changed the way we develop and deploy our data pipelines?
  5. 5. What are data pipelines and who builds them?
  6. 6. Data Sources (e.g. files, relational databases, REST APIs) What do we mean by ‘Data Pipeline’? Data Platform (storage + compute) Ingest the raw data1 raw data table_a table_b table_c table_d table_e table_f ... ... prepared data data model_a data model_b data model_c data model_d 1 Prepare the data for use in further analysis and dashboards, meaning: a. Deduplication b. Cleansing c. Enrichment d. Creation of data models i.e joins, aggregations etc. 2 2
  7. 7. Who is in a data team? Data Engineering Deep understanding of the technology Know how to build robust, performant data pipelines Business Intelligence and Data Science Deep understanding of the data Know how to extract business value from the data
  8. 8. Why is data pipeline development difficult to get right?
  9. 9. What do we need to think about when building a pipeline? 1. How do we handle late-arriving or duplicate data? 2. How can we ensure that if the pipeline fails part-way through, we can run it again without any problems? 3. How do we avoid the small-file problem? 4. How do we monitor data quality? 5. How do we configure our application? (credentials, input paths, output paths etc.) 6. How do we maximise performance? 7. How do extract only what we need from the source (e.g. only extract new records from RDBM)?
  10. 10. What about the business logic? table_a table_b table_c table_d cleanup, convert data types, create user-friendly column names, add derived columns etc. a_b_model join a and b together d_counts group by and perform counts b_c_d_model join b, c and the aggregated d together deduplicated raw data
  11. 11. deployment location environment (data location) paths Hive databases What about deployments? environment (server) software interaction response Traditional software development e.g web development softwareinput data Data development deployment deployment
  12. 12. What are the main challenges? - A lot of overhead for every new data pipeline, even when the problems are very similar each time - Production-grade business logic is hard to write without specialist Data Engineering skills - No tools or best practice around deploying and managing environments for data pipelines
  13. 13. How have we changed the way we develop and deploy our data pipelines?
  14. 14. A long(ish) time ago, in an office quite far away….
  15. 15. How were we dealing with the main challenges? A lot of overhead for every new data pipeline, even when the problems are very similar each time We were… shoehorning new pipeline requirements into a single application in an attempt to avoid the overhead
  16. 16. How were we dealing with the main challenges? Production-grade business logic hard to write without specialist Data Engineering skills We were… taking business logic defined by our BI and Data Science colleagues and reimplementing it
  17. 17. How were we dealing with the main challenges? No tools or best practice around deploying and managing environments for data pipelines We were… manually deploying jars, passing environment-specific configuration to our applications each time we ran them
  18. 18. Business Logic Applications Business Logic Applications Could we make better use of the skills in the team? Business Intelligence and Data Science Data Engineering Data Platform Business Engagement Data Ingestion Applications Modelling Consulting Business Logic Definition Deep understanding of the technology Deep understanding of the data Data ExplorationTools and Frameworks
  19. 19. What tools and frameworks would we need to provide? Spark and Hadoop KMS Delta Lake Deequ Etc... Third-party services and libraries Configuration Management Idempotency and Atomicity Deduplication Compaction Table Metadata Management Action Coordination Boilerplate and Structuring Data Engineering frameworks Business Logic Data Science and Business Intelligence Applications Environment Management Application Deployment Data Engineering tools Data Ingestion Data Engineering Applications
  20. 20. How would we design a Data Engineering framework? Input Output { Inputs } { Transformations } { Outputs } Business Logic Performance Optimisations Data Quality Monitoring Deduplication Compaction ConfigurationMgmt etc. Spark and Hadoop High-level APIs Business Logic Complexity hidden behind high-level APIs Intuitive structuring of business logic code Injection of optimisations and monitoring Efficient scheduling of transformations and actions
  21. 21. How would we like to manage deployments? v 0.1 v 0.2 Paths /data/prod/my_project Hive databases prod_my_project Deployed jars my_project-0.1.jar /data/dev/my_project/feature_one dev_my_project_feature_one my_project_feature_one-0.2-SNAPSHOT.jar /data/dev/my_project/feature_two dev_my_project_feature_two my_project_feature_two-0.2-SNAPSHOT.jar master feature/one feature/two my_project-0.2.jar
  22. 22. What does this look like in practice?
  23. 23. Simpler data ingestion case class SQLServerConnectionDetails(server: String, user: String, password: String) val dbConf = CaseClassConfigParser[SQLServerConnectionDetails]( SparkFlowContext(spark), "app1.dbconf" ) Retrieve server configuration from combination of Spark conf and Databricks Secrets Pull deltas from SQLServer temporal tables and store in storage layer Storage layer will capture last updated values and primary keys Small files will be compacted once between 11pm and 4am Flow is lazy, nothing happens until execute is called val flow = Waimak.sparkFlow(spark) .extractToStorageFromRDBM( rdbmExtractor = new SQLServerTemporalExtractor(spark, dbConf), dbSchema = ..., storageBasePath = ..., tableConfigs = ..., extractDateTime = ZonedDateTime.now(), doCompaction = runSingleCompactionDuringWindow(23, 4) )("table1", "table2", "table3") flow.execute()
  24. 24. val flow = Waimak.sparkFlow(spark) .snapshotFromStorage(basePath, tsNow)("table1", "table2", "table3") .transform("table1", "table2")("model1")( (t1, t2) => t1.join(t2, "pk1") ) .transform("table3", "model1")("model2")( (t3, m1) => t3.join(m1, "pk2") ) .sql("table1", "model2")("reporting1", """select m2.pk1, count(t1.col1) from model2 m2 left join table1 t1 on m2.pk1 = t1.fpk1 group by m2.pk1""" ) .writeHiveManagedTable("model_db")("model1", "model2") .writeHiveManagedTable("reporting_db")("reporting1") .sparkCache("model2") .addDeequCheck( "reporting1", Check(Error, "Not valid PK").isPrimaryKey("pk1") )(ExceptionQualityAlert()) Simpler business logic development Execute with flow.execute() Read from storage layer and deduplicate Perform two transformations on labels using the DataFrame API, generating two more labels Perform a Spark SQL transformation on a table and label generated during a transform, generating one more label Write labels to two different databases Add explicit caching and data quality monitoring actions
  25. 25. Simpler environment management case class MySparkAppEnv(project: String, //e.g. my_spark_app environment: String, //e.g. dev branch: String //e.g. feature/one ) extends HiveEnv Environment consists of a base path: /data/dev/my_spark_app/feature_one/ And a Hive database: dev_my_spark_app_feature_one object MySparkApp extends SparkApp[MySparkAppEnv] { override def run(sparkSession: SparkSession, env: MySparkAppEnv): Unit = //We have a base path and Hive database available in env via env.basePath and env.baseDBName ??? } Define application logic given a SparkSession and an environment Use MultiAppRunner to run apps individually or together with dependencies spark.waimak.apprunner.apps = my_spark_app spark.waimak.apprunner.my_spark_app.appClassName = com.example.MySparkApp spark.waimak.environment.my_spark_app.environment = dev spark.waimak.environment.my_spark_app.branch = feature/one
  26. 26. Simpler deployments
  27. 27. Questions? github.com/ CoxAutomotiveDataSolutions/ waimak
  • kodanda

    Jan. 27, 2020
  • searchs

    Oct. 31, 2019

Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, we will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. We’ll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations. We’ll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we’ll define new approaches and best practices about what we believe is the most overlooked aspect of Data Engineering: deploying data pipelines.

Views

Total views

1,518

On Slideshare

0

From embeds

0

Number of embeds

88

Actions

Downloads

66

Shares

0

Comments

0

Likes

2

×