Best Practices for Building and Deploying Data Pipelines in Apache Spark

Vicky Avison Cox Automotive UK
Alex Bush KPMG Lighthouse New Zealand
Best Practices for Building and Deploying Data
Pipelines in Apache Spark
#UniﬁedDataAnalytics #SparkAISummit

KPMG Lighthouse
Centre of Excellence for Information and Analytics
We provide services across the data value chain including:
● Data strategy and analytics maturity assessment
● Information management
● Data engineering
● Data warehousing, business intelligence (BI) and data visualisation
● Data science, advanced analytics and artiﬁcial intelligence (AI)
● Cloud-based analytics services

What is this talk about?
- What are data pipelines and who builds them?
- Why is data pipeline development difﬁcult to get right?
- How have we changed the way we develop and deploy our data
pipelines?

What are data pipelines and who builds them?

Data Sources
(e.g. ﬁles, relational
databases, REST APIs)
What do we mean by ‘Data Pipeline’?
Data Platform
(storage + compute)
Ingest the raw data1
raw data
table_a
table_b
table_c
table_d
table_e
table_f
...
...
prepared data
data model_a
data model_b
data model_c
data model_d
1
Prepare the data for use
in further analysis and
dashboards, meaning:
a. Deduplication
b. Cleansing
c. Enrichment
d. Creation of data
models i.e joins,
aggregations etc.
2
2

Who is in a data team?
Data Engineering
Deep understanding of the
technology
Know how to build robust,
performant data pipelines
Business Intelligence
and Data Science
Deep understanding of the data
Know how to extract business
value from the data

Why is data pipeline development difficult to get
right?

What do we need to think about when building a pipeline?
1. How do we handle late-arriving or duplicate data?
2. How can we ensure that if the pipeline fails part-way through, we can run it again without
any problems?
3. How do we avoid the small-ﬁle problem?
4. How do we monitor data quality?
5. How do we conﬁgure our application? (credentials, input paths, output paths etc.)
6. How do we maximise performance?
7. How do extract only what we need from the source (e.g. only extract new records from
RDBM)?

What about the business logic?
table_a table_b table_c table_d
cleanup, convert data types, create user-friendly column names, add derived columns etc.
a_b_model
join a and b together
d_counts
group by and perform counts
b_c_d_model
join b, c and the aggregated d
together
deduplicated raw data

deployment
location
environment
(data location)
paths
Hive
databases
What about deployments?
environment (server)
software
interaction response
Traditional software development e.g
web development
softwareinput data
Data development
deployment
deployment

What are the main challenges?
- A lot of overhead for every new data pipeline, even when the problems are
very similar each time
- Production-grade business logic is hard to write without specialist Data
Engineering skills
- No tools or best practice around deploying and managing environments for
data pipelines

How have we changed the way we develop and
deploy our data pipelines?

A long(ish) time ago, in an office quite far away….

How were we dealing with the main challenges?
A lot of overhead for every new data pipeline, even when the problems are very
similar each time
We were… shoehorning new pipeline requirements into a single application in an
attempt to avoid the overhead

Production-grade business logic hard to write without specialist Data
Engineering skills
We were… taking business logic deﬁned by our BI and Data Science colleagues
and reimplementing it

No tools or best practice around deploying and managing environments for data
pipelines
We were… manually deploying jars, passing environment-speciﬁc conﬁguration
to our applications each time we ran them

Business
Logic
Applications
Business
Logic
Applications
Could we make better use of the skills in the team?
Business Intelligence
and Data Science
Data Engineering
Data
Platform
Business
Engagement
Data
Ingestion
Applications Modelling
Consulting
Business
Logic
Deﬁnition
Deep understanding of the technology Deep understanding of the data
Data
ExplorationTools and
Frameworks

What tools and frameworks would we need to provide?
Spark and Hadoop KMS Delta Lake Deequ Etc...
Third-party services and libraries
Conﬁguration
Management
Idempotency and
Atomicity
Deduplication
Compaction
Table Metadata
Management
Action Coordination
Boilerplate and
Structuring
Data Engineering frameworks
Business Logic
Data Science and Business
Intelligence Applications
Environment
Management
Application
Deployment
Data Engineering
tools
Data Ingestion
Data Engineering
Applications

How would we design a Data Engineering framework?
Input
Output
{ Inputs }
{ Transformations }
{ Outputs }
Business Logic
Performance
Optimisations
Data Quality
Monitoring
Deduplication
Compaction
ConﬁgurationMgmt
etc.
Spark and Hadoop
High-level APIs
Business Logic
Complexity hidden
behind high-level APIs
Intuitive structuring of
business logic code
Injection of optimisations
and monitoring
Efﬁcient scheduling of
transformations and
actions

How would we like to manage deployments?
v 0.1 v 0.2
Paths
/data/prod/my_project
Hive databases
prod_my_project
Deployed jars
my_project-0.1.jar
/data/dev/my_project/feature_one
dev_my_project_feature_one
my_project_feature_one-0.2-SNAPSHOT.jar
/data/dev/my_project/feature_two
dev_my_project_feature_two
my_project_feature_two-0.2-SNAPSHOT.jar
master
feature/one
feature/two
my_project-0.2.jar

What does this look like in practice?

Simpler data ingestion
case class SQLServerConnectionDetails(server: String, user: String, password: String)
val dbConf = CaseClassConfigParser[SQLServerConnectionDetails](
SparkFlowContext(spark), "app1.dbconf"
)
Retrieve server conﬁguration from
combination of Spark conf and
Databricks Secrets
Pull deltas from SQLServer
temporal tables and store in
storage layer
Storage layer will capture last
updated values and primary keys
Small ﬁles will be compacted once
between 11pm and 4am
Flow is lazy, nothing happens until
execute is called
val flow = Waimak.sparkFlow(spark)
.extractToStorageFromRDBM(
rdbmExtractor = new SQLServerTemporalExtractor(spark, dbConf),
dbSchema = ...,
storageBasePath = ...,
tableConfigs = ...,
extractDateTime = ZonedDateTime.now(),
doCompaction = runSingleCompactionDuringWindow(23, 4)
)("table1", "table2", "table3")
flow.execute()

val flow = Waimak.sparkFlow(spark)
.snapshotFromStorage(basePath, tsNow)("table1", "table2", "table3")
.transform("table1", "table2")("model1")(
(t1, t2) => t1.join(t2, "pk1")
)
.transform("table3", "model1")("model2")(
(t3, m1) => t3.join(m1, "pk2")
)
.sql("table1", "model2")("reporting1",
"""select m2.pk1, count(t1.col1) from model2 m2
left join table1 t1 on m2.pk1 = t1.fpk1
group by m2.pk1"""
)
.writeHiveManagedTable("model_db")("model1", "model2")
.writeHiveManagedTable("reporting_db")("reporting1")
.sparkCache("model2")
.addDeequCheck(
"reporting1",
Check(Error, "Not valid PK").isPrimaryKey("pk1")
)(ExceptionQualityAlert())
Simpler business logic development
Execute with ﬂow.execute()
Read from storage layer and deduplicate
Perform two transformations on labels
using the DataFrame API, generating
two more labels
Perform a Spark SQL transformation on
a table and label generated during a
transform, generating one more label
Write labels to two different databases
Add explicit caching and data quality
monitoring actions

Simpler environment management
case class MySparkAppEnv(project: String, //e.g. my_spark_app
environment: String, //e.g. dev
branch: String //e.g. feature/one
) extends HiveEnv
Environment consists of a base path:
/data/dev/my_spark_app/feature_one/
And a Hive database:
dev_my_spark_app_feature_one
object MySparkApp extends SparkApp[MySparkAppEnv] {
override def run(sparkSession: SparkSession, env: MySparkAppEnv): Unit =
//We have a base path and Hive database available in env via env.basePath and env.baseDBName
???
}
Deﬁne application logic given a
SparkSession and an environment
Use MultiAppRunner to run apps
individually or together with
dependencies
spark.waimak.apprunner.apps = my_spark_app
spark.waimak.apprunner.my_spark_app.appClassName = com.example.MySparkApp
spark.waimak.environment.my_spark_app.environment = dev
spark.waimak.environment.my_spark_app.branch = feature/one

Questions?
github.com/
CoxAutomotiveDataSolutions/
waimak

Best Practices for Building and Deploying Data Pipelines in Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Best Practices for Building and Deploying Data Pipelines in Apache Spark

Similar to Best Practices for Building and Deploying Data Pipelines in Apache Spark (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Best Practices for Building and Deploying Data Pipelines in Apache Spark