Native Spark Executors on Kubernetes: Diving into the Data Lake - Chicago Cloud Conference 2019

1
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Native Spark Executors
on Kubernetes
Diving into the Data Lake
Grace Chang
Mariano Gonzalez
Chicago Cloud Conference 2019
bit.ly/spark-k8s-code

2
The Presenters
Who are we?
Grace is an engineer with years of experience
ingesting, transforming, and storing data. Before
that, she spent her time building machine learning
models as a data scientist.
Senior Big Data Engineer at Glassdoor
Grace Chang
Mariano is an engineer with more than 15 years of
experience with the JVM. He enjoys working with
and exploring a variety of big data technologies. He
is an avid and prolific open-source contributor.
Principal Data Engineer at Jellyvision
Mariano Gonzalez
Most importantly, we are just two people trying to learn about and
share big data technologies and approaches.

3
Agenda
What are we going to talk about?
● Clarification of Assumptions
● Sharing of Motivations
● Discussion on Data Lakes
● Challenges Description
● Inspiration Explanation
● Description of Solution
● Demo of Solution

4
The Goal
Let’s start from the beginning: What are we trying to achieve?
Data Storage
Different Types
and Formats of
Data
Data Pipelines
User
Ingest, process, and surface large amounts of data in an accessible way.

5
Our Motivation
Why are we talking about this?
We have a complicated relationship with infrastructure.
We observed the strengths and weaknesses of each of
of the implementations.
We Have Tried Three Different Spark Infrastructure
Implementations
We tried out the popular solutions and observed the
pros and cons of the technologies used.
We Have Tried Both Managed and
Unmanaged Solutions
We searched for new, elegant ways to set up spark
infrastructure on a data lake.

6
Data Lake Introduction
Where did the term come from?
The concept of a data lake has been
around for a while.
The term “data lake” was first
introduced in 2010 by James Dixon,
CTO at Pentaho.
“If you think of a data mart as a store of
bottled water – cleansed and packaged
and structured for easy consumption –
the data lake is a large body of water in a
more natural state. The contents of the
data lake stream in from a source to fill
the lake, and various users of the lake
can come to examine, dive in, or take
samples.”
James Dixon, CTO of Pentaho

7
Characteristics of a Data Lake
What makes up a data lake?
Centralization of data bring a number of benefits including being easier to govern and manage as
well as making it easier to discover non-disruptively heterogeneous data sets.
Consolidation
Extending the architecture to different workloads is not difficult.
Agility
Cloud object storage services provide virtually unlimited space at very low cost.
Collect and Store All Data at Any Scale
Having a centralized data lake makes it easier to keep track of what data you have, who has access
to it, what type of data you are storing, and what it’s being used for.
Locate, Curate, and Secure Data

8
Data Lake vs Data Warehouse
What are the differences?
Data Stored in Its Native Format
Flexibility When Accessing Data
A data lake is not a direct replacement for a data warehouse, they are supplemental
technologies that serve different use cases with some overlap.
● Data can be loaded faster and accessed quicker since it does not need to go
through an initial transformation process.
● For traditional relational databases, data would need to be processed and
manipulated before being stored.
● Data scientists and engineers can access data faster than it would be possible in
a traditional BI architecture.

9
Data Lake vs Data Warehouse cont.
What are the differences?
Schema on Read
● Traditional data warehouses employ Schema-on-Write.
○ This requires an upfront data modeling exercise to define the schema for the
data.
● Schema-on-Read allows the schema to be developed and tailored on a case-by-
case basis.
○ The schema is developed and projected on the data sets required for a
particular use case.
○ This means that the data required is processed as needed.

10
Avoiding the Data Swamp
How do we navigate challenges?
Although a data lake is a great
solution to manage data in a modern
data-driven environment, it is not
without its significant challenges.
“We see customers creating big data
graveyards, dumping everything into
the data lake and hoping to do
something with it down the road. But
then they just lose track of what’s there.
The main challenge is not creating
adata lake but taking advantage of the
opportunities it presents.”
Sean Martin, CTO of Cambridge Semantic

11
Data Lineage
Where did the data
come from and what
has happened to it?
Data Quality
Is the data accurate and
fit for its purposes?
Data Security
Is the data protected
from unauthorized
access?
Data Catalog
What data do you have
and where is it stored?
Data Lake Governance Challenges
What should we be aware of?

12
2 3
Operational
Complexity of
Maintaining a
Data Lake
Architecture
1
The Main Challenges
How do we balance and address these?
Static
Architecture
Resulting in
Unextendable
Pipelines
User Complexity
in Accessing Large
and Fragmented
Data

13
Operational Complexity of Maintaining Data Lake Architecture
Some Cluster Management Solutions: YARN and Spark Standalone Clusters
Some Challenges:
● Cluster resizing difficulties
● Forced to scale compute and storage at the same time
● Often required vendor-specific bundle
● Lack of dynamic resource allocation
● Security setup labyrinth (AD, Kerberos, Centrify)
● (spark standalone) Difficulty maintaining a cluster with different runtimes
● (YARN) Exorbitant Amount of Different Configuration (massive XML files)

14
Static Architecture Resulting in Unextendable Pipelines
Difficult to evolve an architecture
initially designed for one type of data
ingestion
○ e.g. Adding a streaming
architecture component to batch
architecture is involved

15
User Complexity in Accessing Large and Fragmented Data
● Friction between users (data scientists,
analysts, etc) and convenient data access
patterns
● Data from different sources can be hard
to merge together
○ Often one set data type (e.g. can only
ingest avro)

16
Inspiration
How did we get to our solution?
Medium Article on Data Infrastructure at AirBnB (2016)

17
Inspiration
DataBricks DeltaLake (2019)

18
Solution
A Central Place Where Engineers and Data Scientists Can Collaborate and Run Diverse
Workflows
● K8s to do cluster management/scaling
● Spark to do data transformations
● Managed Services
● Bronze/Silver/Gold pipeline organization

19
Solution: Storage
Most data lake implementations use cloud object storage as the underlying storage
technology. It is recommended for the following reasons:
Durable: You can typically expect to see eleven 9’s availability.
Scalable: Object storage is particularly well suited to storing vast amounts of
unstructured data, and storage capacity is virtually unlimited.
Affordable: You can store data for approximately $0.01 per GB/Month.
Secure: Granular access down to the object level.
Integrated: Most processing technologies support object storage as both a source
and a data sink.
Can it be a managed service? YES!

20
Solution: Compute
K8s has positioned itself as the defacto container orchestrator, and spark
continues to be the tool of choice for running big data workloads. So why
don’t we run spark on k8s?
Extendible: No extra infrastructure to run your data workloads
Upgradable: Running spark executors on K8’s means I don't longer
need to upgrade the cluster; all runtimes are available right from
beginning via Docker images (scala, python, R)
Can it be a managed service? YES!

21
Why Use Kubernetes as a Cluster Manager?
Are you using data analytical pipelines which are already
containerized?
If a typical pipeline requires a broker, spark, database and some visualization
and everything runs on containers, it make sense to also run spark on
containers and use K8s to manage your entire pipeline.
Resource sharing is better optimized
Instead of running your pipeline on a dedicated hardware, it is very efficient
and optimal to run on Kubernetes cluster, so that there is better resource
sharing as all components in a pipeline are not running all the time.

22
Why Use Kubernetes as a Cluster Manager?
Leveraging Kubernetes Ecosystem
Spark workloads can make direct use of Kubernetes clusters for multi-tenancy
and sharing Namespaces and Quotas, as well as administrative features such as
Pluggable Authorization and Logging.
It requires no changes or new installations on the cluster; simply create a
container image and set up the right RBAC roles for your Spark Application and
you’re all set.

23
So How Does This Actually Work?
spark-submit can be directly used to submit a Spark
application to a Kubernetes cluster. The submission
mechanism works as follows:
● Spark creates a spark driver running within a
Kubernetes pod.
● The driver creates executors which are also running
within Kubernetes pods and connects to them, and
executes application code.
● When the application completes, the executor pods
terminate and are cleaned up, but the driver pod
persists logs and remains in “completed” state in the
Kubernetes API until it’s eventually garbage collected
or manually cleaned up.

24
Advantages of the Solution
Decoupling of compute and storage, which
means they can scale independently!

25
Advantages of the Solution: Compute
● Kubernetes engine is available as managed service in all major cloud providers (AWS, GCP,
Azure)
● Multiple data science and engineering workloads:
○ Streaming / Real Time: CPU intensive
○ Batch / Analytic: Storage intensive
● Minimize operational burden with managed service while maximizing flexibility by taking
advantage of Kubernetes
○ Dynamic and lightweight scaling
● Since we are deploying Docker images there is no need to “patch” the cluster
○ Developers are responsible for updating runtimes and libraries:
■ Scala/Java/Python/R dependencies

26
Advantages of the Solution: Storage
● Storage Object as a managed service!
● Usage of S3 allows for stable and unified source of data
● Option to ingest data in different formats type because:
○ Structure
○ Semistructured
○ Un-structured
● Realtime and historical data available data without much changes
○ No rollup processes to another storage technology

27
Our Example Implementation
Visualize Data
in Data Lake
Aggregate Data
in Data Lake
Transform
Saved Data in
Data Lake
Save Data To
Data Lake
Stream Data
From Twitter

28
Architecture Diagram

29
Architecting a Data Lake Storage
A common misconception and potential mistake is that data lakes are one giant,
centralized bucket where everything need to land. That is not true!
Combining AirBnb and Databricks ideas we obtain the following architecture:
Bronze Bucket: raw data lands (avro, json, csv etc)
Silver Bucket: columnar data (parquet, orc)
Gold Bucket: columnar data (aggregated)

30
Finally, time for some code!
bit.ly/spark-k8s-code

31
Bronze Pipeline
Producer
twitter_producer.py
Consumer
spark_consumer.py
TCP
Pod in Kubernetes
Spark Streaming Job in
Kubernetes
S3 Bucket (Bronze)
JSON file is
appended.
Tweets are streamed
into the producer
Producer sends data to the
spark job

32
Silver Pipeline
S3 Bucket (Silver)
Parquet file is
written.
S3 Bucket
(Bronze)
JSON file.
Kubernetes
Transformer pulls data
from bronze bucket,
converts it to parquet and
saves it to silver bucket.

33
Gold Pipeline
S3 Bucket (Gold)
Parquet file
containing
aggregated data
written.
S3 Bucket
(Silver)
Parquet file.
Kubernetes
Aggregator pulls data from
the silver bucket,
aggregates it, then saves it
to the gold bucket

34
Visualization
S3 Bucket (Gold)
Parquet file of
aggregated data.
Zeppelin in Kubernetes
Allows user to query data.
User Queries
Data

35
Future Work
What about all the management / monitoring / scheduling for spark job?
● State
● Metrics
● Retries
● Logs
Thankfully Google announced the spark kubernetes operator that will take core of all that (and
more)
Spark K8s operator Roadmap: GoogleCloudPlatform/spark-on-k8s-operator/issues/338

36
Questions?

37
Links
https://kubernetes.io/docs/home/
https://spark.apache.org/docs/latest/
https://zeppelin.apache.org/docs/latest/
https://medium.com/airbnb-engineering/data-infrastructure-at-airbnb-8adfb34f169c
https://databricks.com/product/databricks-delta
https://docs.aws.amazon.com/s3/index.html
https://docs.aws.amazon.com/eks/index.html
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator

38

Native Spark Executors on Kubernetes: Diving into the Data Lake - Chicago Cloud Conference 2019

More Related Content

What's hot

Similar to Native Spark Executors on Kubernetes: Diving into the Data Lake - Chicago Cloud Conference 2019

Recently uploaded

Native Spark Executors on Kubernetes: Diving into the Data Lake - Chicago Cloud Conference 2019