1
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Native Spark Executors
on Kubernetes
Diving into the Data Lake
Grace Chang
Mariano Gonzalez
Chicago Cloud Conference 2019
bit.ly/spark-k8s-code
2
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
The Presenters
Who are we?
Grace is an engineer with years of experience
ingesting, transforming, and storing data. Before
that, she spent her time building machine learning
models as a data scientist.
Senior Big Data Engineer at Glassdoor
Grace Chang
Mariano is an engineer with more than 15 years of
experience with the JVM. He enjoys working with
and exploring a variety of big data technologies. He
is an avid and prolific open-source contributor.
Principal Data Engineer at Jellyvision
Mariano Gonzalez
Most importantly, we are just two people trying to learn about and
share big data technologies and approaches.
3
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Agenda
What are we going to talk about?
● Clarification of Assumptions
● Sharing of Motivations
● Discussion on Data Lakes
● Challenges Description
● Inspiration Explanation
● Description of Solution
● Demo of Solution
4
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
The Goal
Let’s start from the beginning: What are we trying to achieve?
Data Storage
Different Types
and Formats of
Data
Data Pipelines
User
Ingest, process, and surface large amounts of data in an accessible way.
5
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Our Motivation
Why are we talking about this?
We have a complicated relationship with infrastructure.
We observed the strengths and weaknesses of each of
of the implementations.
We Have Tried Three Different Spark Infrastructure
Implementations
We tried out the popular solutions and observed the
pros and cons of the technologies used.
We Have Tried Both Managed and
Unmanaged Solutions
We searched for new, elegant ways to set up spark
infrastructure on a data lake.
6
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Data Lake Introduction
Where did the term come from?
The concept of a data lake has been
around for a while.
The term “data lake” was first
introduced in 2010 by James Dixon,
CTO at Pentaho.
“If you think of a data mart as a store of
bottled water – cleansed and packaged
and structured for easy consumption –
the data lake is a large body of water in a
more natural state. The contents of the
data lake stream in from a source to fill
the lake, and various users of the lake
can come to examine, dive in, or take
samples.”
James Dixon, CTO of Pentaho
7
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Characteristics of a Data Lake
What makes up a data lake?
Centralization of data bring a number of benefits including being easier to govern and manage as
well as making it easier to discover non-disruptively heterogeneous data sets.
Consolidation
Extending the architecture to different workloads is not difficult.
Agility
Cloud object storage services provide virtually unlimited space at very low cost.
Collect and Store All Data at Any Scale
Having a centralized data lake makes it easier to keep track of what data you have, who has access
to it, what type of data you are storing, and what it’s being used for.
Locate, Curate, and Secure Data
8
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Data Lake vs Data Warehouse
What are the differences?
Data Stored in Its Native Format
Flexibility When Accessing Data
A data lake is not a direct replacement for a data warehouse, they are supplemental
technologies that serve different use cases with some overlap.
● Data can be loaded faster and accessed quicker since it does not need to go
through an initial transformation process.
● For traditional relational databases, data would need to be processed and
manipulated before being stored.
● Data scientists and engineers can access data faster than it would be possible in
a traditional BI architecture.
9
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Data Lake vs Data Warehouse cont.
What are the differences?
Schema on Read
● Traditional data warehouses employ Schema-on-Write.
○ This requires an upfront data modeling exercise to define the schema for the
data.
● Schema-on-Read allows the schema to be developed and tailored on a case-by-
case basis.
○ The schema is developed and projected on the data sets required for a
particular use case.
○ This means that the data required is processed as needed.
10
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Avoiding the Data Swamp
How do we navigate challenges?
Although a data lake is a great
solution to manage data in a modern
data-driven environment, it is not
without its significant challenges.
“We see customers creating big data
graveyards, dumping everything into
the data lake and hoping to do
something with it down the road. But
then they just lose track of what’s there.
The main challenge is not creating
adata lake but taking advantage of the
opportunities it presents.”
Sean Martin, CTO of Cambridge Semantic
11
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Data Lineage
Where did the data
come from and what
has happened to it?
Data Quality
Is the data accurate and
fit for its purposes?
Data Security
Is the data protected
from unauthorized
access?
Data Catalog
What data do you have
and where is it stored?
Data Lake Governance Challenges
What should we be aware of?
12
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
2 3
Operational
Complexity of
Maintaining a
Data Lake
Architecture
1
The Main Challenges
How do we balance and address these?
Static
Architecture
Resulting in
Unextendable
Pipelines
User Complexity
in Accessing Large
and Fragmented
Data
13
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Operational Complexity of Maintaining Data Lake Architecture
Some Cluster Management Solutions: YARN and Spark Standalone Clusters
Some Challenges:
● Cluster resizing difficulties
● Forced to scale compute and storage at the same time
● Often required vendor-specific bundle
● Lack of dynamic resource allocation
● Security setup labyrinth (AD, Kerberos, Centrify)
● (spark standalone) Difficulty maintaining a cluster with different runtimes
● (YARN) Exorbitant Amount of Different Configuration (massive XML files)
14
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Static Architecture Resulting in Unextendable Pipelines
Difficult to evolve an architecture
initially designed for one type of data
ingestion
○ e.g. Adding a streaming
architecture component to batch
architecture is involved
15
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
User Complexity in Accessing Large and Fragmented Data
● Friction between users (data scientists,
analysts, etc) and convenient data access
patterns
● Data from different sources can be hard
to merge together
○ Often one set data type (e.g. can only
ingest avro)
16
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Inspiration
How did we get to our solution?
Medium Article on Data Infrastructure at AirBnB (2016)
17
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Inspiration
How did we get to our solution?
DataBricks DeltaLake (2019)
18
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Solution
How did we get to our solution?
A Central Place Where Engineers and Data Scientists Can Collaborate and Run Diverse
Workflows
● K8s to do cluster management/scaling
● Spark to do data transformations
● Managed Services
● Bronze/Silver/Gold pipeline organization
19
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Solution: Storage
How did we get to our solution?
Most data lake implementations use cloud object storage as the underlying storage
technology. It is recommended for the following reasons:
Durable: You can typically expect to see eleven 9’s availability.
Scalable: Object storage is particularly well suited to storing vast amounts of
unstructured data, and storage capacity is virtually unlimited.
Affordable: You can store data for approximately $0.01 per GB/Month.
Secure: Granular access down to the object level.
Integrated: Most processing technologies support object storage as both a source
and a data sink.
Can it be a managed service? YES!
20
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Solution: Compute
How did we get to our solution?
K8s has positioned itself as the defacto container orchestrator, and spark
continues to be the tool of choice for running big data workloads. So why
don’t we run spark on k8s?
Extendible: No extra infrastructure to run your data workloads
Upgradable: Running spark executors on K8’s means I don't longer
need to upgrade the cluster; all runtimes are available right from
beginning via Docker images (scala, python, R)
Can it be a managed service? YES!
21
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Why Use Kubernetes as a Cluster Manager?
Are you using data analytical pipelines which are already
containerized?
If a typical pipeline requires a broker, spark, database and some visualization
and everything runs on containers, it make sense to also run spark on
containers and use K8s to manage your entire pipeline.
Resource sharing is better optimized
Instead of running your pipeline on a dedicated hardware, it is very efficient
and optimal to run on Kubernetes cluster, so that there is better resource
sharing as all components in a pipeline are not running all the time.
22
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Why Use Kubernetes as a Cluster Manager?
Leveraging Kubernetes Ecosystem
Spark workloads can make direct use of Kubernetes clusters for multi-tenancy
and sharing Namespaces and Quotas, as well as administrative features such as
Pluggable Authorization and Logging.
It requires no changes or new installations on the cluster; simply create a
container image and set up the right RBAC roles for your Spark Application and
you’re all set.
23
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
So How Does This Actually Work?
spark-submit can be directly used to submit a Spark
application to a Kubernetes cluster. The submission
mechanism works as follows:
● Spark creates a spark driver running within a
Kubernetes pod.
● The driver creates executors which are also running
within Kubernetes pods and connects to them, and
executes application code.
● When the application completes, the executor pods
terminate and are cleaned up, but the driver pod
persists logs and remains in “completed” state in the
Kubernetes API until it’s eventually garbage collected
or manually cleaned up.
24
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Advantages of the Solution
Decoupling of compute and storage, which
means they can scale independently!
25
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Advantages of the Solution: Compute
● Kubernetes engine is available as managed service in all major cloud providers (AWS, GCP,
Azure)
● Multiple data science and engineering workloads:
○ Streaming / Real Time: CPU intensive
○ Batch / Analytic: Storage intensive
● Minimize operational burden with managed service while maximizing flexibility by taking
advantage of Kubernetes
○ Dynamic and lightweight scaling
● Since we are deploying Docker images there is no need to “patch” the cluster
○ Developers are responsible for updating runtimes and libraries:
■ Scala/Java/Python/R dependencies
26
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Advantages of the Solution: Storage
● Storage Object as a managed service!
● Usage of S3 allows for stable and unified source of data
● Option to ingest data in different formats type because:
○ Structure
○ Semistructured
○ Un-structured
● Realtime and historical data available data without much changes
○ No rollup processes to another storage technology
27
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Our Example Implementation
Visualize Data
in Data Lake
Aggregate Data
in Data Lake
Transform
Saved Data in
Data Lake
Save Data To
Data Lake
Stream Data
From Twitter
28
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Architecture Diagram
29
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Architecting a Data Lake Storage
A common misconception and potential mistake is that data lakes are one giant,
centralized bucket where everything need to land. That is not true!
Combining AirBnb and Databricks ideas we obtain the following architecture:
Bronze Bucket: raw data lands (avro, json, csv etc)
Silver Bucket: columnar data (parquet, orc)
Gold Bucket: columnar data (aggregated)
30
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Finally, time for some code!
bit.ly/spark-k8s-code
31
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Bronze Pipeline
Producer
twitter_producer.py
Consumer
spark_consumer.py
TCP
Pod in Kubernetes
Spark Streaming Job in
Kubernetes
S3 Bucket (Bronze)
JSON file is
appended.
Tweets are streamed
into the producer
Producer sends data to the
spark job
32
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Silver Pipeline
S3 Bucket (Silver)
Parquet file is
written.
S3 Bucket
(Bronze)
JSON file.
Spark Streaming Job in
Kubernetes
Transformer pulls data
from bronze bucket,
converts it to parquet and
saves it to silver bucket.
33
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Gold Pipeline
S3 Bucket (Gold)
Parquet file
containing
aggregated data
written.
S3 Bucket
(Silver)
Parquet file.
Spark Streaming Job in
Kubernetes
Aggregator pulls data from
the silver bucket,
aggregates it, then saves it
to the gold bucket
34
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Visualization
S3 Bucket (Gold)
Parquet file of
aggregated data.
Zeppelin in Kubernetes
Allows user to query data.
User Queries
Data
35
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Future Work
What about all the management / monitoring / scheduling for spark job?
● State
● Metrics
● Retries
● Logs
Thankfully Google announced the spark kubernetes operator that will take core of all that (and
more)
Spark K8s operator Roadmap: GoogleCloudPlatform/spark-on-k8s-operator/issues/338
36
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Questions?
37
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes
Links
https://kubernetes.io/docs/home/
https://spark.apache.org/docs/latest/
https://zeppelin.apache.org/docs/latest/
https://medium.com/airbnb-engineering/data-infrastructure-at-airbnb-8adfb34f169c
https://databricks.com/product/databricks-delta
https://docs.aws.amazon.com/s3/index.html
https://docs.aws.amazon.com/eks/index.html
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
38
Chicago Cloud Conference 2019: Native Spark Executors on Kubernetes

Native Spark Executors on Kubernetes: Diving into the Data Lake - Chicago Cloud Conference 2019

  • 1.
    1 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Native Spark Executors on Kubernetes Diving into the Data Lake Grace Chang Mariano Gonzalez Chicago Cloud Conference 2019 bit.ly/spark-k8s-code
  • 2.
    2 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes The Presenters Who are we? Grace is an engineer with years of experience ingesting, transforming, and storing data. Before that, she spent her time building machine learning models as a data scientist. Senior Big Data Engineer at Glassdoor Grace Chang Mariano is an engineer with more than 15 years of experience with the JVM. He enjoys working with and exploring a variety of big data technologies. He is an avid and prolific open-source contributor. Principal Data Engineer at Jellyvision Mariano Gonzalez Most importantly, we are just two people trying to learn about and share big data technologies and approaches.
  • 3.
    3 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Agenda What are we going to talk about? ● Clarification of Assumptions ● Sharing of Motivations ● Discussion on Data Lakes ● Challenges Description ● Inspiration Explanation ● Description of Solution ● Demo of Solution
  • 4.
    4 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes The Goal Let’s start from the beginning: What are we trying to achieve? Data Storage Different Types and Formats of Data Data Pipelines User Ingest, process, and surface large amounts of data in an accessible way.
  • 5.
    5 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Our Motivation Why are we talking about this? We have a complicated relationship with infrastructure. We observed the strengths and weaknesses of each of of the implementations. We Have Tried Three Different Spark Infrastructure Implementations We tried out the popular solutions and observed the pros and cons of the technologies used. We Have Tried Both Managed and Unmanaged Solutions We searched for new, elegant ways to set up spark infrastructure on a data lake.
  • 6.
    6 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Data Lake Introduction Where did the term come from? The concept of a data lake has been around for a while. The term “data lake” was first introduced in 2010 by James Dixon, CTO at Pentaho. “If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” James Dixon, CTO of Pentaho
  • 7.
    7 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Characteristics of a Data Lake What makes up a data lake? Centralization of data bring a number of benefits including being easier to govern and manage as well as making it easier to discover non-disruptively heterogeneous data sets. Consolidation Extending the architecture to different workloads is not difficult. Agility Cloud object storage services provide virtually unlimited space at very low cost. Collect and Store All Data at Any Scale Having a centralized data lake makes it easier to keep track of what data you have, who has access to it, what type of data you are storing, and what it’s being used for. Locate, Curate, and Secure Data
  • 8.
    8 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Data Lake vs Data Warehouse What are the differences? Data Stored in Its Native Format Flexibility When Accessing Data A data lake is not a direct replacement for a data warehouse, they are supplemental technologies that serve different use cases with some overlap. ● Data can be loaded faster and accessed quicker since it does not need to go through an initial transformation process. ● For traditional relational databases, data would need to be processed and manipulated before being stored. ● Data scientists and engineers can access data faster than it would be possible in a traditional BI architecture.
  • 9.
    9 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Data Lake vs Data Warehouse cont. What are the differences? Schema on Read ● Traditional data warehouses employ Schema-on-Write. ○ This requires an upfront data modeling exercise to define the schema for the data. ● Schema-on-Read allows the schema to be developed and tailored on a case-by- case basis. ○ The schema is developed and projected on the data sets required for a particular use case. ○ This means that the data required is processed as needed.
  • 10.
    10 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Avoiding the Data Swamp How do we navigate challenges? Although a data lake is a great solution to manage data in a modern data-driven environment, it is not without its significant challenges. “We see customers creating big data graveyards, dumping everything into the data lake and hoping to do something with it down the road. But then they just lose track of what’s there. The main challenge is not creating adata lake but taking advantage of the opportunities it presents.” Sean Martin, CTO of Cambridge Semantic
  • 11.
    11 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Data Lineage Where did the data come from and what has happened to it? Data Quality Is the data accurate and fit for its purposes? Data Security Is the data protected from unauthorized access? Data Catalog What data do you have and where is it stored? Data Lake Governance Challenges What should we be aware of?
  • 12.
    12 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes 2 3 Operational Complexity of Maintaining a Data Lake Architecture 1 The Main Challenges How do we balance and address these? Static Architecture Resulting in Unextendable Pipelines User Complexity in Accessing Large and Fragmented Data
  • 13.
    13 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Operational Complexity of Maintaining Data Lake Architecture Some Cluster Management Solutions: YARN and Spark Standalone Clusters Some Challenges: ● Cluster resizing difficulties ● Forced to scale compute and storage at the same time ● Often required vendor-specific bundle ● Lack of dynamic resource allocation ● Security setup labyrinth (AD, Kerberos, Centrify) ● (spark standalone) Difficulty maintaining a cluster with different runtimes ● (YARN) Exorbitant Amount of Different Configuration (massive XML files)
  • 14.
    14 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Static Architecture Resulting in Unextendable Pipelines Difficult to evolve an architecture initially designed for one type of data ingestion ○ e.g. Adding a streaming architecture component to batch architecture is involved
  • 15.
    15 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes User Complexity in Accessing Large and Fragmented Data ● Friction between users (data scientists, analysts, etc) and convenient data access patterns ● Data from different sources can be hard to merge together ○ Often one set data type (e.g. can only ingest avro)
  • 16.
    16 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Inspiration How did we get to our solution? Medium Article on Data Infrastructure at AirBnB (2016)
  • 17.
    17 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Inspiration How did we get to our solution? DataBricks DeltaLake (2019)
  • 18.
    18 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Solution How did we get to our solution? A Central Place Where Engineers and Data Scientists Can Collaborate and Run Diverse Workflows ● K8s to do cluster management/scaling ● Spark to do data transformations ● Managed Services ● Bronze/Silver/Gold pipeline organization
  • 19.
    19 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Solution: Storage How did we get to our solution? Most data lake implementations use cloud object storage as the underlying storage technology. It is recommended for the following reasons: Durable: You can typically expect to see eleven 9’s availability. Scalable: Object storage is particularly well suited to storing vast amounts of unstructured data, and storage capacity is virtually unlimited. Affordable: You can store data for approximately $0.01 per GB/Month. Secure: Granular access down to the object level. Integrated: Most processing technologies support object storage as both a source and a data sink. Can it be a managed service? YES!
  • 20.
    20 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Solution: Compute How did we get to our solution? K8s has positioned itself as the defacto container orchestrator, and spark continues to be the tool of choice for running big data workloads. So why don’t we run spark on k8s? Extendible: No extra infrastructure to run your data workloads Upgradable: Running spark executors on K8’s means I don't longer need to upgrade the cluster; all runtimes are available right from beginning via Docker images (scala, python, R) Can it be a managed service? YES!
  • 21.
    21 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Why Use Kubernetes as a Cluster Manager? Are you using data analytical pipelines which are already containerized? If a typical pipeline requires a broker, spark, database and some visualization and everything runs on containers, it make sense to also run spark on containers and use K8s to manage your entire pipeline. Resource sharing is better optimized Instead of running your pipeline on a dedicated hardware, it is very efficient and optimal to run on Kubernetes cluster, so that there is better resource sharing as all components in a pipeline are not running all the time.
  • 22.
    22 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Why Use Kubernetes as a Cluster Manager? Leveraging Kubernetes Ecosystem Spark workloads can make direct use of Kubernetes clusters for multi-tenancy and sharing Namespaces and Quotas, as well as administrative features such as Pluggable Authorization and Logging. It requires no changes or new installations on the cluster; simply create a container image and set up the right RBAC roles for your Spark Application and you’re all set.
  • 23.
    23 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes So How Does This Actually Work? spark-submit can be directly used to submit a Spark application to a Kubernetes cluster. The submission mechanism works as follows: ● Spark creates a spark driver running within a Kubernetes pod. ● The driver creates executors which are also running within Kubernetes pods and connects to them, and executes application code. ● When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.
  • 24.
    24 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Advantages of the Solution Decoupling of compute and storage, which means they can scale independently!
  • 25.
    25 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Advantages of the Solution: Compute ● Kubernetes engine is available as managed service in all major cloud providers (AWS, GCP, Azure) ● Multiple data science and engineering workloads: ○ Streaming / Real Time: CPU intensive ○ Batch / Analytic: Storage intensive ● Minimize operational burden with managed service while maximizing flexibility by taking advantage of Kubernetes ○ Dynamic and lightweight scaling ● Since we are deploying Docker images there is no need to “patch” the cluster ○ Developers are responsible for updating runtimes and libraries: ■ Scala/Java/Python/R dependencies
  • 26.
    26 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Advantages of the Solution: Storage ● Storage Object as a managed service! ● Usage of S3 allows for stable and unified source of data ● Option to ingest data in different formats type because: ○ Structure ○ Semistructured ○ Un-structured ● Realtime and historical data available data without much changes ○ No rollup processes to another storage technology
  • 27.
    27 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Our Example Implementation Visualize Data in Data Lake Aggregate Data in Data Lake Transform Saved Data in Data Lake Save Data To Data Lake Stream Data From Twitter
  • 28.
    28 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Architecture Diagram
  • 29.
    29 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Architecting a Data Lake Storage A common misconception and potential mistake is that data lakes are one giant, centralized bucket where everything need to land. That is not true! Combining AirBnb and Databricks ideas we obtain the following architecture: Bronze Bucket: raw data lands (avro, json, csv etc) Silver Bucket: columnar data (parquet, orc) Gold Bucket: columnar data (aggregated)
  • 30.
    30 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Finally, time for some code! bit.ly/spark-k8s-code
  • 31.
    31 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Bronze Pipeline Producer twitter_producer.py Consumer spark_consumer.py TCP Pod in Kubernetes Spark Streaming Job in Kubernetes S3 Bucket (Bronze) JSON file is appended. Tweets are streamed into the producer Producer sends data to the spark job
  • 32.
    32 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Silver Pipeline S3 Bucket (Silver) Parquet file is written. S3 Bucket (Bronze) JSON file. Spark Streaming Job in Kubernetes Transformer pulls data from bronze bucket, converts it to parquet and saves it to silver bucket.
  • 33.
    33 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Gold Pipeline S3 Bucket (Gold) Parquet file containing aggregated data written. S3 Bucket (Silver) Parquet file. Spark Streaming Job in Kubernetes Aggregator pulls data from the silver bucket, aggregates it, then saves it to the gold bucket
  • 34.
    34 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Visualization S3 Bucket (Gold) Parquet file of aggregated data. Zeppelin in Kubernetes Allows user to query data. User Queries Data
  • 35.
    35 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Future Work What about all the management / monitoring / scheduling for spark job? ● State ● Metrics ● Retries ● Logs Thankfully Google announced the spark kubernetes operator that will take core of all that (and more) Spark K8s operator Roadmap: GoogleCloudPlatform/spark-on-k8s-operator/issues/338
  • 36.
    36 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Questions?
  • 37.
    37 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes Links https://kubernetes.io/docs/home/ https://spark.apache.org/docs/latest/ https://zeppelin.apache.org/docs/latest/ https://medium.com/airbnb-engineering/data-infrastructure-at-airbnb-8adfb34f169c https://databricks.com/product/databricks-delta https://docs.aws.amazon.com/s3/index.html https://docs.aws.amazon.com/eks/index.html https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
  • 38.
    38 Chicago Cloud Conference2019: Native Spark Executors on Kubernetes