Spark - Migration Story

About me
Roman Chukh
 11+ years of experience
 Java / PHP / Ruby / etc.
 ~1 year with Apache Spark
 Interested in
 Data Storage / Data Flow
 Monitoring
 Provisioning Tools

Agenda
 Why Spark?
 Our Migration to Spark
 Issues
 … and solutions
 … or workarounds
 … or at least the lessons learnt

“
[Spark is a] Fast and general-purpose
cluster computing platform for large-scale
data processing

Source: http://www.slideshare.net/databricks/building-a-modern-application-with-dataframes-52776940
Why Spark?
API

Why Spark?
Active Development
Source: https://github.com/apache/spark/pulse/monthly

Why Spark?
Community Growth
Source: http://www.slideshare.net/databricks/spark-summit-eu-2015-matei-zaharia-keynote

Why Spark?
Real-World Usage
Source: http://www.slideshare.net/databricks/apache-spark-15-presented-by-databricks-cofounder-patrick-
wendell/6

Largest Cluster 8000 nodes Tencent
Largest single job 1 PB
Alibaba.com
Databricks
Top streaming intake 1 TB / hour Janelia.org
Source: http://www.slideshare.net/databricks/building-a-modern-application-with-dataframes-52776940
Why Spark?
Real-World Usage

Cluster Manager
Application
SparkContext
Worker Node
Executor
Task
Executor
Task
Worker Node
Executor
Task
Executor
Task
Migrating To Spark
Before We Start

Migrating To Spark
The Product
 Cloud-based analytics application
 Won the Big Data Startup Challenge
 In-house computation engine

Migrating To Spark
Reasons
 More data
 More granular data
 Support various data backends
 Support Machine Learning algorithms

Migrating To Spark
Use Cases
❏ supplement Graph database used to
store/query big dimensions
❏ supplement RDBMS for querying of high
volumes of data
❏ represent existing computation graph as
flow of Spark-based operations

Migrating To Spark
Star Schema
Dimension DimensionMetric
Process /
Filter
Dimension
Filter
Metric
Process /
Filter
Dimension
Result
Data
Processing
...

Issue #1: Low-Level API
RDD
“Resilient Distributed Datasets:
A Fault-Tolerant Abstraction for In-
Memory Cluster Computing”
Source: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

RDD: Resilient Distributed Dataset
❏ Immutable
❏ Statically typed: RDD<MyClass>
❏ Fault-Tolerant: Automatically rebuilt on failure
❏ Lazily evaluated

Example workflow
Read File
line-by-line
Get line length
Sum lengths
Result

RDD: Example
lines.txt
some
lines
for
test

RDD: Issues
 Functional transformations (e.g. map/reduce)
are not as intuitive
 Manual memory management
 High (dev) maintenance cost

DataFrame: Overview
❏ (Semi-) Structured data
❏ Columnar Storage
❏ Graph mutation
❏ Code generation
❏ "on" by default in 1.5+
❏ "always on" in latest master

DataFrame: Example
lines.json
{"line":"some"}
{"line":"lines"}
{"line":"for"}
{"line":"test"}

DataFrame vs RDD
Source: http://www.slideshare.net/databricks/spark-summit-eu-2015-matei-zaharia-keynote

DataFrame: Graph Mutation
Source: http://www.slideshare.net/databricks/spark-whats-new-whats-coming

Lessons Learnt
❏ Be aware of the new features
❏ … especially why they were introduced
❏ Low-Level API != Better Performance

Issue #2
DataSource
Predicates

“
“The fastest way to process big
data is to never read it”
Source: http://www.slideshare.net/databricks/spark-whats-new-whats-coming

Spark Flow
RDBMS
WHERE
x > 0
Result
Issue #2: DataSource Predicates
Use Cases
SQL
SELECT *
FROM Table
WHERE x > 0

Spark Flow
RDBMS
WHERE
x > 0
Result
Use Cases
SQL
SELECT *
FROM Table
WHERE x > 0
AND y < 10
WHERE
y < 10
AND

Spark Flow
RDBMS
WHERE
x > 0
Result
Use Cases
SQL
SELECT *
FROM Table
WHERE x > 0
OR y < 10
WHERE
y < 10
OR

… is at a very early stage
❏ Only simple predicates
<, <=, >, >=, =
❏ Only ‘AND’ predicate groups
(no OR support)
JDBC

… is buggy
❏ Parquet < 1.7
❏ PARQUET-136 - NPE if all column values are
null
❏ Parquet 1.7
❏ PARQUET-251 - Possible incorrect results
for String/Decimal/Binary columns
Apache Parquet

Lessons Learnt
❏ Know your data format / data storage features
❏ ... and issues
❏ Its hard to check predicate pushdown behavior
❏ SPARK-11390: Pushdown information
❏ Simple aggregation operations are not supported
❏ Check out the talk “The Pushdown of Everything”

❏ Window functions (e.g. row_number)
❏ Introduced for HiveContext in 1.4
❏ Introduced for SparkContext in 1.5
❏ Subquery (e.g. not exists) support is still missing
❏ Can sometimes be replaced with left semi join
Issue #3: Spark (sort of) SQL
Missing Functionality

Issue #3: Spark (sort of) SQL
Lessons Learnt
❏ Know your use-case
❏ Spark SQL is still quite young
❏ SQL grammar is incomplete
❏ … but actively extended

Issue #4: Round Trips
Background
Metric
Data Processing
...
Filter
Metric
Result
Internal API
Process / Filter
Dimension
Dimension
ids
Dimension

Get ID for the ‘Year 2015’
Resolving Dimensions
Dimension
WHERE
key = ‘2015’
Result

Get IDs of all passed months of the current year
Dimension
WHERE parent = 2015
and level = month
Dim. id
of ‘2015’
WHERE
key = ‘2015’
Result

Get IDs of all passed months of the current year
AND their siblings from the previous year
Dimension
WHERE
parent = 2015
and
level = month
Dim. id
of ‘2015’
Jan,
Feb,
…
WHERE
key = ‘2015’
WHERE
sibling_id =
sibling_id - 1
Result

❏ Spark is better suited for a single complex request
❏ … though not too complex yet
❏ Invest time in architecture analysis and data flow
❏ It might be better to replace a more high-level API
Lessons Learnt

“
“RAM's cheap, but not that cheap”
Source: http://superuser.com/questions/637302/if-ram-is-cheap-why-dont-we-load-everything-to-ram-and-run-it-from-there

Issue #5: OOM
Background
❏ Receive request
❏ Select / Filter / Process data (on Spark)
❏ Collect results
❏ … Out Of Memory

❏ Same data as before
❏ Same external API
Issue #5: OOM
Workaround: Requirements

❏ Result holds ~ 1M objects
❏ (Average) Object size 928 bytes
❏ Result size ~880 MB
Issue #5: OOM
Workaround: Before

Issue #5: OOM
Workaround: After
❏ Result holds ~ 1M objects
❏ (Average) Object size 272 bytes
❏ Result size ~261 MB

❏ Invest (more) time in data structures
❏ Some java performance tips:
http://java-performance.com/
❏ Know your serializer
❏ E.g. Kryo (v2.2.1) prepares object for
deserialization by using default constructor.
Issue #5: OOM
Lessons Learnt

“
“The fact that there is a highway to hell
and only a stairway to heaven says a lot
about the traffic trends”
Source: https://www.reddit.com/r/Showerthoughts/comments/2wbvou/the_fact_that_there_is_a_highway_to_hell_and_only

Resources
 http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
 https://databricks.com/resources/slides
 https://databricks.com/spark/developer-resources
 https://github.com/apache/spark/pulse/monthly
 http://www.slideshare.net/databricks/building-a-modern-application-with-
dataframes-52776940
 http://www.slideshare.net/databricks/spark-summit-eu-2015-matei-
zaharia-keynote
 http://www.slideshare.net/databricks/apache-spark-15-presented-by-
databricks-cofounder-patrick-wendell/6
 http://www.slideshare.net/databricks/building-a-modern-application-with-
dataframes-52776940
 http://www.slideshare.net/databricks/spark-summit-eu-2015-matei-
zaharia-keynote
 http://www.slideshare.net/databricks/spark-whats-new-whats-coming
 http://superuser.com/questions/637302/if-ram-is-cheap-why-dont-we-load-
everything-to-ram-and-run-it-from-there
 https://www.reddit.com/r/Showerthoughts/comments/2wbvou/the_fact_tha
t_there_is_a_highway_to_hell_and_only

Spark - Migration Story

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Spark - Migration Story

Similar to Spark - Migration Story (20)

Recently uploaded

Recently uploaded (20)

Spark - Migration Story