Spark:
Migration Story
About me
Roman Chukh
 11+ years of experience
 Java / PHP / Ruby / etc.
 ~1 year with Apache Spark
 Interested in
 Data Storage / Data Flow
 Monitoring
 Provisioning Tools
Agenda
 Why Spark?
 Our Migration to Spark
 Issues
 … and solutions
 … or workarounds
 … or at least the lessons learnt
Why Spark?
“
[Spark is a] Fast and general-purpose
cluster computing platform for large-scale
data processing
Source: http://www.slideshare.net/databricks/building-a-modern-application-with-dataframes-52776940
Why Spark?
API
Why Spark?
Active Development
Source: https://github.com/apache/spark/pulse/monthly
Why Spark?
Community Growth
Source: http://www.slideshare.net/databricks/spark-summit-eu-2015-matei-zaharia-keynote
Why Spark?
Real-World Usage
Source: http://www.slideshare.net/databricks/apache-spark-15-presented-by-databricks-cofounder-patrick-
wendell/6
Largest Cluster 8000 nodes Tencent
Largest single job 1 PB
Alibaba.com
Databricks
Top streaming intake 1 TB / hour Janelia.org
Source: http://www.slideshare.net/databricks/building-a-modern-application-with-dataframes-52776940
Why Spark?
Real-World Usage
Migrating to Spark
Cluster Manager
Application
SparkContext
Worker Node
Executor
Task
Executor
Task
Worker Node
Executor
Task
Executor
Task
Migrating To Spark
Before We Start
Migrating To Spark
The Product
 Cloud-based analytics application
 Won the Big Data Startup Challenge
 In-house computation engine
Migrating To Spark
Reasons
 More data
 More granular data
 Support various data backends
 Support Machine Learning algorithms
Migrating To Spark
Use Cases
❏ supplement Graph database used to
store/query big dimensions
❏ supplement RDBMS for querying of high
volumes of data
❏ represent existing computation graph as
flow of Spark-based operations
Migrating To Spark
Star Schema
Dimension DimensionMetric
Process /
Filter
Dimension
Filter
Metric
Process /
Filter
Dimension
Result
Data
Processing
...
Issues
Issue #1
Low-Level API
Issue #1: Low-Level API
RDD
“Resilient Distributed Datasets:
A Fault-Tolerant Abstraction for In-
Memory Cluster Computing”
Source: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Issue #1: Low-Level API
RDD: Resilient Distributed Dataset
❏ Immutable
❏ Statically typed: RDD<MyClass>
❏ Fault-Tolerant: Automatically rebuilt on failure
❏ Lazily evaluated
Issue #1: Low-Level API
Example workflow
Read File
line-by-line
Get line length
Sum lengths
Result
Issue #1: Low-Level API
RDD: Example
lines.txt
some
lines
for
test
Issue #1: Low-Level API
RDD: Issues
 Functional transformations (e.g. map/reduce)
are not as intuitive
 Manual memory management
 High (dev) maintenance cost
Issue #1: Low-Level API
DataFrame: Overview
❏ (Semi-) Structured data
❏ Columnar Storage
❏ Graph mutation
❏ Code generation
❏ "on" by default in 1.5+
❏ "always on" in latest master
Issue #1: Low-Level API
DataFrame: Example
lines.json
{"line":"some"}
{"line":"lines"}
{"line":"for"}
{"line":"test"}
Issue #1: Low-Level API
DataFrame vs RDD
Source: http://www.slideshare.net/databricks/spark-summit-eu-2015-matei-zaharia-keynote
Issue #1: Low-Level API
DataFrame: Graph Mutation
Source: http://www.slideshare.net/databricks/spark-whats-new-whats-coming
Issue #1: Low-Level API
Lessons Learnt
❏ Be aware of the new features
❏ … especially why they were introduced
❏ Low-Level API != Better Performance
Issue #2
DataSource
Predicates
“
“The fastest way to process big
data is to never read it”
Source: http://www.slideshare.net/databricks/spark-whats-new-whats-coming
Spark Flow
RDBMS
WHERE
x > 0
Result
Issue #2: DataSource Predicates
Use Cases
SQL
SELECT *
FROM Table
WHERE x > 0
Spark Flow
RDBMS
WHERE
x > 0
Result
Issue #2: DataSource Predicates
Use Cases
SQL
SELECT *
FROM Table
WHERE x > 0
AND y < 10
WHERE
y < 10
AND
Spark Flow
RDBMS
WHERE
x > 0
Result
Issue #2: DataSource Predicates
Use Cases
SQL
SELECT *
FROM Table
WHERE x > 0
OR y < 10
WHERE
y < 10
OR
Spark Flow
RDBMS
WHERE
x > 0
Result
Issue #2: DataSource Predicates
Use Cases
SQL
SELECT *
FROM Table
WHERE x > 0
OR y < 10
WHERE
y < 10
OR
… is at a very early stage
❏ Only simple predicates
<, <=, >, >=, =
❏ Only ‘AND’ predicate groups
(no OR support)
Issue #2: DataSource Predicates
JDBC
… is buggy
❏ Parquet < 1.7
❏ PARQUET-136 - NPE if all column values are
null
❏ Parquet 1.7
❏ PARQUET-251 - Possible incorrect results
for String/Decimal/Binary columns
Issue #2: DataSource Predicates
Apache Parquet
Issue #2: DataSource Predicates
Lessons Learnt
❏ Know your data format / data storage features
❏ ... and issues
❏ Its hard to check predicate pushdown behavior
❏ SPARK-11390: Pushdown information
❏ Simple aggregation operations are not supported
❏ Check out the talk “The Pushdown of Everything”
Issue #3
Spark SQL
❏ Window functions (e.g. row_number)
❏ Introduced for HiveContext in 1.4
❏ Introduced for SparkContext in 1.5
❏ Subquery (e.g. not exists) support is still missing
❏ Can sometimes be replaced with left semi join
Issue #3: Spark (sort of) SQL
Missing Functionality
Issue #3: Spark (sort of) SQL
Lessons Learnt
❏ Know your use-case
❏ Spark SQL is still quite young
❏ SQL grammar is incomplete
❏ … but actively extended
Issue #4
Round Trips
Issue #4: Round Trips
Background
Metric
Data Processing
...
Filter
Metric
Result
Internal API
Process / Filter
Dimension
Dimension
ids
Dimension
Issue #4: Round Trips
Background
Metric
Data Processing
...
Filter
Metric
Result
Internal API
Process / Filter
Dimension
Dimension
ids
Dimension
Get ID for the ‘Year 2015’
Issue #4: Round Trips
Resolving Dimensions
Dimension
WHERE
key = ‘2015’
Result
Get IDs of all passed months of the current year
Dimension
WHERE parent = 2015
and level = month
Dim. id
of ‘2015’
WHERE
key = ‘2015’
Issue #4: Round Trips
Resolving Dimensions
Result
Get IDs of all passed months of the current year
AND their siblings from the previous year
Dimension
WHERE
parent = 2015
and
level = month
Dim. id
of ‘2015’
Jan,
Feb,
…
WHERE
key = ‘2015’
WHERE
sibling_id =
sibling_id - 1
Result
Issue #4: Round Trips
Resolving Dimensions
❏ Spark is better suited for a single complex request
❏ … though not too complex yet
❏ Invest time in architecture analysis and data flow
❏ It might be better to replace a more high-level API
Issue #4: Round Trips
Lessons Learnt
Issue #5
Out of Memory
“
“RAM's cheap, but not that cheap”
Source: http://superuser.com/questions/637302/if-ram-is-cheap-why-dont-we-load-everything-to-ram-and-run-it-from-there
Issue #5: OOM
Background
❏ Receive request
❏ Select / Filter / Process data (on Spark)
❏ Collect results
❏ … Out Of Memory
❏ Same data as before
❏ Same external API
Issue #5: OOM
Workaround: Requirements
❏ Result holds ~ 1M objects
❏ (Average) Object size 928 bytes
❏ Result size ~880 MB
Issue #5: OOM
Workaround: Before
Issue #5: OOM
Workaround: After
❏ Result holds ~ 1M objects
❏ (Average) Object size 272 bytes
❏ Result size ~261 MB
❏ Invest (more) time in data structures
❏ Some java performance tips:
http://java-performance.com/
❏ Know your serializer
❏ E.g. Kryo (v2.2.1) prepares object for
deserialization by using default constructor.
Issue #5: OOM
Lessons Learnt
Instead Of
Epilogue
“
“The fact that there is a highway to hell
and only a stairway to heaven says a lot
about the traffic trends”
Source: https://www.reddit.com/r/Showerthoughts/comments/2wbvou/the_fact_that_there_is_a_highway_to_hell_and_only
Thanks!
Any questions?
Resources
 http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
 https://databricks.com/resources/slides
 https://databricks.com/spark/developer-resources
 https://github.com/apache/spark/pulse/monthly
 http://www.slideshare.net/databricks/building-a-modern-application-with-
dataframes-52776940
 http://www.slideshare.net/databricks/spark-summit-eu-2015-matei-
zaharia-keynote
 http://www.slideshare.net/databricks/apache-spark-15-presented-by-
databricks-cofounder-patrick-wendell/6
 http://www.slideshare.net/databricks/building-a-modern-application-with-
dataframes-52776940
 http://www.slideshare.net/databricks/spark-summit-eu-2015-matei-
zaharia-keynote
 http://www.slideshare.net/databricks/spark-whats-new-whats-coming
 http://superuser.com/questions/637302/if-ram-is-cheap-why-dont-we-load-
everything-to-ram-and-run-it-from-there
 https://www.reddit.com/r/Showerthoughts/comments/2wbvou/the_fact_tha
t_there_is_a_highway_to_hell_and_only

Spark - Migration Story

  • 1.
  • 2.
    About me Roman Chukh 11+ years of experience  Java / PHP / Ruby / etc.  ~1 year with Apache Spark  Interested in  Data Storage / Data Flow  Monitoring  Provisioning Tools
  • 3.
    Agenda  Why Spark? Our Migration to Spark  Issues  … and solutions  … or workarounds  … or at least the lessons learnt
  • 4.
  • 5.
    “ [Spark is a]Fast and general-purpose cluster computing platform for large-scale data processing
  • 6.
  • 7.
    Why Spark? Active Development Source:https://github.com/apache/spark/pulse/monthly
  • 8.
    Why Spark? Community Growth Source:http://www.slideshare.net/databricks/spark-summit-eu-2015-matei-zaharia-keynote
  • 9.
    Why Spark? Real-World Usage Source:http://www.slideshare.net/databricks/apache-spark-15-presented-by-databricks-cofounder-patrick- wendell/6
  • 10.
    Largest Cluster 8000nodes Tencent Largest single job 1 PB Alibaba.com Databricks Top streaming intake 1 TB / hour Janelia.org Source: http://www.slideshare.net/databricks/building-a-modern-application-with-dataframes-52776940 Why Spark? Real-World Usage
  • 11.
  • 12.
    Cluster Manager Application SparkContext Worker Node Executor Task Executor Task WorkerNode Executor Task Executor Task Migrating To Spark Before We Start
  • 13.
    Migrating To Spark TheProduct  Cloud-based analytics application  Won the Big Data Startup Challenge  In-house computation engine
  • 14.
    Migrating To Spark Reasons More data  More granular data  Support various data backends  Support Machine Learning algorithms
  • 15.
    Migrating To Spark UseCases ❏ supplement Graph database used to store/query big dimensions ❏ supplement RDBMS for querying of high volumes of data ❏ represent existing computation graph as flow of Spark-based operations
  • 16.
    Migrating To Spark StarSchema Dimension DimensionMetric Process / Filter Dimension Filter Metric Process / Filter Dimension Result Data Processing ...
  • 17.
  • 18.
  • 19.
    Issue #1: Low-LevelAPI RDD “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In- Memory Cluster Computing” Source: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  • 20.
    Issue #1: Low-LevelAPI RDD: Resilient Distributed Dataset ❏ Immutable ❏ Statically typed: RDD<MyClass> ❏ Fault-Tolerant: Automatically rebuilt on failure ❏ Lazily evaluated
  • 21.
    Issue #1: Low-LevelAPI Example workflow Read File line-by-line Get line length Sum lengths Result
  • 22.
    Issue #1: Low-LevelAPI RDD: Example lines.txt some lines for test
  • 23.
    Issue #1: Low-LevelAPI RDD: Issues  Functional transformations (e.g. map/reduce) are not as intuitive  Manual memory management  High (dev) maintenance cost
  • 24.
    Issue #1: Low-LevelAPI DataFrame: Overview ❏ (Semi-) Structured data ❏ Columnar Storage ❏ Graph mutation ❏ Code generation ❏ "on" by default in 1.5+ ❏ "always on" in latest master
  • 25.
    Issue #1: Low-LevelAPI DataFrame: Example lines.json {"line":"some"} {"line":"lines"} {"line":"for"} {"line":"test"}
  • 26.
    Issue #1: Low-LevelAPI DataFrame vs RDD Source: http://www.slideshare.net/databricks/spark-summit-eu-2015-matei-zaharia-keynote
  • 27.
    Issue #1: Low-LevelAPI DataFrame: Graph Mutation Source: http://www.slideshare.net/databricks/spark-whats-new-whats-coming
  • 28.
    Issue #1: Low-LevelAPI Lessons Learnt ❏ Be aware of the new features ❏ … especially why they were introduced ❏ Low-Level API != Better Performance
  • 29.
  • 30.
    “ “The fastest wayto process big data is to never read it” Source: http://www.slideshare.net/databricks/spark-whats-new-whats-coming
  • 31.
    Spark Flow RDBMS WHERE x >0 Result Issue #2: DataSource Predicates Use Cases SQL SELECT * FROM Table WHERE x > 0
  • 32.
    Spark Flow RDBMS WHERE x >0 Result Issue #2: DataSource Predicates Use Cases SQL SELECT * FROM Table WHERE x > 0 AND y < 10 WHERE y < 10 AND
  • 33.
    Spark Flow RDBMS WHERE x >0 Result Issue #2: DataSource Predicates Use Cases SQL SELECT * FROM Table WHERE x > 0 OR y < 10 WHERE y < 10 OR
  • 34.
    Spark Flow RDBMS WHERE x >0 Result Issue #2: DataSource Predicates Use Cases SQL SELECT * FROM Table WHERE x > 0 OR y < 10 WHERE y < 10 OR
  • 35.
    … is ata very early stage ❏ Only simple predicates <, <=, >, >=, = ❏ Only ‘AND’ predicate groups (no OR support) Issue #2: DataSource Predicates JDBC
  • 36.
    … is buggy ❏Parquet < 1.7 ❏ PARQUET-136 - NPE if all column values are null ❏ Parquet 1.7 ❏ PARQUET-251 - Possible incorrect results for String/Decimal/Binary columns Issue #2: DataSource Predicates Apache Parquet
  • 37.
    Issue #2: DataSourcePredicates Lessons Learnt ❏ Know your data format / data storage features ❏ ... and issues ❏ Its hard to check predicate pushdown behavior ❏ SPARK-11390: Pushdown information ❏ Simple aggregation operations are not supported ❏ Check out the talk “The Pushdown of Everything”
  • 38.
  • 39.
    ❏ Window functions(e.g. row_number) ❏ Introduced for HiveContext in 1.4 ❏ Introduced for SparkContext in 1.5 ❏ Subquery (e.g. not exists) support is still missing ❏ Can sometimes be replaced with left semi join Issue #3: Spark (sort of) SQL Missing Functionality
  • 40.
    Issue #3: Spark(sort of) SQL Lessons Learnt ❏ Know your use-case ❏ Spark SQL is still quite young ❏ SQL grammar is incomplete ❏ … but actively extended
  • 41.
  • 42.
    Issue #4: RoundTrips Background Metric Data Processing ... Filter Metric Result Internal API Process / Filter Dimension Dimension ids Dimension
  • 43.
    Issue #4: RoundTrips Background Metric Data Processing ... Filter Metric Result Internal API Process / Filter Dimension Dimension ids Dimension
  • 44.
    Get ID forthe ‘Year 2015’ Issue #4: Round Trips Resolving Dimensions Dimension WHERE key = ‘2015’ Result
  • 45.
    Get IDs ofall passed months of the current year Dimension WHERE parent = 2015 and level = month Dim. id of ‘2015’ WHERE key = ‘2015’ Issue #4: Round Trips Resolving Dimensions Result
  • 46.
    Get IDs ofall passed months of the current year AND their siblings from the previous year Dimension WHERE parent = 2015 and level = month Dim. id of ‘2015’ Jan, Feb, … WHERE key = ‘2015’ WHERE sibling_id = sibling_id - 1 Result Issue #4: Round Trips Resolving Dimensions
  • 47.
    ❏ Spark isbetter suited for a single complex request ❏ … though not too complex yet ❏ Invest time in architecture analysis and data flow ❏ It might be better to replace a more high-level API Issue #4: Round Trips Lessons Learnt
  • 48.
  • 49.
    “ “RAM's cheap, butnot that cheap” Source: http://superuser.com/questions/637302/if-ram-is-cheap-why-dont-we-load-everything-to-ram-and-run-it-from-there
  • 50.
    Issue #5: OOM Background ❏Receive request ❏ Select / Filter / Process data (on Spark) ❏ Collect results ❏ … Out Of Memory
  • 51.
    ❏ Same dataas before ❏ Same external API Issue #5: OOM Workaround: Requirements
  • 52.
    ❏ Result holds~ 1M objects ❏ (Average) Object size 928 bytes ❏ Result size ~880 MB Issue #5: OOM Workaround: Before
  • 53.
    Issue #5: OOM Workaround:After ❏ Result holds ~ 1M objects ❏ (Average) Object size 272 bytes ❏ Result size ~261 MB
  • 54.
    ❏ Invest (more)time in data structures ❏ Some java performance tips: http://java-performance.com/ ❏ Know your serializer ❏ E.g. Kryo (v2.2.1) prepares object for deserialization by using default constructor. Issue #5: OOM Lessons Learnt
  • 55.
  • 56.
    “ “The fact thatthere is a highway to hell and only a stairway to heaven says a lot about the traffic trends” Source: https://www.reddit.com/r/Showerthoughts/comments/2wbvou/the_fact_that_there_is_a_highway_to_hell_and_only
  • 57.
  • 58.
    Resources  http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf  https://databricks.com/resources/slides https://databricks.com/spark/developer-resources  https://github.com/apache/spark/pulse/monthly  http://www.slideshare.net/databricks/building-a-modern-application-with- dataframes-52776940  http://www.slideshare.net/databricks/spark-summit-eu-2015-matei- zaharia-keynote  http://www.slideshare.net/databricks/apache-spark-15-presented-by- databricks-cofounder-patrick-wendell/6  http://www.slideshare.net/databricks/building-a-modern-application-with- dataframes-52776940  http://www.slideshare.net/databricks/spark-summit-eu-2015-matei- zaharia-keynote  http://www.slideshare.net/databricks/spark-whats-new-whats-coming  http://superuser.com/questions/637302/if-ram-is-cheap-why-dont-we-load- everything-to-ram-and-run-it-from-there  https://www.reddit.com/r/Showerthoughts/comments/2wbvou/the_fact_tha t_there_is_a_highway_to_hell_and_only