SlideShare a Scribd company logo
Spark:
Migration Story
About me
Roman Chukh
 11+ years of experience
 Java / PHP / Ruby / etc.
 ~1 year with Apache Spark
 Interested in
 Data Storage / Data Flow
 Monitoring
 Provisioning Tools
Agenda
 Why Spark?
 Our Migration to Spark
 Issues
 … and solutions
 … or workarounds
 … or at least the lessons learnt
Why Spark?
“
[Spark is a] Fast and general-purpose
cluster computing platform for large-scale
data processing
Source: http://www.slideshare.net/databricks/building-a-modern-application-with-dataframes-52776940
Why Spark?
API
Why Spark?
Active Development
Source: https://github.com/apache/spark/pulse/monthly
Why Spark?
Community Growth
Source: http://www.slideshare.net/databricks/spark-summit-eu-2015-matei-zaharia-keynote
Why Spark?
Real-World Usage
Source: http://www.slideshare.net/databricks/apache-spark-15-presented-by-databricks-cofounder-patrick-
wendell/6
Largest Cluster 8000 nodes Tencent
Largest single job 1 PB
Alibaba.com
Databricks
Top streaming intake 1 TB / hour Janelia.org
Source: http://www.slideshare.net/databricks/building-a-modern-application-with-dataframes-52776940
Why Spark?
Real-World Usage
Migrating to Spark
Cluster Manager
Application
SparkContext
Worker Node
Executor
Task
Executor
Task
Worker Node
Executor
Task
Executor
Task
Migrating To Spark
Before We Start
Migrating To Spark
The Product
 Cloud-based analytics application
 Won the Big Data Startup Challenge
 In-house computation engine
Migrating To Spark
Reasons
 More data
 More granular data
 Support various data backends
 Support Machine Learning algorithms
Migrating To Spark
Use Cases
❏ supplement Graph database used to
store/query big dimensions
❏ supplement RDBMS for querying of high
volumes of data
❏ represent existing computation graph as
flow of Spark-based operations
Migrating To Spark
Star Schema
Dimension DimensionMetric
Process /
Filter
Dimension
Filter
Metric
Process /
Filter
Dimension
Result
Data
Processing
...
Issues
Issue #1
Low-Level API
Issue #1: Low-Level API
RDD
“Resilient Distributed Datasets:
A Fault-Tolerant Abstraction for In-
Memory Cluster Computing”
Source: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Issue #1: Low-Level API
RDD: Resilient Distributed Dataset
❏ Immutable
❏ Statically typed: RDD<MyClass>
❏ Fault-Tolerant: Automatically rebuilt on failure
❏ Lazily evaluated
Issue #1: Low-Level API
Example workflow
Read File
line-by-line
Get line length
Sum lengths
Result
Issue #1: Low-Level API
RDD: Example
lines.txt
some
lines
for
test
Issue #1: Low-Level API
RDD: Issues
 Functional transformations (e.g. map/reduce)
are not as intuitive
 Manual memory management
 High (dev) maintenance cost
Issue #1: Low-Level API
DataFrame: Overview
❏ (Semi-) Structured data
❏ Columnar Storage
❏ Graph mutation
❏ Code generation
❏ "on" by default in 1.5+
❏ "always on" in latest master
Issue #1: Low-Level API
DataFrame: Example
lines.json
{"line":"some"}
{"line":"lines"}
{"line":"for"}
{"line":"test"}
Issue #1: Low-Level API
DataFrame vs RDD
Source: http://www.slideshare.net/databricks/spark-summit-eu-2015-matei-zaharia-keynote
Issue #1: Low-Level API
DataFrame: Graph Mutation
Source: http://www.slideshare.net/databricks/spark-whats-new-whats-coming
Issue #1: Low-Level API
Lessons Learnt
❏ Be aware of the new features
❏ … especially why they were introduced
❏ Low-Level API != Better Performance
Issue #2
DataSource
Predicates
“
“The fastest way to process big
data is to never read it”
Source: http://www.slideshare.net/databricks/spark-whats-new-whats-coming
Spark Flow
RDBMS
WHERE
x > 0
Result
Issue #2: DataSource Predicates
Use Cases
SQL
SELECT *
FROM Table
WHERE x > 0
Spark Flow
RDBMS
WHERE
x > 0
Result
Issue #2: DataSource Predicates
Use Cases
SQL
SELECT *
FROM Table
WHERE x > 0
AND y < 10
WHERE
y < 10
AND
Spark Flow
RDBMS
WHERE
x > 0
Result
Issue #2: DataSource Predicates
Use Cases
SQL
SELECT *
FROM Table
WHERE x > 0
OR y < 10
WHERE
y < 10
OR
Spark Flow
RDBMS
WHERE
x > 0
Result
Issue #2: DataSource Predicates
Use Cases
SQL
SELECT *
FROM Table
WHERE x > 0
OR y < 10
WHERE
y < 10
OR
… is at a very early stage
❏ Only simple predicates
<, <=, >, >=, =
❏ Only ‘AND’ predicate groups
(no OR support)
Issue #2: DataSource Predicates
JDBC
… is buggy
❏ Parquet < 1.7
❏ PARQUET-136 - NPE if all column values are
null
❏ Parquet 1.7
❏ PARQUET-251 - Possible incorrect results
for String/Decimal/Binary columns
Issue #2: DataSource Predicates
Apache Parquet
Issue #2: DataSource Predicates
Lessons Learnt
❏ Know your data format / data storage features
❏ ... and issues
❏ Its hard to check predicate pushdown behavior
❏ SPARK-11390: Pushdown information
❏ Simple aggregation operations are not supported
❏ Check out the talk “The Pushdown of Everything”
Issue #3
Spark SQL
❏ Window functions (e.g. row_number)
❏ Introduced for HiveContext in 1.4
❏ Introduced for SparkContext in 1.5
❏ Subquery (e.g. not exists) support is still missing
❏ Can sometimes be replaced with left semi join
Issue #3: Spark (sort of) SQL
Missing Functionality
Issue #3: Spark (sort of) SQL
Lessons Learnt
❏ Know your use-case
❏ Spark SQL is still quite young
❏ SQL grammar is incomplete
❏ … but actively extended
Issue #4
Round Trips
Issue #4: Round Trips
Background
Metric
Data Processing
...
Filter
Metric
Result
Internal API
Process / Filter
Dimension
Dimension
ids
Dimension
Issue #4: Round Trips
Background
Metric
Data Processing
...
Filter
Metric
Result
Internal API
Process / Filter
Dimension
Dimension
ids
Dimension
Get ID for the ‘Year 2015’
Issue #4: Round Trips
Resolving Dimensions
Dimension
WHERE
key = ‘2015’
Result
Get IDs of all passed months of the current year
Dimension
WHERE parent = 2015
and level = month
Dim. id
of ‘2015’
WHERE
key = ‘2015’
Issue #4: Round Trips
Resolving Dimensions
Result
Get IDs of all passed months of the current year
AND their siblings from the previous year
Dimension
WHERE
parent = 2015
and
level = month
Dim. id
of ‘2015’
Jan,
Feb,
…
WHERE
key = ‘2015’
WHERE
sibling_id =
sibling_id - 1
Result
Issue #4: Round Trips
Resolving Dimensions
❏ Spark is better suited for a single complex request
❏ … though not too complex yet
❏ Invest time in architecture analysis and data flow
❏ It might be better to replace a more high-level API
Issue #4: Round Trips
Lessons Learnt
Issue #5
Out of Memory
“
“RAM's cheap, but not that cheap”
Source: http://superuser.com/questions/637302/if-ram-is-cheap-why-dont-we-load-everything-to-ram-and-run-it-from-there
Issue #5: OOM
Background
❏ Receive request
❏ Select / Filter / Process data (on Spark)
❏ Collect results
❏ … Out Of Memory
❏ Same data as before
❏ Same external API
Issue #5: OOM
Workaround: Requirements
❏ Result holds ~ 1M objects
❏ (Average) Object size 928 bytes
❏ Result size ~880 MB
Issue #5: OOM
Workaround: Before
Issue #5: OOM
Workaround: After
❏ Result holds ~ 1M objects
❏ (Average) Object size 272 bytes
❏ Result size ~261 MB
❏ Invest (more) time in data structures
❏ Some java performance tips:
http://java-performance.com/
❏ Know your serializer
❏ E.g. Kryo (v2.2.1) prepares object for
deserialization by using default constructor.
Issue #5: OOM
Lessons Learnt
Instead Of
Epilogue
“
“The fact that there is a highway to hell
and only a stairway to heaven says a lot
about the traffic trends”
Source: https://www.reddit.com/r/Showerthoughts/comments/2wbvou/the_fact_that_there_is_a_highway_to_hell_and_only
Thanks!
Any questions?
Resources
 http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
 https://databricks.com/resources/slides
 https://databricks.com/spark/developer-resources
 https://github.com/apache/spark/pulse/monthly
 http://www.slideshare.net/databricks/building-a-modern-application-with-
dataframes-52776940
 http://www.slideshare.net/databricks/spark-summit-eu-2015-matei-
zaharia-keynote
 http://www.slideshare.net/databricks/apache-spark-15-presented-by-
databricks-cofounder-patrick-wendell/6
 http://www.slideshare.net/databricks/building-a-modern-application-with-
dataframes-52776940
 http://www.slideshare.net/databricks/spark-summit-eu-2015-matei-
zaharia-keynote
 http://www.slideshare.net/databricks/spark-whats-new-whats-coming
 http://superuser.com/questions/637302/if-ram-is-cheap-why-dont-we-load-
everything-to-ram-and-run-it-from-there
 https://www.reddit.com/r/Showerthoughts/comments/2wbvou/the_fact_tha
t_there_is_a_highway_to_hell_and_only

More Related Content

What's hot

201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning
Mark Tabladillo
 
Redash: Open Source SQL Analytics on Data Lakes
Redash: Open Source SQL Analytics on Data LakesRedash: Open Source SQL Analytics on Data Lakes
Redash: Open Source SQL Analytics on Data Lakes
Databricks
 
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Databricks
 
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Databricks
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Databricks
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
Mykola Zerniuk
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
Data Con LA
 
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...
Databricks
 
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta Lake
Databricks
 
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Databricks
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIO
Jozo Kovac
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At Scale
Databricks
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data Engineering
Anant Corporation
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
DataWorks Summit/Hadoop Summit
 
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Databricks
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
Building Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta LakesBuilding Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta Lakes
Databricks
 
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleEbooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Vasu S
 

What's hot (20)

201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning
 
Redash: Open Source SQL Analytics on Data Lakes
Redash: Open Source SQL Analytics on Data LakesRedash: Open Source SQL Analytics on Data Lakes
Redash: Open Source SQL Analytics on Data Lakes
 
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
 
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...
 
Machine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta LakeMachine Learning Data Lineage with MLflow and Delta Lake
Machine Learning Data Lineage with MLflow and Delta Lake
 
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIO
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At Scale
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data Engineering
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
 
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Building Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta LakesBuilding Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta Lakes
 
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleEbooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
 

Viewers also liked

Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
Serg Masyutin
 
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
NoSQLmatters
 
AWS Simple Workflow: Distributed Out of the Box! - Morning@Lohika
AWS Simple Workflow: Distributed Out of the Box! - Morning@LohikaAWS Simple Workflow: Distributed Out of the Box! - Morning@Lohika
AWS Simple Workflow: Distributed Out of the Box! - Morning@Lohika
Serhiy Batyuk
 
Big data analysis in java world
Big data analysis in java worldBig data analysis in java world
Big data analysis in java world
Serg Masyutin
 
Tweaking performance on high-load projects
Tweaking performance on high-load projectsTweaking performance on high-load projects
Tweaking performance on high-load projects
Dmitriy Dumanskiy
 
Apache HBase Workshop
Apache HBase WorkshopApache HBase Workshop
Apache HBase Workshop
Valerii Moisieienko
 
React. Flux. Redux
React. Flux. ReduxReact. Flux. Redux
React. Flux. Redux
Andrey Kolodnitsky
 
Marionette talk 2016
Marionette talk 2016Marionette talk 2016
Marionette talk 2016
Kseniya Redunova
 
Migration from Redshift to Spark
Migration from Redshift to SparkMigration from Redshift to Spark
Migration from Redshift to Spark
Sky Yin
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
NLP: a peek into a day of a computational linguist
NLP: a peek into a day of a computational linguistNLP: a peek into a day of a computational linguist
NLP: a peek into a day of a computational linguist
Mariana Romanyshyn
 
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
Spark Summit
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Anastasiia Kornilova
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
Edureka!
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Cassandra under the hood
Cassandra under the hoodCassandra under the hood
Cassandra under the hood
Andriy Rymar
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
IBM
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
Sofian Hadiwijaya
 

Viewers also liked (18)

Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
 
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
 
AWS Simple Workflow: Distributed Out of the Box! - Morning@Lohika
AWS Simple Workflow: Distributed Out of the Box! - Morning@LohikaAWS Simple Workflow: Distributed Out of the Box! - Morning@Lohika
AWS Simple Workflow: Distributed Out of the Box! - Morning@Lohika
 
Big data analysis in java world
Big data analysis in java worldBig data analysis in java world
Big data analysis in java world
 
Tweaking performance on high-load projects
Tweaking performance on high-load projectsTweaking performance on high-load projects
Tweaking performance on high-load projects
 
Apache HBase Workshop
Apache HBase WorkshopApache HBase Workshop
Apache HBase Workshop
 
React. Flux. Redux
React. Flux. ReduxReact. Flux. Redux
React. Flux. Redux
 
Marionette talk 2016
Marionette talk 2016Marionette talk 2016
Marionette talk 2016
 
Migration from Redshift to Spark
Migration from Redshift to SparkMigration from Redshift to Spark
Migration from Redshift to Spark
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
NLP: a peek into a day of a computational linguist
NLP: a peek into a day of a computational linguistNLP: a peek into a day of a computational linguist
NLP: a peek into a day of a computational linguist
 
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)How Spark Fits into Baidu's Scale-(James Peng, Baidu)
How Spark Fits into Baidu's Scale-(James Peng, Baidu)
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Cassandra under the hood
Cassandra under the hoodCassandra under the hood
Cassandra under the hood
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 

Similar to Spark - Migration Story

Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?Edureka!
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Lillian Pierson
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
Mostafa Majidpour
 
Presto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop MeetupPresto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop Meetup
Wojciech Biela
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
Paco Nathan
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
Burak Yavuz
 
Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why
Edureka!
 
BDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of sparkBDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of spark
Jerry Wen
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
Databricks
 
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
DataStax Academy
 
Microservices, Events, and Breaking the Data Monolith with Kafka
Microservices, Events, and Breaking the Data Monolith with KafkaMicroservices, Events, and Breaking the Data Monolith with Kafka
Microservices, Events, and Breaking the Data Monolith with Kafka
VMware Tanzu
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
DataWorks Summit
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Slim Baltagi
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
DataKitchen
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Paco Nathan
 

Similar to Spark - Migration Story (20)

Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?Spark is going to replace Apache Hadoop! Know Why?
Spark is going to replace Apache Hadoop! Know Why?
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
Presto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop MeetupPresto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop Meetup
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
 
Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why Spark Will Replace Hadoop ! Know Why
Spark Will Replace Hadoop ! Know Why
 
BDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of sparkBDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of spark
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
 
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
 
Microservices, Events, and Breaking the Data Monolith with Kafka
Microservices, Events, and Breaking the Data Monolith with KafkaMicroservices, Events, and Breaking the Data Monolith with Kafka
Microservices, Events, and Breaking the Data Monolith with Kafka
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
Open Data Science Conference Big Data Infrastructure – Introduction to Hadoop...
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 

Recently uploaded

ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
Vijay Dialani, PhD
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
zwunae
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
BrazilAccount1
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
ssuser7dcef0
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
ongomchris
 

Recently uploaded (20)

ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单专业办理
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
 

Spark - Migration Story

  • 2. About me Roman Chukh  11+ years of experience  Java / PHP / Ruby / etc.  ~1 year with Apache Spark  Interested in  Data Storage / Data Flow  Monitoring  Provisioning Tools
  • 3. Agenda  Why Spark?  Our Migration to Spark  Issues  … and solutions  … or workarounds  … or at least the lessons learnt
  • 5. “ [Spark is a] Fast and general-purpose cluster computing platform for large-scale data processing
  • 7. Why Spark? Active Development Source: https://github.com/apache/spark/pulse/monthly
  • 8. Why Spark? Community Growth Source: http://www.slideshare.net/databricks/spark-summit-eu-2015-matei-zaharia-keynote
  • 9. Why Spark? Real-World Usage Source: http://www.slideshare.net/databricks/apache-spark-15-presented-by-databricks-cofounder-patrick- wendell/6
  • 10. Largest Cluster 8000 nodes Tencent Largest single job 1 PB Alibaba.com Databricks Top streaming intake 1 TB / hour Janelia.org Source: http://www.slideshare.net/databricks/building-a-modern-application-with-dataframes-52776940 Why Spark? Real-World Usage
  • 12. Cluster Manager Application SparkContext Worker Node Executor Task Executor Task Worker Node Executor Task Executor Task Migrating To Spark Before We Start
  • 13. Migrating To Spark The Product  Cloud-based analytics application  Won the Big Data Startup Challenge  In-house computation engine
  • 14. Migrating To Spark Reasons  More data  More granular data  Support various data backends  Support Machine Learning algorithms
  • 15. Migrating To Spark Use Cases ❏ supplement Graph database used to store/query big dimensions ❏ supplement RDBMS for querying of high volumes of data ❏ represent existing computation graph as flow of Spark-based operations
  • 16. Migrating To Spark Star Schema Dimension DimensionMetric Process / Filter Dimension Filter Metric Process / Filter Dimension Result Data Processing ...
  • 19. Issue #1: Low-Level API RDD “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In- Memory Cluster Computing” Source: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  • 20. Issue #1: Low-Level API RDD: Resilient Distributed Dataset ❏ Immutable ❏ Statically typed: RDD<MyClass> ❏ Fault-Tolerant: Automatically rebuilt on failure ❏ Lazily evaluated
  • 21. Issue #1: Low-Level API Example workflow Read File line-by-line Get line length Sum lengths Result
  • 22. Issue #1: Low-Level API RDD: Example lines.txt some lines for test
  • 23. Issue #1: Low-Level API RDD: Issues  Functional transformations (e.g. map/reduce) are not as intuitive  Manual memory management  High (dev) maintenance cost
  • 24. Issue #1: Low-Level API DataFrame: Overview ❏ (Semi-) Structured data ❏ Columnar Storage ❏ Graph mutation ❏ Code generation ❏ "on" by default in 1.5+ ❏ "always on" in latest master
  • 25. Issue #1: Low-Level API DataFrame: Example lines.json {"line":"some"} {"line":"lines"} {"line":"for"} {"line":"test"}
  • 26. Issue #1: Low-Level API DataFrame vs RDD Source: http://www.slideshare.net/databricks/spark-summit-eu-2015-matei-zaharia-keynote
  • 27. Issue #1: Low-Level API DataFrame: Graph Mutation Source: http://www.slideshare.net/databricks/spark-whats-new-whats-coming
  • 28. Issue #1: Low-Level API Lessons Learnt ❏ Be aware of the new features ❏ … especially why they were introduced ❏ Low-Level API != Better Performance
  • 30. “ “The fastest way to process big data is to never read it” Source: http://www.slideshare.net/databricks/spark-whats-new-whats-coming
  • 31. Spark Flow RDBMS WHERE x > 0 Result Issue #2: DataSource Predicates Use Cases SQL SELECT * FROM Table WHERE x > 0
  • 32. Spark Flow RDBMS WHERE x > 0 Result Issue #2: DataSource Predicates Use Cases SQL SELECT * FROM Table WHERE x > 0 AND y < 10 WHERE y < 10 AND
  • 33. Spark Flow RDBMS WHERE x > 0 Result Issue #2: DataSource Predicates Use Cases SQL SELECT * FROM Table WHERE x > 0 OR y < 10 WHERE y < 10 OR
  • 34. Spark Flow RDBMS WHERE x > 0 Result Issue #2: DataSource Predicates Use Cases SQL SELECT * FROM Table WHERE x > 0 OR y < 10 WHERE y < 10 OR
  • 35. … is at a very early stage ❏ Only simple predicates <, <=, >, >=, = ❏ Only ‘AND’ predicate groups (no OR support) Issue #2: DataSource Predicates JDBC
  • 36. … is buggy ❏ Parquet < 1.7 ❏ PARQUET-136 - NPE if all column values are null ❏ Parquet 1.7 ❏ PARQUET-251 - Possible incorrect results for String/Decimal/Binary columns Issue #2: DataSource Predicates Apache Parquet
  • 37. Issue #2: DataSource Predicates Lessons Learnt ❏ Know your data format / data storage features ❏ ... and issues ❏ Its hard to check predicate pushdown behavior ❏ SPARK-11390: Pushdown information ❏ Simple aggregation operations are not supported ❏ Check out the talk “The Pushdown of Everything”
  • 39. ❏ Window functions (e.g. row_number) ❏ Introduced for HiveContext in 1.4 ❏ Introduced for SparkContext in 1.5 ❏ Subquery (e.g. not exists) support is still missing ❏ Can sometimes be replaced with left semi join Issue #3: Spark (sort of) SQL Missing Functionality
  • 40. Issue #3: Spark (sort of) SQL Lessons Learnt ❏ Know your use-case ❏ Spark SQL is still quite young ❏ SQL grammar is incomplete ❏ … but actively extended
  • 42. Issue #4: Round Trips Background Metric Data Processing ... Filter Metric Result Internal API Process / Filter Dimension Dimension ids Dimension
  • 43. Issue #4: Round Trips Background Metric Data Processing ... Filter Metric Result Internal API Process / Filter Dimension Dimension ids Dimension
  • 44. Get ID for the ‘Year 2015’ Issue #4: Round Trips Resolving Dimensions Dimension WHERE key = ‘2015’ Result
  • 45. Get IDs of all passed months of the current year Dimension WHERE parent = 2015 and level = month Dim. id of ‘2015’ WHERE key = ‘2015’ Issue #4: Round Trips Resolving Dimensions Result
  • 46. Get IDs of all passed months of the current year AND their siblings from the previous year Dimension WHERE parent = 2015 and level = month Dim. id of ‘2015’ Jan, Feb, … WHERE key = ‘2015’ WHERE sibling_id = sibling_id - 1 Result Issue #4: Round Trips Resolving Dimensions
  • 47. ❏ Spark is better suited for a single complex request ❏ … though not too complex yet ❏ Invest time in architecture analysis and data flow ❏ It might be better to replace a more high-level API Issue #4: Round Trips Lessons Learnt
  • 48. Issue #5 Out of Memory
  • 49. “ “RAM's cheap, but not that cheap” Source: http://superuser.com/questions/637302/if-ram-is-cheap-why-dont-we-load-everything-to-ram-and-run-it-from-there
  • 50. Issue #5: OOM Background ❏ Receive request ❏ Select / Filter / Process data (on Spark) ❏ Collect results ❏ … Out Of Memory
  • 51. ❏ Same data as before ❏ Same external API Issue #5: OOM Workaround: Requirements
  • 52. ❏ Result holds ~ 1M objects ❏ (Average) Object size 928 bytes ❏ Result size ~880 MB Issue #5: OOM Workaround: Before
  • 53. Issue #5: OOM Workaround: After ❏ Result holds ~ 1M objects ❏ (Average) Object size 272 bytes ❏ Result size ~261 MB
  • 54. ❏ Invest (more) time in data structures ❏ Some java performance tips: http://java-performance.com/ ❏ Know your serializer ❏ E.g. Kryo (v2.2.1) prepares object for deserialization by using default constructor. Issue #5: OOM Lessons Learnt
  • 56. “ “The fact that there is a highway to hell and only a stairway to heaven says a lot about the traffic trends” Source: https://www.reddit.com/r/Showerthoughts/comments/2wbvou/the_fact_that_there_is_a_highway_to_hell_and_only
  • 58. Resources  http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf  https://databricks.com/resources/slides  https://databricks.com/spark/developer-resources  https://github.com/apache/spark/pulse/monthly  http://www.slideshare.net/databricks/building-a-modern-application-with- dataframes-52776940  http://www.slideshare.net/databricks/spark-summit-eu-2015-matei- zaharia-keynote  http://www.slideshare.net/databricks/apache-spark-15-presented-by- databricks-cofounder-patrick-wendell/6  http://www.slideshare.net/databricks/building-a-modern-application-with- dataframes-52776940  http://www.slideshare.net/databricks/spark-summit-eu-2015-matei- zaharia-keynote  http://www.slideshare.net/databricks/spark-whats-new-whats-coming  http://superuser.com/questions/637302/if-ram-is-cheap-why-dont-we-load- everything-to-ram-and-run-it-from-there  https://www.reddit.com/r/Showerthoughts/comments/2wbvou/the_fact_tha t_there_is_a_highway_to_hell_and_only