Performance of Spark vs MapReduce

www.edureka.co/apache-spark-scala-training
Performance of Spark vs MapReduce

What will you learn today ?
 Beyond Hadoop MapReduce
 How Spark is better than MapReduce?
 Benchmark : Spark vs MapReduce
 Hands-On : Analyzing data with Spark

Word Count Problem - MapReduce
MapReduce Code for a Simple Word Count Problem

Apache Spark
Apache Spark is a general purpose data processing
engine with in-memory computing
Spark provides API for Scala, Java, Python and R which
makes Spark widely adopted for data processing

How Spark fits into Hadoop Ecosystem ?
Spark is intended to enhance, not replace, the Hadoop stack
Spark is designed to read and write data to HDFS as well as other storage systems such as
CSV files, Amazon S3 and NoSQL databases

Word Count Problem - Spark
Spark Scala Code for Word Count Problem
Spark Python Code for Word Count Problem
Clearly processing data with Spark is much
easier than MapReduce and Spark gives you
the flexibility to choose your favorite
language Scala, Java, Python etc.

Why Spark for Big Data Analytics ?
What makes
Spark
suitable for
Big Data
Analytics ?

Why Spark for Big Data Analytics ?
Following features make Spark, the best fit for Big Data Analytics :
 Spark simplifies data analysis
 Spark provides built-in libraries to do advanced analytics
 Spark speaks more than one language
 Spark provides faster results
 Spark allows you to use different Hadoop vendors

Benchmark : Spark is Blazingly Fast

Isn’t Spark In-Memory Only
But I have
heard Spark is
good for only
in-memory
processing?

Spark : Best of both Worlds
It’s a common misconception Spark is only for in-memory processing. From its inception
Spark was designed to be a general execution engine that works both in-memory and on-
disk. Almost all Spark operators perform external operations when data does not fit in
memory

Spark Libraries
 Spark SQL : Spark’s module for working with structured data
 MLlib : Spark’s machine learning library
 GraphX : Spark’s API for graph computation
 Spark Streaming : Spark’s API to process streaming data

Spark in one Snapshot

Spark Use Cases
Different companies are using Spark
for solving various problems e.g.
recommendation systems, business
intelligence, fraud detection etc.

Who is using Spark?
A complete list of companies using Spark can be found here : https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

Spark is here to stay
Spark is not one of those "here today, gone tomorrow". Spark is here to stay for the foreseeable
future, and it is well worth to get your teeth into it in order to get value out of your data

Hands-on
Analyzing data with Spark

References
IBM backs Apache Spark for Big Data Analytics :
http://www.forbes.com/sites/paulmiller/2015/06/15/ibm-backs-apache-spark-for-big-data-analytics/
Why Cloudera is saying 'Goodbye, MapReduce' and 'Hello, Spark' :
http://fortune.com/2015/09/09/cloudera-spark-mapreduce/
5 reasons to turn to Spark for Big Data Analytics :
http://www.infoworld.com/article/2897287/big-data/5-reasons-to-turn-to-spark-for-big-data-analytics.html

References
Spark new record for large scale sorting :
https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
How eBay uses Spark to ignite Data Analytics :
http://www.ebaytechblog.com/2014/05/28/using-spark-to-ignite-data-analytics/
Spark is fast on disk too :
https://gigaom.com/2014/10/10/databricks-demolishes-big-data-benchmark-to-prove-spark-is-fast-on-disk-too/

Survey
Your feedback is vital for us, be it a compliment, a suggestion or a complaint. It helps us to make your
experience better!
Please spare few minutes to take the survey after the webinar.

Thank You …
Questions/Queries/Feedback
Recording and presentation will be made available to you within 24 hours

Performance of Spark vs MapReduce

More Related Content

What's hot

Similar to Performance of Spark vs MapReduce

More from Edureka!

Recently uploaded

Performance of Spark vs MapReduce