www.edureka.co/r-for-analytics
www.edureka.co/apache-spark-scala-training
Apache Spark: Beyond Hadoop MapReduce
Slide 2Slide 2Slide 2 www.edureka.co/apache-spark-scala-training
Agenda
At the end of this webinar you will be able to know about:
 Strength of MapReduce
 Things beyond MapReduce
 How MapReduce limitations can be overcome
 How Spark fits the bill
 Other exciting features in Spark
Slide 3Slide 3Slide 3 www.edureka.co/apache-spark-scala-training
Strength of MapReduce
Slide 4Slide 4Slide 4 www.edureka.co/apache-spark-scala-training
Simple
Scalability
Fault
Tolerance
Minimal
data
motion
Strength of MapReduce
Independence of language of choice, such as Java, C++ or Python.
process petabytes of data, stored in HDFS on one cl
MapReduce takes care of failures using the replicated copies.
Process moves towards data to minimize disk I/O
Slide 5Slide 5Slide 5 www.edureka.co/apache-spark-scala-training
Limitations Of MapReduce (MR)
Slide 6Slide 6Slide 6 www.edureka.co/apache-spark-scala-training
Real
Time
Complex
Algorithm
Re-reading
And parsing
Data
Minimal
Data
Motion
Graph
Processing
Iterative
Tasks
Random
Access
Limitations Of MR
Slide 7Slide 7Slide 7 www.edureka.co/apache-spark-scala-training
Feature Comparison with Spark
Fast 100x faster than MapReduce
Batch Processing Batch and Real-time Processing
Stores Data on Disk Stores Data in Memory
Written in Java Written in Scala
Hadoop MapReduce HADOOP Spark
Source: Databrix
Slide 8Slide 8Slide 8 www.edureka.co/apache-spark-scala-training
How MR limitations can be overcome
Slide 9Slide 9Slide 9 www.edureka.co/apache-spark-scala-training
Overcoming MR limitations
Cutting down on the number of
reads and writes to the disc
Real
time
Slide 10Slide 10Slide 10 www.edureka.co/apache-spark-scala-training
Overcoming MR limitations
Libraries for Machine learning,
Streaming
Graph
processing
complex
algorithm
Slide 11Slide 11Slide 11 www.edureka.co/apache-spark-scala-training
Overcoming MR limitations
Cyclic data flows
Random
access
Slide 12Slide 12Slide 12 www.edureka.co/apache-spark-scala-training
How Spark Implements Features To Make Its
Architecture Better Than MR
Slide 13Slide 13Slide 13 www.edureka.co/apache-spark-scala-training
Spark tries to keep things in-memory of its distributed workers, allowing for significantly faster/lower-latency
computations, whereas MapReduce keeps shuffling things in and out of disk.
Sparks Cuts Down Read/Write I/O To Disk
Slide 14Slide 14Slide 14 www.edureka.co/apache-spark-scala-training
Libraries For ML, Graph Programming …
Machine Learning
Library
Graph
programming
Spark interface
For RDBMS lovers
Utility for
continues
ingestion of data
Slide 15Slide 15Slide 15 www.edureka.co/apache-spark-scala-training
Cyclic Data Flows
• All jobs in spark comprise a series of operators and run on a set of data.
• All the operators in a job are used to construct a DAG (Directed Acyclic
Graph).
• The DAG is optimized by rearranging and combining operators where
possible.
Slide 16Slide 16Slide 16 www.edureka.co/apache-spark-scala-training
Spark Other Features In Demand
Slide 17Slide 17Slide 17 www.edureka.co/apache-spark-scala-training
Spark Features/Modules In Demand
Source: Typesafe
Slide 18Slide 18Slide 18 www.edureka.co/apache-spark-scala-training
New Features In 2015
Data Frames 
• Similar API to data frames in R and Pandas
• Automatically optimised via Spark SQL
• Released in Spark 1.3
SparkR 
• Released in Spark 1.4
• Exposes DataFrames, RDD’s & ML library in R
Machine Learning Pipelines 
• High Level API
• Featurization
• Evaluation
• Model Tuning
External Data Sources 
• Platform API to plug Data-Sources into Spark
• Pushes logic into sources
Source: Databrix
Questions
Slide 19
Slide 20
Your feedback is vital for us, be it a compliment, a suggestion or a complaint. It helps us to make your
experience better!
Please spare few minutes to take the survey after the webinar.
Survey
Apache spark

Apache spark

  • 1.
  • 2.
    Slide 2Slide 2Slide2 www.edureka.co/apache-spark-scala-training Agenda At the end of this webinar you will be able to know about:  Strength of MapReduce  Things beyond MapReduce  How MapReduce limitations can be overcome  How Spark fits the bill  Other exciting features in Spark
  • 3.
    Slide 3Slide 3Slide3 www.edureka.co/apache-spark-scala-training Strength of MapReduce
  • 4.
    Slide 4Slide 4Slide4 www.edureka.co/apache-spark-scala-training Simple Scalability Fault Tolerance Minimal data motion Strength of MapReduce Independence of language of choice, such as Java, C++ or Python. process petabytes of data, stored in HDFS on one cl MapReduce takes care of failures using the replicated copies. Process moves towards data to minimize disk I/O
  • 5.
    Slide 5Slide 5Slide5 www.edureka.co/apache-spark-scala-training Limitations Of MapReduce (MR)
  • 6.
    Slide 6Slide 6Slide6 www.edureka.co/apache-spark-scala-training Real Time Complex Algorithm Re-reading And parsing Data Minimal Data Motion Graph Processing Iterative Tasks Random Access Limitations Of MR
  • 7.
    Slide 7Slide 7Slide7 www.edureka.co/apache-spark-scala-training Feature Comparison with Spark Fast 100x faster than MapReduce Batch Processing Batch and Real-time Processing Stores Data on Disk Stores Data in Memory Written in Java Written in Scala Hadoop MapReduce HADOOP Spark Source: Databrix
  • 8.
    Slide 8Slide 8Slide8 www.edureka.co/apache-spark-scala-training How MR limitations can be overcome
  • 9.
    Slide 9Slide 9Slide9 www.edureka.co/apache-spark-scala-training Overcoming MR limitations Cutting down on the number of reads and writes to the disc Real time
  • 10.
    Slide 10Slide 10Slide10 www.edureka.co/apache-spark-scala-training Overcoming MR limitations Libraries for Machine learning, Streaming Graph processing complex algorithm
  • 11.
    Slide 11Slide 11Slide11 www.edureka.co/apache-spark-scala-training Overcoming MR limitations Cyclic data flows Random access
  • 12.
    Slide 12Slide 12Slide12 www.edureka.co/apache-spark-scala-training How Spark Implements Features To Make Its Architecture Better Than MR
  • 13.
    Slide 13Slide 13Slide13 www.edureka.co/apache-spark-scala-training Spark tries to keep things in-memory of its distributed workers, allowing for significantly faster/lower-latency computations, whereas MapReduce keeps shuffling things in and out of disk. Sparks Cuts Down Read/Write I/O To Disk
  • 14.
    Slide 14Slide 14Slide14 www.edureka.co/apache-spark-scala-training Libraries For ML, Graph Programming … Machine Learning Library Graph programming Spark interface For RDBMS lovers Utility for continues ingestion of data
  • 15.
    Slide 15Slide 15Slide15 www.edureka.co/apache-spark-scala-training Cyclic Data Flows • All jobs in spark comprise a series of operators and run on a set of data. • All the operators in a job are used to construct a DAG (Directed Acyclic Graph). • The DAG is optimized by rearranging and combining operators where possible.
  • 16.
    Slide 16Slide 16Slide16 www.edureka.co/apache-spark-scala-training Spark Other Features In Demand
  • 17.
    Slide 17Slide 17Slide17 www.edureka.co/apache-spark-scala-training Spark Features/Modules In Demand Source: Typesafe
  • 18.
    Slide 18Slide 18Slide18 www.edureka.co/apache-spark-scala-training New Features In 2015 Data Frames  • Similar API to data frames in R and Pandas • Automatically optimised via Spark SQL • Released in Spark 1.3 SparkR  • Released in Spark 1.4 • Exposes DataFrames, RDD’s & ML library in R Machine Learning Pipelines  • High Level API • Featurization • Evaluation • Model Tuning External Data Sources  • Platform API to plug Data-Sources into Spark • Pushes logic into sources Source: Databrix
  • 19.
  • 20.
    Slide 20 Your feedbackis vital for us, be it a compliment, a suggestion or a complaint. It helps us to make your experience better! Please spare few minutes to take the survey after the webinar. Survey

Editor's Notes

  • #19 http://www.information-management.com/gallery/Big-Data-Hadoop-2015-Predictions-Forrester-10026357-1.html https://www.forrester.com/Predictions+2015+Hadoop+Will+Become+A+Cornerstone+Of+Your+Business+Technology+Agenda/fulltext/-/E-RES117705