Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache spark


Published on

Apache Spark: Beyond Hadoop MapReduce

Published in: Technology
  • Be the first to comment

Apache spark

  1. 1. Apache Spark: Beyond Hadoop MapReduce
  2. 2. Slide 2Slide 2Slide 2 Agenda At the end of this webinar you will be able to know about:  Strength of MapReduce  Things beyond MapReduce  How MapReduce limitations can be overcome  How Spark fits the bill  Other exciting features in Spark
  3. 3. Slide 3Slide 3Slide 3 Strength of MapReduce
  4. 4. Slide 4Slide 4Slide 4 Simple Scalability Fault Tolerance Minimal data motion Strength of MapReduce Independence of language of choice, such as Java, C++ or Python. process petabytes of data, stored in HDFS on one cl MapReduce takes care of failures using the replicated copies. Process moves towards data to minimize disk I/O
  5. 5. Slide 5Slide 5Slide 5 Limitations Of MapReduce (MR)
  6. 6. Slide 6Slide 6Slide 6 Real Time Complex Algorithm Re-reading And parsing Data Minimal Data Motion Graph Processing Iterative Tasks Random Access Limitations Of MR
  7. 7. Slide 7Slide 7Slide 7 Feature Comparison with Spark Fast 100x faster than MapReduce Batch Processing Batch and Real-time Processing Stores Data on Disk Stores Data in Memory Written in Java Written in Scala Hadoop MapReduce HADOOP Spark Source: Databrix
  8. 8. Slide 8Slide 8Slide 8 How MR limitations can be overcome
  9. 9. Slide 9Slide 9Slide 9 Overcoming MR limitations Cutting down on the number of reads and writes to the disc Real time
  10. 10. Slide 10Slide 10Slide 10 Overcoming MR limitations Libraries for Machine learning, Streaming Graph processing complex algorithm
  11. 11. Slide 11Slide 11Slide 11 Overcoming MR limitations Cyclic data flows Random access
  12. 12. Slide 12Slide 12Slide 12 How Spark Implements Features To Make Its Architecture Better Than MR
  13. 13. Slide 13Slide 13Slide 13 Spark tries to keep things in-memory of its distributed workers, allowing for significantly faster/lower-latency computations, whereas MapReduce keeps shuffling things in and out of disk. Sparks Cuts Down Read/Write I/O To Disk
  14. 14. Slide 14Slide 14Slide 14 Libraries For ML, Graph Programming … Machine Learning Library Graph programming Spark interface For RDBMS lovers Utility for continues ingestion of data
  15. 15. Slide 15Slide 15Slide 15 Cyclic Data Flows • All jobs in spark comprise a series of operators and run on a set of data. • All the operators in a job are used to construct a DAG (Directed Acyclic Graph). • The DAG is optimized by rearranging and combining operators where possible.
  16. 16. Slide 16Slide 16Slide 16 Spark Other Features In Demand
  17. 17. Slide 17Slide 17Slide 17 Spark Features/Modules In Demand Source: Typesafe
  18. 18. Slide 18Slide 18Slide 18 New Features In 2015 Data Frames  • Similar API to data frames in R and Pandas • Automatically optimised via Spark SQL • Released in Spark 1.3 SparkR  • Released in Spark 1.4 • Exposes DataFrames, RDD’s & ML library in R Machine Learning Pipelines  • High Level API • Featurization • Evaluation • Model Tuning External Data Sources  • Platform API to plug Data-Sources into Spark • Pushes logic into sources Source: Databrix
  19. 19. Questions Slide 19
  20. 20. Slide 20 Your feedback is vital for us, be it a compliment, a suggestion or a complaint. It helps us to make your experience better! Please spare few minutes to take the survey after the webinar. Survey