Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Spark: The Next Gen toolset for Big Data Processing

3,034 views

Published on

The Spark project from Apache(spark.apache.org), is the next generation of Big Data processing systems. It uses a new architecture and in-memory processing for orders of magnitude improvement in performance. Some would call it the successor to the Hadoop set of tools. Hadoop is a batch mode Big Data processor and depends on disk based files. Spark improves on this and supports real time and interactive processing, in addition to batch processing.

Table of contents:
1. The Big Data triangle
2. Hadoop stack and its limitations
3. Spark: An Overview
3.a. Spark Streaming
3.b. GraphX: Graph processing
3.c. MLib: Machine Learning
4. Performance characteristics of Spark

Published in: Data & Analytics
  • Follow the link, new dating source: ❶❶❶ http://bit.ly/2F7hN3u ❶❶❶
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Sex in your area is here: ❤❤❤ http://bit.ly/2F7hN3u ❤❤❤
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Apache Spark: The Next Gen toolset for Big Data Processing

  1. 1. Prajod Vettiyattil Architect, Open source Wipro in.linkedin.com/in/prajod @prajods Apache Spark The Next Gen toolset for Big Data Processing Namitha M S Architect, Advanced Technologies Wipro in.linkedin.com/in/namithams Open Source India Nov 2014 Bangalore
  2. 2. •Big Data •Hadoop stack and its limitations •Spark: An overview •Streaming, GraphX and MLlib •Performance characteristics of Spark Agenda
  3. 3. •Data too huge for normal systems •3 Vs: Volume, Variety, Velocity •Storage challenge •Analysis challenge •Query results take hours, days or months Big Data Data disks
  4. 4. The Big Data Analysis Triad Batch Interactive Streaming
  5. 5. The Hadoop stack •Distributed data processing •Fault tolerant •Process peta byte data sets •Ecosystem tools •Hive DB, Hbase •Pig •Storm •Hadoop •Map •Reduce •Shuffle, partition, sort •HDFS
  6. 6. Hadoop: Data flow Partition for target reducers Buffer in memory Map Input data files Sort each partition by key Merge all partitions and write to disk Potential spill to disk Merge round 1 Merge round 2 Merge round N http fetch from map node Reduce Merge sort … Output High disk I/O On Map nodes On Reduce nodes
  7. 7. •Batch mode •Only the batch layer in the Lambda pattern •No real time •No repetitive queries •Iterative algorithms •Interactive data querying •Poor support for distributed memory Limitations of Hadoop
  8. 8. Spark: An overview •“Over time, fewer projects will use MapReduce, and more will use Spark” •Doug Cutting, creator of Hadoop •New architecture: scale better and simplify •In memory processing for Big Data •Cached intermediate data sets •Multi-step DAG based execution •Resilient Distributed Data(RDD) sets •The core innovation in Spark
  9. 9. Spark Ecosystem tools Apache Spark Spark SQL Streaming MLlib GraphX Spark R Blink DB Shark Bagel
  10. 10. DAG Execution Engine Map Collect Filter Map Reduce Sort Collect DAG = Directed Acyclic Graph
  11. 11. •Resilient Distributed Data sets •Features •Read only •Fault tolerance without replication •Uses data lineage for recovery •Low network I/O •Partitions/Slices •parallel tasks RDD Disk Transform 1 RDD 1 Transform 2 RDD 2 Data partitions
  12. 12. Lambda architecture pattern •Used for Lambda architecture implementation •Batch layer •Speed layer •Serving layer Batch Layer Speed Layer Serving Layer Input Data consumers Query Query
  13. 13. Spark Streaming •For stream processing in Spark •Real time data •Like Twitter queries •Discretized streams(DStreams) •Micro batches •Sequence of RDDs
  14. 14. Discretized Streams Spark Streaming Spark Batches of x seconds Input Output
  15. 15. Why Spark Streaming •Near real time processing (0.5 – 2 sec latency) •Parallel recovery of lost nodes and stragglers •Implementation of Lambda architecture •Single engine for batch and stream •Not suited for low latency requirements •i.e., 100ms
  16. 16. Apache Storm vs Spark Streaming Feature Spark Streaming Storm Processing Model Micro-Batching Event Stream processing Message Delivery options Inherently fault tolerant, exactly once delivery At least once, at most once, exactly once Flexibility Coarse grained transformation Fine grained transformation Implemented in Scala Clojure Development Cost Common platform for both batch and stream Only stream – separate setup for batch Applicability Machine learning, Interactive analytics, near real time analytics Near real time analytics, Natural language processing
  17. 17. GraphX & MLlib • Data parallel Vs Graph Parallel processing • Wikipedia search vs Facebook connection search, Page rank • Spark MLlib implements high quality machine learning algorithms • Iterative Algorithm Paradigm • Leverage Spark’s in memory data sets ( ) (t 1) t x  f x  f(xt) f(xt+1) x(t) x(t+1)
  18. 18. Performance characteristics Performance of Spark •100x faster in memory •10x faster on disk Graph courtesy: spark.apache.org
  19. 19. Hadoop vs Spark Hadoop Spark Spark World Record 100 TB * 1 PB Data Size 102.5 TB 100 TB 1000 TB Elapsed Time 72 mins 23 mins 234 mins # Nodes 2100 206 190 # Cores 50400 6592 6080 # Reducers 10,000 29,000 250,000 Rate 1.42 TB/min 4.27 TB/min 4.27 TB/min Rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min Data courtesy: databricks.com
  20. 20. 1 TB performance test: data per sec
  21. 21. 1 TB performance test data rate vs RAM size
  22. 22. Apache Spark •New architecture •RDD, DAG •In memory processing •Map reduce and more •GraphX •MLlib •Spark streaming Summary Ecosystem tools •Spark R •Blink DB •Storm Spark performance •GBs per second •RAM to data size •Inflexion point
  23. 23. Questions Prajod Vettiyattil Architect, Open source Wipro @prajods in.linkedin.com/in/prajod Namitha M S Architect, Advanced Technologies Wipro in.linkedin.com/in/namithams

×