Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Spark

869 views

Published on

This is a short introduction to spark presented at Exebit 2014 at IIT Madras.

Published in: Technology
  • Be the first to comment

Introduction to Spark

  1. 1. Introduction to Spark Sriram and Amritendu DOS Lab, IIT Madras “Introduction to Spark” by Sriram and Amritendu is licensed under a Creative Commons Attribution 4.0 International License.
  2. 2. Motivation • In Hadoop, programmer writes job using Map Reduce abstraction • Runtime distributes work and handles fault-tolerance Makes analysis of large-data sets easy and reliable Emerging Class of Applications Machine learning • K-means clustering . . Graph Algorithms • Page-rank . . DOS Lab, IIT Madras
  3. 3. Intermediate results are reused across multiple computations Nature of the emerging class of applications Iterative Computation DOS Lab, IIT Madras
  4. 4. Problem with Hadoop MapReduce HDFS R R R Iteration 1W W W HDFS R R R HDFS W W W Iteration 2 Results are written to HDFS New job is launched for each iteration Incurs substantial storage and job launch overheads DOS Lab, IIT Madras
  5. 5. Can we do away with these overheads? Persist intermediate results in memory What if a node fails? HDFS L L L Iteration 1 Memory is 10-100X faster than disk/network Iteration 2 X Challenge: how to handle faults efficiently? W R R R W W W W W RR R DOS Lab, IIT Madras
  6. 6. Approaches to handle faults • Replication Issues: – Requires more storage – More network traffic – Log the operation – Re-compute lost partitions using lineage information Master W M R Replica 1 R Replica 2 X Can tolerate ‘r-1’ failures • Using Lineage D1 D2 D3 C1 C2 X D2 D3 C2 Issues: Recovery time can be high if re- computation is very costly – high iteration time – wide dependencies Wide dependencies DOS Lab, IIT Madras
  7. 7. Spark • RDD – Resilient Distributed Datasets – Read-only, partitioned collection of records – Supports only coarse-grained operations • e.g. map and group-by transformations, reduce action – Uses lineage graph to recover from faults D12 D11 D13 3 partitions DOS Lab, IIT Madras Val
  8. 8. Spark contd. • Control placement of partitions of RDD – can specify number of partitions – can partition based on a key in each record • useful in joins • In-memory storage – Up to 100X speedup over Hadoop for iterative applications • Spark can run on Hadoop YARN and read files from HDFS • Spark is coded using Scala DOS Lab, IIT Madras
  9. 9. SCALA overview • Functional programming meets object orientation • “No side effects” aids concurrent programming • Every variable is an object • Every function is a value DOS Lab, IIT Madras
  10. 10. Variables and Functions var obj : java.lang.String = “Hello” var x = new A() def square(x: Int) : Int={ x * x } Return type DOS Lab, IIT Madras
  11. 11. Execution of a function scala> square(2) res0:Int = 4 scala-> square(square(6)) res1:Int = 1296 def square(x: Int) : Int={ x * x } DOS Lab, IIT Madras
  12. 12. Nested Functions def factorial(i: Int): Int = { def fact(i: Int, acc: Int): Int ={ if (i <= 1) acc else fact(i - 1, i * acc) } fact(i, 1) } DOS Lab, IIT Madras
  13. 13. Nested Functions def factorial(i: Int): Int = { def fact(i: Int, acc: Int): Int ={ if (i <= 1) acc else fact(i - 1, i * acc) }  fact(i, 1) } DOS Lab, IIT Madras
  14. 14. Higher order map functions val add = (x: Int) => x+1 val lst = list(1,2,3) lst.map(add) : list(2,3,4) lst.map(x => x+1) : list(2,3,4) lst.map( _ + 1) : list(2,3,4) DOS Lab, IIT Madras
  15. 15. Defining Objects object Example{ def main(args: Array[String]) { val logData = sc.textFile(logFile, 2).cache() ------- ------- } } Example.main( (“master”,”noOfMap”,”noOfReducer”) ) DOS Lab, IIT Madras
  16. 16. Spark: Filter transformation in RDD val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line =>line.contains("a")) Here is a example of filter Transformation, you can notice that the filter method will be applied on each line and return a new RDD test Give me those lines which contains ‘a’ Here is a example of filter Transformation, you can notice that the filter method will be applied on each line and return a new RDD DOS Lab, IIT Madras
  17. 17. Count val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter( line =>line.contains("a")) numAs.count() 5 Here is a example of filter Transformation, you can notice that the filter method will be applied on each line and return a new RDD test DOS Lab, IIT Madras
  18. 18. Flatmap val logData = sc.textFile(logFile, 2).cache() val numAs = logData.flatMap(line => line.split(" ")) Take each line, split based on space and give me the array Here is a example of filter map ( Here, is, a, example, of, filter,map ) DOS Lab, IIT Madras
  19. 19. Wordcount Example in Spark new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://[input_path_to_textfile]") val counts = file.flatMap (line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://[output_path]") DOS Lab, IIT Madras
  20. 20. Limitations • RDDs are not suitable for applications that require fine-grained updates – e.g. web storage system DOS Lab, IIT Madras
  21. 21. References • http://www.slideshare.net/tpunder/a-brief-intro-to-scala • Scala in depth by Joshua D. Suereth • Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica “Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing”, In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2012. • Pictures: – http://www.xbitlabs.com/images/news/2011-04/hard_disk_drive.jpg – http://www.thecomputercoach.net/assets/images/256_MB_DDR_333_Cl2_5_Pc2700_R AM_Chip_Brand_New_Chip.jpg DOS Lab, IIT Madras

×