Your SlideShare is downloading. ×
Introduction to Spark
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Introduction to Spark

466

Published on

This is a short introduction to spark presented at Exebit 2014 at IIT Madras.

This is a short introduction to spark presented at Exebit 2014 at IIT Madras.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
466
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
26
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Introduction to Spark Sriram and Amritendu DOS Lab, IIT Madras “Introduction to Spark” by Sriram and Amritendu is licensed under a Creative Commons Attribution 4.0 International License.
  • 2. Motivation • In Hadoop, programmer writes job using Map Reduce abstraction • Runtime distributes work and handles fault-tolerance Makes analysis of large-data sets easy and reliable Emerging Class of Applications Machine learning • K-means clustering . . Graph Algorithms • Page-rank . . DOS Lab, IIT Madras
  • 3. Intermediate results are reused across multiple computations Nature of the emerging class of applications Iterative Computation DOS Lab, IIT Madras
  • 4. Problem with Hadoop MapReduce HDFS R R R Iteration 1W W W HDFS R R R HDFS W W W Iteration 2 Results are written to HDFS New job is launched for each iteration Incurs substantial storage and job launch overheads DOS Lab, IIT Madras
  • 5. Can we do away with these overheads? Persist intermediate results in memory What if a node fails? HDFS L L L Iteration 1 Memory is 10-100X faster than disk/network Iteration 2 X Challenge: how to handle faults efficiently? W R R R W W W W W RR R DOS Lab, IIT Madras
  • 6. Approaches to handle faults • Replication Issues: – Requires more storage – More network traffic – Log the operation – Re-compute lost partitions using lineage information Master W M R Replica 1 R Replica 2 X Can tolerate ‘r-1’ failures • Using Lineage D1 D2 D3 C1 C2 X D2 D3 C2 Issues: Recovery time can be high if re- computation is very costly – high iteration time – wide dependencies Wide dependencies DOS Lab, IIT Madras
  • 7. Spark • RDD – Resilient Distributed Datasets – Read-only, partitioned collection of records – Supports only coarse-grained operations • e.g. map and group-by transformations, reduce action – Uses lineage graph to recover from faults D12 D11 D13 3 partitions DOS Lab, IIT Madras Val
  • 8. Spark contd. • Control placement of partitions of RDD – can specify number of partitions – can partition based on a key in each record • useful in joins • In-memory storage – Up to 100X speedup over Hadoop for iterative applications • Spark can run on Hadoop YARN and read files from HDFS • Spark is coded using Scala DOS Lab, IIT Madras
  • 9. SCALA overview • Functional programming meets object orientation • “No side effects” aids concurrent programming • Every variable is an object • Every function is a value DOS Lab, IIT Madras
  • 10. Variables and Functions var obj : java.lang.String = “Hello” var x = new A() def square(x: Int) : Int={ x * x } Return type DOS Lab, IIT Madras
  • 11. Execution of a function scala> square(2) res0:Int = 4 scala-> square(square(6)) res1:Int = 1296 def square(x: Int) : Int={ x * x } DOS Lab, IIT Madras
  • 12. Nested Functions def factorial(i: Int): Int = { def fact(i: Int, acc: Int): Int ={ if (i <= 1) acc else fact(i - 1, i * acc) } fact(i, 1) } DOS Lab, IIT Madras
  • 13. Nested Functions def factorial(i: Int): Int = { def fact(i: Int, acc: Int): Int ={ if (i <= 1) acc else fact(i - 1, i * acc) }  fact(i, 1) } DOS Lab, IIT Madras
  • 14. Higher order map functions val add = (x: Int) => x+1 val lst = list(1,2,3) lst.map(add) : list(2,3,4) lst.map(x => x+1) : list(2,3,4) lst.map( _ + 1) : list(2,3,4) DOS Lab, IIT Madras
  • 15. Defining Objects object Example{ def main(args: Array[String]) { val logData = sc.textFile(logFile, 2).cache() ------- ------- } } Example.main( (“master”,”noOfMap”,”noOfReducer”) ) DOS Lab, IIT Madras
  • 16. Spark: Filter transformation in RDD val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line =>line.contains("a")) Here is a example of filter Transformation, you can notice that the filter method will be applied on each line and return a new RDD test Give me those lines which contains ‘a’ Here is a example of filter Transformation, you can notice that the filter method will be applied on each line and return a new RDD DOS Lab, IIT Madras
  • 17. Count val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter( line =>line.contains("a")) numAs.count() 5 Here is a example of filter Transformation, you can notice that the filter method will be applied on each line and return a new RDD test DOS Lab, IIT Madras
  • 18. Flatmap val logData = sc.textFile(logFile, 2).cache() val numAs = logData.flatMap(line => line.split(" ")) Take each line, split based on space and give me the array Here is a example of filter map ( Here, is, a, example, of, filter,map ) DOS Lab, IIT Madras
  • 19. Wordcount Example in Spark new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://[input_path_to_textfile]") val counts = file.flatMap (line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://[output_path]") DOS Lab, IIT Madras
  • 20. Limitations • RDDs are not suitable for applications that require fine-grained updates – e.g. web storage system DOS Lab, IIT Madras
  • 21. References • http://www.slideshare.net/tpunder/a-brief-intro-to-scala • Scala in depth by Joshua D. Suereth • Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica “Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing”, In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2012. • Pictures: – http://www.xbitlabs.com/images/news/2011-04/hard_disk_drive.jpg – http://www.thecomputercoach.net/assets/images/256_MB_DDR_333_Cl2_5_Pc2700_R AM_Chip_Brand_New_Chip.jpg DOS Lab, IIT Madras

×