Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Spark RDD 101

3,310 views

Published on

A very short set of slides to describe an RDD data structure.
Extracted from my 3-day course: www.sparkInternals.com
There is also a video of this on YouTube: http://youtu.be/odcEg515Ne8

Published in: Software
  • Be the first to comment

Apache Spark RDD 101

  1. 1. Spark Illustrated and Illuminated Tony Duarte Spark and Hadoop Training tony@sparkInternals.com www.sparkInternals.com (650) 223-3397
  2. 2. Copyright 2014 Tony Duarte 2 newRDD = myRDD.map(myfunc)
  3. 3. What is an RDD? Copyright 2014 Tony Duarte 3 myRDD : RDD Partition Partition Partition Partition Some RDD Characteristics Memory Partition Memory Partition Memory Partition Memory Partition • Hold references to Partition objects • Each Partition object references a subset of your data • Partitions are assigned to nodes on your cluster • Each partition/split will be in RAM (by default) Array
  4. 4. What happens? newRDD = myRDD.map(myfunc) Copyright 2014 Tony Duarte 4 myRDD : RDD map() new mappedRDD(myRDD, myfunc) newRDD : mappedRDD dependency on myRDD compute() stores operation: map(myfunc) new myfunc()
  5. 5. After Executing: newRDD = myRDD.map(myfunc) Copyright 2014 Tony Duarte 5 newRDD : mappedRDD Partition Partition Partition Partition This architecture enables: Memory Partition Memory Partition Memory Partition Memory Partition • You can chain operations on RDDs and Spark will keep generating new RDD's • Job Scheduling can be lazy - since a dependency chain of operations can be submited. Array myRDD : RDD dependency on myRDD stores operation: map(myfunc)
  6. 6. Spark Illustrated and Illuminated Tony Duarte Spark and Hadoop Training tony@sparkInternals.com www.sparkInternals.com (650) 223-3397

×