Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Anatomy of RDD : Deep dive into Spark RDD abstraction


Published on

An in depth discussion about Apache Spark RDD abstraction. Presented at Bangalore Apache Spark Meetup by Madhukara Phatak on 28/03/2015.

Published in: Data & Analytics
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website!
    Are you sure you want to  Yes  No
    Your message goes here

Anatomy of RDD : Deep dive into Spark RDD abstraction

  1. 1. Anatomy of RDD A deep dive into the RDD data structure
  2. 2. ● Madhukara Phatak ● Big data consultant and trainer at ● Consult in Hadoop, Spark and Scala ●
  3. 3. Agenda ● What is RDD? ● Immutable and Distributed ● Partitions ● Laziness ● Caching ● Extending Spark API
  4. 4. What is RDD? Resilient Distributed Dataset - A big collection of data with following properties - Immutable - Distributed - Lazily evaluated - Type inferred - Cacheable
  5. 5. Immutable and Distributed
  6. 6. Partitions ● Logical division of data ● Derived from Hadoop Map/Reduce ● All Input,Intermediate and output data will be represented as partitions ● Partitions are basic unit of parallelism ● RDD data is just collection of partitions
  7. 7. Partition from Input Data Data in HDFS Chunk 1 Chunk 2 Chunk 3 Input Format RDD Partition 1 Partition 2 Partition 3
  8. 8. Partition example
  9. 9. Partition and Immutability ● All partitions are immutable ● Every transformation generates new partition ● Partition immutability driven by underneath storage like HDFS ● Partition immutability allows for fault recovery
  10. 10. Partitions and Distribution ● Partitions derived from HDFS are distributed by default ● Partitions also location aware ● Location awareness of partitions allow for data locality ● For computed data, using caching we can distribute in memory also
  11. 11. Accessing partitions ● We can access partition together rather single row at a time ● mapParititons API of RDD allows us that ● Accessing partition at a time allows us to do some partionwise operation which cannot be done by accessing single row.
  12. 12. Map partition example
  13. 13. Partition for transformed Data ● Partitioning will be different for key/value pairs that are generated by shuffle operation ● Partitioning is driven by partitioner specified ● By default HashPartitioner is used ● You can use your own partitioner also
  14. 14. Hash Partitioning Input RDD Partition 1 Partition 2 Partition 3 Hash(key)%no.of.partitions Ouptut RDD Partition 1 Partition 2
  15. 15. Hash partition example
  16. 16. Custom Partitioner ● Partition the data according to your data structure ● Custom partitioning allows control over no of partitions and the distribution of data across when grouping or reducing is done
  17. 17. Custom partition example
  18. 18. Look up operation ● Partitioning allows faster lookups ● Lookup operation allows to look up for a given value by specifying the key ● Using partitioner, lookup determines which partition look for ● Then it only need to look in that partition ● If no partition is specified, it will fallback to filter
  19. 19. Lookup example
  20. 20. Laziness
  21. 21. Parent(Dependency) ● Each RDD has access to it’s parent RDD ● Nil is the value of parent for first RDD ● Before computing it’s value, it always computes it’s parent ● This chain of running allows for laziness
  22. 22. Sub classing ● Each spark operator, creates an instance of specific sub class of RDD ● map operator results in MappedRDD, flatMap in FlatMappedRDD etc ● Subclass allows RDD to remember the operation that is performed in the transformation
  23. 23. RDD transformations val dataRDD = sc.textFile(args (1)) val splitRDD = dataRDD. flatMap(value => value.split(“ “) Hadoop RDD splitRDD: FlatMappedRDD dataRDD : MappedRDD Nil
  24. 24. Laziness example
  25. 25. Compute ● Compute is the function for evaluation of each partition in RDD ● Compute is an abstract method of RDD ● Each sub class of RDD like MappedRDD, FilteredRDD have to override this method
  26. 26. RDD actions val dataRDD = sc. textFile(args(1)) val flatMapRDD = dataRDD.flatMap (value => value.split(“ “) flatMapRDD.collect() Hadoop RDD FlatMap RDD Mapped RDD Nil runJob compute compute compute
  27. 27. runJob API ● runJob API of RDD is the api to implement actions ● runJob allows to take each partition and allow you evaluate ● All spark actions internally use runJob api.
  28. 28. Run job example
  29. 29. Caching ● cache internally uses persist API ● persist sets a specific storage level for a given RDD ● Spark context tracks persistent RDD ● When first evaluates, partition will be put into memory by block manager
  30. 30. Block manager ● Handles all in memory data in spark ● Responsible for ○ Cached Data ( BlockRDD) ○ Shuffle Data ○ Broadcast data ● Partition will be stored in Block with id (RDD. id, partition_index)
  31. 31. How caching works? ● Partition iterator checks the storage level ● if Storage level is set it calls cacheManager.getOrCompute(partition) ● as iterator is run for each RDD evaluation, its transparent to user
  32. 32. Caching example
  33. 33. Extending Spark API
  34. 34. Why? ● Domain specific operators ○ Allows developer to express domain specific calculation in cleaner way ○ Improves code readability ○ Easy to maintain ● Domain specific RDD’s ○ Better way of expressing domain data ○ Control over partitioning and distribution
  35. 35. DSL Example ● salesRecordRDD: RDD[SalesRecord] ● To make sum of sales ○ In plain spark ■ ○ In our dsl ■ salesRecord.totalSales ● Our dsl hides internal representation and improves readability
  36. 36. How to Extend ● Custom operators to RDD ○ Domain specific operators to specific RDD’s ○ Uses scala implicit mechanism ○ Feels and works like built in operator ● Custom RDD ○ Extend RDD API to create our own RDD ○ Combined with RDD it’s very powerful
  37. 37. Implicits in Scala ● A way of extending Types on the fly ● Implicits also used to pass the parameters functions which are read from environment ● In our example, we just use the type extension facility ● All implicits are compile time checked.
  38. 38. Implicits example
  39. 39. Adding operators to RDD’s ● We use scala implicit facility, to add the custom operators on our RDD ● These operators only show up in our RDD’s ● All implicit conversion are handled by Scala not by Spark ● Spark internally use similar tricks for PairRDD’s
  40. 40. Custom operator Example
  41. 41. Extending RDD ● Extending RDD API allows to create our own custom RDD structure ● Custom RDD’s allows control over computation ● You can change partitions, locality and evaluation depending upon your requirement
  42. 42. Discount RDD example