Spark core

1. SPARK CORE SHIJIE ZHANG

2. OUTLINE ▸ Intro ▸ Standalone-view RDD concepts ▸ Basic RDD ▸ Key-Pair RDD ▸ Distributed-view RDD concepts ▸ Lazy evaluation ▸ Cache ▸ Shufﬂe/Partition ▸ Broadcast ▸ Accumulator ▸ Conclusion

3. INTRO - LEGACY NUMBERS

4. INTRO ▸ To get a better intuition about differences of these numbers, humanize these durations ▸ Then, we can map each latency number to a human activity

5. INTRO

6. INTRO Humanized numbers

7. INTRO

8. INTRO Distributed data parallel programming, two more worries ▸ Partial failure: a subset of machine crash ▸ Latency: certain operations have higher latency RUN ON: JVM IMPLEMENTED BY: SCALA SDKS: PYTHON / SCALA / JAVA OTHER LIBRARIES: R

9. INTRO Spark components Spark ▸ A distributed data parallel programming framework ▸ Implements a data parallel model called RDD RDD Core

10. INTRO WHAT’S RDD ▸ Just looks like Scala collections (but distributed across machines)

11. INTRO WHAT’S RDD ▸ While signatures differ a bit, their semantics are the same

12. INTRO WHAT’S RDD ▸ Scientiﬁc Def: An interface containing ﬁve properties ▸ Set of partitions ▸ List of dependencies ▸ Function to compute a partition given its parents ▸ (Optional) Partition method ▸ (Optional) Preferred locations

13. INTRO Spark Core components - code base (2012) RDD

14. OUTLINE ▸ Intro ▸ Standalone-view RDD concepts ▸ Basic RDD ▸ Key-Pair RDD ▸ Distributed-view RDD concepts ▸ Lazy evaluation ▸ Cache ▸ Partition/Shufﬂe ▸ Broadcast ▸ Accumulator ▸ Conclusion Operators Scheduler Block manager Networking Broadcast Accumulator

16. INTRO Execution of Spark program ▸ Execution mode: Standalone / Cluster Yarn / Mesos Your program

18. STANDALONE-VIEW RDD transformations another RDD actions result values creation Note: actions/transformations will not modify input RDD, but will always create a new one

19. STANDALONE-VIEW RDD CREATION ▸ Loading an external dataset ▸ Parallelizing a collection

20. STANDALONE-VIEW TRANSFORMATIONS

21. STANDALONE-VIEW TRANSFORMATIONS

22. STANDALONE-VIEW ACTIONS

23. transformations another RDD actions single values

24. PYTHON EXAMPLE - COUNT RATING NUM FOR EACH STAR https://www.udemy.com/taming-big-data-with-apache-spark-hands-on/learn/v4/overview For more detailed step, see:

25. PYTHON EXAMPLE - COUNT RATING NUM FOR EACH STAR

32. KEY-PAIR RDD ▸ Def: RDDs containing key/value pairs ▸ Motivation: allows to act on each key

33. KEY-PAIR RDD CREATION ▸ Create from RDDs by transforming to tuples ▸ Load external data

34. KEY-PAIR RDD TRANSFORMATIONS ▸ Key-Pair RDDs have all transformations available to standard RDDs

35. KEY-PAIR RDD TRANSFORMATIONS ▸ Key-Pair RDDs have all transformations available to standard RDDs

36. KEY-PAIR RDD ACTIONS

37. OUTLINE ▸ Intro ▸ Standalone-view RDD concepts ▸ Basic RDD ▸ Key-Pair RDD ▸ Distributed-view RDD concepts ▸ Lazy evaluation ▸ Cache ▸ Partition/Shufﬂe ▸ Broadcast ▸ Accumulator ▸ Conclusion Operators Scheduler Networking Broadcast Accumulator Block manager

39. LAZY-EVALUATION ▸ Def: ▸ Transformations are lazy. Their result RDD is not immediately computed. ▸ Actions are eager. Their result is immediately computed. ▸ Example:

42. LAZY-EVALUATION ▸ Why: ▸ Enable computation optimization

43. LAZY-EVALUATION ▸ Why: ▸ Limit network communication ▸ If you perform an action, its result returns back to the driver program RDD results return back RDD transformations and actions

44. LAZY-EVALUATION ▸ What does it do when loading/transformation are performed: ▸ Record RDD dependencies with a directed acyclic graph (DAG) ▸ For failure recovery ▸ Pipeline operations execution order ▸ Optimize by utilizing network trafﬁc based on partition and shufﬂing

45. LAZY EVALUATION Life time of a job in Spark

47. CACHE ▸ Def ▸ By default, RDDs are recomputed each time you run an action on them

48. CACHE ▸ Cache options

49. CACHE ▸ Why ▸ Avoid network IO and recomputation ▸ How ▸ Implemented with Least Recently Used (LRU) cache policy ▸ Users do not have to worry about caching too much data

50. OUTLINE ▸ Intro ▸ Standalone-view RDD concepts ▸ Basic RDD ▸ Key-Pair RDD ▸ Distributed-view RDD concepts ▸ Lazy evaluation ▸ Cache ▸ Shufﬂe/Partition ▸ Broadcast ▸ Accumulator ▸ Conclusion Operators Scheduler Networking Broadcast Accumulator Block manager

52. SHUFFLE

53. SHUFFLE

54. SHUFFLE

55. SHUFFLE

56. SHUFFLE

57. SHUFFLE

58. SHUFFLE

59. REMINDER: LATENCY MATTERS (HUMANIZED)

60. SHUFFLE ▸ Can we do a better job

61. SHUFFLE

62. SHUFFLE

63. SHUFFLE

64. SHUFFLE

65. SHUFFLE

67. PARTITION

68. HASH PARTITION

69. HASH PARTITION

70. RANGE PARTITION

71. RANGE PARTITION

72. RANGE PARTITION

73. RANGE PARTITION

74. RANGE PARTITION

75. RANGE PARTITION

76. RANGE PARTITION

77. PARTITIONING DATA:

81. OPTIMIZING PREVIOUS EXAMPLE USING RANGE PARTITIONING

82. HOW DO I KNOW A SHUFFLE WILL OCCUR?

83. HOW DO I KNOW A SHUFFLE WILL OCCUR?

84. HOW DO I KNOW A SHUFFLE WILL OCCUR? ▸ Operations that might cause a shufﬂe

86. SHARED VARIABLES ▸ Start with an example

87. BROADCAST VARIABLES ▸ Motivation: ▸ Normally, Spark closures, including variables they use, are sent separately within each task. ▸ But in same cases, the same variable will be used in multiple parallel operations ▸ By default, Spark task launching mechanism is optimized for small task sizes. ▸ But in some cases, a large read-only variable needs to be shared across tasks, or across operations ▸ Examples: large lookup tables

89. BROADCAST VARIABLES

90. ACCUMULATORS ▸ Motivation: ▸ Normally, Spark closures, including variables they use, are sent separately within each task. ▸ But values inside worker nodes are not propagated back to the driver ▸ Normally, we could get the entire RDD back with reduce(), ▸ But sometimes, we just need a simpler way ▸ Examples: counters

92. CONCLUSION ▸ Spark core ▸ basic Spark RDDs ▸ Things to take care when programming in clusters ▸ What’s next? ▸ Spark SQL/Hive

93. REFERENCES ▸ Udemy hands-one course ▸ https://www.udemy.com/taming-big-data-with-apache-spark-hands-on/ learn/v4/overview ▸ Several slides: ▸ http://heather.miller.am/teaching/cs212/slides ▸ http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia- amp-camp-2012-advanced-spark.pdf ▸ https://databricks-training.s3.amazonaws.com/slides/advanced-spark- training.pdf ▸ <<Learning Spark>>

Spark core

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Spark core

Similar to Spark core (20)

Recently uploaded

Recently uploaded (20)

Spark core