Spark core

1.
SPARK CORE SHIJIE ZHANG

2.
OUTLINE ▸ Intro ▸ Standalone-viewRDD concepts ▸ Basic RDD ▸ Key-Pair RDD ▸ Distributed-view RDD concepts ▸ Lazy evaluation ▸ Cache ▸ Shufﬂe/Partition ▸ Broadcast ▸ Accumulator ▸ Conclusion

3.
INTRO - LEGACYNUMBERS

4.
INTRO ▸ To geta better intuition about differences of these numbers, humanize these durations ▸ Then, we can map each latency number to a human activity

5.
INTRO

6.
INTRO Humanized numbers

7.
INTRO

8.
INTRO Distributed data parallelprogramming, two more worries ▸ Partial failure: a subset of machine crash ▸ Latency: certain operations have higher latency RUN ON: JVM IMPLEMENTED BY: SCALA SDKS: PYTHON / SCALA / JAVA OTHER LIBRARIES: R

9.
INTRO Spark components Spark ▸ Adistributed data parallel programming framework ▸ Implements a data parallel model called RDD RDD Core

10.
INTRO WHAT’S RDD ▸ Justlooks like Scala collections (but distributed across machines)

11.
INTRO WHAT’S RDD ▸ Whilesignatures differ a bit, their semantics are the same

12.
INTRO WHAT’S RDD ▸ ScientiﬁcDef: An interface containing ﬁve properties ▸ Set of partitions ▸ List of dependencies ▸ Function to compute a partition given its parents ▸ (Optional) Partition method ▸ (Optional) Preferred locations

13.
INTRO Spark Core components- code base (2012) RDD

14.
OUTLINE ▸ Intro ▸ Standalone-viewRDD concepts ▸ Basic RDD ▸ Key-Pair RDD ▸ Distributed-view RDD concepts ▸ Lazy evaluation ▸ Cache ▸ Partition/Shufﬂe ▸ Broadcast ▸ Accumulator ▸ Conclusion Operators Scheduler Block manager Networking Broadcast Accumulator

15.

16.
INTRO Execution of Sparkprogram ▸ Execution mode: Standalone / Cluster Yarn / Mesos Your program

17.

18.
STANDALONE-VIEW RDD transformations another RDD actions result values creation Note:actions/transformations will not modify input RDD, but will always create a new one

19.
STANDALONE-VIEW RDD CREATION ▸ Loadingan external dataset ▸ Parallelizing a collection

20.
STANDALONE-VIEW TRANSFORMATIONS

21.
STANDALONE-VIEW TRANSFORMATIONS

22.
STANDALONE-VIEW ACTIONS

23.
transformations another RDD actions single values

24.
PYTHON EXAMPLE -COUNT RATING NUM FOR EACH STAR https://www.udemy.com/taming-big-data-with-apache-spark-hands-on/learn/v4/overview For more detailed step, see:

25.
PYTHON EXAMPLE -COUNT RATING NUM FOR EACH STAR

26.

27.

28.

29.

30.

31.

32.
KEY-PAIR RDD ▸ Def:RDDs containing key/value pairs ▸ Motivation: allows to act on each key

33.
KEY-PAIR RDD CREATION ▸Create from RDDs by transforming to tuples ▸ Load external data

34.
KEY-PAIR RDD TRANSFORMATIONS ▸Key-Pair RDDs have all transformations available to standard RDDs

35.
KEY-PAIR RDD TRANSFORMATIONS ▸Key-Pair RDDs have all transformations available to standard RDDs

36.
KEY-PAIR RDD ACTIONS

37.
OUTLINE ▸ Intro ▸ Standalone-viewRDD concepts ▸ Basic RDD ▸ Key-Pair RDD ▸ Distributed-view RDD concepts ▸ Lazy evaluation ▸ Cache ▸ Partition/Shufﬂe ▸ Broadcast ▸ Accumulator ▸ Conclusion Operators Scheduler Networking Broadcast Accumulator Block manager

38.

39.
LAZY-EVALUATION ▸ Def: ▸ Transformationsare lazy. Their result RDD is not immediately computed. ▸ Actions are eager. Their result is immediately computed. ▸ Example:

40.

41.

42.
LAZY-EVALUATION ▸ Why: ▸ Enablecomputation optimization

43.
LAZY-EVALUATION ▸ Why: ▸ Limitnetwork communication ▸ If you perform an action, its result returns back to the driver program RDD results return back RDD transformations and actions

44.
LAZY-EVALUATION ▸ What doesit do when loading/transformation are performed: ▸ Record RDD dependencies with a directed acyclic graph (DAG) ▸ For failure recovery ▸ Pipeline operations execution order ▸ Optimize by utilizing network trafﬁc based on partition and shufﬂing

45.
LAZY EVALUATION Life timeof a job in Spark

46.

47.
CACHE ▸ Def ▸ Bydefault, RDDs are recomputed each time you run an action on them

48.
CACHE ▸ Cache options

49.
CACHE ▸ Why ▸ Avoidnetwork IO and recomputation ▸ How ▸ Implemented with Least Recently Used (LRU) cache policy ▸ Users do not have to worry about caching too much data

50.
OUTLINE ▸ Intro ▸ Standalone-viewRDD concepts ▸ Basic RDD ▸ Key-Pair RDD ▸ Distributed-view RDD concepts ▸ Lazy evaluation ▸ Cache ▸ Shufﬂe/Partition ▸ Broadcast ▸ Accumulator ▸ Conclusion Operators Scheduler Networking Broadcast Accumulator Block manager

51.

52.
SHUFFLE

53.
SHUFFLE

54.
SHUFFLE

55.
SHUFFLE

56.
SHUFFLE

57.
SHUFFLE

58.
SHUFFLE

59.
REMINDER: LATENCY MATTERS(HUMANIZED)

60.
SHUFFLE ▸ Can wedo a better job

61.
SHUFFLE

62.
SHUFFLE

63.
SHUFFLE

64.
SHUFFLE

65.
SHUFFLE

66.

67.
PARTITION

68.
HASH PARTITION

69.
HASH PARTITION

70.
RANGE PARTITION

71.
RANGE PARTITION

72.
RANGE PARTITION

73.
RANGE PARTITION

74.
RANGE PARTITION

75.
RANGE PARTITION

76.
RANGE PARTITION

77.
PARTITIONING DATA:

78.
PARTITIONING DATA:

79.
PARTITIONING DATA:

80.
PARTITIONING DATA:

81.
OPTIMIZING PREVIOUS EXAMPLEUSING RANGE PARTITIONING

82.
HOW DO IKNOW A SHUFFLE WILL OCCUR?

83.
HOW DO IKNOW A SHUFFLE WILL OCCUR?

84.
HOW DO IKNOW A SHUFFLE WILL OCCUR? ▸ Operations that might cause a shufﬂe

85.

86.
SHARED VARIABLES ▸ Startwith an example

87.
BROADCAST VARIABLES ▸ Motivation: ▸Normally, Spark closures, including variables they use, are sent separately within each task. ▸ But in same cases, the same variable will be used in multiple parallel operations ▸ By default, Spark task launching mechanism is optimized for small task sizes. ▸ But in some cases, a large read-only variable needs to be shared across tasks, or across operations ▸ Examples: large lookup tables

88.

89.
BROADCAST VARIABLES

90.
ACCUMULATORS ▸ Motivation: ▸ Normally,Spark closures, including variables they use, are sent separately within each task. ▸ But values inside worker nodes are not propagated back to the driver ▸ Normally, we could get the entire RDD back with reduce(), ▸ But sometimes, we just need a simpler way ▸ Examples: counters

91.

92.
CONCLUSION ▸ Spark core ▸basic Spark RDDs ▸ Things to take care when programming in clusters ▸ What’s next? ▸ Spark SQL/Hive

93.
REFERENCES ▸ Udemy hands-onecourse ▸ https://www.udemy.com/taming-big-data-with-apache-spark-hands-on/ learn/v4/overview ▸ Several slides: ▸ http://heather.miller.am/teaching/cs212/slides ▸ http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia- amp-camp-2012-advanced-spark.pdf ▸ https://databricks-training.s3.amazonaws.com/slides/advanced-spark- training.pdf ▸ <<Learning Spark>>

Spark core

More Related Content

What's hot

Viewers also liked

Similar to Spark core

Recently uploaded

Spark core