Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
BENEATH RDD
IN APACHE SPARK
USING SPARK-SHELL AND WEBUI
/ / /JACEK LASKOWSKI @JACEKLASKOWSKI GITHUB MASTERING APACHE SPARK...
Jacek Laskowski is an independent consultant
Contact me at jacek@japila.pl or
Delivering Development Services | Consulting...
http://bit.ly/mastering-apache-spark
http://bit.ly/mastering-apache-spark
SPARKCONTEXT
THE LIVING SPACE FOR RDDS
SPARKCONTEXT AND RDDS
An RDD belongs to one and only one Spark context.
You cannot share RDDs between contexts.
SparkConte...
SPARKCONTEXT AND RDDS (2)
RDD
RESILIENT DISTRIBUTED DATASET
CREATING RDD - SC.PARALLELIZE
sc.parallelize(col, slices)to distribute a local
collection of any elements.
scala> val rdd ...
CREATING RDD - SC.RANGE
sc.range(start, end, step, slices)to create
RDD of long numbers.
scala> val rdd = sc.range(0, 100)...
CREATING RDD - SC.TEXTFILE
sc.textFile(name, partitions)to create a RDD of
lines from a file.
scala> val rdd = sc.textFile...
CREATING RDD - SC.WHOLETEXTFILES
sc.wholeTextFiles(name, partitions)to create
a RDD of pairs of a file name and its conten...
There are many more more advanced functions in
SparkContextto create RDDs.
PARTITIONS (AND SLICES)
Did you notice the words slices and partitions as
parameters?
Partitions (aka slices) are the leve...
CREATING RDD - DATAFRAMES
RDDs are so last year :-) Use DataFrames...early and often!
A DataFrame is a higher-level abstra...
FROM RDDS TO DATAFRAMES
scala> val rdd = sc.parallelize(0 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRD...
...AND VICE VERSA
scala> val rdd = sc.parallelize(0 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] ...
CREATING DATAFRAMES -
SQLCONTEXT.CREATEDATAFRAME
sqlContext.createDataFrame(rowRDD, schema)
CREATING DATAFRAMES - SQLCONTEXT.READ
sqlContext.readis the modern yet experimental way.
sqlContext.read.format(f).load(pa...
EXECUTION ENVIRONMENT
PARTITIONS AND LEVEL OF PARALLELISM
The number of partitions of a RDD is (roughly) the number
of tasks.
Partitions are the...
PARTITIONS AND LEVEL OF PARALLELISM CD.
In local[*] mode, the number of partitions equals the
number of cores (the default...
LEVEL OF PARALLELISM IN SPARK CLUSTERS
TaskScheduler controls the level of parallelism
DAGScheduler, TaskScheduler, Schedu...
DAGSCHEDULER
TASKSCHEDULER AND SCHEDULERBACKEND
RDD LINEAGE
RDD lineage is a graph of RDD dependencies.
Use toDebugString to know the lineage.
Be careful with the hops - ...
RDD LINEAGE - DEMO
What does the following do?
val rdd = sc.parallelize(0 to 10).map(n => (n % 2, n)).groupBy(_._1)
RDD LINEAGE - DEMO CD.
How many stages are there?
// val rdd = sc.parallelize(0 to 10).map(n => (n % 2, n)).groupBy(_._1)
...
SPARK CLUSTERS
Spark supports the following clusters:
one-JVM local cluster
Spark Standalone
Apache Mesos
Hadoop YARN
You ...
MANDATORY PROPERTIES OF SPARK APP
Your task: Fill in the gaps below.
Any Spark application must specify application name (...
SPARK STANDALONE CLUSTER
The built-in Spark cluster
Start standalone Master with sbin/start-master
Use -h to control the h...
SPARK-SHELL
SPARK REPL APPLICATION
SPARK-SHELL AND SPARK STANDALONE
You can connect to Spark Standalone using spark-shell
through --master command-line optio...
WEBUI
WEB USER INTERFACE FOR SPARK APPLICATION
WEBUI
It is available under
You can disable it using spark.ui.enabled flag.
All the events are captured by Spark listeners...
QUESTIONS?
- Visit
- Follow at twitter
- Use
- Read notes.
Jacek Laskowski's blog
@jaceklaskowski
Jacek's projects at GitH...
Upcoming SlideShare
Loading in …5
×
Upcoming SlideShare
IBM Spark Meetup - RDD & Spark Basics
Next
Download to read offline and view in fullscreen.

5

Share

Download to read offline

Beneath RDD in Apache Spark by Jacek Laskowski

Download to read offline

Spark Summit East Talk

Related Books

Free with a 30 day trial from Scribd

See all

Beneath RDD in Apache Spark by Jacek Laskowski

  1. 1. BENEATH RDD IN APACHE SPARK USING SPARK-SHELL AND WEBUI / / /JACEK LASKOWSKI @JACEKLASKOWSKI GITHUB MASTERING APACHE SPARK NOTES
  2. 2. Jacek Laskowski is an independent consultant Contact me at jacek@japila.pl or Delivering Development Services | Consulting | Training Building and leading development teams Mostly and these days Leader of and Blogger at and @JacekLaskowski Apache Spark Scala Warsaw Scala Enthusiasts Warsaw Apache Spark Java Champion blog.jaceklaskowski.pl jaceklaskowski.pl
  3. 3. http://bit.ly/mastering-apache-spark
  4. 4. http://bit.ly/mastering-apache-spark
  5. 5. SPARKCONTEXT THE LIVING SPACE FOR RDDS
  6. 6. SPARKCONTEXT AND RDDS An RDD belongs to one and only one Spark context. You cannot share RDDs between contexts. SparkContext tracks how many RDDs were created. You may see it in toString output.
  7. 7. SPARKCONTEXT AND RDDS (2)
  8. 8. RDD RESILIENT DISTRIBUTED DATASET
  9. 9. CREATING RDD - SC.PARALLELIZE sc.parallelize(col, slices)to distribute a local collection of any elements. scala> val rdd = sc.parallelize(0 to 10) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at Alternatively, sc.makeRDD(col, slices)
  10. 10. CREATING RDD - SC.RANGE sc.range(start, end, step, slices)to create RDD of long numbers. scala> val rdd = sc.range(0, 100) rdd: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[14] at range at <console>:
  11. 11. CREATING RDD - SC.TEXTFILE sc.textFile(name, partitions)to create a RDD of lines from a file. scala> val rdd = sc.textFile("README.md") rdd: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[16] at textFil
  12. 12. CREATING RDD - SC.WHOLETEXTFILES sc.wholeTextFiles(name, partitions)to create a RDD of pairs of a file name and its content from a directory. scala> val rdd = sc.wholeTextFiles("tags") rdd: org.apache.spark.rdd.RDD[(String, String)] = tags MapPartitionsRDD[18] at wh
  13. 13. There are many more more advanced functions in SparkContextto create RDDs.
  14. 14. PARTITIONS (AND SLICES) Did you notice the words slices and partitions as parameters? Partitions (aka slices) are the level of parallelism. We're going to talk about the level of parallelism later.
  15. 15. CREATING RDD - DATAFRAMES RDDs are so last year :-) Use DataFrames...early and often! A DataFrame is a higher-level abstraction over RDDs and semi-structured data. DataFrames require a SQLContext.
  16. 16. FROM RDDS TO DATAFRAMES scala> val rdd = sc.parallelize(0 to 10) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at parallelize at scala> val df = rdd.toDF df: org.apache.spark.sql.DataFrame = [_1: int] scala> val df = rdd.toDF("numbers") df: org.apache.spark.sql.DataFrame = [numbers: int]
  17. 17. ...AND VICE VERSA scala> val rdd = sc.parallelize(0 to 10) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at parallelize at scala> val df = rdd.toDF("numbers") df: org.apache.spark.sql.DataFrame = [numbers: int] scala> df.rdd res23: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[70]
  18. 18. CREATING DATAFRAMES - SQLCONTEXT.CREATEDATAFRAME sqlContext.createDataFrame(rowRDD, schema)
  19. 19. CREATING DATAFRAMES - SQLCONTEXT.READ sqlContext.readis the modern yet experimental way. sqlContext.read.format(f).load(path), where f is: jdbc json orc parquet text
  20. 20. EXECUTION ENVIRONMENT
  21. 21. PARTITIONS AND LEVEL OF PARALLELISM The number of partitions of a RDD is (roughly) the number of tasks. Partitions are the hint to size jobs. Tasks are the smallest unit of execution. Tasks belong to TaskSets. TaskSets belong to Stages. Stages belong to Jobs. Jobs, stages, and tasks are displayed in web UI. We're going to talk about the web UI later.
  22. 22. PARTITIONS AND LEVEL OF PARALLELISM CD. In local[*] mode, the number of partitions equals the number of cores (the default in spark-shell) scala> sc.defaultParallelism res0: Int = 8 scala> sc.master res1: String = local[*] Not necessarily true when you use local or local[n] master URLs.
  23. 23. LEVEL OF PARALLELISM IN SPARK CLUSTERS TaskScheduler controls the level of parallelism DAGScheduler, TaskScheduler, SchedulerBackend work in tandem DAGScheduler manages a "DAG" of RDDs (aka RDD lineage) SchedulerBackends manage TaskSets
  24. 24. DAGSCHEDULER
  25. 25. TASKSCHEDULER AND SCHEDULERBACKEND
  26. 26. RDD LINEAGE RDD lineage is a graph of RDD dependencies. Use toDebugString to know the lineage. Be careful with the hops - they introduce shuffle barriers. Why is the RDD lineage important? This is the R in RDD - resiliency. But deep lineage costs processing time, doesn't it? Persist (aka cache) it early and often!
  27. 27. RDD LINEAGE - DEMO What does the following do? val rdd = sc.parallelize(0 to 10).map(n => (n % 2, n)).groupBy(_._1)
  28. 28. RDD LINEAGE - DEMO CD. How many stages are there? // val rdd = sc.parallelize(0 to 10).map(n => (n % 2, n)).groupBy(_._1) scala> rdd.toDebugString res2: String = (2) ShuffledRDD[3] at groupBy at <console>:24 [] +-(2) MapPartitionsRDD[2] at groupBy at <console>:24 [] | MapPartitionsRDD[1] at map at <console>:24 [] | ParallelCollectionRDD[0] at parallelize at <console>:24 [] Nothing happens yet - processing time-wise.
  29. 29. SPARK CLUSTERS Spark supports the following clusters: one-JVM local cluster Spark Standalone Apache Mesos Hadoop YARN You use --master to select the cluster spark://hostname:port is for Spark Standalone And you know the local master URL, ain't you? local, local[n], or local[*]
  30. 30. MANDATORY PROPERTIES OF SPARK APP Your task: Fill in the gaps below. Any Spark application must specify application name (aka appName ) and master URL. Demo time! => spark-shell is a Spark app, too!
  31. 31. SPARK STANDALONE CLUSTER The built-in Spark cluster Start standalone Master with sbin/start-master Use -h to control the host name to bind to. Start standalone Worker with sbin/start-slave Run single worker per machine (aka node) = web UI for Standalone cluster Don't confuse it with the web UI of Spark application Demo time! => Run Standalone cluster http://localhost:8080/
  32. 32. SPARK-SHELL SPARK REPL APPLICATION
  33. 33. SPARK-SHELL AND SPARK STANDALONE You can connect to Spark Standalone using spark-shell through --master command-line option. Demo time! => we've already started the Standalone cluster.
  34. 34. WEBUI WEB USER INTERFACE FOR SPARK APPLICATION
  35. 35. WEBUI It is available under You can disable it using spark.ui.enabled flag. All the events are captured by Spark listeners You can register your own Spark listener. Demo time! => webUI in action with different master URLs http://localhost:4040/
  36. 36. QUESTIONS? - Visit - Follow at twitter - Use - Read notes. Jacek Laskowski's blog @jaceklaskowski Jacek's projects at GitHub Mastering Apache Spark
  • PinkiSukhwani

    Aug. 13, 2019
  • JaideepMehta2

    Jul. 15, 2019
  • zhb_mccoy

    Dec. 9, 2016
  • shanitslideshare

    Mar. 24, 2016
  • bunkertor

    Feb. 25, 2016

Spark Summit East Talk

Views

Total views

1,984

On Slideshare

0

From embeds

0

Number of embeds

11

Actions

Downloads

132

Shares

0

Comments

0

Likes

5

×