Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Beneath RDD in Apache Spark by Jacek Laskowski

1,758 views

Published on

Spark Summit East Talk

Published in: Data & Analytics

Beneath RDD in Apache Spark by Jacek Laskowski

  1. 1. BENEATH RDD IN APACHE SPARK USING SPARK-SHELL AND WEBUI / / /JACEK LASKOWSKI @JACEKLASKOWSKI GITHUB MASTERING APACHE SPARK NOTES
  2. 2. Jacek Laskowski is an independent consultant Contact me at jacek@japila.pl or Delivering Development Services | Consulting | Training Building and leading development teams Mostly and these days Leader of and Blogger at and @JacekLaskowski Apache Spark Scala Warsaw Scala Enthusiasts Warsaw Apache Spark Java Champion blog.jaceklaskowski.pl jaceklaskowski.pl
  3. 3. http://bit.ly/mastering-apache-spark
  4. 4. http://bit.ly/mastering-apache-spark
  5. 5. SPARKCONTEXT THE LIVING SPACE FOR RDDS
  6. 6. SPARKCONTEXT AND RDDS An RDD belongs to one and only one Spark context. You cannot share RDDs between contexts. SparkContext tracks how many RDDs were created. You may see it in toString output.
  7. 7. SPARKCONTEXT AND RDDS (2)
  8. 8. RDD RESILIENT DISTRIBUTED DATASET
  9. 9. CREATING RDD - SC.PARALLELIZE sc.parallelize(col, slices)to distribute a local collection of any elements. scala> val rdd = sc.parallelize(0 to 10) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at Alternatively, sc.makeRDD(col, slices)
  10. 10. CREATING RDD - SC.RANGE sc.range(start, end, step, slices)to create RDD of long numbers. scala> val rdd = sc.range(0, 100) rdd: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[14] at range at <console>:
  11. 11. CREATING RDD - SC.TEXTFILE sc.textFile(name, partitions)to create a RDD of lines from a file. scala> val rdd = sc.textFile("README.md") rdd: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[16] at textFil
  12. 12. CREATING RDD - SC.WHOLETEXTFILES sc.wholeTextFiles(name, partitions)to create a RDD of pairs of a file name and its content from a directory. scala> val rdd = sc.wholeTextFiles("tags") rdd: org.apache.spark.rdd.RDD[(String, String)] = tags MapPartitionsRDD[18] at wh
  13. 13. There are many more more advanced functions in SparkContextto create RDDs.
  14. 14. PARTITIONS (AND SLICES) Did you notice the words slices and partitions as parameters? Partitions (aka slices) are the level of parallelism. We're going to talk about the level of parallelism later.
  15. 15. CREATING RDD - DATAFRAMES RDDs are so last year :-) Use DataFrames...early and often! A DataFrame is a higher-level abstraction over RDDs and semi-structured data. DataFrames require a SQLContext.
  16. 16. FROM RDDS TO DATAFRAMES scala> val rdd = sc.parallelize(0 to 10) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at parallelize at scala> val df = rdd.toDF df: org.apache.spark.sql.DataFrame = [_1: int] scala> val df = rdd.toDF("numbers") df: org.apache.spark.sql.DataFrame = [numbers: int]
  17. 17. ...AND VICE VERSA scala> val rdd = sc.parallelize(0 to 10) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at parallelize at scala> val df = rdd.toDF("numbers") df: org.apache.spark.sql.DataFrame = [numbers: int] scala> df.rdd res23: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[70]
  18. 18. CREATING DATAFRAMES - SQLCONTEXT.CREATEDATAFRAME sqlContext.createDataFrame(rowRDD, schema)
  19. 19. CREATING DATAFRAMES - SQLCONTEXT.READ sqlContext.readis the modern yet experimental way. sqlContext.read.format(f).load(path), where f is: jdbc json orc parquet text
  20. 20. EXECUTION ENVIRONMENT
  21. 21. PARTITIONS AND LEVEL OF PARALLELISM The number of partitions of a RDD is (roughly) the number of tasks. Partitions are the hint to size jobs. Tasks are the smallest unit of execution. Tasks belong to TaskSets. TaskSets belong to Stages. Stages belong to Jobs. Jobs, stages, and tasks are displayed in web UI. We're going to talk about the web UI later.
  22. 22. PARTITIONS AND LEVEL OF PARALLELISM CD. In local[*] mode, the number of partitions equals the number of cores (the default in spark-shell) scala> sc.defaultParallelism res0: Int = 8 scala> sc.master res1: String = local[*] Not necessarily true when you use local or local[n] master URLs.
  23. 23. LEVEL OF PARALLELISM IN SPARK CLUSTERS TaskScheduler controls the level of parallelism DAGScheduler, TaskScheduler, SchedulerBackend work in tandem DAGScheduler manages a "DAG" of RDDs (aka RDD lineage) SchedulerBackends manage TaskSets
  24. 24. DAGSCHEDULER
  25. 25. TASKSCHEDULER AND SCHEDULERBACKEND
  26. 26. RDD LINEAGE RDD lineage is a graph of RDD dependencies. Use toDebugString to know the lineage. Be careful with the hops - they introduce shuffle barriers. Why is the RDD lineage important? This is the R in RDD - resiliency. But deep lineage costs processing time, doesn't it? Persist (aka cache) it early and often!
  27. 27. RDD LINEAGE - DEMO What does the following do? val rdd = sc.parallelize(0 to 10).map(n => (n % 2, n)).groupBy(_._1)
  28. 28. RDD LINEAGE - DEMO CD. How many stages are there? // val rdd = sc.parallelize(0 to 10).map(n => (n % 2, n)).groupBy(_._1) scala> rdd.toDebugString res2: String = (2) ShuffledRDD[3] at groupBy at <console>:24 [] +-(2) MapPartitionsRDD[2] at groupBy at <console>:24 [] | MapPartitionsRDD[1] at map at <console>:24 [] | ParallelCollectionRDD[0] at parallelize at <console>:24 [] Nothing happens yet - processing time-wise.
  29. 29. SPARK CLUSTERS Spark supports the following clusters: one-JVM local cluster Spark Standalone Apache Mesos Hadoop YARN You use --master to select the cluster spark://hostname:port is for Spark Standalone And you know the local master URL, ain't you? local, local[n], or local[*]
  30. 30. MANDATORY PROPERTIES OF SPARK APP Your task: Fill in the gaps below. Any Spark application must specify application name (aka appName ) and master URL. Demo time! => spark-shell is a Spark app, too!
  31. 31. SPARK STANDALONE CLUSTER The built-in Spark cluster Start standalone Master with sbin/start-master Use -h to control the host name to bind to. Start standalone Worker with sbin/start-slave Run single worker per machine (aka node) = web UI for Standalone cluster Don't confuse it with the web UI of Spark application Demo time! => Run Standalone cluster http://localhost:8080/
  32. 32. SPARK-SHELL SPARK REPL APPLICATION
  33. 33. SPARK-SHELL AND SPARK STANDALONE You can connect to Spark Standalone using spark-shell through --master command-line option. Demo time! => we've already started the Standalone cluster.
  34. 34. WEBUI WEB USER INTERFACE FOR SPARK APPLICATION
  35. 35. WEBUI It is available under You can disable it using spark.ui.enabled flag. All the events are captured by Spark listeners You can register your own Spark listener. Demo time! => webUI in action with different master URLs http://localhost:4040/
  36. 36. QUESTIONS? - Visit - Follow at twitter - Use - Read notes. Jacek Laskowski's blog @jaceklaskowski Jacek's projects at GitHub Mastering Apache Spark

×