Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1 of 50

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules Damji

15

Share

Download to read offline

Of all the developers’ delight, none is more attractive than a set of APIs that make developers productive, that are easy to use, and that are intuitive and expressive. Apache Spark offers these APIs across components such as Spark SQL, Streaming, Machine Learning, and Graph Processing to operate on large data sets in languages such as Scala, Java, Python, and R for doing distributed big data processing at scale. In this talk, I will explore the evolution of three sets of APIs-RDDs, DataFrames, and Datasets-available in Apache Spark 2.x. In particular, I will emphasize three takeaways: 1) why and when you should use each set as best practices 2) outline its performance and optimization benefits; and 3) underscore scenarios when to use DataFrames and Datasets instead of RDDs for your big data distributed processing. Through simple notebook demonstrations with API code examples, you’ll learn how to process big data using RDDs, DataFrames, and Datasets and interoperate among them. (this will be vocalization of the blog, along with the latest developments in Apache Spark 2.x Dataframe/Datasets and Spark SQL APIs: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)

A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules Damji

  1. 1. A Tale of Three Apache Spark APIs: RDDs, DataFrames & Datasets Jules S. Damji Spark Summit EU, Dublin 2017 @2twitme
  2. 2. Spark Community Evangelist & Developer Advocate @ Databricks Developer Advocate @ Hortonworks Software engineering @: Sun Microsystems, Netscape, @Home, VeriSign, Scalix, Centrify, LoudCloud/Opsware, ProQuest
  3. 3. Agenda… Why are we here today? • Resilient Distributed Datasets (RDDs) • Structure in Spark • DataFrames and Datasets • Demo • Q & A
  4. 4. Not the Tale of Three Kings..
  5. 5. Resilient Distributed Dataset (RDD)
  6. 6. What are RDDs?
  7. 7. A Resilient Distributed Dataset (RDD) 1. Distributed Data Abstraction Logical Model Across Distributed Storage S3 or HDFS
  8. 8. 2. Resilient& Immutable RDD RDD RDDT RDD à Tà RDD -> RDD T T = Transformation
  9. 9. 3. Compile-time Type-safe Integer RDD String or Text RDD Double or Binary RDD
  10. 10. 4. Unstructured/Structured Data: Text (logs, tweets, articles, social)
  11. 11. Structured Tabular data..
  12. 12. 5. Lazy RDD RDD RDDT RDD à Tà RDD à Tà RDD T T = Transformation A = Action RDD RDD RDDT A
  13. 13. Why Use RDDs? • … OfferControl & flexibility • ... Low-levelAPI • ... Type-safe • ... Encourage how-to
  14. 14. Some code toread Wikipedia val rdd = sc.textFile("/mnt/wikipediapagecounts.gz") val parsedRDD = rdd.flatMap { line => line.split("""s+""") match { case Array(project, page, numRequests, _) => Some((project, page, numRequests)) case _ => None } } // filter only English pages ; count pages and requests to it. parsedRDD.filter { case (project, page, numRequests) => project == "en" }. map { case (_, page, numRequests) => (page, numRequests) }. reduceByKey(_ + _). take(100). foreach { case (page, requests) => println(s"$page: $requests") }
  15. 15. When to Use RDDs? • ... Low-levelAPI & control of dataset • ... Dealing with unstrucrureddata (media streamsor texts) • ... Manipulate data with lambda functions than DSL • ... Don’t care schemaor structureof data • ... Sacrifice optimization, performance & inefficiecies
  16. 16. What’s the Problem?
  17. 17. What’s the problem? • ... Express how-to solution, not what-to • ... Not optimized by Spark • ... Slow for non-JVM languages like Python • ... Inadverdent inefficiecies
  18. 18. Inadvertent inefficiencies in RDDs parsedRDD.filter { case (project, page, numRequests) => project == "en" }. map { case (_, page, numRequests) => (page, numRequests) }. reduceByKey(_ + _). filter { case (page, _) => ! isSpecialPage(page) }. take(100). foreach { case (project, requests) => println (s"project: $requests") }
  19. 19. Structured in Spark DataFrames & Datasets APIs
  20. 20. Background:What is in an RDD? •Dependencies • Partitions (with optional localityinfo) • Compute function: Partition=>Iterator[T] Opaque Computation & Opaque Data
  21. 21. Structured APIs In Spark 23 SQL DataFrames Datasets Syntax Errors Analysis Errors Runtime Compile Time Runtime Compile Time Compile Time Runtime Analysiserrorsarereported beforea distributedjob starts
  22. 22. Unificationof APIs in Spark 2.0
  23. 23. DataFrame API code. // convert RDD -> DF with column names val df = parsedRDD.toDF("project", "page", "numRequests") //filter, groupBy, sum, and then agg() df.filter($"project" === "en"). groupBy($"page"). agg(sum($"numRequests").as("count")). limit(100). show(100) project page numRequests en 23 45 en 24 200
  24. 24. Take DataFrame à SQL Table à Query df. createOrReplaceTempView(("edits") val results = spark.sql("""SELECT page, sum(numRequests) AS count FROM edits WHERE project = 'en' GROUP BY page LIMIT 100""") results.show(100) project page numRequests en 23 45 en 24 200
  25. 25. Easy to write code... Believe it! from pyspark.sql.functions import avg dataRDD = sc.parallelize([("Jim", 20), ("Anne", 31), ("Jim", 30)]) dataDF = dataRDD.toDF(["name", "age"]) # Using RDD code to compute aggregate average (dataRDD.map(lambda (x,y): (x, (y,1))) .reduceByKey(lambda x,y: (x[0] +y[0], x[1] +y[1])) .map(lambda (x, (y, z)): (x, y / z))) # Using DataFrame dataDF.groupBy("name").agg(avg("age")) name age Jim 20 Ann 31 Jim 30
  26. 26. Why structure APIs? data.map { case (dept, age) => dept -> (age, 1) } .reduceByKey { case ((a1, c1), (a2, c2)) => (a1 + a2, c1 + c2)} .map { case (dept, (age, c)) => dept -> age / c } select dept, avg(age) from data group by 1 SQL DataFrame RDD data.groupBy("dept").avg("age")
  27. 27. 29 Using Catalyst in Spark SQL Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog Analysis: analyzing a logical plan to resolve references Logical Optimization: logical plan optimization Physical Planning: Physical planning Code Generation: Compile parts of the query to Java bytecode SQL AST DataFrame Datasets
  28. 28. Physical Plan with Predicate Pushdown and Column Pruning join optimized scan (events) optimized scan (users) LogicalPlan filter join Physical Plan join scan (users)eventsfile userstable 30 scan (events) filter users.join(events, users("id") === events("uid")) . filter(events("date") > "2015-01-01") DataFrame Optimization
  29. 29. Type-safe: operate on domain objects with compiled lambda functions 8 Dataset API in Spark 2.x val df = spark.read.json("people.json") / / Convert data to domain objects. case class Person(name: String, age: I nt ) val ds: Dataset[Person] = df.as[Person] val = filterDS = ds . f i l t er( p= >p. age > 3)
  30. 30. DataFrames are Faster than RDDs
  31. 31. Datasets < Memory RDDs
  32. 32. Datasets Faster…
  33. 33. Why When DataFrames & Datasets • Structured Data schema • Code optimization & performance • Space efficiency with Tungsten • High-level APIs and DSL • Strong Type-safety • Ease-of-use & Readability • What-to-do
  34. 34. 43 Spark Core (RDD) Catalyst DataFrame/DatasetSQL ML Pipelines Structured Streaming { JSON } JDBC and more… FoundationalSpark2.x Components Spark SQL GraphFrames TensorFrames DL Pipelines DataSource API
  35. 35. Source:michaelmalak Puttingall Together: Conclusion
  36. 36. Demo
  37. 37. http://dbricks.co/2sK35XT
  38. 38. https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html http://scala-phase.org/talks/rdds-dataframes-datasets-2016-06-16/#/ + Resources
  39. 39. Resources • Getting Started Guide with Apache Spark on Databricks • docs.databricks.com • Spark Programming Guide • https://databricks.com/blog/2016/01/04/introducing-apache-spark- datasets.html • https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark- apis-rdds-dataframes-and-datasets.html • https://github.com/bmc/rdds-dataframes-datasets-presentation-2016 • Databricks Engineering Blogs
  40. 40. Thank you! Do you have any questionsformy preparedanswers?
  41. 41. TEAM About Databricks Started Spark project (now Apache Spark) at UC Berkeley in 2009 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple
  42. 42. The UnifiedAnalyticsPlatform
  43. 43. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Q1 Q2 Q3 Q4 Title Blue Orange Green Use this chart to start
  44. 44. Here are some icons to use - scalable DB Benefits DB Features General / Data Science Icons can berecolored within Powerpoint — see: format picture/ picturecolor / recolor Orange, Green, and Black versions (no recoloration necessary)can befound in go/icons
  45. 45. More icons Industries Security SparkBenefits SparkFeatures
  46. 46. Misc Even more Misc
  47. 47. Slide for Large Question or Section Headers
  48. 48. Thank You Parting words or contact informationgo here.

×