Advertisement
Advertisement

More Related Content

Advertisement

Similar to Apache Spark 2.0: Faster, Easier, and Smarter(20)

More from Databricks(20)

Advertisement

Apache Spark 2.0: Faster, Easier, and Smarter

  1. Apache Spark 2.0: Faster, Easier, and Smarter Reynold Xin @rxin 2016-05-05 Webinar
  2. About Databricks Founded by creatorsof Spark in 2013 Cloud enterprisedata platform - Managed Spark clusters - Interactive data science - Production pipelines - Data governance,security, …
  3. What is Apache Spark? Unified engineacross data workloads and platforms … SQLStreaming ML Graph Batch …
  4. A slide from 2013 …
  5. Spark 2.0 Steps to bigger& better things…. Builds on all we learned in past 2 years
  6. Versioning in Spark In reality, we hate breaking APIs! Will notdo so exceptfor dependency conflicts(e.g.Guava) and experimental APIs 1.6.0 Patch version (only bug fixes) Major version (may change APIs) Minor version (addsAPIs/ features)
  7. Major Features in 2.0 TungstenPhase 2 speedupsof 5-20x StructuredStreaming SQL 2003 & Unifying Datasets and DataFrames
  8. API Foundation for the Future Dataset, DataFrame, SQL, ML
  9. Towards SQL 2003 As of this week, Spark branch-2.0 can run all 99 TPC-DS queries! - New standard compliant parser(with good errormessages!) - Subqueries(correlated& uncorrelated) - Approximate aggregatestats
  10. Datasets and DataFrames In 2015, we added DataFrames & Datasets as structured data APIs • DataFrames are collections of rows with a schema • Datasets add static types,e.g. Dataset[Person] • Both run on Tungsten Spark 2.0 will merge these APIs: DataFrame = Dataset[Row]
  11. SparkSession – a new entry point SparkSessionis the “SparkContext”for Dataset/DataFrame - Entry point for reading data - Working with metadata - Configuration - Clusterresourcemanagement
  12. Notebook demo http://bit.ly/1SMPEzQ and http://bit.ly/1OeqdSn
  13. Long-Term RDD will remain the low-levelAPIin Spark Datasets & DataFrames give richer semanticsand optimizations • New libraries will increasingly use these as interchange format • Examples: Structured Streaming,MLlib, GraphFrames
  14. Other notable API improvements DataFrame-based ML pipeline API becoming the main MLlib API ML model & pipeline persistencewith almost complete coverage • In all programming languages:Scala, Java, Python,R Improved R support • (Parallelizable) User-defined functionsin R • Generalized LinearModels(GLMs), Naïve Bayes,Survival Regression,K-Means
  15. Structured Streaming How do we simplify streaming?
  16. Background Real-time processingis vital for streaming analytics Apps needa combination: batch & interactive queries • Trackstate using a stream, then run SQL queries • Train an ML model offline, then update it
  17. Integration Example Streaming engine Stream (home.html, 10:08) (product.html, 10:09) (home.html, 10:10) . . . What can go wrong? • Late events • Partial outputs to MySQL • State recovery on failure • Distributed reads/writes • ... MySQL Page Minute Visits home 10:09 21 pricing 10:10 30 ... ... ...
  18. Processing Businesslogic change & new ops (windows,sessions) Complex Programming Models Output How do we define outputover time & correctness? Data Late arrival, varying distribution overtime, …
  19. The simplest way to perform streaming analytics is not having to reason about streaming.
  20. Spark 2.0 Infinite DataFrames Spark 1.3 Static DataFrames Single API !
  21. logs = ctx.read.format("json").open("s3://logs") logs.groupBy(logs.user_id).agg(sum(logs.time)) .write.format("jdbc") .save("jdbc:mysql//...") Example: Batch Aggregation
  22. logs = ctx.read.format("json").stream("s3://logs") logs.groupBy(logs.user_id).agg(sum(logs.time)) .write.format("jdbc") .startStream("jdbc:mysql//...") Example: Continuous Aggregation
  23. Structured Streaming High-levelstreaming APIbuilt on Spark SQL engine • Declarative API that extendsDataFrames / Datasets • Eventtime, windowing,sessions,sources& sinks Support interactive & batch queries • Aggregate data in a stream, then serve using JDBC • Change queriesatruntime • Build and apply ML models Not just streaming, but “continuous applications”
  24. Goal: end-to-end continuous applications Example Reporting Applications ML Model Ad-hoc Queries Traditionalstreaming Other processingtypes Kafka DatabaseETL
  25. Tungsten Phase 2 Can we speed up Spark by 10X?
  26. Demo http://bit.ly/1X8LKmH
  27. Going back to the fundamentals Difficult to getorder of magnitude performancespeed ups with profiling techniques • For 10ximprovement,would need to find top hotspots that add up to 90% and make theminstantaneous • For 100x,99% Instead, lookbottom up, how fast should it run?
  28. Scan Filter Project Aggregate select count(*) from store_sales where ss_item_sk = 1000
  29. Volcano Iterator Model Standard for 30 years: almost all databases do it Each operatoris an “iterator” that consumes recordsfrom its input operator class Filter { def next(): Boolean = { var found = false while (!found && child.next()) { found = predicate(child.fetch()) } return found } def fetch(): InternalRow = { child.fetch() } … }
  30. What if we hire a collegefreshmanto implement this queryin Java in 10 mins? select count(*) from store_sales where ss_item_sk = 1000 var count = 0 for (ss_item_sk in store_sales) { if (ss_item_sk == 1000) { count += 1 } }
  31. Volcano model 30+ years of database research college freshman hand-written code in 10 mins vs
  32. Volcano 13.95 million rows/sec college freshman 125 million rows/sec Note: End-to-end, single thread, single column, and data originated in Parquet on disk High throughput
  33. How does a student beat 30 years of research? Volcano 1. Many virtual function calls 2. Data in memory (orcache) 3. No loop unrolling,SIMD, pipelining hand-written code 1. No virtual function calls 2. Data in CPU registers 3. Compilerloop unrolling,SIMD, pipelining Take advantage of all the information that is known after query compilation
  34. Scan Filter Project Aggregate long count = 0; for (ss_item_sk in store_sales) { if (ss_item_sk == 1000) { count += 1; } } Tungsten Phase 2: Spark as a “Compiler” Functionality of a generalpurpose execution engine; performanceas if hand built system just to run your query
  35. Performance of Core Primitives cost per row (single thread) primitive Spark 1.6 Spark 2.0 filter 15 ns 1.1 ns sum w/o group 14 ns 0.9 ns sum w/ group 79 ns 10.7 ns hash join 115 ns 4.0 ns sort (8 bit entropy) 620 ns 5.3 ns sort (64 bit entropy) 620 ns 40 ns sort-merge join 750 ns 700 ns Intel Haswell i7 4960HQ 2.6GHz, HotSpot 1.8.0_60-b27, Mac OS X 10.11
  36. 0 100 200 300 400 500 600 Runtime(seconds) Preliminary TPC-DS Spark2.0 vs 1.6 – Lower is Better Time (1.6) Time (2.0)
  37. Databricks Community Edition Best place to try & learn Spark.
  38. Release Schedule Today: work-in-progresssource code available on GitHub Next week: preview of Spark 2.0 in Databricks Community Edition Early June: Apache Spark 2.0 GA
  39. Today’s talk Spark 2.0 doubles down on what made Spark attractive: • Faster: Project Tungsten Phase 2, i.e. “Spark as a compiler” • Easier: unified APIs& SQL 2003 • Smarter: Structured Streaming • Only scratched the surface here, as Spark 2.0 will resolve ~ 2000 tickets. Learn Spark on Databricks Community Edition • join beta waitlist https://databricks.com/ce/
  40. Discount code: Meetup16SF
  41. Thank you. Don’tforgettoregisterforSparkSummitSF!
Advertisement