Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Jump Start with Apache Spark 2.0 on Databricks

659 views

Published on

Apache Spark 2.x has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.

In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Apache Spark Fundamentals & Concepts
What’s new in Spark 2.x
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs

Published in: Software
  • Be the first to comment

Jump Start with Apache Spark 2.0 on Databricks

  1. 1. Jump Start with Apache® Spark™ 2.x on Databricks Jules S. Damji Spark Community Evangelist Big Data Trunk Meetup, Fremont 12/17/2016 @2twitme
  2. 2. I have used Apache Spark Before…
  3. 3. I know the difference between DataFrame and RDDs…
  4. 4. $ whoami Spark Community Evangelist @ Databricks Developer Advocate @ Hortonworks Software engineering @: Sun Microsystems, Netscape, @Home, VeriSign, Scalix, Centrify, LoudCloud/Opsware, ProQuest https://www.linkedin.com/in/dmatrix @2twitme
  5. 5. Agenda for the next 3+ hours • Get to know Databricks • Overview of Spark Fundamentals& Architecture • What’s New in Spark 2.0 • UnifiedAPIs: SparkSessions, SQL, DataFrames, Datasets… • Workshop Notebook1 • Lunch Hour 1.5 • Introduction to DataFrames, DataSets and Spark SQL • Workshop Notebook2 • Break • Introduction to Structured StreamingConcepts • Workshop Notebook3 • Go Home… Hour 1.5
  6. 6. Get to know Databricks • Get Databricks communityedition http://databricks.com/try-databricks
  7. 7. We are Databricks, the company behind Apache Spark Founded by the creators of Apache Spark in 2013 Share of Spark code contributed by Databricks in 2014 75% 7 Data Value Created Databricks on top of Spark to make big data simple. The Best Place to Run Apache Spark
  8. 8. MapReduce Generalbatch processing Pregel Dremel Mahout Drill Giraph ImpalaStorm S4 . . . Specialized systems for newworkloads Why Spark? Big Data Systems of Yesterday… Hard to manage, tune, deployHard to combine in pipelines
  9. 9. MapReduce Generalbatch processing Unified engine Why Spark? Big Data Systems Yesterday.. ? Pregel Dremel Drill Giraph Impala Storm Mahout. . . Specialized systems for newworkloads
  10. 10. An Analogy Specialized devices Unified device New applications
  11. 11. Unified engine across diverse workloads& environments
  12. 12. Unified engineacross diverse workloads & environments
  13. 13. Apache Spark Fundamentals & Architecture
  14. 14. A Resilient Distributed Dataset (RDD)
  15. 15. 2 kinds of Actions collect, count, reduce, take, show..saveAsTextFile, (HDFS, S3, SQL, NoSQL, etc.)
  16. 16. Apache Spark Architecture Deployments Modes • Local • Standalone • YARN • Mesos
  17. 17. Driver + Executor Driver + Executor Container EC2 Machine Student-1 Notebook Student-2 Notebook Container JVM JVM
  18. 18. 30 GB Container 30 GB Container 22 GB JVM 22 GB JVM S S S S S S S S Ex. Ex. 30 GB Container 30 GB Container 22 GB JVM 22 GB JVM S S S S Dr Ex. ... ... Standalone Mode:
  19. 19. Apache Spark Architecture An Anatomy ofan Application Spark Application • Jobs • Stages • Tasks
  20. 20. S S Container S* * * * * * * * JVM T * * DF/RDD
  21. 21. How did we Get Here..? Where we Going..?
  22. 22. A Brief History 29 2012 Started @ UC Berkeley 2010 2013 Databricks started & donated to ASF 2014 Spark 1.0 & libraries (SQL, ML,GraphX) 2015 DataFrames/Datasets Tungsten ML Pipelines 2016 Apache Spark 2.0 Easier Smarter Faster
  23. 23. Apache Spark 2.0 • Steps to Bigger& Better Things…. Builds on all we learned in past 2 years
  24. 24. Major Themes in Apache Spark 2.0 TungstenPhase 2 speedupsof 5-10x & Catalyst Optimizer Faster StructuredStreaming real-time engine on SQL / DataFrames Smarter Unifying Datasets and DataFrames & SparkSessions Easier
  25. 25. Unified API Foundation for the Future: Spark Sessions, DataFrame Dataset, MLlib, Structured Streaming…
  26. 26. SparkSession – A Unified entry point to Spark • Conduit to Spark – Creates Datasets/DataFrames – Reads/writes data – Works with metadata – Sets/gets Spark Configuration – Driver uses for Cluster resource management
  27. 27. SparkSession vs SparkContext SparkSessions Subsumes • SparkContext • SQLContext • HiveContext • StreamingContext • SparkConf
  28. 28. SparkSession – A Unified entry point to Spark
  29. 29. Datasets and DataFrames
  30. 30. Impose Structure
  31. 31. Long Term • RDD as the low-levelAPI in Spark • For controland certain type-safety inJava/Scala • Datasets & DataFrames give richer semantics& optimizations • For semi-structureddataand DSL like operations • New libraries will increasingly use these as interchange format • Examples: Structured Streaming,MLlib, GraphFrames
  32. 32. Spark 1.6 vs Spark 2.x
  33. 33. Spark 1.6 vs Spark 2.x
  34. 34. Towards SQL 2003 • Today, Spark can run all 99 TPC-DS queries! - New standard compliant parser(with good error messages!) - Subqueries(correlated& uncorrelated) - Approximate aggregatestats - https://databricks.com/blog/2016/07/26/introducing-apache- spark-2-0.html - https://databricks.com/blog/2016/06/17/sql-subqueries-in- apache-spark-2-0.html
  35. 35. 0 100 200 300 400 500 600 Runtime(seconds) Preliminary TPC-DS Spark2.0 vs 1.6 – Lower is Better Time (1.6) Time (2.0)
  36. 36. Other notable API improvements • DataFrame-based ML pipeline API becomingthe main MLlib API • ML model & pipeline persistence with almost complete coverage • In all programming languages: Scala, Java, Python, R • ImprovedR support • (Parallelizable) User-defined functions in R • Generalized Linear Models (GLMs), Naïve Bayes, Survival Regression, K-Means
  37. 37. Workshop: Notebook on SparkSession • Import Notebook into your Spark 2.0 Cluster – http://dbricks.co/sswksh1 – http://docs.databricks.com – http://spark.apache.org/docs/latest/api/scala/index.html#org.a pache.spark.sql.SparkSession • Familiarize your self with Databricks Notebook environment • Work through each cell • CNTR +<return> / Shift +Return • Try challenges • Break…
  38. 38. DataFrames/Datasets & Spark SQL & Catalyst Optimizer
  39. 39. The not so secret truth… SQL is not about SQL is about more thanSQL
  40. 40. Spark SQL: The whole story 10 Is About Creating and Running Spark Programs Faster: •  Write less code •  Read less data •  Do less work • optimizerdo the hard work
  41. 41. Spark SQL Architecture Logical Plan Physical Plan Catalog Optimizer RDDs … Data Source API SQL DataFrames Code Generator Datasets
  42. 42. 52 Using Catalyst in Spark SQL Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog Analysis: analyzinga logicalplan to resolve references Logical Optimization: logicalplan optimization Physical Planning: Physical planning Code Generation:Compileparts of the query to Java bytecode SQL AST DataFrame Datasets
  43. 43. Catalyst Optimizations Logical Optimizations Create Physical Plan & generate JVM bytecode • Push filter predicates down to data source, so irrelevant data can be skipped • Parquet: skip entire blocks, turn comparisons on strings into cheaper integer comparisons via dictionary encoding • RDBMS: reduce amount of data traffic by pushing predicates down • Catalyst compiles operations into physical plans for execution and generates JVM bytecode • Intelligently choose betweenbroadcast joins and shuffle joins to reduce network traffic • Lower level optimizations: eliminate expensive object allocations and reduce virtual function calls
  44. 44. # Load partitioned Hive table def add_demographics(events): u = sqlCtx.table(" users") events . jo in ( u , events.user_id == u.user_id) .withColumn("c ity" , zipToCity( df .z ip)) # Join on user_id # Run udf to add c it y column PhysicalPlan with Predicate Pushdown and Column Pruning join optimized scan (events) optimized scan (users) events = add_demographics(sqlCtx.load("/data/events", "parquet")) training_data = events.where(events.city == "New York").select(even ts. times ta mp) .co llec t() LogicalPlan filter join PhysicalPlan join scan (users)events file userstable 54 scan (events) filter
  45. 45. Columns: Predicate pushdown spark.read .format("jdbc") .option("url", "jdbc:postgresql:dbserver") .option("dbtable", "people") .load() .where($"name" === "michael") 55 You Write Spark Translates For Postgres SELECT * FROM people WHERE name = 'michael'
  46. 46. 43 Spark Core (RDD) Catalyst DataFrame/DatasetSQL MLPipelines Structured Streaming { JSON } JDBC andmore… FoundationalSpark2.0 Components Spark SQL GraphFrames
  47. 47. http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
  48. 48. Dataset Spark 2.0 APIs
  49. 49. Background: What is in an RDD? •Dependencies • Partitions (with optional localityinfo) • Compute function: Partition =>Iterator[T] Opaque Computation & Opaque Data
  50. 50. Structured APIs In Spark 60 SQL DataFrames Datasets Syntax Errors Analysis Errors Runtime Compile Time Runtime Compile Time Compile Time Runtime Analysis errors are reported before a distributed job starts
  51. 51. Type-safe:operate on domain objects with compiled lambda functions 8 Dataset API in Spark 2.0 val df = spark.read.j son("people.json") / / Convert data to domain obj ects. case cl ass Person(name: Stri ng, age: I n t ) val ds: Dataset[Person] = df.as[Person] val fi l terD S = d s . f i l t er ( _ . age > 30) // will return DataFrame=Dataset[Row] val groupDF = ds.filter(p=> p.name.startsWith(“M”)) .groupBy(“name”) .avg(“age”)
  52. 52. Source: michaelmalak
  53. 53. Project Tungsten II
  54. 54. Project Tungsten • Substantially speed up execution by optimizing CPU efficiency, via: SPARK-12795 (1) Runtime code generation (2) Exploiting cachelocality (3) Off-heap memory management
  55. 55. 6 “bricks” Tungsten’s Compact Row Format 0x0 123 32L 48L 4 “data” (123, “data”, “bricks”) Nullbitmap Offset to data Offset to data Fieldlengths 20
  56. 56. Encoders 6 “bricks”0x0 123 32L 48L 4 “data” JVM Object Internal Representation MyClass(123, “data”, “ br i c ks”) Encoders translate between domain objects and Spark's internal format
  57. 57. Datasets: Lightning-fast Serialization with Encoders
  58. 58. Performance of Core Primitives cost per row (single thread) primitive Spark 1.6 Spark 2.0 filter 15 ns 1.1 ns sum w/o group 14 ns 0.9 ns sum w/ group 79 ns 10.7 ns hash join 115 ns 4.0 ns sort (8 bit entropy) 620 ns 5.3 ns sort (64 bit entropy) 620 ns 40 ns sort-merge join 750 ns 700 ns Intel Haswell i7 4960HQ 2.6GHz, HotSpot 1.8.0_60-b27, Mac OS X 10.11
  59. 59. Workshop: Notebook on DataFrames/Datasets & Spark SQL • Import Notebook into your Spark 2.0 Cluster – http://dbricks.co/sswksh2A – http://dbricks.co/sswksh2 – https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql. Dataset • Workthrough each Notebookcell • Try challenges • Break..
  60. 60. Introduction to Structured Streaming
  61. 61. Streaming in Apache Spark Streaming demands newtypes of streaming requirements… 3 SQL MLlib Spark Core GraphX Functional, conciseand expressive Fault-tolerant statemanagement Unified stack with batch processing More than 51%users say most important partof Apache Spark Spark Streaming in production jumped to 22%from 14% Streaming
  62. 62. Streaming apps are growing more complex 4
  63. 63. Streaming computations don’t run in isolation • Need to interact with batch data, interactive analysis, machine learning, etc.
  64. 64. Use case: IoT Device Monitoring IoT events from Kafka ETL into long term storage - Preventdata loss - PreventduplicatesStatus monitoring - Handlelate data - Aggregateon windows on eventtime Interactively debug issues - consistency event stream Anomaly detection - Learn modelsoffline - Use online+continuous learning
  65. 65. Use case: IoT Device Monitoring Anomalydetection - Learn modelsoffline - Useonline + continuous learning IoT events event stream fromKafka ETL into long term storage - Prevent dataloss Status monitoring - Preventduplicates Interactively - Handle late data debugissues - Aggregateon windows -consistency on eventtime Continuous Applications Not just streaming any more
  66. 66. The simplest way to perform streaming analytics is not having to reason about streaming at all
  67. 67. Static, bounded table Stream as a unbound DataFrame Streaming, unbounded table Single API !
  68. 68. Stream as unbounded DataFrame
  69. 69. Gist of Structured Streaming High-level streaming API built on SparkSQL engine Runs the same computation as batch queriesin Datasets / DataFrames Eventtime, windowing,sessions,sources& sinks Guaranteesan end-to-end exactlyonce semantics Unifies streaming, interactive and batch queries Aggregate data in a stream, then serve using JDBC Add, remove,change queriesat runtime Build and apply ML modelsto your Stream
  70. 70. Advantages over DStreams 1. Processingwith event-time,dealingwith late data 2. Exactly same API for batch,streaming,and interactive 3. End-to-endexactly-once guaranteesfromthe system 4. Performance through SQL optimizations - Logical plan optimizations, Tungsten, Codegen, etc. - Faster state management for stateful stream processing 82
  71. 71. Structured Streaming ModelTrigger: every 1 sec 1 2 3 Time data up to 1 Input data up to 2 data up to 3 Query Input: data from source as an append-only table Trigger: newrows appended to table Query: operations on input usual map/filter/reduce newwindow, session ops
  72. 72. Structured Streaming ModelTrigger: every 1 sec 1 2 3 output for data up to 1 Result Query Time data up to 1 Input data up to 2 output for data up to 2 data up to 3 output for data up to 3 Result: final operated table updated every triggerinterval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Output complete output
  73. 73. Structured Streaming ModelTrigger: every 1 sec 1 2 3 result for data up to 1 Result Query Time data up to 1 Input data up to 2 result for data up to 2 data up to 3 result for data up to 3 Output [complete mode] output only new rows since last trigger Result: final operated table updated every triggerinterval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Append output: Write only new rows that got added to result table since previous batch *Not all output modes are feasible withall queries
  74. 74. Example WordCount https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes
  75. 75. Batch ETL with DataFrame inputDF = spark.read .format("json") .load("source-path") resultDF = input .select("device", "signal") .where("signal > 15") resultDF.write .format("parquet") .save("dest-path") Read from JSON file Select some devices Write to parquet file
  76. 76. Streaming ETL with DataFrame input = ctxt.read .format("json") .stream("source-path") result = input .select("device", "signal") .where("signal > 15") result.write .format("parquet") .outputMode("append") .startStream("dest-path") read…stream() creates a streaming DataFrame, doesnot start any of the computation write…startStream() defineswhere & how to outputthe data and starts the processing
  77. 77. Streaming ETL with DataFrames 1 2 3 Result Input Output [append mode] new rows in result of 2 new rows in result of 3 input = spark.readStream .format("json") .load("source-path") result = input .select("device", "signal") .where("signal > 15") result.writeStream .format("parquet") .start("dest-path")
  78. 78. Continuous Aggregations Continuously compute average signal of each type of device 90 input.groupBy("device-type") .avg("signal")
  79. 79. Continuous Windowed Aggregations 91 input.groupBy( $"device-type", window($"event-time", "10 min")) .avg("signal") Continuously compute average signal of each type of device in last 10 minutes using event-time column
  80. 80. Joining streams with static data kafkaDataset = spark.readStream .format("kafka") .option("subscribe", "iot-updates") .option("kafka.bootstrap.servers", ...) .load() staticDataset = ctxt.read .jdbc("jdbc://", "iot-device-info") joinedDataset = kafkaDataset.join(staticDataset, "device-type") 92 Join streaming data from Kafka with static data via JDBC to enrich the streaming data … … withouthaving to thinkthat you are joining streaming data
  81. 81. Query Management query = result.write .format("parquet") .outputMode("append") .startStream("dest-path") query.stop() query.awaitTermination() query.exception() query.sourceStatuses() query.sinkStatus() 93 query: a handle to the running streaming computation for managingit - Stop it, wait for it to terminate - Get status - Get error, if terminated Multiple queries can be active at the same time Each query has unique name for keepingtrack
  82. 82. Watermarking for handling late data [Spark 2.1] 94 input.withWatermark("event-time", "15 min") .groupBy( $"device-type", window($"event-time", "10 min")) .avg("signal") Continuously compute 10 minute average while data can be 15 minutes late
  83. 83. Structured Streaming Underneath the Hood
  84. 84. Logically: Dataset operations on table (i.e. as easy to understand asbatch) Physically: Spark automatically runs the queryin streaming fashion (i.e. incrementally andcontinuously) DataFrame LogicalPlan Catalystoptimizer Continuous, incrementalexecution Query Execution
  85. 85. Batch/Streaming Execution on Spark SQL 97 DataFrame/ Dataset Logical Plan Planner SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation CatalogDataset Helluvalotofmagic!
  86. 86. Batch Execution on Spark SQL 98 DataFrame/ Dataset Logical Plan Execution PlanPlanner Run super-optimized Spark jobsto compute results Bytecode generation JVM intrinsics, vectorization Operations on serialized data Code Optimizations MemoryOptimizations Compact and fastencoding Offheap memory Project Tungsten -Phase 1 and 2
  87. 87. Continuous Incremental Execution Planner knows how to convert streaming logical plans to a continuous series of incremental execution plans 99 DataFrame/ Dataset Logical Plan Incremental Execution Plan 1 Incremental Execution Plan 2 Incremental Execution Plan 3 Planner Incremental Execution Plan 4
  88. 88. Continuous Incremental Execution 100 Planner Incremental Execution 2 Offsets:[106-197] Count: 92 Plannerpollsfor new data from sources Incremental Execution 1 Offsets:[19-105] Count: 87 Incrementally executes new data and writesto sink
  89. 89. Continuous Aggregations Maintain runningaggregate as in-memory state backed by WAL in file system for fault-tolerance 101 state data generated and used across incremental executions Incremental Execution 1 state: 87 Offsets:[19-105] Running Count: 87 memory Incremental Execution 2 state: 179 Offsets:[106-179] Count: 87+92 = 179
  90. 90. Structured Streaming: Recap • High-level streaming API built on Datasets/DataFrames • Eventtime, windowing,sessions,sources& sinks End-to-end exactly once semantics • Unifies streaming, interactive and batch queries • Aggregate data in a stream, then serveusing JDBC Add, remove,change queriesat runtime • Build and applyML models
  91. 91. Current Status: Spark 2.0.2 Basic infrastructureand API - Eventtime, windows,aggregations - Append and Complete output modes - Support for a subsetof batch queries Sourceand sink - Sources: Files,Kafka, Flume,Kinesis - Sinks: Files,in-memory table, unmanaged sinks(foreach) Experimental releaseto set the future direction Not ready for production but good for experiment
  92. 92. Coming Soon (Dec 2016): Spark 2.1.0 Event-time watermarks and old state evictions Status monitoring and metrics - Current status (processing,waiting fordata, etc.) - Current metrics (inputrates, processing rates, state data size, etc.) - Codahale / Dropwizard metrics (reporting to Ganglia,Graphite, etc.) Many stability & performance improvements 104
  93. 93. Future Direction Stability, stability, stability Supportfor more queries Sessionization More outputmodes Supportfor more sinks ML integrations Make Structured Streaming readyfor production workloads by Spark 2.2/2.3
  94. 94. Blogs for Deeper Technical Dive Blog1: https://databricks.com/blog/2016/07/28/ continuous-applications-evolving- streaming-in-apache-spark-2-0.html Blog2: https://databricks.com/blog/2016/07/28/ structured-streaming-in-apache- spark.html
  95. 95. Resources • Getting Started Guide w ApacheSpark on Databricks • docs.databricks.com • Spark Programming Guide • StructuredStreaming Programming Guide • Databricks EngineeringBlogs • sparkhub.databricks.com • spark-packages.org
  96. 96. https://spark-summit.org/east-2017/
  97. 97. Do you have any questions for my prepared answers?
  98. 98. Demo & Workshop: Structured Streaming • Import Notebook into your Spark 2.0 Cluster • http://dbricks.co/sswksh3 (Demo) • http://dbricks.co/sswksh4 (Workshop) • Done!
  99. 99. 1. Processing with event-time, dealing with late data - DStream API exposes batch time, hard to incorporate event-time 2. Interoperatestreaming with batch AND interactive - RDD/DStream hassimilar API, butstill requirestranslation 3. Reasoning about end-to-end guarantees - Requirescarefully constructing sinks that handle failurescorrectly - Data consistency in the storage while being updated Pain points with DStreams
  100. 100. Output Modes Defines what is written every time there is a trigger Different output modes make sense for differentqueries 22 i n p u t.select ("dev ic e", "s i g n al ") .w r i te .outputMode("append") .fo r m a t( "parq uet") .startStrea m( "de st-pa th ") Append modewith non-aggregationqueries i n p u t.agg( cou nt("* ") ) .w r i te .outputMode("complete") .fo r m a t( "parq uet") .startStrea m( "de st-pa th ") Complete mode with aggregationqueries

×