Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Apache Spark 2.0

1,569 views

Published on

Spark 2.0 is a major release of Apache Spark. This release has brought many changes to API(s) and libraries of Spark. So in this KnolX, we will be looking at some improvements that are made in Spark 2.0. Also, in these slides we will be getting an introduction to some new features in Spark 2,0 like SparkSession API and Structured Streaming.

Published in: Software
  • Be the first to comment

Introduction to Apache Spark 2.0

  1. 1. Introduction to Apache Spark 2.0 Himanshu Gupta Sr. Software Consultant Knoldus Software LLP
  2. 2. Agenda Part 1 (SparkSession) Part 2 (Structured Streaming)
  3. 3. Agenda Part 1 (SparkSession) Part 2 (Structured Streaming)
  4. 4. What is Apache Spark ? ● A fast and general engine for large-scale data processing. ● Offers a rich set of API(s) and Libraries – In Scala, Java, Python and R ● Most active Apache Big Data project. Img Src: https://www.google.com/
  5. 5. Spark Survey 2015 ● Reflected answers and opinions – Of over 1417 respondents from 842 organizations ● Indicated rapid growth of Spark community. ● Displayed positive attitude towards: – Concise and Unified API for Big Data processing. ● https://databricks.com/blog/2015/09/24/spark-survey-2015-results -are-now-available.html
  6. 6. Apache Spark 2.0 ● Released in July this year – In fact version 2.1.0 is already under development. ● Provides a Unified API for SQL, Streaming and Graph operations.
  7. 7. Apache Spark 2.0 ● Released in July this year – In fact version 2.1.0 is already under development. ● Provides a Unified API for SQL, Streaming and Graph operations. SparkSession
  8. 8. What is SparkSession ? Img Src: https://www.google.com/
  9. 9. What is SparkSession ? SparkContext For Core API
  10. 10. What is SparkSession ? SparkContext StreamingContext For Core API For Streaming API
  11. 11. What is SparkSession ? SparkContext StreamingContext SQLContext For Core API For Streaming API For SQL API
  12. 12. What is SparkSession ? SparkContext StreamingContext SQLContext For Core API For Streaming API For SQL API SparkSession Unified API
  13. 13. Benefits of Spark 2.0 ● Unified DataFrames and Datasets – DataFrames = Datasets[Row] ● 10X faster than Spark 1.6 – Due to Whole-Stage Code Generation. ● Smarter than Spark Streaming 1.6: – As streaming is structured too. Img Src: https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html
  14. 14. Why Spark 2.0 is Faster ? Img Src: https://www.google.com/
  15. 15. Why Spark 2.0 is Faster ? Reason is “Whole-Stage Code Generation”
  16. 16. Example Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
  17. 17. Example Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
  18. 18. Example Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
  19. 19. Example Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html Volcano Model
  20. 20. Whats wrong here ? Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html Volcano Model
  21. 21. For Answer Lets compare same code with hand-written code System Generated Hand Written Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
  22. 22. Volcano Model vs Hand-Written Code Volcano Hand-Written Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
  23. 23. Solution Img Src: https://www.google.com/
  24. 24. Solution Of Course Whole-Stage Code Generation Provides the performance of hand-written code with the functionality of a general purpose engine.
  25. 25. What is Whole-Stage Code Generation ? ● Same as Volcano Model – As it generates code using the same process. ● The only difference is – Earlier Spark applied code generation only to expression evaluation (i.e., “1 + a”) but now it generates code for the entire query.
  26. 26. Spark 1.x vs Spark 2.0 Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
  27. 27. Demo 1
  28. 28. Agenda Part 1 (SparkSession) Part 2 (Structured Streaming) Questions ??
  29. 29. Agenda Part 1 (SparkSession) Part 2 (Structured Streaming)
  30. 30. Streaming Applications Pros - ● Consistent ● In-Order Data ● No Shuffling Cons - ● Non-Scalable ● No Fault Tolerance Pros - ● Scalable ● Fault Tolerant Cons - ● Inconsistent ● Out-of-Order Data ● Too much Shuffling
  31. 31. Continuous Application Img Src: https://databricks.com/blog/2016/07/28/continuous-applications-evolving-streaming-in-apache-spark-2-0.html
  32. 32. How to Achieve it ? Img Src: https://www.google.com/
  33. 33. Solution Structured Streaming Structured Streaming guarantees that at any time, the output of the application is equivalent to executing a batch job on a prefix of the data.
  34. 34. How ? Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html Conceptually, Structured Streaming treats all the data arriving as an infinite input table.
  35. 35. How ? ● Developer defines a query on the input table – As if it were a static table. ● Results are computed in a Result Table – Which are further written to an output sink. ● At last developers define triggers – To control result modification. Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
  36. 36. How ? ● Developer defines a query on the input table – As if it were a static table. ● Results are computed in a Result Table – Which are further written to an output sink. ● At last developers define triggers – To control result modification. Incremental Execution Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
  37. 37. Output Modes ● Append – Only the new rows are appended to the result table since the last trigger will be written to the external storage. ● Complete – The entire updated result table will be written to external storage ● Update – Only the rows that were updated in the result table since the last trigger will be changed in the external storage.
  38. 38. Other Benefits ● Easy to use – As it is simple Spark’s DataFrame/Dataset API.
  39. 39. Other Benefits ● Easy to use – As it is simple Spark’s DataFrame/Dataset API. Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
  40. 40. Other Benefits ● Easy to use – As it is simple Spark’s DataFrame/Dataset API. Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
  41. 41. Other Benefits ● Easy to use – As it is simple Spark’s DataFrame/Dataset API. ● Uses Spark’s DataFrame/Datasets existing API – So we can map, filter and aggregate data as we do in Spark SQL.
  42. 42. Other Benefits ● Easy to use – As it is simple Spark’s DataFrame/Dataset API. ● Uses Spark’s DataFrame/Datasets existing API – So we can map, filter and aggregate data as we do in Spark SQL. ● Join Streams with Static data – To join a stream with a static DataFrame.
  43. 43. Other Benefits ● Easy to use – As it is simple Spark’s DataFrame/Dataset API. ● Uses Spark’s DataFrame/Datasets existing API – So we can map, filter and aggregate data as we do in Spark SQL. ● Join Streams with Static data – To join a stream with a static DataFrame. Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
  44. 44. Other Benefits ● Easy to use – As it is simple Spark’s DataFrame/Dataset API. ● Uses Spark’s DataFrame/Datasets existing API – So we can map, filter and aggregate data as we do in Spark SQL. ● Join Streams with Static data – To join a stream with a static DataFrame. There are many more...
  45. 45. Requirements ● Input Sources must be replayable – So that recent data can be re-read if the job crashes. ● Output Source must support transactional updates – So that the system make a set of records appear atomically.
  46. 46. Comparison with Other Engines Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
  47. 47. Demo 2
  48. 48. Agenda Part 1 (SparkSession) Part 2 (Structured Streaming) Questions ??
  49. 49. Code https://github.com/knoldus/Sparkathon
  50. 50. References ● https://databricks.com/blog/2016/07/26/introducing-apache-spark-2- 0.html ● https://databricks.com/blog/2016/07/28/structured-streaming-in-apa che-spark.html ● https://www.youtube.com/watch?v=ZFBgY0PwUeY ● http://spark.apache.org/docs/latest/
  51. 51. Thank You !!!

×