Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to real time big data with Apache Spark

2,195 views

Published on

This presentation will be useful to those who would like to get acquainted with Apache Spark architecture, top features and see some of them in action, e.g. RDD transformations and actions, Spark SQL, etc. Also it covers real life use cases related to one of ours commercial projects and recall roadmap how we’ve integrated Apache Spark into it.

Was presented on Morning@Lohika tech talks in Lviv.

Design by Yarko Filevych: http://www.filevych.com/

Published in: Technology
  • Be the first to comment

Introduction to real time big data with Apache Spark

  1. 1. Introduction to Real-time Big Data with Apache Spark
  2. 2. Introduction
  3. 3. About Me https://ua.linkedin.com/in/tarasmatyashovsky
  4. 4. Agenda • Buzzwords • Spark in a Nutshell • Spark Concepts • Spark Core • live demo session • Spark SQL • live demo session • Road to Production • Spark Drawbacks • Our Spark Integration • Spark is on a Rise
  5. 5. Buzzword for large and complex data sets difficult to process using on-hand database management tools or traditional data processing applications https://www.linkedin.com/pulse/decoding-buzzwords-big-data-predictive-analytics-business-gordon
  6. 6. http://www.ibmbigdatahub.com/infographic/four-vs-big-data
  7. 7. Jesus Christ, It is Big Data, Get Hadoop! by Sergey Shelpuk (https://ua.linkedin.com/in/shelpuk) at AI Club Meetup in Lviv
  8. 8. To Hadoop? http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop • Batch mode, not real-time • Unstructured or semi-structured data • MapReduce programming model, e.g. key/value pairs
  9. 9. Not to Hadoop? • Real-time, streaming • Structures which could not be decomposed to key-value pairs • Jobs/algorithms which do not yield to the MapReduce programming model http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop
  10. 10. Not to Hadoop? • Subset of data is enough Remove excessive complexity or shrink data set via other processing techniques, e.g.: hashing, clusterization • Random, Interactive Access to Data Well structured data Bunch of scalable mature (No)SQL DB solutions exist (Hbase/Cassandra/Columnar scalable DW engines) • Sensitive Data Security is still very challenging and immature
  11. 11. Why Spark? As of mid 2014, Spark is the most active Big Data project http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east Contributors per month to Spark
  12. 12. Spark Fast and general-purpose cluster computing platform for large-scale data processing
  13. 13. History
  14. 14. Time to Sort 100TB http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east
  15. 15. Why Spark is Faster? Spark processes data in-memory while Hadoop persists back to the disk after a map/reduce action
  16. 16. Powered by Spark https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
  17. 17. Components Stack https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
  18. 18. Core Concepts automatically distribute data across cluster and parallelize operations performed on them
  19. 19. Distributed Application https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
  20. 20. http://spark.apache.org/docs/latest/cluster-overview.html
  21. 21. Spark Core Abstractions
  22. 22. RDD API Transformations: • filter() • map() • flatMap() • distinct() • union() • intersection() • subtract() • etc. Actions: • collect() • reduce() • count() • countByValue() • first() • take() • top() • etc.
  23. 23. RDD Operations • transformations are executed on workers • actions may transfer data from the workers to the driver • сollect() sends all the partitions to the single driver http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs
  24. 24. Pair RDD Transformations: • reduceByKey() • groupByKey() • sortByKey() • keys() • values() • join() • etc. Actions: • countByKey() • collectAsMap() • lookup() • etc.
  25. 25. Sample Application https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
  26. 26. Requirements Analytics about Morning@Lohika events: • unique participants by companies • most loyal participants • participants by position • etc. https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
  27. 27. Data Format Simple CSV files all fields are optional First Name Last Name Company Position Email Present Vladimir Tsukur GlobalLogic Tech/Team Lead flushdia@gmail.com 1 Mikalai Alimenkou XP Injection Tech Lead mikalai.alimenkou@ xpinjection.com 1 Taras Matyashovsky Lohika Software Engineer taras.matyashovsky@ gmail.com 0 https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
  28. 28. Technologies Technologies: • Spring Boot 1.2.3.RELEASE • Spark 1.3.1 - released April 17, 2015 • 2 Spark jar dependencies • Apache 2.0 license, i.e. free to use https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
  29. 29. Features • simple HTTP-based API • file system: local and HDFS • data formats: CSV and Parquet • 3 compatible implementations based on: • RDD (Spark Core) • Data Frame DSL (Spark SQL) • Data Frame SQL (Spark SQL) • serialization: default Java and Kryo https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
  30. 30. Demo Time https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
  31. 31. Cluster Manager Worker Driver Spark Context Executor Task Worker Executor Task http://spark.apache.org/docs/latest/cluster-overview.html Task Task Demo Explained
  32. 32. Limited opportunities for automatic optimization Functional Programming API Drawback
  33. 33. Structured data processing Spark SQL
  34. 34. Distributed collection of data organized into named columns Data Frame
  35. 35. Data Frame API • selecting columns • joining different data sources • aggregation, e.g. sum, count, average • filtering
  36. 36. Plan Optimization & Execution http://web.eecs.umich.edu/~prabal/teaching/resources/eecs582/armbrust15sparksql.pdf
  37. 37. Faster than RDD http://www.slideshare.net/databricks/spark-sqlsse2015public
  38. 38. Demo Time https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
  39. 39. Persistence & Caching • by default stores the data in the JVM heap as unserialized objects • possibility to store on disk as unserialized/serialized objects • off-heap caching is experimental and uses
  40. 40. https://spark.apache.org/docs/latest/running-on-mesos.html
  41. 41. Cluster Manager should be chosen and configured properly
  42. 42. Monitoring via web UI(s) and metrics
  43. 43. Monitoring • master web UI • worker web UI • driver web UI • available only during execution • history server • spark.eventLog.enabled = true
  44. 44. Metrics • based on Coda Hale Metrics library • can be reported via HTTP, JMX, and CSV files
  45. 45. https://spark.apache.org/docs/latest/tuning.html
  46. 46. Serialization https://spark.apache.org/docs/latest/configuration.html#compression-and-serialization
  47. 47. Memory Management Tune Executor Memory Fraction RDD Storage (60%) Shuffle and aggregation buffers (20%) User code (20%) https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior
  48. 48. Memory Management Tune storage level: • store in memory and/or on disk • store as unserialized/serialized objects • replicate each partition on 1 or 2 cluster nodes • store in Tachyon https://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose
  49. 49. Level of Parallelism • spark.task.cpus • 1 task per partition using 1 core to execute • spark.default.parallelism • can be controlled: • repartition() and coalescence() functions • degree of parallelism as a operations parameter • storage system matters
  50. 50. Data Locality • check data locality via UI • configure data locality settings if needed • spark.locality.wait timeout • execute certain jobs on a driver • spark.localExecution.enabled
  51. 51. Java API Drawbacks • API can be experimental or used just for development • Spark Java API can be not up-to-date as Scala API is main focus
  52. 52. Our Spark Integration
  53. 53. Product Cloud-based analytics application
  54. 54. Use Cases • supplement Neo4j database used to store/query big dimensions • supplement RDBMS for querying of high volumes of data
  55. 55. Use Cases • represent existing computational graph as flow of Spark-based operations • predictive analytics based on Spark MLib component
  56. 56. Lessons Learned • Spark simplicity is deceptive • Each use case is unique • Be really aware: • Databricks blog • Mailing lists & Jira • Pull requests Spark is kind of magic
  57. 57. Spark is on a Rise
  58. 58. http://www.techrepublic.com/article/can-anything-dim-apache-spark/
  59. 59. Project Tungsten • the largest change to Spark’s execution engine since the project’s inception • focuses on substantially improving the efficiency of memory and CPU for Spark applications • sun.misc.Unsafe https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
  60. 60. Thank you! Taras Matyashovsky taras.matyashovsky@gmail.com @tmatyashovsky http://www.filevych.com/
  61. 61. References https://www.linkedin.com/pulse/decoding-buzzwords-big-data-predictive-analytics- business-gordon http://www.ibmbigdatahub.com/infographic/four-vs-big-data http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app- models/ Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia (early release ebook from O'Reilly Media) https://spark-prs.appspot.com/#all https://www.gitbook.com/book/databricks/databricks-spark-knowledge-base/details http://insidebigdata.com/2015/03/06/8-reasons-apache-spark-hot/ https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark http://databricks.com/blog/2014/10/10/spark-petabyte-sort.html http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale- sorting.html http://web.eecs.umich.edu/~prabal/teaching/resources/eecs582/armbrust15sparksql.pdf http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing- better-spark-programs http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east http://www.slideshare.net/databricks/spark-sqlsse2015public https://spark.apache.org/docs/latest/running-on-mesos.html http://spark.apache.org/docs/latest/cluster-overview.html http://www.techrepublic.com/article/can-anything-dim-apache-spark/ http://spark-packages.org/

×