Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

JEEConf 2016 - Lambda Architecture with Apache Spark

1,581 views

Published on

A lot of players on the market have built successful MapReduce workflows to daily process terabytes of historical data. But who wants to wait for 24h to get updated analytics? This talk will introduce you to the lambda architecture designed to take advantages of both batch and streaming processing methods. So we will leverage fast access to historical data with real-time streaming data using Spark (Core, SQL, Streaming), Twitter, Apache Parquet, etc.

Clear code plus intuitive demo are also included - https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv

Was presented on JEEConf 2016 in Kyiv on 20/05/2016.

Design by Yarko Filevych: http://www.filevych.com/

Published in: Technology
  • Be the first to comment

JEEConf 2016 - Lambda Architecture with Apache Spark

  1. 1. Lambda Architecture with Apache Spark IMAGE
  2. 2. About Me https://ua.linkedin.com/in/tarasmatyashovsky
  3. 3. Apache Hadoop: A Brief History http://www.slideshare.net/fadicce/hadoop-user-group-uae-meeting
  4. 4. A lot of customers implemented successful Hadoop-based M/R pipelines which are operating today
  5. 5. Examples from Real Life • Oozie workflow, operates daily and processes up to 150 TB to generate analytics • bash managed workflow, operates daily and processes up to 8 TB to generate analytics
  6. 6. Examples from Real Life http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop
  7. 7. Lambda Architecture A data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream processing methods http://lambda-architecture.net/
  8. 8. https://www.manning.com/books/big-data
  9. 9. https://www.manning.com/books/big-data
  10. 10. Layers of Lambda Architecture Batch layer • manages the master dataset (an immutable, append-only set of raw data) • pre-compute the batch views Serving layer • indexes the batch views so that they can be queried in ad-hoc with low-latency Speed layer • deals with recent-data only http://lambda-architecture.net/
  11. 11. https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
  12. 12. Relevance of Data http://www.slideshare.net/helenaedelson/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala query = real time view = batch view = function(batch view, real time view) function(real time view, new data) function(all data)
  13. 13. Trade-offs Full recomputation vs. partical recomputation e.g. using Bloom filters Additive algorithms vs. approximation algorithms e.g. HyperLogLog for count-distinct problem
  14. 14. Implementation of Lambda Architecture
  15. 15. https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
  16. 16. Integrated solution for processing on all lambda architecture layers
  17. 17. Apache Spark: a Brief History
  18. 18. Enables scalable, high-throughput, fault-tolerant stream processing of live data streams 50% users consider it the most important part of Spark Spark Streaming http://spark.apache.org/docs/latest/streaming-programming-guide.html
  19. 19. Streaming Architecture http://spark.apache.org/docs/latest/streaming-programming-guide.html
  20. 20. https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
  21. 21. http://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers
  22. 22. http://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams
  23. 23. DStream as a Continuous Series of RDDs http://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams
  24. 24. http://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams
  25. 25. Provide hashtags statistics used in a #jeeconf tweets All time till today + right now Sample Application https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
  26. 26. Batch View apache – architecture – aws – java – jeeconf – lambda – morningatlohika – simpleworkflow – spark – 6 12 3 4 7 6 15 14 5 https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
  27. 27. Real-time View “Cool presentation by @tmatyashovsky about #lambda #architecture using #apache #spark at #jeeconf” apache – architecture – jeeconf– lambda – spark – 1 1 1 1 1 https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
  28. 28. Batch View + Real-time View apache – architecture – aws – java – jeeconf – lambda – morningatlohika – simpleworkflow – spark – 7 13 3 4 8 7 15 14 6 https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
  29. 29. Simplified Steps • Create batch view (.parquet) via Apache Spark • Cache batch view in Apache Spark • Start streaming application connected to Twitter • Focus on real-time #jeeconf tweets* • Build incremental real-time views • Query, i.e. merge batch and real-time views on a fly * Stream from file system (used for testing) can be used as a backup https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
  30. 30. Demo Time https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
  31. 31. http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics
  32. 32. Structured Streaming in Spark 2.0 The simplest way to perform streaming analytics is not having to reason about streaming Static DataFrame API = Infinite DataFrame API http://www.slideshare.net/rxin/the-future-of-realtime-in-spark
  33. 33. Structured Streaming • Introduces streaming API built on top of Spark SQL • Unifies streaming, interactive and batch queries logs = context.read.format("json") .stream("s3://logs") logs.groupBy(logs.user_id) .agg(sum(logs.time)) .write.format("jdbc") .stream("jdbc:mysql//...") https://www.youtube.com/watch?v=oXkxXDG0gNk
  34. 34. Instead of Epilogue
  35. 35. http://milinda.pathirage.org/kappa-architecture.com/
  36. 36. http://milinda.pathirage.org/kappa-architecture.com/
  37. 37. Taras Matyashovsky taras.matyashovsky@gmail.com @tmatyashovsky http://www.filevych.com/ Thank you!
  38. 38. References http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark https://www.manning.com/books/big-data Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia (early release ebook from O'Reilly Media) http://spark.apache.org/docs/latest/streaming-programming-guide.html http://www.slideshare.net/helenaedelson/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala http://www.rittmanmead.com/2015/08/combining-spark-streaming-and-data-frames-for-near-real-time-log-analysis/ https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-execution-model.html https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Streaming%20mapWithState.html http://spark.apache.org/docs/latest/cluster-overview.html http://milinda.pathirage.org/kappa-architecture.com/ http://www.slideshare.net/databricks/2016-spark-summit-east-keynote-matei-zaharia http://www.slideshare.net/rxin/the-future-of-realtime-in-spark http://thenewstack.io/spark-2-0-will-offer-interactive-querying-live-data/ http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617 https://databricks.com/blog/2015/10/13/interactive-audience-analytics-with-spark-and-hyperloglog.html https://www.youtube.com/watch?v=ZFBgY0PwUeY https://www.youtube.com/watch?v=oXkxXDG0gN http://milinda.pathirage.org/kappa-architecture.com/ https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-zero-data-loss-in-spark-streaming.html http://www.slideshare.net/Typesafe_Inc/four-things-to-know-about-reliable-spark-streaming-with-typesafe-and-databricks http://spark.apache.org/docs/latest/configuration.html#spark-streaming

×