Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Lambda Architecture with Apache Spark

11,359 views

Published on

A lot of players on the market have built successful MapReduce workflows to daily process terabytes of historical data. But who wants to wait for 24h to get updated analytics? This talk will introduce you to the lambda architecture designed to take advantages of both batch and streaming processing methods. So we will leverage fast access to historical data with real-time streaming data using Spark (Core, SQL, Streaming), Twitter, Apache Parquet, etc.

Clear code plus intuitive demo are also included - https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv

Was presented on Morning@Lohika tech talks in Lviv on 14/05/2016.

Design by Yarko Filevych: http://www.filevych.com/

Published in: Technology
  • You can try to use this service ⇒ www.HelpWriting.net ⇐ I have used it several times in college and was absolutely satisfied with the result.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • If you’re struggling with your assignments like me, check out ⇒ www.HelpWriting.net ⇐.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • You can try to use this service ⇒ www.HelpWriting.net ⇐ I have used it several times in college and was absolutely satisfied with the result.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Did u try to use external powers for studying? Like ⇒ www.WritePaper.info ⇐ ? They helped me a lot once.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • You might get some help from ⇒ www.HelpWriting.net ⇐ Success and best regards!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Lambda Architecture with Apache Spark

  1. 1. Lambda Architecture with Apache Spark IMAGE
  2. 2. About Me https://ua.linkedin.com/in/tarasmatyashovsky
  3. 3. Apache Hadoop: A Brief History http://www.slideshare.net/fadicce/hadoop-user-group-uae-meeting
  4. 4. A lot of customers implemented successful Hadoop-based M/R pipelines which are operating today
  5. 5. Examples from Real Life • Oozie workflow, operates daily and processes up to 150 TB to generate analytics • bash managed workflow, operates daily and processes up to 8 TB to generate analytics
  6. 6. It Is 2016 Now! • Making decisions faster is more valuable • Kafka, Storm, Trident, Samza, Spark, Flink, Parquet, Avro, Cloud providers, etc.
  7. 7. Examples from Real Life http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop
  8. 8. Lambda Architecture A data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream processing methods http://lambda-architecture.net/
  9. 9. https://www.manning.com/books/big-data
  10. 10. https://www.manning.com/books/big-data
  11. 11. Layers of Lambda Architecture Batch layer • manages the master dataset (an immutable, append-only set of raw data) • pre-computes the batch views Serving layer • indexes the batch views so that they can be queried in ad-hoc with low-latency Speed layer • deals with recent data only http://lambda-architecture.net/
  12. 12. https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
  13. 13. Relevance of Data http://www.slideshare.net/helenaedelson/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala query = real time view = batch view = function(batch view, real time view) function(real time view, new data) function(all data)
  14. 14. Trade-offs Full recomputation vs. partial recomputation e.g. using Bloom filters Recomputational algorithms vs. incremental algorithms Additive algorithms vs. approximation algorithms e.g. HyperLogLog for count-distinct problem
  15. 15. Implementation of Lambda Architecture
  16. 16. https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
  17. 17. Can be considered as an integrated solution for processing on all lambda architecture layers
  18. 18. Apache Spark: a Brief History
  19. 19. Why Apache Spark? As of mid 2014, Spark is the most active Big Data project http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east Total Contributors
  20. 20. Core Concepts automatically distribute data across cluster and parallelize operations performed on them
  21. 21. Components Stack https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
  22. 22. Enables scalable, high-throughput, fault-tolerant stream processing of live data streams 50% users consider most important part of Spark Spark Streaming http://spark.apache.org/docs/latest/streaming-programming-guide.html
  23. 23. Streaming Architecture • micro-batch architecture • series of batch computations on small chunks of data • batch interval is configurable • exactly once semantics http://spark.apache.org/docs/latest/streaming-programming-guide.html
  24. 24. Streaming Architecture http://spark.apache.org/docs/latest/streaming-programming-guide.html
  25. 25. https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
  26. 26. http://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers
  27. 27. http://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams
  28. 28. DStream as a Continuous Series of RDDs http://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams
  29. 29. http://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams
  30. 30. Provide hashtags statistics used in a #morningatlohika tweets All time till today + right now Sample Application https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
  31. 31. Batch View apache – architecture – aws – java – jeeconf – lambda – morningatlohika – simpleworkflow – spark – 6 12 3 4 7 6 15 14 5 https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
  32. 32. Real-time View “Cool presentation by @tmatyashovsky about #lambda #architecture using #apache #spark at #morningatlohika” apache – architecture – morningatlohika – lambda – spark – 1 1 1 1 1 https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
  33. 33. Batch View + Real-time View apache – architecture – aws – java – jeeconf – lambda – morningatlohika – simpleworkflow – spark – 7 13 3 4 7 7 16 14 6 https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
  34. 34. Simplified Steps • Create batch view (.parquet) via Apache Spark • Cache batch view in Apache Spark • Start streaming application connected to Twitter • Focus on real-time #morningatlohika tweets* • Build incremental real-time views • Query, i.e. merge batch and real-time views on a fly * Stream from file system (used for testing) can be used as a backup https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
  35. 35. Demo Time https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
  36. 36. http://shop.oreilly.com/product/0636920028512.do
  37. 37. http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics
  38. 38. Structured Streaming in Spark 2.0 The simplest way to perform streaming analytics is not having to reason about streaming Static DataFrame API = Infinite DataFrame API http://www.slideshare.net/rxin/the-future-of-realtime-in-spark
  39. 39. Structured Streaming • Introduces streaming API built on top of Spark SQL • Unifies streaming, interactive and batch queries logs = context.read.format("json") .stream("s3://logs") logs.groupBy(logs.user_id) .agg(sum(logs.time)) .write.format("jdbc") .stream("jdbc:mysql//...") https://www.youtube.com/watch?v=oXkxXDG0gNk
  40. 40. Instead of Epilogue
  41. 41. http://milinda.pathirage.org/kappa-architecture.com/
  42. 42. http://milinda.pathirage.org/kappa-architecture.com/
  43. 43. Taras Matyashovsky taras.matyashovsky@gmail.com @tmatyashovsky http://www.filevych.com/ Thank you!
  44. 44. References http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark https://www.manning.com/books/big-data Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia (early release ebook from O'Reilly Media) http://spark.apache.org/docs/latest/streaming-programming-guide.html http://www.slideshare.net/helenaedelson/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala http://www.rittmanmead.com/2015/08/combining-spark-streaming-and-data-frames-for-near-real-time-log-analysis/ https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-execution-model.html https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Streaming%20mapWithState.html http://spark.apache.org/docs/latest/cluster-overview.html http://milinda.pathirage.org/kappa-architecture.com/ http://www.slideshare.net/databricks/2016-spark-summit-east-keynote-matei-zaharia http://www.slideshare.net/rxin/the-future-of-realtime-in-spark http://thenewstack.io/spark-2-0-will-offer-interactive-querying-live-data/ http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617 https://databricks.com/blog/2015/10/13/interactive-audience-analytics-with-spark-and-hyperloglog.html https://www.youtube.com/watch?v=ZFBgY0PwUeY https://www.youtube.com/watch?v=oXkxXDG0gN http://milinda.pathirage.org/kappa-architecture.com/

×