A lot of players on the market have built successful MapReduce workflows to daily process terabytes of historical data. But who wants to wait for 24h to get updated analytics? This talk will introduce you to the lambda architecture designed to take advantages of both batch and streaming processing methods. So we will leverage fast access to historical data with real-time streaming data using Spark (Core, SQL, Streaming), Twitter, Apache Parquet, etc.
Clear code plus intuitive demo are also included - https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
Was presented on Morning@Lohika tech talks in Lviv on 14/05/2016.
Design by Yarko Filevych: http://www.filevych.com/
3. Apache Hadoop: A Brief History
http://www.slideshare.net/fadicce/hadoop-user-group-uae-meeting
4. A lot of customers implemented
successful Hadoop-based M/R pipelines
which are operating today
5. Examples from Real Life
• Oozie workflow, operates daily and processes up to
150 TB to generate analytics
• bash managed workflow, operates daily and processes
up to 8 TB to generate analytics
6. It Is 2016 Now!
• Making decisions faster is more valuable
• Kafka, Storm, Trident, Samza, Spark, Flink, Parquet,
Avro, Cloud providers, etc.
7. Examples from Real Life
http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop
8. Lambda Architecture
A data-processing architecture
designed to handle massive quantities of data
by taking advantage of both
batch and stream processing methods
http://lambda-architecture.net/
11. Layers of Lambda Architecture
Batch layer
• manages the master dataset (an immutable, append-only set of
raw data)
• pre-computes the batch views
Serving layer
• indexes the batch views so that they can be queried in ad-hoc with
low-latency
Speed layer
• deals with recent data only
http://lambda-architecture.net/
14. Trade-offs
Full recomputation vs. partial recomputation
e.g. using Bloom filters
Recomputational algorithms vs. incremental algorithms
Additive algorithms vs. approximation algorithms
e.g. HyperLogLog for count-distinct problem
21. Why Apache Spark?
As of mid 2014,
Spark is the most active Big Data project
http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east
Total Contributors
24. Enables scalable, high-throughput, fault-tolerant
stream processing of live data streams
50% users consider most important part of Spark
Spark Streaming
http://spark.apache.org/docs/latest/streaming-programming-guide.html
25. Streaming Architecture
• micro-batch architecture
• series of batch computations on small chunks of data
• batch interval is configurable
• exactly once semantics
http://spark.apache.org/docs/latest/streaming-programming-guide.html
32. Provide hashtags statistics
used in a #morningatlohika tweets
All time till today + right now
Sample Application
https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
36. Simplified Steps
• Create batch view (.parquet) via Apache Spark
• Cache batch view in Apache Spark
• Start streaming application connected to Twitter
• Focus on real-time #morningatlohika tweets*
• Build incremental real-time views
• Query, i.e. merge batch and real-time views on a fly
* Stream from file system (used for testing) can be used as a backup
https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
43. Structured Streaming in Spark 2.0
The simplest way to perform streaming analytics
is not having to reason about streaming
Static DataFrame API = Infinite DataFrame API
http://www.slideshare.net/rxin/the-future-of-realtime-in-spark
44. Structured Streaming
• Introduces streaming API built on top of Spark SQL
• Unifies streaming, interactive and batch queries
logs = context.read.format("json")
.stream("s3://logs")
logs.groupBy(logs.user_id)
.agg(sum(logs.time))
.write.format("jdbc")
.stream("jdbc:mysql//...")
https://www.youtube.com/watch?v=oXkxXDG0gNk
49. References
http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop
https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
https://www.manning.com/books/big-data
Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia (early release ebook from O'Reilly
Media)
http://spark.apache.org/docs/latest/streaming-programming-guide.html
http://www.slideshare.net/helenaedelson/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala
http://www.rittmanmead.com/2015/08/combining-spark-streaming-and-data-frames-for-near-real-time-log-analysis/
https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-execution-model.html
https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Streaming%20mapWithState.html
http://spark.apache.org/docs/latest/cluster-overview.html
http://milinda.pathirage.org/kappa-architecture.com/
http://www.slideshare.net/databricks/2016-spark-summit-east-keynote-matei-zaharia
http://www.slideshare.net/rxin/the-future-of-realtime-in-spark
http://thenewstack.io/spark-2-0-will-offer-interactive-querying-live-data/
http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617
https://databricks.com/blog/2015/10/13/interactive-audience-analytics-with-spark-and-hyperloglog.html
https://www.youtube.com/watch?v=ZFBgY0PwUeY
https://www.youtube.com/watch?v=oXkxXDG0gN
http://milinda.pathirage.org/kappa-architecture.com/
Editor's Notes
Receiver:
Task that collects data from the input source and represents it as RDDs
Is launched automatically for each input source
Replicates data to another executor for fault tolerance
Cluster Manager: Standalone, Apache Mesos, Hadoop Yarn
Cluster Manager should be chosen and configured properly
Monitoring via web UI(s) and metrics
Web UI:
master web UI
worker web UI
driver web UI - available only during execution
history server - spark.eventLog.enabled = true
Metrics based on Coda Hale Metrics library. Can be reported via HTTP, JMX, and CSV files.
Spark 2.0:
Project Tungsten 2.0
Whole stage code generation
Optimized input / output -> Parquet + built-in cache
Spark Streaming
DataFrame API unified with Dataset API