SlideShare a Scribd company logo
1 of 48
What is FastData & How the SMACK
stack plays a major role by implementing
a Fast Data strategy
Stavros Kontopoulos
Senior Software Engineer @ Lightbend, M.Sc.
De oude Prodentfabriek
Trivento Summercamp 2016
Amersfoort
Introduction
2
Introduction: Who Am I?
Agenda
A bit of history: Big Data Processing
What is Fast Data?
Batch Systems vs Streaming Systems & Streaming Evolution
Event Log, Message Queues
A Fast Data Architecture
Example Application
3
Last warning...
4
Data Processing
Batch processing: processing done on a bounded dataset.
Stream Processing (Streaming): processing done on an unbounded datasets.
Data items are pushed or pulled.
Two categories of systems: batch vs streaming systems.
5
Big Data - The story
Internet scale apps moved data size from Gigabytes to Petabytes.
Once upon a time there were traditional RDBMS like Oracle and Data
Warehouses but volume, velocity and variety changed the game.
6
Big Data - The story
MapReduce was a major breakthrough (Google published the seminal paper in
2004).
Nutch project already had an implementation in 2005
2006 becomes a subproject of Lucene with the name Hadoop.
2008 Yahoo brings Hadoop to production with a 10K cluster. Same year it
becomes a top-level apache project.
Hadoop is good for batch processing.
Big Data - The story
Word Count example - Inverted Index.
8
Split 1
Split N
doc1,
doc2 ...
...
doc300,
doc100
MAP REDUCE
(w1,1)
…
(w20,1)
(w41,1)
…
(w1,1)
Shuffle
(w1, (1,1,1…))
...
(w41, (1,1,…))
...
(w1, 13)
...
(w1, 3)
...
Big Data - The story
Giuseppe DeCandia et al., ”Dynamo: amazon's highly available key-value
store.” changed the DataBase world in 2007.
NoSQL Databases along with general system like Hadoop solve problems
cannot be solved with traditional RDBMs.
Technology facts: Cheap memory, SSDs, HDDs are the new tape, more cpus
over more powerful cpus.
9
Big Data - The story
There is a major shift in the industry as batch processing is not enough any
more.
Batch jobs usually take hours if not days to complete, in many applications that
is not acceptable.
10
Big Data - The story
The trend now is near-real time computation which implies streaming
algorithms and needs new semantics. Fast Data (data in motion) & Big
Data (data at rest) at the same time.
The enterprise needs to get smarter, all major players across industries
use ML on top of massive datasets to make better decisions.
11Images: https://www.tesla.com/sites/default/files/pictures/thumbs/model_s/red_models.jpg?201501121530
https://i.ytimg.com/vi/cj83dL72cvg/maxresdefault.jpg
Big Data - The story
OpsClarity report:
92% plan to increase their investment in stream processing applications in the
next year
79% plan to reduce or eliminate investment in batch processing
32% use real time analysis to power core customer-facing applications
44% agreed that it is tedious to correlate issues across the pipeline
68% identified lack of experience and underlying complexity of new data
frameworks as their barrier to adoption
http://info.opsclarity.com/2016-fast-data-streaming-applications-report.html
12
Big Data - The story
13Image: http://info.opsclarity.com/2016-fast-data-streaming-applications-report.html
Big Data - The story
14
In OpsClarity report:
● Apache Kafka is the most popular broker technology (ingestion queue)
● HDFS the most used data sink
● Apache Spark is the most popular data processing tool.
Big Data Landscape
15
Image: http://mattturck.com/wp-content/uploads/2016/03/Big-Data-Landscape-2016-v18-FINAL.png
Big Data System
A Big Data System must have at least the following components at its core:
DFS - Distributed File System like (S3, HDFS) or a distributed database system (DDS).
Distributed Data processing tool like: Spark, MapReduce, etc.
Tools and services to manage the previous systems.
16
Batch Systems - The Hadoop Ecosystem
17
Yarn (Yet Another Resource Negotiator) deployed in production at Yahoo in
March 2013.
Same year Cloudera, the dominant Hadoop vendor, embraced Spark as the
next-generation replacement for MapReduce.
Image: Lightbend Inc.
Batch Systems - The Hadoop Ecosystem
Hadoop clusters, the gold standard for big data from ~2008 to the present (started back in 2005).
Strengths:
Lowest CapEx system for Big Data.
Excellent for ingesting and integrating diverse datasets.
Flexible: from classic analytics (aggregations and data warehousing) to machine learning.
18
Batch Systems - The Hadoop Ecosystem
Weaknesses:
Complex administration.
YARN can’t manage all distributed services.
MapReduce, has poor performance, a difficult programming model, and doesn’t support stream
processing.
19
Analyzing Infinite Data Streams
20
What does it mean to run a SQL query on an unbounded data set.
How should I deal with the late data.
What kind of time measurement should I use? Event-time, Processing time or
Ingestion time?
Accuracy of computations on bounded datasets vs on unbounded datasets
Algorithms for streaming computations?
Analyzing Infinite Data Streams
21
Two cases for processing:
Single event processing: event transformation, trigger an alarm on an error event
Event aggregations: summary statistics, group-by, join and similar queries. For example
compute the average temperature for the last 5 minutes from a sensor data stream.
Analyzing Infinite Data Streams
22
Event aggregation introduces the concept of windowing wrt to the notion of time
selected:
Event time (the time that events happen): Important for most use cases where context and
correctness matter at the same time. Example: billing applications, anomaly detection.
Processing time (the time they are observed during processing): Use cases where I only care
about what I process in a window. Example: accumulated clicks on a page per second.
System Arrival or Ingestion time (the time that events arrived at the streaming system).
Ideally event time = processing time. Reality is: there is skew.
Analyzing Infinite Data Streams
23
Windows come in different flavors:
Tumbling windows discretize a stream into non-overlapping windows.
Sliding Windows: slide over the stream of data.
Analyzing Infinite Data Streams
24
Watermarks: indicates that no elements with a timestamp older or equal to the
watermark timestamp should arrive for the specific window of data.
Triggers: decide when the window is evaluated or purged.
Analyzing Infinite Data Streams
25
Given the advances in streaming we can:
Trade-off latency with cost and accuracy
In certain use-cases replace batch processing with streaming
Analyzing Infinite Data Streams
26
Recent advances in Streaming are a result of the pioneer work:
MillWheel: Fault-Tolerant Stream Processing at Internet Scale, VLDB 2013.
The Dataflow Model: A Practical Approach to Balancing Correctness,
Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data
Processing, Proceedings of the VLDB Endowment, vol. 8 (2015), pp.
1792-1803
Analyzing Infinite Data Streams
27
Apache Beam is the open source successor of Google’s DataFlow
It is becoming the standard api for streaming. Provides the advanced semantics
needed for the current needs in streaming applications.
Streaming Systems Architecture
28
User provides a graph of computations through a high level API where data
flows on the edges of this graph. Each vertex its an operator which executes
a user operation-computation. For example: stream.map().keyBy()...
Operators can run in multiple instances and preserve state (unlike batch
processing where we have immutable datasets).
State can be persisted and restored in the presence of failures.
Analyzing Infinite Data Streams - Flink Example
29
sealed trait SensorType { def stype: String }
case object TemperatureSensor extends SensorType { val stype = "TEMP" }
case object HumiditySensor extends SensorType { val stype = "HUM" }
case class SensorData(var sensorId: String, var value: Double, var sensorType: SensorType, timestamp: Long)
https://github.com/skonto/trivento-summercamp-2016
Analyzing Infinite Data Streams - Flink Example
30
class SensorDataSource(val sensorType: SensorType, val numberOfSensors: Int,
val watermarkTag: Int, val numberOfElements: Int = -1) extends SourceFunction[SensorData] {
final val serialVersionUID = 1L
@volatile var isRunning = true
var counter = 1
var timestamp = 0
val randomGen = Random
require(numberOfSensors > 0)
require(numberOfElements >= -1)
lazy val initialReading: Double = {
sensorType match {
case TemperatureSensor => 27.0
case HumiditySensor => 0.75
}
}
override def run(ctx: SourceContext[SensorData]): Unit = {
val counterCondition = {
if(numberOfElements == -1) {
x: Int => isRunning
} else {
x: Int => isRunning && counter <= x
}
}
while (counterCondition(numberOfElements)) {
Thread.sleep(10) // send sensor data every 10 milliseconds
val dataId = randomGen.nextInt(numberOfSensors) + 1
val data = SensorData(dataId.toString, initialReading + Random.nextGaussian()/initialReading, sensorType, timestamp)
ctx.collectWithTimestamp(data, timestamp) // time starts at 0 in millisecs
timestamp = timestamp + 1
if (timestamp % watermarkTag == 0) { // watermark should be mod 0
ctx.emitWatermark(new Watermark(timestamp)) // watermark in milliseconds
}
counter = counter + 1
}
}
override def cancel(): Unit = {
// No cleanup needed
isRunning = false
}
}
The Source
https://github.com/skonto/trivento-summercamp-2016
Analyzing Infinite Data Streams - Flink Example
31
object SensorSimple {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
// set default env parallelism for all operators
env.setParallelism(2)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val numberOfSensors = 2
val watermarkTag = 10
val numberOfElements = 1000
val sensorDataStream =
env.addSource(new SensorDataSource(TemperatureSensor, numberOfSensors, watermarkTag, numberOfElements))
sensorDataStream.writeAsText("inputData.txt")
val windowedKeyed = sensorDataStream
.keyBy(data => data.sensorId)
.timeWindow(Time.milliseconds(10))
windowedKeyed.max("value")
.writeAsText("outputMaxValue.txt")
windowedKeyed.apply(new SensorAverage())
.writeAsText("outputAverage.txt")
env.execute("Sensor Data Simple Statistics")
}
}
class SensorAverage extends WindowFunction[SensorData, SensorData, String, TimeWindow] {
def apply(key: String, window: TimeWindow, input: Iterable[SensorData], out: Collector[SensorData]): Unit = {
if (input.nonEmpty) {
val average = input.map(_.value).sum / input.size
out.collect(input.head.copy(value = average))
}
}
}
The Job
https://github.com/skonto/trivento-summercamp-2016
Analyzing Infinite Data Streams - Flink Example
32
Operator 1 Operator 2
Watermark 1 (10)
0 3 6
2
7 5
849
Operators run the operations defined by the graph of
the streaming computation. Example Operators
(KeyBy, Map, FlatMap etc)
Two instances of the same operator with parallelism
2 (previous example).
Watermark N (10*N)
..
..
..
..
..
..
..
..
..
..
..
..
1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22...
time
file1 file2
window 2window 1
Streaming vs Batch Systems
33
Metric Batch Streaming
Data size per job TB to PB MB to TB (in flight)
Time between data arrival
and processing
Many minutes to hours Microseconds to minutes
Job execution times Minutes to hours Microseconds to minutes
Event Log as the Core Abstraction
34
Logging is everywhere:
Write-ahead log (WAL) in databases for durability.
The distributed log can be seen as the data structure which models the problem of consensus.
Reduction of the problem of making multiple machines all do the same thing to the problem of
a distributed consistent log implementation. The log feeds processes input (state-machine
replication model).
In real-time streaming implementations use logs as a natural mean for recording events as they
are processed according to the computation graph. This assists in implementing consistency
algorithms like ABS in Apache Flink and other functionality for all major streaming engines like
Google DataFlow, Apache Storm, Apache Samza.
Event Log as the Core Abstraction
35
Architecture patterns enablers:
Event Sourcing (ES)
Command-query Responsibility Segregation (CQRS)
Message Queues as the Integration Tool
36
FIFO data structures, the natural way to process logs.
Organise user data in topics, each topic has its own queue.
Benefits
Decouples consumers from producers
Arbitrary number of producers and consumers are supported
Easy to use
Kafka is the most popular implementation for Big Data Systems.
Message Queues - Kafka
37
Kafka is the most popular implementation for Big Data Systems.
Kafka is a “...distributed, partitioned, replicated commit log service”
No deletion of messages on read, allows replay of data.
Production tested at LinkedIn at scale.
Delivery/Processing Semantics
38
In distributed systems failure is part of the game. What semantics I can achieve for message delivery?
at-most-once delivery: for each message sent, that message is delivered zero or one times.
at-least-once delivery: for each message sent potentially multiple attempts are made at delivering it,
such that at least one succeeds; messages may be duplicated but not lost.
exactly-once delivery: for each message sent exactly one delivery is made to the recipient; the
message can neither be lost nor duplicated.
In theory it is impossible to have exactly once delivery.
In practice we might care more for exactly-once state changes and at-least once delivery. Example:
Keeping state at some operator of the streaming graph.
The SMACK Stack
39
Technologies which combined together deliver high performing streaming systems:
Spark
Mesos
Akka
Cassandra
Kafka
A Fast Data Architecture using the SMACK stack
40Image: Dean Wampler, "Fast Data Architectures for Streaming Applications", Lightbend and O'Reilly Media, September
2016
Example IoT Application
41
Adding ML support to the IoT Application
42
Anomaly detection
Voice interface
Image classification
Recommendations
Automatic tuning of the IoT environment
What about Lambda Architecture?
43
Eventually we want to replace it, it is more of a traditional model.
Problems
Hard to maintain
Duplication of code & systems
Special systems for unifying views
In certain cases we can replace it with streaming based architectures.
Streaming Implementations Status
44
Apache Spark: Structured Streaming in v2 starts the improvement of the
streaming engine. Still based on micro-batches but event-time support was
added.
Apache Flink: SQL API supported from v0.9 and on. Still important features are
on the roadmap: scaling streaming jobs, mesos support, dynamic allocation.
Picking the Right Tool for Streaming
45
Criteria to choose:
Processing semantics (strong consistency is needed for correctness)
Latency guarantees
Deployment / Operation
Ecosystem build around it
Complex event processing (CEP)
Batch & Streaming API support
Community & Support
Picking the Right Tool for Streaming
46
Some tips
Pick Flink if you need sub-second latency and Beam support
Pick Spark Streaming for its integration with spark ML libraries, micro-batch mode ideal for
training models, has mature deployment capabilities.
Pick Gearpump for materializing Akka Streams in a distributed fashion.
Pick Kafka streams for low level simple transformations of Kafka messages (It is a distributed
solution out of the box). (Check Confluent Platform for many useful tools around Kafka).
Questions?
Thank you!
47
References
48
Watermarks: Time and progress in streaming dataflow and beyond: Big Data Conference - Strata + Hadoop World, May 31 - June 3,
2016, London, UK
The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-
Order Data Processing
MillWheel: Fault-Tolerant Stream Processing at Internet Scale
Executive Summary: Data Growth, Business Opportunities, and the IT Imperatives | The Digital Universe of Opportunities: Rich Data
and the Increasing Value of the Internet of Things
Asynchronous Distributed Snapshots for Distributed Dataflows | the morning paper
State machine replication - Wikipedia, the free encyclopedia
The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-
Order Data Processing | the morning paper
The Log: What every software engineer should know about real-time data's unifying abstraction | LinkedIn Engineering
Dean Wampler, "Fast Data Architectures for Streaming Applications", Lightbend and O'Reilly Media, September 2016
How Apache Flink™ enables new streaming applications – data Artisans

More Related Content

What's hot

TPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data cloudsTPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data cloudsRim Moussa
 
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Big Data Spain
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseBig Data Spain
 
Bdu -stream_processing_with_smack_final
Bdu  -stream_processing_with_smack_finalBdu  -stream_processing_with_smack_final
Bdu -stream_processing_with_smack_finalmanishduttpurohit
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streamshktripathy
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions csandit
 
Tutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming ArchitecturesTutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming ArchitecturesKarthik Ramasamy
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)Nicolas Kourtellis
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataDataWorks Summit
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain
 
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseA Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseRidwan Fadjar
 
Ismis2014 dbaas expert
Ismis2014 dbaas expertIsmis2014 dbaas expert
Ismis2014 dbaas expertRim Moussa
 
Big Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in ActionBig Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in ActionGuido Schmutz
 
AI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer ExperienceAI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer ExperienceDatabricks
 

What's hot (18)

TPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data cloudsTPC-H analytics' scenarios and performances on Hadoop data clouds
TPC-H analytics' scenarios and performances on Hadoop data clouds
 
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
 
Introduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas WeiseIntroduction to Apache Apex by Thomas Weise
Introduction to Apache Apex by Thomas Weise
 
Bdu -stream_processing_with_smack_final
Bdu  -stream_processing_with_smack_finalBdu  -stream_processing_with_smack_final
Bdu -stream_processing_with_smack_final
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
 
Tutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming ArchitecturesTutorial - Modern Real Time Streaming Architectures
Tutorial - Modern Real Time Streaming Architectures
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member Data
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium EnterpriseA Study Review of Common Big Data Architecture for Small-Medium Enterprise
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
 
Ismis2014 dbaas expert
Ismis2014 dbaas expertIsmis2014 dbaas expert
Ismis2014 dbaas expert
 
Big Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in ActionBig Data and Fast Data - Lambda Architecture in Action
Big Data and Fast Data - Lambda Architecture in Action
 
Bicod2017
Bicod2017Bicod2017
Bicod2017
 
AI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer ExperienceAI-Powered Streaming Analytics for Real-Time Customer Experience
AI-Powered Streaming Analytics for Real-Time Customer Experience
 
Beyond stream analytics
Beyond stream analyticsBeyond stream analytics
Beyond stream analytics
 

Similar to Trivento summercamp fast data 9/9/2016

Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Stavros Kontopoulos
 
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thessaloniki
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the artStavros Kontopoulos
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Big Data : Risks and Opportunities
Big Data : Risks and OpportunitiesBig Data : Risks and Opportunities
Big Data : Risks and OpportunitiesKenny Huang Ph.D.
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworksIJDKP
 
Big Data Hadoop (Overview)
Big Data Hadoop (Overview)Big Data Hadoop (Overview)
Big Data Hadoop (Overview)Rohit Srivastava
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingPaco Nathan
 
introduction to big data frameworks
introduction to big data frameworksintroduction to big data frameworks
introduction to big data frameworksAmal Targhi
 
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediFundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediAnimesh Chaturvedi
 
Innovating With Data and Analytics
Innovating With Data and AnalyticsInnovating With Data and Analytics
Innovating With Data and AnalyticsVMware Tanzu
 
Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...Big Data Spain
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 
SC4 Workshop 1: Simon Scerri: Existing tools and technologies
SC4 Workshop 1: Simon Scerri: Existing tools and technologiesSC4 Workshop 1: Simon Scerri: Existing tools and technologies
SC4 Workshop 1: Simon Scerri: Existing tools and technologiesBigData_Europe
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdfLevLafayette1
 

Similar to Trivento summercamp fast data 9/9/2016 (20)

Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
 
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the art
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Big Data : Risks and Opportunities
Big Data : Risks and OpportunitiesBig Data : Risks and Opportunities
Big Data : Risks and Opportunities
 
Real time data processing frameworks
Real time data processing frameworksReal time data processing frameworks
Real time data processing frameworks
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Big Data
Big DataBig Data
Big Data
 
Big Data Hadoop (Overview)
Big Data Hadoop (Overview)Big Data Hadoop (Overview)
Big Data Hadoop (Overview)
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 
introduction to big data frameworks
introduction to big data frameworksintroduction to big data frameworks
introduction to big data frameworks
 
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvediFundamental question and answer in cloud computing quiz by animesh chaturvedi
Fundamental question and answer in cloud computing quiz by animesh chaturvedi
 
Innovating With Data and Analytics
Innovating With Data and AnalyticsInnovating With Data and Analytics
Innovating With Data and Analytics
 
Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...Advanced data science algorithms applied to scalable stream processing by Dav...
Advanced data science algorithms applied to scalable stream processing by Dav...
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
SC4 Workshop 1: Simon Scerri: Existing tools and technologies
SC4 Workshop 1: Simon Scerri: Existing tools and technologiesSC4 Workshop 1: Simon Scerri: Existing tools and technologies
SC4 Workshop 1: Simon Scerri: Existing tools and technologies
 
2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf2023comp90024_Spartan.pdf
2023comp90024_Spartan.pdf
 

More from Stavros Kontopoulos

Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfServerless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfStavros Kontopoulos
 
Online machine learning in Streaming Applications
Online machine learning in Streaming ApplicationsOnline machine learning in Streaming Applications
Online machine learning in Streaming ApplicationsStavros Kontopoulos
 
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...Stavros Kontopoulos
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsStavros Kontopoulos
 
Apache Flink London Meetup - Let's Talk ML on Flink
Apache Flink London Meetup - Let's Talk ML on FlinkApache Flink London Meetup - Let's Talk ML on Flink
Apache Flink London Meetup - Let's Talk ML on FlinkStavros Kontopoulos
 
Spark Summit EU Supporting Spark (Brussels 2016)
Spark Summit EU Supporting Spark (Brussels 2016)Spark Summit EU Supporting Spark (Brussels 2016)
Spark Summit EU Supporting Spark (Brussels 2016)Stavros Kontopoulos
 

More from Stavros Kontopoulos (8)

Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfServerless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
 
Online machine learning in Streaming Applications
Online machine learning in Streaming ApplicationsOnline machine learning in Streaming Applications
Online machine learning in Streaming Applications
 
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
Apache Flink London Meetup - Let's Talk ML on Flink
Apache Flink London Meetup - Let's Talk ML on FlinkApache Flink London Meetup - Let's Talk ML on Flink
Apache Flink London Meetup - Let's Talk ML on Flink
 
Spark Summit EU Supporting Spark (Brussels 2016)
Spark Summit EU Supporting Spark (Brussels 2016)Spark Summit EU Supporting Spark (Brussels 2016)
Spark Summit EU Supporting Spark (Brussels 2016)
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 

Recently uploaded

Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationWave PLM
 
Workforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfWorkforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfDeskTrack
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Andrea Goulet
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabbereGrabber
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1KnowledgeSeed
 
What need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java DevelopersWhat need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java DevelopersEmilyJiang23
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAlluxio, Inc.
 
IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024vaibhav130304
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...rajkumar669520
 
How to pick right visual testing tool.pdf
How to pick right visual testing tool.pdfHow to pick right visual testing tool.pdf
How to pick right visual testing tool.pdfTestgrid.io
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024Shane Coughlan
 
CompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfCompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfFurqanuddin10
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Soroosh Khodami
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Krakówbim.edu.pl
 
A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationHelp Desk Migration
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfkalichargn70th171
 
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purityAPVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purityamy56318795
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfsteffenkarlsson2
 
SQL Injection Introduction and Prevention
SQL Injection Introduction and PreventionSQL Injection Introduction and Prevention
SQL Injection Introduction and PreventionMohammed Fazuluddin
 

Recently uploaded (20)

Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
 
Workforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfWorkforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdf
 
5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabber
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
 
What need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java DevelopersWhat need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java Developers
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
How to pick right visual testing tool.pdf
How to pick right visual testing tool.pdfHow to pick right visual testing tool.pdf
How to pick right visual testing tool.pdf
 
OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024OpenChain @ LF Japan Executive Briefing - May 2024
OpenChain @ LF Japan Executive Briefing - May 2024
 
CompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdfCompTIA Security+ (Study Notes) for cs.pdf
CompTIA Security+ (Study Notes) for cs.pdf
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data Migration
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
 
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purityAPVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
 
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdfStrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
StrimziCon 2024 - Transition to Apache Kafka on Kubernetes with Strimzi.pdf
 
SQL Injection Introduction and Prevention
SQL Injection Introduction and PreventionSQL Injection Introduction and Prevention
SQL Injection Introduction and Prevention
 

Trivento summercamp fast data 9/9/2016

  • 1. What is FastData & How the SMACK stack plays a major role by implementing a Fast Data strategy Stavros Kontopoulos Senior Software Engineer @ Lightbend, M.Sc. De oude Prodentfabriek Trivento Summercamp 2016 Amersfoort
  • 3. Agenda A bit of history: Big Data Processing What is Fast Data? Batch Systems vs Streaming Systems & Streaming Evolution Event Log, Message Queues A Fast Data Architecture Example Application 3
  • 5. Data Processing Batch processing: processing done on a bounded dataset. Stream Processing (Streaming): processing done on an unbounded datasets. Data items are pushed or pulled. Two categories of systems: batch vs streaming systems. 5
  • 6. Big Data - The story Internet scale apps moved data size from Gigabytes to Petabytes. Once upon a time there were traditional RDBMS like Oracle and Data Warehouses but volume, velocity and variety changed the game. 6
  • 7. Big Data - The story MapReduce was a major breakthrough (Google published the seminal paper in 2004). Nutch project already had an implementation in 2005 2006 becomes a subproject of Lucene with the name Hadoop. 2008 Yahoo brings Hadoop to production with a 10K cluster. Same year it becomes a top-level apache project. Hadoop is good for batch processing.
  • 8. Big Data - The story Word Count example - Inverted Index. 8 Split 1 Split N doc1, doc2 ... ... doc300, doc100 MAP REDUCE (w1,1) … (w20,1) (w41,1) … (w1,1) Shuffle (w1, (1,1,1…)) ... (w41, (1,1,…)) ... (w1, 13) ... (w1, 3) ...
  • 9. Big Data - The story Giuseppe DeCandia et al., ”Dynamo: amazon's highly available key-value store.” changed the DataBase world in 2007. NoSQL Databases along with general system like Hadoop solve problems cannot be solved with traditional RDBMs. Technology facts: Cheap memory, SSDs, HDDs are the new tape, more cpus over more powerful cpus. 9
  • 10. Big Data - The story There is a major shift in the industry as batch processing is not enough any more. Batch jobs usually take hours if not days to complete, in many applications that is not acceptable. 10
  • 11. Big Data - The story The trend now is near-real time computation which implies streaming algorithms and needs new semantics. Fast Data (data in motion) & Big Data (data at rest) at the same time. The enterprise needs to get smarter, all major players across industries use ML on top of massive datasets to make better decisions. 11Images: https://www.tesla.com/sites/default/files/pictures/thumbs/model_s/red_models.jpg?201501121530 https://i.ytimg.com/vi/cj83dL72cvg/maxresdefault.jpg
  • 12. Big Data - The story OpsClarity report: 92% plan to increase their investment in stream processing applications in the next year 79% plan to reduce or eliminate investment in batch processing 32% use real time analysis to power core customer-facing applications 44% agreed that it is tedious to correlate issues across the pipeline 68% identified lack of experience and underlying complexity of new data frameworks as their barrier to adoption http://info.opsclarity.com/2016-fast-data-streaming-applications-report.html 12
  • 13. Big Data - The story 13Image: http://info.opsclarity.com/2016-fast-data-streaming-applications-report.html
  • 14. Big Data - The story 14 In OpsClarity report: ● Apache Kafka is the most popular broker technology (ingestion queue) ● HDFS the most used data sink ● Apache Spark is the most popular data processing tool.
  • 15. Big Data Landscape 15 Image: http://mattturck.com/wp-content/uploads/2016/03/Big-Data-Landscape-2016-v18-FINAL.png
  • 16. Big Data System A Big Data System must have at least the following components at its core: DFS - Distributed File System like (S3, HDFS) or a distributed database system (DDS). Distributed Data processing tool like: Spark, MapReduce, etc. Tools and services to manage the previous systems. 16
  • 17. Batch Systems - The Hadoop Ecosystem 17 Yarn (Yet Another Resource Negotiator) deployed in production at Yahoo in March 2013. Same year Cloudera, the dominant Hadoop vendor, embraced Spark as the next-generation replacement for MapReduce. Image: Lightbend Inc.
  • 18. Batch Systems - The Hadoop Ecosystem Hadoop clusters, the gold standard for big data from ~2008 to the present (started back in 2005). Strengths: Lowest CapEx system for Big Data. Excellent for ingesting and integrating diverse datasets. Flexible: from classic analytics (aggregations and data warehousing) to machine learning. 18
  • 19. Batch Systems - The Hadoop Ecosystem Weaknesses: Complex administration. YARN can’t manage all distributed services. MapReduce, has poor performance, a difficult programming model, and doesn’t support stream processing. 19
  • 20. Analyzing Infinite Data Streams 20 What does it mean to run a SQL query on an unbounded data set. How should I deal with the late data. What kind of time measurement should I use? Event-time, Processing time or Ingestion time? Accuracy of computations on bounded datasets vs on unbounded datasets Algorithms for streaming computations?
  • 21. Analyzing Infinite Data Streams 21 Two cases for processing: Single event processing: event transformation, trigger an alarm on an error event Event aggregations: summary statistics, group-by, join and similar queries. For example compute the average temperature for the last 5 minutes from a sensor data stream.
  • 22. Analyzing Infinite Data Streams 22 Event aggregation introduces the concept of windowing wrt to the notion of time selected: Event time (the time that events happen): Important for most use cases where context and correctness matter at the same time. Example: billing applications, anomaly detection. Processing time (the time they are observed during processing): Use cases where I only care about what I process in a window. Example: accumulated clicks on a page per second. System Arrival or Ingestion time (the time that events arrived at the streaming system). Ideally event time = processing time. Reality is: there is skew.
  • 23. Analyzing Infinite Data Streams 23 Windows come in different flavors: Tumbling windows discretize a stream into non-overlapping windows. Sliding Windows: slide over the stream of data.
  • 24. Analyzing Infinite Data Streams 24 Watermarks: indicates that no elements with a timestamp older or equal to the watermark timestamp should arrive for the specific window of data. Triggers: decide when the window is evaluated or purged.
  • 25. Analyzing Infinite Data Streams 25 Given the advances in streaming we can: Trade-off latency with cost and accuracy In certain use-cases replace batch processing with streaming
  • 26. Analyzing Infinite Data Streams 26 Recent advances in Streaming are a result of the pioneer work: MillWheel: Fault-Tolerant Stream Processing at Internet Scale, VLDB 2013. The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing, Proceedings of the VLDB Endowment, vol. 8 (2015), pp. 1792-1803
  • 27. Analyzing Infinite Data Streams 27 Apache Beam is the open source successor of Google’s DataFlow It is becoming the standard api for streaming. Provides the advanced semantics needed for the current needs in streaming applications.
  • 28. Streaming Systems Architecture 28 User provides a graph of computations through a high level API where data flows on the edges of this graph. Each vertex its an operator which executes a user operation-computation. For example: stream.map().keyBy()... Operators can run in multiple instances and preserve state (unlike batch processing where we have immutable datasets). State can be persisted and restored in the presence of failures.
  • 29. Analyzing Infinite Data Streams - Flink Example 29 sealed trait SensorType { def stype: String } case object TemperatureSensor extends SensorType { val stype = "TEMP" } case object HumiditySensor extends SensorType { val stype = "HUM" } case class SensorData(var sensorId: String, var value: Double, var sensorType: SensorType, timestamp: Long) https://github.com/skonto/trivento-summercamp-2016
  • 30. Analyzing Infinite Data Streams - Flink Example 30 class SensorDataSource(val sensorType: SensorType, val numberOfSensors: Int, val watermarkTag: Int, val numberOfElements: Int = -1) extends SourceFunction[SensorData] { final val serialVersionUID = 1L @volatile var isRunning = true var counter = 1 var timestamp = 0 val randomGen = Random require(numberOfSensors > 0) require(numberOfElements >= -1) lazy val initialReading: Double = { sensorType match { case TemperatureSensor => 27.0 case HumiditySensor => 0.75 } } override def run(ctx: SourceContext[SensorData]): Unit = { val counterCondition = { if(numberOfElements == -1) { x: Int => isRunning } else { x: Int => isRunning && counter <= x } } while (counterCondition(numberOfElements)) { Thread.sleep(10) // send sensor data every 10 milliseconds val dataId = randomGen.nextInt(numberOfSensors) + 1 val data = SensorData(dataId.toString, initialReading + Random.nextGaussian()/initialReading, sensorType, timestamp) ctx.collectWithTimestamp(data, timestamp) // time starts at 0 in millisecs timestamp = timestamp + 1 if (timestamp % watermarkTag == 0) { // watermark should be mod 0 ctx.emitWatermark(new Watermark(timestamp)) // watermark in milliseconds } counter = counter + 1 } } override def cancel(): Unit = { // No cleanup needed isRunning = false } } The Source https://github.com/skonto/trivento-summercamp-2016
  • 31. Analyzing Infinite Data Streams - Flink Example 31 object SensorSimple { def main(args: Array[String]): Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment // set default env parallelism for all operators env.setParallelism(2) env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) val numberOfSensors = 2 val watermarkTag = 10 val numberOfElements = 1000 val sensorDataStream = env.addSource(new SensorDataSource(TemperatureSensor, numberOfSensors, watermarkTag, numberOfElements)) sensorDataStream.writeAsText("inputData.txt") val windowedKeyed = sensorDataStream .keyBy(data => data.sensorId) .timeWindow(Time.milliseconds(10)) windowedKeyed.max("value") .writeAsText("outputMaxValue.txt") windowedKeyed.apply(new SensorAverage()) .writeAsText("outputAverage.txt") env.execute("Sensor Data Simple Statistics") } } class SensorAverage extends WindowFunction[SensorData, SensorData, String, TimeWindow] { def apply(key: String, window: TimeWindow, input: Iterable[SensorData], out: Collector[SensorData]): Unit = { if (input.nonEmpty) { val average = input.map(_.value).sum / input.size out.collect(input.head.copy(value = average)) } } } The Job https://github.com/skonto/trivento-summercamp-2016
  • 32. Analyzing Infinite Data Streams - Flink Example 32 Operator 1 Operator 2 Watermark 1 (10) 0 3 6 2 7 5 849 Operators run the operations defined by the graph of the streaming computation. Example Operators (KeyBy, Map, FlatMap etc) Two instances of the same operator with parallelism 2 (previous example). Watermark N (10*N) .. .. .. .. .. .. .. .. .. .. .. .. 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22... time file1 file2 window 2window 1
  • 33. Streaming vs Batch Systems 33 Metric Batch Streaming Data size per job TB to PB MB to TB (in flight) Time between data arrival and processing Many minutes to hours Microseconds to minutes Job execution times Minutes to hours Microseconds to minutes
  • 34. Event Log as the Core Abstraction 34 Logging is everywhere: Write-ahead log (WAL) in databases for durability. The distributed log can be seen as the data structure which models the problem of consensus. Reduction of the problem of making multiple machines all do the same thing to the problem of a distributed consistent log implementation. The log feeds processes input (state-machine replication model). In real-time streaming implementations use logs as a natural mean for recording events as they are processed according to the computation graph. This assists in implementing consistency algorithms like ABS in Apache Flink and other functionality for all major streaming engines like Google DataFlow, Apache Storm, Apache Samza.
  • 35. Event Log as the Core Abstraction 35 Architecture patterns enablers: Event Sourcing (ES) Command-query Responsibility Segregation (CQRS)
  • 36. Message Queues as the Integration Tool 36 FIFO data structures, the natural way to process logs. Organise user data in topics, each topic has its own queue. Benefits Decouples consumers from producers Arbitrary number of producers and consumers are supported Easy to use Kafka is the most popular implementation for Big Data Systems.
  • 37. Message Queues - Kafka 37 Kafka is the most popular implementation for Big Data Systems. Kafka is a “...distributed, partitioned, replicated commit log service” No deletion of messages on read, allows replay of data. Production tested at LinkedIn at scale.
  • 38. Delivery/Processing Semantics 38 In distributed systems failure is part of the game. What semantics I can achieve for message delivery? at-most-once delivery: for each message sent, that message is delivered zero or one times. at-least-once delivery: for each message sent potentially multiple attempts are made at delivering it, such that at least one succeeds; messages may be duplicated but not lost. exactly-once delivery: for each message sent exactly one delivery is made to the recipient; the message can neither be lost nor duplicated. In theory it is impossible to have exactly once delivery. In practice we might care more for exactly-once state changes and at-least once delivery. Example: Keeping state at some operator of the streaming graph.
  • 39. The SMACK Stack 39 Technologies which combined together deliver high performing streaming systems: Spark Mesos Akka Cassandra Kafka
  • 40. A Fast Data Architecture using the SMACK stack 40Image: Dean Wampler, "Fast Data Architectures for Streaming Applications", Lightbend and O'Reilly Media, September 2016
  • 42. Adding ML support to the IoT Application 42 Anomaly detection Voice interface Image classification Recommendations Automatic tuning of the IoT environment
  • 43. What about Lambda Architecture? 43 Eventually we want to replace it, it is more of a traditional model. Problems Hard to maintain Duplication of code & systems Special systems for unifying views In certain cases we can replace it with streaming based architectures.
  • 44. Streaming Implementations Status 44 Apache Spark: Structured Streaming in v2 starts the improvement of the streaming engine. Still based on micro-batches but event-time support was added. Apache Flink: SQL API supported from v0.9 and on. Still important features are on the roadmap: scaling streaming jobs, mesos support, dynamic allocation.
  • 45. Picking the Right Tool for Streaming 45 Criteria to choose: Processing semantics (strong consistency is needed for correctness) Latency guarantees Deployment / Operation Ecosystem build around it Complex event processing (CEP) Batch & Streaming API support Community & Support
  • 46. Picking the Right Tool for Streaming 46 Some tips Pick Flink if you need sub-second latency and Beam support Pick Spark Streaming for its integration with spark ML libraries, micro-batch mode ideal for training models, has mature deployment capabilities. Pick Gearpump for materializing Akka Streams in a distributed fashion. Pick Kafka streams for low level simple transformations of Kafka messages (It is a distributed solution out of the box). (Check Confluent Platform for many useful tools around Kafka).
  • 48. References 48 Watermarks: Time and progress in streaming dataflow and beyond: Big Data Conference - Strata + Hadoop World, May 31 - June 3, 2016, London, UK The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of- Order Data Processing MillWheel: Fault-Tolerant Stream Processing at Internet Scale Executive Summary: Data Growth, Business Opportunities, and the IT Imperatives | The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things Asynchronous Distributed Snapshots for Distributed Dataflows | the morning paper State machine replication - Wikipedia, the free encyclopedia The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of- Order Data Processing | the morning paper The Log: What every software engineer should know about real-time data's unifying abstraction | LinkedIn Engineering Dean Wampler, "Fast Data Architectures for Streaming Applications", Lightbend and O'Reilly Media, September 2016 How Apache Flink™ enables new streaming applications – data Artisans