Advanced data science algorithms applied to scalable stream processing by David Piris and Ignacio García

Advanced data science algorithms
applied to scalable stream processing
David Piris Valenzuela
Nacho García Fernández
Ignacio.g.Fernandez@treelogic.com
@0xNacho
david.piris@treelogic.com
@davidpiris

3
About Treelogic
 R&D intensive company with the mission of adapting technological knowledge to
improve quality standards in our daily life
 8 ongoing H2020 projects (coordinating 3 of them)
 8 ongoing FP7 projects (coordinating 5 of them)
 Focused on providing Big Data Analytics in all the world
 Internal organization
Research lines
 Big Data
 Computer vision
 Data science
 Social Media Analysis
 Security
ICT solutions
 Security & Safety
 Justice
 Health
 Transport
 Financial Services
 ICT tailored solutions

CONTENTS
1. WHY WE NEED BIG DATA
2. BIG DATA: SOLUTIONS
3. BIG DATA: REAL-TIME PROCESSING
4. INCREMENTAL ALGORITHMS
5. WHAT WE WANT
6. WHAT WE NEED
1. A stream processing engine
2. Online incremental algorithms
3. A distributed data storage system
4. A use case
5. A visualization layer

7
Why we need Big Data
 Public and private sector companies store a huge mount of data
 Countries with huge databases store data from
 Population
 Medical records
 Taxes
 Online transactions
 Mobile transactions
 Social Networks
In a single day, tweets generates 12 TB!!

8
2.5 Exabytes are produced every day!!!
 530.000.000 million songs
 150.000.000 iPhones
 5 million laptops
 90 years of HD Video

9
How can we manage all data?

11
Big Data: Solutions
First we can manage all historical repository, and retrieve some value from
data stored
 Batch architecture
 MapReduce
 Hadoop Ecosystem

13
Big Data: Solutions
Batch processing with Hadoop takes a lot of time and the need to process
ingested data and display results in a shortest way possible brings new
architecture and tools
 Lambda architecture
 Spark (memory vs disk)

16
Big data: real-time processing
 Faster results
 Accurate results
 Less expense
 Please consumers

17
As previously said, we need to extract and visualize information in near real
time…

18
 Flink as engine process
 Stream processing
 Windowing with events time semantics
 Streaming and batch processing

19
Kappa architecture
 Batch layer removed
 Only one set of code needs to be maintained

20
 No need to use batch layer
 Avoid use disk in engine process (latency)

23
Incremental algorithms
 BI & BA people always want to made some common operations to retrieve
value and visualize data
 We have operational tools in a relational or batch environment
 How we can obtain average for a data stream that is changing every
second, minutes or even milliseconds…?
 Common average operation is indicated for historical repository, data input
without any changes in the moment we start the process to obtain it.
 Do we have tools to make it possible in a real time deployment?

24
Answer is NO!

25
Flink gives us the chance to operate with a new window processing concept.
We can decide and configure "small time pieces", and make some
operations or manipulate data in that time space.

26
With Flink and windowing…

27
 These algorithms consume streams of data and are able to update their
results in a parallel manner without the need of saving the processed data
 Using checkpoints in windowing, allows us to store result from previous
window process

28
Our analytics & visualization solution implemented in a real time architecture

29
If you are a BI or BA professional...we care about you!

30
 Currently, we have implemented:
 Average
 Mode
 Variance
 Correlation
 Covariance
 Min
 Max

31
 Currently we are working on:
 Median

32
 In roadmap…
 Standard deviation
 Order by
 Discretization
 Contains
 Split
 Validate range values
 Set default value to specific output

CONTENTS
1. WHY WE NEED BIG DATA
2. BIG DATA: SOLUTIONS
3. BIG DATA: REAL-TIME PROCESSING
4. INCREMENTAL ALGORITHMS
5. WHAT WE NEED
1. A stream processing engine
2. Online incremental algorithms
3. A distributed data storage system
4. A use case
5. A visualization layer

34
Apache Flink vs Apache Spark
 Pure streams for all workloads
 Optimizer
 Low latency, high throughput
 Global, session, time and count based
window criteria
 Provides automatic memory management
 Micro-batches for all workloads
 No job optimizer
 High latency as compared to Flink
 Time-based window criteria
 Configurable memory management. Spark
1.6+ has move towards automating
memory management

37
Incremental algorithms in Flink

38
 Default behavior in Apache Flink:
 With incremental algorithms:

39

41
Apache Kudu
 Provides a combination of fast inserts / updates and efficient columnar
scans to enable real-time analytic workloads
 It is a new complements to HDFS and HBase
 Designed for use cases that require fast analytics on fast data
 Low query latency
 V1.0.1 was released on October 11, 2016

43
PROTEUS: a steel making scenario
 Steel industry is a key sector for the European community.
 PROTEUS was introduced last year at Big Data Spain by Treelogic *
 Hot Strip mills (sometimes) produces steel with defects
 Predict coil parameters (thickness, width, flatness) using real-time and historical data
 Detecting defective coils in an early stage saves money. The production process can be
modified / stopped.
 Proposed architecture is being validated in this project
 7870 variables with a frequency of 500ms: data-in-motion
 700.000 registers for each variables. 500GB time series and flatness map: data-at-rest
* https://www.youtube.com/watch?v=EIH7HLyqhfE

44
PROTEUS: a steel-making scenario
 Steel industry is a key sector for the European community.
 PROTEUS was introduced last year at Big Data Spain by Treelogic *
 Hot Strip mills (sometimes) produces steel with defects
 Predict coil parameters (thickness, width, flatness) using real-time and historical data
 Detecting defective coils in an early stage saves money. The production process can be
modified / stopped.
 Proposed architecture is being validated in this project
 7870 variables with a frequency of 500ms: data-in-motion
 700.000 registers for each variables. 500GB time series and flatness map: data-at-rest
* https://www.youtube.com/watch?v=EIH7HLyqhfE

46
Websockets
 Websocket is a computer communication protocol providing full-duplex
communication channels over a single TCP connection.
 Extremely faster than HTTP
 Its API is standardized by the W3C

47
Apache Flink & Websockets
 Data sinks consume DataSets and are used to store or return them.
 Flink comes with a variety of built-in output formats that are encapsulated behind
operations on the DataSet:
 writeAsText()
 writeAsFormattedText()
 writeAsCsv()
 print()
 write()
 We’ve developed a WebsocketSink enabling Flink to send outputs to a given
websocket endpoint.
 Based on the javax-websocket-client-api 1.1 spec.

48
Incremental architecture: our approach

50
ProteicJS
https://github.com/proteus-h2020/proteic/

52
ProteicJS: Researching on visualization
 Currently researching on new ways of visualizing data and ML models

54
How to get it all
https://github.com/proteus-h2020/proteus-docker

Advanced data science algorithms applied to scalable stream processing by David Piris and Ignacio García

More Related Content

What's hot

Viewers also liked

Similar to Advanced data science algorithms applied to scalable stream processing by David Piris and Ignacio García

More from Big Data Spain

Recently uploaded

Advanced data science algorithms applied to scalable stream processing by David Piris and Ignacio García