SlideShare a Scribd company logo
1 of 72
Download to read offline
Data Streaming 
Raja Chiky – raja.chiky@isep.fr
About me 
¡ Associate professor in Computer Science – LISITE-RDI 
¡ Research interest: Data stream mining, scalability and resource optimization in distributed architectures 
(e.g cloud architectures), recommender systems 
¡ Research field: Large scale data management 
1. Real-time and 
distributed 
processing of 
various data 
sources 
2. Use semantic 
technologies to 
add a semantic 
layer 
3. Recommender 
systems and 
collaborative data 
mining 
4. Optimizing resources in large scale systems 
Heterogeneous 
and 
sta1c 
data 
Heterogeneous 
and 
dynamic 
data 
streams 
sensors 
2
OUTLINE 
¡ Context: Big Data 
¡ What is a data stream ? 
¡ Data stream management systems 
¡ Basic approximate algorithms 
¡ Big Data: Distributed Systems 
¡ Semantic Data Streaming 
¡ Conclusion 
3
New era 
4
5 Big Data: Buzzword!
6 
08/12/2014 
Where is all this data coming 
from?
7 
More and More connected 
Things
8 So, what is Big Data? 
Dawn 
of 
(me 
Volume 
of 
data 
created 
Worldwide 
2003 
2012 
5 
EB 
… 
2.7 
ZB 
2015 
10 
ZB 
(E) 
§ 1 
YB 
= 
10^24 
Bytes 
§ 1 
ZB 
= 
10^21 
Bytes 
§ 1 
EB 
= 
10^18 
Bytes 
§ 1 
PB 
= 
10^15 
Bytes 
§ 1TB 
= 
10^12 
Bytes 
§ 1 
GB 
= 
10^9 
Bytes 
Variety 
of 
data 
§ Radio 
§ TV 
§ News 
§ E-­‐Mails 
§ Facebook 
Posts 
Velocity 
of 
data 
§ Walmart 
handles 
1M 
transac(ons 
per 
hour 
§ Google 
processes 
24PB 
of 
data 
per 
day 
§ AT&T 
transfers 
30 
PB 
of 
data 
per 
day 
§ 90 
trillion 
emails 
are 
sent 
per 
year 
§ World 
of 
WarcraQ 
uses 
1.3 
PB 
of 
storage 
§ Tweets 
§ Blogs 
§ Photos 
§ Videos 
(user 
and 
paid) 
§ RSS 
feeds 
§ Wikipedia 
§ GPS 
data 
§ RFID 
§ POS 
Scanners 
§ … 
§ Facebook 
when 
had 
a 
user 
base 
of 
900 
M 
users, 
had 
25 
PB 
of 
compressed 
data 
§ 400M 
tweets 
per 
day 
in 
June 
’12 
§ 72 
hours 
of 
video 
is 
uploaded 
to 
Youtube 
every 
minute 
Big 
Data 
Elements 
Volume 
Variety 
Velocity 
+ Veracity (IBM) - 
information 
uncertainty 
Source: Big Data & Analytics - Why Should We Care?, Vishwa Kolla
Output 
User 
Interaction 
Store 
Gathering 
Information 
Data 
sources 
Static data Stream (big) data 
C 
Continuous 
queries/ 
Business rules 
sensors 
databases 
Data stream Static data 
ETL 
Batch 
processing 
Semantic ETL 
stream 
processing 
Ad-hoc queries 
Analytics 
Knowledge 
enrichment 
Databases/ 
Triplestores 
(synopsis) 
9 
Real time 
visual-analytics 
Retro-action 
Load shedding 
Data Warehouse 
08/12/2014
10 Big Data : Velocity 
Website logs 
Network 
monitoring Financial services 
eCommerce Traffic control 
Weather 
forecasting 
Power 
consumption
What is a data stream? 
11 
¡ Golab & Oszu (2003): “A data stream is a real-time, continuous, ordered 
(implicitly by arrival time or explicitly by timestamp) sequence of items. It is 
impossible to control the order in which items arrive, nor is it feasible to locally 
store a stream in its entirety.” 
¡ Massive volumes of data, items arrive at a high rate.
12 
Applications of data stream 
processing 
¡ Data stream processing 
¡ Process queries (compute statistics, activate alarms) 
¡ Apply data mining algorithms 
¡ Requirements 
¡ Real-time processing 
¡ One-pass processing 
¡ Bounded storage (no complete storage of streams) 
¡ Possibly consider several streams
13 
Applications of data stream 
processing 
¡ Let’s go deeper into some examples 
¡ Network management 
¡ Stock monitoring
14 Network management 
¡ Supervision of a computer network 
¡ Improvement of network configuration (hardware, software, 
architecture) 
¡ Detection of attacks 
¡ Measurements made on routers 
Network supervision 
Huge volume of data center 
High rate of arrivals
15 Network management 
Network supervision 
center 
Timestamp Source Destination Duration Bytes Protocol 
… … … … … … 
12342 10.1.0.2 16.2.3.7 12 20K http 
12343 18.6.7.1 12.4.0.3 16 24K http 
12344 12.4.3.8 14.8.7.4 26 58K http 
12345 19.7.1.2 16.5.5.8 18 80K ftp 
… … … … … …
16 Network management 
Network supervision 
center 
Typical queries: 
- 100 most frequent (@S, @D) on router R1 … 
- How many different (@S, @D) seen on R1 but not R2 … 
- … during last month, last week, last day, last hour ?
17 Stock monitoring 
¡ Stream of price and sales volume of stocks over time 
¡ Technical analysis/charting for stock investors 
¡ Support trading decisions 
l Notify me when the price of IBM is above $83, and 
the first MSFT price afterwards is below $27. 
l Notify me when some stock goes up by at least 5% 
from one transaction to the next. 
l Notify me when the price of any stock increases 
monotonically for ≥30 min. 
l Notify me when the difference between the 
current price of a stock and its 10 day moving 
average is greater than some threshold value 
Source: Gehrke 07 and Cayuga application scenarios (Cornell University)
18 Where is the problem? (1/2) 
¡ Example: 
05/12/2014 
Bank 
withdrawal 
50 € 
12/05/2014 
Bank 
fraud bank 
withdrawal 
1000$ 
12/05/2014 
¡ Join between several streams 
¡ Join between stream data and customer database 
¡ Generic tools for processing streams 
¡ Avoid the ‘Store’, ‘Compute’, ‘Delete’ approach 
¡ Solution: incremental computation and definition of temporal windows for joins
19 Where is the problem? (2/2) 
¡ Example: 
¡ 100 most frequent @S IP adresses on a router 
¡ Maintain a table of IP addresses with 
frequencies ? 
¡ Sampling the stream ? 
¡ Face high (and varying) rate of arrivals 
¡ Exact versus approximate answers
20 Examples of queries 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
Find elements with 
frequency> 0.1% 
top-k 
Frequency of the element 3 
Total frequency of 
Elements between 8 and 14 
number of elements having a non-zero frequency
21 
Data Stream Management 
Systems 
DBMS DSMS 
Data model Permanent updatable relations Streams and permanent updatable 
relations 
Storage Data is stored on disk Permanent relations are stored on disk 
Streams are processed on the fly 
Query SQL language 
Creating structures 
Inserting/updating/deleting data 
Retrieving data (one-time query) 
SQL-like query language 
Standard SQL on permanent relations 
Extended SQL on streams with 
windowing 
Continuous queries 
Performance Large volumes of data Optimization of computer resources to 
deal with 
Several streams 
Several queries 
Ability to face variations in arrival rates 
without crash
22 Generic DSMS architecture 
Input 
Monitor 
Output 
Buffer 
Query Processor 
Query 
Reposi-tory 
Working 
Storage 
Summary 
Storage 
Static 
Storage 
Streaming 
Inputs 
Streaming 
Outputs 
Updates to 
Static Data 
User 
Queries 
Golab & Oszu (2003)
23 Existing DSMS
24 
How to deal with Big Data 
Streams 
Distribution 
DSMS 
Sampling/Load Shedding/Sketch/…
25 
Approximate answers to 
queries 
¡ When ? 
¡ Queries needing unbounded memory 
¡ Too much queries/too rapid streams/too high response time 
requirements 
¡ CPU limit 
¡ Memory limit 
¡ Solution : approximate answers to queries 
¡ Sliding windows 
¡ Sampling and load shedding 
¡ Definition of synopsis
26 Approaches 
¡ Two approaches for handling such streams 
¡ Use a time window, and query the window as a static table 
¡ When you can’t store collected data, or to keep track of historical 
data 
¡ Sampling 
¡ Filtering 
¡ Counting
Windowing 
¡ Applying queries/mining tasks to the whole stream (from 
beginning to current time) 
¡ Applying queries/mining to a portion of the stream 
Beginning of the stream 
Current date 
Window on the stream 
t
29 Windowing 
¡ Definition of windows of interest on streams 
¡ Fixed windows: September 2014 
¡ Sliding windows: last 3 hours 
¡ Landmark windows: from September 1st, 2014 
08/12/2014 
¡ Window specification 
¡ Physical : last 3 hours 
¡ Logical : last 1000 items 
¡ Refreshing rate 
¡ Rate of producing results (every item, every 10 items, every minute, 
…)
Sliding window 
Beginning of the stream 
t 
tc 
t’ t c 
Refreshment time 
Results 
Results 
Give me the last room where Axel has been in 
the last 10 minutes, updating results every minute 
30
Sliding window vs. Tumbling 
window 
Beginning of the stream 
t 
tc 
t’c t 
Results 
Refreshment time 
Results 
Give me the last room where Axel has been in 
the last 10 minutes, updating results every 10 
minutes 
31
32 Sampling from data stream 
¡ Inputs: 
¡ Sample size k 
¡ Window size n >> k (alternatively, time duration m) (optionnaly) 
¡ Stream of data elements that arrive online 
¡ Output: 
¡ k elements chosen uniformly at random from the last n elements 
(alternatively, from all elements that have arrived in the last m time 
units) 
¡ Goal: 
¡ maintain a data structure that can produce the desired output at 
any time upon request 
¡ Challenge: 
¡ don’t know how long stream is 
¡ So when/how often to sample?
A simple, Unsatisfying 
Approach 
¡ Choose a random subset X={x1, …,xk}, X⊂{0,1,…,n-1} 
¡ The sample always consists of the non-expired elements whose 
indexes are equal to x1, …,xk (modulo n) 
¡ Only uses O(k) memory 
¡ Technically produces a uniform random sample of each 
window, but unsatisfying because the sample is highly periodic 
¡ Unsuitable for many real applications, particularly those with 
periodicity in the data 
34
35 Reservoir sampling 
¡ Classic online algorithm due to Vitter (1985) 
¡ Maintains a fixed-size uniform random sample 
¡ Size of the data stream need not be known in advance 
¡ Data structure: “reservoir” of k data elements 
¡ As the ith data element arrives: 
¡ Add it to the reservoir with probability p = k/i, discarding a randomly 
chosen data element from the reservoir to make room
36 Chain Sampling 
v Chain Sampling method is for sequence based windows. 
v In this type of sampling when the ith element arrives it is chosen 
to become the sample with probability Min(i,n)/n. 
v If the ith element is chosen as the sample, the algorithm also 
selects the index of the element that will replace it when expires 
(assuming that it is still present in the sample when it expires). 
This index is picked uniformly at random from the range i+1…i 
+n, representing the range of indexes of the elements that will 
be active when the ith element expires. 
v When the element with the selected index arrives, the algorithm 
stores it in the memory and choses the index of the element 
that will replace it when it expires etc., building a chain of 
elements to use in case of the expiration of the current element 
in the sample.
37 Chain Sampling 
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
Another Simple Approach: 
oversample 
¡ As each element arrives remember it with probability 
p = ck/n log n; otherwise discard it 
¡ Discard elements when they expire 
¡ When asked to produce a sample, choose k elements at 
random from the set in memory 
¡ Expected memory usage of O(k log n) 
¡ The algorithm can fail if less than k elements from a window are 
remembered; however with high probability this will not happen 
38
39 Min-wise Sampling 
¡ For each item, pick a random fraction between 0 and 1 
¡ Store item(s) with the smallest random tag (Nath et al. 2004) 
0.391 0.908 0.291 0.555 0.619 0.273 
Each item has same chance of least tag, so uniform 
Can run on multiple streams separately, then merge
40 Hash function - Revision 
¡ Hash function: Any algorithm that maps data of arbitrary length 
to data of a fixed length. The values returned by a hash function 
are called hash values, hash codes, hash sums, checksums or 
simply hashes. 
08/12/2014
41 Stream filtering 
¡ Compact data structure, by B.H. Bloom, 1970: 
¡ A bit array of size m, 
¡ A H family of k hash functions (MD5, SHA256, Murmur), 
¡ A set S of n items 
¡ False positive probability f=(1-e-kn/m)k 
¡ Example M=10, hash=MD5, k=3
42 Sketches 
¡ Sketch 
¡ Synopsis structure taking advantage of high volumes of data 
¡ Provides an approximate result with probabilistic bounds 
¡ Random projections on smaller spaces (hash functions) 
¡ Many sketch structures: usually dedicated to a specialized task 
¡ Examples of sketch structures 
¡ COUNT (Flajolet 85) 
¡ COUNT SKETCH (Charikar et al. 04)
Difference between Sampling 
and Sketching 
43 
¡ Sample sees only those items which were selected to be in the 
sample; whereas the sketch sees the entire input, but is restricted 
to retain only a small summary of it 
¡ Not every problem can be solved with sampling 
¡ Example: counting how many distinct items in the stream 
¡ If a large fraction of items aren’t sampled, don’t know if they are all 
same or all different
44 
Sketches: COUNT (Flajolet 
Martin algorithm) 
¡ Goal 
¡ Number N of distinct values in a stream (for large N) 
¡ Example: number of distinct IP addresses going through a router 
¡ Sketch structure 
¡ SK: L bits initialized to 0 
0 0 0 0 0 0 0 0 
¡ H: hashing function transforming an element of the stream into L bits 
18.6.7.1 0 0 1 0 1 0 1 0 
¡ H distributes uniformly elements of the stream on the 2^L possibilities
45 Sketches 
¡ Method 
¡ Maintenance and update of SK 
¡ For each new element e 
¡ Compute H(e) 
¡ Select the position of the leftmost 1 in H(e) 
¡ Force to 1 this position in SK 
0 1 0 0 1 0 0 1 
SK 
H(18.6.7.1) 0 0 1 0 1 0 1 0 
New SK 0 1 1 0 1 0 0 1
46 Sketches 
¡ Result 
¡ Select the position R (0…L-1) of the leftmost 0 in SK 
¡ E(R) = log2 (φ*N) with φ = 0.77351… 
è Estimate the count by 2^R/φ 
¡ σ(R) = 1.12 
SK 1 1 1 0 1 0 0 0 
R 
To improve accuracy of this approximation algorithm, use multiple 
hash functions and use the average R instead.
47 Sketches 
¡ COUNT SKETCH ALGORITHM (Charikar et al. 2004) 
¡ Goal 
¡ k most frequent elements in a stream (for large number N of distinct 
values) 
¡ Ex. 100 most frequent IP addresses going through a router 
Input stream 2, 0, 1, 3, 1, 2, 4, . . . 
2 2 
1 1 1 
f(0) f(1) f(2) f(3) f(4)
48 Sketches 
+12 +7 +23 +15 
-5 -12 -23 +1 
. . . . 
. . . . 
. . . . 
. . . . 
. . . . 
+78 +56 +66 +65 
1 
2 
. 
. 
. 
. 
. 
B 
1 2 … t 
e 
-1 
+1 
-1 
+1
49 Sketches 
¡ Sketch structure 
¡ h : hash function from [0, … , N-1] to [0, 1, … , B] 
¡ s : hash function from [0, … , N-1] to {+1, -1} 
¡ Array of B counters: C1, …, CB (with B << N) 
¡ Sketch maintenance 
when e arrives: Ch(e) += s(e) 
¡ Use of sketch 
¡ Estimation of frequency of object e: : ne ≈ Ch(e) . s(e) 
¡ Actually t hash function h and t hash function s: 
ne ≈ median j∈[1…t] ( Chj(e) . sj(e) ) 
¡ Theoretical results on error depending on N, t and B.
50 Sketches 
¡ Algorithm 
¡ Maintenance of a list (e1, e2, …, ek) of the current k most frequent 
elements 
¡ For a new arriving element e 
¡ Add e to the sketch structure 
¡ Estimate frequency of e from the sketch structure 
¡ If f(e) > f(ek), remove ek and insert e into the list
Distributed 
Systems 
Answering today Big Data Needs 
S4, Storm, SAMOA, … 
51
Yahoo S4 
¡ Simple Scalable Streaming System 
¡ Use a decentralized and symmetric architecture 
¡ All nodes are similar, no master node 
¡ Based on Processing Elements 
¡ Each PE has 4 components: 
1. Functionality defined by a PE class and associated configuration 
¡ input event handler processEvent() 
¡ output mechanism output() 
2. the types of events that it consumes 
3. the keyed attribute in those events 
4. the value of the keyed attribute in events which it consumes 
¡ Several PEs are available for standard tasks such as count, aggregate, 
join … 
¡ Custom PEs can easily be programmed 
¡ The PEs are atomatically deployed on the Processing Nodes (machines): 
¡ Ensures load balancing of events, 
¡ Notifies appropriate PEs when an event comes in.
S4 Example: Word Count
Twitter Storm 
¡ Same princple as S4, with a simplified programming model 
¡ Storm provides realtime computation 
¡ Scalable 
¡ Guarantees no data loss 
¡ Extremely robust and fault-tolerant 
¡ Programming language agnostic 
¡ Concepts 
¡ Streams 
¡ Spouts 
¡ Bolts 
¡ Topologies
56 
Lambda architecture (By 
Nathan Marz) 
REAL TIME 
STREAM PROCESSING 
SERVING 
LAYER 
SPEED LAYER 
PRECOMPUTED 
DATA FLOW QUERIES 
BATCH LAYER 
BATCH 
PROCESSING 
VIEWS 
Source: Mathieu DESPRIEE (USI) 
Generic, scalable and fault-tolerant data processing architecture
Lambda architecture 
1. All data entering the system is dispatched to both the batch 
layer and the speed layer for processing. 
2. The batch layer has two functions: (i) managing the master 
dataset (an immutable, append-only set of raw data), and (ii) 
to pre-compute the batch views. 
3. The serving layer indexes the batch views so that they can be 
queried in ad-hoc way. 
4. The speed layer compensates for the high latency of updates 
to the serving layer and deals with recent data only. 
5. Any incoming query can be answered by merging results from 
batch views and real-time views. 
http://lambda-architecture.net/
58 Big Data Stream Mining 
Machine 
Learning 
Distributed 
Batch 
Hadoop 
Mahout 
Stream 
S4, Storm 
SAMOA 
Non 
Distributed 
Batch 
R, 
WEKA, 
… 
Stream 
MOA
59 
What is SAMOA? 
• NEW Software framework for mining distributed data streams 
• Big Data mining for evolving streams in REAL-TIME
Clustering 
Methods 
SAMOA 
SAMOA architecture 
Classifier 
Methods 
Frequent 
Pattern 
Mining 
S4 Storm … 
} Use S4, Storm, or 
other distributed 
stream processing 
platform 
} Use MOA, or other 
streaming machine 
learning library 
} Easy to extend 
through PACKAGES
Semantic data 
streaming 
61
62 Too much data streams
63 
Type of data used in Big Data 
initiatives 
Internal data 
Traditional sources 
« New data » 
Source: Big Data opportunities survey, Unisphere / SAP, May 2013.
64 BI: New Generation 
08/12/2014 
Decision 
Monitoring, Alerts, Statistics, fault 
detection, etc. 
Semantic filtering and Continuous queries 
Heterogeneous and 
dynamic data streams 
Heterogeneous and 
static data 
sensors 
Semantic data streams 
Interconnection 
Ontologies 
Retro-action
65 
Semantic Web technologies for 
data stream 
¡ Annotate stream data with semantic metadata 
¡ Apply Linked Data principles to publish streaming data 
¡ Interlink streaming data with existing datasets 
¡ Integrate data stream processing + reasoning 
¡ Objectives : interoperability, automation, enrichment
66 Existing prototypes 
CQELS 
SPARQL-Stream 
EP-SPARQL 
C-SPARQL
67 Approaches 
In RDF Stream 
models 
(timestamps, 
events, time 
intervals, triple-based, 
graph-based 
…)
68 Example: SRBench 
Q: Detect if a hurricane has been observed 
“A hurricane has a sustained wind (for more than 3 hours) of at 
least 33 metres per second or 74 miles per hour (119 km/h).” 
ASK 
WHERE { 
STREAM <http://www.cwi.nl/SRBench/observations> [RANGE 10800s SLIDE 600s] 
{?observation om-owl:procedure ?sensor ; 
om-owl:observedProperty weather:WindSpeed ; 
om-owl:result [ om-owl:floatValue ?value ] .} 
} 
GROUP BY ?sensor 
HAVING ( AVG(?value) >= "74"^^xsd:float ) 
ASK 
FROM STREAM <http://www.cwi.nl/SRBench/observations> [RANGE 3h STEP 10m] 
WHERE { 
?observation om-owl:procedure ?sensor ; 
om-owl:observedProperty weather:WindSpeed ; 
om-owl:result [ om-owl:floatValue ?value ] . } 
GROUP BY ?sensor 
HAVING ( AVG(?value) >= "74"^^xsd:float )
69 
How to deal with Big Data 
Streams 
Cloud Computing 
DSMS 
Sampling/Load Shedding 
[1] et [2] 
[3] 
[1] J. Hoeksema & S. Kotoulas : High-performance Distributed Stream Reasoning using S4 (ISWC 2011) 
[2] D. L. Phuoc & al : Elastic and Scalable Processing of Linked Stream Data in the Cloud (ISWC 2013) 
[3] N. Jain & al : Sampling Semantic Data Stream: Resolving Overload and Limited Storage Issues. (DaEng 2013)
70 
Sampling Extensions for 
continuous SPARQL 
PREFIX vocab: http://data-gov.tw.rpi.edu/vocab/p/8/ 
SELECT ?val 
WHERE { 
STREAM <C:/CQELS/streams/data-8.stream>[NOW][UNISAMPLING %80] 
{?rawData vocab:ozone_8hr_daily_max ?val } 
GRAPH <C:/CQELS/test.rdf> {?user sioc:account_of ?person} 
} 
Operators: [UNISAMPLING %{Sampling Percentage}] 
[RESSAMPLING %{Reservoir Size}] 
[CHNSAMPLING %{Window Size}] 
N. Jain & al : Sampling Semantic Data Stream: Resolving Overload and 
Limited Storage Issues. (DaEng 2013)
71 
Semantic Data Stream Load 
Shedding (1/2) 
Observation_AirTemperature_ID 
TemperatureObservation type 
AirTemperature 
observedProperty 
procedure 
System_ID 
MeasureData_AirTemperature_ID 
result 
Instant_ID 
NN 
floatValue double 
fahrenheit 
MeasureData 
uom 
Instant 
type 
2004-08-08T06:25:00 
inXSDDateTime 
samplingTime 
Observation_RelativeHumidity_ID 
procedure 
RelativeHumidityObservation 
type 
RelativeHumidity 
observedProperty 
MeasureData_RelativeHumidity_ID 
result 
type 
floatValue 
percentage 
uom 
NN 
double 
(a) 
(b) 
(c) 
(d) 
(1) 
(2) 
RDF Triple approach : 
Effect of deleting the two triples below (1) and 
(2) (in dotted line) in the graph : 
- The deletion of the first RDF triple destroys the 
link connecting node (a) to node (b) 
- The deletion of the second RDF triple destroys 
the link connecting node (c) to node (d) 
=> This has the effect of making nodes (b), (d) 
and all those connected to them 
unreachables, in spite of the presence of their 
data in memory. Which represents 6 RDF 
striples among 18 i.e. 33% of unusable data
72 
Semantic Data Stream Load 
Shedding (2/2) 
RDF Graph approach : 
Effect of deleting sub-graphs such as those formed by nodes (b) or (d) and all nodes 
to which they are directly connected. 
=> Preserving the semantic level of the information 
=> Protecting the data consistency of the whole graph 
=> Enhancing the Semantic Data Stream systems 
observedProperty
73 
Conclusion: Big Data Stream 
challenges 
¡ Semantic Information aggregation 
¡ Information aggregation: “too much data to assimilate but not 
enough knowledge to act” 
¡ Distributed and real-time processing 
¡ Design of real-time and distributed algorithms for stream processing 
and information aggregation 
¡ Distribution and parallelization of data mining algorithms 
¡ Visual analytics and user modeling 
¡ Dynamic user model 
¡ Novel visualizations for very large datasets
74 
Thanks to 
Zakia Kazi Aoul, ISEP 
Marie-Aude Aufaure, ECP 
Fethi Belghaouti, ISEP-INT 
Georges Hébrail, EDF R&D 
Sylvain Lefebvre, ISEP 
Yousra Chabchoub, ISEP
Big 
Data 
Linked 
Data 
Volume, 
Variety, 
Velocity, 
Veracity, 
… 
Value 
Web 
of 
data, 
Seman(c 
Web 
-­‐ A 
set 
of 
principles 
and 
good 
prac1ces 
allowing 
to 
link, 
publish 
and 
search 
for 
web 
data 
-­‐ Structure 
and 
seman1cally 
enrich 
RDF 
data, 
with 
a 
very 
high 
scalability 
-­‐> 
Big 
Linked 
Data 
Integrate, 
aggregate, 
analyze, 
visualize 
large 
data 
sets, 
whatever 
is 
their 
type, 
provenance, 
speed 
of 
their 
flow 
… 
Big 
Linked 
Data 
Linked 
Big 
Data 
Seman8c 
Technologies 
Living 
Lab 
Linked 
& 
Big 
Data 
Academic 
Chair 
Our 
Value 
proposi8on 
– 
Seman1c 
aggrega1on 
from 
textual 
and 
non 
textual 
streams 
– 
Manage 
seman1c 
heterogeneity, 
real-­‐1me 
and 
distributed 
processing 
– 
Ensure 
data 
quality 
and 
veracity 
– 
Visual 
analy1cs

More Related Content

What's hot

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theorem
Grisha Weintraub
 

What's hot (20)

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Aerospike Today and Tomorrow Product Roadmap 2023_Lenley Hensarling.pdf
Aerospike Today and Tomorrow Product Roadmap 2023_Lenley Hensarling.pdfAerospike Today and Tomorrow Product Roadmap 2023_Lenley Hensarling.pdf
Aerospike Today and Tomorrow Product Roadmap 2023_Lenley Hensarling.pdf
 
Data streaming fundamentals
Data streaming fundamentalsData streaming fundamentals
Data streaming fundamentals
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 
Bigtable and Dynamo
Bigtable and DynamoBigtable and Dynamo
Bigtable and Dynamo
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Real time big data stream processing
Real time big data stream processing Real time big data stream processing
Real time big data stream processing
 
Unsupervised Anomaly Detection with Isolation Forest - Elena Sharova
Unsupervised Anomaly Detection with Isolation Forest - Elena SharovaUnsupervised Anomaly Detection with Isolation Forest - Elena Sharova
Unsupervised Anomaly Detection with Isolation Forest - Elena Sharova
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Introduction to Amazon DynamoDB
Introduction to Amazon DynamoDBIntroduction to Amazon DynamoDB
Introduction to Amazon DynamoDB
 
Gfs vs hdfs
Gfs vs hdfsGfs vs hdfs
Gfs vs hdfs
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theorem
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 

Viewers also liked

Viewers also liked (13)

Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
 
Computer Programming For Power Systems Analysts.
Computer Programming For Power Systems Analysts.Computer Programming For Power Systems Analysts.
Computer Programming For Power Systems Analysts.
 
Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)
 
Hash - A probabilistic approach for big data
Hash - A probabilistic approach for big dataHash - A probabilistic approach for big data
Hash - A probabilistic approach for big data
 
Streaming from the cloud
Streaming from the cloudStreaming from the cloud
Streaming from the cloud
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
 
Data Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and RData Stream Algorithms in Storm and R
Data Stream Algorithms in Storm and R
 
Introduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache KafkaIntroduction To Streaming Data and Stream Processing with Apache Kafka
Introduction To Streaming Data and Stream Processing with Apache Kafka
 
Cloud-based Data Stream Processing
Cloud-based Data Stream ProcessingCloud-based Data Stream Processing
Cloud-based Data Stream Processing
 
Cloud Migration: Moving to the Cloud
Cloud Migration: Moving to the CloudCloud Migration: Moving to the Cloud
Cloud Migration: Moving to the Cloud
 
Migrating to Cloud - A Step by Step
Migrating to Cloud - A Step by Step Migrating to Cloud - A Step by Step
Migrating to Cloud - A Step by Step
 
Slides That Rock
Slides That RockSlides That Rock
Slides That Rock
 
cloud computing ppt
cloud computing pptcloud computing ppt
cloud computing ppt
 

Similar to Introduction to Data streaming - 05/12/2014

The State of Stream Processing
The State of Stream ProcessingThe State of Stream Processing
The State of Stream Processing
confluent
 

Similar to Introduction to Data streaming - 05/12/2014 (20)

Design and Implementation of A Data Stream Management System
Design and Implementation of A Data Stream Management SystemDesign and Implementation of A Data Stream Management System
Design and Implementation of A Data Stream Management System
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
 
Seminaire bigdata23102014
Seminaire bigdata23102014Seminaire bigdata23102014
Seminaire bigdata23102014
 
Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®Kostas Tzoumas - Stream Processing with Apache Flink®
Kostas Tzoumas - Stream Processing with Apache Flink®
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataVoxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
 
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
 
Stream Processing
Stream Processing Stream Processing
Stream Processing
 
Semantics in Sensor Networks
Semantics in Sensor NetworksSemantics in Sensor Networks
Semantics in Sensor Networks
 
Jewei Hans & Kamber Chapter 8
Jewei Hans & Kamber  Chapter 8Jewei Hans & Kamber  Chapter 8
Jewei Hans & Kamber Chapter 8
 
data streammining and its applications.ppt
data streammining and its applications.pptdata streammining and its applications.ppt
data streammining and its applications.ppt
 
The State of Stream Processing
The State of Stream ProcessingThe State of Stream Processing
The State of Stream Processing
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
 
Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques Chapter 08 Data Mining Techniques
Chapter 08 Data Mining Techniques
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data Streams
 
CERN IT Monitoring
CERN IT Monitoring CERN IT Monitoring
CERN IT Monitoring
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 

Recently uploaded

Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 

Recently uploaded (20)

Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 

Introduction to Data streaming - 05/12/2014

  • 1. Data Streaming Raja Chiky – raja.chiky@isep.fr
  • 2. About me ¡ Associate professor in Computer Science – LISITE-RDI ¡ Research interest: Data stream mining, scalability and resource optimization in distributed architectures (e.g cloud architectures), recommender systems ¡ Research field: Large scale data management 1. Real-time and distributed processing of various data sources 2. Use semantic technologies to add a semantic layer 3. Recommender systems and collaborative data mining 4. Optimizing resources in large scale systems Heterogeneous and sta1c data Heterogeneous and dynamic data streams sensors 2
  • 3. OUTLINE ¡ Context: Big Data ¡ What is a data stream ? ¡ Data stream management systems ¡ Basic approximate algorithms ¡ Big Data: Distributed Systems ¡ Semantic Data Streaming ¡ Conclusion 3
  • 5. 5 Big Data: Buzzword!
  • 6. 6 08/12/2014 Where is all this data coming from?
  • 7. 7 More and More connected Things
  • 8. 8 So, what is Big Data? Dawn of (me Volume of data created Worldwide 2003 2012 5 EB … 2.7 ZB 2015 10 ZB (E) § 1 YB = 10^24 Bytes § 1 ZB = 10^21 Bytes § 1 EB = 10^18 Bytes § 1 PB = 10^15 Bytes § 1TB = 10^12 Bytes § 1 GB = 10^9 Bytes Variety of data § Radio § TV § News § E-­‐Mails § Facebook Posts Velocity of data § Walmart handles 1M transac(ons per hour § Google processes 24PB of data per day § AT&T transfers 30 PB of data per day § 90 trillion emails are sent per year § World of WarcraQ uses 1.3 PB of storage § Tweets § Blogs § Photos § Videos (user and paid) § RSS feeds § Wikipedia § GPS data § RFID § POS Scanners § … § Facebook when had a user base of 900 M users, had 25 PB of compressed data § 400M tweets per day in June ’12 § 72 hours of video is uploaded to Youtube every minute Big Data Elements Volume Variety Velocity + Veracity (IBM) - information uncertainty Source: Big Data & Analytics - Why Should We Care?, Vishwa Kolla
  • 9. Output User Interaction Store Gathering Information Data sources Static data Stream (big) data C Continuous queries/ Business rules sensors databases Data stream Static data ETL Batch processing Semantic ETL stream processing Ad-hoc queries Analytics Knowledge enrichment Databases/ Triplestores (synopsis) 9 Real time visual-analytics Retro-action Load shedding Data Warehouse 08/12/2014
  • 10. 10 Big Data : Velocity Website logs Network monitoring Financial services eCommerce Traffic control Weather forecasting Power consumption
  • 11. What is a data stream? 11 ¡ Golab & Oszu (2003): “A data stream is a real-time, continuous, ordered (implicitly by arrival time or explicitly by timestamp) sequence of items. It is impossible to control the order in which items arrive, nor is it feasible to locally store a stream in its entirety.” ¡ Massive volumes of data, items arrive at a high rate.
  • 12. 12 Applications of data stream processing ¡ Data stream processing ¡ Process queries (compute statistics, activate alarms) ¡ Apply data mining algorithms ¡ Requirements ¡ Real-time processing ¡ One-pass processing ¡ Bounded storage (no complete storage of streams) ¡ Possibly consider several streams
  • 13. 13 Applications of data stream processing ¡ Let’s go deeper into some examples ¡ Network management ¡ Stock monitoring
  • 14. 14 Network management ¡ Supervision of a computer network ¡ Improvement of network configuration (hardware, software, architecture) ¡ Detection of attacks ¡ Measurements made on routers Network supervision Huge volume of data center High rate of arrivals
  • 15. 15 Network management Network supervision center Timestamp Source Destination Duration Bytes Protocol … … … … … … 12342 10.1.0.2 16.2.3.7 12 20K http 12343 18.6.7.1 12.4.0.3 16 24K http 12344 12.4.3.8 14.8.7.4 26 58K http 12345 19.7.1.2 16.5.5.8 18 80K ftp … … … … … …
  • 16. 16 Network management Network supervision center Typical queries: - 100 most frequent (@S, @D) on router R1 … - How many different (@S, @D) seen on R1 but not R2 … - … during last month, last week, last day, last hour ?
  • 17. 17 Stock monitoring ¡ Stream of price and sales volume of stocks over time ¡ Technical analysis/charting for stock investors ¡ Support trading decisions l Notify me when the price of IBM is above $83, and the first MSFT price afterwards is below $27. l Notify me when some stock goes up by at least 5% from one transaction to the next. l Notify me when the price of any stock increases monotonically for ≥30 min. l Notify me when the difference between the current price of a stock and its 10 day moving average is greater than some threshold value Source: Gehrke 07 and Cayuga application scenarios (Cornell University)
  • 18. 18 Where is the problem? (1/2) ¡ Example: 05/12/2014 Bank withdrawal 50 € 12/05/2014 Bank fraud bank withdrawal 1000$ 12/05/2014 ¡ Join between several streams ¡ Join between stream data and customer database ¡ Generic tools for processing streams ¡ Avoid the ‘Store’, ‘Compute’, ‘Delete’ approach ¡ Solution: incremental computation and definition of temporal windows for joins
  • 19. 19 Where is the problem? (2/2) ¡ Example: ¡ 100 most frequent @S IP adresses on a router ¡ Maintain a table of IP addresses with frequencies ? ¡ Sampling the stream ? ¡ Face high (and varying) rate of arrivals ¡ Exact versus approximate answers
  • 20. 20 Examples of queries 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Find elements with frequency> 0.1% top-k Frequency of the element 3 Total frequency of Elements between 8 and 14 number of elements having a non-zero frequency
  • 21. 21 Data Stream Management Systems DBMS DSMS Data model Permanent updatable relations Streams and permanent updatable relations Storage Data is stored on disk Permanent relations are stored on disk Streams are processed on the fly Query SQL language Creating structures Inserting/updating/deleting data Retrieving data (one-time query) SQL-like query language Standard SQL on permanent relations Extended SQL on streams with windowing Continuous queries Performance Large volumes of data Optimization of computer resources to deal with Several streams Several queries Ability to face variations in arrival rates without crash
  • 22. 22 Generic DSMS architecture Input Monitor Output Buffer Query Processor Query Reposi-tory Working Storage Summary Storage Static Storage Streaming Inputs Streaming Outputs Updates to Static Data User Queries Golab & Oszu (2003)
  • 24. 24 How to deal with Big Data Streams Distribution DSMS Sampling/Load Shedding/Sketch/…
  • 25. 25 Approximate answers to queries ¡ When ? ¡ Queries needing unbounded memory ¡ Too much queries/too rapid streams/too high response time requirements ¡ CPU limit ¡ Memory limit ¡ Solution : approximate answers to queries ¡ Sliding windows ¡ Sampling and load shedding ¡ Definition of synopsis
  • 26. 26 Approaches ¡ Two approaches for handling such streams ¡ Use a time window, and query the window as a static table ¡ When you can’t store collected data, or to keep track of historical data ¡ Sampling ¡ Filtering ¡ Counting
  • 27. Windowing ¡ Applying queries/mining tasks to the whole stream (from beginning to current time) ¡ Applying queries/mining to a portion of the stream Beginning of the stream Current date Window on the stream t
  • 28. 29 Windowing ¡ Definition of windows of interest on streams ¡ Fixed windows: September 2014 ¡ Sliding windows: last 3 hours ¡ Landmark windows: from September 1st, 2014 08/12/2014 ¡ Window specification ¡ Physical : last 3 hours ¡ Logical : last 1000 items ¡ Refreshing rate ¡ Rate of producing results (every item, every 10 items, every minute, …)
  • 29. Sliding window Beginning of the stream t tc t’ t c Refreshment time Results Results Give me the last room where Axel has been in the last 10 minutes, updating results every minute 30
  • 30. Sliding window vs. Tumbling window Beginning of the stream t tc t’c t Results Refreshment time Results Give me the last room where Axel has been in the last 10 minutes, updating results every 10 minutes 31
  • 31. 32 Sampling from data stream ¡ Inputs: ¡ Sample size k ¡ Window size n >> k (alternatively, time duration m) (optionnaly) ¡ Stream of data elements that arrive online ¡ Output: ¡ k elements chosen uniformly at random from the last n elements (alternatively, from all elements that have arrived in the last m time units) ¡ Goal: ¡ maintain a data structure that can produce the desired output at any time upon request ¡ Challenge: ¡ don’t know how long stream is ¡ So when/how often to sample?
  • 32. A simple, Unsatisfying Approach ¡ Choose a random subset X={x1, …,xk}, X⊂{0,1,…,n-1} ¡ The sample always consists of the non-expired elements whose indexes are equal to x1, …,xk (modulo n) ¡ Only uses O(k) memory ¡ Technically produces a uniform random sample of each window, but unsatisfying because the sample is highly periodic ¡ Unsuitable for many real applications, particularly those with periodicity in the data 34
  • 33. 35 Reservoir sampling ¡ Classic online algorithm due to Vitter (1985) ¡ Maintains a fixed-size uniform random sample ¡ Size of the data stream need not be known in advance ¡ Data structure: “reservoir” of k data elements ¡ As the ith data element arrives: ¡ Add it to the reservoir with probability p = k/i, discarding a randomly chosen data element from the reservoir to make room
  • 34. 36 Chain Sampling v Chain Sampling method is for sequence based windows. v In this type of sampling when the ith element arrives it is chosen to become the sample with probability Min(i,n)/n. v If the ith element is chosen as the sample, the algorithm also selects the index of the element that will replace it when expires (assuming that it is still present in the sample when it expires). This index is picked uniformly at random from the range i+1…i +n, representing the range of indexes of the elements that will be active when the ith element expires. v When the element with the selected index arrives, the algorithm stores it in the memory and choses the index of the element that will replace it when it expires etc., building a chain of elements to use in case of the expiration of the current element in the sample.
  • 35. 37 Chain Sampling 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
  • 36. Another Simple Approach: oversample ¡ As each element arrives remember it with probability p = ck/n log n; otherwise discard it ¡ Discard elements when they expire ¡ When asked to produce a sample, choose k elements at random from the set in memory ¡ Expected memory usage of O(k log n) ¡ The algorithm can fail if less than k elements from a window are remembered; however with high probability this will not happen 38
  • 37. 39 Min-wise Sampling ¡ For each item, pick a random fraction between 0 and 1 ¡ Store item(s) with the smallest random tag (Nath et al. 2004) 0.391 0.908 0.291 0.555 0.619 0.273 Each item has same chance of least tag, so uniform Can run on multiple streams separately, then merge
  • 38. 40 Hash function - Revision ¡ Hash function: Any algorithm that maps data of arbitrary length to data of a fixed length. The values returned by a hash function are called hash values, hash codes, hash sums, checksums or simply hashes. 08/12/2014
  • 39. 41 Stream filtering ¡ Compact data structure, by B.H. Bloom, 1970: ¡ A bit array of size m, ¡ A H family of k hash functions (MD5, SHA256, Murmur), ¡ A set S of n items ¡ False positive probability f=(1-e-kn/m)k ¡ Example M=10, hash=MD5, k=3
  • 40. 42 Sketches ¡ Sketch ¡ Synopsis structure taking advantage of high volumes of data ¡ Provides an approximate result with probabilistic bounds ¡ Random projections on smaller spaces (hash functions) ¡ Many sketch structures: usually dedicated to a specialized task ¡ Examples of sketch structures ¡ COUNT (Flajolet 85) ¡ COUNT SKETCH (Charikar et al. 04)
  • 41. Difference between Sampling and Sketching 43 ¡ Sample sees only those items which were selected to be in the sample; whereas the sketch sees the entire input, but is restricted to retain only a small summary of it ¡ Not every problem can be solved with sampling ¡ Example: counting how many distinct items in the stream ¡ If a large fraction of items aren’t sampled, don’t know if they are all same or all different
  • 42. 44 Sketches: COUNT (Flajolet Martin algorithm) ¡ Goal ¡ Number N of distinct values in a stream (for large N) ¡ Example: number of distinct IP addresses going through a router ¡ Sketch structure ¡ SK: L bits initialized to 0 0 0 0 0 0 0 0 0 ¡ H: hashing function transforming an element of the stream into L bits 18.6.7.1 0 0 1 0 1 0 1 0 ¡ H distributes uniformly elements of the stream on the 2^L possibilities
  • 43. 45 Sketches ¡ Method ¡ Maintenance and update of SK ¡ For each new element e ¡ Compute H(e) ¡ Select the position of the leftmost 1 in H(e) ¡ Force to 1 this position in SK 0 1 0 0 1 0 0 1 SK H(18.6.7.1) 0 0 1 0 1 0 1 0 New SK 0 1 1 0 1 0 0 1
  • 44. 46 Sketches ¡ Result ¡ Select the position R (0…L-1) of the leftmost 0 in SK ¡ E(R) = log2 (φ*N) with φ = 0.77351… è Estimate the count by 2^R/φ ¡ σ(R) = 1.12 SK 1 1 1 0 1 0 0 0 R To improve accuracy of this approximation algorithm, use multiple hash functions and use the average R instead.
  • 45. 47 Sketches ¡ COUNT SKETCH ALGORITHM (Charikar et al. 2004) ¡ Goal ¡ k most frequent elements in a stream (for large number N of distinct values) ¡ Ex. 100 most frequent IP addresses going through a router Input stream 2, 0, 1, 3, 1, 2, 4, . . . 2 2 1 1 1 f(0) f(1) f(2) f(3) f(4)
  • 46. 48 Sketches +12 +7 +23 +15 -5 -12 -23 +1 . . . . . . . . . . . . . . . . . . . . +78 +56 +66 +65 1 2 . . . . . B 1 2 … t e -1 +1 -1 +1
  • 47. 49 Sketches ¡ Sketch structure ¡ h : hash function from [0, … , N-1] to [0, 1, … , B] ¡ s : hash function from [0, … , N-1] to {+1, -1} ¡ Array of B counters: C1, …, CB (with B << N) ¡ Sketch maintenance when e arrives: Ch(e) += s(e) ¡ Use of sketch ¡ Estimation of frequency of object e: : ne ≈ Ch(e) . s(e) ¡ Actually t hash function h and t hash function s: ne ≈ median j∈[1…t] ( Chj(e) . sj(e) ) ¡ Theoretical results on error depending on N, t and B.
  • 48. 50 Sketches ¡ Algorithm ¡ Maintenance of a list (e1, e2, …, ek) of the current k most frequent elements ¡ For a new arriving element e ¡ Add e to the sketch structure ¡ Estimate frequency of e from the sketch structure ¡ If f(e) > f(ek), remove ek and insert e into the list
  • 49. Distributed Systems Answering today Big Data Needs S4, Storm, SAMOA, … 51
  • 50. Yahoo S4 ¡ Simple Scalable Streaming System ¡ Use a decentralized and symmetric architecture ¡ All nodes are similar, no master node ¡ Based on Processing Elements ¡ Each PE has 4 components: 1. Functionality defined by a PE class and associated configuration ¡ input event handler processEvent() ¡ output mechanism output() 2. the types of events that it consumes 3. the keyed attribute in those events 4. the value of the keyed attribute in events which it consumes ¡ Several PEs are available for standard tasks such as count, aggregate, join … ¡ Custom PEs can easily be programmed ¡ The PEs are atomatically deployed on the Processing Nodes (machines): ¡ Ensures load balancing of events, ¡ Notifies appropriate PEs when an event comes in.
  • 52. Twitter Storm ¡ Same princple as S4, with a simplified programming model ¡ Storm provides realtime computation ¡ Scalable ¡ Guarantees no data loss ¡ Extremely robust and fault-tolerant ¡ Programming language agnostic ¡ Concepts ¡ Streams ¡ Spouts ¡ Bolts ¡ Topologies
  • 53. 56 Lambda architecture (By Nathan Marz) REAL TIME STREAM PROCESSING SERVING LAYER SPEED LAYER PRECOMPUTED DATA FLOW QUERIES BATCH LAYER BATCH PROCESSING VIEWS Source: Mathieu DESPRIEE (USI) Generic, scalable and fault-tolerant data processing architecture
  • 54. Lambda architecture 1. All data entering the system is dispatched to both the batch layer and the speed layer for processing. 2. The batch layer has two functions: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views. 3. The serving layer indexes the batch views so that they can be queried in ad-hoc way. 4. The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only. 5. Any incoming query can be answered by merging results from batch views and real-time views. http://lambda-architecture.net/
  • 55. 58 Big Data Stream Mining Machine Learning Distributed Batch Hadoop Mahout Stream S4, Storm SAMOA Non Distributed Batch R, WEKA, … Stream MOA
  • 56. 59 What is SAMOA? • NEW Software framework for mining distributed data streams • Big Data mining for evolving streams in REAL-TIME
  • 57. Clustering Methods SAMOA SAMOA architecture Classifier Methods Frequent Pattern Mining S4 Storm … } Use S4, Storm, or other distributed stream processing platform } Use MOA, or other streaming machine learning library } Easy to extend through PACKAGES
  • 59. 62 Too much data streams
  • 60. 63 Type of data used in Big Data initiatives Internal data Traditional sources « New data » Source: Big Data opportunities survey, Unisphere / SAP, May 2013.
  • 61. 64 BI: New Generation 08/12/2014 Decision Monitoring, Alerts, Statistics, fault detection, etc. Semantic filtering and Continuous queries Heterogeneous and dynamic data streams Heterogeneous and static data sensors Semantic data streams Interconnection Ontologies Retro-action
  • 62. 65 Semantic Web technologies for data stream ¡ Annotate stream data with semantic metadata ¡ Apply Linked Data principles to publish streaming data ¡ Interlink streaming data with existing datasets ¡ Integrate data stream processing + reasoning ¡ Objectives : interoperability, automation, enrichment
  • 63. 66 Existing prototypes CQELS SPARQL-Stream EP-SPARQL C-SPARQL
  • 64. 67 Approaches In RDF Stream models (timestamps, events, time intervals, triple-based, graph-based …)
  • 65. 68 Example: SRBench Q: Detect if a hurricane has been observed “A hurricane has a sustained wind (for more than 3 hours) of at least 33 metres per second or 74 miles per hour (119 km/h).” ASK WHERE { STREAM <http://www.cwi.nl/SRBench/observations> [RANGE 10800s SLIDE 600s] {?observation om-owl:procedure ?sensor ; om-owl:observedProperty weather:WindSpeed ; om-owl:result [ om-owl:floatValue ?value ] .} } GROUP BY ?sensor HAVING ( AVG(?value) >= "74"^^xsd:float ) ASK FROM STREAM <http://www.cwi.nl/SRBench/observations> [RANGE 3h STEP 10m] WHERE { ?observation om-owl:procedure ?sensor ; om-owl:observedProperty weather:WindSpeed ; om-owl:result [ om-owl:floatValue ?value ] . } GROUP BY ?sensor HAVING ( AVG(?value) >= "74"^^xsd:float )
  • 66. 69 How to deal with Big Data Streams Cloud Computing DSMS Sampling/Load Shedding [1] et [2] [3] [1] J. Hoeksema & S. Kotoulas : High-performance Distributed Stream Reasoning using S4 (ISWC 2011) [2] D. L. Phuoc & al : Elastic and Scalable Processing of Linked Stream Data in the Cloud (ISWC 2013) [3] N. Jain & al : Sampling Semantic Data Stream: Resolving Overload and Limited Storage Issues. (DaEng 2013)
  • 67. 70 Sampling Extensions for continuous SPARQL PREFIX vocab: http://data-gov.tw.rpi.edu/vocab/p/8/ SELECT ?val WHERE { STREAM <C:/CQELS/streams/data-8.stream>[NOW][UNISAMPLING %80] {?rawData vocab:ozone_8hr_daily_max ?val } GRAPH <C:/CQELS/test.rdf> {?user sioc:account_of ?person} } Operators: [UNISAMPLING %{Sampling Percentage}] [RESSAMPLING %{Reservoir Size}] [CHNSAMPLING %{Window Size}] N. Jain & al : Sampling Semantic Data Stream: Resolving Overload and Limited Storage Issues. (DaEng 2013)
  • 68. 71 Semantic Data Stream Load Shedding (1/2) Observation_AirTemperature_ID TemperatureObservation type AirTemperature observedProperty procedure System_ID MeasureData_AirTemperature_ID result Instant_ID NN floatValue double fahrenheit MeasureData uom Instant type 2004-08-08T06:25:00 inXSDDateTime samplingTime Observation_RelativeHumidity_ID procedure RelativeHumidityObservation type RelativeHumidity observedProperty MeasureData_RelativeHumidity_ID result type floatValue percentage uom NN double (a) (b) (c) (d) (1) (2) RDF Triple approach : Effect of deleting the two triples below (1) and (2) (in dotted line) in the graph : - The deletion of the first RDF triple destroys the link connecting node (a) to node (b) - The deletion of the second RDF triple destroys the link connecting node (c) to node (d) => This has the effect of making nodes (b), (d) and all those connected to them unreachables, in spite of the presence of their data in memory. Which represents 6 RDF striples among 18 i.e. 33% of unusable data
  • 69. 72 Semantic Data Stream Load Shedding (2/2) RDF Graph approach : Effect of deleting sub-graphs such as those formed by nodes (b) or (d) and all nodes to which they are directly connected. => Preserving the semantic level of the information => Protecting the data consistency of the whole graph => Enhancing the Semantic Data Stream systems observedProperty
  • 70. 73 Conclusion: Big Data Stream challenges ¡ Semantic Information aggregation ¡ Information aggregation: “too much data to assimilate but not enough knowledge to act” ¡ Distributed and real-time processing ¡ Design of real-time and distributed algorithms for stream processing and information aggregation ¡ Distribution and parallelization of data mining algorithms ¡ Visual analytics and user modeling ¡ Dynamic user model ¡ Novel visualizations for very large datasets
  • 71. 74 Thanks to Zakia Kazi Aoul, ISEP Marie-Aude Aufaure, ECP Fethi Belghaouti, ISEP-INT Georges Hébrail, EDF R&D Sylvain Lefebvre, ISEP Yousra Chabchoub, ISEP
  • 72. Big Data Linked Data Volume, Variety, Velocity, Veracity, … Value Web of data, Seman(c Web -­‐ A set of principles and good prac1ces allowing to link, publish and search for web data -­‐ Structure and seman1cally enrich RDF data, with a very high scalability -­‐> Big Linked Data Integrate, aggregate, analyze, visualize large data sets, whatever is their type, provenance, speed of their flow … Big Linked Data Linked Big Data Seman8c Technologies Living Lab Linked & Big Data Academic Chair Our Value proposi8on – Seman1c aggrega1on from textual and non textual streams – Manage seman1c heterogeneity, real-­‐1me and distributed processing – Ensure data quality and veracity – Visual analy1cs