internet of things data science
Albert Bifet
@abifet
Paris, 7 October 2015
albert.bifet@telecom-paristech.fr
internet of things data science architecture
1
real time analytics
2
real time analytics
3
introduction: data streams
Data Streams
• Sequence is potentially infinite
• High amount of data: sublinear space
• High speed of arrival: sublinear time per example
• Once an element from a data stream has been processed it
is discarded or archived
Example
Puzzle: Finding Missing Numbers
• Let π be a permutation of {1,...,n}.
• Let π−1 be π with one element missing.
• π−1[i] arrives in increasing order
Task: Determine the missing number
4
introduction: data streams
Data Streams
• Sequence is potentially infinite
• High amount of data: sublinear space
• High speed of arrival: sublinear time per example
• Once an element from a data stream has been processed it
is discarded or archived
Example
Puzzle: Finding Missing Numbers
• Let π be a permutation of {1,...,n}.
• Let π−1 be π with one element missing.
• π−1[i] arrives in increasing order
Task: Determine the missing number
Use a n-bit
vector to
memorize all the
numbers (O(n)
space)
4
introduction: data streams
Data Streams
• Sequence is potentially infinite
• High amount of data: sublinear space
• High speed of arrival: sublinear time per example
• Once an element from a data stream has been processed it
is discarded or archived
Example
Puzzle: Finding Missing Numbers
• Let π be a permutation of {1,...,n}.
• Let π−1 be π with one element missing.
• π−1[i] arrives in increasing order
Task: Determine the missing number
Data Streams:
O(log(n)) space.
4
introduction: data streams
Data Streams
• Sequence is potentially infinite
• High amount of data: sublinear space
• High speed of arrival: sublinear time per example
• Once an element from a data stream has been processed it
is discarded or archived
Example
Puzzle: Finding Missing Numbers
• Let π be a permutation of {1,...,n}.
• Let π−1 be π with one element missing.
• π−1[i] arrives in increasing order
Task: Determine the missing number
Data Streams:
O(log(n)) space.
Store
n(n+1)
2
−∑
j≤i
π−1[j].
4
data streams
Data Streams
• Sequence is potentially infinite
• High amount of data: sublinear space
• High speed of arrival: sublinear time per example
• Once an element from a data stream has been processed it
is discarded or archived
Tools:
• approximation
• randomization, sampling
• sketching
5
data streams
Data Streams
• Sequence is potentially infinite
• High amount of data: sublinear space
• High speed of arrival: sublinear time per example
• Once an element from a data stream has been processed it
is discarded or archived
Approximation algorithms
• Small error rate with high probability
• An algorithm (ε,δ)−approximates F if it outputs ˜F for which
Pr[|˜F−F| > εF] < δ.
5
data streams approximation algorithms
1011000111 1010101
Sliding Window
We can maintain simple statistics over sliding windows, using
O(1
ε log2
N) space, where
• N is the length of the sliding window
• ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002
6
data streams approximation algorithms
10110001111 0101011
Sliding Window
We can maintain simple statistics over sliding windows, using
O(1
ε log2
N) space, where
• N is the length of the sliding window
• ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002
6
data streams approximation algorithms
101100011110 1010111
Sliding Window
We can maintain simple statistics over sliding windows, using
O(1
ε log2
N) space, where
• N is the length of the sliding window
• ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002
6
data streams approximation algorithms
1011000111101 0101110
Sliding Window
We can maintain simple statistics over sliding windows, using
O(1
ε log2
N) space, where
• N is the length of the sliding window
• ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002
6
data streams approximation algorithms
10110001111010 1011101
Sliding Window
We can maintain simple statistics over sliding windows, using
O(1
ε log2
N) space, where
• N is the length of the sliding window
• ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002
6
data streams approximation algorithms
101100011110101 0111010
Sliding Window
We can maintain simple statistics over sliding windows, using
O(1
ε log2
N) space, where
• N is the length of the sliding window
• ε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002
6
Classification
7
classification
Definition
Given nC different classes, a classifier algorithm builds a
model that predicts for every unlabelled instance I the class C
to which it belongs with accuracy.
Example
A spam filter
Example
Twitter Sentiment analysis: analyze tweets with positive or
negative feelings
8
data stream classification cycle
1 Process an example at a time,
and inspect it only once (at
most)
2 Use a limited amount of memory
3 Work in a limited amount of time
4 Be ready to predict at any point
9
classification
Data set that
describes e-mail
features for
deciding if it is
spam.
Example
Contains Domain Has Time
“Money” type attach. received spam
yes com yes night yes
yes edu no night yes
no com yes night yes
no edu no day no
no com no day no
yes cat no day yes
Assume we have to classify the following new instance:
Contains Domain Has Time
“Money” type attach. received spam
yes edu yes day ?
10
bayes classifiers
Naïve Bayes
• Based on Bayes Theorem:
P(c|d) =
P(c)P(d|c)
P(d)
posterior =
prior×likelikood
evidence
• Estimates the probability of observing attribute a and the
prior probability P(c)
• Probability of class c given an instance d:
P(c|d) =
P(c)∏a∈d P(a|c)
P(d)
11
bayes classifiers
Multinomial Naïve Bayes
• Considers a document as a bag-of-words.
• Estimates the probability of observing word w and the prior
probability P(c)
• Probability of class c given a test document d:
P(c|d) =
P(c)∏w∈d P(w|c)nwd
P(d)
12
perceptron
Attribute 1
Attribute 2
Attribute 3
Attribute 4
Attribute 5
Output h⃗w(⃗xi)
w1
w2
w3
w4
w5
• Data stream: ⟨⃗xi,yi⟩
• Classical perceptron: h⃗w(⃗xi) = sgn(⃗wT⃗xi),
• Minimize Mean-square error: J(⃗w) = 1
2 ∑(yi −h⃗w(⃗xi))2
13
perceptron
Attribute 1
Attribute 2
Attribute 3
Attribute 4
Attribute 5
Output h⃗w(⃗xi)
w1
w2
w3
w4
w5
• We use sigmoid function h⃗w = σ(⃗wT⃗x) where
σ(x) = 1/(1+e−x
)
σ′
(x) = σ(x)(1−σ(x))
13
perceptron
• Minimize Mean-square error: J(⃗w) = 1
2 ∑(yi −h⃗w(⃗xi))2
• Stochastic Gradient Descent: ⃗w =⃗w−η∇J⃗xi
• Gradient of the error function:
∇J = −∑
i
(yi −h⃗w(⃗xi))∇h⃗w(⃗xi)
∇h⃗w(⃗xi) = h⃗w(⃗xi)(1−h⃗w(⃗xi))
• Weight update rule
⃗w =⃗w+η ∑
i
(yi −h⃗w(⃗xi))h⃗w(⃗xi)(1−h⃗w(⃗xi))⃗xi
13
perceptron
Perceptron Learning(Stream,η)
1 for each class
2 do Perceptron Learning(Stream,class,η)
Perceptron Learning(Stream,class,η)
1 £ Let w0 and ⃗w be randomly initialized
2 for each example (⃗x,y) in Stream
3 do if class = y
4 then δ = (1−h⃗w(⃗x))·h⃗w(⃗x)·(1−h⃗w(⃗x))
5 else δ = (0−h⃗w(⃗x))·h⃗w(⃗x)·(1−h⃗w(⃗x))
6 ⃗w = ⃗w+η ·δ ·⃗x
Perceptron Prediction(⃗x)
1 return argmaxclass h⃗wclass
(⃗x)
14
classification
Data set that
describes e-mail
features for
deciding if it is
spam.
Example
Contains Domain Has Time
“Money” type attach. received spam
yes com yes night yes
yes edu no night yes
no com yes night yes
no edu no day no
no com no day no
yes cat no day yes
Assume we have to classify the following new instance:
Contains Domain Has Time
“Money” type attach. received spam
yes edu yes day ?
15
classification
• Assume we have to classify the following new instance:
Contains Domain Has Time
“Money” type attach. received spam
yes edu yes day ?
Time
Contains “Money”
YES
Yes
NO
No
Day
YES
Night
15
decision trees
Basic induction strategy:
• A ← the “best” decision attribute for next node
• Assign A as decision attribute for node
• For each value of A, create new descendant of node
• Sort training examples to leaf nodes
• If training examples perfectly classified, Then STOP, Else
iterate over new leaf nodes
16
hoeffding trees
Hoeffding Tree : VFDT
Pedro Domingos and Geoff Hulten.
Mining high-speed data streams. 2000
• With high probability, constructs an identical model that a
traditional (greedy) method would learn
• With theoretical guarantees on the error rate
Time
Contains “Money”
YES
Yes
NO
No
Day
YES
Night
17
hoeffding bound inequality
Probability of deviation of its expected value.
18
hoeffding bound inequality
Let X = ∑i Xi where X1,...,Xn are independent and indentically
distributed in [0,1]. Then
1 Chernoff For each ε < 1
Pr[X > (1+ε)E[X]] ≤ exp
(
−
ε2
3
E[X]
)
2 Hoeffding For each t > 0
Pr[X > E[X]+t] ≤ exp
(
−2t2
/n
)
3 Bernstein Let σ2 = ∑i σ2
i the variance of X. If Xi −E[Xi] ≤ b for
each i ∈ [n] then for each t > 0
Pr[X > E[X]+t] ≤ exp
(
−
t2
2σ2 + 2
3bt
)
19
hoeffding tree or vfdt
HT(Stream,δ)
1 £ Let HT be a tree with a single leaf(root)
2 £ Init counts nijk at root
3 for each example (x,y) in Stream
4 do HTGrow((x,y),HT,δ)
HTGrow((x,y),HT,δ)
1 £ Sort (x,y) to leaf l using HT
2 £ Update counts nijk at leaf l
3 if examples seen so far at l are not all of the same class
4 then £ Compute G for each attribute
5 if G(Best Attr.)−G(2nd best) >
√
R2 ln1/δ
2n
6 then £ Split leaf on best attribute
7 for each branch
8 do £ Start new leaf and initiliatize counts
20
hoeffding tree or vfdt
HT(Stream,δ)
1 £ Let HT be a tree with a single leaf(root)
2 £ Init counts nijk at root
3 for each example (x,y) in Stream
4 do HTGrow((x,y),HT,δ)
HTGrow((x,y),HT,δ)
1 £ Sort (x,y) to leaf l using HT
2 £ Update counts nijk at leaf l
3 if examples seen so far at l are not all of the same class
4 then £ Compute G for each attribute
5 if G(Best Attr.)−G(2nd best) >
√
R2 ln1/δ
2n
6 then £ Split leaf on best attribute
7 for each branch
8 do £ Start new leaf and initiliatize counts
20
hoeffding trees
HT features
• With high probability, constructs an identical model that a
traditional (greedy) method would learn
• Ties: when two attributes have similar G, split if
G(Best Attr.)−G(2nd best) <
√
R2 ln1/δ
2n
< τ
• Compute G every nmin instances
• Memory: deactivate least promising nodes with lower pl ×el
• pl is the probability to reach leaf l
• el is the error in the node
21
hoeffding naive bayes tree
Hoeffding Tree
Majority Class learner at leaves
Hoeffding Naive Bayes Tree
G. Holmes, R. Kirkby, and B. Pfahringer.
Stress-testing Hoeffding trees, 2005.
• monitors accuracy of a Majority Class learner
• monitors accuracy of a Naive Bayes learner
• predicts using the most accurate method
22
bagging
Example
Dataset of 4 Instances : A, B, C, D
Classifier 1: B, A, C, B
Classifier 2: D, B, A, D
Classifier 3: B, A, C, B
Classifier 4: B, C, B, B
Classifier 5: D, C, A, C
Bagging builds a set of M base models, with a bootstrap
sample created by drawing random samples with
replacement.
23
bagging
Example
Dataset of 4 Instances : A, B, C, D
Classifier 1: A, B, B, C
Classifier 2: A, B, D, D
Classifier 3: A, B, B, C
Classifier 4: B, B, B, C
Classifier 5: A, C, C, D
Bagging builds a set of M base models, with a bootstrap
sample created by drawing random samples with
replacement.
23
bagging
Example
Dataset of 4 Instances : A, B, C, D
Classifier 1: A, B, B, C: A(1) B(2) C(1) D(0)
Classifier 2: A, B, D, D: A(1) B(1) C(0) D(2)
Classifier 3: A, B, B, C: A(1) B(2) C(1) D(0)
Classifier 4: B, B, B, C: A(0) B(3) C(1) D(0)
Classifier 5: A, C, C, D: A(1) B(0) C(2) D(1)
Each base model’s training set contains each of the original
training example K times where P(K = k) follows a binomial
distribution.
23
bagging
Figure 1: Poisson(1) Distribution.
Each base model’s training set contains each of the original
training example K times where P(K = k) follows a binomial
distribution.
23
oza and russell’s online bagging for m models
1: Initialize base models hm for all m ∈ {1,2,...,M}
2: for all training examples do
3: for m = 1,2,...,M do
4: Set w = Poisson(1)
5: Update hm with the current example with weight w
6: anytime output:
7: return hypothesis: hfin(x) = argmaxy∈Y ∑T
t=1 I(ht(x) = y)
24
Evolving Stream Classification
25
data mining algorithms with concept drift
No Concept Drift
-
input output
DM Algorithm
-
Counter1
Counter2
Counter3
Counter4
Counter5
Concept Drift
-
input output
DM Algorithm
Static Model
-
Change Detect.
-
6

26
data mining algorithms with concept drift
No Concept Drift
-
input output
DM Algorithm
-
Counter1
Counter2
Counter3
Counter4
Counter5
Concept Drift
-
input output
DM Algorithm
-
Estimator1
Estimator2
Estimator3
Estimator4
Estimator5
26
optimal change detector and predictor
• High accuracy
• Low false positives and false negatives ratios
• Theoretical guarantees
• Fast detection of change
• Low computational cost: minimum space and time needed
• No parameters needed
27
algorithm adaptive sliding window
Example
W= 101010110111111
W0= 1
ADWIN: Adaptive Windowing Algorithm
1 Initialize Window W
2 for each t  0
3 do W ← W∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until |ˆµW0
− ˆµW1
| ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
28
algorithm adaptive sliding window
Example
W= 101010110111111
W0= 1 W1 = 01010110111111
ADWIN: Adaptive Windowing Algorithm
1 Initialize Window W
2 for each t  0
3 do W ← W∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until |ˆµW0
− ˆµW1
| ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
28
algorithm adaptive sliding window
Example
W= 101010110111111
W0= 10 W1 = 1010110111111
ADWIN: Adaptive Windowing Algorithm
1 Initialize Window W
2 for each t  0
3 do W ← W∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until |ˆµW0
− ˆµW1
| ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
28
algorithm adaptive sliding window
Example
W= 101010110111111
W0= 101 W1 = 010110111111
ADWIN: Adaptive Windowing Algorithm
1 Initialize Window W
2 for each t  0
3 do W ← W∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until |ˆµW0
− ˆµW1
| ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
28
algorithm adaptive sliding window
Example
W= 101010110111111
W0= 1010 W1 = 10110111111
ADWIN: Adaptive Windowing Algorithm
1 Initialize Window W
2 for each t  0
3 do W ← W∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until |ˆµW0
− ˆµW1
| ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
28
algorithm adaptive sliding window
Example
W= 101010110111111
W0= 10101 W1 = 0110111111
ADWIN: Adaptive Windowing Algorithm
1 Initialize Window W
2 for each t  0
3 do W ← W∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until |ˆµW0
− ˆµW1
| ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
28
algorithm adaptive sliding window
Example
W= 101010110111111
W0= 101010 W1 = 110111111
ADWIN: Adaptive Windowing Algorithm
1 Initialize Window W
2 for each t  0
3 do W ← W∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until |ˆµW0
− ˆµW1
| ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
28
algorithm adaptive sliding window
Example
W= 101010110111111
W0= 1010101 W1 = 10111111
ADWIN: Adaptive Windowing Algorithm
1 Initialize Window W
2 for each t  0
3 do W ← W∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until |ˆµW0
− ˆµW1
| ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
28
algorithm adaptive sliding window
Example
W= 101010110111111
W0= 10101011 W1 = 0111111
ADWIN: Adaptive Windowing Algorithm
1 Initialize Window W
2 for each t  0
3 do W ← W∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until |ˆµW0
− ˆµW1
| ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
28
algorithm adaptive sliding window
Example
W= 101010110111111 |ˆµW0
− ˆµW1
| ≥ εc : CHANGE DET.!
W0= 101010110 W1 = 111111
ADWIN: Adaptive Windowing Algorithm
1 Initialize Window W
2 for each t  0
3 do W ← W∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until |ˆµW0
− ˆµW1
| ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
28
algorithm adaptive sliding window
Example
W= 101010110111111 Drop elements from the tail of W
W0= 101010110 W1 = 111111
ADWIN: Adaptive Windowing Algorithm
1 Initialize Window W
2 for each t  0
3 do W ← W∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until |ˆµW0
− ˆµW1
| ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
28
algorithm adaptive sliding window
Example
W= 01010110111111 Drop elements from the tail of W
W0= 101010110 W1 = 111111
ADWIN: Adaptive Windowing Algorithm
1 Initialize Window W
2 for each t  0
3 do W ← W∪{xt} (i.e., add xt to the head of W)
4 repeat Drop elements from the tail of W
5 until |ˆµW0
− ˆµW1
| ≥ εc holds
6 for every split of W into W = W0 ·W1
7 Output ˆµW
28
algorithm adaptive sliding window
Theorem
At every time step we have:
1 (False positive rate bound). If µt remains constant within W,
the probability that ADWIN shrinks the window at this step is at
most δ.
2 (False negative rate bound). Suppose that for some partition
of W in two parts W0W1 (where W1 contains the most recent
items) we have |µW0
− µW1
|  2εc. Then with probability 1−δ
ADWIN shrinks W to W1, or shorter.
ADWIN tunes itself to the data stream at hand, with no need for
the user to hardwire or precompute parameters.
29
algorithm adaptive sliding window
ADWIN using a Data Stream Sliding Window Model,
• can provide the exact counts of 1’s in O(1) time per point.
• tries O(logW) cutpoints
• uses O(1
ε logW) memory words
• the processing time per example is O(logW) (amortized and
worst-case).
Sliding Window Model
1010101 101 11 1 1
Content: 4 2 2 1 1
Capacity: 7 3 2 1 1
30
vfdt / cvfdt
Concept-adapting Very Fast Decision Trees: CVFDT
G. Hulten, L. Spencer, and P. Domingos.
Mining time-changing data streams. 2001
• It keeps its model consistent with a sliding window of
examples
• Construct “alternative branches” as preparation for changes
• If the alternative branch becomes more accurate, switch of
tree branches occurs
Time
Contains “Money”
Yes No
Day
YES
Night
31
decision trees: cvfdt
Time
Contains “Money”
YES
Yes
NO
No
Day
YES
Night
No theoretical guarantees on the error rate of CVFDT
CVFDT parameters :
1 W: is the example window size.
2 T0: number of examples used to check at each node if the
splitting attribute is still the best.
3 T1: number of examples used to build the alternate tree.
4 T2: number of examples used to test the accuracy of the 32
decision trees: hoeffding adaptive tree
Hoeffding Adaptive Tree:
• replace frequency statistics counters by estimators
• don’t need a window to store examples, due to the fact that we
maintain the statistics data needed with estimators
• change the way of checking the substitution of alternate
subtrees, using a change detector with theoretical
guarantees
Advantages over CVFDT:
1 Theoretical guarantees
2 No Parameters
33
adwin bagging (kdd’09)
ADWIN
An adaptive sliding window whose size is recomputed online
according to the rate of change observed.
ADWIN has rigorous guarantees (theorems)
• On ratio of false positives and negatives
• On the relation of the size of the current window and change
rates
ADWIN Bagging
When a change is detected, the worst classifier is removed
and a new classifier is added.
34
Randomization as a powerful tool to increase accuracy and
diversity
There are three ways of using randomization:
• Manipulating the input data
• Manipulating the classifier algorithms
• Manipulating the output targets
35
leveraging bagging for evolving data streams
Leveraging Bagging
• Using Poisson(λ)
Leveraging Bagging MC
• Using Poisson(λ) and Random Output Codes
Fast Leveraging Bagging ME
• if an instance is misclassified: weight = 1
• if not: weight = eT/(1−eT),
36
empirical evaluation
Accuracy RAM-Hours
Hoeffding Tree 74.03 0.01
Online Bagging 77.15 2.98
ADWIN Bagging 79.24 1.48
Leveraging Bagging 85.54 20.17
Leveraging Bagging MC 85.37 22.04
Leveraging Bagging ME 80.77 0.87
Leveraging Bagging
• Leveraging Bagging
• Using Poisson(λ)
• Leveraging Bagging MC
• Using Poisson(λ) and Random Output Codes
• Leveraging Bagging ME
• Using weight 1 if misclassified, otherwise eT/(1−eT)
37
Clustering
38
clustering
Definition
Clustering is the distribution of a set of instances of examples
into non-known groups according to some common relations
or affinities.
Example
Market segmentation of customers
Example
Social network communities
39
clustering
Definition
Given
• a set of instances I
• a number of clusters K
• an objective function cost(I)
a clustering algorithm computes an assignment of a cluster
for each instance
f : I → {1,...,K}
that minimizes the objective function cost(I)
40
clustering
Definition
Given
• a set of instances I
• a number of clusters K
• an objective function cost(C,I)
a clustering algorithm computes a set C of instances with
|C| = K that minimizes the objective function
cost(C,I) = ∑
x∈I
d2
(x,C)
where
• d(x,c): distance function between x and c
• d2(x,C) = minc∈Cd2(x,c): distance from x to the nearest point 41
k-means
• 1. Choose k initial centers C = {c1,...,ck}
• 2. while stopping criterion has not been met
• For i = 1,...,N
• find closest center ck ∈ C to each instance pi
• assign instance pi to cluster Ck
• For k = 1,...,K
• set ck to be the center of mass of all points in Ci
42
k-means++
• 1. Choose a initial center c1
• For k = 2,...,K
• select ck = p ∈ I with probability d2(p,C)/cost(C,I)
• 2. while stopping criterion has not been met
• For i = 1,...,N
• find closest center ck ∈ C to each instance pi
• assign instance pi to cluster Ck
• For k = 1,...,K
• set ck to be the center of mass of all points in Ci
43
performance measures
Internal Measures
• Sum square distance
• Dunn index D = dmin
dmax
• C-Index C = S−Smin
Smax−Smin
External Measures
• Rand Measure
• F Measure
• Jaccard
• Purity
44
birch
Balanced Iterative Reducing and Clustering using
Hierarchies
• Clustering Features CF = (N,LS,SS)
• N: number of data points
• LS: linear sum of the N data points
• SS: square sum of the N data points
• Properties:
• Additivity: CF1 +CF2 = (N1 +N2,LS1 +LS2,SS1 +SS2)
• Easy to compute: average inter-cluster distance
and average intra-cluster distance
• Uses CF tree
• Height-balanced tree with two parameters
• B: branching factor
• T: radius leaf threshold
45
birch
Balanced Iterative Reducing and Clustering using
Hierarchies
Phase 1: Scan all data and build an initial in-memory CF
tree
Phase 2: Condense into desirable range by building a
smaller CF tree (optional)
Phase 3: Global clustering
Phase 4: Cluster refining (optional and off line, as requires
more passes)
46
clu-stream
Clu-Stream
• Uses micro-clusters to store statistics on-line
• Clustering Features CF = (N,LS,SS,LT,ST)
• N: numer of data points
• LS: linear sum of the N data points
• SS: square sum of the N data points
• LT: linear sum of the time stamps
• ST: square sum of the time stamps
• Uses pyramidal time frame
47
clu-stream
On-line Phase
• For each new point that arrives
• the point is absorbed by a micro-cluster
• the point starts a new micro-cluster of its own
• delete oldest micro-cluster
• merge two of the oldest micro-cluster
Off-line Phase
• Apply k-means using microclusters as points
48
streamkm++: coresets
Coreset of a set P with respect to some problem
Small subset that approximates the original set P.
• Solving the problem for the coreset provides an approximate
solution for the problem on P.
(k,ε)-coreset
A (k,ε)-coreset S of P is a subset of P that for each C of size k
(1−ε)cost(P,C) ≤ costw(S,C) ≤ (1+ε)cost(P,C)
49
streamkm++: coresets
Coreset Tree
• Choose a leaf l node at random
• Choose a new sample point denoted by qt+1 from Pl
according to d2
• Based on ql and qt+1, split Pl into two subclusters and create
two child nodes
StreamKM++
• Maintain L = ⌈log2( n
m )+2⌉ buckets B0,B1,...,BL−1
50
Frequent Pattern Mining
51
frequent patterns
Suppose D is a dataset of patterns, t ∈ D, and min_sup is a
constant.
Definition
Support (t): number of
patterns in D that are
superpatterns of t.
Definition
Pattern t is frequent if
Support (t) ≥ min_sup.
Frequent Subpattern Problem
Given D and min_sup, find all frequent subpatterns of patterns
in D.
52
frequent patterns
Suppose D is a dataset of patterns, t ∈ D, and min_sup is a
constant.
Definition
Support (t): number of
patterns in D that are
superpatterns of t.
Definition
Pattern t is frequent if
Support (t) ≥ min_sup.
Frequent Subpattern Problem
Given D and min_sup, find all frequent subpatterns of patterns
in D.
52
frequent patterns
Suppose D is a dataset of patterns, t ∈ D, and min_sup is a
constant.
Definition
Support (t): number of
patterns in D that are
superpatterns of t.
Definition
Pattern t is frequent if
Support (t) ≥ min_sup.
Frequent Subpattern Problem
Given D and min_sup, find all frequent subpatterns of patterns
in D.
52
frequent patterns
Suppose D is a dataset of patterns, t ∈ D, and min_sup is a
constant.
Definition
Support (t): number of
patterns in D that are
superpatterns of t.
Definition
Pattern t is frequent if
Support (t) ≥ min_sup.
Frequent Subpattern Problem
Given D and min_sup, find all frequent subpatterns of patterns
in D.
52
pattern mining
Dataset Example
Document Patterns
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
53
itemset mining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent
d1,d2,d3,d4,d5,d6 c
d1,d2,d3,d4,d5 e,ce
d1,d3,d4,d5 a,ac,ae,ace
d1,d3,d5,d6 b,bc
d2,d4,d5,d6 d,cd
d1,d3,d5 ab,abc,abe
be,bce,abce
d2,d4,d5 de,cde
minimal support = 3
54
itemset mining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent
6 c
5 e,ce
4 a,ac,ae,ace
4 b,bc
4 d,cd
3 ab,abc,abe
be,bce,abce
3 de,cde
55
itemset mining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent Gen Closed
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce
3 de,cde de cde
55
itemset mining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent Gen Closed Max
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde
55
itemset mining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent Gen Closed Max
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde
56
itemset mining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
e → ce
Support Frequent Gen Closed Max
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde
56
itemset mining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent Gen Closed Max
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde
56
itemset mining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent Gen Closed Max
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde
57
itemset mining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
a → ace
Support Frequent Gen Closed Max
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde
57
itemset mining
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd
Support Frequent Gen Closed Max
6 c c c
5 e,ce e ce
4 a,ac,ae,ace a ace
4 b,bc b bc
4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde
58
closed patterns
Usually, there are too many frequent patterns. We can
compute a smaller set, while keeping the same information.
Example
A set of 1000 items, has 21000 ≈ 10301 subsets, that is more
than the number of atoms in the universe ≈ 1079
59
closed patterns
A priori property
If t′ is a subpattern of t, then Support (t′) ≥ Support (t).
Definition
A frequent pattern t is closed if none of its proper
superpatterns has the same support as it has.
Frequent subpatterns and their supports can be generated
from closed patterns.
59
maximal patterns
Definition
A frequent pattern t is maximal if none of its proper
superpatterns is frequent.
Frequent subpatterns can be generated from maximal
patterns, but not with their support.
All maximal patterns are closed, but not all closed patterns are
maximal.
60
non streaming frequent itemset miners
Representation:
• Horizontal layout
T1: a, b, c
T2: b, c, e
T3: b, d, e
• Vertical layout
a: 1 0 0
b: 1 1 1
c: 1 1 0
Search:
• Breadth-first (levelwise): Apriori
• Depth-first: Eclat, FP-Growth
61
mining patterns over data streams
Requirements: fast, use small amount of memory and adaptive
• Type:
• Exact
• Approximate
• Per batch, per transaction
• Incremental, Sliding Window, Adaptive
• Frequent, Closed, Maximal patterns
62
moment
• Computes closed frequents itemsets in a sliding window
• Uses Closed Enumeration Tree
• Uses 4 type of Nodes:
• Closed Nodes
• Intermediate Nodes
• Unpromising Gateway Nodes
• Infrequent Gateway Nodes
• Adding transactions: closed items remains closed
• Removing transactions: infrequent items remains infrequent
63
fp-stream
• Mining Frequent Itemsets at Multiple Time Granularities
• Based in FP-Growth
• Maintains
• pattern tree
• tilted-time window
• Allows to answer time-sensitive queries
• Places greater information to recent data
• Drawback: time and memory complexity
64
tree and graph mining: dealing with time changes
• Keep a window on recent stream elements
• Actually, just its lattice of closed sets!
• Keep track of number of closed patterns in lattice, N
• Use some change detector on N
• When change is detected:
• Drop stale part of the window
• Update lattice to reflect this deletion, using deletion rule
Alternatively, sliding window of some fixed size
65
Summary
66
overview of big data science
Short Course Summary
1 Introduction to Big Data
2 Big Data Science
3 Real Time Big Data Management
4 Internet of Things Data Science
Open Source Software
1 MOA: http://moa.cms.waikato.ac.nz/
2 SAMOA: http://samoa-project.net/
67

Internet of Things Data Science

  • 1.
    internet of thingsdata science Albert Bifet @abifet Paris, 7 October 2015 albert.bifet@telecom-paristech.fr
  • 2.
    internet of thingsdata science architecture 1
  • 3.
  • 4.
  • 5.
    introduction: data streams DataStreams • Sequence is potentially infinite • High amount of data: sublinear space • High speed of arrival: sublinear time per example • Once an element from a data stream has been processed it is discarded or archived Example Puzzle: Finding Missing Numbers • Let π be a permutation of {1,...,n}. • Let π−1 be π with one element missing. • π−1[i] arrives in increasing order Task: Determine the missing number 4
  • 6.
    introduction: data streams DataStreams • Sequence is potentially infinite • High amount of data: sublinear space • High speed of arrival: sublinear time per example • Once an element from a data stream has been processed it is discarded or archived Example Puzzle: Finding Missing Numbers • Let π be a permutation of {1,...,n}. • Let π−1 be π with one element missing. • π−1[i] arrives in increasing order Task: Determine the missing number Use a n-bit vector to memorize all the numbers (O(n) space) 4
  • 7.
    introduction: data streams DataStreams • Sequence is potentially infinite • High amount of data: sublinear space • High speed of arrival: sublinear time per example • Once an element from a data stream has been processed it is discarded or archived Example Puzzle: Finding Missing Numbers • Let π be a permutation of {1,...,n}. • Let π−1 be π with one element missing. • π−1[i] arrives in increasing order Task: Determine the missing number Data Streams: O(log(n)) space. 4
  • 8.
    introduction: data streams DataStreams • Sequence is potentially infinite • High amount of data: sublinear space • High speed of arrival: sublinear time per example • Once an element from a data stream has been processed it is discarded or archived Example Puzzle: Finding Missing Numbers • Let π be a permutation of {1,...,n}. • Let π−1 be π with one element missing. • π−1[i] arrives in increasing order Task: Determine the missing number Data Streams: O(log(n)) space. Store n(n+1) 2 −∑ j≤i π−1[j]. 4
  • 9.
    data streams Data Streams •Sequence is potentially infinite • High amount of data: sublinear space • High speed of arrival: sublinear time per example • Once an element from a data stream has been processed it is discarded or archived Tools: • approximation • randomization, sampling • sketching 5
  • 10.
    data streams Data Streams •Sequence is potentially infinite • High amount of data: sublinear space • High speed of arrival: sublinear time per example • Once an element from a data stream has been processed it is discarded or archived Approximation algorithms • Small error rate with high probability • An algorithm (ε,δ)−approximates F if it outputs ˜F for which Pr[|˜F−F| > εF] < δ. 5
  • 11.
    data streams approximationalgorithms 1011000111 1010101 Sliding Window We can maintain simple statistics over sliding windows, using O(1 ε log2 N) space, where • N is the length of the sliding window • ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002 6
  • 12.
    data streams approximationalgorithms 10110001111 0101011 Sliding Window We can maintain simple statistics over sliding windows, using O(1 ε log2 N) space, where • N is the length of the sliding window • ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002 6
  • 13.
    data streams approximationalgorithms 101100011110 1010111 Sliding Window We can maintain simple statistics over sliding windows, using O(1 ε log2 N) space, where • N is the length of the sliding window • ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002 6
  • 14.
    data streams approximationalgorithms 1011000111101 0101110 Sliding Window We can maintain simple statistics over sliding windows, using O(1 ε log2 N) space, where • N is the length of the sliding window • ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002 6
  • 15.
    data streams approximationalgorithms 10110001111010 1011101 Sliding Window We can maintain simple statistics over sliding windows, using O(1 ε log2 N) space, where • N is the length of the sliding window • ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002 6
  • 16.
    data streams approximationalgorithms 101100011110101 0111010 Sliding Window We can maintain simple statistics over sliding windows, using O(1 ε log2 N) space, where • N is the length of the sliding window • ε is the accuracy parameter M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. 2002 6
  • 17.
  • 18.
    classification Definition Given nC differentclasses, a classifier algorithm builds a model that predicts for every unlabelled instance I the class C to which it belongs with accuracy. Example A spam filter Example Twitter Sentiment analysis: analyze tweets with positive or negative feelings 8
  • 19.
    data stream classificationcycle 1 Process an example at a time, and inspect it only once (at most) 2 Use a limited amount of memory 3 Work in a limited amount of time 4 Be ready to predict at any point 9
  • 20.
    classification Data set that describese-mail features for deciding if it is spam. Example Contains Domain Has Time “Money” type attach. received spam yes com yes night yes yes edu no night yes no com yes night yes no edu no day no no com no day no yes cat no day yes Assume we have to classify the following new instance: Contains Domain Has Time “Money” type attach. received spam yes edu yes day ? 10
  • 21.
    bayes classifiers Naïve Bayes •Based on Bayes Theorem: P(c|d) = P(c)P(d|c) P(d) posterior = prior×likelikood evidence • Estimates the probability of observing attribute a and the prior probability P(c) • Probability of class c given an instance d: P(c|d) = P(c)∏a∈d P(a|c) P(d) 11
  • 22.
    bayes classifiers Multinomial NaïveBayes • Considers a document as a bag-of-words. • Estimates the probability of observing word w and the prior probability P(c) • Probability of class c given a test document d: P(c|d) = P(c)∏w∈d P(w|c)nwd P(d) 12
  • 23.
    perceptron Attribute 1 Attribute 2 Attribute3 Attribute 4 Attribute 5 Output h⃗w(⃗xi) w1 w2 w3 w4 w5 • Data stream: ⟨⃗xi,yi⟩ • Classical perceptron: h⃗w(⃗xi) = sgn(⃗wT⃗xi), • Minimize Mean-square error: J(⃗w) = 1 2 ∑(yi −h⃗w(⃗xi))2 13
  • 24.
    perceptron Attribute 1 Attribute 2 Attribute3 Attribute 4 Attribute 5 Output h⃗w(⃗xi) w1 w2 w3 w4 w5 • We use sigmoid function h⃗w = σ(⃗wT⃗x) where σ(x) = 1/(1+e−x ) σ′ (x) = σ(x)(1−σ(x)) 13
  • 25.
    perceptron • Minimize Mean-squareerror: J(⃗w) = 1 2 ∑(yi −h⃗w(⃗xi))2 • Stochastic Gradient Descent: ⃗w =⃗w−η∇J⃗xi • Gradient of the error function: ∇J = −∑ i (yi −h⃗w(⃗xi))∇h⃗w(⃗xi) ∇h⃗w(⃗xi) = h⃗w(⃗xi)(1−h⃗w(⃗xi)) • Weight update rule ⃗w =⃗w+η ∑ i (yi −h⃗w(⃗xi))h⃗w(⃗xi)(1−h⃗w(⃗xi))⃗xi 13
  • 26.
    perceptron Perceptron Learning(Stream,η) 1 foreach class 2 do Perceptron Learning(Stream,class,η) Perceptron Learning(Stream,class,η) 1 £ Let w0 and ⃗w be randomly initialized 2 for each example (⃗x,y) in Stream 3 do if class = y 4 then δ = (1−h⃗w(⃗x))·h⃗w(⃗x)·(1−h⃗w(⃗x)) 5 else δ = (0−h⃗w(⃗x))·h⃗w(⃗x)·(1−h⃗w(⃗x)) 6 ⃗w = ⃗w+η ·δ ·⃗x Perceptron Prediction(⃗x) 1 return argmaxclass h⃗wclass (⃗x) 14
  • 27.
    classification Data set that describese-mail features for deciding if it is spam. Example Contains Domain Has Time “Money” type attach. received spam yes com yes night yes yes edu no night yes no com yes night yes no edu no day no no com no day no yes cat no day yes Assume we have to classify the following new instance: Contains Domain Has Time “Money” type attach. received spam yes edu yes day ? 15
  • 28.
    classification • Assume wehave to classify the following new instance: Contains Domain Has Time “Money” type attach. received spam yes edu yes day ? Time Contains “Money” YES Yes NO No Day YES Night 15
  • 29.
    decision trees Basic inductionstrategy: • A ← the “best” decision attribute for next node • Assign A as decision attribute for node • For each value of A, create new descendant of node • Sort training examples to leaf nodes • If training examples perfectly classified, Then STOP, Else iterate over new leaf nodes 16
  • 30.
    hoeffding trees Hoeffding Tree: VFDT Pedro Domingos and Geoff Hulten. Mining high-speed data streams. 2000 • With high probability, constructs an identical model that a traditional (greedy) method would learn • With theoretical guarantees on the error rate Time Contains “Money” YES Yes NO No Day YES Night 17
  • 31.
    hoeffding bound inequality Probabilityof deviation of its expected value. 18
  • 32.
    hoeffding bound inequality LetX = ∑i Xi where X1,...,Xn are independent and indentically distributed in [0,1]. Then 1 Chernoff For each ε < 1 Pr[X > (1+ε)E[X]] ≤ exp ( − ε2 3 E[X] ) 2 Hoeffding For each t > 0 Pr[X > E[X]+t] ≤ exp ( −2t2 /n ) 3 Bernstein Let σ2 = ∑i σ2 i the variance of X. If Xi −E[Xi] ≤ b for each i ∈ [n] then for each t > 0 Pr[X > E[X]+t] ≤ exp ( − t2 2σ2 + 2 3bt ) 19
  • 33.
    hoeffding tree orvfdt HT(Stream,δ) 1 £ Let HT be a tree with a single leaf(root) 2 £ Init counts nijk at root 3 for each example (x,y) in Stream 4 do HTGrow((x,y),HT,δ) HTGrow((x,y),HT,δ) 1 £ Sort (x,y) to leaf l using HT 2 £ Update counts nijk at leaf l 3 if examples seen so far at l are not all of the same class 4 then £ Compute G for each attribute 5 if G(Best Attr.)−G(2nd best) > √ R2 ln1/δ 2n 6 then £ Split leaf on best attribute 7 for each branch 8 do £ Start new leaf and initiliatize counts 20
  • 34.
    hoeffding tree orvfdt HT(Stream,δ) 1 £ Let HT be a tree with a single leaf(root) 2 £ Init counts nijk at root 3 for each example (x,y) in Stream 4 do HTGrow((x,y),HT,δ) HTGrow((x,y),HT,δ) 1 £ Sort (x,y) to leaf l using HT 2 £ Update counts nijk at leaf l 3 if examples seen so far at l are not all of the same class 4 then £ Compute G for each attribute 5 if G(Best Attr.)−G(2nd best) > √ R2 ln1/δ 2n 6 then £ Split leaf on best attribute 7 for each branch 8 do £ Start new leaf and initiliatize counts 20
  • 35.
    hoeffding trees HT features •With high probability, constructs an identical model that a traditional (greedy) method would learn • Ties: when two attributes have similar G, split if G(Best Attr.)−G(2nd best) < √ R2 ln1/δ 2n < τ • Compute G every nmin instances • Memory: deactivate least promising nodes with lower pl ×el • pl is the probability to reach leaf l • el is the error in the node 21
  • 36.
    hoeffding naive bayestree Hoeffding Tree Majority Class learner at leaves Hoeffding Naive Bayes Tree G. Holmes, R. Kirkby, and B. Pfahringer. Stress-testing Hoeffding trees, 2005. • monitors accuracy of a Majority Class learner • monitors accuracy of a Naive Bayes learner • predicts using the most accurate method 22
  • 37.
    bagging Example Dataset of 4Instances : A, B, C, D Classifier 1: B, A, C, B Classifier 2: D, B, A, D Classifier 3: B, A, C, B Classifier 4: B, C, B, B Classifier 5: D, C, A, C Bagging builds a set of M base models, with a bootstrap sample created by drawing random samples with replacement. 23
  • 38.
    bagging Example Dataset of 4Instances : A, B, C, D Classifier 1: A, B, B, C Classifier 2: A, B, D, D Classifier 3: A, B, B, C Classifier 4: B, B, B, C Classifier 5: A, C, C, D Bagging builds a set of M base models, with a bootstrap sample created by drawing random samples with replacement. 23
  • 39.
    bagging Example Dataset of 4Instances : A, B, C, D Classifier 1: A, B, B, C: A(1) B(2) C(1) D(0) Classifier 2: A, B, D, D: A(1) B(1) C(0) D(2) Classifier 3: A, B, B, C: A(1) B(2) C(1) D(0) Classifier 4: B, B, B, C: A(0) B(3) C(1) D(0) Classifier 5: A, C, C, D: A(1) B(0) C(2) D(1) Each base model’s training set contains each of the original training example K times where P(K = k) follows a binomial distribution. 23
  • 40.
    bagging Figure 1: Poisson(1)Distribution. Each base model’s training set contains each of the original training example K times where P(K = k) follows a binomial distribution. 23
  • 41.
    oza and russell’sonline bagging for m models 1: Initialize base models hm for all m ∈ {1,2,...,M} 2: for all training examples do 3: for m = 1,2,...,M do 4: Set w = Poisson(1) 5: Update hm with the current example with weight w 6: anytime output: 7: return hypothesis: hfin(x) = argmaxy∈Y ∑T t=1 I(ht(x) = y) 24
  • 42.
  • 43.
    data mining algorithmswith concept drift No Concept Drift - input output DM Algorithm - Counter1 Counter2 Counter3 Counter4 Counter5 Concept Drift - input output DM Algorithm Static Model - Change Detect. - 6 26
  • 44.
    data mining algorithmswith concept drift No Concept Drift - input output DM Algorithm - Counter1 Counter2 Counter3 Counter4 Counter5 Concept Drift - input output DM Algorithm - Estimator1 Estimator2 Estimator3 Estimator4 Estimator5 26
  • 45.
    optimal change detectorand predictor • High accuracy • Low false positives and false negatives ratios • Theoretical guarantees • Fast detection of change • Low computational cost: minimum space and time needed • No parameters needed 27
  • 46.
    algorithm adaptive slidingwindow Example W= 101010110111111 W0= 1 ADWIN: Adaptive Windowing Algorithm 1 Initialize Window W 2 for each t 0 3 do W ← W∪{xt} (i.e., add xt to the head of W) 4 repeat Drop elements from the tail of W 5 until |ˆµW0 − ˆµW1 | ≥ εc holds 6 for every split of W into W = W0 ·W1 7 Output ˆµW 28
  • 47.
    algorithm adaptive slidingwindow Example W= 101010110111111 W0= 1 W1 = 01010110111111 ADWIN: Adaptive Windowing Algorithm 1 Initialize Window W 2 for each t 0 3 do W ← W∪{xt} (i.e., add xt to the head of W) 4 repeat Drop elements from the tail of W 5 until |ˆµW0 − ˆµW1 | ≥ εc holds 6 for every split of W into W = W0 ·W1 7 Output ˆµW 28
  • 48.
    algorithm adaptive slidingwindow Example W= 101010110111111 W0= 10 W1 = 1010110111111 ADWIN: Adaptive Windowing Algorithm 1 Initialize Window W 2 for each t 0 3 do W ← W∪{xt} (i.e., add xt to the head of W) 4 repeat Drop elements from the tail of W 5 until |ˆµW0 − ˆµW1 | ≥ εc holds 6 for every split of W into W = W0 ·W1 7 Output ˆµW 28
  • 49.
    algorithm adaptive slidingwindow Example W= 101010110111111 W0= 101 W1 = 010110111111 ADWIN: Adaptive Windowing Algorithm 1 Initialize Window W 2 for each t 0 3 do W ← W∪{xt} (i.e., add xt to the head of W) 4 repeat Drop elements from the tail of W 5 until |ˆµW0 − ˆµW1 | ≥ εc holds 6 for every split of W into W = W0 ·W1 7 Output ˆµW 28
  • 50.
    algorithm adaptive slidingwindow Example W= 101010110111111 W0= 1010 W1 = 10110111111 ADWIN: Adaptive Windowing Algorithm 1 Initialize Window W 2 for each t 0 3 do W ← W∪{xt} (i.e., add xt to the head of W) 4 repeat Drop elements from the tail of W 5 until |ˆµW0 − ˆµW1 | ≥ εc holds 6 for every split of W into W = W0 ·W1 7 Output ˆµW 28
  • 51.
    algorithm adaptive slidingwindow Example W= 101010110111111 W0= 10101 W1 = 0110111111 ADWIN: Adaptive Windowing Algorithm 1 Initialize Window W 2 for each t 0 3 do W ← W∪{xt} (i.e., add xt to the head of W) 4 repeat Drop elements from the tail of W 5 until |ˆµW0 − ˆµW1 | ≥ εc holds 6 for every split of W into W = W0 ·W1 7 Output ˆµW 28
  • 52.
    algorithm adaptive slidingwindow Example W= 101010110111111 W0= 101010 W1 = 110111111 ADWIN: Adaptive Windowing Algorithm 1 Initialize Window W 2 for each t 0 3 do W ← W∪{xt} (i.e., add xt to the head of W) 4 repeat Drop elements from the tail of W 5 until |ˆµW0 − ˆµW1 | ≥ εc holds 6 for every split of W into W = W0 ·W1 7 Output ˆµW 28
  • 53.
    algorithm adaptive slidingwindow Example W= 101010110111111 W0= 1010101 W1 = 10111111 ADWIN: Adaptive Windowing Algorithm 1 Initialize Window W 2 for each t 0 3 do W ← W∪{xt} (i.e., add xt to the head of W) 4 repeat Drop elements from the tail of W 5 until |ˆµW0 − ˆµW1 | ≥ εc holds 6 for every split of W into W = W0 ·W1 7 Output ˆµW 28
  • 54.
    algorithm adaptive slidingwindow Example W= 101010110111111 W0= 10101011 W1 = 0111111 ADWIN: Adaptive Windowing Algorithm 1 Initialize Window W 2 for each t 0 3 do W ← W∪{xt} (i.e., add xt to the head of W) 4 repeat Drop elements from the tail of W 5 until |ˆµW0 − ˆµW1 | ≥ εc holds 6 for every split of W into W = W0 ·W1 7 Output ˆµW 28
  • 55.
    algorithm adaptive slidingwindow Example W= 101010110111111 |ˆµW0 − ˆµW1 | ≥ εc : CHANGE DET.! W0= 101010110 W1 = 111111 ADWIN: Adaptive Windowing Algorithm 1 Initialize Window W 2 for each t 0 3 do W ← W∪{xt} (i.e., add xt to the head of W) 4 repeat Drop elements from the tail of W 5 until |ˆµW0 − ˆµW1 | ≥ εc holds 6 for every split of W into W = W0 ·W1 7 Output ˆµW 28
  • 56.
    algorithm adaptive slidingwindow Example W= 101010110111111 Drop elements from the tail of W W0= 101010110 W1 = 111111 ADWIN: Adaptive Windowing Algorithm 1 Initialize Window W 2 for each t 0 3 do W ← W∪{xt} (i.e., add xt to the head of W) 4 repeat Drop elements from the tail of W 5 until |ˆµW0 − ˆµW1 | ≥ εc holds 6 for every split of W into W = W0 ·W1 7 Output ˆµW 28
  • 57.
    algorithm adaptive slidingwindow Example W= 01010110111111 Drop elements from the tail of W W0= 101010110 W1 = 111111 ADWIN: Adaptive Windowing Algorithm 1 Initialize Window W 2 for each t 0 3 do W ← W∪{xt} (i.e., add xt to the head of W) 4 repeat Drop elements from the tail of W 5 until |ˆµW0 − ˆµW1 | ≥ εc holds 6 for every split of W into W = W0 ·W1 7 Output ˆµW 28
  • 58.
    algorithm adaptive slidingwindow Theorem At every time step we have: 1 (False positive rate bound). If µt remains constant within W, the probability that ADWIN shrinks the window at this step is at most δ. 2 (False negative rate bound). Suppose that for some partition of W in two parts W0W1 (where W1 contains the most recent items) we have |µW0 − µW1 | 2εc. Then with probability 1−δ ADWIN shrinks W to W1, or shorter. ADWIN tunes itself to the data stream at hand, with no need for the user to hardwire or precompute parameters. 29
  • 59.
    algorithm adaptive slidingwindow ADWIN using a Data Stream Sliding Window Model, • can provide the exact counts of 1’s in O(1) time per point. • tries O(logW) cutpoints • uses O(1 ε logW) memory words • the processing time per example is O(logW) (amortized and worst-case). Sliding Window Model 1010101 101 11 1 1 Content: 4 2 2 1 1 Capacity: 7 3 2 1 1 30
  • 60.
    vfdt / cvfdt Concept-adaptingVery Fast Decision Trees: CVFDT G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. 2001 • It keeps its model consistent with a sliding window of examples • Construct “alternative branches” as preparation for changes • If the alternative branch becomes more accurate, switch of tree branches occurs Time Contains “Money” Yes No Day YES Night 31
  • 61.
    decision trees: cvfdt Time Contains“Money” YES Yes NO No Day YES Night No theoretical guarantees on the error rate of CVFDT CVFDT parameters : 1 W: is the example window size. 2 T0: number of examples used to check at each node if the splitting attribute is still the best. 3 T1: number of examples used to build the alternate tree. 4 T2: number of examples used to test the accuracy of the 32
  • 62.
    decision trees: hoeffdingadaptive tree Hoeffding Adaptive Tree: • replace frequency statistics counters by estimators • don’t need a window to store examples, due to the fact that we maintain the statistics data needed with estimators • change the way of checking the substitution of alternate subtrees, using a change detector with theoretical guarantees Advantages over CVFDT: 1 Theoretical guarantees 2 No Parameters 33
  • 63.
    adwin bagging (kdd’09) ADWIN Anadaptive sliding window whose size is recomputed online according to the rate of change observed. ADWIN has rigorous guarantees (theorems) • On ratio of false positives and negatives • On the relation of the size of the current window and change rates ADWIN Bagging When a change is detected, the worst classifier is removed and a new classifier is added. 34
  • 64.
    Randomization as apowerful tool to increase accuracy and diversity There are three ways of using randomization: • Manipulating the input data • Manipulating the classifier algorithms • Manipulating the output targets 35
  • 65.
    leveraging bagging forevolving data streams Leveraging Bagging • Using Poisson(λ) Leveraging Bagging MC • Using Poisson(λ) and Random Output Codes Fast Leveraging Bagging ME • if an instance is misclassified: weight = 1 • if not: weight = eT/(1−eT), 36
  • 66.
    empirical evaluation Accuracy RAM-Hours HoeffdingTree 74.03 0.01 Online Bagging 77.15 2.98 ADWIN Bagging 79.24 1.48 Leveraging Bagging 85.54 20.17 Leveraging Bagging MC 85.37 22.04 Leveraging Bagging ME 80.77 0.87 Leveraging Bagging • Leveraging Bagging • Using Poisson(λ) • Leveraging Bagging MC • Using Poisson(λ) and Random Output Codes • Leveraging Bagging ME • Using weight 1 if misclassified, otherwise eT/(1−eT) 37
  • 67.
  • 68.
    clustering Definition Clustering is thedistribution of a set of instances of examples into non-known groups according to some common relations or affinities. Example Market segmentation of customers Example Social network communities 39
  • 69.
    clustering Definition Given • a setof instances I • a number of clusters K • an objective function cost(I) a clustering algorithm computes an assignment of a cluster for each instance f : I → {1,...,K} that minimizes the objective function cost(I) 40
  • 70.
    clustering Definition Given • a setof instances I • a number of clusters K • an objective function cost(C,I) a clustering algorithm computes a set C of instances with |C| = K that minimizes the objective function cost(C,I) = ∑ x∈I d2 (x,C) where • d(x,c): distance function between x and c • d2(x,C) = minc∈Cd2(x,c): distance from x to the nearest point 41
  • 71.
    k-means • 1. Choosek initial centers C = {c1,...,ck} • 2. while stopping criterion has not been met • For i = 1,...,N • find closest center ck ∈ C to each instance pi • assign instance pi to cluster Ck • For k = 1,...,K • set ck to be the center of mass of all points in Ci 42
  • 72.
    k-means++ • 1. Choosea initial center c1 • For k = 2,...,K • select ck = p ∈ I with probability d2(p,C)/cost(C,I) • 2. while stopping criterion has not been met • For i = 1,...,N • find closest center ck ∈ C to each instance pi • assign instance pi to cluster Ck • For k = 1,...,K • set ck to be the center of mass of all points in Ci 43
  • 73.
    performance measures Internal Measures •Sum square distance • Dunn index D = dmin dmax • C-Index C = S−Smin Smax−Smin External Measures • Rand Measure • F Measure • Jaccard • Purity 44
  • 74.
    birch Balanced Iterative Reducingand Clustering using Hierarchies • Clustering Features CF = (N,LS,SS) • N: number of data points • LS: linear sum of the N data points • SS: square sum of the N data points • Properties: • Additivity: CF1 +CF2 = (N1 +N2,LS1 +LS2,SS1 +SS2) • Easy to compute: average inter-cluster distance and average intra-cluster distance • Uses CF tree • Height-balanced tree with two parameters • B: branching factor • T: radius leaf threshold 45
  • 75.
    birch Balanced Iterative Reducingand Clustering using Hierarchies Phase 1: Scan all data and build an initial in-memory CF tree Phase 2: Condense into desirable range by building a smaller CF tree (optional) Phase 3: Global clustering Phase 4: Cluster refining (optional and off line, as requires more passes) 46
  • 76.
    clu-stream Clu-Stream • Uses micro-clustersto store statistics on-line • Clustering Features CF = (N,LS,SS,LT,ST) • N: numer of data points • LS: linear sum of the N data points • SS: square sum of the N data points • LT: linear sum of the time stamps • ST: square sum of the time stamps • Uses pyramidal time frame 47
  • 77.
    clu-stream On-line Phase • Foreach new point that arrives • the point is absorbed by a micro-cluster • the point starts a new micro-cluster of its own • delete oldest micro-cluster • merge two of the oldest micro-cluster Off-line Phase • Apply k-means using microclusters as points 48
  • 78.
    streamkm++: coresets Coreset ofa set P with respect to some problem Small subset that approximates the original set P. • Solving the problem for the coreset provides an approximate solution for the problem on P. (k,ε)-coreset A (k,ε)-coreset S of P is a subset of P that for each C of size k (1−ε)cost(P,C) ≤ costw(S,C) ≤ (1+ε)cost(P,C) 49
  • 79.
    streamkm++: coresets Coreset Tree •Choose a leaf l node at random • Choose a new sample point denoted by qt+1 from Pl according to d2 • Based on ql and qt+1, split Pl into two subclusters and create two child nodes StreamKM++ • Maintain L = ⌈log2( n m )+2⌉ buckets B0,B1,...,BL−1 50
  • 80.
  • 81.
    frequent patterns Suppose Dis a dataset of patterns, t ∈ D, and min_sup is a constant. Definition Support (t): number of patterns in D that are superpatterns of t. Definition Pattern t is frequent if Support (t) ≥ min_sup. Frequent Subpattern Problem Given D and min_sup, find all frequent subpatterns of patterns in D. 52
  • 82.
    frequent patterns Suppose Dis a dataset of patterns, t ∈ D, and min_sup is a constant. Definition Support (t): number of patterns in D that are superpatterns of t. Definition Pattern t is frequent if Support (t) ≥ min_sup. Frequent Subpattern Problem Given D and min_sup, find all frequent subpatterns of patterns in D. 52
  • 83.
    frequent patterns Suppose Dis a dataset of patterns, t ∈ D, and min_sup is a constant. Definition Support (t): number of patterns in D that are superpatterns of t. Definition Pattern t is frequent if Support (t) ≥ min_sup. Frequent Subpattern Problem Given D and min_sup, find all frequent subpatterns of patterns in D. 52
  • 84.
    frequent patterns Suppose Dis a dataset of patterns, t ∈ D, and min_sup is a constant. Definition Support (t): number of patterns in D that are superpatterns of t. Definition Pattern t is frequent if Support (t) ≥ min_sup. Frequent Subpattern Problem Given D and min_sup, find all frequent subpatterns of patterns in D. 52
  • 85.
    pattern mining Dataset Example DocumentPatterns d1 abce d2 cde d3 abce d4 acde d5 abcde d6 bcd 53
  • 86.
    itemset mining d1 abce d2cde d3 abce d4 acde d5 abcde d6 bcd Support Frequent d1,d2,d3,d4,d5,d6 c d1,d2,d3,d4,d5 e,ce d1,d3,d4,d5 a,ac,ae,ace d1,d3,d5,d6 b,bc d2,d4,d5,d6 d,cd d1,d3,d5 ab,abc,abe be,bce,abce d2,d4,d5 de,cde minimal support = 3 54
  • 87.
    itemset mining d1 abce d2cde d3 abce d4 acde d5 abcde d6 bcd Support Frequent 6 c 5 e,ce 4 a,ac,ae,ace 4 b,bc 4 d,cd 3 ab,abc,abe be,bce,abce 3 de,cde 55
  • 88.
    itemset mining d1 abce d2cde d3 abce d4 acde d5 abcde d6 bcd Support Frequent Gen Closed 6 c c c 5 e,ce e ce 4 a,ac,ae,ace a ace 4 b,bc b bc 4 d,cd d cd 3 ab,abc,abe ab be,bce,abce be abce 3 de,cde de cde 55
  • 89.
    itemset mining d1 abce d2cde d3 abce d4 acde d5 abcde d6 bcd Support Frequent Gen Closed Max 6 c c c 5 e,ce e ce 4 a,ac,ae,ace a ace 4 b,bc b bc 4 d,cd d cd 3 ab,abc,abe ab be,bce,abce be abce abce 3 de,cde de cde cde 55
  • 90.
    itemset mining d1 abce d2cde d3 abce d4 acde d5 abcde d6 bcd Support Frequent Gen Closed Max 6 c c c 5 e,ce e ce 4 a,ac,ae,ace a ace 4 b,bc b bc 4 d,cd d cd 3 ab,abc,abe ab be,bce,abce be abce abce 3 de,cde de cde cde 56
  • 91.
    itemset mining d1 abce d2cde d3 abce d4 acde d5 abcde d6 bcd e → ce Support Frequent Gen Closed Max 6 c c c 5 e,ce e ce 4 a,ac,ae,ace a ace 4 b,bc b bc 4 d,cd d cd 3 ab,abc,abe ab be,bce,abce be abce abce 3 de,cde de cde cde 56
  • 92.
    itemset mining d1 abce d2cde d3 abce d4 acde d5 abcde d6 bcd Support Frequent Gen Closed Max 6 c c c 5 e,ce e ce 4 a,ac,ae,ace a ace 4 b,bc b bc 4 d,cd d cd 3 ab,abc,abe ab be,bce,abce be abce abce 3 de,cde de cde cde 56
  • 93.
    itemset mining d1 abce d2cde d3 abce d4 acde d5 abcde d6 bcd Support Frequent Gen Closed Max 6 c c c 5 e,ce e ce 4 a,ac,ae,ace a ace 4 b,bc b bc 4 d,cd d cd 3 ab,abc,abe ab be,bce,abce be abce abce 3 de,cde de cde cde 57
  • 94.
    itemset mining d1 abce d2cde d3 abce d4 acde d5 abcde d6 bcd a → ace Support Frequent Gen Closed Max 6 c c c 5 e,ce e ce 4 a,ac,ae,ace a ace 4 b,bc b bc 4 d,cd d cd 3 ab,abc,abe ab be,bce,abce be abce abce 3 de,cde de cde cde 57
  • 95.
    itemset mining d1 abce d2cde d3 abce d4 acde d5 abcde d6 bcd Support Frequent Gen Closed Max 6 c c c 5 e,ce e ce 4 a,ac,ae,ace a ace 4 b,bc b bc 4 d,cd d cd 3 ab,abc,abe ab be,bce,abce be abce abce 3 de,cde de cde cde 58
  • 96.
    closed patterns Usually, thereare too many frequent patterns. We can compute a smaller set, while keeping the same information. Example A set of 1000 items, has 21000 ≈ 10301 subsets, that is more than the number of atoms in the universe ≈ 1079 59
  • 97.
    closed patterns A prioriproperty If t′ is a subpattern of t, then Support (t′) ≥ Support (t). Definition A frequent pattern t is closed if none of its proper superpatterns has the same support as it has. Frequent subpatterns and their supports can be generated from closed patterns. 59
  • 98.
    maximal patterns Definition A frequentpattern t is maximal if none of its proper superpatterns is frequent. Frequent subpatterns can be generated from maximal patterns, but not with their support. All maximal patterns are closed, but not all closed patterns are maximal. 60
  • 99.
    non streaming frequentitemset miners Representation: • Horizontal layout T1: a, b, c T2: b, c, e T3: b, d, e • Vertical layout a: 1 0 0 b: 1 1 1 c: 1 1 0 Search: • Breadth-first (levelwise): Apriori • Depth-first: Eclat, FP-Growth 61
  • 100.
    mining patterns overdata streams Requirements: fast, use small amount of memory and adaptive • Type: • Exact • Approximate • Per batch, per transaction • Incremental, Sliding Window, Adaptive • Frequent, Closed, Maximal patterns 62
  • 101.
    moment • Computes closedfrequents itemsets in a sliding window • Uses Closed Enumeration Tree • Uses 4 type of Nodes: • Closed Nodes • Intermediate Nodes • Unpromising Gateway Nodes • Infrequent Gateway Nodes • Adding transactions: closed items remains closed • Removing transactions: infrequent items remains infrequent 63
  • 102.
    fp-stream • Mining FrequentItemsets at Multiple Time Granularities • Based in FP-Growth • Maintains • pattern tree • tilted-time window • Allows to answer time-sensitive queries • Places greater information to recent data • Drawback: time and memory complexity 64
  • 103.
    tree and graphmining: dealing with time changes • Keep a window on recent stream elements • Actually, just its lattice of closed sets! • Keep track of number of closed patterns in lattice, N • Use some change detector on N • When change is detected: • Drop stale part of the window • Update lattice to reflect this deletion, using deletion rule Alternatively, sliding window of some fixed size 65
  • 104.
  • 105.
    overview of bigdata science Short Course Summary 1 Introduction to Big Data 2 Big Data Science 3 Real Time Big Data Management 4 Internet of Things Data Science Open Source Software 1 MOA: http://moa.cms.waikato.ac.nz/ 2 SAMOA: http://samoa-project.net/ 67