Data Smashing

010100101010111001
010010111010100001
00010010010101001010101010100001
Data Smashing
Zero-Knowledge Feature-free Anomaly Detection
Data Smashing
Zero-Knowledge Feature-free Anomaly Detection
Ishanu Chattopadhyay
Computation Institute
University of Chicago

a·nom·a·ly
noun
something that deviates from what is standard, normal, or expected.
“there are a number of anomalies in the present system”
synonyms:
oddity, peculiarity, abnormality, irregularity, inconsistency, incongruity,
aberration, quirk, rarity

Deterministic Behavior
Specifying “Normal” Might Be Easier

Stochastic Process Deterministic + Noise
1 · 102
2 · 102
3 · 102
4 · 102
5 · 102
6 · 102
7 · 102
8 · 102
9 · 102
1 · 103
−1 · 101
−5 · 100
0 · 100
5 · 100
1 · 101
time

Stochastic Process Deterministic + Noise
1 · 102
2 · 102
3 · 102
4 · 102
5 · 102
6 · 102
7 · 102
8 · 102
9 · 102
1 · 103
−1 · 101
−5 · 100
0 · 100
5 · 100
1 · 101
Similar ?
time

Brainwaves: Identical Stimuli
5 · 101
1 · 102
1.5 · 102
2 · 102
2.5 · 102
−4 · 100
−2 · 100
0 · 100
2 · 100
5 · 101
1 · 102
1.5 · 102
2 · 102
2.5 · 102
−5 · 100
0 · 100
5 · 100
5 · 101
1 · 102
1.5 · 102
2 · 102
2.5 · 102
−5 · 100
0 · 100
5 · 100
5 · 101
1 · 102
1.5 · 102
2 · 102
2.5 · 102
−4 · 100
−2 · 100
0 · 100
2 · 100
4 · 100
time
Are they
from the
same individual?

Brainwaves: Identical Stimuli
5 · 101
1 · 102
1.5 · 102
2 · 102
2.5 · 102
−4 · 100
−2 · 100
0 · 100
2 · 100
5 · 101
1 · 102
1.5 · 102
2 · 102
2.5 · 102
−5 · 100
0 · 100
5 · 100
5 · 101
1 · 102
1.5 · 102
2 · 102
2.5 · 102
−5 · 100
0 · 100
5 · 100
5 · 101
1 · 102
1.5 · 102
2 · 102
2.5 · 102
−4 · 100
−2 · 100
0 · 100
2 · 100
4 · 100
time
Subject A
Subject B

State of Art
Requires Features
Feature 1
Feature 2
Feature 3
Similarity:
Distance between
feature vectors

Data Smashing
b
c
a
b
c
a
cababcbababcbbacbcbcbababacacabaccbcbccab· · ·
Quantize: bcbbbbacabbcbbbbabbbbba· · ·
Quantize: abbababcbbbabbbabcbbbab· · ·
Invert: cacacccaaabcacbcacabacaaacacac· · ·
Signal 1
Signal 2

Data Smashing
b
c
a
b
c
a
cababcbababcbbacbcbcbababacacabaccbcbccab· · ·
Quantize: bcbbbbacabbcbbbbabbbbba· · ·
Quantize: abbababcbbbabbbabcbbbab· · ·
Invert: cacacccaaabcacbcacabacaaacacac· · ·
Noise?Noise?
Signal 1
Signal 2

Data Smashing
I. Chattopadhyay and H. Lipson , “Data Smashing: Uncovering Lurking Order
In Data”, Royal Society Interface, 2014 vol. 11 no. 101 20140826

Anti-streams
Intuitive Description
Stream s −→ Anti-stream s

Anti-streams
0 1
0
0.2
0.4
0.6
0.8
1
0.7
0.3
0 1
0
0.2
0.4
0.6
0.8
1
0.3
0.7

Anti-streams
0 1
0
0.2
0.4
0.6
0.8
1
0.7
0.3
0 1
0
0.2
0.4
0.6
0.8
1
0.3
0.7
00 01 10 11
0.2
0.4
0.48
0.24
0.16
0.12
00 01 10 11
0.2
0.4
0.1
0.2
0.3
0.4

Anti-streams
0 1
0
0.2
0.4
0.6
0.8
1
0.7
0.3
0 1
0
0.2
0.4
0.6
0.8
1
0.3
0.7
00 01 10 11
0.2
0.4
0.48
0.24
0.16
0.12
00 01 10 11
0.2
0.4
0.1
0.2
0.3
0.4
000 001 010 011 100 101 110 111
0
0.2
0.4
000 001 010 011 100 101 110 111
0
0.1
0.2

Quantization
Mapping Continuous Data to Symbol Stream
Quantization Alphabet Σ = {0, 1, 2, 3}

The Black Box Approach
Dynamical System As A Symbol Generator
1
1
22
Pr(0|s) = 0.8
Pr(1|s) = 0
Pr(2|s) = 0.2

The Black Box Approach
Dynamical System As A Symbol Generator
0
0
10
2
22
0
11
2
21
1
22
Pr(0|s) = 0.4
Pr(1|s) = 0.3
Pr(2|s) = 0.3
Ergodic Stationary Quantized Process ⇐⇒ Probability Measure on Inﬁnite Strings

System As A Symbol Generator
Probability Space (Σω
, B, µ)
Σω : Set of strictly inﬁnite strings on alphabet Σ
B : smallest σ-algebra generated by the sets xΣω : x ∈ Σ
µ : Probability measure on inﬁnite strings:
µ(xΣω
) → [0, 1]
x∈Σ
µ(xΣω
) = 1

System As A Symbol Generator
Probability Space (Σω
, B, µ)
Σω : Set of strictly infinite strings on alphabet Σ
B : smallest σ-algebra generated by the sets xΣω : x ∈ Σ
µ : Probability measure on infinite strings:
µ(xΣω
) → [0, 1]
x∈Σ
µ(xΣω
) = 1
Probability Measure on Infinite Strings =⇒ Equivalence Relation on Finite Strings
∀x1, x2 ∈ Σ , x1 ∼ x2
if ∀x ∈ Σ , µ(x1xΣω
) = µ(x2xΣω
)
Equivalence Classes are causal states

Probabilistic Finite State Automata
Models For Quantized Stationary Ergodic Stochastic Processes
q1 q2
σ1|0.7
σ0|0.3 σ1|0.9
σ0|0.1
q1 q2
q3q4
σ1|0.7
σ1|0.7
σ0|0.7
σ0|0.7
σ0|0.3
σ1|0.9
σ
1|0.1σ
0|0.1
q1
σ0|0.5
σ1|0.5
℘1 ℘2
G1
G2
G3
Space Of PFSA
Space Of Measures

Adding Probability Measures
Consider two measures:
℘1 : B → [0, 1]
℘2 : B → [0, 1]
Deﬁne a binary operation:
℘1 ⊕ ℘2 ℘3
where
℘3(xΣω
) = ℘1(xΣω
)℘2(xΣω
) " Constant
x∈Σ
℘3(xΣω
) = 1

Adding Probability Measures
Consider two measures:
℘1 : B → [0, 1]
℘2 : B → [0, 1]
Deﬁne a binary operation:
℘1 ⊕ ℘2 ℘3
where
℘3(xΣω
) = ℘1(xΣω
)℘2(xΣω
) " Constant
x∈Σ
℘3(xΣω
) = 1
Commutative
Closed
Unique Inverse
Unique Identity
Abelian Group

Lifting Group Structure To PFSAs
q1 q2
σ1|0.7
σ0|0.3 σ1|0.9
σ0|0.1
q1
q2
q3
σ1
|0.2
σ0 |0.8
σ1|0.3σ0|0.7
σ
1 |0.4
σ0|0.6
q1 q2 q3
q4 q5 q6
σ1|0.86
σ0|0.14
σ1|0.7
σ0|0.3
σ1|0.79
σ0|0.21
σ1|0.57
σ0|0.43
σ0|0.39
σ1|0.61
σ0|0.63
σ1|0.37
℘1 =℘2 ℘3
H
G1 G2
G3Space Of PFSA
Space Of Measures

Mathematical Structure Of Model Space
The Abelian Group
The Space of Models has the mathematical structure of an Abelian Group

The Abelian Group
1 + 2 = 3
2 − 2 = 0
3 + 0 = 3

The Abelian Group
1 + 2 = 3
2 − 2 = 0
3 + 0 = 3
We can “add and subtract” models:
G + H = J
G − H = K

The Abelian Group
1 + 2 = 3
2 − 2 = 0
3 + 0 = 3
We can “add and subtract” models:
G + H = J
G − H = K
G − G =?

The Zero Machine
q1 q2
σ1|0.3
σ0|0.7 σ1|0.1
σ0|0.9
q1 q2
σ1|0.7
σ0|0.3 σ1|0.9
σ0|0.1
q1
σ0|0.5
σ1|0.5
G −G
W
(Flat White Noise)+ =

The Zero Machine
q1 q2
σ1|0.3
σ0|0.7 σ1|0.1
σ0|0.9
q1 q2
σ1|0.7
σ0|0.3 σ1|0.9
σ0|0.1
q1
σ0|0.5
σ1|0.5
G −G
W
(Flat White Noise)+ =
q1
σ0|0.5
σ1|0.5
Zero PFSA
for binary alphabet
q1
σ0|1/3σ1|1/3
σ2|1/3
Zero PFSA
for trinary alphabet

The Zero Machine
Maximum entropy rate
History is useless for prediction
Encodes minimum information
q1
σ0|0.5
σ1|0.5
Zero PFSA
or binary alphabet
1
1
00
Pr(0|s) = 0.5
Pr(1|s) = 0.5

Stream Inversion
Direct Generation of Anti-streams
q1 q2
σ1|0.3
σ0|0.7 σ1|0.1
σ0|0.9
Group-theoretic
Inversion
Using PFSA model
Stream Inversion
via selective erasure of s1
q1 q2
σ1|0.7
σ0|0.3 σ1|0.9
σ0|0.1
sequence s1 sequence s
−1
G −G
Space Of PFSA

Annihilation Identity
q1 q2
σ1|0.3
σ0|0.7 σ1|0.1
σ0|0.9
q1 q2
σ1|0.7
σ0|0.3 σ1|0.9
σ0|0.1
q1
σ0|0.5
σ1|0.5
sequence s1
sequence s2
sequence s
G
−G
W
(Flat White
Noise Model)
Flat White Noise
=
Space Of PFSA

Example Of Stream, Anti-stream, & FWN
4 Letter Alphabet
Signal
Inverse Signal
Flat White Noise

EEG - Epileptic Pathology
5 · 102
1 · 103
1.5 · 103
2 · 103
2.5 · 103
3 · 103
3.5 · 103
4 · 103
−2 · 102
0 · 100
2 · 102
5 · 102
1 · 103
1.5 · 103
2 · 103
2.5 · 103
3 · 103
3.5 · 103
4 · 103
−2 · 102
0 · 100
2 · 102
5 · 102
1 · 103
1.5 · 103
2 · 103
2.5 · 103
3 · 103
3.5 · 103
4 · 103
0 · 100
5 · 102
5 · 102
1 · 103
1.5 · 103
2 · 103
2.5 · 103
3 · 103
3.5 · 103
4 · 103
−1 · 102
0 · 100
1 · 102
time
Eyes Open
Epileptic Pathology

EEG - Epileptic Pathology
Pathology
Eyes
Closed
(Normal)
Eyes
Open
(Normal)
100 200 300
100
200
300
Pairwise Distance Matrix
9.9 · 10
−4
1 · 10
−2
1 · 10
−1
0
0.2
0.4
0.6 0
0.2
0.4
0
0.2
x
yz
3D Euclidean Embedding
Anomaly
Normal (Eyes closed)
Normal (Eyes open)
PlaneA

Cardiac Pathology
Disambiguate Normal Rhythm from Murmur
1 · 102
2 · 102
3 · 102
4 · 102
5 · 102
6 · 102
−1 · 10−1
−5 · 10−2
0 · 100
5 · 10−2
1 · 102
2 · 102
3 · 102
4 · 102
5 · 102
6 · 102
−5 · 10−1
0 · 100
1 · 102
2 · 102
3 · 102
4 · 102
5 · 102
6 · 102
−4 · 10−1
−2 · 10−1
0 · 100
2 · 10−1
4 · 10−1
1 · 102
2 · 102
3 · 102
4 · 102
5 · 102
6 · 102
−5 · 10−1
0 · 100
5 · 10−1
time
Normal Rhythm
Murmur

Cardiac Pathology
20 40 60
20
40
60
Distance Matrix
0
9 · 10
−5
1 · 10
−2
0
0.05
0.1
0
2
4
6
·10−20
0.02
x
y
z
3D Euclidean Embeddin
Murmur
Healthy

EEG Based Biometric Authentication
122 Subjects, 100% Accuracy
T8
FT8
F8
AF8
FP2
FPZ
FP1
AF7
F7
FT7
T7
TP7
P7
PO7
O1
OZ
O2
PO8
P8
TP8
CZ
FCZ
FZ
AFZ
CPZ
PZ
POZ
F1
F3
F5
F2
F4
F6
FC1FC3FC5
FC2 FC4 FC6
C1C3C5 C2 C4 C6
CP1CP3CP5
CP2 CP4 CP6
P1
P3
P5
P2
P4
P6
PO1 PO2
AF1 AF2
0.5
0.6
0.7
0.2 0.4
0
0.2
x
y
z
(3 random subjects)

Classiﬁcation of Variable Stars
Optical Gravitational Lensing Experiment (OGLE) database
Cepheid
RR Lyrae
0 5,000 10,000
0
5,000
10,000
Distance Matrix
9.9 · 10
−4
1 · 10
−2
1 · 10
−1
0 0.2 0.4 0.6
0
0.2
0.4
0
0.2
0.4
x
yz
RRL
Cepheids

Self Annihilation Error
−1
Sequence s Flat White Noise

−1
Sequence s Flat White Noise
Only if |s| is large enough !

−1
Sequence s s
s = Θ(s , W)

−1
Sequence s s
s = Θ(s , W)
s → 0 exponentially fast with |s|
Information content of stream

−1
Sequence s s
s = Θ(s , W)
Auto-detect Data Suﬃciency!

Cognitive Fingerprinting
User Authentication From Keypress Dynamics
100 200 300 400
−1000
0
1000
100 200 300 400
−1000
0
1000
2000
100 200 300 400
−2000
0
2000
100 200 300 400
−1000
0
1000
100 200 300 400
−1000
0
1000
100 200 300 400
−2000
0
2000
100 200 300 400
−1000
0
1000
2000
100 200 300 400
−1000
0
1000
100 200 300 400
−1000
0
1000
100 200 300 400
−1000
0
1000
JOHNCHERYL
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
CHERYL
JASON
JOHN
Time-series of
keypress delays
Random text

10
1
10
2
10
−1
Threshold
ACCEPT
# of Key Press
ACCEPT
Data Sufficiency Test
Authentication Test
10
1
10
2
10
−1
Threshold
Data Sufficiency Test
Authentication Test
REJECT
# of Key Press
User Authorized
User Rejected

10
1
10
2
10
−1
Threshold
# of Key Press
ACCEPT
REJECT
USER CHANGE
Authorization
revoked
on unexpected user
change detection

Physical Dexterity Quantiﬁcation
Dynamic Regulation of Instabilities: Athletes vs Amateurs
Francisco Valero-Cuevas
Brain-body Dynamics Lab
Univ. of Southern California
−0.6
−0.4
−0.2
0
0.2
−0.5
0
0.5
1
−1
0
1
−0.4
−0.2
0
0.2
0.4
−0.4
−0.2
0
0.2
0.4
0 1,000 2,000 3,000 4,000 5,000
−2
0
2

50 100 150 200
50
100
150
200
0.02 · 10−14 · 10−1
0.0
5 · 10−2
1 · 10−1
0.0
5 · 10−2
x
y
z

50 100 150
50
100
150
180 200
180
200
0.02 · 10−14 · 10−1
0.0
5 · 10−2
1 · 10−1
0.0
5 · 10−2
x
y
z

Handwritten Digits
Recognition / Classiﬁcation Problem

State of Art Approaches
Find Clever Representations

State of Art Approaches
Find Clever Representations
Feature
Vector 1
Feature
Vector 2
Feature
Vector 16

Image To Symbol Stream
Random Walk

Handwritten Digits
Recognition / Classiﬁcation Problem
4
9
100 200 300
100
200
300
00.20.40.6
0
0.2
0.4
0.6
0
0.2
0.4
0.6
x
y
z

Memory-less Stream Manipulation
Generating First Independent Copy for String s1:
2
0
s1:
FWN:
Copy 1:

2
0
s1:
FWN:
Copy 1:
1
1
1

2
0
s1:
FWN:
Copy 1:
1 2
1 2
1 2

2 0
0 2
s1:
FWN:
Copy 1:
1 2
1 2
1 2

2 0 1
0 2 2
s1:
FWN:
Copy 1:
1 2
1 2
1 2

2 0 1 0 2 2 1 2 1 1 0 0 2 0
0 2 2 1 0 0 2 1 2 2 2 2 0 2
s1:
FWN:
Copy 1:
1 2 2 0 0 0 1 0 0 1 0 2
1 2 2 0 0 0 1 0 0 1 0 2
1 2 2 0 0 0 1 0 0 1 0 2 1 0 2 2 2 1 2 2 1 2 2 1 2 1
Generating Second Independent Copy for String s1:
2
0
s1:
FWN:
Copy 2:

2 0 1 0 2 2 1 2 1 1 0 0 2 0
0 2 2 1 0 0 2 1 2 2 2 2 0 2
s1:
FWN:
Copy 1:
1 2 2 0 0 0 1 0 0 1 0 2
1 2 2 0 0 0 1 0 0 1 0 2
1 2 2 0 0 0 1 0 0 1 0 2 1 0 2 2 2 1 2 2 1 2 2 1 2 1
Generating Second Independent Copy for String s1:
2 1 1 0 2 0 0 1 2 0 1 0 1 0 2 0 1 0 2
0 0 0 2 0 2 2 2 1 2 0 1 0 2 0 2 2 2 0
s1:
FWN:
Copy 2:
2 0 2 2 1 0 0
2 0 2 2 1 0 0
2 0 2 2 1 0 0 2 2 2 1 2 2 0 1 2 2 1 2 1 2

Generating Inverse of String s1:
1
Inverse:
Copy 2:
Copy 1:
2
0
0

1 2
Inverse:
Copy 2:
Copy 1:
2 0
0 1
0 1

2
2
1 2
Inverse:
Copy 2:
Copy 1:
2 0
0 1
0 1

2
2
1 2 0
Inverse:
Copy 2:
Copy 1:
2 0 2
0 1 1
0 1 1

2 0 2 0 2 2 1 2
2 0 2 0 2 2 1 2
1 2 0 0 1 0 0 1 0 1 2 2 1 2 2 1 2 1
Inverse:
Copy 2:
Copy 1:
2 0 2 1 0 2 2 2 1 2 1 1 2
0 1 1 2 2 1 1 0 2 0 0 0 0
0 1 1 2 2 1 1 0 2 0 0 0 0
Summing s2 with Inverse of s1
2 2 1 2 1 0 2 2 2 1 0 0 2 1 2 2
0 1 2 1 0 2 0 0 0
· · · · · ·
Inverse of String s1:
String s2:
Result:
1 2 1 0
1 2 1 0
1 2 1 0

Summing It Up!
Notion of Universal Similarity
Zero-knowledge Feature-free Anomaly Detection

Patents held by Cornell University
6259-01-US Stochastic Automata for Earthquake Prediction from Large
Scale Surveys
6259-02-PC Systems and Methods for Abductive Learning of Quantized
Stochastic Process
6024-03-PC System and Methods for Analysis of Data PCT/US13/62397
6998-01-US Causality Network Construction Algorithm (Application no.
62170063, EFS ID 2517508)

Data Smashing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Smashing

Similar to Data Smashing (20)

Recently uploaded

Recently uploaded (20)

Data Smashing