Information surprise or how to find interesting data

Surprise!
Information Surprise
or how to discover data
Oleksandr Pryymak /Sasha/
@opryymak
opryymak@gmail.com

Define Surprise!
surprise
[countable] an event, a piece of news, etc. that is
unexpected or that happens suddenly
SYNONYMS: shock, … , eye-opener
[uncountable, countable] a feeling caused by something
happening suddenly or unexpectedly
SYNONYMS: astonishment, ...
(Oxford Advanced Learner's Dictionary)

Quantify Surprise!
?
measured in
wows

Quantify
Complexity
can measure any content type.
Note: complex is not random!
Measures of complexity
1. Subjective rating
2. #Distinct elements
3. #Dimension
4. #Control parameters
5. Minimal description
6. Information content
7. Minimal generator
8. Minimum energy
Abdallah, S., & Plumbley, M. (2009). Information dynamics: patterns of expectation
and surprise in the perception of music. Connection Science, 21(2-3), 89-117.
<vs>

Neuro/Cognitive Science
How do we perceive information?
Machine Learning
How to measure differences?
Surprise Quants in academia

... machine that constantly tells you what you already know
is just irritating. So software alerts users only to surprises...
Horvitz, E., Apacible, J., Sarin, R., & Liao, L. Prediction, Expectation, and
Surprise: Methods, Designs, and Study of a Deployed Traffic Forecasting
Service.
Friston, K. (2010). The free-energy principle: a unified brain theory?. Nature
Reviews Neuroscience, 11(2), 127-138.
Neuro/Cognitive Science
How do we perceive information?
Machine Learning
How to measure differences?

Machine LearningNeuro/Cognitive Science
Itti, L., & Baldi, P. F. (2005). Bayesian surprise attracts human attention. In Advances in neural information
processing systems (pp. 547-554).

Itti, L., & Baldi, P. F. (2005). Bayesian surprise attracts human attention. In Advances in neural information
processing systems (pp. 547-554).
meh
wow
meh

Typical ML applications
Unsupervised Learning
1. Decision trees (inf. gain)
2. MaxEnt principle
3. ...
Specifically after ‘surprise’:
4. One-class classification
5. Anomaly detection
6. Novelty measure Pimentel, M. A., Clifton, D. A., Clifton, L., & Tarassenko, L. (2014).
A review of novelty detection. Signal Processing, 99, 215-249.

Model of a cat
Data Model
(expectations)
Data
(stream) Surprising?
(interesting, new)
Update
wow
(act)
meh
(ignore)
Element
(attention window)

Model of a cat’s surprise
Surprising?
(interesting, new)

Quantify surprisal /self-information/
The surprise /information/ in observing the
occurrence of an event having probability .
Axioms:
≤
≥
∗
Derive:
∗ ∗
∗
Surprisal /self-information/:
−
Flipping a fair coin provides 1bit of new information.
bits
or wows
bits

Surprisal applications
Selecting
information source:
Oleksandr Pryymak. Achieving Accurate Opinion Consensus
in Large Multi-Agent Systems
University of Southampton, Doctoral Thesis, 170pp., 2013

Model of a cat’s knowledge
Data Model
(expectations)

Quantify ‘knowledge’ /entropy/
The Shannon entropy is the expected value of the self-information.
Notes:
1. The maximum entropy distribution
is the least informative.
2. The statistical mechanics and the
information entropy are principally
the same.
max: log2
(n)
Entropy of a Bernoulli trial
X Є {0,1}

Entropy applications
Analysis of a binary of GeoIP ISP database:
Analyzing unknown binary files using information entropy:
http://yurichev.com/blog/entropy/

Entropy applications
Visualizing the OSX ksh binary (see binvis.io)
Visualizing entropy in binary files
http://corte.si/posts/visualisation/entropy/index.html
1,2:
Cryptic
signature

Model of a cat’s discovery
Data Model
(expectations)
Surprising?
(interesting, new)
wow
(act)
meh
(ignore)
Element
(attention window)
What has changed?

The Kullback–Leibler divergence /relative entropy, information gain/:
is a measure of the information lost when Q is used to approximate P
(measures the expected number of extra bits required to recode)
Quantify ‘discovery’ /information gain/
"KL-Gauss-Example" T. Nathan Mundhenk
Not a true measure: asymmetric →

Quantify ‘discovery surprise’
Symmetric KL Distances: All result in the same performance:
Pinto, D., Benedí, J. M., & Rosso, P. (2007).
Clustering narrow-domain short texts by
using the Kullback-Leibler distance. In
Computational Linguistics and Intelligent Text
Processing

Calculating KLD
Data sparseness problem: often ∞
Solutions:
- drop components from calculations
- smothing:

Surprise in Tweets
KLD application

Explore data: search engines
Elasticsearch +
Kibana =
faceted data exploration

Whole dataset
I still have hopes
to find where I left
this partition

Whole dataset
MH17
July 17,2014
Annexation of Crimea
Feb 20... March 20,2014
Presidential elections
May 25,2014
Experiments SET
Feb 1 - 28, 2014

tweets: 5.64 M
Experiment dataset: Feb2014

Pipeline
Stream
(tweets)
Last 8 timeslots
(data model)
Timeslot
(attention window)
KLD
(interesting, new)
Update
new
event
(act)
meh
(ignore)

Simplistic topic modeling
- tweets are super short
+ important events are widely discussed
+ events change vocabulary
- timeslot aggregation favors the predominant event
Document is a timeslot.
Model:
- bag of words
- freq. threshold > 200 tweets
- term frequency (naive)
- tokenizer: https://github.
com/jaredks/tweetokenize
+ a few touches

Simplistic topic modeling
Document is a time slot.
Model:
- bag of words
- freq. threshold > 200 tweets
- term frequency (naive)
- tokenizer: https://github.
com/jaredks/tweetokenize
+ a few touches

Vocabulary diversity
Follows daily cycles
run out of
disc space

Test a domain specific hack
Vocabulary: catastrophe
…

Vocabulary slots: KLD
How surpriseful vocabulary of
each hour against the whole dataset
Beware: on this scale individual hours
are small, but events are plentiful
Higher KLD on sparse data
Lower KLD on dense data

Vocabulary slots: KLD smoothed
Smoothing did not change peaks
new minimum

Vocabulary slots: rolling KLD
each hour against the last 24h
Less variation on dense data

Event Detection Problem
Outliers detection:
- rate change of the ‘surprise’
Compare against:

Rolling KLD outliers
Events: detected rate change

Rolling KLD outliers tokens
Annotate events with
the most surpriseful tokens

Further dataset limitation
prime
events

Rolling KLD outliers: Feb 19-28

Find representative tweets
Last 8 timeslots
(data model)
Timeslot
(attention window)
KLD
(surprising)
Update
surprising
tweets
-KLD
(least surprising)
1. Detect distinct
features
2. Find elements
representing
distinct features

Surpriseful tweets link➥
Only from users with
+500followers

The only spam/bot tweet selected. from the first time slot, when the prior is
uniform. Notice: the dataset is not filtered!

majdannezalezhnosti.blogspot.com

1. Benchmark: ‘hot’ events from media
2. Fight bots
a. spam (repetitions, bots)
b. ‘forced’ opinions
c. filter low quality
3. Topic model
a. no just Term Frequency
b. split topics (!)
To improve in Tweets app

Questions?
art by www.facebook.com/Marysya.
Rudska

Information surprise or how to find interesting data

Recommended

Recommended

More Related Content

Similar to Information surprise or how to find interesting data

Similar to Information surprise or how to find interesting data (20)

More from Oleksandr Pryymak

More from Oleksandr Pryymak (8)

Recently uploaded

Recently uploaded (20)

Information surprise or how to find interesting data

Editor's Notes