Master Minds on Data Science - Arno Siebes

Big Data (and a bit of) Media
Prof. dr Arno Siebes1
Algorithmic Data Analysis Group
Department of Information and Computing Sciences
Universiteit Utrecht
Masterclass Data Science Mediapark
November 12, 2015
1
Sponsored in part by the Dutch national COMMIT project

Big Data
Big data is like teenage sex: everyone talks about it,
nobody really knows how to do it, everyone thinks
everyone else is doing it, so everyone claims they are
doing it.
by prof. Dan Ariely
James B. Duke Professor of Psychology and Behavioral
Economics at Duke University
founding member of the Center for Advanced Hindsight
At the end you will at least know what the fuss is about and have
thought about its consequences.

Outline
I am going to talk about three things
1. Why Big Data?
why do we suddenly have big data?
and what does it actually mean?
2. Patterns and Profiles
“we” want big data because it allows us to find patterns; what
are these patterns and how do we find them?
and we’ll briefly discuss privacy in this context
3. Recommendations
how can we use these profiles to make recommendations?
Time permitting, I’ll tell you something about the forefront of
research

The Greatest Discovery Ever
Information is immaterial (incorporeal)
it may be embodied by some material configuration,
but it isn’t that matter.
there are many ways to represent the same information, e.g.
it is always representable by a 0/1 string (Shannon)
Every possible transformation of information (effective
computation) can be done by a computer
firstly because of the existence of Universal Turing Machines
(Turing)
secondly because of the Church - Turing Thesis (Church,
Turing, and Rosser)
Only one type of machine is necessary to process any type of
information in any way possible (how many for matter?)
This glorious insight gave rise to the age of computation.

Because Information is Immaterial,
it has no size!
Hence Moore’s Law:
CPU a 2.5 million-fold increase
Intel 4004 (1973): 2300 transistors
Intel Xeon E5-2699 v3 (2014): 5.5 × 109
transistors (its L3 cache is twice the size of my
ﬁrst hard disk; 45 vs 20)
Hard Disk a million fold increase
IBM (1956): 5 MB (for $50,000)
Toshiba (2015): 5TB (for $200)
note: Bytes per dollar increase 2.5 × 108
Some quite smart technological breakthroughs may also have been
not unimportant for the miniaturization; but I’m a computer
scientist

Shrinking Means Growth
Moore’s Law has made:
computing ubiquitous
from PC to laptop to tablet to smartphone, to smartwatch ...
from dedicated hardware to cars, to fridges, to thermostats ...
and data big
from transactional DB’s to DW’s, to digital libraries, to clouds,
to social networks
from data entry to web interfaces to quantiﬁed self, to
quantiﬁed employees (HR analytics), to sensors everywhere
Or phrased more prosaically,
data is acquired, stored, exchanged, and processed, on
everything, everywhere, at any time, all the time.
If it isn’t digital,
it doesn’t exist.
it didn’t happen
it doesn’t matter

Hence, Big Data
This ubiquity is what causes Big Data, which is “deﬁned” by
V olume, we are truly talking about massive amounts of
data
V elocity, the data comes from all sides at faster and
faster rates – often too fast to look at more than
once
V ariety, the data comes in many sorts, shapes, and
sizes
data used to mean a (rectangular) table ﬁlled
with numbers
but that is no longer true
we have libraries of texts, databases with
molecules, music and video collections, and so
forth, and so forth.

Big is Really Big
Some (not so) random statistics:
in 2012 the world produced 1.8 Zettabyte (i.e.,
1.8 × 1021 = 1.8 × 109 TB)
the NSA is estimated to hoard 3 - 12 Exabyte (1018; Forbes
2013)
in 2014 the world wide web was estimated at 4 Zettabyte
In 1 second, there are (Internet Live Stats, March 30, 2015)
1,841 Tumblr posts
1,918 Instagram photos uploaded
8,885 Tweets sent
48,187 Google searches
98,404 YouTube videos viewed
2,383,324 Emails sent

But, Why?
It is interesting that we generate so much data per second
but why is it all stored?
Clearly, storage space is cheap, but still
that doesn’t mean that every bit is sacred, does it?
The reason is (at least) twofold
You store everything about yourself
Facebook, YouTube, Twitter, Whatsapp, Google+, LinkedIn,
Instagram, Snapchat, Pinterest, foursquare, WeChat, ...
don’t ask me why you do that.
Companies love these hoards of data, it allows them to make
proﬁles

Because
Data is the New Oil
Clive Humby
Data is just like crude. It is valuable, but if unrefined it
cannot really be used. It has to be changed into gas,
plastic, chemicals, etc to create a valuable entity that
drives profitable activity; so must data be broken down,
analyzed for it to have value.
Companies, governments, healthcare institutions and who not(?)
make profiles to
to predict what you want to buy next
to offer you better service
to predict what medication is best for you
for better and for worse

Patterns are Groups
A pattern is a set of characteristics shared by a group of
customers
patients
transactions
i.e., anything you have data of
a group that is for some reason deemed interesting
For example, a group of
patients that have the same disease
customers that spend at least 1000 euros in your shop last year
insurants who claim way above average (or way below)

Patterns are Descriptions
We only look for groups that are easily described, e.g.,
Sex = M and Age ≤ 25 for car insurance
BRCA1, Exon 2 deletion, exon 13 deletion, 2804delAA –
describes (some) Dutch women with a higher risk of early
onset breast cancer
The reason to look for such groups is
what social scientists call homophily
birds of a feather ﬂock together
i.e., similar “things” act similarly
We’ll restrict ourselves mostly to very simple patterns
item sets such as {Diapers, Beer}
describing all transactions in which a the customers bought both
Diapers and Beer.

When is a Pattern Interesting?
A data miner should define what makes a pattern interesting.
There are many interestingness measures, e.g.
heightened risk: a group with above average risk
differential: this group of patients shares a set of
characteristics, not shared by non-patients
frequency: these items have been bought together more than
θ times
These predefined measures come with an algorithm that allows us
to
find all interesting patterns in the database
relatively efficiently

Profiles are Interesting Patterns
Interesting patterns are used as profiles
profiles to recognize good customers
profiles to recognize patients with a certain disease
profiles of groups of articles one should (or should not)
discount together
That is, we get profiles if we have interesting patterns we can act
on.
Mind you, patterns can be interesting – and extremely useful –
even if one cannot act upon them. The most important reason to
mine for interesting patterns is that they provide insight in the
data – and thus into the real world.

Pattern Mining
Or, theory mining, as defined by Manilla and Toivonen in 97:
Given a database db, a language L to define subgroups of the data
and a selection predicate q that determines whether an element
φ ∈ L describes an interesting subgroup of db or not, the task is to
find:
T (L, db, q) = {φ ∈ L | q(db, φ) is true }
That is, the task is to find all interesting subgroups.

Transaction Databases
The ﬁrst example:
Each entry in the database records the contents of a shopping
basket at the check-out
Just presence of items, no counts
More formal:
Given a set of items I,
a transaction t over I is a subset of I, i.e., t ⊆ I
A transaction database db over I is a bag of transactions over
I
Note: each categorical database can be recoded as a transaction
database.

Frequent Item Set Mining
An item set I is a set of items, I ⊆ I
L = P(I)
An item set I occurs in a transaction t iﬀ I ⊆ t
The support of item set I is:
suppdb(I) = |{t ∈ db | I ⊆ t}|
The “interestingness” predicate is a threshold on the support
of the item sets, the minimal support: min-sup.
Frequent item set mining task :
{I ∈ L | suppdb(I) ≥ min-sup}

A Priori
Clearly, checking all item sets for frequency isn’t going to work
with n items you have to check 2n − 1 sets
if n = 100, this means 1.3 × 1030 sets to check
there have only been 14 × 109 years, which had 31.5 × 106
seconds each, i.e., there have been 4.4 × 1017 seconds
which is way too short, even if you could compute the
frequency of 1 item set per clock tick, i.e., 5 × 109/s
Fortunately, the A Priori property holds:
I ⊆ J ⇒ suppdb(J) ≤ suppdb(I)
Hence, a simple level-wise search algorithm will ﬁnd all frequent
item sets.

Why Item Set Mining?
Frequent item sets are interesting in their own right (in fact, the
most interesting type of pattern IM(n)HO), but that is not why
they were invented.
Item sets are the basis for Association Rules
Let X, Y ⊆ I, X ∩ Y = ∅,
X → Y is an association rule iﬀ
P(X ∪ Y ) = σdb(X∪Y )
|db| ≥ t1
P(Y |X) = σdb(X∪Y )
σdb(X) ≥ t2
Standard Example: Diapers → Beer
Computing all frequent item sets is the hard part of
computing all association rules
Interesting for shops, but also a very simple (not very good) way to
recommend

Why Frequent Item Sets?
People are not as unique as they think
they have many things in common with (many) others
In fact, that is why we can do things like recommending
“birds of a feather...” all over again
We briefly look at 1 example
Facebook likes.
But people also satisfy
very infrequent patterns
and that is a threat to privacy
We briefly look at two examples
Netflix
Credit Card purchases.

Example: Mining facebook likes
User – Like Matrix
(10M User-Like pairs)
Users’ Facebook Likes
55,814 Likes
58,466Users
1
User – Components Matrix
Singular Value
100 Components
58,466Users
2
(with 10-
3
e.g. age=α+β1 C1 +…+ βnC100
Predicted variables
Facebook profile:
social network size and density
Profile picture: ethnicity
Survey / test results: BIG5 Personali-
substance use, parents together?
sbased ona sample of 58,466volunteersfrom the United States, obtained throughthe myPersonality Facebook applicat
cluded their Facebook profile information, a list of their Likes (n = 170 Likes per person on average), psychometric t
nd their Likes were represented as a sparse user–Like matrix, the entries of which were set to 1 if there existed an associ
wise. The dimensionality of the user–Like matrix was reduced using singular-value decomposition (SVD) (24). Numeri
predicted using a linear regression model, whereas dichotomous variables such as gender or sexual orientation wer
cases, we applied 10-fold cross-validation and used the k = 100 top SVD components. For sexual orientation, parents’ rel
M. Kosinski, D. Stillwell, T. Graepel: Private traits and attributes
are predictable from digital records of human behavior, PNAS,
March 11, 2013.

Example: Mining facebook likes65% Liberal), religion (“Muslim”/“Christian”; n = 18,833; 90%
Christian), and the Facebook social network information [n =
17,601; median size, ~X = 204; interquartile range (IQR), 206;
median density, ~X = 0.03; IQR, 0.03] were obtained from users’
Facebook profiles. Users’ consumption of alcohol (n = 1,196;
50% drink), drugs (n = 856; 21% take drugs), and cigarettes (n =
1211; 30% smoke) and whether a user’s parents stayed together
until the user was 21 y old (n = 766; 56% stayed together) were
recorded using online surveys. Visual inspection of profile pic-
tures was used to assign ethnic origin to a randomly selected
subsample of users (n = 7,000; 73% Caucasian; 14% African
American; 13% others). Sexual orientation was assigned using the
Facebook profile “Interested in” field; users interested only in
others of the same sex were labeled as homosexual (4.3% males;
2.4% females), whereas those interested in users of the opposite
gender were labeled as heterosexual.
Results
Prediction of Dichotomous Variables. Fig. 2 shows the prediction
accuracy of dichotomous variables expressed in terms of the area
under the receiver-operating characteristic curve (AUC), which is
equivalent to the probability of correctly classifying two randomly
selected users one from each class (e.g., male and female). The
highest accuracy was achieved for ethnic origin and gender. African
Americans and Caucasian Americans were correctly classified in
95% of cases, and males and females were correctly classified in
93% of cases, suggesting that patterns of online behavior as
expressed by Likes significantly differ between those groups
allowing for nearly perfect classification.
Christians and Muslims were correctly classified in 82% of cases,
and similar results were achieved for Democrats and Republicans
(85%). Sexual orientation was easier to distinguish among males
(88%) than females (75%), which may suggest a wider behavioral
divide (as observed from online behavior) between hetero- and
homosexual males.
Good prediction accuracy was achieved for relationship status
and substance use (between 65% and 73%). The relatively lower
accuracy for relationship status may be explained by its temporal
variability compared with other dichotomous variables (e.g.,
gender or sexual orientation).
The model’s accuracy was lowest (60%) when inferring whether
users’ parents stayed together or separated before users were 21 y
old. Although it is known that parental divorce does have long-
term effects on young adults’ well-being (28), it is remarkable that
this is detectable through their Facebook Likes. Individuals
with parents who separated have a higher probability of liking
statements preoccupied with relationships, such as “If I’m with
you then I’m with you I don’t want anybody else” (Table S1).
a Like and 0 otherwise. The dimensionality of the user–Like matrix was reduced using singular-value decomposition (SVD) (24). Numeric variables such as age or
intelligence were predicted using a linear regression model, whereas dichotomous variables such as gender or sexual orientation were predicted using logistic
regression. In both cases, we applied 10-fold cross-validation and used the k = 100 top SVD components. For sexual orientation, parents’ relationship status, and drug
consumption only k = 30 top SVD components were used because of the smaller number of users for which this information was available.
Fig. 2. Prediction accuracy of classification for dichotomous/dichotomized
attributes expressed by the AUC.
2 of 4 | www.pnas.org/cgi/doi/10.1073/pnas.1218772110 Kosinski et al.
AUC: probability of correctly classifying two random selected users,
one from each class (e.g. male and female). Random guessing:
AUC=0.5.

Example: Mining facebook likes
Best predictors of high intelligence include:
“Thunderstorms”
“Science”
“Curly Fries”
Best predictors of low intelligence include:
“I love being a mom”
“Harley Davidson”
“Lady Antebellum”

Netflix
Nextflix offered big cash-prices for recommender systems that beat
their won recommender
recommending movies to users based on what they have
already seen and how they rated them
They released a data set containing records of the form
[user, movie, date of grade, grade]
in which both the user and the movie were replaced by integers
the same user was replaced by the same unique integer, of
course, the same was true for the movies.
Was this fail-safe anonymization? No:
by cross-referencing with IMDb ratings, users could be
identified
With 8 movie ratings(of which 2 may be completely wrong)
and dates that may have a 14-day error, 99% of records can
be uniquely identified in the dataset.
For 68%, two ratings and dates (with a 3-day error) are
sufficient

Credit Card Data
(Science, Vol 347, Issue 6221, 2015)
Assume you have anonymized credit card data
all personal data has been removed
credit card number is replaced by an arbitrary number
time and amount of purchase have been replaced by buckets
The problem is
how much do I need to know about your shopping habits, to
know them all?
The answer is
4 transactions
if I know 4 of your purchases – you get coﬀee at Starbucks
Central Station around 9 – I can identify your credit card trail
External data makes privacy by anonymization hard!

How To Discover This
It is easy to discover that you only need 4 on average
you simply do frequent item set mining with a threshold of 1
pick out all smallest item sets with a frequency of 1
and check the statistics
With these item sets, you can
count the minimum number of specific things you need to
know to identify a specific person
count how many arbitrary things you need to know to identify
an arbitrary person
You can do the same thing with mobile phone trails
and again 4 known (approximate) known locations suffices.

The Paradox of Patterns
These examples illustrate what I like to call the paradox of patterns
Almost all people satisfy both (short) frequent and
infrequent patterns
This is both a
Curse privacy may be breached with dire consequences
and a
Blessing it allows you to turn data into actionable knowledge
with great power (data) comes great responsibility
If you plan to build and exploit big data you should heed this.

Recommender Systems Research
Recommendations are (perceived as) one of the key ways to turn
data into money, e.g., because of
selling more to your customers
keeping your customers happy and, thus, customers
And there are a plethora of possible applications, e.g.,
what results your search engine should show you
what books/movies/music/series/... you would probably like
Hence megabucks are invested in R & D
hence, far too many approaches to survey the ﬁeld
We’ll keep it short and simple
but do point out some of the pitfalls

Collaborative Filtering
From Wikipedia:
a method of making automatic predictions (ﬁltering)
about the interests of a user by collecting preferences or
taste information from many users (collaborating)
Birds of a feather ... ﬂies again.
Two simple approaches are:
user centric
1. look for users who share the same rating patterns with the
active user
2. use the ratings from those like-minded users to calculate a
prediction for the active user
item centric
1. build an item-item matrix determining relationships between
pairs of items
2. infer the tastes of the current user by examining the matrix
and matching that user’s data

User Centric
When are users similar?
compute the cosine between two users:
cos(u1, u2) = i u1,i u2,i
|u1||u2|
the closer to 1, the more similar users are
How do you predict?
pick the closest k users
recommend the item that the highest number of these k have
and the customer not
and recommend that item

Item Centric
In the item-item matrix, you, e.g., put the
support of the pair (Ii , Ij )
The customer is simply an item set
the set of all items he bought
This item set is a subset of at least one of the rows in this matrix
there is at least one row in which the support for all the items
that the customer bought is ≥ 1
(because the customer bought all these items)
For all the rows of which the customer is a subset
determine which items that the customer hasn’t bought yet
have a positive support
From these compute the probability that the customer may like
them
and recommend an item with the highest probability

Easy, no?
No, unfortunately not
Amazon has over 10 million books for sale
2014: a new book every 5 minutes...
The vectors (users) or Matrices are going to be very sparse
The closest users may be rather dissimilar
Or, what about time?
is a book you bought years ago as relevant as the one you
pick today?
sometimes: yes (a series like Harry Potter)
sometimes: no (an abandoned hobby).
One account is seen as one user, but is it?
Netﬂicks allows 4 concurrent streams
Couples will share an account
Some books are presents (perhaps even one-oﬀ) some are for
own consumption

It All About What You Know
Big Data is not just a lot of data
but often also a lot of different types of data
The more you know about your customers that is different from
what they purchased from you
the better recommendations can become
If you know a customers social network
you can use purchases from close friends to determine could
recommendations
remember: birds of a feather ...
There are some recommendations I like
usually when I buy something very specific
But many more that I hate
would you like to book the hotel you just booked?
would you like a hotel in X where (part of) your journey ends?
(no thanks, I have to 1 hour to transfer – as you should
know!)

And How Well You Know Your Products
Products often come in categories
and it makes sense to use this information
It doesn’t make much sense to recommend “chick-lit”
to someone who upto now only bought SciFi
But, then again, some authors straddle genres
and then it may be the author and not the genre that
enthuses the customer
It is not just about data scientists
but just as much about domain experts
It is about teamwork, not about unicorns
It is about data
as much as about algorithms

Models
When one encounters a new type of data or a new type of problem
the ﬁrst thing to decide is how do we model the data
Classical models you learned about in kindergarten often simply
don’t ﬁt
linear regression on a collection of text documents?
Sometimes you can generalize
sometimes you have to invent something new
Sometimes you model all data collectively
sometimes you use patterns
I always search for the right pattern language
because the world is never homogeneous

Which Model to Choose
If you know how you want to model (describe) your data
you have to decide what makes a model good
Kindergarten statistics doesn’t scale
in big data everything is signiﬁcant
Fortunately, there are many ways to solve this
I will spare you the details (for now ...)
For pattern miners it is easy
they should at least be frequent
and probably satisfy other constraints

Algorithms
When you can specify which models you prefer
there is only one thing left to do
You have to devise an algorithm that finds these models
Sometimes you are lucky
you can compute the optimal model
Sometimes you are not
it is infeasible or simply impossible to compute the optimal
model
there are things a computer cannot compute
Then you have to use heuristics
to find reasonably good models
In pattern mining life is good
we can usually find all good patterns

Done?
Unfortunately often: no! Especially for pattern miners, e.g.,
data usually has many, very many patterns
often more than one has data!
You cannot use all of them
that would lead to very bad models (overﬁtting)
And you cannot look at them to pick out superior ones by eye
would you inspect 1012 patterns?
The Model Selection problem returns with a vengeance...
we have to choose a small set of patterns automatically.

The Problem of Induction
In big data we have lots and lots of data
but it is still a finite sample
And we want to say something about new data (the future if you
want.
We want to induce a general “law” from our finite sample
and there are infinitely many functions that go through the
same finite set of data points
and there is no reasonable choice between these infinitely many
possibilities
this is what David Hume called: the problem of induction
The well-known example (before the discovery of Australia)
how many swans should you have seen before you can
conclude that all swans are white?

Faith
There is no reasonable choice between models without further
assumptions
hence we have to make such assumptions
one has to believe something about the world
In Kindergarten Statistics
you believe that you know how the model looks like
and usually you also believe the data to have some desirable
statistical properties
and you compute the model that is optimal given your beliefs
These are very strong beliefs, I make a diﬀerent assumption
I only believe that the “law” we are looking for is a
computable function
This is a very weak assumption
after all, what good is a law you cannot use?

MDL
More concretely we use the Minimum Description Length Principle
to ﬁnd a small set of characteristic patterns:
Given a set of models H, the best model H ∈ H is the one that
minimizes
L(H) + L(D|H)
in which
L(H) is the length, in bits, of the description of H, and
L(D|H) is the length, in bits, of the description of the data
when encoded with H.
note: this is lossless compression
Our idea: ﬁnd a small set of patterns that collectively describe the
data well (i.e., compress it well)

Krimp
In view of time
and perhaps your appetite for some hefty math by now
I will spare you the details
It suﬃces to say that the resulting (heuristic!) algorithm, Krimp,
really reduces the number of patterns one has to consider. And the
resulting set of patterns is very characteristic.

Reduction Visualized
020406080100120
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
#patterns
204080100
Minimum support
Frequent itemsets
Picked by Krimp
Wine

More Results
1
10
100
1000
10.000
100.000
Mushroom
Accidents
Adult
Anneal
BMS-pos
BMS-wv1
BMS-wv2
Breast
Chess(k-k)
Chess(kr-k)
Connect-4
DNAamp
Heart
Ionosphere
Iris
Led7
Letter
Mammals
Nursery
Pageblocks
Pima
Pumsbstar
Retail
Pendigits
Tic-tac-toe
Waveform
Wine
NumberofitemsetsTimeinseconds
1
10.000.000
100.000.000
1E+09
10.000
100.000
1.000.000
10
100
1.000
|| |CT| time
Only 1 in 109 is chosen

Characteristic by Classiﬁcation
Database
(n classes)
Split
per class
Apply
KRIMP
Code table
per class
Shortest
code wins!
elbatedoc
elbatedoc
Encode
unseen
transactions
This yields pretty good classiﬁers, while we only try to describe the
data.

Classiﬁcation Example
16 19 24 15 29 1 25 0 12 3613 26 62
12 1316 26 36 0 1 15 19 24 25 29 62
0 16 19 20 24 6 7 36 1 25 3 4 29
0 1 3 4 6 7 16 19 20 24 25 29 36
CT2
Transaction 2Transaction 1
CT1

Zugabe, zugabe, zugabe ...
There is much more we can de with the results of Krimp
clustering
change detection
data imputation
....
In fact, we can do recommendation
tag recommendation for Flickr data
sorry, we haven’t looked at other recommendation problems
But again, in view of time we will not cover this today
except one further remark on privacy

Privacy Revisited
We can use the results from Krimp to generate data sets
each tuple (entry) in this data set is completely random
no relation with any person in the real world
but almost all statistical properties of the original data are
more or less preserved in the generated data
if I give you generated data instead of the original data, you’ll
find the same model
Data Mining with Guaranteed Privacy
Unfortunately this can only be verified experimentally in a limited
setting
I have been working on a general, provably correct, version off
and on for a few years now
he outline is done, but
there are still a few hairy details that need to be solved
nothing a couple of hundred k cannot solve.

Conclusions
Big Data is here to stay
and with Big Data many aspects of life can become much
nicer
Big Data can make your life better, provided
you have the relevant data
you have people that really understand your domain
you have people that play with data for fun
(and these groups are usually distinct
And it can even be done without violating privacy

Master Minds on Data Science - Arno Siebes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Master Minds on Data Science - Arno Siebes

Similar to Master Minds on Data Science - Arno Siebes (20)

More from Media Perspectives

More from Media Perspectives (20)

Recently uploaded

Recently uploaded (20)

Master Minds on Data Science - Arno Siebes