SlideShare a Scribd company logo
1 of 56
Big Data (and a bit of) Media
Prof. dr Arno Siebes1
Algorithmic Data Analysis Group
Department of Information and Computing Sciences
Universiteit Utrecht
Masterclass Data Science Mediapark
November 12, 2015
1
Sponsored in part by the Dutch national COMMIT project
Prologue
Big Data
Big data is like teenage sex: everyone talks about it,
nobody really knows how to do it, everyone thinks
everyone else is doing it, so everyone claims they are
doing it.
by prof. Dan Ariely
James B. Duke Professor of Psychology and Behavioral
Economics at Duke University
founding member of the Center for Advanced Hindsight
At the end you will at least know what the fuss is about and have
thought about its consequences.
Outline
I am going to talk about three things
1. Why Big Data?
why do we suddenly have big data?
and what does it actually mean?
2. Patterns and Profiles
“we” want big data because it allows us to find patterns; what
are these patterns and how do we find them?
and we’ll briefly discuss privacy in this context
3. Recommendations
how can we use these profiles to make recommendations?
Time permitting, I’ll tell you something about the forefront of
research
Why Big Data?
The Greatest Discovery Ever
Information is immaterial (incorporeal)
it may be embodied by some material configuration,
but it isn’t that matter.
there are many ways to represent the same information, e.g.
it is always representable by a 0/1 string (Shannon)
Every possible transformation of information (effective
computation) can be done by a computer
firstly because of the existence of Universal Turing Machines
(Turing)
secondly because of the Church - Turing Thesis (Church,
Turing, and Rosser)
Only one type of machine is necessary to process any type of
information in any way possible (how many for matter?)
This glorious insight gave rise to the age of computation.
Because Information is Immaterial,
it has no size!
Hence Moore’s Law:
CPU a 2.5 million-fold increase
Intel 4004 (1973): 2300 transistors
Intel Xeon E5-2699 v3 (2014): 5.5 × 109
transistors (its L3 cache is twice the size of my
first hard disk; 45 vs 20)
Hard Disk a million fold increase
IBM (1956): 5 MB (for $50,000)
Toshiba (2015): 5TB (for $200)
note: Bytes per dollar increase 2.5 × 108
Some quite smart technological breakthroughs may also have been
not unimportant for the miniaturization; but I’m a computer
scientist
Shrinking Means Growth
Moore’s Law has made:
computing ubiquitous
from PC to laptop to tablet to smartphone, to smartwatch ...
from dedicated hardware to cars, to fridges, to thermostats ...
and data big
from transactional DB’s to DW’s, to digital libraries, to clouds,
to social networks
from data entry to web interfaces to quantified self, to
quantified employees (HR analytics), to sensors everywhere
Or phrased more prosaically,
data is acquired, stored, exchanged, and processed, on
everything, everywhere, at any time, all the time.
If it isn’t digital,
it doesn’t exist.
it didn’t happen
it doesn’t matter
Hence, Big Data
This ubiquity is what causes Big Data, which is “defined” by
V olume, we are truly talking about massive amounts of
data
V elocity, the data comes from all sides at faster and
faster rates – often too fast to look at more than
once
V ariety, the data comes in many sorts, shapes, and
sizes
data used to mean a (rectangular) table filled
with numbers
but that is no longer true
we have libraries of texts, databases with
molecules, music and video collections, and so
forth, and so forth.
Big is Really Big
Some (not so) random statistics:
in 2012 the world produced 1.8 Zettabyte (i.e.,
1.8 × 1021 = 1.8 × 109 TB)
the NSA is estimated to hoard 3 - 12 Exabyte (1018; Forbes
2013)
in 2014 the world wide web was estimated at 4 Zettabyte
In 1 second, there are (Internet Live Stats, March 30, 2015)
1,841 Tumblr posts
1,918 Instagram photos uploaded
8,885 Tweets sent
48,187 Google searches
98,404 YouTube videos viewed
2,383,324 Emails sent
But, Why?
It is interesting that we generate so much data per second
but why is it all stored?
Clearly, storage space is cheap, but still
that doesn’t mean that every bit is sacred, does it?
The reason is (at least) twofold
You store everything about yourself
Facebook, YouTube, Twitter, Whatsapp, Google+, LinkedIn,
Instagram, Snapchat, Pinterest, foursquare, WeChat, ...
don’t ask me why you do that.
Companies love these hoards of data, it allows them to make
profiles
Because
Data is the New Oil
Clive Humby
Data is just like crude. It is valuable, but if unrefined it
cannot really be used. It has to be changed into gas,
plastic, chemicals, etc to create a valuable entity that
drives profitable activity; so must data be broken down,
analyzed for it to have value.
Companies, governments, healthcare institutions and who not(?)
make profiles to
to predict what you want to buy next
to offer you better service
to predict what medication is best for you
for better and for worse
Patterns and Profiles
Patterns are Groups
A pattern is a set of characteristics shared by a group of
customers
patients
transactions
i.e., anything you have data of
a group that is for some reason deemed interesting
For example, a group of
patients that have the same disease
customers that spend at least 1000 euros in your shop last year
insurants who claim way above average (or way below)
Patterns are Descriptions
We only look for groups that are easily described, e.g.,
Sex = M and Age ≤ 25 for car insurance
BRCA1, Exon 2 deletion, exon 13 deletion, 2804delAA –
describes (some) Dutch women with a higher risk of early
onset breast cancer
The reason to look for such groups is
what social scientists call homophily
birds of a feather flock together
i.e., similar “things” act similarly
We’ll restrict ourselves mostly to very simple patterns
item sets such as {Diapers, Beer}
describing all transactions in which a the customers bought both
Diapers and Beer.
Diapers and Beer?
When is a Pattern Interesting?
A data miner should define what makes a pattern interesting.
There are many interestingness measures, e.g.
heightened risk: a group with above average risk
differential: this group of patients shares a set of
characteristics, not shared by non-patients
frequency: these items have been bought together more than
θ times
These predefined measures come with an algorithm that allows us
to
find all interesting patterns in the database
relatively efficiently
Profiles are Interesting Patterns
Interesting patterns are used as profiles
profiles to recognize good customers
profiles to recognize patients with a certain disease
profiles of groups of articles one should (or should not)
discount together
That is, we get profiles if we have interesting patterns we can act
on.
Mind you, patterns can be interesting – and extremely useful –
even if one cannot act upon them. The most important reason to
mine for interesting patterns is that they provide insight in the
data – and thus into the real world.
Pattern Mining
Or, theory mining, as defined by Manilla and Toivonen in 97:
Given a database db, a language L to define subgroups of the data
and a selection predicate q that determines whether an element
φ ∈ L describes an interesting subgroup of db or not, the task is to
find:
T (L, db, q) = {φ ∈ L | q(db, φ) is true }
That is, the task is to find all interesting subgroups.
Transaction Databases
The first example:
Each entry in the database records the contents of a shopping
basket at the check-out
Just presence of items, no counts
More formal:
Given a set of items I,
a transaction t over I is a subset of I, i.e., t ⊆ I
A transaction database db over I is a bag of transactions over
I
Note: each categorical database can be recoded as a transaction
database.
Frequent Item Set Mining
An item set I is a set of items, I ⊆ I
L = P(I)
An item set I occurs in a transaction t iff I ⊆ t
The support of item set I is:
suppdb(I) = |{t ∈ db | I ⊆ t}|
The “interestingness” predicate is a threshold on the support
of the item sets, the minimal support: min-sup.
Frequent item set mining task :
{I ∈ L | suppdb(I) ≥ min-sup}
A Priori
Clearly, checking all item sets for frequency isn’t going to work
with n items you have to check 2n − 1 sets
if n = 100, this means 1.3 × 1030 sets to check
there have only been 14 × 109 years, which had 31.5 × 106
seconds each, i.e., there have been 4.4 × 1017 seconds
which is way too short, even if you could compute the
frequency of 1 item set per clock tick, i.e., 5 × 109/s
Fortunately, the A Priori property holds:
I ⊆ J ⇒ suppdb(J) ≤ suppdb(I)
Hence, a simple level-wise search algorithm will find all frequent
item sets.
Why Item Set Mining?
Frequent item sets are interesting in their own right (in fact, the
most interesting type of pattern IM(n)HO), but that is not why
they were invented.
Item sets are the basis for Association Rules
Let X, Y ⊆ I, X ∩ Y = ∅,
X → Y is an association rule iff
P(X ∪ Y ) = σdb(X∪Y )
|db| ≥ t1
P(Y |X) = σdb(X∪Y )
σdb(X) ≥ t2
Standard Example: Diapers → Beer
Computing all frequent item sets is the hard part of
computing all association rules
Interesting for shops, but also a very simple (not very good) way to
recommend
Why Frequent Item Sets?
People are not as unique as they think
they have many things in common with (many) others
In fact, that is why we can do things like recommending
“birds of a feather...” all over again
We briefly look at 1 example
Facebook likes.
But people also satisfy
very infrequent patterns
and that is a threat to privacy
We briefly look at two examples
Netflix
Credit Card purchases.
Example: Mining facebook likes
User – Like Matrix
(10M User-Like pairs)
Users’ Facebook Likes
55,814 Likes
58,466Users
1
User – Components Matrix
Singular Value
100 Components
58,466Users
2
(with 10-
3
e.g. age=α+β1 C1 +…+ βnC100
Predicted variables
Facebook profile:
social network size and density
Profile picture: ethnicity
Survey / test results: BIG5 Personali-
substance use, parents together?
sbased ona sample of 58,466volunteersfrom the United States, obtained throughthe myPersonality Facebook applicat
cluded their Facebook profile information, a list of their Likes (n = 170 Likes per person on average), psychometric t
nd their Likes were represented as a sparse user–Like matrix, the entries of which were set to 1 if there existed an associ
wise. The dimensionality of the user–Like matrix was reduced using singular-value decomposition (SVD) (24). Numeri
predicted using a linear regression model, whereas dichotomous variables such as gender or sexual orientation wer
cases, we applied 10-fold cross-validation and used the k = 100 top SVD components. For sexual orientation, parents’ rel
M. Kosinski, D. Stillwell, T. Graepel: Private traits and attributes
are predictable from digital records of human behavior, PNAS,
March 11, 2013.
Example: Mining facebook likes65% Liberal), religion (“Muslim”/“Christian”; n = 18,833; 90%
Christian), and the Facebook social network information [n =
17,601; median size, ~X = 204; interquartile range (IQR), 206;
median density, ~X = 0.03; IQR, 0.03] were obtained from users’
Facebook profiles. Users’ consumption of alcohol (n = 1,196;
50% drink), drugs (n = 856; 21% take drugs), and cigarettes (n =
1211; 30% smoke) and whether a user’s parents stayed together
until the user was 21 y old (n = 766; 56% stayed together) were
recorded using online surveys. Visual inspection of profile pic-
tures was used to assign ethnic origin to a randomly selected
subsample of users (n = 7,000; 73% Caucasian; 14% African
American; 13% others). Sexual orientation was assigned using the
Facebook profile “Interested in” field; users interested only in
others of the same sex were labeled as homosexual (4.3% males;
2.4% females), whereas those interested in users of the opposite
gender were labeled as heterosexual.
Results
Prediction of Dichotomous Variables. Fig. 2 shows the prediction
accuracy of dichotomous variables expressed in terms of the area
under the receiver-operating characteristic curve (AUC), which is
equivalent to the probability of correctly classifying two randomly
selected users one from each class (e.g., male and female). The
highest accuracy was achieved for ethnic origin and gender. African
Americans and Caucasian Americans were correctly classified in
95% of cases, and males and females were correctly classified in
93% of cases, suggesting that patterns of online behavior as
expressed by Likes significantly differ between those groups
allowing for nearly perfect classification.
Christians and Muslims were correctly classified in 82% of cases,
and similar results were achieved for Democrats and Republicans
(85%). Sexual orientation was easier to distinguish among males
(88%) than females (75%), which may suggest a wider behavioral
divide (as observed from online behavior) between hetero- and
homosexual males.
Good prediction accuracy was achieved for relationship status
and substance use (between 65% and 73%). The relatively lower
accuracy for relationship status may be explained by its temporal
variability compared with other dichotomous variables (e.g.,
gender or sexual orientation).
The model’s accuracy was lowest (60%) when inferring whether
users’ parents stayed together or separated before users were 21 y
old. Although it is known that parental divorce does have long-
term effects on young adults’ well-being (28), it is remarkable that
this is detectable through their Facebook Likes. Individuals
with parents who separated have a higher probability of liking
statements preoccupied with relationships, such as “If I’m with
you then I’m with you I don’t want anybody else” (Table S1).
a Like and 0 otherwise. The dimensionality of the user–Like matrix was reduced using singular-value decomposition (SVD) (24). Numeric variables such as age or
intelligence were predicted using a linear regression model, whereas dichotomous variables such as gender or sexual orientation were predicted using logistic
regression. In both cases, we applied 10-fold cross-validation and used the k = 100 top SVD components. For sexual orientation, parents’ relationship status, and drug
consumption only k = 30 top SVD components were used because of the smaller number of users for which this information was available.
Fig. 2. Prediction accuracy of classification for dichotomous/dichotomized
attributes expressed by the AUC.
2 of 4 | www.pnas.org/cgi/doi/10.1073/pnas.1218772110 Kosinski et al.
AUC: probability of correctly classifying two random selected users,
one from each class (e.g. male and female). Random guessing:
AUC=0.5.
Example: Mining facebook likes
Best predictors of high intelligence include:
“Thunderstorms”
“Science”
“Curly Fries”
Best predictors of low intelligence include:
“I love being a mom”
“Harley Davidson”
“Lady Antebellum”
Netflix
Nextflix offered big cash-prices for recommender systems that beat
their won recommender
recommending movies to users based on what they have
already seen and how they rated them
They released a data set containing records of the form
[user, movie, date of grade, grade]
in which both the user and the movie were replaced by integers
the same user was replaced by the same unique integer, of
course, the same was true for the movies.
Was this fail-safe anonymization? No:
by cross-referencing with IMDb ratings, users could be
identified
With 8 movie ratings(of which 2 may be completely wrong)
and dates that may have a 14-day error, 99% of records can
be uniquely identified in the dataset.
For 68%, two ratings and dates (with a 3-day error) are
sufficient
Credit Card Data
(Science, Vol 347, Issue 6221, 2015)
Assume you have anonymized credit card data
all personal data has been removed
credit card number is replaced by an arbitrary number
time and amount of purchase have been replaced by buckets
The problem is
how much do I need to know about your shopping habits, to
know them all?
The answer is
4 transactions
if I know 4 of your purchases – you get coffee at Starbucks
Central Station around 9 – I can identify your credit card trail
External data makes privacy by anonymization hard!
How To Discover This
It is easy to discover that you only need 4 on average
you simply do frequent item set mining with a threshold of 1
pick out all smallest item sets with a frequency of 1
and check the statistics
With these item sets, you can
count the minimum number of specific things you need to
know to identify a specific person
count how many arbitrary things you need to know to identify
an arbitrary person
You can do the same thing with mobile phone trails
and again 4 known (approximate) known locations suffices.
The Paradox of Patterns
These examples illustrate what I like to call the paradox of patterns
Almost all people satisfy both (short) frequent and
infrequent patterns
This is both a
Curse privacy may be breached with dire consequences
and a
Blessing it allows you to turn data into actionable knowledge
with great power (data) comes great responsibility
If you plan to build and exploit big data you should heed this.
Recommendations
Recommender Systems Research
Recommendations are (perceived as) one of the key ways to turn
data into money, e.g., because of
selling more to your customers
keeping your customers happy and, thus, customers
And there are a plethora of possible applications, e.g.,
what results your search engine should show you
what books/movies/music/series/... you would probably like
Hence megabucks are invested in R & D
hence, far too many approaches to survey the field
We’ll keep it short and simple
but do point out some of the pitfalls
Collaborative Filtering
From Wikipedia:
a method of making automatic predictions (filtering)
about the interests of a user by collecting preferences or
taste information from many users (collaborating)
Birds of a feather ... flies again.
Two simple approaches are:
user centric
1. look for users who share the same rating patterns with the
active user
2. use the ratings from those like-minded users to calculate a
prediction for the active user
item centric
1. build an item-item matrix determining relationships between
pairs of items
2. infer the tastes of the current user by examining the matrix
and matching that user’s data
User Centric
When are users similar?
compute the cosine between two users:
cos(u1, u2) = i u1,i u2,i
|u1||u2|
the closer to 1, the more similar users are
How do you predict?
pick the closest k users
recommend the item that the highest number of these k have
and the customer not
and recommend that item
Item Centric
In the item-item matrix, you, e.g., put the
support of the pair (Ii , Ij )
The customer is simply an item set
the set of all items he bought
This item set is a subset of at least one of the rows in this matrix
there is at least one row in which the support for all the items
that the customer bought is ≥ 1
(because the customer bought all these items)
For all the rows of which the customer is a subset
determine which items that the customer hasn’t bought yet
have a positive support
From these compute the probability that the customer may like
them
and recommend an item with the highest probability
Easy, no?
No, unfortunately not
Amazon has over 10 million books for sale
2014: a new book every 5 minutes...
The vectors (users) or Matrices are going to be very sparse
The closest users may be rather dissimilar
Or, what about time?
is a book you bought years ago as relevant as the one you
pick today?
sometimes: yes (a series like Harry Potter)
sometimes: no (an abandoned hobby).
One account is seen as one user, but is it?
Netflicks allows 4 concurrent streams
Couples will share an account
Some books are presents (perhaps even one-off) some are for
own consumption
It All About What You Know
Big Data is not just a lot of data
but often also a lot of different types of data
The more you know about your customers that is different from
what they purchased from you
the better recommendations can become
If you know a customers social network
you can use purchases from close friends to determine could
recommendations
remember: birds of a feather ...
There are some recommendations I like
usually when I buy something very specific
But many more that I hate
would you like to book the hotel you just booked?
would you like a hotel in X where (part of) your journey ends?
(no thanks, I have to 1 hour to transfer – as you should
know!)
And How Well You Know Your Products
Products often come in categories
and it makes sense to use this information
It doesn’t make much sense to recommend “chick-lit”
to someone who upto now only bought SciFi
But, then again, some authors straddle genres
and then it may be the author and not the genre that
enthuses the customer
It is not just about data scientists
but just as much about domain experts
It is about teamwork, not about unicorns
It is about data
as much as about algorithms
A Day in the Life
Models
When one encounters a new type of data or a new type of problem
the first thing to decide is how do we model the data
Classical models you learned about in kindergarten often simply
don’t fit
linear regression on a collection of text documents?
Sometimes you can generalize
sometimes you have to invent something new
Sometimes you model all data collectively
sometimes you use patterns
I always search for the right pattern language
because the world is never homogeneous
Which Model to Choose
If you know how you want to model (describe) your data
you have to decide what makes a model good
Kindergarten statistics doesn’t scale
in big data everything is significant
Fortunately, there are many ways to solve this
I will spare you the details (for now ...)
For pattern miners it is easy
they should at least be frequent
and probably satisfy other constraints
Algorithms
When you can specify which models you prefer
there is only one thing left to do
You have to devise an algorithm that finds these models
Sometimes you are lucky
you can compute the optimal model
Sometimes you are not
it is infeasible or simply impossible to compute the optimal
model
there are things a computer cannot compute
Then you have to use heuristics
to find reasonably good models
In pattern mining life is good
we can usually find all good patterns
Done?
Unfortunately often: no! Especially for pattern miners, e.g.,
data usually has many, very many patterns
often more than one has data!
You cannot use all of them
that would lead to very bad models (overfitting)
And you cannot look at them to pick out superior ones by eye
would you inspect 1012 patterns?
The Model Selection problem returns with a vengeance...
we have to choose a small set of patterns automatically.
The Problem of Induction
In big data we have lots and lots of data
but it is still a finite sample
And we want to say something about new data (the future if you
want.
We want to induce a general “law” from our finite sample
and there are infinitely many functions that go through the
same finite set of data points
and there is no reasonable choice between these infinitely many
possibilities
this is what David Hume called: the problem of induction
The well-known example (before the discovery of Australia)
how many swans should you have seen before you can
conclude that all swans are white?
Faith
There is no reasonable choice between models without further
assumptions
hence we have to make such assumptions
one has to believe something about the world
In Kindergarten Statistics
you believe that you know how the model looks like
and usually you also believe the data to have some desirable
statistical properties
and you compute the model that is optimal given your beliefs
These are very strong beliefs, I make a different assumption
I only believe that the “law” we are looking for is a
computable function
This is a very weak assumption
after all, what good is a law you cannot use?
MDL
More concretely we use the Minimum Description Length Principle
to find a small set of characteristic patterns:
Given a set of models H, the best model H ∈ H is the one that
minimizes
L(H) + L(D|H)
in which
L(H) is the length, in bits, of the description of H, and
L(D|H) is the length, in bits, of the description of the data
when encoded with H.
note: this is lossless compression
Our idea: find a small set of patterns that collectively describe the
data well (i.e., compress it well)
Krimp
In view of time
and perhaps your appetite for some hefty math by now
I will spare you the details
It suffices to say that the resulting (heuristic!) algorithm, Krimp,
really reduces the number of patterns one has to consider. And the
resulting set of patterns is very characteristic.
Reduction Visualized
020406080100120
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
#patterns
204080100
Minimum support
Frequent itemsets
Picked by Krimp
Wine
More Results
1
10
100
1000
10.000
100.000
Mushroom
Accidents
Adult
Anneal
BMS-pos
BMS-wv1
BMS-wv2
Breast
Chess(k-k)
Chess(kr-k)
Connect-4
DNAamp
Heart
Ionosphere
Iris
Led7
Letter
Mammals
Nursery
Pageblocks
Pima
Pumsbstar
Retail
Pendigits
Tic-tac-toe
Waveform
Wine
NumberofitemsetsTimeinseconds
1
10.000.000
100.000.000
1E+09
10.000
100.000
1.000.000
10
100
1.000
|| |CT| time
Only 1 in 109 is chosen
Characteristic by Classification
Database
(n classes)
Split
per class
Apply
KRIMP
Code table
per class
Shortest
code wins!
elbatedoc
elbatedoc
Encode
unseen
transactions
This yields pretty good classifiers, while we only try to describe the
data.
Classification Example
16 19 24 15 29 1 25 0 12 3613 26 62
12 1316 26 36 0 1 15 19 24 25 29 62
0 16 19 20 24 6 7 36 1 25 3 4 29
0 1 3 4 6 7 16 19 20 24 25 29 36
CT2
Transaction 2Transaction 1
CT1
Zugabe, zugabe, zugabe ...
There is much more we can de with the results of Krimp
clustering
change detection
data imputation
....
In fact, we can do recommendation
tag recommendation for Flickr data
sorry, we haven’t looked at other recommendation problems
But again, in view of time we will not cover this today
except one further remark on privacy
Privacy Revisited
We can use the results from Krimp to generate data sets
each tuple (entry) in this data set is completely random
no relation with any person in the real world
but almost all statistical properties of the original data are
more or less preserved in the generated data
if I give you generated data instead of the original data, you’ll
find the same model
Data Mining with Guaranteed Privacy
Unfortunately this can only be verified experimentally in a limited
setting
I have been working on a general, provably correct, version off
and on for a few years now
he outline is done, but
there are still a few hairy details that need to be solved
nothing a couple of hundred k cannot solve.
Conclusions
Conclusions
Big Data is here to stay
and with Big Data many aspects of life can become much
nicer
Big Data can make your life better, provided
you have the relevant data
you have people that really understand your domain
you have people that play with data for fun
(and these groups are usually distinct
And it can even be done without violating privacy

More Related Content

What's hot

Vint big data research privacy technology and the law
Vint big data research privacy technology and the lawVint big data research privacy technology and the law
Vint big data research privacy technology and the law
Karlos Svoboda
 

What's hot (20)

BI, AI/ML, Use Cases, Business Impact and how to get started
BI, AI/ML, Use Cases, Business Impact and how to get startedBI, AI/ML, Use Cases, Business Impact and how to get started
BI, AI/ML, Use Cases, Business Impact and how to get started
 
Digital cultural heritage spring 2015 day 2
Digital cultural heritage spring 2015 day 2Digital cultural heritage spring 2015 day 2
Digital cultural heritage spring 2015 day 2
 
A Statistician's `Big Tent' View on Big Data and Data Science in Health Scien...
A Statistician's `Big Tent' View on Big Data and Data Science in Health Scien...A Statistician's `Big Tent' View on Big Data and Data Science in Health Scien...
A Statistician's `Big Tent' View on Big Data and Data Science in Health Scien...
 
A Statistician's 'Big Tent' View on Big Data and Data Science (Version 9)
A Statistician's 'Big Tent' View on Big Data and Data Science (Version 9)A Statistician's 'Big Tent' View on Big Data and Data Science (Version 9)
A Statistician's 'Big Tent' View on Big Data and Data Science (Version 9)
 
A Swiss Statistician's 'Big Tent' View on Big Data and Data Science (Version 10)
A Swiss Statistician's 'Big Tent' View on Big Data and Data Science (Version 10)A Swiss Statistician's 'Big Tent' View on Big Data and Data Science (Version 10)
A Swiss Statistician's 'Big Tent' View on Big Data and Data Science (Version 10)
 
OWF14 - Plenary Session : Ori Pekelman, Founder, Constellation Matrix
OWF14 - Plenary Session : Ori Pekelman, Founder, Constellation MatrixOWF14 - Plenary Session : Ori Pekelman, Founder, Constellation Matrix
OWF14 - Plenary Session : Ori Pekelman, Founder, Constellation Matrix
 
A Journey into bringing (Artificial) Intelligence to the Enterprise
A Journey into bringing (Artificial) Intelligence to the EnterpriseA Journey into bringing (Artificial) Intelligence to the Enterprise
A Journey into bringing (Artificial) Intelligence to the Enterprise
 
A Statistician's View on Big Data and Data Science (Version 2)
A Statistician's View on Big Data and Data Science (Version 2)A Statistician's View on Big Data and Data Science (Version 2)
A Statistician's View on Big Data and Data Science (Version 2)
 
Big data march2016 ipsos mori
Big data march2016 ipsos moriBig data march2016 ipsos mori
Big data march2016 ipsos mori
 
Why Data Science is a Science
Why Data Science is a ScienceWhy Data Science is a Science
Why Data Science is a Science
 
Big Data, Data-Driven Decision Making and Statistics Towards Data-Informed Po...
Big Data, Data-Driven Decision Making and Statistics Towards Data-Informed Po...Big Data, Data-Driven Decision Making and Statistics Towards Data-Informed Po...
Big Data, Data-Driven Decision Making and Statistics Towards Data-Informed Po...
 
Big Data as the Fuel and Analytics as the Engine of the Digital Transformation
Big Data as the Fuel and Analytics as the Engine of the Digital TransformationBig Data as the Fuel and Analytics as the Engine of the Digital Transformation
Big Data as the Fuel and Analytics as the Engine of the Digital Transformation
 
Behavioral Big Data & Healthcare Research
Behavioral Big Data & Healthcare ResearchBehavioral Big Data & Healthcare Research
Behavioral Big Data & Healthcare Research
 
Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?Public Data and Data Mining Competitions - What are Lessons?
Public Data and Data Mining Competitions - What are Lessons?
 
Sogeti on big data creating clarity
Sogeti on big data creating claritySogeti on big data creating clarity
Sogeti on big data creating clarity
 
Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation Designing Human-AI Partnerships to Combat Misinfomation
Designing Human-AI Partnerships to Combat Misinfomation
 
The current challenges and opportunities of big data and analytics in emergen...
The current challenges and opportunities of big data and analytics in emergen...The current challenges and opportunities of big data and analytics in emergen...
The current challenges and opportunities of big data and analytics in emergen...
 
Overview of Big Data, Data Science and Statistics, along with Digitalisation,...
Overview of Big Data, Data Science and Statistics, along with Digitalisation,...Overview of Big Data, Data Science and Statistics, along with Digitalisation,...
Overview of Big Data, Data Science and Statistics, along with Digitalisation,...
 
Vint big data research privacy technology and the law
Vint big data research privacy technology and the lawVint big data research privacy technology and the law
Vint big data research privacy technology and the law
 
A Statistician's 'Big Tent' View on Big Data and Data Science (Version 8)
A Statistician's 'Big Tent' View on Big Data and Data Science (Version 8)A Statistician's 'Big Tent' View on Big Data and Data Science (Version 8)
A Statistician's 'Big Tent' View on Big Data and Data Science (Version 8)
 

Viewers also liked (10)

La voz de la canal 2ºedición
La voz de la canal 2ºediciónLa voz de la canal 2ºedición
La voz de la canal 2ºedición
 
¿Podemos combatir la Tos Ferina?
¿Podemos combatir la Tos Ferina?¿Podemos combatir la Tos Ferina?
¿Podemos combatir la Tos Ferina?
 
Iban registry
Iban registryIban registry
Iban registry
 
Establecerse en Barcelona / Guía práctica
Establecerse en Barcelona / Guía prácticaEstablecerse en Barcelona / Guía práctica
Establecerse en Barcelona / Guía práctica
 
MULTIcash
MULTIcashMULTIcash
MULTIcash
 
Bringing It All Together - Search & Earned Media - David Whitworth, SEO Manag...
Bringing It All Together - Search & Earned Media - David Whitworth, SEO Manag...Bringing It All Together - Search & Earned Media - David Whitworth, SEO Manag...
Bringing It All Together - Search & Earned Media - David Whitworth, SEO Manag...
 
Manual infra recorder ximo
Manual infra recorder ximoManual infra recorder ximo
Manual infra recorder ximo
 
Rackspace-Marketing & Strategy for International Expansion
Rackspace-Marketing & Strategy for International ExpansionRackspace-Marketing & Strategy for International Expansion
Rackspace-Marketing & Strategy for International Expansion
 
SIN LIMITES
SIN LIMITESSIN LIMITES
SIN LIMITES
 
Presentación accio3
Presentación accio3Presentación accio3
Presentación accio3
 

Similar to Master Minds on Data Science - Arno Siebes

DevelopingDataScienceProfession
DevelopingDataScienceProfessionDevelopingDataScienceProfession
DevelopingDataScienceProfession
Gary Rector
 
DL Classe 0 - You can do it
DL Classe 0 - You can do itDL Classe 0 - You can do it
DL Classe 0 - You can do it
Gregory Renard
 
Hiding Sensitive Association Rules
Hiding Sensitive Association Rules Hiding Sensitive Association Rules
Hiding Sensitive Association Rules
Vinayreddy Polati
 

Similar to Master Minds on Data Science - Arno Siebes (20)

Data science training institute in hyderabad
Data science training institute in hyderabadData science training institute in hyderabad
Data science training institute in hyderabad
 
DevelopingDataScienceProfession
DevelopingDataScienceProfessionDevelopingDataScienceProfession
DevelopingDataScienceProfession
 
Deep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do ItDeep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do It
 
DL Classe 0 - You can do it
DL Classe 0 - You can do itDL Classe 0 - You can do it
DL Classe 0 - You can do it
 
Introduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptIntroduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .ppt
 
UNIT1-2.pptx
UNIT1-2.pptxUNIT1-2.pptx
UNIT1-2.pptx
 
Neural Networks and Deep Learning for Physicists
Neural Networks and Deep Learning for PhysicistsNeural Networks and Deep Learning for Physicists
Neural Networks and Deep Learning for Physicists
 
Cs501 dm intro
Cs501 dm introCs501 dm intro
Cs501 dm intro
 
DataScience_introduction.pdf
DataScience_introduction.pdfDataScience_introduction.pdf
DataScience_introduction.pdf
 
Data Science.pptx
Data Science.pptxData Science.pptx
Data Science.pptx
 
Data Science versus Artificial Intelligence: a useful distinction
Data Science versus Artificial Intelligence: a useful distinctionData Science versus Artificial Intelligence: a useful distinction
Data Science versus Artificial Intelligence: a useful distinction
 
Big data may 2012
Big data may 2012Big data may 2012
Big data may 2012
 
Better the devil you know
Better the devil you knowBetter the devil you know
Better the devil you know
 
Hiding Sensitive Association Rules
Hiding Sensitive Association Rules Hiding Sensitive Association Rules
Hiding Sensitive Association Rules
 
Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)
 
Data science in action
Data science in actionData science in action
Data science in action
 
Artificial intelligence: Simulation of Intelligence
Artificial intelligence: Simulation of IntelligenceArtificial intelligence: Simulation of Intelligence
Artificial intelligence: Simulation of Intelligence
 
Opportunities in Data Science.ppt
Opportunities in Data Science.pptOpportunities in Data Science.ppt
Opportunities in Data Science.ppt
 
TCS Point of View Session - Analyze by Dr. Gautam Shroff, VP and Chief Scient...
TCS Point of View Session - Analyze by Dr. Gautam Shroff, VP and Chief Scient...TCS Point of View Session - Analyze by Dr. Gautam Shroff, VP and Chief Scient...
TCS Point of View Session - Analyze by Dr. Gautam Shroff, VP and Chief Scient...
 
D.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital PreservationD.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital Preservation
 

More from Media Perspectives

More from Media Perspectives (20)

Presentatie Paul Rutten - Monitor Creatieve Industrie 2021
Presentatie Paul Rutten - Monitor Creatieve Industrie 2021Presentatie Paul Rutten - Monitor Creatieve Industrie 2021
Presentatie Paul Rutten - Monitor Creatieve Industrie 2021
 
Jeroen Broekema (Springcast) - Podcast hosting en analytics
Jeroen Broekema (Springcast) - Podcast hosting en analyticsJeroen Broekema (Springcast) - Podcast hosting en analytics
Jeroen Broekema (Springcast) - Podcast hosting en analytics
 
Liedewij Hentenaar (Audify) over de groei van audio
Liedewij Hentenaar (Audify) over de groei van audioLiedewij Hentenaar (Audify) over de groei van audio
Liedewij Hentenaar (Audify) over de groei van audio
 
Egon Verhagen (NPO) - Audio innovatie bij de publieke omroep
Egon Verhagen (NPO) - Audio innovatie bij de publieke omroepEgon Verhagen (NPO) - Audio innovatie bij de publieke omroep
Egon Verhagen (NPO) - Audio innovatie bij de publieke omroep
 
Willem Brom (EndemolShine) over non-scripted voor streamers
Willem Brom (EndemolShine) over non-scripted voor streamersWillem Brom (EndemolShine) over non-scripted voor streamers
Willem Brom (EndemolShine) over non-scripted voor streamers
 
Jordi van de Bovenkamp (MediaMonks) met vijf tips voor fit-for-format-content
Jordi van de Bovenkamp (MediaMonks) met vijf tips voor fit-for-format-contentJordi van de Bovenkamp (MediaMonks) met vijf tips voor fit-for-format-content
Jordi van de Bovenkamp (MediaMonks) met vijf tips voor fit-for-format-content
 
Laura Veenema (NewBe) over 'superserve the niche'
Laura Veenema (NewBe) over 'superserve the niche'Laura Veenema (NewBe) over 'superserve the niche'
Laura Veenema (NewBe) over 'superserve the niche'
 
Gerard de Kloet (NOS) over @NOS op Instagram
Gerard de Kloet (NOS) over @NOS op Instagram Gerard de Kloet (NOS) over @NOS op Instagram
Gerard de Kloet (NOS) over @NOS op Instagram
 
Paulo Lopes Escudeiro over nieuwe TikTok-gewoontes @ Cross Media Café - Nieuw...
Paulo Lopes Escudeiro over nieuwe TikTok-gewoontes @ Cross Media Café - Nieuw...Paulo Lopes Escudeiro over nieuwe TikTok-gewoontes @ Cross Media Café - Nieuw...
Paulo Lopes Escudeiro over nieuwe TikTok-gewoontes @ Cross Media Café - Nieuw...
 
Slides MediaTalk NOS-project '75 jaar bevrijding'
Slides MediaTalk NOS-project '75 jaar bevrijding'Slides MediaTalk NOS-project '75 jaar bevrijding'
Slides MediaTalk NOS-project '75 jaar bevrijding'
 
Paul Bojarski (Sceenic) over Watch Together @ CMC - Innovatie in coronatijden
Paul Bojarski (Sceenic) over Watch Together @ CMC - Innovatie in coronatijdenPaul Bojarski (Sceenic) over Watch Together @ CMC - Innovatie in coronatijden
Paul Bojarski (Sceenic) over Watch Together @ CMC - Innovatie in coronatijden
 
Tomas van den Spiegel (Flanders Classics) en Jorre Belpaire (Kiswe Mobile) ov...
Tomas van den Spiegel (Flanders Classics) en Jorre Belpaire (Kiswe Mobile) ov...Tomas van den Spiegel (Flanders Classics) en Jorre Belpaire (Kiswe Mobile) ov...
Tomas van den Spiegel (Flanders Classics) en Jorre Belpaire (Kiswe Mobile) ov...
 
Geraldine Macqueron (GAME OVER) over het initiatief Creators United @ CMC - I...
Geraldine Macqueron (GAME OVER) over het initiatief Creators United @ CMC - I...Geraldine Macqueron (GAME OVER) over het initiatief Creators United @ CMC - I...
Geraldine Macqueron (GAME OVER) over het initiatief Creators United @ CMC - I...
 
Arno Scharl (webLyzard technology) over online corona sentimenten weergeeft @...
Arno Scharl (webLyzard technology) over online corona sentimenten weergeeft @...Arno Scharl (webLyzard technology) over online corona sentimenten weergeeft @...
Arno Scharl (webLyzard technology) over online corona sentimenten weergeeft @...
 
William Linders (ODMedia) over de opkomst van SVOD en AVOD
William Linders (ODMedia) over de opkomst van SVOD en AVODWilliam Linders (ODMedia) over de opkomst van SVOD en AVOD
William Linders (ODMedia) over de opkomst van SVOD en AVOD
 
Suzan Hoogland (GfK) over hoe de Nederlander 'Video' consumeert
Suzan Hoogland (GfK) over hoe de Nederlander 'Video' consumeertSuzan Hoogland (GfK) over hoe de Nederlander 'Video' consumeert
Suzan Hoogland (GfK) over hoe de Nederlander 'Video' consumeert
 
Maarten Lens-FitzGerald (voice ondernemers) @ CMC Nieuwe Interfaces
Maarten Lens-FitzGerald (voice ondernemers) @ CMC Nieuwe Interfaces Maarten Lens-FitzGerald (voice ondernemers) @ CMC Nieuwe Interfaces
Maarten Lens-FitzGerald (voice ondernemers) @ CMC Nieuwe Interfaces
 
Jeroen de Bakker (Talpa Network) @ CMC Nieuwe Interfaces
Jeroen de Bakker (Talpa Network) @ CMC Nieuwe InterfacesJeroen de Bakker (Talpa Network) @ CMC Nieuwe Interfaces
Jeroen de Bakker (Talpa Network) @ CMC Nieuwe Interfaces
 
Vera Holland (KRO-NCRV) @ CMC Nieuwe Interfaces
Vera Holland (KRO-NCRV) @ CMC Nieuwe InterfacesVera Holland (KRO-NCRV) @ CMC Nieuwe Interfaces
Vera Holland (KRO-NCRV) @ CMC Nieuwe Interfaces
 
Joey Scheufler (Prappers Media) @ CMC Nieuwe Interfaces
Joey Scheufler (Prappers Media) @ CMC Nieuwe InterfacesJoey Scheufler (Prappers Media) @ CMC Nieuwe Interfaces
Joey Scheufler (Prappers Media) @ CMC Nieuwe Interfaces
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Master Minds on Data Science - Arno Siebes

  • 1. Big Data (and a bit of) Media Prof. dr Arno Siebes1 Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Masterclass Data Science Mediapark November 12, 2015 1 Sponsored in part by the Dutch national COMMIT project
  • 3. Big Data Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it. by prof. Dan Ariely James B. Duke Professor of Psychology and Behavioral Economics at Duke University founding member of the Center for Advanced Hindsight At the end you will at least know what the fuss is about and have thought about its consequences.
  • 4. Outline I am going to talk about three things 1. Why Big Data? why do we suddenly have big data? and what does it actually mean? 2. Patterns and Profiles “we” want big data because it allows us to find patterns; what are these patterns and how do we find them? and we’ll briefly discuss privacy in this context 3. Recommendations how can we use these profiles to make recommendations? Time permitting, I’ll tell you something about the forefront of research
  • 6. The Greatest Discovery Ever Information is immaterial (incorporeal) it may be embodied by some material configuration, but it isn’t that matter. there are many ways to represent the same information, e.g. it is always representable by a 0/1 string (Shannon) Every possible transformation of information (effective computation) can be done by a computer firstly because of the existence of Universal Turing Machines (Turing) secondly because of the Church - Turing Thesis (Church, Turing, and Rosser) Only one type of machine is necessary to process any type of information in any way possible (how many for matter?) This glorious insight gave rise to the age of computation.
  • 7. Because Information is Immaterial, it has no size! Hence Moore’s Law: CPU a 2.5 million-fold increase Intel 4004 (1973): 2300 transistors Intel Xeon E5-2699 v3 (2014): 5.5 × 109 transistors (its L3 cache is twice the size of my first hard disk; 45 vs 20) Hard Disk a million fold increase IBM (1956): 5 MB (for $50,000) Toshiba (2015): 5TB (for $200) note: Bytes per dollar increase 2.5 × 108 Some quite smart technological breakthroughs may also have been not unimportant for the miniaturization; but I’m a computer scientist
  • 8. Shrinking Means Growth Moore’s Law has made: computing ubiquitous from PC to laptop to tablet to smartphone, to smartwatch ... from dedicated hardware to cars, to fridges, to thermostats ... and data big from transactional DB’s to DW’s, to digital libraries, to clouds, to social networks from data entry to web interfaces to quantified self, to quantified employees (HR analytics), to sensors everywhere Or phrased more prosaically, data is acquired, stored, exchanged, and processed, on everything, everywhere, at any time, all the time. If it isn’t digital, it doesn’t exist. it didn’t happen it doesn’t matter
  • 9. Hence, Big Data This ubiquity is what causes Big Data, which is “defined” by V olume, we are truly talking about massive amounts of data V elocity, the data comes from all sides at faster and faster rates – often too fast to look at more than once V ariety, the data comes in many sorts, shapes, and sizes data used to mean a (rectangular) table filled with numbers but that is no longer true we have libraries of texts, databases with molecules, music and video collections, and so forth, and so forth.
  • 10. Big is Really Big Some (not so) random statistics: in 2012 the world produced 1.8 Zettabyte (i.e., 1.8 × 1021 = 1.8 × 109 TB) the NSA is estimated to hoard 3 - 12 Exabyte (1018; Forbes 2013) in 2014 the world wide web was estimated at 4 Zettabyte In 1 second, there are (Internet Live Stats, March 30, 2015) 1,841 Tumblr posts 1,918 Instagram photos uploaded 8,885 Tweets sent 48,187 Google searches 98,404 YouTube videos viewed 2,383,324 Emails sent
  • 11. But, Why? It is interesting that we generate so much data per second but why is it all stored? Clearly, storage space is cheap, but still that doesn’t mean that every bit is sacred, does it? The reason is (at least) twofold You store everything about yourself Facebook, YouTube, Twitter, Whatsapp, Google+, LinkedIn, Instagram, Snapchat, Pinterest, foursquare, WeChat, ... don’t ask me why you do that. Companies love these hoards of data, it allows them to make profiles
  • 12. Because Data is the New Oil Clive Humby Data is just like crude. It is valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value. Companies, governments, healthcare institutions and who not(?) make profiles to to predict what you want to buy next to offer you better service to predict what medication is best for you for better and for worse
  • 14. Patterns are Groups A pattern is a set of characteristics shared by a group of customers patients transactions i.e., anything you have data of a group that is for some reason deemed interesting For example, a group of patients that have the same disease customers that spend at least 1000 euros in your shop last year insurants who claim way above average (or way below)
  • 15. Patterns are Descriptions We only look for groups that are easily described, e.g., Sex = M and Age ≤ 25 for car insurance BRCA1, Exon 2 deletion, exon 13 deletion, 2804delAA – describes (some) Dutch women with a higher risk of early onset breast cancer The reason to look for such groups is what social scientists call homophily birds of a feather flock together i.e., similar “things” act similarly We’ll restrict ourselves mostly to very simple patterns item sets such as {Diapers, Beer} describing all transactions in which a the customers bought both Diapers and Beer.
  • 17. When is a Pattern Interesting? A data miner should define what makes a pattern interesting. There are many interestingness measures, e.g. heightened risk: a group with above average risk differential: this group of patients shares a set of characteristics, not shared by non-patients frequency: these items have been bought together more than θ times These predefined measures come with an algorithm that allows us to find all interesting patterns in the database relatively efficiently
  • 18. Profiles are Interesting Patterns Interesting patterns are used as profiles profiles to recognize good customers profiles to recognize patients with a certain disease profiles of groups of articles one should (or should not) discount together That is, we get profiles if we have interesting patterns we can act on. Mind you, patterns can be interesting – and extremely useful – even if one cannot act upon them. The most important reason to mine for interesting patterns is that they provide insight in the data – and thus into the real world.
  • 19. Pattern Mining Or, theory mining, as defined by Manilla and Toivonen in 97: Given a database db, a language L to define subgroups of the data and a selection predicate q that determines whether an element φ ∈ L describes an interesting subgroup of db or not, the task is to find: T (L, db, q) = {φ ∈ L | q(db, φ) is true } That is, the task is to find all interesting subgroups.
  • 20. Transaction Databases The first example: Each entry in the database records the contents of a shopping basket at the check-out Just presence of items, no counts More formal: Given a set of items I, a transaction t over I is a subset of I, i.e., t ⊆ I A transaction database db over I is a bag of transactions over I Note: each categorical database can be recoded as a transaction database.
  • 21. Frequent Item Set Mining An item set I is a set of items, I ⊆ I L = P(I) An item set I occurs in a transaction t iff I ⊆ t The support of item set I is: suppdb(I) = |{t ∈ db | I ⊆ t}| The “interestingness” predicate is a threshold on the support of the item sets, the minimal support: min-sup. Frequent item set mining task : {I ∈ L | suppdb(I) ≥ min-sup}
  • 22. A Priori Clearly, checking all item sets for frequency isn’t going to work with n items you have to check 2n − 1 sets if n = 100, this means 1.3 × 1030 sets to check there have only been 14 × 109 years, which had 31.5 × 106 seconds each, i.e., there have been 4.4 × 1017 seconds which is way too short, even if you could compute the frequency of 1 item set per clock tick, i.e., 5 × 109/s Fortunately, the A Priori property holds: I ⊆ J ⇒ suppdb(J) ≤ suppdb(I) Hence, a simple level-wise search algorithm will find all frequent item sets.
  • 23. Why Item Set Mining? Frequent item sets are interesting in their own right (in fact, the most interesting type of pattern IM(n)HO), but that is not why they were invented. Item sets are the basis for Association Rules Let X, Y ⊆ I, X ∩ Y = ∅, X → Y is an association rule iff P(X ∪ Y ) = σdb(X∪Y ) |db| ≥ t1 P(Y |X) = σdb(X∪Y ) σdb(X) ≥ t2 Standard Example: Diapers → Beer Computing all frequent item sets is the hard part of computing all association rules Interesting for shops, but also a very simple (not very good) way to recommend
  • 24. Why Frequent Item Sets? People are not as unique as they think they have many things in common with (many) others In fact, that is why we can do things like recommending “birds of a feather...” all over again We briefly look at 1 example Facebook likes. But people also satisfy very infrequent patterns and that is a threat to privacy We briefly look at two examples Netflix Credit Card purchases.
  • 25. Example: Mining facebook likes User – Like Matrix (10M User-Like pairs) Users’ Facebook Likes 55,814 Likes 58,466Users 1 User – Components Matrix Singular Value 100 Components 58,466Users 2 (with 10- 3 e.g. age=α+β1 C1 +…+ βnC100 Predicted variables Facebook profile: social network size and density Profile picture: ethnicity Survey / test results: BIG5 Personali- substance use, parents together? sbased ona sample of 58,466volunteersfrom the United States, obtained throughthe myPersonality Facebook applicat cluded their Facebook profile information, a list of their Likes (n = 170 Likes per person on average), psychometric t nd their Likes were represented as a sparse user–Like matrix, the entries of which were set to 1 if there existed an associ wise. The dimensionality of the user–Like matrix was reduced using singular-value decomposition (SVD) (24). Numeri predicted using a linear regression model, whereas dichotomous variables such as gender or sexual orientation wer cases, we applied 10-fold cross-validation and used the k = 100 top SVD components. For sexual orientation, parents’ rel M. Kosinski, D. Stillwell, T. Graepel: Private traits and attributes are predictable from digital records of human behavior, PNAS, March 11, 2013.
  • 26. Example: Mining facebook likes65% Liberal), religion (“Muslim”/“Christian”; n = 18,833; 90% Christian), and the Facebook social network information [n = 17,601; median size, ~X = 204; interquartile range (IQR), 206; median density, ~X = 0.03; IQR, 0.03] were obtained from users’ Facebook profiles. Users’ consumption of alcohol (n = 1,196; 50% drink), drugs (n = 856; 21% take drugs), and cigarettes (n = 1211; 30% smoke) and whether a user’s parents stayed together until the user was 21 y old (n = 766; 56% stayed together) were recorded using online surveys. Visual inspection of profile pic- tures was used to assign ethnic origin to a randomly selected subsample of users (n = 7,000; 73% Caucasian; 14% African American; 13% others). Sexual orientation was assigned using the Facebook profile “Interested in” field; users interested only in others of the same sex were labeled as homosexual (4.3% males; 2.4% females), whereas those interested in users of the opposite gender were labeled as heterosexual. Results Prediction of Dichotomous Variables. Fig. 2 shows the prediction accuracy of dichotomous variables expressed in terms of the area under the receiver-operating characteristic curve (AUC), which is equivalent to the probability of correctly classifying two randomly selected users one from each class (e.g., male and female). The highest accuracy was achieved for ethnic origin and gender. African Americans and Caucasian Americans were correctly classified in 95% of cases, and males and females were correctly classified in 93% of cases, suggesting that patterns of online behavior as expressed by Likes significantly differ between those groups allowing for nearly perfect classification. Christians and Muslims were correctly classified in 82% of cases, and similar results were achieved for Democrats and Republicans (85%). Sexual orientation was easier to distinguish among males (88%) than females (75%), which may suggest a wider behavioral divide (as observed from online behavior) between hetero- and homosexual males. Good prediction accuracy was achieved for relationship status and substance use (between 65% and 73%). The relatively lower accuracy for relationship status may be explained by its temporal variability compared with other dichotomous variables (e.g., gender or sexual orientation). The model’s accuracy was lowest (60%) when inferring whether users’ parents stayed together or separated before users were 21 y old. Although it is known that parental divorce does have long- term effects on young adults’ well-being (28), it is remarkable that this is detectable through their Facebook Likes. Individuals with parents who separated have a higher probability of liking statements preoccupied with relationships, such as “If I’m with you then I’m with you I don’t want anybody else” (Table S1). a Like and 0 otherwise. The dimensionality of the user–Like matrix was reduced using singular-value decomposition (SVD) (24). Numeric variables such as age or intelligence were predicted using a linear regression model, whereas dichotomous variables such as gender or sexual orientation were predicted using logistic regression. In both cases, we applied 10-fold cross-validation and used the k = 100 top SVD components. For sexual orientation, parents’ relationship status, and drug consumption only k = 30 top SVD components were used because of the smaller number of users for which this information was available. Fig. 2. Prediction accuracy of classification for dichotomous/dichotomized attributes expressed by the AUC. 2 of 4 | www.pnas.org/cgi/doi/10.1073/pnas.1218772110 Kosinski et al. AUC: probability of correctly classifying two random selected users, one from each class (e.g. male and female). Random guessing: AUC=0.5.
  • 27. Example: Mining facebook likes Best predictors of high intelligence include: “Thunderstorms” “Science” “Curly Fries” Best predictors of low intelligence include: “I love being a mom” “Harley Davidson” “Lady Antebellum”
  • 28. Netflix Nextflix offered big cash-prices for recommender systems that beat their won recommender recommending movies to users based on what they have already seen and how they rated them They released a data set containing records of the form [user, movie, date of grade, grade] in which both the user and the movie were replaced by integers the same user was replaced by the same unique integer, of course, the same was true for the movies. Was this fail-safe anonymization? No: by cross-referencing with IMDb ratings, users could be identified With 8 movie ratings(of which 2 may be completely wrong) and dates that may have a 14-day error, 99% of records can be uniquely identified in the dataset. For 68%, two ratings and dates (with a 3-day error) are sufficient
  • 29. Credit Card Data (Science, Vol 347, Issue 6221, 2015) Assume you have anonymized credit card data all personal data has been removed credit card number is replaced by an arbitrary number time and amount of purchase have been replaced by buckets The problem is how much do I need to know about your shopping habits, to know them all? The answer is 4 transactions if I know 4 of your purchases – you get coffee at Starbucks Central Station around 9 – I can identify your credit card trail External data makes privacy by anonymization hard!
  • 30. How To Discover This It is easy to discover that you only need 4 on average you simply do frequent item set mining with a threshold of 1 pick out all smallest item sets with a frequency of 1 and check the statistics With these item sets, you can count the minimum number of specific things you need to know to identify a specific person count how many arbitrary things you need to know to identify an arbitrary person You can do the same thing with mobile phone trails and again 4 known (approximate) known locations suffices.
  • 31. The Paradox of Patterns These examples illustrate what I like to call the paradox of patterns Almost all people satisfy both (short) frequent and infrequent patterns This is both a Curse privacy may be breached with dire consequences and a Blessing it allows you to turn data into actionable knowledge with great power (data) comes great responsibility If you plan to build and exploit big data you should heed this.
  • 33. Recommender Systems Research Recommendations are (perceived as) one of the key ways to turn data into money, e.g., because of selling more to your customers keeping your customers happy and, thus, customers And there are a plethora of possible applications, e.g., what results your search engine should show you what books/movies/music/series/... you would probably like Hence megabucks are invested in R & D hence, far too many approaches to survey the field We’ll keep it short and simple but do point out some of the pitfalls
  • 34. Collaborative Filtering From Wikipedia: a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating) Birds of a feather ... flies again. Two simple approaches are: user centric 1. look for users who share the same rating patterns with the active user 2. use the ratings from those like-minded users to calculate a prediction for the active user item centric 1. build an item-item matrix determining relationships between pairs of items 2. infer the tastes of the current user by examining the matrix and matching that user’s data
  • 35. User Centric When are users similar? compute the cosine between two users: cos(u1, u2) = i u1,i u2,i |u1||u2| the closer to 1, the more similar users are How do you predict? pick the closest k users recommend the item that the highest number of these k have and the customer not and recommend that item
  • 36. Item Centric In the item-item matrix, you, e.g., put the support of the pair (Ii , Ij ) The customer is simply an item set the set of all items he bought This item set is a subset of at least one of the rows in this matrix there is at least one row in which the support for all the items that the customer bought is ≥ 1 (because the customer bought all these items) For all the rows of which the customer is a subset determine which items that the customer hasn’t bought yet have a positive support From these compute the probability that the customer may like them and recommend an item with the highest probability
  • 37. Easy, no? No, unfortunately not Amazon has over 10 million books for sale 2014: a new book every 5 minutes... The vectors (users) or Matrices are going to be very sparse The closest users may be rather dissimilar Or, what about time? is a book you bought years ago as relevant as the one you pick today? sometimes: yes (a series like Harry Potter) sometimes: no (an abandoned hobby). One account is seen as one user, but is it? Netflicks allows 4 concurrent streams Couples will share an account Some books are presents (perhaps even one-off) some are for own consumption
  • 38. It All About What You Know Big Data is not just a lot of data but often also a lot of different types of data The more you know about your customers that is different from what they purchased from you the better recommendations can become If you know a customers social network you can use purchases from close friends to determine could recommendations remember: birds of a feather ... There are some recommendations I like usually when I buy something very specific But many more that I hate would you like to book the hotel you just booked? would you like a hotel in X where (part of) your journey ends? (no thanks, I have to 1 hour to transfer – as you should know!)
  • 39. And How Well You Know Your Products Products often come in categories and it makes sense to use this information It doesn’t make much sense to recommend “chick-lit” to someone who upto now only bought SciFi But, then again, some authors straddle genres and then it may be the author and not the genre that enthuses the customer It is not just about data scientists but just as much about domain experts It is about teamwork, not about unicorns It is about data as much as about algorithms
  • 40. A Day in the Life
  • 41. Models When one encounters a new type of data or a new type of problem the first thing to decide is how do we model the data Classical models you learned about in kindergarten often simply don’t fit linear regression on a collection of text documents? Sometimes you can generalize sometimes you have to invent something new Sometimes you model all data collectively sometimes you use patterns I always search for the right pattern language because the world is never homogeneous
  • 42. Which Model to Choose If you know how you want to model (describe) your data you have to decide what makes a model good Kindergarten statistics doesn’t scale in big data everything is significant Fortunately, there are many ways to solve this I will spare you the details (for now ...) For pattern miners it is easy they should at least be frequent and probably satisfy other constraints
  • 43. Algorithms When you can specify which models you prefer there is only one thing left to do You have to devise an algorithm that finds these models Sometimes you are lucky you can compute the optimal model Sometimes you are not it is infeasible or simply impossible to compute the optimal model there are things a computer cannot compute Then you have to use heuristics to find reasonably good models In pattern mining life is good we can usually find all good patterns
  • 44. Done? Unfortunately often: no! Especially for pattern miners, e.g., data usually has many, very many patterns often more than one has data! You cannot use all of them that would lead to very bad models (overfitting) And you cannot look at them to pick out superior ones by eye would you inspect 1012 patterns? The Model Selection problem returns with a vengeance... we have to choose a small set of patterns automatically.
  • 45. The Problem of Induction In big data we have lots and lots of data but it is still a finite sample And we want to say something about new data (the future if you want. We want to induce a general “law” from our finite sample and there are infinitely many functions that go through the same finite set of data points and there is no reasonable choice between these infinitely many possibilities this is what David Hume called: the problem of induction The well-known example (before the discovery of Australia) how many swans should you have seen before you can conclude that all swans are white?
  • 46. Faith There is no reasonable choice between models without further assumptions hence we have to make such assumptions one has to believe something about the world In Kindergarten Statistics you believe that you know how the model looks like and usually you also believe the data to have some desirable statistical properties and you compute the model that is optimal given your beliefs These are very strong beliefs, I make a different assumption I only believe that the “law” we are looking for is a computable function This is a very weak assumption after all, what good is a law you cannot use?
  • 47. MDL More concretely we use the Minimum Description Length Principle to find a small set of characteristic patterns: Given a set of models H, the best model H ∈ H is the one that minimizes L(H) + L(D|H) in which L(H) is the length, in bits, of the description of H, and L(D|H) is the length, in bits, of the description of the data when encoded with H. note: this is lossless compression Our idea: find a small set of patterns that collectively describe the data well (i.e., compress it well)
  • 48. Krimp In view of time and perhaps your appetite for some hefty math by now I will spare you the details It suffices to say that the resulting (heuristic!) algorithm, Krimp, really reduces the number of patterns one has to consider. And the resulting set of patterns is very characteristic.
  • 51. Characteristic by Classification Database (n classes) Split per class Apply KRIMP Code table per class Shortest code wins! elbatedoc elbatedoc Encode unseen transactions This yields pretty good classifiers, while we only try to describe the data.
  • 52. Classification Example 16 19 24 15 29 1 25 0 12 3613 26 62 12 1316 26 36 0 1 15 19 24 25 29 62 0 16 19 20 24 6 7 36 1 25 3 4 29 0 1 3 4 6 7 16 19 20 24 25 29 36 CT2 Transaction 2Transaction 1 CT1
  • 53. Zugabe, zugabe, zugabe ... There is much more we can de with the results of Krimp clustering change detection data imputation .... In fact, we can do recommendation tag recommendation for Flickr data sorry, we haven’t looked at other recommendation problems But again, in view of time we will not cover this today except one further remark on privacy
  • 54. Privacy Revisited We can use the results from Krimp to generate data sets each tuple (entry) in this data set is completely random no relation with any person in the real world but almost all statistical properties of the original data are more or less preserved in the generated data if I give you generated data instead of the original data, you’ll find the same model Data Mining with Guaranteed Privacy Unfortunately this can only be verified experimentally in a limited setting I have been working on a general, provably correct, version off and on for a few years now he outline is done, but there are still a few hairy details that need to be solved nothing a couple of hundred k cannot solve.
  • 56. Conclusions Big Data is here to stay and with Big Data many aspects of life can become much nicer Big Data can make your life better, provided you have the relevant data you have people that really understand your domain you have people that play with data for fun (and these groups are usually distinct And it can even be done without violating privacy