The Magical Art of Extracting Meaning From Data

The Magical Art of Extracting
Meaning From Data
Luis Rei
@lrei
luis.rei@gmail.com
http://luisrei.com
Data Mining For The Web

Outline
• Introduction
• Recommender Systems
• Classiﬁcation
• Clustering

“The greatest problem of today is how to teach people to ignore the
irrelevant, how to refuse to know things, before they are suffocated. For
too many facts are as bad as none at all.”
(W.H.Auden)
“The key in business is to know something that nobody else knows.”
(Aristotle Onassis)

DATA
Luis Rei
25
<a href="http://luisrei.com/">
codebits
4
<a href="http://codebits.eu/">
MEANING
Luis Rei
25
NAME
PERSON
AGE
PHOTO
WEBSITE <a href="http://luisrei.com/">

Tools
• Python vs C or C++
• feedparser, Beautiful
Soup (scrap web pages)
• NumPy, SciPy
• Weka
• R
• Libraries
http://mloss.org/software/

Down The Rabbit Hole
• In 2006, google search crawler used
850TB of data.Total web history is
around 3PB
• Think of all the audio, photos & videos
• That’s a lot of data
• Open formats (HTML, RSS, PDF, ...)
• Everyone + their dog has an API
• facebook, twitter, ﬂickr, last.fm,
delicious, digg, gowalla, ...
• Think about:
• news articles published every day
• status updates / day

The Netflix Prize
• In October 2006 Netflix launched an open competition for the best
collaborative filtering algorithm
• at least 10% improvement over netflix’s own algorithm
• Predict user ratings for films based on previous ratings (by all users)
• US$1,000,000 prize won in Sep 2009

The Three Acts
I: The Pledge
The magician shows you something ordinary. But of course... it
probably isn't.
II: The Turn
The magician takes the ordinary something and makes it do
something extraordinary. Now you're looking for the secret...
III: The Prestige
But you wouldn't clap yet. Because making something disappear
isn't enough; you have to bring it back.

Collaborative Filtering
I. Collect Preferences
II. Find Similar Users
or Items
III. Recommend

I. Collecting Preferences
• yes/no votes
• Ratings in stars
• Purchase history
• Who you follow/who’s your
friend.
• The music you listen to or the
movies you watch
• Comments (“Bad”, “Great”, “Lousy”, ...)

II. Similarity
• Euclidean Distance
Olsen Twins - notice the similarity!
> 0.0 (positive correlation)
< 1.0 (not equal)
Same eyes, nose, ...
Different hair color, dress, earings, ...
• Pearson Correlation
√(a-b)2

UsersVs Items
• Find similar items instead of similar users!
• Same recommendation process:
• just switch users with items & vice versa (conceptually)
• Why?
• Works for new users
• Might be more accurate (might not)
• It can be useful to have both

Cross-Validation
• How good are the recommendations?
• Partitioning the data:Training set vs Test set
• Size of the sets? 95/5
• Variance
• Multiple rounds with different partitions
• How many rounds? 1? 2? 100?
• Measure of “goodness” (or rather, the error): Root
Mean Square Error

Case Study: Francesinhas.com
• Django project by 1 programmer
• Users give ratings to restaurants
• 0 to 5 stars (0-100 internally)
• Challenge: recommend users
restaurants they will probably like

Allows you to show similar restaurants in a restaurants page

Recommend
(based on user similarity)

(based on restaurant similarity)

restaurant recommendations
can be based on user or restaurant similarity
(it’s restaurant)

Case Study:Twitter Follow
•Recommend users to follow
•Users don’t have ratings
•implied rating:
“follow” (binary)
•Recommend users that the
people the target user
follows also follow (but that the
target user doesn’t)
this was stuff I presented @codebits in 2008
before twitter had follow recommendations
(code was rewritten)

A KNN in 1 minute
• Calculate the nearest neighbors (similarity)
• e.g. the other users with the highest number of equal ratings
to the customer
• For the k nearest neighbors:
• neighbor base predictor (e.g. avg rating for neighbor)
• s += sim * (rating - nbp)
• d += sim
• prediction = cbp + s/d (cbp = customer base predictor, e.g. average customer rating)

Classifying
•Assign an item into a category
•An email as spam (document classiﬁcation)
•A set of symptoms to a particular disease
•A signature to an individual (biometric identiﬁcation)
•An individual as credit worthy (credit scoring)
•An image as a particular letter (Optical Character Recognition)
Item
Category
Item

Common Algorithms
• Supervised
• Neural Networks
• SupportVector Machines
• Genetic Algorithms
• Naive Bayes Classiﬁer
• Unsupervised:
• Usually done via Clustering (clustering hypothesis)
• i.e. similar contents => similar classiﬁcation

Naive Bayes Classiﬁer
I. Train
II. Calculate Probabillities
III. Classify

Case Study:A Spam Filter
• The item (document) is an email message
• 2 Categories: Spam and Ham
• What do we need?
fc: {'python': {'spam': 0, 'ham': 6}, 'the': {'spam': 3, 'ham': 3}}
cc: {'ham': 6, 'spam': 6}

Feature Extraction
• Input data can be way too large
• Think every pixel of an image
• It can also be mostly useless
• A signature is the same regardless of color (B&W
will sufﬁce)
• And incredibly redundant (lots of data, little info)
• The solution is too transform the input into a
smaller representation - a features vector!
• A feature is either present or not

Get Features
• WordVector: Features are words (basic for doc classﬁcation)
• An item (document) is an email message and can:
• contain a word (feature is present)
• not contain a word (feature is absent)
[‘date', 'don', 'mortgage', 'taint',‘you’,‘how’,‘delay’, ...]
Other ideas: use capitalization, stemming, tlf-idf

I.Training
For every training example (item, category):
1.Extract the item’s features
2.For each feature:
• Increment the count for this (feature, category) pair
3.Increment the category count (+1 example)
fc: {'feature': {'category': count, ...}}
cc: {'category': count, ...}

II. Probabilities
P(word | category) the probability that a word is in a particular category (classiﬁcation)
P(w | c) =
P(c ∩ w)
P(c)
Assumed Probability
using only the information it has seen so far makes it incredibly sensitive to words
that appear very rarely.
It would be much more realistic for the value to gradually change as a word is
found in more and more documents with the same category.
a weight of 1 means the assumed probability is weighted the same as one word

III. Bayes’ Theorem
P(c | d) =
P(d | c) x P(c)
P(d)
P(d | c) = P(w1 | c) ∩ P(w2 | c) ∩ ... P(wn | c)
can be ignored

• If you’re thinking of filtering spam, go with akismet
• If you really want to do your own bayesian spam filter,
a good start is wikipedia
• Training datasets are available online - for spam and
pretty much everything else
http://en.wikipedia.org/wiki/Bayesian_spam_filter
http://akismet.com/
http://spamassassin.apache.org/publiccorpus/

Clustering
• Find structure in datasets:
• Groups of things, people, concepts
• Unsupervised (i.e. there is no training)
• Common algorithms:
• Hierarchical clustering
• K-means
• Non Negative Matrix Approximation
A, B, C, D, F, G, I, J
A, C
B, D, GF
I, J

Non Negative Matrix
Approximation (or Factorization)
I. Get the data
• in matrix form!
II. Factorize the matrix
III.Present the results
yeah the matrix is kind of magic

I.The Data
[[7, 8, 1, 10, ...]
[2, 0, 16, 1, ...]
[22, 3, 0, 0, ...]
[9, 12, 5, 4, ...]
...]]
Matrix
word vector
article vector
[‘sapo’,‘codebits’,‘haiti’,‘iraq’, ...]
[‘A’,‘B’,‘C’,‘D’, ...]
value
(word frequency/article)
Article D contains the word ‘iraq’ 4 times
item
(article)
property (word)

II. Factorize
[[7, 8]
[2, 0]]
[[1, 0]
[2, 3]]
x=
[[23, 24]
[2, 0]]
data matrix = features matrix x weights matrix
word
feature article
feature
importance of the word to the feature
how much the feature applies to the article

http://public.procoders.net/nnma/py_nnma:
k - the number of features to ﬁnd (i.e. number of clusters)

• For every feature:
• Display the top X words
(from the features
matrix)
• Display the topY articles
for this feature (from the
weights matrix)
III.The Results

['adobe', 'ﬂash', 'platform', 'acrobat', 'software', 'reader']
(0.0014202284481846406, u"Apple,Adobe, and Openness: Let's Get Real")
(0.00049914481067248734, u'Piggybacking on Adobe Acrobat and others')
(0.00047202214371591086, u'CVE-2010-3654 - New dangerous 0-day authplay library adobe products')
['macbook', 'hard', 'only', 'much', 'drive', 'screen']
(0.0017976618817123543, u'The new MacBook Air')
(0.00067015549607138966, u'Revisiting Solid State Hard Drives')
(0.00035732495413261966, u"The new MacBook Air's SSD performance")
['apps', 'mobile', 'business', 'other', 'good', 'application']
(0.0013598162030796167, u'Which mobile apps are making good money?')
(0.00054549656743046277, u'An open enhancement request to the Mobile Safari team for sane
bookmarklet installation or alternatives')
(0.00040802131970223176, u'Google Apps highlights u2013 10/29/2010')
['quot', 'strike', 'operations', 'forces', 'some', 'afghan']
(0.002464522414843272, u'Kandahar diary:Watching conventional forces conduct a successful COIN')
(0.00027058999725999285, u'How universities can help in our wars - By Tom Ricks')
(0.00026940637538539202, u'This Weekendu2019s News:Afghanistanu2019s Long-Term Stability')
*note: this was created using an OPML ﬁle exported from my google
reader (260 subscriptions)

Food for the Brain
Machine Learning
Tom Mitchell
Neural Networks:
A Comprehensive Foundation
Simon Haykin
Programming Collective Intelligence:
Building Smart Web 2.0 Applications
Toby Segaran
Data Mining: Practical Machine
Learning Tools and Techniques
Ian H.Witten, Eibe Frank

The Magical Art of Extracting Meaning From Data

Recommended

Recommended

More Related Content

Similar to The Magical Art of Extracting Meaning From Data

Similar to The Magical Art of Extracting Meaning From Data (20)

Recently uploaded

Recently uploaded (20)

The Magical Art of Extracting Meaning From Data