SlideShare a Scribd company logo
The Magical Art of Extracting
Meaning From Data
Luis Rei
@lrei
luis.rei@gmail.com
http://luisrei.com
Data Mining For The Web
Outline
• Introduction
• Recommender Systems
• Classification
• Clustering
“The greatest problem of today is how to teach people to ignore the
irrelevant, how to refuse to know things, before they are suffocated. For
too many facts are as bad as none at all.”
(W.H.Auden)
“The key in business is to know something that nobody else knows.”
(Aristotle Onassis)
DATA
Luis Rei
25
<a href="http://luisrei.com/">
codebits
4
<a href="http://codebits.eu/">
MEANING
Luis Rei
25
NAME
PERSON
AGE
PHOTO
WEBSITE <a href="http://luisrei.com/">
Tools
• Python vs C or C++
• feedparser, Beautiful
Soup (scrap web pages)
• NumPy, SciPy
• Weka
• R
• Libraries
http://mloss.org/software/
Down The Rabbit Hole
• In 2006, google search crawler used
850TB of data.Total web history is
around 3PB
• Think of all the audio, photos & videos
• That’s a lot of data
• Open formats (HTML, RSS, PDF, ...)
• Everyone + their dog has an API
• facebook, twitter, flickr, last.fm,
delicious, digg, gowalla, ...
• Think about:
• news articles published every day
• status updates / day
Recommendations
The Netflix Prize
• In October 2006 Netflix launched an open competition for the best
collaborative filtering algorithm
• at least 10% improvement over netflix’s own algorithm
• Predict user ratings for films based on previous ratings (by all users)
• US$1,000,000 prize won in Sep 2009
The Three Acts
I: The Pledge
The magician shows you something ordinary. But of course... it
probably isn't.
II: The Turn
The magician takes the ordinary something and makes it do
something extraordinary. Now you're looking for the secret...
III: The Prestige
But you wouldn't clap yet. Because making something disappear
isn't enough; you have to bring it back.
Collaborative Filtering
I. Collect Preferences
II. Find Similar Users
or Items
III. Recommend
I. Collecting Preferences
• yes/no votes
• Ratings in stars
• Purchase history
• Who you follow/who’s your
friend.
• The music you listen to or the
movies you watch
• Comments (“Bad”, “Great”, “Lousy”, ...)
II. Similarity
• Euclidean Distance
Olsen Twins - notice the similarity!
> 0.0 (positive correlation)
< 1.0 (not equal)
Same eyes, nose, ...
Different hair color, dress, earings, ...
• Pearson Correlation
√(a-b)2
III. Recommend
UsersVs Items
• Find similar items instead of similar users!
• Same recommendation process:
• just switch users with items & vice versa (conceptually)
• Why?
• Works for new users
• Might be more accurate (might not)
• It can be useful to have both
Cross-Validation
• How good are the recommendations?
• Partitioning the data:Training set vs Test set
• Size of the sets? 95/5
• Variance
• Multiple rounds with different partitions
• How many rounds? 1? 2? 100?
• Measure of “goodness” (or rather, the error): Root
Mean Square Error
Case Study: Francesinhas.com
• Django project by 1 programmer
• Users give ratings to restaurants
• 0 to 5 stars (0-100 internally)
• Challenge: recommend users
restaurants they will probably like
User Similarity
normalize
Restaurant Similarity
Allows you to show similar restaurants in a restaurants page
Recommend
(based on user similarity)
(based on restaurant similarity)
restaurant recommendations
can be based on user or restaurant similarity
(it’s restaurant)
Case Study:Twitter Follow
•Recommend users to follow
•Users don’t have ratings
•implied rating:
“follow” (binary)
•Recommend users that the
people the target user
follows also follow (but that the
target user doesn’t)
this was stuff I presented @codebits in 2008
before twitter had follow recommendations
(code was rewritten)
Similarity
Scoring
A KNN in 1 minute
• Calculate the nearest neighbors (similarity)
• e.g. the other users with the highest number of equal ratings
to the customer
• For the k nearest neighbors:
• neighbor base predictor (e.g. avg rating for neighbor)
• s += sim * (rating - nbp)
• d += sim
• prediction = cbp + s/d (cbp = customer base predictor, e.g. average customer rating)
Classifying
•Assign an item into a category
•An email as spam (document classification)
•A set of symptoms to a particular disease
•A signature to an individual (biometric identification)
•An individual as credit worthy (credit scoring)
•An image as a particular letter (Optical Character Recognition)
Item
Category
Item
Common Algorithms
• Supervised
• Neural Networks
• SupportVector Machines
• Genetic Algorithms
• Naive Bayes Classifier
• Unsupervised:
• Usually done via Clustering (clustering hypothesis)
• i.e. similar contents => similar classification
Naive Bayes Classifier
I. Train
II. Calculate Probabillities
III. Classify
Case Study:A Spam Filter
• The item (document) is an email message
• 2 Categories: Spam and Ham
• What do we need?
fc: {'python': {'spam': 0, 'ham': 6}, 'the': {'spam': 3, 'ham': 3}}
cc: {'ham': 6, 'spam': 6}
Feature Extraction
• Input data can be way too large
• Think every pixel of an image
• It can also be mostly useless
• A signature is the same regardless of color (B&W
will suffice)
• And incredibly redundant (lots of data, little info)
• The solution is too transform the input into a
smaller representation - a features vector!
• A feature is either present or not
Get Features
• WordVector: Features are words (basic for doc classfication)
• An item (document) is an email message and can:
• contain a word (feature is present)
• not contain a word (feature is absent)
[‘date', 'don', 'mortgage', 'taint',‘you’,‘how’,‘delay’, ...]
Other ideas: use capitalization, stemming, tlf-idf
I.Training
For every training example (item, category):
1.Extract the item’s features
2.For each feature:
• Increment the count for this (feature, category) pair
3.Increment the category count (+1 example)
fc: {'feature': {'category': count, ...}}
cc: {'category': count, ...}
II. Probabilities
P(word | category) the probability that a word is in a particular category (classification)
P(w | c) =
P(c ∩ w)
P(c)
Assumed Probability
using only the information it has seen so far makes it incredibly sensitive to words
that appear very rarely.
It would be much more realistic for the value to gradually change as a word is
found in more and more documents with the same category.
a weight of 1 means the assumed probability is weighted the same as one word
P(Document | Category) probability that a given doc belongs in a particular category
= P(w1 | c) ∩ P(w2 | c) ∩ ... P(wn | c) for every word in the document
Yeah that’s nice... but what we want is
P(Category | Document)!
*note: Decimal vs float
III. Bayes’ Theorem
III. Bayes’ Theorem
P(c | d) =
P(d | c) x P(c)
P(d)
P(d | c) = P(w1 | c) ∩ P(w2 | c) ∩ ... P(wn | c)
can be ignored
• If you’re thinking of filtering spam, go with akismet
• If you really want to do your own bayesian spam filter,
a good start is wikipedia
• Training datasets are available online - for spam and
pretty much everything else
http://en.wikipedia.org/wiki/Bayesian_spam_filter
http://akismet.com/
http://spamassassin.apache.org/publiccorpus/
Clustering
• Find structure in datasets:
• Groups of things, people, concepts
• Unsupervised (i.e. there is no training)
• Common algorithms:
• Hierarchical clustering
• K-means
• Non Negative Matrix Approximation
A, B, C, D, F, G, I, J
A, C
B, D, GF
I, J
Non Negative Matrix
Approximation (or Factorization)
I. Get the data
• in matrix form!
II. Factorize the matrix
III.Present the results
yeah the matrix is kind of magic
Case Study: News Clustering
I.The Data
[[7, 8, 1, 10, ...]
[2, 0, 16, 1, ...]
[22, 3, 0, 0, ...]
[9, 12, 5, 4, ...]
...]]
Matrix
word vector
article vector
[‘sapo’,‘codebits’,‘haiti’,‘iraq’, ...]
[‘A’,‘B’,‘C’,‘D’, ...]
value
(word frequency/article)
Article D contains the word ‘iraq’ 4 times
item
(article)
property (word)
II. Factorize
[[7, 8]
[2, 0]]
[[1, 0]
[2, 3]]
x=
[[23, 24]
[2, 0]]
data matrix = features matrix x weights matrix
word
feature article
feature
importance of the word to the feature
how much the feature applies to the article
http://public.procoders.net/nnma/py_nnma:
k - the number of features to find (i.e. number of clusters)
• For every feature:
• Display the top X words
(from the features
matrix)
• Display the topY articles
for this feature (from the
weights matrix)
III.The Results
['adobe', 'flash', 'platform', 'acrobat', 'software', 'reader']
(0.0014202284481846406, u"Apple,Adobe, and Openness: Let's Get Real")
(0.00049914481067248734, u'Piggybacking on Adobe Acrobat and others')
(0.00047202214371591086, u'CVE-2010-3654 - New dangerous 0-day authplay library adobe products')
['macbook', 'hard', 'only', 'much', 'drive', 'screen']
(0.0017976618817123543, u'The new MacBook Air')
(0.00067015549607138966, u'Revisiting Solid State Hard Drives')
(0.00035732495413261966, u"The new MacBook Air's SSD performance")
['apps', 'mobile', 'business', 'other', 'good', 'application']
(0.0013598162030796167, u'Which mobile apps are making good money?')
(0.00054549656743046277, u'An open enhancement request to the Mobile Safari team for sane
bookmarklet installation or alternatives')
(0.00040802131970223176, u'Google Apps highlights u2013 10/29/2010')
['quot', 'strike', 'operations', 'forces', 'some', 'afghan']
(0.002464522414843272, u'Kandahar diary:Watching conventional forces conduct a successful COIN')
(0.00027058999725999285, u'How universities can help in our wars - By Tom Ricks')
(0.00026940637538539202, u'This Weekendu2019s News:Afghanistanu2019s Long-Term Stability')
*note: this was created using an OPML file exported from my google
reader (260 subscriptions)
Food for the Brain
Machine Learning
Tom Mitchell
Neural Networks:
A Comprehensive Foundation
Simon Haykin
Programming Collective Intelligence:
Building Smart Web 2.0 Applications
Toby Segaran
Data Mining: Practical Machine
Learning Tools and Techniques
Ian H.Witten, Eibe Frank

More Related Content

Similar to The Magical Art of Extracting Meaning From Data

Machine Learning ebook.pdf
Machine Learning ebook.pdfMachine Learning ebook.pdf
Machine Learning ebook.pdf
HODIT12
 
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 11_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
MostafaHazemMostafaa
 
Search Engines
Search EnginesSearch Engines
Search Engines
butest
 
[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用
台灣資料科學年會
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
MongoDB
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research Objects
David De Roure
 
Exploring Word2Vec in Scala
Exploring Word2Vec in ScalaExploring Word2Vec in Scala
Exploring Word2Vec in Scala
Gary Sieling
 
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al MesAyudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Big Data Colombia
 
Multi-model Databases and Tightly Integrated Polystores
Multi-model Databases and Tightly Integrated PolystoresMulti-model Databases and Tightly Integrated Polystores
Multi-model Databases and Tightly Integrated Polystores
Jiaheng Lu
 
Data Mining Lecture_1.pptx
Data Mining Lecture_1.pptxData Mining Lecture_1.pptx
Data Mining Lecture_1.pptx
Subrata Kumer Paul
 
Nlp and Neural Networks workshop
Nlp and Neural Networks workshopNlp and Neural Networks workshop
Nlp and Neural Networks workshop
QuantUniversity
 
Visually Exploring Patent Collections for Events and Patterns
Visually Exploring Patent Collections for Events and PatternsVisually Exploring Patent Collections for Events and Patterns
Visually Exploring Patent Collections for Events and Patterns
Xiaoyu Wang
 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and models
Korea Sdec
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
Paco Nathan
 
CSD2600 Nov 2018
CSD2600 Nov 2018CSD2600 Nov 2018
CSD2600 Nov 2018
EISLibrarian
 
Data mining BY Zubair Yaseen
Data mining BY Zubair YaseenData mining BY Zubair Yaseen
Data mining BY Zubair Yaseen
University of Education
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"
sandinmyjoints
 
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3
Dr. Aparna Varde
 
datamining-lect1.pptx
datamining-lect1.pptxdatamining-lect1.pptx
datamining-lect1.pptx
GautamDematti1
 
chương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdfchương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdf
phongnguyen312110237
 

Similar to The Magical Art of Extracting Meaning From Data (20)

Machine Learning ebook.pdf
Machine Learning ebook.pdfMachine Learning ebook.pdf
Machine Learning ebook.pdf
 
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 11_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
Towards Computational Research Objects
Towards Computational Research ObjectsTowards Computational Research Objects
Towards Computational Research Objects
 
Exploring Word2Vec in Scala
Exploring Word2Vec in ScalaExploring Word2Vec in Scala
Exploring Word2Vec in Scala
 
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al MesAyudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes
 
Multi-model Databases and Tightly Integrated Polystores
Multi-model Databases and Tightly Integrated PolystoresMulti-model Databases and Tightly Integrated Polystores
Multi-model Databases and Tightly Integrated Polystores
 
Data Mining Lecture_1.pptx
Data Mining Lecture_1.pptxData Mining Lecture_1.pptx
Data Mining Lecture_1.pptx
 
Nlp and Neural Networks workshop
Nlp and Neural Networks workshopNlp and Neural Networks workshop
Nlp and Neural Networks workshop
 
Visually Exploring Patent Collections for Events and Patterns
Visually Exploring Patent Collections for Events and PatternsVisually Exploring Patent Collections for Events and Patterns
Visually Exploring Patent Collections for Events and Patterns
 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and models
 
Human-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage MLHuman-in-a-loop: a design pattern for managing teams which leverage ML
Human-in-a-loop: a design pattern for managing teams which leverage ML
 
CSD2600 Nov 2018
CSD2600 Nov 2018CSD2600 Nov 2018
CSD2600 Nov 2018
 
Data mining BY Zubair Yaseen
Data mining BY Zubair YaseenData mining BY Zubair Yaseen
Data mining BY Zubair Yaseen
 
An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"An Introduction to gensim: "Topic Modelling for Humans"
An Introduction to gensim: "Topic Modelling for Humans"
 
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 3
 
datamining-lect1.pptx
datamining-lect1.pptxdatamining-lect1.pptx
datamining-lect1.pptx
 
chương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdfchương 1 - Tổng quan về khai phá dữ liệu.pdf
chương 1 - Tổng quan về khai phá dữ liệu.pdf
 

Recently uploaded

National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 

Recently uploaded (20)

National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 

The Magical Art of Extracting Meaning From Data

  • 1. The Magical Art of Extracting Meaning From Data Luis Rei @lrei luis.rei@gmail.com http://luisrei.com Data Mining For The Web
  • 2. Outline • Introduction • Recommender Systems • Classification • Clustering
  • 3. “The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as none at all.” (W.H.Auden) “The key in business is to know something that nobody else knows.” (Aristotle Onassis)
  • 4. DATA Luis Rei 25 <a href="http://luisrei.com/"> codebits 4 <a href="http://codebits.eu/"> MEANING Luis Rei 25 NAME PERSON AGE PHOTO WEBSITE <a href="http://luisrei.com/">
  • 5. Tools • Python vs C or C++ • feedparser, Beautiful Soup (scrap web pages) • NumPy, SciPy • Weka • R • Libraries http://mloss.org/software/
  • 6. Down The Rabbit Hole • In 2006, google search crawler used 850TB of data.Total web history is around 3PB • Think of all the audio, photos & videos • That’s a lot of data • Open formats (HTML, RSS, PDF, ...) • Everyone + their dog has an API • facebook, twitter, flickr, last.fm, delicious, digg, gowalla, ... • Think about: • news articles published every day • status updates / day
  • 8. The Netflix Prize • In October 2006 Netflix launched an open competition for the best collaborative filtering algorithm • at least 10% improvement over netflix’s own algorithm • Predict user ratings for films based on previous ratings (by all users) • US$1,000,000 prize won in Sep 2009
  • 9. The Three Acts I: The Pledge The magician shows you something ordinary. But of course... it probably isn't. II: The Turn The magician takes the ordinary something and makes it do something extraordinary. Now you're looking for the secret... III: The Prestige But you wouldn't clap yet. Because making something disappear isn't enough; you have to bring it back.
  • 10. Collaborative Filtering I. Collect Preferences II. Find Similar Users or Items III. Recommend
  • 11. I. Collecting Preferences • yes/no votes • Ratings in stars • Purchase history • Who you follow/who’s your friend. • The music you listen to or the movies you watch • Comments (“Bad”, “Great”, “Lousy”, ...)
  • 12. II. Similarity • Euclidean Distance Olsen Twins - notice the similarity! > 0.0 (positive correlation) < 1.0 (not equal) Same eyes, nose, ... Different hair color, dress, earings, ... • Pearson Correlation √(a-b)2
  • 14.
  • 15. UsersVs Items • Find similar items instead of similar users! • Same recommendation process: • just switch users with items & vice versa (conceptually) • Why? • Works for new users • Might be more accurate (might not) • It can be useful to have both
  • 16. Cross-Validation • How good are the recommendations? • Partitioning the data:Training set vs Test set • Size of the sets? 95/5 • Variance • Multiple rounds with different partitions • How many rounds? 1? 2? 100? • Measure of “goodness” (or rather, the error): Root Mean Square Error
  • 17. Case Study: Francesinhas.com • Django project by 1 programmer • Users give ratings to restaurants • 0 to 5 stars (0-100 internally) • Challenge: recommend users restaurants they will probably like
  • 20.
  • 21. Allows you to show similar restaurants in a restaurants page
  • 23. (based on restaurant similarity)
  • 24. restaurant recommendations can be based on user or restaurant similarity (it’s restaurant)
  • 25. Case Study:Twitter Follow •Recommend users to follow •Users don’t have ratings •implied rating: “follow” (binary) •Recommend users that the people the target user follows also follow (but that the target user doesn’t) this was stuff I presented @codebits in 2008 before twitter had follow recommendations (code was rewritten)
  • 27. A KNN in 1 minute • Calculate the nearest neighbors (similarity) • e.g. the other users with the highest number of equal ratings to the customer • For the k nearest neighbors: • neighbor base predictor (e.g. avg rating for neighbor) • s += sim * (rating - nbp) • d += sim • prediction = cbp + s/d (cbp = customer base predictor, e.g. average customer rating)
  • 28. Classifying •Assign an item into a category •An email as spam (document classification) •A set of symptoms to a particular disease •A signature to an individual (biometric identification) •An individual as credit worthy (credit scoring) •An image as a particular letter (Optical Character Recognition) Item Category Item
  • 29. Common Algorithms • Supervised • Neural Networks • SupportVector Machines • Genetic Algorithms • Naive Bayes Classifier • Unsupervised: • Usually done via Clustering (clustering hypothesis) • i.e. similar contents => similar classification
  • 30. Naive Bayes Classifier I. Train II. Calculate Probabillities III. Classify
  • 31. Case Study:A Spam Filter • The item (document) is an email message • 2 Categories: Spam and Ham • What do we need? fc: {'python': {'spam': 0, 'ham': 6}, 'the': {'spam': 3, 'ham': 3}} cc: {'ham': 6, 'spam': 6}
  • 32. Feature Extraction • Input data can be way too large • Think every pixel of an image • It can also be mostly useless • A signature is the same regardless of color (B&W will suffice) • And incredibly redundant (lots of data, little info) • The solution is too transform the input into a smaller representation - a features vector! • A feature is either present or not
  • 33. Get Features • WordVector: Features are words (basic for doc classfication) • An item (document) is an email message and can: • contain a word (feature is present) • not contain a word (feature is absent) [‘date', 'don', 'mortgage', 'taint',‘you’,‘how’,‘delay’, ...] Other ideas: use capitalization, stemming, tlf-idf
  • 34. I.Training For every training example (item, category): 1.Extract the item’s features 2.For each feature: • Increment the count for this (feature, category) pair 3.Increment the category count (+1 example) fc: {'feature': {'category': count, ...}} cc: {'category': count, ...}
  • 35. II. Probabilities P(word | category) the probability that a word is in a particular category (classification) P(w | c) = P(c ∩ w) P(c) Assumed Probability using only the information it has seen so far makes it incredibly sensitive to words that appear very rarely. It would be much more realistic for the value to gradually change as a word is found in more and more documents with the same category. a weight of 1 means the assumed probability is weighted the same as one word
  • 36. P(Document | Category) probability that a given doc belongs in a particular category = P(w1 | c) ∩ P(w2 | c) ∩ ... P(wn | c) for every word in the document Yeah that’s nice... but what we want is P(Category | Document)! *note: Decimal vs float
  • 38. III. Bayes’ Theorem P(c | d) = P(d | c) x P(c) P(d) P(d | c) = P(w1 | c) ∩ P(w2 | c) ∩ ... P(wn | c) can be ignored
  • 39.
  • 40. • If you’re thinking of filtering spam, go with akismet • If you really want to do your own bayesian spam filter, a good start is wikipedia • Training datasets are available online - for spam and pretty much everything else http://en.wikipedia.org/wiki/Bayesian_spam_filter http://akismet.com/ http://spamassassin.apache.org/publiccorpus/
  • 41. Clustering • Find structure in datasets: • Groups of things, people, concepts • Unsupervised (i.e. there is no training) • Common algorithms: • Hierarchical clustering • K-means • Non Negative Matrix Approximation A, B, C, D, F, G, I, J A, C B, D, GF I, J
  • 42. Non Negative Matrix Approximation (or Factorization) I. Get the data • in matrix form! II. Factorize the matrix III.Present the results yeah the matrix is kind of magic
  • 43. Case Study: News Clustering
  • 44. I.The Data [[7, 8, 1, 10, ...] [2, 0, 16, 1, ...] [22, 3, 0, 0, ...] [9, 12, 5, 4, ...] ...]] Matrix word vector article vector [‘sapo’,‘codebits’,‘haiti’,‘iraq’, ...] [‘A’,‘B’,‘C’,‘D’, ...] value (word frequency/article) Article D contains the word ‘iraq’ 4 times item (article) property (word)
  • 45.
  • 46. II. Factorize [[7, 8] [2, 0]] [[1, 0] [2, 3]] x= [[23, 24] [2, 0]] data matrix = features matrix x weights matrix word feature article feature importance of the word to the feature how much the feature applies to the article
  • 47. http://public.procoders.net/nnma/py_nnma: k - the number of features to find (i.e. number of clusters)
  • 48. • For every feature: • Display the top X words (from the features matrix) • Display the topY articles for this feature (from the weights matrix) III.The Results
  • 49.
  • 50. ['adobe', 'flash', 'platform', 'acrobat', 'software', 'reader'] (0.0014202284481846406, u"Apple,Adobe, and Openness: Let's Get Real") (0.00049914481067248734, u'Piggybacking on Adobe Acrobat and others') (0.00047202214371591086, u'CVE-2010-3654 - New dangerous 0-day authplay library adobe products') ['macbook', 'hard', 'only', 'much', 'drive', 'screen'] (0.0017976618817123543, u'The new MacBook Air') (0.00067015549607138966, u'Revisiting Solid State Hard Drives') (0.00035732495413261966, u"The new MacBook Air's SSD performance") ['apps', 'mobile', 'business', 'other', 'good', 'application'] (0.0013598162030796167, u'Which mobile apps are making good money?') (0.00054549656743046277, u'An open enhancement request to the Mobile Safari team for sane bookmarklet installation or alternatives') (0.00040802131970223176, u'Google Apps highlights u2013 10/29/2010') ['quot', 'strike', 'operations', 'forces', 'some', 'afghan'] (0.002464522414843272, u'Kandahar diary:Watching conventional forces conduct a successful COIN') (0.00027058999725999285, u'How universities can help in our wars - By Tom Ricks') (0.00026940637538539202, u'This Weekendu2019s News:Afghanistanu2019s Long-Term Stability') *note: this was created using an OPML file exported from my google reader (260 subscriptions)
  • 51. Food for the Brain Machine Learning Tom Mitchell Neural Networks: A Comprehensive Foundation Simon Haykin Programming Collective Intelligence: Building Smart Web 2.0 Applications Toby Segaran Data Mining: Practical Machine Learning Tools and Techniques Ian H.Witten, Eibe Frank