Machine Learning for Web
Data
Hilary Mason
Web Directions USA 2010
= new capacities
(superpowers)
Machine learning is a way of
thinking about data.
http://www.meetup.com/NYC-Tech-Talks/calendar/12939544/
?from=list&offset=0
http://bit.ly/9N7VB1
6
wicked hard problem
10s of
millions of
URLs /day
100s of
millions of
events /
day
1000s of
millions of
@hmason
[archive photo]
ELIZA
ML Today
Algorithms +
On-demand computing +
Ubiquitous data
Algorithms
New frames for modeling the world with
data.
[moar data and new kinds of data]
Examples
[spam filters]
[netflix movie recommendations]
Language Identification
Face Identification
Machine Learning
Supervised Learning
Vs
Unsupervised Learning
Clustering
immunity
ultrasound
medical
imaging
medical
devices
thermoelectric
devices
fault-tolerant
circuits
low power
de...
Entity disambiguation
This is important.
ME
UGLY HAG
Entity disambiguation
This is important.
Company disambiguation is a very common
problem – Are “Microsoft”, “Microsoft
Cor...
Classification
classification
Text
Feature
Extractor
Trained
Classifier
Cats
Dogs
Fire
Training
Data
Feature
Extractor
<math>
Probability
P(A) is the probability that A is true.
Axioms of Probability
0 ≤ P(A) ≤ 1
P(True) = 1
P(False) = 0
P(A or B) = P(A) + P(B) – P(A and B)
P(A or B) = P(A) + P(B) – P(A and B)
P(A)
P(B)
P(A and B)
Bayes Law
Example
There are
10,000 people.
1% have a rare
disease.
Example
• Population of 10,000
• 1% have rare disease
• There’s a test that is 99% effective.
– 99% of sick patients test ...
Given a positive test result, what is the probability
that the patient is sick?
Disease Diagnosis
99 sick patients test positive, 99 healthy
patients test positive
Given a positive test, there is a 50%
...
Bayesian Disease
Know the prob. of testing sick given healthy,
and healthy given sick
Use Bayes theorem to invert probabil...
</math>
Obtain
Scrub
Explore
Model
iNterpret
1. Obtain Data
“pointing and clicking does not scale!”
http://www.delicious.com/pskomoroch/dataset
lynx –dump
http://www.nytimes.com
Lynx: http://bit.ly/a6Pumm
2. Scrub
3. Explore
http://vis.stanford.edu/protovis/
4. Model
Google Prediction API
http://code.google.com/apis/predict/
4. Model
Python
• NLTK - http://www.nltk.org/
• Scikits Learn - http://scikit-
learn.sourceforge.net/
4. Model
http://www.alchemyapi.com/
5. Interpret
Andrew Vande Moore – Visual Poetry 06
http://www.dataists.com
One Final Example
Twitter is full of noise.
Sports – down
Math – UP!
Narcissism - down
Code!
Filtering & Relevance Ordering
http://github.com/hmason/tc
What’s next?
Soon:
Natural Language Generation
Rich media classification
Contextual everything
Algorithms-As-A-Service
infer links in data
Filtering
Relevance
h@bit.ly @hmason
Thank you!
Machine Learning for Web Data
Machine Learning for Web Data
Machine Learning for Web Data
Machine Learning for Web Data
Machine Learning for Web Data
Machine Learning for Web Data
Machine Learning for Web Data
Machine Learning for Web Data
Machine Learning for Web Data
Machine Learning for Web Data
Machine Learning for Web Data
Machine Learning for Web Data
Upcoming SlideShare
Loading in …5
×

Machine Learning for Web Data

8,507 views

Published on

Presentation at Web Directions 2010, Atlanta, GA.

Published in: Technology, Education
0 Comments
19 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
8,507
On SlideShare
0
From Embeds
0
Number of Embeds
164
Actions
Shares
0
Downloads
201
Comments
0
Likes
19
Embeds 0
No embeds

No notes for slide
  • Sad puppy.
  • The netflix prize was $1 million for a 10% increase in accuracy. Just 10%!!
  • P(A) is the fraction of possible universes in which A is true.
  • Machine Learning for Web Data

    1. Machine Learning for Web Data Hilary Mason Web Directions USA 2010
    2. = new capacities (superpowers) Machine learning is a way of thinking about data.
    3. http://www.meetup.com/NYC-Tech-Talks/calendar/12939544/ ?from=list&offset=0 http://bit.ly/9N7VB1
    4. 6
    5. wicked hard problem 10s of millions of URLs /day 100s of millions of events / day 1000s of millions of
    6. @hmason
    7. [archive photo]
    8. ELIZA
    9. ML Today
    10. Algorithms + On-demand computing + Ubiquitous data
    11. Algorithms New frames for modeling the world with data.
    12. [moar data and new kinds of data]
    13. Examples
    14. [spam filters]
    15. [netflix movie recommendations]
    16. Language Identification
    17. Face Identification
    18. Machine Learning
    19. Supervised Learning Vs Unsupervised Learning
    20. Clustering immunity ultrasound medical imaging medical devices thermoelectric devices fault-tolerant circuits low power devices
    21. Entity disambiguation This is important.
    22. ME UGLY HAG
    23. Entity disambiguation This is important. Company disambiguation is a very common problem – Are “Microsoft”, “Microsoft Corporation”, and “MS” the same company?
    24. Classification
    25. classification Text Feature Extractor Trained Classifier Cats Dogs Fire Training Data Feature Extractor
    26. <math>
    27. Probability P(A) is the probability that A is true.
    28. Axioms of Probability 0 ≤ P(A) ≤ 1 P(True) = 1 P(False) = 0 P(A or B) = P(A) + P(B) – P(A and B)
    29. P(A or B) = P(A) + P(B) – P(A and B) P(A) P(B) P(A and B)
    30. Bayes Law
    31. Example There are 10,000 people. 1% have a rare disease.
    32. Example • Population of 10,000 • 1% have rare disease • There’s a test that is 99% effective. – 99% of sick patients test positive – 99% of healthy patients test negative
    33. Given a positive test result, what is the probability that the patient is sick?
    34. Disease Diagnosis 99 sick patients test positive, 99 healthy patients test positive Given a positive test, there is a 50% probability that the patient is sick.
    35. Bayesian Disease Know the prob. of testing sick given healthy, and healthy given sick Use Bayes theorem to invert probabilities
    36. </math>
    37. Obtain Scrub Explore Model iNterpret
    38. 1. Obtain Data “pointing and clicking does not scale!” http://www.delicious.com/pskomoroch/dataset
    39. lynx –dump http://www.nytimes.com Lynx: http://bit.ly/a6Pumm 2. Scrub
    40. 3. Explore http://vis.stanford.edu/protovis/
    41. 4. Model Google Prediction API http://code.google.com/apis/predict/
    42. 4. Model Python • NLTK - http://www.nltk.org/ • Scikits Learn - http://scikit- learn.sourceforge.net/
    43. 4. Model http://www.alchemyapi.com/
    44. 5. Interpret Andrew Vande Moore – Visual Poetry 06
    45. http://www.dataists.com
    46. One Final Example Twitter is full of noise. Sports – down Math – UP! Narcissism - down
    47. Code!
    48. Filtering & Relevance Ordering http://github.com/hmason/tc
    49. What’s next?
    50. Soon: Natural Language Generation Rich media classification Contextual everything
    51. Algorithms-As-A-Service
    52. infer links in data
    53. Filtering
    54. Relevance
    55. h@bit.ly @hmason Thank you!

    ×