• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content

Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Machine Learning for Web Data

on

  • 4,481 views

Presentation at Web Directions 2010, Atlanta, GA.

Presentation at Web Directions 2010, Atlanta, GA.

Statistics

Views

Total Views
4,481
Views on SlideShare
4,334
Embed Views
147

Actions

Likes
17
Downloads
176
Comments
0

2 Embeds 147

http://www.webdirections.org 95
http://adellefrank.com 52

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Sad puppy.
  • Thenetflix prize was $1 million for a 10% increase in accuracy. Just 10%!!
  • P(A) is the fraction of possible universes in which A is true.

Machine Learning for Web Data Machine Learning for Web Data Presentation Transcript

  • Machine Learning for Web Data
    Hilary Mason
    Web Directions USA 2010
  • = new capacities
    (superpowers)
    Machine learning is a way of thinking about data.
  • http://www.meetup.com/NYC-Tech-Talks/calendar/12939544/
    ?from=list&offset=0
    http://bit.ly/9N7VB1
  • 6
  • wicked hard problem
    10s of millions of URLs /day
    100s of millions of events / day
    1000s of millions of data points
  • ?
  • @hmason
  • [archive photo]
  • ELIZA
  • ML Today
  • Algorithms +
    On-demand computing +
    Ubiquitous data
  • Algorithms
    New frames for modeling the world with data.
  • [moar data and new kinds of data]
  • Examples
  • [spam filters]
  • [netflix movie recommendations]
  • Language Identification
  • Face Identification
  • Machine Learning
  • Supervised Learning
    Vs
    Unsupervised Learning
  • Clustering
    immunity
    fault-tolerant circuits
    medical imaging
    ultrasound
    low power devices
    medical devices
    thermoelectric devices
  • Entity disambiguation
    This is important.
  • UGLY HAG
    ME
  • Entity disambiguation
    This is important.
    Company disambiguation is a very common problem – Are “Microsoft”, “Microsoft Corporation”, and “MS” the same company?
  • Classification
  • classification
    Cats
    Training
    Data
    Feature Extractor
    Text
    Feature Extractor
    Trained
    Classifier
    Dogs
    Fire
  • <math>
  • Probability
    P(A) is the probability that A is true.
  • Axioms of Probability
    0 ≤ P(A) ≤ 1
    P(True) = 1
    P(False) = 0
    P(A or B) = P(A) + P(B) – P(A and B)
  • P(A or B) = P(A) + P(B) – P(A and B)
    P(A)
    P(A and B)
    P(B)
  • Bayes Law
  • Example
    There are
    10,000 people.
    1% have a rare
    disease.
  • Example
    Population of 10,000
    1% have rare disease
    There’s a test that is 99% effective.
    99% of sick patients test positive
    99% of healthy patients test negative
  • Given a positive test result, what is the probability that the patient is sick?
  • Disease Diagnosis
    99 sick patients test positive, 99 healthy patients test positive
    Given a positive test, there is a 50% probability that the patient is sick.
  • Bayesian Disease
    Know the prob. of testing sick given healthy, and healthy given sick
    Use Bayes theorem to invert probabilities
  • </math>
  • Obtain
    Scrub
    Explore
    Model
    iNterpret
  • 1. Obtain Data
    “pointing and clicking does not scale!”
    http://www.delicious.com/pskomoroch/dataset
  • lynx –dump http://www.nytimes.com
    Lynx: http://bit.ly/a6Pumm
    2. Scrub
  • 3. Explore
    http://vis.stanford.edu/protovis/
  • 4. Model
    Google Prediction API
    http://code.google.com/apis/predict/
  • 4. Model
    Python
    • NLTK - http://www.nltk.org/
    • Scikits Learn - http://scikit-learn.sourceforge.net/
  • 4. Model
    http://www.alchemyapi.com/
  • 5. Interpret
    Andrew Vande Moore – Visual Poetry 06
  • http://www.dataists.com
  • One Final Example
    Twitter is full of noise.
    Sports – down
    Math – UP!
    Narcissism - down
  • Code!
  • Filtering & Relevance Ordering
    http://github.com/hmason/tc
  • What’s next?
  • Soon:
    Natural Language Generation
    Rich media classification
    Contextual everything
  • Algorithms-As-A-Service
  • infer links in data
  • Filtering
  • Relevance
  • Thank you!
    h@bit.ly @hmason