Your SlideShare is downloading. ×
0
Hands-on Classification
Preliminaries• Code is available from github:    – git@github.com:tdunning/Chapter-16.git•   EC2 instances available•   Th...
A Quick Review• What is classification?  – goes-ins: predictors  – goes-outs: target variable• What is classifiable data? ...
Data FlowNot quite so  simple
Classifiable Data• Continuous  – A number that represents a quantity, not an id  – Blood pressure, stock price, latitude, ...
But that isn’t quite there• Learning algorithms need feature vectors  – Have to convert from data to vector• Can assign on...
Data Flow
Classifiable Data   Vectors
Hashed Encoding
What about collisions?
Let’s write some code  (cue relaxing background music)
Generating new features• Sometimes the existing features are difficult to  use• Restating the geometry using new reference...
K-means using target
K-means features
More code!(cue relaxing background music)
Integration Issues• Feature extraction is ideal for map-reduce  – Side data adds some complexity• Clustering works great w...
Parallel Stochastic Gradient Descent              Model    I    n              Train   Average    p               sub    m...
Variational Dirichlet Assignment             Model    I    n             Gather      Update    p            sufficient   m...
Old tricks, new dogs                       Read from local disk• Mapper               from distributed cache  – Assign poi...
Old tricks, new dogs• Mapper  – Assign point to cluster        Read                                   from  – Emit cluster...
Modeling architecture      Side-data                                Now via NFSI       Featuren                           ...
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
Upcoming SlideShare
Loading in...5
×

Oscon data-2011-ted-dunning

1,084

Published on

These are the slides for my half of the OSCON Mahout tutorial.

Published in: Technology, Education
1 Comment
2 Likes
Statistics
Notes
No Downloads
Views
Total Views
1,084
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
18
Comments
1
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Oscon data-2011-ted-dunning"

  1. 1. Hands-on Classification
  2. 2. Preliminaries• Code is available from github: – git@github.com:tdunning/Chapter-16.git• EC2 instances available• Thumb drives also available• Email to ted.dunning@gmail.com• Twitter @ted_dunning
  3. 3. A Quick Review• What is classification? – goes-ins: predictors – goes-outs: target variable• What is classifiable data? – continuous, categorical, word-like, text-like – uniform schema• How do we convert from classifiable data to feature vector?
  4. 4. Data FlowNot quite so simple
  5. 5. Classifiable Data• Continuous – A number that represents a quantity, not an id – Blood pressure, stock price, latitude, mass• Categorical – One of a known, small set (color, shape)• Word-like – One of a possibly unknown, possibly large set• Text-like – Many word-like things, usually unordered
  6. 6. But that isn’t quite there• Learning algorithms need feature vectors – Have to convert from data to vector• Can assign one location per feature – or category – or word• Can assign one or more locations with hashing – scary – but safe on average
  7. 7. Data Flow
  8. 8. Classifiable Data Vectors
  9. 9. Hashed Encoding
  10. 10. What about collisions?
  11. 11. Let’s write some code (cue relaxing background music)
  12. 12. Generating new features• Sometimes the existing features are difficult to use• Restating the geometry using new reference points may help• Automatic reference points using k-means can be better than manual references
  13. 13. K-means using target
  14. 14. K-means features
  15. 15. More code!(cue relaxing background music)
  16. 16. Integration Issues• Feature extraction is ideal for map-reduce – Side data adds some complexity• Clustering works great with map-reduce – Cluster centroids to HDFS• Model training works better sequentially – Need centroids in normal files• Model deployment shouldn’t depend on HDFS
  17. 17. Parallel Stochastic Gradient Descent Model I n Train Average p sub models u model t
  18. 18. Variational Dirichlet Assignment Model I n Gather Update p sufficient model u statistics t
  19. 19. Old tricks, new dogs Read from local disk• Mapper from distributed cache – Assign point to cluster Read from – Emit cluster id, (1, point) HDFS to local disk• Combiner and reducer by distributed cache – Sum counts, weighted sum of points – Emit cluster id, (n, sum/n) Written by• Output to HDFS map-reduce
  20. 20. Old tricks, new dogs• Mapper – Assign point to cluster Read from – Emit cluster id, 1, point NFS• Combiner and reducer – Sum counts, weighted sum of points – Emit cluster id, n, sum/n Written by map-reduce• Output to HDFS MapR FS
  21. 21. Modeling architecture Side-data Now via NFSI Featuren Sequential extraction Datap SGD and joinu Learning downt sampling Map-reduce
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×