Your SlideShare is downloading. ×
Oscon data-2011-ted-dunning
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Oscon data-2011-ted-dunning


Published on

These are the slides for my half of the OSCON Mahout tutorial.

These are the slides for my half of the OSCON Mahout tutorial.

Published in: Technology, Education

1 Comment
No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Hands-on Classification
  • 2. Preliminaries• Code is available from github: –• EC2 instances available• Thumb drives also available• Email to• Twitter @ted_dunning
  • 3. A Quick Review• What is classification? – goes-ins: predictors – goes-outs: target variable• What is classifiable data? – continuous, categorical, word-like, text-like – uniform schema• How do we convert from classifiable data to feature vector?
  • 4. Data FlowNot quite so simple
  • 5. Classifiable Data• Continuous – A number that represents a quantity, not an id – Blood pressure, stock price, latitude, mass• Categorical – One of a known, small set (color, shape)• Word-like – One of a possibly unknown, possibly large set• Text-like – Many word-like things, usually unordered
  • 6. But that isn’t quite there• Learning algorithms need feature vectors – Have to convert from data to vector• Can assign one location per feature – or category – or word• Can assign one or more locations with hashing – scary – but safe on average
  • 7. Data Flow
  • 8. Classifiable Data Vectors
  • 9. Hashed Encoding
  • 10. What about collisions?
  • 11. Let’s write some code (cue relaxing background music)
  • 12. Generating new features• Sometimes the existing features are difficult to use• Restating the geometry using new reference points may help• Automatic reference points using k-means can be better than manual references
  • 13. K-means using target
  • 14. K-means features
  • 15. More code!(cue relaxing background music)
  • 16. Integration Issues• Feature extraction is ideal for map-reduce – Side data adds some complexity• Clustering works great with map-reduce – Cluster centroids to HDFS• Model training works better sequentially – Need centroids in normal files• Model deployment shouldn’t depend on HDFS
  • 17. Parallel Stochastic Gradient Descent Model I n Train Average p sub models u model t
  • 18. Variational Dirichlet Assignment Model I n Gather Update p sufficient model u statistics t
  • 19. Old tricks, new dogs Read from local disk• Mapper from distributed cache – Assign point to cluster Read from – Emit cluster id, (1, point) HDFS to local disk• Combiner and reducer by distributed cache – Sum counts, weighted sum of points – Emit cluster id, (n, sum/n) Written by• Output to HDFS map-reduce
  • 20. Old tricks, new dogs• Mapper – Assign point to cluster Read from – Emit cluster id, 1, point NFS• Combiner and reducer – Sum counts, weighted sum of points – Emit cluster id, n, sum/n Written by map-reduce• Output to HDFS MapR FS
  • 21. Modeling architecture Side-data Now via NFSI Featuren Sequential extraction Datap SGD and joinu Learning downt sampling Map-reduce