Oscon data-2011-ted-dunning

  • 1,041 views
Uploaded on

These are the slides for my half of the OSCON Mahout tutorial.

These are the slides for my half of the OSCON Mahout tutorial.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
1,041
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
15
Comments
1
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Hands-on Classification
  • 2. Preliminaries• Code is available from github: – git@github.com:tdunning/Chapter-16.git• EC2 instances available• Thumb drives also available• Email to ted.dunning@gmail.com• Twitter @ted_dunning
  • 3. A Quick Review• What is classification? – goes-ins: predictors – goes-outs: target variable• What is classifiable data? – continuous, categorical, word-like, text-like – uniform schema• How do we convert from classifiable data to feature vector?
  • 4. Data FlowNot quite so simple
  • 5. Classifiable Data• Continuous – A number that represents a quantity, not an id – Blood pressure, stock price, latitude, mass• Categorical – One of a known, small set (color, shape)• Word-like – One of a possibly unknown, possibly large set• Text-like – Many word-like things, usually unordered
  • 6. But that isn’t quite there• Learning algorithms need feature vectors – Have to convert from data to vector• Can assign one location per feature – or category – or word• Can assign one or more locations with hashing – scary – but safe on average
  • 7. Data Flow
  • 8. Classifiable Data Vectors
  • 9. Hashed Encoding
  • 10. What about collisions?
  • 11. Let’s write some code (cue relaxing background music)
  • 12. Generating new features• Sometimes the existing features are difficult to use• Restating the geometry using new reference points may help• Automatic reference points using k-means can be better than manual references
  • 13. K-means using target
  • 14. K-means features
  • 15. More code!(cue relaxing background music)
  • 16. Integration Issues• Feature extraction is ideal for map-reduce – Side data adds some complexity• Clustering works great with map-reduce – Cluster centroids to HDFS• Model training works better sequentially – Need centroids in normal files• Model deployment shouldn’t depend on HDFS
  • 17. Parallel Stochastic Gradient Descent Model I n Train Average p sub models u model t
  • 18. Variational Dirichlet Assignment Model I n Gather Update p sufficient model u statistics t
  • 19. Old tricks, new dogs Read from local disk• Mapper from distributed cache – Assign point to cluster Read from – Emit cluster id, (1, point) HDFS to local disk• Combiner and reducer by distributed cache – Sum counts, weighted sum of points – Emit cluster id, (n, sum/n) Written by• Output to HDFS map-reduce
  • 20. Old tricks, new dogs• Mapper – Assign point to cluster Read from – Emit cluster id, 1, point NFS• Combiner and reducer – Sum counts, weighted sum of points – Emit cluster id, n, sum/n Written by map-reduce• Output to HDFS MapR FS
  • 21. Modeling architecture Side-data Now via NFSI Featuren Sequential extraction Datap SGD and joinu Learning downt sampling Map-reduce