Your SlideShare is downloading. ×
0
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
Oscon data-2011-ted-dunning
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Oscon data-2011-ted-dunning

1,082

Published on

These are the slides for my half of the OSCON Mahout tutorial.

These are the slides for my half of the OSCON Mahout tutorial.

Published in: Technology, Education
1 Comment
2 Likes
Statistics
Notes
No Downloads
Views
Total Views
1,082
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
18
Comments
1
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hands-on Classification
  • 2. Preliminaries• Code is available from github: – git@github.com:tdunning/Chapter-16.git• EC2 instances available• Thumb drives also available• Email to ted.dunning@gmail.com• Twitter @ted_dunning
  • 3. A Quick Review• What is classification? – goes-ins: predictors – goes-outs: target variable• What is classifiable data? – continuous, categorical, word-like, text-like – uniform schema• How do we convert from classifiable data to feature vector?
  • 4. Data FlowNot quite so simple
  • 5. Classifiable Data• Continuous – A number that represents a quantity, not an id – Blood pressure, stock price, latitude, mass• Categorical – One of a known, small set (color, shape)• Word-like – One of a possibly unknown, possibly large set• Text-like – Many word-like things, usually unordered
  • 6. But that isn’t quite there• Learning algorithms need feature vectors – Have to convert from data to vector• Can assign one location per feature – or category – or word• Can assign one or more locations with hashing – scary – but safe on average
  • 7. Data Flow
  • 8. Classifiable Data Vectors
  • 9. Hashed Encoding
  • 10. What about collisions?
  • 11. Let’s write some code (cue relaxing background music)
  • 12. Generating new features• Sometimes the existing features are difficult to use• Restating the geometry using new reference points may help• Automatic reference points using k-means can be better than manual references
  • 13. K-means using target
  • 14. K-means features
  • 15. More code!(cue relaxing background music)
  • 16. Integration Issues• Feature extraction is ideal for map-reduce – Side data adds some complexity• Clustering works great with map-reduce – Cluster centroids to HDFS• Model training works better sequentially – Need centroids in normal files• Model deployment shouldn’t depend on HDFS
  • 17. Parallel Stochastic Gradient Descent Model I n Train Average p sub models u model t
  • 18. Variational Dirichlet Assignment Model I n Gather Update p sufficient model u statistics t
  • 19. Old tricks, new dogs Read from local disk• Mapper from distributed cache – Assign point to cluster Read from – Emit cluster id, (1, point) HDFS to local disk• Combiner and reducer by distributed cache – Sum counts, weighted sum of points – Emit cluster id, (n, sum/n) Written by• Output to HDFS map-reduce
  • 20. Old tricks, new dogs• Mapper – Assign point to cluster Read from – Emit cluster id, 1, point NFS• Combiner and reducer – Sum counts, weighted sum of points – Emit cluster id, n, sum/n Written by map-reduce• Output to HDFS MapR FS
  • 21. Modeling architecture Side-data Now via NFSI Featuren Sequential extraction Datap SGD and joinu Learning downt sampling Map-reduce

×