Hands-on Classification
Preliminaries
• Code is available from github:
– git@github.com:tdunning/Chapter-16.git
• EC2 instances available
• Thumb ...
A Quick Review
• What is classification?
– goes-ins: predictors
– goes-outs: target variable
• What is classifiable data?
...
Data Flow
Not quite so
simple
Classifiable Data
• Continuous
– A number that represents a quantity, not an id
– Blood pressure, stock price, latitude, m...
But that isn’t quite there
• Learning algorithms need feature vectors
– Have to convert from data to vector
• Can assign o...
Data Flow
Classifiable Data Vectors
Hashed Encoding
What about collisions?
Let’s write some code
(cue relaxing background music)
Generating new features
• Sometimes the existing features are difficult to
use
• Restating the geometry using new referenc...
K-means using target
K-means features
More code!
(cue relaxing background music)
Integration Issues
• Feature extraction is ideal for map-reduce
– Side data adds some complexity
• Clustering works great ...
Average
models
Parallel Stochastic Gradient Descent
Train
sub
model
Model
I
n
p
u
t
Update
model
Variational Dirichlet Assignment
Gather
sufficient
statistics
Model
I
n
p
u
t
Old tricks, new dogs
• Mapper
– Assign point to cluster
– Emit cluster id, (1, point)
• Combiner and reducer
– Sum counts,...
Old tricks, new dogs
• Mapper
– Assign point to cluster
– Emit cluster id, 1, point
• Combiner and reducer
– Sum counts, w...
Modeling architecture
Feature
extraction
and
down
sampling
I
n
p
u
t
Side-data
Data
join
Sequential
SGD
Learning
Map-reduc...
Oscon Data 2011 Ted Dunning
Oscon Data 2011 Ted Dunning
Oscon Data 2011 Ted Dunning
Upcoming SlideShare
Loading in …5
×

Oscon Data 2011 Ted Dunning

286 views
215 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
286
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Oscon Data 2011 Ted Dunning

  1. 1. Hands-on Classification
  2. 2. Preliminaries • Code is available from github: – git@github.com:tdunning/Chapter-16.git • EC2 instances available • Thumb drives also available • Email to ted.dunning@gmail.com • Twitter @ted_dunning
  3. 3. A Quick Review • What is classification? – goes-ins: predictors – goes-outs: target variable • What is classifiable data? – continuous, categorical, word-like, text-like – uniform schema • How do we convert from classifiable data to feature vector?
  4. 4. Data Flow Not quite so simple
  5. 5. Classifiable Data • Continuous – A number that represents a quantity, not an id – Blood pressure, stock price, latitude, mass • Categorical – One of a known, small set (color, shape) • Word-like – One of a possibly unknown, possibly large set • Text-like – Many word-like things, usually unordered
  6. 6. But that isn’t quite there • Learning algorithms need feature vectors – Have to convert from data to vector • Can assign one location per feature – or category – or word • Can assign one or more locations with hashing – scary – but safe on average
  7. 7. Data Flow
  8. 8. Classifiable Data Vectors
  9. 9. Hashed Encoding
  10. 10. What about collisions?
  11. 11. Let’s write some code (cue relaxing background music)
  12. 12. Generating new features • Sometimes the existing features are difficult to use • Restating the geometry using new reference points may help • Automatic reference points using k-means can be better than manual references
  13. 13. K-means using target
  14. 14. K-means features
  15. 15. More code! (cue relaxing background music)
  16. 16. Integration Issues • Feature extraction is ideal for map-reduce – Side data adds some complexity • Clustering works great with map-reduce – Cluster centroids to HDFS • Model training works better sequentially – Need centroids in normal files • Model deployment shouldn’t depend on HDFS
  17. 17. Average models Parallel Stochastic Gradient Descent Train sub model Model I n p u t
  18. 18. Update model Variational Dirichlet Assignment Gather sufficient statistics Model I n p u t
  19. 19. Old tricks, new dogs • Mapper – Assign point to cluster – Emit cluster id, (1, point) • Combiner and reducer – Sum counts, weighted sum of points – Emit cluster id, (n, sum/n) • Output to HDFS Read from HDFS to local disk by distributed cache Written by map-reduce Read from local disk from distributed cache
  20. 20. Old tricks, new dogs • Mapper – Assign point to cluster – Emit cluster id, 1, point • Combiner and reducer – Sum counts, weighted sum of points – Emit cluster id, n, sum/n • Output to HDFS MapR FS Read from NFS Written by map-reduce
  21. 21. Modeling architecture Feature extraction and down sampling I n p u t Side-data Data join Sequential SGD Learning Map-reduce Now via NFS

×