Your SlideShare is downloading. ×
0
Data-mining, Hadoop and the
Single Node
Map-Reduce
Input Output
Shuffle
MapR's Streaming Performance
Read Write
0
250
500
750
1000
1250
1500
1750
2000
2250
Read Write
0
250
500
750
1000
1250
150...
Terasort on MapR
1.0 TB
0
10
20
30
40
50
60
3.5 TB
0
50
100
150
200
250
300
MapR
Hadoop
Elapsed
time
(mins)
10+1 nodes: 8 ...
Data Flow Expected Volumes
Node
Storage
6 x 1Gb/s =
600 MB / s
12 x 100MB/s =
900 MB / s
MUCH faster for some operations
# of files (millions)
Create
Rate
Same 10 nodes …
Cluster
Node
NFS
Server
Universal export to self
Task
Cluster Nodes
Cluster
Node
NFS
Server
Task
Cluster
Node
NFS
Server
Task
Cluster
Node
NFS
Server
Task
Nodes are identical
Sharded text Indexing
Map
Reducer
Input
documents
Local
disk Search
Engine
Local
disk
Clustered
index storage
Assign docum...
Conventional data flow
Map
Reducer
Input
documents
Local
disk Search
Engine
Local
disk
Clustered
index storage
Failure of ...
Search
Engine
Simplified NFS data flows
Map
Reducer
Input
documents
Clustered
index storage
Failure of a reducer
is cleane...
Aggregate
new
centroids
K-means, the movie
Assign
to
Nearest
centroid
Centroids
I
n
p
u
t
But …
Average
models
Parallel Stochastic Gradient Descent
Train
sub
model
Model
I
n
p
u
t
Update
model
Variational Dirichlet Assignment
Gather
sufficient
statistics
Model
I
n
p
u
t
Old tricks, new dogs
• Mapper
– Assign point to cluster
– Emit cluster id, (1, point)
• Combiner and reducer
– Sum counts,...
Old tricks, new dogs
• Mapper
– Assign point to cluster
– Emit cluster id, (1, point)
• Combiner and reducer
– Sum counts,...
Poor man’s Pregel
• Mapper
• Lines in bold can use conventional I/O via NFS
18
while not done:
read and accumulate input m...
Mahout
• Scalable Data Mining for Everybody
What is Mahout
• Recommendations (people who x this also x
that)
• Clustering (segment data into groups of)
• Classificati...
What is Mahout?
• Recommendations (people who x this also x
that)
• Clustering (segment data into groups of)
• Classificat...
Classification in Detail
• Naive Bayes Family
– Hadoop based training
• Decision Forests
– Hadoop based training
• Logisti...
Classification in Detail
• Naive Bayes Family
– Hadoop based training
• Decision Forests
– Hadoop based training
• Logisti...
So What?
• Online
training has
low
overhead for
small and
moderate
size data-sets
big starts here
An Example
And Another
From: Dr. Paul Acquah
Dear Sir,
Re: Proposal for over-invoice Contract Benevolence
Based on information gather...
Mahout’s SGD
• Learns on-line per example
– O(1) memory
– O(1) time per training example
• Sequential implementation
– fas...
Special Features
• Hashed feature encoding
• Per-term annealing
– learn the boring stuff once
• Auto-magical learning knob...
Feature Encoding
Hashed Encoding
Feature Collisions
Learning Rate AnnealingLearningRate
# training examples seen
Per-term AnnealingLearningRate
# training examples seen
Common
Feature
Rare
Feature
General Structure
• OnlineLogisticRegression
– Traditional logistic regression
– Stochastic Gradient Descent
– Per term an...
Next Level
• CrossFoldLearner
– contains multiple primitive learners
– online cross validation
– 5x more work
And again
• AdaptiveLogisticRegression
– 20 x CrossFoldLearner
– evolves good learning and regularization rates
– 100 x mo...
A comparison
• Traditional view
– 400 x (read + OLR)
• Revised Mahout view
– 1 x (read + mu x 100 x OLR) x eta
– mu = effi...
Click modeling architecture
Feature
extraction
and
down
sampling
I
n
p
u
t
Side-data
Data
join
Sequential
SGD
Learning
Map...
Click modeling architecture
Map-reduceMap-reduce
Feature
extraction
and
down
sampling
I
n
p
u
t
Side-data
Data
join
Sequen...
Deployment
• Training
– ModelSerializer.writeBinary(..., model)
• Deployment
– m = ModelSerializer.readBinary(...)
– r = m...
The Upshot
• One machine can go fast
– SITM trains in 2 billion examples in 3 hours
• Deployability pays off big
– simple ...
Data mining 2011 09
Upcoming SlideShare
Loading in...5
×

Data mining 2011 09

75

Published on

Talk given on September 2011 by Ted Dunning to the bay area data mining. The basic idea is that integrating map-reduce programs with the real world is easier than ever.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
75
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Data mining 2011 09"

  1. 1. Data-mining, Hadoop and the Single Node
  2. 2. Map-Reduce Input Output Shuffle
  3. 3. MapR's Streaming Performance Read Write 0 250 500 750 1000 1250 1500 1750 2000 2250 Read Write 0 250 500 750 1000 1250 1500 1750 2000 2250 Hardware MapR HadoopMB per sec Tests: i. 16 streams x 120GB ii. 2000 streams x 1GB 11 x 7200rpm SATA 11 x 15Krpm SAS Higher is better
  4. 4. Terasort on MapR 1.0 TB 0 10 20 30 40 50 60 3.5 TB 0 50 100 150 200 250 300 MapR Hadoop Elapsed time (mins) 10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm Lower is better
  5. 5. Data Flow Expected Volumes Node Storage 6 x 1Gb/s = 600 MB / s 12 x 100MB/s = 900 MB / s
  6. 6. MUCH faster for some operations # of files (millions) Create Rate Same 10 nodes …
  7. 7. Cluster Node NFS Server Universal export to self Task Cluster Nodes
  8. 8. Cluster Node NFS Server Task Cluster Node NFS Server Task Cluster Node NFS Server Task Nodes are identical
  9. 9. Sharded text Indexing Map Reducer Input documents Local disk Search Engine Local disk Clustered index storage Assign documents to shards Index text to local disk and then copy index to distributed file store Copy to local disk typically required before index can be loaded
  10. 10. Conventional data flow Map Reducer Input documents Local disk Search Engine Local disk Clustered index storage Failure of a reducer causes garbage to accumulate in the local disk Failure of search engine requires another download of the index from clustered storage.
  11. 11. Search Engine Simplified NFS data flows Map Reducer Input documents Clustered index storage Failure of a reducer is cleaned up by map-reduce framework Search engine reads mirrored index directly. Index to task work directory via NFS
  12. 12. Aggregate new centroids K-means, the movie Assign to Nearest centroid Centroids I n p u t
  13. 13. But …
  14. 14. Average models Parallel Stochastic Gradient Descent Train sub model Model I n p u t
  15. 15. Update model Variational Dirichlet Assignment Gather sufficient statistics Model I n p u t
  16. 16. Old tricks, new dogs • Mapper – Assign point to cluster – Emit cluster id, (1, point) • Combiner and reducer – Sum counts, weighted sum of points – Emit cluster id, (n, sum/n) • Output to HDFS Read from HDFS to local disk by distributed cache Written by map-reduce Read from local disk from distributed cache
  17. 17. Old tricks, new dogs • Mapper – Assign point to cluster – Emit cluster id, (1, point) • Combiner and reducer – Sum counts, weighted sum of points – Emit cluster id, (n, sum/n) • Output to HDFS MapR FS Read from NFS Written by map-reduce
  18. 18. Poor man’s Pregel • Mapper • Lines in bold can use conventional I/O via NFS 18 while not done: read and accumulate input models for each input: accumulate model write model synchronize reset input format emit summary
  19. 19. Mahout • Scalable Data Mining for Everybody
  20. 20. What is Mahout • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classification (learn decision making from examples) • Stuff (LDA, SVD, frequent item-set, math)
  21. 21. What is Mahout? • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classification (learn decision making from examples) • Stuff (LDA, SVM, frequent item-set, math)
  22. 22. Classification in Detail • Naive Bayes Family – Hadoop based training • Decision Forests – Hadoop based training • Logistic Regression (aka SGD) – fast on-line (sequential) training
  23. 23. Classification in Detail • Naive Bayes Family – Hadoop based training • Decision Forests – Hadoop based training • Logistic Regression (aka SGD) – fast on-line (sequential) training
  24. 24. So What? • Online training has low overhead for small and moderate size data-sets big starts here
  25. 25. An Example
  26. 26. And Another From: Dr. Paul Acquah Dear Sir, Re: Proposal for over-invoice Contract Benevolence Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit. I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor. ... Date: Thu, May 20, 2010 at 10:51 AM From: George <george@fumble-tech.com> Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?
  27. 27. Mahout’s SGD • Learns on-line per example – O(1) memory – O(1) time per training example • Sequential implementation – fast, but not parallel
  28. 28. Special Features • Hashed feature encoding • Per-term annealing – learn the boring stuff once • Auto-magical learning knob turning – learns correct learning rate, learns correct learning rate for learning learning rate, ...
  29. 29. Feature Encoding
  30. 30. Hashed Encoding
  31. 31. Feature Collisions
  32. 32. Learning Rate AnnealingLearningRate # training examples seen
  33. 33. Per-term AnnealingLearningRate # training examples seen Common Feature Rare Feature
  34. 34. General Structure • OnlineLogisticRegression – Traditional logistic regression – Stochastic Gradient Descent – Per term annealing – Too fast (for the disk + encoder)
  35. 35. Next Level • CrossFoldLearner – contains multiple primitive learners – online cross validation – 5x more work
  36. 36. And again • AdaptiveLogisticRegression – 20 x CrossFoldLearner – evolves good learning and regularization rates – 100 x more work than basic learner – still faster than disk + encoding
  37. 37. A comparison • Traditional view – 400 x (read + OLR) • Revised Mahout view – 1 x (read + mu x 100 x OLR) x eta – mu = efficiency from killing losers early – eta = efficiency from stopping early
  38. 38. Click modeling architecture Feature extraction and down sampling I n p u t Side-data Data join Sequential SGD Learning Map-reduce Now via NFS
  39. 39. Click modeling architecture Map-reduceMap-reduce Feature extraction and down sampling I n p u t Side-data Data join Sequential SGD Learning Map-reduce cooperates with NFS Sequential SGD Learning Sequential SGD Learning Sequential SGD Learning
  40. 40. Deployment • Training – ModelSerializer.writeBinary(..., model) • Deployment – m = ModelSerializer.readBinary(...) – r = m.classifyScalar(featureVector)
  41. 41. The Upshot • One machine can go fast – SITM trains in 2 billion examples in 3 hours • Deployability pays off big – simple sample server farm
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×