Data mining 2011 09

257 views
183 views

Published on

Talk given on September 2011 by Ted Dunning to the bay area data mining. The basic idea is that integrating map-reduce programs with the real world is easier than ever.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
257
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Data mining 2011 09

  1. 1. Data-mining, Hadoop and the Single Node
  2. 2. Map-Reduce Input Output Shuffle
  3. 3. MapR's Streaming Performance Read Write 0 250 500 750 1000 1250 1500 1750 2000 2250 Read Write 0 250 500 750 1000 1250 1500 1750 2000 2250 Hardware MapR HadoopMB per sec Tests: i. 16 streams x 120GB ii. 2000 streams x 1GB 11 x 7200rpm SATA 11 x 15Krpm SAS Higher is better
  4. 4. Terasort on MapR 1.0 TB 0 10 20 30 40 50 60 3.5 TB 0 50 100 150 200 250 300 MapR Hadoop Elapsed time (mins) 10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm Lower is better
  5. 5. Data Flow Expected Volumes Node Storage 6 x 1Gb/s = 600 MB / s 12 x 100MB/s = 900 MB / s
  6. 6. MUCH faster for some operations # of files (millions) Create Rate Same 10 nodes …
  7. 7. Cluster Node NFS Server Universal export to self Task Cluster Nodes
  8. 8. Cluster Node NFS Server Task Cluster Node NFS Server Task Cluster Node NFS Server Task Nodes are identical
  9. 9. Sharded text Indexing Map Reducer Input documents Local disk Search Engine Local disk Clustered index storage Assign documents to shards Index text to local disk and then copy index to distributed file store Copy to local disk typically required before index can be loaded
  10. 10. Conventional data flow Map Reducer Input documents Local disk Search Engine Local disk Clustered index storage Failure of a reducer causes garbage to accumulate in the local disk Failure of search engine requires another download of the index from clustered storage.
  11. 11. Search Engine Simplified NFS data flows Map Reducer Input documents Clustered index storage Failure of a reducer is cleaned up by map-reduce framework Search engine reads mirrored index directly. Index to task work directory via NFS
  12. 12. Aggregate new centroids K-means, the movie Assign to Nearest centroid Centroids I n p u t
  13. 13. But …
  14. 14. Average models Parallel Stochastic Gradient Descent Train sub model Model I n p u t
  15. 15. Update model Variational Dirichlet Assignment Gather sufficient statistics Model I n p u t
  16. 16. Old tricks, new dogs • Mapper – Assign point to cluster – Emit cluster id, (1, point) • Combiner and reducer – Sum counts, weighted sum of points – Emit cluster id, (n, sum/n) • Output to HDFS Read from HDFS to local disk by distributed cache Written by map-reduce Read from local disk from distributed cache
  17. 17. Old tricks, new dogs • Mapper – Assign point to cluster – Emit cluster id, (1, point) • Combiner and reducer – Sum counts, weighted sum of points – Emit cluster id, (n, sum/n) • Output to HDFS MapR FS Read from NFS Written by map-reduce
  18. 18. Poor man’s Pregel • Mapper • Lines in bold can use conventional I/O via NFS 18 while not done: read and accumulate input models for each input: accumulate model write model synchronize reset input format emit summary
  19. 19. Mahout • Scalable Data Mining for Everybody
  20. 20. What is Mahout • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classification (learn decision making from examples) • Stuff (LDA, SVD, frequent item-set, math)
  21. 21. What is Mahout? • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classification (learn decision making from examples) • Stuff (LDA, SVM, frequent item-set, math)
  22. 22. Classification in Detail • Naive Bayes Family – Hadoop based training • Decision Forests – Hadoop based training • Logistic Regression (aka SGD) – fast on-line (sequential) training
  23. 23. Classification in Detail • Naive Bayes Family – Hadoop based training • Decision Forests – Hadoop based training • Logistic Regression (aka SGD) – fast on-line (sequential) training
  24. 24. So What? • Online training has low overhead for small and moderate size data-sets big starts here
  25. 25. An Example
  26. 26. And Another From: Dr. Paul Acquah Dear Sir, Re: Proposal for over-invoice Contract Benevolence Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit. I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor. ... Date: Thu, May 20, 2010 at 10:51 AM From: George <george@fumble-tech.com> Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?
  27. 27. Mahout’s SGD • Learns on-line per example – O(1) memory – O(1) time per training example • Sequential implementation – fast, but not parallel
  28. 28. Special Features • Hashed feature encoding • Per-term annealing – learn the boring stuff once • Auto-magical learning knob turning – learns correct learning rate, learns correct learning rate for learning learning rate, ...
  29. 29. Feature Encoding
  30. 30. Hashed Encoding
  31. 31. Feature Collisions
  32. 32. Learning Rate AnnealingLearningRate # training examples seen
  33. 33. Per-term AnnealingLearningRate # training examples seen Common Feature Rare Feature
  34. 34. General Structure • OnlineLogisticRegression – Traditional logistic regression – Stochastic Gradient Descent – Per term annealing – Too fast (for the disk + encoder)
  35. 35. Next Level • CrossFoldLearner – contains multiple primitive learners – online cross validation – 5x more work
  36. 36. And again • AdaptiveLogisticRegression – 20 x CrossFoldLearner – evolves good learning and regularization rates – 100 x more work than basic learner – still faster than disk + encoding
  37. 37. A comparison • Traditional view – 400 x (read + OLR) • Revised Mahout view – 1 x (read + mu x 100 x OLR) x eta – mu = efficiency from killing losers early – eta = efficiency from stopping early
  38. 38. Click modeling architecture Feature extraction and down sampling I n p u t Side-data Data join Sequential SGD Learning Map-reduce Now via NFS
  39. 39. Click modeling architecture Map-reduceMap-reduce Feature extraction and down sampling I n p u t Side-data Data join Sequential SGD Learning Map-reduce cooperates with NFS Sequential SGD Learning Sequential SGD Learning Sequential SGD Learning
  40. 40. Deployment • Training – ModelSerializer.writeBinary(..., model) • Deployment – m = ModelSerializer.readBinary(...) – r = m.classifyScalar(featureVector)
  41. 41. The Upshot • One machine can go fast – SITM trains in 2 billion examples in 3 hours • Deployability pays off big – simple sample server farm

×