Your SlideShare is downloading. ×
0
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Data mining-2011-09
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Data mining-2011-09

1,188

Published on

Talk given on September 20 to the bay area data mining. The basic idea is that integrating map-reduce programs with the real world is easier than ever.

Talk given on September 20 to the bay area data mining. The basic idea is that integrating map-reduce programs with the real world is easier than ever.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,188
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
25
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Data-mining, Hadoop and the Single Node<br />
  • 2. Map-Reduce<br />Shuffle<br />Input<br />Output<br />
  • 3. MapR&apos;s Streaming Performance<br />11 x 7200rpm SATA<br />11 x 15Krpm SAS<br />MB<br />per<br />sec<br />Higher is better<br />Tests: i. 16 streams x 120GB ii. 2000 streams x 1GB<br />
  • 4. Terasort on MapR<br />10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm<br />Elapsed time (mins)<br />Lower is better<br />
  • 5. Data Flow Expected Volumes<br />Node<br />6 x 1Gb/s =<br />600 MB / s<br />12 x 100MB/s =<br />900 MB / s<br />Storage<br />
  • 6. MUCH faster for some operations<br />Same 10 nodes …<br />Create<br />Rate<br /># of files (millions)<br />
  • 7. Universal export to self<br />Cluster Nodes<br />Cluster<br />Node<br />Task<br />NFS<br />Server<br />
  • 8. Cluster<br />Node<br />Task<br />NFS<br />Server<br />Cluster<br />Node<br />Task<br />Cluster<br />Node<br />Task<br />NFS<br />Server<br />NFS<br />Server<br />Nodes are identical<br />
  • 9. Sharded text Indexing<br />Index text to local disk and then copy index to distributed file store<br />Assign documents to shards<br />Map<br />Reducer<br />Clustered index storage<br />Input documents<br />Copy to local disk typically required before index can be loaded<br />Local<br />disk<br />Search<br />Engine<br />Local<br />disk<br />
  • 10. Conventional data flow<br />Failure of search engine requires another download of the index from clustered storage.<br />Map<br />Failure of a reducer causes garbage to accumulate in the local disk<br />Reducer<br />Clustered index storage<br />Input documents<br />Local<br />disk<br />Search<br />Engine<br />Local<br />disk<br />
  • 11. Simplified NFS data flows<br />Index to task work directory via NFS<br />Map<br />Reducer<br />Search<br />Engine<br />Input documents<br />Clustered index storage<br />Failure of a reducer is cleaned up by map-reduce framework<br />Search engine reads mirrored index directly.<br />
  • 12. K-means, the movie<br />Centroids<br />Assign<br />to<br />Nearest<br />centroid<br />I<br />n<br />p<br />u<br />t<br />Aggregate<br />new<br />centroids<br />
  • 13. But …<br />
  • 14. Parallel Stochastic Gradient Descent<br />Model<br />Train<br />sub<br />model<br />I<br />n<br />p<br />u<br />t<br />Average<br />models<br />
  • 15. VariationalDirichlet Assignment<br />Model<br />Gather<br />sufficient<br />statistics<br />I<br />n<br />p<br />u<br />t<br />Update<br />model<br />
  • 16. Old tricks, new dogs<br />Mapper<br />Assign point to cluster<br />Emit cluster id, (1, point)<br />Combiner and reducer<br />Sum counts, weighted sum of points<br />Emit cluster id, (n, sum/n)<br />Output to HDFS<br />Read from local disk from distributed cache<br />Read from<br />HDFS to local disk by distributed cache<br />Written by map-reduce<br />
  • 17. Old tricks, new dogs<br />Mapper<br />Assign point to cluster<br />Emit cluster id, (1, point)<br />Combiner and reducer<br />Sum counts, weighted sum of points<br />Emit cluster id, (n, sum/n)<br />Output to HDFS<br />Read from<br />NFS<br />Written by map-reduce<br />MapR FS<br />
  • 18. Poor man’s Pregel<br />Mapper<br />Lines in bold can use conventional I/O via NFS<br />while not done:<br /> read and accumulate input models<br /> for each input:<br /> accumulate model<br /> write model<br /> synchronize<br /> reset input format<br />emit summary<br />18<br />
  • 19.
  • 20. Mahout<br />Scalable Data Mining for Everybody<br />
  • 21. What is Mahout<br />Recommendations (people who x this also x that)<br />Clustering (segment data into groups of)<br />Classification (learn decision making from examples)<br />Stuff (LDA, SVD, frequent item-set, math)<br />
  • 22. What is Mahout?<br />Recommendations (people who x this also x that)<br />Clustering (segment data into groups of)<br />Classification (learn decision making from examples)<br />Stuff (LDA, SVM, frequent item-set, math)<br />
  • 23. Classification in Detail<br />Naive Bayes Family<br />Hadoop based training<br />Decision Forests<br />Hadoop based training<br />Logistic Regression (aka SGD)<br />fast on-line (sequential) training<br />
  • 24. Classification in Detail<br />Naive Bayes Family<br />Hadoop based training<br />Decision Forests<br />Hadoop based training<br />Logistic Regression (aka SGD)<br />fast on-line (sequential) training<br />
  • 25. So What?<br />big starts here<br />Online training has low overhead for small and moderate size data-sets<br />
  • 26. An Example<br />
  • 27. And Another<br />From:  Dr. Paul Acquah<br />Dear Sir,<br />Re: Proposal for over-invoice Contract Benevolence<br />Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit.  I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company&apos;s bank account for our favor.<br />...<br />Date: Thu, May 20, 2010 at 10:51 AM<br />From: George &lt;george@fumble-tech.com&gt; <br />Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?<br />
  • 28. Mahout’s SGD<br />Learns on-line per example<br />O(1) memory<br />O(1) time per training example<br />Sequential implementation<br />fast, but not parallel<br />
  • 29. Special Features<br />Hashed feature encoding<br />Per-term annealing<br />learn the boring stuff once<br />Auto-magical learning knob turning<br />learns correct learning rate, learns correct learning rate for learning learning rate, ...<br />
  • 30. Feature Encoding<br />
  • 31. Hashed Encoding<br />
  • 32. Feature Collisions<br />
  • 33. Learning Rate Annealing<br />Learning Rate<br /># training examples seen<br />
  • 34. Per-term Annealing<br />Common Feature<br />Learning Rate<br />Rare Feature<br /># training examples seen<br />
  • 35. General Structure<br />OnlineLogisticRegression<br />Traditional logistic regression<br />Stochastic Gradient Descent<br />Per term annealing<br />Too fast (for the disk + encoder)<br />
  • 36. Next Level<br />CrossFoldLearner<br />contains multiple primitive learners<br />online cross validation<br />5x more work<br />
  • 37. And again<br />AdaptiveLogisticRegression<br />20 x CrossFoldLearner<br />evolves good learning and regularization rates<br />100 x more work than basic learner<br />still faster than disk + encoding<br />
  • 38. A comparison<br />Traditional view<br />400 x (read + OLR)<br />Revised Mahout view<br />1 x (read + mu x 100 x OLR) x eta<br />mu = efficiency from killing losers early<br />eta = efficiency from stopping early<br />
  • 39. Click modeling architecture<br />Map-reduce<br />Side-data<br />Now via NFS<br />Feature<br />extraction<br />and<br />down<br />sampling<br />I<br />n<br />p<br />u<br />t<br />Data<br />join<br />Sequential<br />SGD<br />Learning<br />
  • 40. Click modeling architecture<br />Map-reduce<br />Map-reduce<br />Side-data<br />Map-reduce cooperates with NFS<br />Sequential<br />SGD<br />Learning<br />Feature<br />extraction<br />and<br />down<br />sampling<br />Sequential<br />SGD<br />Learning<br />I<br />n<br />p<br />u<br />t<br />Data<br />join<br />Sequential<br />SGD<br />Learning<br />Sequential<br />SGD<br />Learning<br />
  • 41. Deployment<br />Training<br />ModelSerializer.writeBinary(..., model)<br />Deployment<br />m = ModelSerializer.readBinary(...)<br />r = m.classifyScalar(featureVector)<br />
  • 42. The Upshot<br />One machine can go fast<br />SITM trains in 2 billion examples in 3 hours<br />Deployability pays off big<br />simple sample server farm<br />

×