• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Data mining 2011 09
 

Data mining 2011 09

on

  • 198 views

Talk given on September 2011 by Ted Dunning to the bay area data mining. The basic idea is that integrating map-reduce programs with the real world is easier than ever.

Talk given on September 2011 by Ted Dunning to the bay area data mining. The basic idea is that integrating map-reduce programs with the real world is easier than ever.

Statistics

Views

Total Views
198
Views on SlideShare
198
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Data mining 2011 09 Data mining 2011 09 Presentation Transcript

    • Data-mining, Hadoop and the Single Node
    • Map-Reduce Input Output Shuffle
    • MapR's Streaming Performance Read Write 0 250 500 750 1000 1250 1500 1750 2000 2250 Read Write 0 250 500 750 1000 1250 1500 1750 2000 2250 Hardware MapR HadoopMB per sec Tests: i. 16 streams x 120GB ii. 2000 streams x 1GB 11 x 7200rpm SATA 11 x 15Krpm SAS Higher is better
    • Terasort on MapR 1.0 TB 0 10 20 30 40 50 60 3.5 TB 0 50 100 150 200 250 300 MapR Hadoop Elapsed time (mins) 10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm Lower is better
    • Data Flow Expected Volumes Node Storage 6 x 1Gb/s = 600 MB / s 12 x 100MB/s = 900 MB / s
    • MUCH faster for some operations # of files (millions) Create Rate Same 10 nodes …
    • Cluster Node NFS Server Universal export to self Task Cluster Nodes
    • Cluster Node NFS Server Task Cluster Node NFS Server Task Cluster Node NFS Server Task Nodes are identical
    • Sharded text Indexing Map Reducer Input documents Local disk Search Engine Local disk Clustered index storage Assign documents to shards Index text to local disk and then copy index to distributed file store Copy to local disk typically required before index can be loaded
    • Conventional data flow Map Reducer Input documents Local disk Search Engine Local disk Clustered index storage Failure of a reducer causes garbage to accumulate in the local disk Failure of search engine requires another download of the index from clustered storage.
    • Search Engine Simplified NFS data flows Map Reducer Input documents Clustered index storage Failure of a reducer is cleaned up by map-reduce framework Search engine reads mirrored index directly. Index to task work directory via NFS
    • Aggregate new centroids K-means, the movie Assign to Nearest centroid Centroids I n p u t
    • But …
    • Average models Parallel Stochastic Gradient Descent Train sub model Model I n p u t
    • Update model Variational Dirichlet Assignment Gather sufficient statistics Model I n p u t
    • Old tricks, new dogs • Mapper – Assign point to cluster – Emit cluster id, (1, point) • Combiner and reducer – Sum counts, weighted sum of points – Emit cluster id, (n, sum/n) • Output to HDFS Read from HDFS to local disk by distributed cache Written by map-reduce Read from local disk from distributed cache
    • Old tricks, new dogs • Mapper – Assign point to cluster – Emit cluster id, (1, point) • Combiner and reducer – Sum counts, weighted sum of points – Emit cluster id, (n, sum/n) • Output to HDFS MapR FS Read from NFS Written by map-reduce
    • Poor man’s Pregel • Mapper • Lines in bold can use conventional I/O via NFS 18 while not done: read and accumulate input models for each input: accumulate model write model synchronize reset input format emit summary
    • Mahout • Scalable Data Mining for Everybody
    • What is Mahout • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classification (learn decision making from examples) • Stuff (LDA, SVD, frequent item-set, math)
    • What is Mahout? • Recommendations (people who x this also x that) • Clustering (segment data into groups of) • Classification (learn decision making from examples) • Stuff (LDA, SVM, frequent item-set, math)
    • Classification in Detail • Naive Bayes Family – Hadoop based training • Decision Forests – Hadoop based training • Logistic Regression (aka SGD) – fast on-line (sequential) training
    • Classification in Detail • Naive Bayes Family – Hadoop based training • Decision Forests – Hadoop based training • Logistic Regression (aka SGD) – fast on-line (sequential) training
    • So What? • Online training has low overhead for small and moderate size data-sets big starts here
    • An Example
    • And Another From: Dr. Paul Acquah Dear Sir, Re: Proposal for over-invoice Contract Benevolence Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit. I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor. ... Date: Thu, May 20, 2010 at 10:51 AM From: George <george@fumble-tech.com> Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?
    • Mahout’s SGD • Learns on-line per example – O(1) memory – O(1) time per training example • Sequential implementation – fast, but not parallel
    • Special Features • Hashed feature encoding • Per-term annealing – learn the boring stuff once • Auto-magical learning knob turning – learns correct learning rate, learns correct learning rate for learning learning rate, ...
    • Feature Encoding
    • Hashed Encoding
    • Feature Collisions
    • Learning Rate AnnealingLearningRate # training examples seen
    • Per-term AnnealingLearningRate # training examples seen Common Feature Rare Feature
    • General Structure • OnlineLogisticRegression – Traditional logistic regression – Stochastic Gradient Descent – Per term annealing – Too fast (for the disk + encoder)
    • Next Level • CrossFoldLearner – contains multiple primitive learners – online cross validation – 5x more work
    • And again • AdaptiveLogisticRegression – 20 x CrossFoldLearner – evolves good learning and regularization rates – 100 x more work than basic learner – still faster than disk + encoding
    • A comparison • Traditional view – 400 x (read + OLR) • Revised Mahout view – 1 x (read + mu x 100 x OLR) x eta – mu = efficiency from killing losers early – eta = efficiency from stopping early
    • Click modeling architecture Feature extraction and down sampling I n p u t Side-data Data join Sequential SGD Learning Map-reduce Now via NFS
    • Click modeling architecture Map-reduceMap-reduce Feature extraction and down sampling I n p u t Side-data Data join Sequential SGD Learning Map-reduce cooperates with NFS Sequential SGD Learning Sequential SGD Learning Sequential SGD Learning
    • Deployment • Training – ModelSerializer.writeBinary(..., model) • Deployment – m = ModelSerializer.readBinary(...) – r = m.classifyScalar(featureVector)
    • The Upshot • One machine can go fast – SITM trains in 2 billion examples in 3 hours • Deployability pays off big – simple sample server farm