• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Data mining-2011-09
 

Data mining-2011-09

on

  • 1,387 views

Talk given on September 20 to the bay area data mining. The basic idea is that integrating map-reduce programs with the real world is easier than ever.

Talk given on September 20 to the bay area data mining. The basic idea is that integrating map-reduce programs with the real world is easier than ever.

Statistics

Views

Total Views
1,387
Views on SlideShare
1,380
Embed Views
7

Actions

Likes
0
Downloads
20
Comments
0

2 Embeds 7

http://www.linkedin.com 5
https://www.linkedin.com 2

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Data mining-2011-09 Data mining-2011-09 Presentation Transcript

    • Data-mining, Hadoop and the Single Node
    • Map-Reduce
      Shuffle
      Input
      Output
    • MapR's Streaming Performance
      11 x 7200rpm SATA
      11 x 15Krpm SAS
      MB
      per
      sec
      Higher is better
      Tests: i. 16 streams x 120GB ii. 2000 streams x 1GB
    • Terasort on MapR
      10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm
      Elapsed time (mins)
      Lower is better
    • Data Flow Expected Volumes
      Node
      6 x 1Gb/s =
      600 MB / s
      12 x 100MB/s =
      900 MB / s
      Storage
    • MUCH faster for some operations
      Same 10 nodes …
      Create
      Rate
      # of files (millions)
    • Universal export to self
      Cluster Nodes
      Cluster
      Node
      Task
      NFS
      Server
    • Cluster
      Node
      Task
      NFS
      Server
      Cluster
      Node
      Task
      Cluster
      Node
      Task
      NFS
      Server
      NFS
      Server
      Nodes are identical
    • Sharded text Indexing
      Index text to local disk and then copy index to distributed file store
      Assign documents to shards
      Map
      Reducer
      Clustered index storage
      Input documents
      Copy to local disk typically required before index can be loaded
      Local
      disk
      Search
      Engine
      Local
      disk
    • Conventional data flow
      Failure of search engine requires another download of the index from clustered storage.
      Map
      Failure of a reducer causes garbage to accumulate in the local disk
      Reducer
      Clustered index storage
      Input documents
      Local
      disk
      Search
      Engine
      Local
      disk
    • Simplified NFS data flows
      Index to task work directory via NFS
      Map
      Reducer
      Search
      Engine
      Input documents
      Clustered index storage
      Failure of a reducer is cleaned up by map-reduce framework
      Search engine reads mirrored index directly.
    • K-means, the movie
      Centroids
      Assign
      to
      Nearest
      centroid
      I
      n
      p
      u
      t
      Aggregate
      new
      centroids
    • But …
    • Parallel Stochastic Gradient Descent
      Model
      Train
      sub
      model
      I
      n
      p
      u
      t
      Average
      models
    • VariationalDirichlet Assignment
      Model
      Gather
      sufficient
      statistics
      I
      n
      p
      u
      t
      Update
      model
    • Old tricks, new dogs
      Mapper
      Assign point to cluster
      Emit cluster id, (1, point)
      Combiner and reducer
      Sum counts, weighted sum of points
      Emit cluster id, (n, sum/n)
      Output to HDFS
      Read from local disk from distributed cache
      Read from
      HDFS to local disk by distributed cache
      Written by map-reduce
    • Old tricks, new dogs
      Mapper
      Assign point to cluster
      Emit cluster id, (1, point)
      Combiner and reducer
      Sum counts, weighted sum of points
      Emit cluster id, (n, sum/n)
      Output to HDFS
      Read from
      NFS
      Written by map-reduce
      MapR FS
    • Poor man’s Pregel
      Mapper
      Lines in bold can use conventional I/O via NFS
      while not done:
      read and accumulate input models
      for each input:
      accumulate model
      write model
      synchronize
      reset input format
      emit summary
      18
    • Mahout
      Scalable Data Mining for Everybody
    • What is Mahout
      Recommendations (people who x this also x that)
      Clustering (segment data into groups of)
      Classification (learn decision making from examples)
      Stuff (LDA, SVD, frequent item-set, math)
    • What is Mahout?
      Recommendations (people who x this also x that)
      Clustering (segment data into groups of)
      Classification (learn decision making from examples)
      Stuff (LDA, SVM, frequent item-set, math)
    • Classification in Detail
      Naive Bayes Family
      Hadoop based training
      Decision Forests
      Hadoop based training
      Logistic Regression (aka SGD)
      fast on-line (sequential) training
    • Classification in Detail
      Naive Bayes Family
      Hadoop based training
      Decision Forests
      Hadoop based training
      Logistic Regression (aka SGD)
      fast on-line (sequential) training
    • So What?
      big starts here
      Online training has low overhead for small and moderate size data-sets
    • An Example
    • And Another
      From:  Dr. Paul Acquah
      Dear Sir,
      Re: Proposal for over-invoice Contract Benevolence
      Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit.  I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor.
      ...
      Date: Thu, May 20, 2010 at 10:51 AM
      From: George <george@fumble-tech.com>
      Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?
    • Mahout’s SGD
      Learns on-line per example
      O(1) memory
      O(1) time per training example
      Sequential implementation
      fast, but not parallel
    • Special Features
      Hashed feature encoding
      Per-term annealing
      learn the boring stuff once
      Auto-magical learning knob turning
      learns correct learning rate, learns correct learning rate for learning learning rate, ...
    • Feature Encoding
    • Hashed Encoding
    • Feature Collisions
    • Learning Rate Annealing
      Learning Rate
      # training examples seen
    • Per-term Annealing
      Common Feature
      Learning Rate
      Rare Feature
      # training examples seen
    • General Structure
      OnlineLogisticRegression
      Traditional logistic regression
      Stochastic Gradient Descent
      Per term annealing
      Too fast (for the disk + encoder)
    • Next Level
      CrossFoldLearner
      contains multiple primitive learners
      online cross validation
      5x more work
    • And again
      AdaptiveLogisticRegression
      20 x CrossFoldLearner
      evolves good learning and regularization rates
      100 x more work than basic learner
      still faster than disk + encoding
    • A comparison
      Traditional view
      400 x (read + OLR)
      Revised Mahout view
      1 x (read + mu x 100 x OLR) x eta
      mu = efficiency from killing losers early
      eta = efficiency from stopping early
    • Click modeling architecture
      Map-reduce
      Side-data
      Now via NFS
      Feature
      extraction
      and
      down
      sampling
      I
      n
      p
      u
      t
      Data
      join
      Sequential
      SGD
      Learning
    • Click modeling architecture
      Map-reduce
      Map-reduce
      Side-data
      Map-reduce cooperates with NFS
      Sequential
      SGD
      Learning
      Feature
      extraction
      and
      down
      sampling
      Sequential
      SGD
      Learning
      I
      n
      p
      u
      t
      Data
      join
      Sequential
      SGD
      Learning
      Sequential
      SGD
      Learning
    • Deployment
      Training
      ModelSerializer.writeBinary(..., model)
      Deployment
      m = ModelSerializer.readBinary(...)
      r = m.classifyScalar(featureVector)
    • The Upshot
      One machine can go fast
      SITM trains in 2 billion examples in 3 hours
      Deployability pays off big
      simple sample server farm