Data mining-2011-09

  • 1,124 views
Uploaded on

Talk given on September 20 to the bay area data mining. The basic idea is that integrating map-reduce programs with the real world is easier than ever.

Talk given on September 20 to the bay area data mining. The basic idea is that integrating map-reduce programs with the real world is easier than ever.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,124
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
23
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Data-mining, Hadoop and the Single Node
  • 2. Map-Reduce
    Shuffle
    Input
    Output
  • 3. MapR's Streaming Performance
    11 x 7200rpm SATA
    11 x 15Krpm SAS
    MB
    per
    sec
    Higher is better
    Tests: i. 16 streams x 120GB ii. 2000 streams x 1GB
  • 4. Terasort on MapR
    10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm
    Elapsed time (mins)
    Lower is better
  • 5. Data Flow Expected Volumes
    Node
    6 x 1Gb/s =
    600 MB / s
    12 x 100MB/s =
    900 MB / s
    Storage
  • 6. MUCH faster for some operations
    Same 10 nodes …
    Create
    Rate
    # of files (millions)
  • 7. Universal export to self
    Cluster Nodes
    Cluster
    Node
    Task
    NFS
    Server
  • 8. Cluster
    Node
    Task
    NFS
    Server
    Cluster
    Node
    Task
    Cluster
    Node
    Task
    NFS
    Server
    NFS
    Server
    Nodes are identical
  • 9. Sharded text Indexing
    Index text to local disk and then copy index to distributed file store
    Assign documents to shards
    Map
    Reducer
    Clustered index storage
    Input documents
    Copy to local disk typically required before index can be loaded
    Local
    disk
    Search
    Engine
    Local
    disk
  • 10. Conventional data flow
    Failure of search engine requires another download of the index from clustered storage.
    Map
    Failure of a reducer causes garbage to accumulate in the local disk
    Reducer
    Clustered index storage
    Input documents
    Local
    disk
    Search
    Engine
    Local
    disk
  • 11. Simplified NFS data flows
    Index to task work directory via NFS
    Map
    Reducer
    Search
    Engine
    Input documents
    Clustered index storage
    Failure of a reducer is cleaned up by map-reduce framework
    Search engine reads mirrored index directly.
  • 12. K-means, the movie
    Centroids
    Assign
    to
    Nearest
    centroid
    I
    n
    p
    u
    t
    Aggregate
    new
    centroids
  • 13. But …
  • 14. Parallel Stochastic Gradient Descent
    Model
    Train
    sub
    model
    I
    n
    p
    u
    t
    Average
    models
  • 15. VariationalDirichlet Assignment
    Model
    Gather
    sufficient
    statistics
    I
    n
    p
    u
    t
    Update
    model
  • 16. Old tricks, new dogs
    Mapper
    Assign point to cluster
    Emit cluster id, (1, point)
    Combiner and reducer
    Sum counts, weighted sum of points
    Emit cluster id, (n, sum/n)
    Output to HDFS
    Read from local disk from distributed cache
    Read from
    HDFS to local disk by distributed cache
    Written by map-reduce
  • 17. Old tricks, new dogs
    Mapper
    Assign point to cluster
    Emit cluster id, (1, point)
    Combiner and reducer
    Sum counts, weighted sum of points
    Emit cluster id, (n, sum/n)
    Output to HDFS
    Read from
    NFS
    Written by map-reduce
    MapR FS
  • 18. Poor man’s Pregel
    Mapper
    Lines in bold can use conventional I/O via NFS
    while not done:
    read and accumulate input models
    for each input:
    accumulate model
    write model
    synchronize
    reset input format
    emit summary
    18
  • 19.
  • 20. Mahout
    Scalable Data Mining for Everybody
  • 21. What is Mahout
    Recommendations (people who x this also x that)
    Clustering (segment data into groups of)
    Classification (learn decision making from examples)
    Stuff (LDA, SVD, frequent item-set, math)
  • 22. What is Mahout?
    Recommendations (people who x this also x that)
    Clustering (segment data into groups of)
    Classification (learn decision making from examples)
    Stuff (LDA, SVM, frequent item-set, math)
  • 23. Classification in Detail
    Naive Bayes Family
    Hadoop based training
    Decision Forests
    Hadoop based training
    Logistic Regression (aka SGD)
    fast on-line (sequential) training
  • 24. Classification in Detail
    Naive Bayes Family
    Hadoop based training
    Decision Forests
    Hadoop based training
    Logistic Regression (aka SGD)
    fast on-line (sequential) training
  • 25. So What?
    big starts here
    Online training has low overhead for small and moderate size data-sets
  • 26. An Example
  • 27. And Another
    From:  Dr. Paul Acquah
    Dear Sir,
    Re: Proposal for over-invoice Contract Benevolence
    Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit.  I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor.
    ...
    Date: Thu, May 20, 2010 at 10:51 AM
    From: George <george@fumble-tech.com>
    Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?
  • 28. Mahout’s SGD
    Learns on-line per example
    O(1) memory
    O(1) time per training example
    Sequential implementation
    fast, but not parallel
  • 29. Special Features
    Hashed feature encoding
    Per-term annealing
    learn the boring stuff once
    Auto-magical learning knob turning
    learns correct learning rate, learns correct learning rate for learning learning rate, ...
  • 30. Feature Encoding
  • 31. Hashed Encoding
  • 32. Feature Collisions
  • 33. Learning Rate Annealing
    Learning Rate
    # training examples seen
  • 34. Per-term Annealing
    Common Feature
    Learning Rate
    Rare Feature
    # training examples seen
  • 35. General Structure
    OnlineLogisticRegression
    Traditional logistic regression
    Stochastic Gradient Descent
    Per term annealing
    Too fast (for the disk + encoder)
  • 36. Next Level
    CrossFoldLearner
    contains multiple primitive learners
    online cross validation
    5x more work
  • 37. And again
    AdaptiveLogisticRegression
    20 x CrossFoldLearner
    evolves good learning and regularization rates
    100 x more work than basic learner
    still faster than disk + encoding
  • 38. A comparison
    Traditional view
    400 x (read + OLR)
    Revised Mahout view
    1 x (read + mu x 100 x OLR) x eta
    mu = efficiency from killing losers early
    eta = efficiency from stopping early
  • 39. Click modeling architecture
    Map-reduce
    Side-data
    Now via NFS
    Feature
    extraction
    and
    down
    sampling
    I
    n
    p
    u
    t
    Data
    join
    Sequential
    SGD
    Learning
  • 40. Click modeling architecture
    Map-reduce
    Map-reduce
    Side-data
    Map-reduce cooperates with NFS
    Sequential
    SGD
    Learning
    Feature
    extraction
    and
    down
    sampling
    Sequential
    SGD
    Learning
    I
    n
    p
    u
    t
    Data
    join
    Sequential
    SGD
    Learning
    Sequential
    SGD
    Learning
  • 41. Deployment
    Training
    ModelSerializer.writeBinary(..., model)
    Deployment
    m = ModelSerializer.readBinary(...)
    r = m.classifyScalar(featureVector)
  • 42. The Upshot
    One machine can go fast
    SITM trains in 2 billion examples in 3 hours
    Deployability pays off big
    simple sample server farm