eHarmony in the Cloud

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    eHarmony in the Cloud - Presentation Transcript

    1. Subtitle eHarmony in Cloud Brian Ko
    2. eHarmony • Online subscription-based matchmaking service • Available in United States, Canada, Australia and United Kingdom. • On average, 236 members in US marry every day. • More than 20 million registered users. 1
    3. Why Cloud? • Problem exceeds the limits of the data center and data warehouse environment. • Leverage EC2 and Hadoop to scale data 2
    4. Finding match • Model Creation 3
    5. Find matching • Matching 4
    6. Find Matching • Predicative Model Scores 5
    7. Requirement • All the matches, scores, and user information should be archived daily • Ready for 10X growth • Possible O(n2) problem • Need to support set of models becoming more complex 6
    8. Challenge • Current architecture is multi-tiered with a relational back-end • Scoring is DB join intensive • Data need constant archiving – Matches, match scores, user attributes at time of match creation – Model validation is done at a later time across many days • Need a non-DB solution 7
    9. Solution • Open Source Java implementation of Google’s MapReduce framework – Distributes work across vast amounts of data – Hadoop Distributed File System (HDFS) provides reliability through replication – Automatic re-execution on failure/distribution – Scale horizontally on commodity hardware 8
    10. Slide 9 • Simple Storage Service (S3) provides cheap unlimited storage. • Elastic Cloud Computing (EC2) enables horizontal scaling by adding servers on demand. 9
    11. MapReduce • A large server farm can use MapReduce to process huge dataset. • Map step – Master node takes the input – Chops it up into smaller sub-problems – Distributes those to worker nodes. • Reduce step – Master node takes the answers to all the sub- problems – Combines them in a way to get the output 10
    12. Why Hadoop • Mapper and Reducer are written by you • Hadoop provides – Parallelization – Shuffle and sort 11
    13. Actual Process • Upload to S3 and start EC2 Cluster 13
    14. Actual Process • Process and archive 14
    15. Amazon Elastic MapReduce • It is a web service • EC2 cluster is managed for you behind the scenes • Starts Hadoop implementation of the MapReduce framework on Amazon EC2 • Each step can read and write data directly from and to S3 • Based on Hadoop 0.18.3 15
    16. Elastic MapReduce • No need to explicitly allocate, start and shutdown EC2 instances • Individual jobs were managed by a remote script running on master node (no longer required) • Jobs are arranged into a job flow, created with a single command • Status of a job flow and all its steps are accessible by a REST service 16
    17. Before Elastic Map Reduce • Allocate/Verify cluster • Push application to cluster • Run a control script on the master • Kick off each job step on the master • Create and detect a job completion token • Shut the cluster down 17
    18. After Elastic MapReduce • With Elastic MapReduce we can do all this with a single local command • Uses jar and conf files stored on S3 • Various monitoring tools for EC2 and S3 are provided 18
    19. Development Environment • Cheap to set up on Amazon • Quick setup - Number of servers is controlled by a config variable • Identical to production • Separate development account recommended 19
    20. Cost comparison • Average EC2 and S3 Cost – Each run is 2 to 3 hours – $1200/month for EC2 – $100/month for S3 • Projected in-house cost – $5000/month for a local cluster of 50 nodes running 24/7 – A new company needs to add data center and operation personnel expense 20
    21. Summary • Dev tools really easy to work with and just work right out of the box • Standard Hadoop AMI worked great • Easy to write unit tests for MapReduce • Hadoop community support is great. • EC2/S3/EMR are cost effective
    22. The End 5 minutes of question time starts now!
    23. Questions 4 minutes left!
    24. Questions 3 minutes left!
    25. Questions 2 minutes left!
    26. Questions 1 minute left!
    27. Questions 30 seconds left!
    28. Questions TIME IS UP!
    SlideShare Zeitgeist 2009

    + Craig DicksonCraig Dickson Nominate

    custom

    270 views, 0 favs, 1 embeds more stats

    This is a lightning presentation given by Brian Ko more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 270
      • 262 on SlideShare
      • 8 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 3
    Most viewed embeds
    • 8 views on http://craigsdickson.me

    more

    All embeds
    • 8 views on http://craigsdickson.me

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories