AWS Customer Presentation - eHarmony


Published on

Ben Hardy, Senior Software Engineer, eHarmony talks about their use of AWS to power their matching algorithms

Published in: Education, Technology
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Here are some facts and figures on us 2% of US Marriages
  • Db joins etc, models are CPU and IO intensive and need to be tested, in offline system we can take advantage of aggregate data without constraining our online system
  • Getting our data to and from EC2 is definitely non-trivial Steps 1,2,6, and 7 are outside the cloud
  • EMR simplifies the process and scripting for us by consolidating the allocation, hadoop configuration and process control of the jobs
  • Lots of steps before EMR. No fun. Lots of possible points of failure. No need to copy job to master, or even touch the master in any way. Uses Amazon’s elastic-mapreduce.rb utility script.
  • Status of job flow Status of steps in each job flow
  • Design for failure
  • AWS Customer Presentation - eHarmony

    1. 1. Matchmaking in the Cloud A study on Amazon EC2, Elastic MapReduce and Apache Hadoop at eHarmony Ben Hardy - Sr. Software Engineer
    2. 2. About eHarmony Online subscription-based matchmaking service Launched in 2000 Available in United States, Canada, Australia and United Kingdom On average, 236 members in US marry every day More than 20 million registered users Matching models are based on decades of research and clinical experience in psychology 2
    3. 3. Business use case To combine match data, user feedback, and outcomes to help research improve success of future matching models
    4. 4. Scorer • Each match is evaluated and given a score • Score serves as predictor of match quality • Score is compared later with match outcome to inform research team
    5. 5. Challenges to meet All possible matches = O(n2) problem Database is a bottleneck Input: tens of GB of matches, scores and constantly changing user features are archived daily Output: TB of data currently archived and growing Desire scalability to 10x our current user base More complex, data heavy models being added
    6. 6. How Hadoop solved our problem Our problem can be broken up into a series of MapReduce steps Join match data, user A attributes into one line Join above with user B attributes and calculate the match score Group the match scores by user
    7. 7. Architecture Overview Data unload s3put Warehouse User and Match data Cluster control EC2 S3 start verify Hadoop get job status shutdown Local store s3get Store Score Data 7
    8. 8. How AWS solved our problem Quick and easy prototyping Cost savings Reduced headache Easier resource management Scaling is a no-brainer S3 provides cheap permanent storage
    9. 9. AWS Elastic MapReduceBETA Only need to think in terms of Elastic MapReduce job flow EC2 cluster is managed for you behind the scenes Each job flow has one or more steps Each step is a Hadoop MapReduce process Each step can read and write data directly from and to S3 Based on Hadoop 0.18.3 9
    10. 10. Simplified Job Control Before EMR, we had to explicitly: Run a Kick off Create and Push Shut the Allocate Verify control each job detect a job application cluster cluster cluster script on step on the completion to cluster down the master master token Over 150 lines of scripts just for management After EMR, it was one single local command #{ELASTIC_MR_UTIL} --create --name #{JOB_NAME} --num-instances #{NUM_INSTANCES} --instance-type #{INSTANCE_TYPE} --key_pair #{KEY_PAIR_NAME} --log-uri #{SCORER_LOG_BUCKET_URL} --jar #{SCORER_JAR_S3_PATH} --main-class #{MR_JAVA_PACKAGE}.join.JoinJob --arg -xconf --arg #{MASTER_CONF_DIR}/join-config.xml --jar #{SCORER_JAR_S3_PATH} --main-class #{MR_JAVA_PACKAGE}.scorer.ScorerJob --arg -xconf --arg #{MASTER_CONF_DIR}/scorer-config.xml --jar #{SCORER_JAR_S3_PATH} --main-class #{MR_JAVA_PACKAGE}.combiner.CombinerJob --arg -xconf --arg #{MASTER_CONF_DIR}/combiner- config-#{TARGET_ENV}.xml Jar and Config on S3 and Job status queried via REST interface 10
    11. 11. AWS Management Console Elastic MapReduceBETA 11
    12. 12. Points of caution Elastic MapReduceBETA Provisioning of the servers is not yet stable Frequently failed in the first two weeks of the beta program It blocks until provisioning is complete Only billed for job execution time Amazon is working on improving reliability It’s handy to terminate a hanged job with Amazon Web Services Management Console 12