Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AWS Customer Presentation - eHarmony

Ben Hardy, Senior Software Engineer, eHarmony talks about their use of AWS to power their matching algorithms

  • Be the first to comment

AWS Customer Presentation - eHarmony

  1. 1. Matchmaking in the Cloud A study on Amazon EC2, Elastic MapReduce and Apache Hadoop at eHarmony Ben Hardy - Sr. Software Engineer
  2. 2. About eHarmony Online subscription-based matchmaking service Launched in 2000 Available in United States, Canada, Australia and United Kingdom On average, 236 members in US marry every day More than 20 million registered users Matching models are based on decades of research and clinical experience in psychology 2
  3. 3. Business use case To combine match data, user feedback, and outcomes to help research improve success of future matching models
  4. 4. Scorer • Each match is evaluated and given a score • Score serves as predictor of match quality • Score is compared later with match outcome to inform research team
  5. 5. Challenges to meet All possible matches = O(n2) problem Database is a bottleneck Input: tens of GB of matches, scores and constantly changing user features are archived daily Output: TB of data currently archived and growing Desire scalability to 10x our current user base More complex, data heavy models being added
  6. 6. How Hadoop solved our problem Our problem can be broken up into a series of MapReduce steps Join match data, user A attributes into one line Join above with user B attributes and calculate the match score Group the match scores by user
  7. 7. Architecture Overview Data unload s3put Warehouse User and Match data Cluster control EC2 S3 start verify Hadoop get job status shutdown Local store s3get Store Score Data 7
  8. 8. How AWS solved our problem Quick and easy prototyping Cost savings Reduced headache Easier resource management Scaling is a no-brainer S3 provides cheap permanent storage
  9. 9. AWS Elastic MapReduceBETA Only need to think in terms of Elastic MapReduce job flow EC2 cluster is managed for you behind the scenes Each job flow has one or more steps Each step is a Hadoop MapReduce process Each step can read and write data directly from and to S3 Based on Hadoop 0.18.3 9
  10. 10. Simplified Job Control Before EMR, we had to explicitly: Run a Kick off Create and Push Shut the Allocate Verify control each job detect a job application cluster cluster cluster script on step on the completion to cluster down the master master token Over 150 lines of scripts just for management After EMR, it was one single local command #{ELASTIC_MR_UTIL} --create --name #{JOB_NAME} --num-instances #{NUM_INSTANCES} --instance-type #{INSTANCE_TYPE} --key_pair #{KEY_PAIR_NAME} --log-uri #{SCORER_LOG_BUCKET_URL} --jar #{SCORER_JAR_S3_PATH} --main-class #{MR_JAVA_PACKAGE}.join.JoinJob --arg -xconf --arg #{MASTER_CONF_DIR}/join-config.xml --jar #{SCORER_JAR_S3_PATH} --main-class #{MR_JAVA_PACKAGE}.scorer.ScorerJob --arg -xconf --arg #{MASTER_CONF_DIR}/scorer-config.xml --jar #{SCORER_JAR_S3_PATH} --main-class #{MR_JAVA_PACKAGE}.combiner.CombinerJob --arg -xconf --arg #{MASTER_CONF_DIR}/combiner- config-#{TARGET_ENV}.xml Jar and Config on S3 and Job status queried via REST interface 10
  11. 11. AWS Management Console Elastic MapReduceBETA 11
  12. 12. Points of caution Elastic MapReduceBETA Provisioning of the servers is not yet stable Frequently failed in the first two weeks of the beta program It blocks until provisioning is complete Only billed for job execution time Amazon is working on improving reliability It’s handy to terminate a hanged job with Amazon Web Services Management Console 12