Hw09 Matchmaking In The Cloud


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hw09 Matchmaking In The Cloud

  1. 1. Matchmaking in the Cloud: Amazon Web Services and Apache Hadoop at eHarmony <ul><li>Ben Hardy, Senior Software Engineer </li></ul>CONFIDENTIAL CONFIDENTIAL
  2. 2. <ul><li>You’ll learn how eHarmony: </li></ul><ul><ul><li>Used EC2 and Hadoop to develop a scalable solution for our large, real-world data problem </li></ul></ul><ul><ul><li>Overcame the limitations of our existing infrastructure </li></ul></ul><ul><ul><li>Reaped significant cost savings with this choice </li></ul></ul><ul><li>Also find out about new opportunities and challenges </li></ul>Why You’re Here CONFIDENTIAL
  3. 3. <ul><li>Online subscription-based matchmaking service </li></ul><ul><li>Launched in 2000 </li></ul><ul><li>Available in United States, Canada, Australia and United Kingdom </li></ul><ul><li>On average, 236 members in US marry every day* </li></ul><ul><li>More than 20 million registered users </li></ul>About eHarmony CONFIDENTIAL * based on survey conducted by Harris Interactive in 2007.
  4. 4. <ul><li>We match couples using detailed compatibility models </li></ul><ul><li>Models are based on decades of research and clinical experience in psychology </li></ul><ul><li>Variety of user attributes </li></ul><ul><ul><ul><li>Demographic </li></ul></ul></ul><ul><ul><ul><li>Psychographic </li></ul></ul></ul><ul><ul><ul><li>Behavioral </li></ul></ul></ul><ul><li>New models constantly being tested and developed </li></ul><ul><li>Model evaluation is the gorilla in the room </li></ul>The Science of Matching CONFIDENTIAL
  5. 5. Computational Requirements <ul><li>Tens of GB of matches, scores and constantly changing user features are archived daily </li></ul><ul><li>TBs of data currently archived and growing </li></ul><ul><li>Want to support 10x our current user base </li></ul><ul><li>All possible matches = O(n 2 ) problem </li></ul><ul><li>Support a growing set of models that may be </li></ul><ul><ul><li>arbitrarily complex </li></ul></ul><ul><ul><li>computationally and I/O expensive </li></ul></ul>CONFIDENTIAL
  6. 6. <ul><li>Current architecture is multi-tiered with a relational back-end </li></ul><ul><li>Scoring is DB join intensive </li></ul><ul><li>Data needs constant archiving </li></ul><ul><ul><li>Matches, match scores, user attributes at time of match creation </li></ul></ul><ul><ul><li>Model validation is done at a later time across many days </li></ul></ul><ul><li>Need a non-DB solution better suited towards big data crunching </li></ul>Scaling Challenges CONFIDENTIAL
  7. 7. <ul><li>Good fit for our problem </li></ul><ul><ul><li>Need to process entire match pool (n 2 ) </li></ul></ul><ul><ul><li>Data easily partitioned </li></ul></ul><ul><li>Hadoop provides </li></ul><ul><ul><li>Horizontally scalable parallel processing </li></ul></ul><ul><ul><li>Work distribution </li></ul></ul><ul><ul><li>Distributed Storage </li></ul></ul><ul><ul><li>Fault tolerance </li></ul></ul><ul><ul><li>Job monitoring </li></ul></ul><ul><li>Hadoop is an Apache project </li></ul>Hadoop Addresses Scaling Needs CONFIDENTIAL
  8. 8. Computing on AWS <ul><li>Elastic Cloud Computing (EC2) enables horizontal scaling by adding servers on demand </li></ul><ul><li>Elastic MapReduce </li></ul><ul><ul><li>Hosted Hadoop framework on top EC2 and S3 </li></ul></ul><ul><ul><li>Simplifies end-to-end processing on cloud </li></ul></ul><ul><ul><li>Pricing is in addition to EC2 </li></ul></ul><ul><li>Simple Storage Service (S3) </li></ul><ul><ul><li>provides cheap unlimited storage </li></ul></ul><ul><ul><li>Highly configurable security using ACLs </li></ul></ul>CONFIDENTIAL
  9. 9. AWS Pricing Model <ul><li>Pay-per-use elastic model </li></ul><ul><li>Choice of server type </li></ul><ul><li>Lets you get up and running quickly and cheaply </li></ul><ul><li>Highly cost effective alternative to doing it in house </li></ul><ul><li>Allows rapid horizontal scaling on demand </li></ul>CONFIDENTIAL
  10. 10. Architecture CONFIDENTIAL Data Warehouse Amazon Cloud Hadoop Jobs User data dump S3 upload download Result keystore input output update Elastic MapReduce Data Warehouse
  11. 11. MapReduce Overview <ul><li>Applications are modeled as a series of maps and reductions </li></ul><ul><li>In map phase, values are assigned to keys </li></ul><ul><li>Shuffle and sort </li></ul><ul><li>In reduce phase, values are combined for each key </li></ul><ul><li>Example - Word Count </li></ul><ul><ul><li>Counts the frequency of words </li></ul></ul><ul><ul><li>Modeled as one Map and one Reduce </li></ul></ul><ul><ul><li>Data as key -> values </li></ul></ul>CONFIDENTIAL
  12. 12. Model Validation with MapReduce <ul><li>Complex application uses a series of 3 MapReduce jobs </li></ul><ul><li>Match Scoring procedure for pairs of users: </li></ul><ul><ul><li>Join match data with left-side User attributes into one line </li></ul></ul><ul><ul><li>Join above with right-side User attributes, calculate resulting match score </li></ul></ul><ul><ul><li>Group match scores by user </li></ul></ul><ul><li>Temporary files in HDFS hold results between jobs </li></ul>CONFIDENTIAL
  13. 13. Data Flow CONFIDENTIAL Match Info Users (Left Side) Users (Right Side) Join Join & Score Group by User Results Temp Files 3 MapReduce Jobs:
  14. 14. AWS Elastic MapReduce <ul><li>Only need to think in terms of Elastic MapReduce job flow </li></ul><ul><li>EC2 cluster is managed for you behind the scenes </li></ul><ul><li>Each job flow has one or more steps </li></ul><ul><li>Each step is a Hadoop MapReduce process </li></ul><ul><li>Each step can read and write data directly from and to S3 or HDFS </li></ul><ul><li>Based on Hadoop 0.18.3 </li></ul>CONFIDENTIAL
  15. 15. Elastic MapReduce for eHarmony <ul><li>Vastly simplified our Hadoop processing </li></ul><ul><ul><li>No need to explicitly allocate, start and shutdown EC2 instances </li></ul></ul><ul><ul><li>No need to explicitly manipulate master node </li></ul></ul><ul><li>Status of a job flow and all its steps are accessible by a REST service </li></ul>CONFIDENTIAL
  16. 16. Simple Job Control <ul><li>Cluster control and job management reduced to a single local command </li></ul><ul><li>Uses Amazon’s EMR Ruby script </li></ul><ul><li>Uses jar and conf files stored on S3 </li></ul>CONFIDENTIAL elastic_mapreduce.rb --create --name #{JOB_NAME} --num-instances #{NODES} --instance-type #{INST_TYPE} --key_pair #{KEY} --log-uri #{LOGDIR} --jar #{JAR} --main-class #{JOIN_CLASS} --arg -xconf --arg #{CONF}/join-config.xml --jar #{JAR} --main-class #{SCORER_CLASS} --arg -xconf --arg #{CONF}/scorer-config.xml --jar #{JAR} --main-class #{COMBINER_CLASS} --arg -xconf --arg #{CONF}/combiner-config.xml
  17. 17. Development & Test Environments <ul><li>Cheap to set up and experiment on Amazon </li></ul><ul><li>Quick setup </li></ul><ul><ul><li>Number of servers is controlled by a config variable </li></ul></ul><ul><li>Can test identical setup to production </li></ul><ul><li>Performance testing easy with big cluster </li></ul><ul><li>Integration test easy with small cluster and input data subset. </li></ul><ul><li>Separate development and test accounts recommended </li></ul>CONFIDENTIAL
  18. 18. Performance by Instance Type CONFIDENTIAL Minutes
  19. 19. Total Execution Time CONFIDENTIAL
  20. 20. Administration Tools <ul><li>AWS Console </li></ul><ul><li>ElasticFox for EC2 Firefox plugin </li></ul><ul><li>Hadoop status web pages </li></ul><ul><li>Aws/s3 Ruby gem with irb shell </li></ul><ul><li>Tim Kay’s AWS command line tool for S3 </li></ul><ul><li>S3Fox for S3 Firefox plugin </li></ul>CONFIDENTIAL
  21. 21. AWS Management Console <ul><li>Useful for Elastic MapReduce </li></ul><ul><ul><li>Start or Terminate job flow </li></ul></ul><ul><ul><li>Track execution of jobs in a job flow </li></ul></ul><ul><li>Useful for vanilla EC2 as well </li></ul><ul><ul><li>Start and stop clusters, nodes </li></ul></ul><ul><ul><li>Get machine addresses to view Hadoop status </li></ul></ul>CONFIDENTIAL
  23. 23. AWS Management Console CONFIDENTIAL EC2 Console Dashboard
  26. 26. Hadoop DFS – Monitor Disk Usage CONFIDENTIAL
  27. 27. Challenges <ul><li>The overall process depends on the success of each stage </li></ul><ul><li>Assume every stage is unreliable </li></ul><ul><li>Need to build retry/abort logic to handle failures </li></ul>CONFIDENTIAL
  28. 28. Challenges – Elastic MapReduce <ul><li>Hard to debug – produces hundreds of log files in an S3 bucket </li></ul><ul><li>Hanged node can be stopped with AWS Console </li></ul><ul><li>Probably better to debug using normal EC2 cluster </li></ul>CONFIDENTIAL
  29. 29. Challenges – S3 (Simple Storage Service) <ul><li>S3 web service calls can time out </li></ul><ul><li>Extra logic required to validate file is correctly uploaded to and downloaded from S3 </li></ul><ul><li>We retry once on failure </li></ul>CONFIDENTIAL
  30. 30. Challenges – Data Shuffling <ul><li>We currently spend as much time moving data around as actually running Hadoop </li></ul><ul><li>Network bandwidth does not scale as Hadoop and EC2. </li></ul><ul><li>New scaling challenge is to reduce the data shuffle time and error recovery. </li></ul><ul><li>Try to do your processing near the data </li></ul>CONFIDENTIAL
  31. 31. Future Directions: Hadoop Streaming <ul><li>Great for rapid prototyping </li></ul><ul><li>Develop using Unix text processing tools and pipes </li></ul><ul><li>Can use any language – Perl, Ruby etc </li></ul><ul><li>Recommended to wrap scripts in a container </li></ul><ul><li>Tests are easily run outside of Hadoop </li></ul><ul><li>Has hastened our internal adoption of Hadoop </li></ul>CONFIDENTIAL
  32. 32. Future Directions: Data Analysis in the Cloud <ul><li>Daily reporting: use Hadoop instead of depending on data warehouse. </li></ul><ul><li>Statistical analyses: </li></ul><ul><ul><li>Big aggregations, stratifications, distribution discovery </li></ul></ul><ul><ul><li>Median/Mean score per user </li></ul></ul><ul><ul><li>Analyze users by location </li></ul></ul><ul><ul><li>Preparing data for analysis in packages like R </li></ul></ul>CONFIDENTIAL
  33. 33. Data Analysis with Hive <ul><li>Language very similar to SQL </li></ul><ul><li>Once set up by devs, analysts can quickly become proficient </li></ul><ul><li>Errors rare, usually from bad input data </li></ul><ul><li>Flexible enough to handle complex tasks </li></ul><ul><ul><li>Loading data into key/value maps </li></ul></ul><ul><ul><li>User defined functions usually not required </li></ul></ul><ul><li>Hive community is very active and supportive </li></ul><ul><li>Running on EC2 using Amazon SupportedHive </li></ul><ul><li>Elastic Hive can read and write data in S3 buckets </li></ul>CONFIDENTIAL
  34. 34. Data Analysis with Pig <ul><li>Apache Hadoop subproject </li></ul><ul><li>High-level language on top of Hadoop </li></ul><ul><li>Procedural language for describing data flow and filtering </li></ul><ul><li>Extremely flexible </li></ul><ul><li>Faster to write than Java, but slower to run </li></ul><ul><li>Hard to debug </li></ul>CONFIDENTIAL
  35. 35. Lessons Learned <ul><li>EC2/S3/EMR are cost effective. </li></ul><ul><li>Easy to write unit tests for MapReduce. </li></ul><ul><li>Hadoop community support is great. </li></ul><ul><li>Easier to control process using Ruby than Bash </li></ul><ul><li>Dev tools really easy to work with and just work right out of the box </li></ul><ul><li>Ensuring end-to-end reliability poses biggest challenges </li></ul>CONFIDENTIAL
  36. 36. Any questions? <ul><li>Ask away </li></ul>CONFIDENTIAL
  37. 37. Thank you <ul><li>Ben Hardy, Senior Software Engineer </li></ul><ul><li>[email_address] </li></ul>CONFIDENTIAL CONFIDENTIAL