Your SlideShare is downloading. ×

Hw09 Matchmaking In The Cloud


Published on

Published in: Technology

1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Matchmaking in the Cloud: Amazon Web Services and Apache Hadoop at eHarmony
    • Ben Hardy, Senior Software Engineer
  • 2.
    • You’ll learn how eHarmony:
      • Used EC2 and Hadoop to develop a scalable solution for our large, real-world data problem
      • Overcame the limitations of our existing infrastructure
      • Reaped significant cost savings with this choice
    • Also find out about new opportunities and challenges
    Why You’re Here CONFIDENTIAL
  • 3.
    • Online subscription-based matchmaking service
    • Launched in 2000
    • Available in United States, Canada, Australia and United Kingdom
    • On average, 236 members in US marry every day*
    • More than 20 million registered users
    About eHarmony CONFIDENTIAL * based on survey conducted by Harris Interactive in 2007.
  • 4.
    • We match couples using detailed compatibility models
    • Models are based on decades of research and clinical experience in psychology
    • Variety of user attributes
        • Demographic
        • Psychographic
        • Behavioral
    • New models constantly being tested and developed
    • Model evaluation is the gorilla in the room
    The Science of Matching CONFIDENTIAL
  • 5. Computational Requirements
    • Tens of GB of matches, scores and constantly changing user features are archived daily
    • TBs of data currently archived and growing
    • Want to support 10x our current user base
    • All possible matches = O(n 2 ) problem
    • Support a growing set of models that may be
      • arbitrarily complex
      • computationally and I/O expensive
  • 6.
    • Current architecture is multi-tiered with a relational back-end
    • Scoring is DB join intensive
    • Data needs constant archiving
      • Matches, match scores, user attributes at time of match creation
      • Model validation is done at a later time across many days
    • Need a non-DB solution better suited towards big data crunching
    Scaling Challenges CONFIDENTIAL
  • 7.
    • Good fit for our problem
      • Need to process entire match pool (n 2 )
      • Data easily partitioned
    • Hadoop provides
      • Horizontally scalable parallel processing
      • Work distribution
      • Distributed Storage
      • Fault tolerance
      • Job monitoring
    • Hadoop is an Apache project
    Hadoop Addresses Scaling Needs CONFIDENTIAL
  • 8. Computing on AWS
    • Elastic Cloud Computing (EC2) enables horizontal scaling by adding servers on demand
    • Elastic MapReduce
      • Hosted Hadoop framework on top EC2 and S3
      • Simplifies end-to-end processing on cloud
      • Pricing is in addition to EC2
    • Simple Storage Service (S3)
      • provides cheap unlimited storage
      • Highly configurable security using ACLs
  • 9. AWS Pricing Model
    • Pay-per-use elastic model
    • Choice of server type
    • Lets you get up and running quickly and cheaply
    • Highly cost effective alternative to doing it in house
    • Allows rapid horizontal scaling on demand
  • 10. Architecture CONFIDENTIAL Data Warehouse Amazon Cloud Hadoop Jobs User data dump S3 upload download Result keystore input output update Elastic MapReduce Data Warehouse
  • 11. MapReduce Overview
    • Applications are modeled as a series of maps and reductions
    • In map phase, values are assigned to keys
    • Shuffle and sort
    • In reduce phase, values are combined for each key
    • Example - Word Count
      • Counts the frequency of words
      • Modeled as one Map and one Reduce
      • Data as key -> values
  • 12. Model Validation with MapReduce
    • Complex application uses a series of 3 MapReduce jobs
    • Match Scoring procedure for pairs of users:
      • Join match data with left-side User attributes into one line
      • Join above with right-side User attributes, calculate resulting match score
      • Group match scores by user
    • Temporary files in HDFS hold results between jobs
  • 13. Data Flow CONFIDENTIAL Match Info Users (Left Side) Users (Right Side) Join Join & Score Group by User Results Temp Files 3 MapReduce Jobs:
  • 14. AWS Elastic MapReduce
    • Only need to think in terms of Elastic MapReduce job flow
    • EC2 cluster is managed for you behind the scenes
    • Each job flow has one or more steps
    • Each step is a Hadoop MapReduce process
    • Each step can read and write data directly from and to S3 or HDFS
    • Based on Hadoop 0.18.3
  • 15. Elastic MapReduce for eHarmony
    • Vastly simplified our Hadoop processing
      • No need to explicitly allocate, start and shutdown EC2 instances
      • No need to explicitly manipulate master node
    • Status of a job flow and all its steps are accessible by a REST service
  • 16. Simple Job Control
    • Cluster control and job management reduced to a single local command
    • Uses Amazon’s EMR Ruby script
    • Uses jar and conf files stored on S3
    CONFIDENTIAL elastic_mapreduce.rb --create --name #{JOB_NAME} --num-instances #{NODES} --instance-type #{INST_TYPE} --key_pair #{KEY} --log-uri #{LOGDIR} --jar #{JAR} --main-class #{JOIN_CLASS} --arg -xconf --arg #{CONF}/join-config.xml --jar #{JAR} --main-class #{SCORER_CLASS} --arg -xconf --arg #{CONF}/scorer-config.xml --jar #{JAR} --main-class #{COMBINER_CLASS} --arg -xconf --arg #{CONF}/combiner-config.xml
  • 17. Development & Test Environments
    • Cheap to set up and experiment on Amazon
    • Quick setup
      • Number of servers is controlled by a config variable
    • Can test identical setup to production
    • Performance testing easy with big cluster
    • Integration test easy with small cluster and input data subset.
    • Separate development and test accounts recommended
  • 18. Performance by Instance Type CONFIDENTIAL Minutes
  • 19. Total Execution Time CONFIDENTIAL
  • 20. Administration Tools
    • AWS Console
    • ElasticFox for EC2 Firefox plugin
    • Hadoop status web pages
    • Aws/s3 Ruby gem with irb shell
    • Tim Kay’s AWS command line tool for S3
    • S3Fox for S3 Firefox plugin
  • 21. AWS Management Console
    • Useful for Elastic MapReduce
      • Start or Terminate job flow
      • Track execution of jobs in a job flow
    • Useful for vanilla EC2 as well
      • Start and stop clusters, nodes
      • Get machine addresses to view Hadoop status
  • 23. AWS Management Console CONFIDENTIAL EC2 Console Dashboard
  • 26. Hadoop DFS – Monitor Disk Usage CONFIDENTIAL
  • 27. Challenges
    • The overall process depends on the success of each stage
    • Assume every stage is unreliable
    • Need to build retry/abort logic to handle failures
  • 28. Challenges – Elastic MapReduce
    • Hard to debug – produces hundreds of log files in an S3 bucket
    • Hanged node can be stopped with AWS Console
    • Probably better to debug using normal EC2 cluster
  • 29. Challenges – S3 (Simple Storage Service)
    • S3 web service calls can time out
    • Extra logic required to validate file is correctly uploaded to and downloaded from S3
    • We retry once on failure
  • 30. Challenges – Data Shuffling
    • We currently spend as much time moving data around as actually running Hadoop
    • Network bandwidth does not scale as Hadoop and EC2.
    • New scaling challenge is to reduce the data shuffle time and error recovery.
    • Try to do your processing near the data
  • 31. Future Directions: Hadoop Streaming
    • Great for rapid prototyping
    • Develop using Unix text processing tools and pipes
    • Can use any language – Perl, Ruby etc
    • Recommended to wrap scripts in a container
    • Tests are easily run outside of Hadoop
    • Has hastened our internal adoption of Hadoop
  • 32. Future Directions: Data Analysis in the Cloud
    • Daily reporting: use Hadoop instead of depending on data warehouse.
    • Statistical analyses:
      • Big aggregations, stratifications, distribution discovery
      • Median/Mean score per user
      • Analyze users by location
      • Preparing data for analysis in packages like R
  • 33. Data Analysis with Hive
    • Language very similar to SQL
    • Once set up by devs, analysts can quickly become proficient
    • Errors rare, usually from bad input data
    • Flexible enough to handle complex tasks
      • Loading data into key/value maps
      • User defined functions usually not required
    • Hive community is very active and supportive
    • Running on EC2 using Amazon SupportedHive
    • Elastic Hive can read and write data in S3 buckets
  • 34. Data Analysis with Pig
    • Apache Hadoop subproject
    • High-level language on top of Hadoop
    • Procedural language for describing data flow and filtering
    • Extremely flexible
    • Faster to write than Java, but slower to run
    • Hard to debug
  • 35. Lessons Learned
    • EC2/S3/EMR are cost effective.
    • Easy to write unit tests for MapReduce.
    • Hadoop community support is great.
    • Easier to control process using Ruby than Bash
    • Dev tools really easy to work with and just work right out of the box
    • Ensuring end-to-end reliability poses biggest challenges
  • 36. Any questions?
    • Ask away
  • 37. Thank you
    • Ben Hardy, Senior Software Engineer
    • [email_address]