Your SlideShare is downloading. ×
  • Like

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Hw09 Matchmaking In The Cloud



Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Matchmaking in the Cloud: Amazon Web Services and Apache Hadoop at eHarmony
    • Ben Hardy, Senior Software Engineer
  • 2.
    • You’ll learn how eHarmony:
      • Used EC2 and Hadoop to develop a scalable solution for our large, real-world data problem
      • Overcame the limitations of our existing infrastructure
      • Reaped significant cost savings with this choice
    • Also find out about new opportunities and challenges
    Why You’re Here CONFIDENTIAL
  • 3.
    • Online subscription-based matchmaking service
    • Launched in 2000
    • Available in United States, Canada, Australia and United Kingdom
    • On average, 236 members in US marry every day*
    • More than 20 million registered users
    About eHarmony CONFIDENTIAL * based on survey conducted by Harris Interactive in 2007.
  • 4.
    • We match couples using detailed compatibility models
    • Models are based on decades of research and clinical experience in psychology
    • Variety of user attributes
        • Demographic
        • Psychographic
        • Behavioral
    • New models constantly being tested and developed
    • Model evaluation is the gorilla in the room
    The Science of Matching CONFIDENTIAL
  • 5. Computational Requirements
    • Tens of GB of matches, scores and constantly changing user features are archived daily
    • TBs of data currently archived and growing
    • Want to support 10x our current user base
    • All possible matches = O(n 2 ) problem
    • Support a growing set of models that may be
      • arbitrarily complex
      • computationally and I/O expensive
  • 6.
    • Current architecture is multi-tiered with a relational back-end
    • Scoring is DB join intensive
    • Data needs constant archiving
      • Matches, match scores, user attributes at time of match creation
      • Model validation is done at a later time across many days
    • Need a non-DB solution better suited towards big data crunching
    Scaling Challenges CONFIDENTIAL
  • 7.
    • Good fit for our problem
      • Need to process entire match pool (n 2 )
      • Data easily partitioned
    • Hadoop provides
      • Horizontally scalable parallel processing
      • Work distribution
      • Distributed Storage
      • Fault tolerance
      • Job monitoring
    • Hadoop is an Apache project
    Hadoop Addresses Scaling Needs CONFIDENTIAL
  • 8. Computing on AWS
    • Elastic Cloud Computing (EC2) enables horizontal scaling by adding servers on demand
    • Elastic MapReduce
      • Hosted Hadoop framework on top EC2 and S3
      • Simplifies end-to-end processing on cloud
      • Pricing is in addition to EC2
    • Simple Storage Service (S3)
      • provides cheap unlimited storage
      • Highly configurable security using ACLs
  • 9. AWS Pricing Model
    • Pay-per-use elastic model
    • Choice of server type
    • Lets you get up and running quickly and cheaply
    • Highly cost effective alternative to doing it in house
    • Allows rapid horizontal scaling on demand
  • 10. Architecture CONFIDENTIAL Data Warehouse Amazon Cloud Hadoop Jobs User data dump S3 upload download Result keystore input output update Elastic MapReduce Data Warehouse
  • 11. MapReduce Overview
    • Applications are modeled as a series of maps and reductions
    • In map phase, values are assigned to keys
    • Shuffle and sort
    • In reduce phase, values are combined for each key
    • Example - Word Count
      • Counts the frequency of words
      • Modeled as one Map and one Reduce
      • Data as key -> values
  • 12. Model Validation with MapReduce
    • Complex application uses a series of 3 MapReduce jobs
    • Match Scoring procedure for pairs of users:
      • Join match data with left-side User attributes into one line
      • Join above with right-side User attributes, calculate resulting match score
      • Group match scores by user
    • Temporary files in HDFS hold results between jobs
  • 13. Data Flow CONFIDENTIAL Match Info Users (Left Side) Users (Right Side) Join Join & Score Group by User Results Temp Files 3 MapReduce Jobs:
  • 14. AWS Elastic MapReduce
    • Only need to think in terms of Elastic MapReduce job flow
    • EC2 cluster is managed for you behind the scenes
    • Each job flow has one or more steps
    • Each step is a Hadoop MapReduce process
    • Each step can read and write data directly from and to S3 or HDFS
    • Based on Hadoop 0.18.3
  • 15. Elastic MapReduce for eHarmony
    • Vastly simplified our Hadoop processing
      • No need to explicitly allocate, start and shutdown EC2 instances
      • No need to explicitly manipulate master node
    • Status of a job flow and all its steps are accessible by a REST service
  • 16. Simple Job Control
    • Cluster control and job management reduced to a single local command
    • Uses Amazon’s EMR Ruby script
    • Uses jar and conf files stored on S3
    CONFIDENTIAL elastic_mapreduce.rb --create --name #{JOB_NAME} --num-instances #{NODES} --instance-type #{INST_TYPE} --key_pair #{KEY} --log-uri #{LOGDIR} --jar #{JAR} --main-class #{JOIN_CLASS} --arg -xconf --arg #{CONF}/join-config.xml --jar #{JAR} --main-class #{SCORER_CLASS} --arg -xconf --arg #{CONF}/scorer-config.xml --jar #{JAR} --main-class #{COMBINER_CLASS} --arg -xconf --arg #{CONF}/combiner-config.xml
  • 17. Development & Test Environments
    • Cheap to set up and experiment on Amazon
    • Quick setup
      • Number of servers is controlled by a config variable
    • Can test identical setup to production
    • Performance testing easy with big cluster
    • Integration test easy with small cluster and input data subset.
    • Separate development and test accounts recommended
  • 18. Performance by Instance Type CONFIDENTIAL Minutes
  • 19. Total Execution Time CONFIDENTIAL
  • 20. Administration Tools
    • AWS Console
    • ElasticFox for EC2 Firefox plugin
    • Hadoop status web pages
    • Aws/s3 Ruby gem with irb shell
    • Tim Kay’s AWS command line tool for S3
    • S3Fox for S3 Firefox plugin
  • 21. AWS Management Console
    • Useful for Elastic MapReduce
      • Start or Terminate job flow
      • Track execution of jobs in a job flow
    • Useful for vanilla EC2 as well
      • Start and stop clusters, nodes
      • Get machine addresses to view Hadoop status
  • 23. AWS Management Console CONFIDENTIAL EC2 Console Dashboard
  • 26. Hadoop DFS – Monitor Disk Usage CONFIDENTIAL
  • 27. Challenges
    • The overall process depends on the success of each stage
    • Assume every stage is unreliable
    • Need to build retry/abort logic to handle failures
  • 28. Challenges – Elastic MapReduce
    • Hard to debug – produces hundreds of log files in an S3 bucket
    • Hanged node can be stopped with AWS Console
    • Probably better to debug using normal EC2 cluster
  • 29. Challenges – S3 (Simple Storage Service)
    • S3 web service calls can time out
    • Extra logic required to validate file is correctly uploaded to and downloaded from S3
    • We retry once on failure
  • 30. Challenges – Data Shuffling
    • We currently spend as much time moving data around as actually running Hadoop
    • Network bandwidth does not scale as Hadoop and EC2.
    • New scaling challenge is to reduce the data shuffle time and error recovery.
    • Try to do your processing near the data
  • 31. Future Directions: Hadoop Streaming
    • Great for rapid prototyping
    • Develop using Unix text processing tools and pipes
    • Can use any language – Perl, Ruby etc
    • Recommended to wrap scripts in a container
    • Tests are easily run outside of Hadoop
    • Has hastened our internal adoption of Hadoop
  • 32. Future Directions: Data Analysis in the Cloud
    • Daily reporting: use Hadoop instead of depending on data warehouse.
    • Statistical analyses:
      • Big aggregations, stratifications, distribution discovery
      • Median/Mean score per user
      • Analyze users by location
      • Preparing data for analysis in packages like R
  • 33. Data Analysis with Hive
    • Language very similar to SQL
    • Once set up by devs, analysts can quickly become proficient
    • Errors rare, usually from bad input data
    • Flexible enough to handle complex tasks
      • Loading data into key/value maps
      • User defined functions usually not required
    • Hive community is very active and supportive
    • Running on EC2 using Amazon SupportedHive
    • Elastic Hive can read and write data in S3 buckets
  • 34. Data Analysis with Pig
    • Apache Hadoop subproject
    • High-level language on top of Hadoop
    • Procedural language for describing data flow and filtering
    • Extremely flexible
    • Faster to write than Java, but slower to run
    • Hard to debug
  • 35. Lessons Learned
    • EC2/S3/EMR are cost effective.
    • Easy to write unit tests for MapReduce.
    • Hadoop community support is great.
    • Easier to control process using Ruby than Bash
    • Dev tools really easy to work with and just work right out of the box
    • Ensuring end-to-end reliability poses biggest challenges
  • 36. Any questions?
    • Ask away
  • 37. Thank you
    • Ben Hardy, Senior Software Engineer
    • [email_address]