eHarmony in the Cloud

2,284 views

Published on

This is a lightning presentation given by Brian Ko a member of my development team. It is a recap of a presentation he attended at JavaOne 2009.

Published in: Technology, Education, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,284
On SlideShare
0
From Embeds
0
Number of Embeds
40
Actions
Shares
0
Downloads
45
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

eHarmony in the Cloud

  1. 1. Subtitle eHarmony in Cloud Brian Ko
  2. 2. eHarmony • Online subscription-based matchmaking service • Available in United States, Canada, Australia and United Kingdom. • On average, 236 members in US marry every day. • More than 20 million registered users. 1
  3. 3. Why Cloud? • Problem exceeds the limits of the data center and data warehouse environment. • Leverage EC2 and Hadoop to scale data 2
  4. 4. Finding match • Model Creation 3
  5. 5. Find matching • Matching 4
  6. 6. Find Matching • Predicative Model Scores 5
  7. 7. Requirement • All the matches, scores, and user information should be archived daily • Ready for 10X growth • Possible O(n2) problem • Need to support set of models becoming more complex 6
  8. 8. Challenge • Current architecture is multi-tiered with a relational back-end • Scoring is DB join intensive • Data need constant archiving – Matches, match scores, user attributes at time of match creation – Model validation is done at a later time across many days • Need a non-DB solution 7
  9. 9. Solution • Open Source Java implementation of Google’s MapReduce framework – Distributes work across vast amounts of data – Hadoop Distributed File System (HDFS) provides reliability through replication – Automatic re-execution on failure/distribution – Scale horizontally on commodity hardware 8
  10. 10. Slide 9 • Simple Storage Service (S3) provides cheap unlimited storage. • Elastic Cloud Computing (EC2) enables horizontal scaling by adding servers on demand. 9
  11. 11. MapReduce • A large server farm can use MapReduce to process huge dataset. • Map step – Master node takes the input – Chops it up into smaller sub-problems – Distributes those to worker nodes. • Reduce step – Master node takes the answers to all the sub- problems – Combines them in a way to get the output 10
  12. 12. Why Hadoop • Mapper and Reducer are written by you • Hadoop provides – Parallelization – Shuffle and sort 11
  13. 13. Actual Process • Upload to S3 and start EC2 Cluster 13
  14. 14. Actual Process • Process and archive 14
  15. 15. Amazon Elastic MapReduce • It is a web service • EC2 cluster is managed for you behind the scenes • Starts Hadoop implementation of the MapReduce framework on Amazon EC2 • Each step can read and write data directly from and to S3 • Based on Hadoop 0.18.3 15
  16. 16. Elastic MapReduce • No need to explicitly allocate, start and shutdown EC2 instances • Individual jobs were managed by a remote script running on master node (no longer required) • Jobs are arranged into a job flow, created with a single command • Status of a job flow and all its steps are accessible by a REST service 16
  17. 17. Before Elastic Map Reduce • Allocate/Verify cluster • Push application to cluster • Run a control script on the master • Kick off each job step on the master • Create and detect a job completion token • Shut the cluster down 17
  18. 18. After Elastic MapReduce • With Elastic MapReduce we can do all this with a single local command • Uses jar and conf files stored on S3 • Various monitoring tools for EC2 and S3 are provided 18
  19. 19. Development Environment • Cheap to set up on Amazon • Quick setup - Number of servers is controlled by a config variable • Identical to production • Separate development account recommended 19
  20. 20. Cost comparison • Average EC2 and S3 Cost – Each run is 2 to 3 hours – $1200/month for EC2 – $100/month for S3 • Projected in-house cost – $5000/month for a local cluster of 50 nodes running 24/7 – A new company needs to add data center and operation personnel expense 20
  21. 21. Summary • Dev tools really easy to work with and just work right out of the box • Standard Hadoop AMI worked great • Easy to write unit tests for MapReduce • Hadoop community support is great. • EC2/S3/EMR are cost effective
  22. 22. The End 5 minutes of question time starts now!
  23. 23. Questions 4 minutes left!
  24. 24. Questions 3 minutes left!
  25. 25. Questions 2 minutes left!
  26. 26. Questions 1 minute left!
  27. 27. Questions 30 seconds left!
  28. 28. Questions TIME IS UP!

×