0
Subtitle




eHarmony in Cloud      Brian Ko
eHarmony
• Online subscription-based matchmaking
  service
• Available in United States, Canada,
  Australia and United Ki...
Why Cloud?
• Problem exceeds the limits of the data
  center and data warehouse environment.
• Leverage EC2 and Hadoop to ...
Finding match
• Model Creation




                   3
Find matching
• Matching




                   4
Find Matching
• Predicative Model Scores




                     5
Requirement
• All the matches, scores, and user
  information should be archived daily
• Ready for 10X growth
• Possible O...
Challenge
• Current architecture is multi-tiered with a
  relational back-end
• Scoring is DB join intensive
• Data need c...
Solution
• Open Source Java implementation of
  Google’s MapReduce framework

  – Distributes work across vast amounts of ...
Slide 9
• Simple Storage Service (S3) provides
  cheap unlimited storage.
• Elastic Cloud Computing (EC2) enables
  horizo...
MapReduce
• A large server farm can use MapReduce to
  process huge dataset.
• Map step
  – Master node takes the input
  ...
Why Hadoop
• Mapper and Reducer are written by you
• Hadoop provides
  – Parallelization
  – Shuffle and sort




        ...
Actual Process
• Upload to S3 and start EC2 Cluster




                     13
Actual Process
• Process and archive




                        14
Amazon Elastic MapReduce
• It is a web service
• EC2 cluster is managed for you behind the
  scenes
• Starts Hadoop implem...
Elastic MapReduce
• No need to explicitly allocate, start and
  shutdown EC2 instances
• Individual jobs were managed by a...
Before Elastic Map Reduce
•   Allocate/Verify cluster
•   Push application to cluster
•   Run a control script on the mast...
After Elastic MapReduce
• With Elastic MapReduce we can do all this
  with a single local command
• Uses jar and conf file...
Development Environment
• Cheap to set up on Amazon
• Quick setup - Number of servers is
  controlled by a config variable...
Cost comparison
• Average EC2 and S3 Cost
  – Each run is 2 to 3 hours
  – $1200/month for EC2
  – $100/month for S3
• Pro...
Summary
• Dev tools really easy to work with and just
  work right out of the box
• Standard Hadoop AMI worked great
• Eas...
The End

5 minutes of question time
       starts now!
Questions

4 minutes left!
Questions

3 minutes left!
Questions

2 minutes left!
Questions

1 minute left!
Questions

30 seconds left!
Questions

TIME IS UP!
Upcoming SlideShare
Loading in...5
×

eHarmony in the Cloud

1,913

Published on

This is a lightning presentation given by Brian Ko a member of my development team. It is a recap of a presentation he attended at JavaOne 2009.

Published in: Technology, Education, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,913
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
44
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "eHarmony in the Cloud"

  1. 1. Subtitle eHarmony in Cloud Brian Ko
  2. 2. eHarmony • Online subscription-based matchmaking service • Available in United States, Canada, Australia and United Kingdom. • On average, 236 members in US marry every day. • More than 20 million registered users. 1
  3. 3. Why Cloud? • Problem exceeds the limits of the data center and data warehouse environment. • Leverage EC2 and Hadoop to scale data 2
  4. 4. Finding match • Model Creation 3
  5. 5. Find matching • Matching 4
  6. 6. Find Matching • Predicative Model Scores 5
  7. 7. Requirement • All the matches, scores, and user information should be archived daily • Ready for 10X growth • Possible O(n2) problem • Need to support set of models becoming more complex 6
  8. 8. Challenge • Current architecture is multi-tiered with a relational back-end • Scoring is DB join intensive • Data need constant archiving – Matches, match scores, user attributes at time of match creation – Model validation is done at a later time across many days • Need a non-DB solution 7
  9. 9. Solution • Open Source Java implementation of Google’s MapReduce framework – Distributes work across vast amounts of data – Hadoop Distributed File System (HDFS) provides reliability through replication – Automatic re-execution on failure/distribution – Scale horizontally on commodity hardware 8
  10. 10. Slide 9 • Simple Storage Service (S3) provides cheap unlimited storage. • Elastic Cloud Computing (EC2) enables horizontal scaling by adding servers on demand. 9
  11. 11. MapReduce • A large server farm can use MapReduce to process huge dataset. • Map step – Master node takes the input – Chops it up into smaller sub-problems – Distributes those to worker nodes. • Reduce step – Master node takes the answers to all the sub- problems – Combines them in a way to get the output 10
  12. 12. Why Hadoop • Mapper and Reducer are written by you • Hadoop provides – Parallelization – Shuffle and sort 11
  13. 13. Actual Process • Upload to S3 and start EC2 Cluster 13
  14. 14. Actual Process • Process and archive 14
  15. 15. Amazon Elastic MapReduce • It is a web service • EC2 cluster is managed for you behind the scenes • Starts Hadoop implementation of the MapReduce framework on Amazon EC2 • Each step can read and write data directly from and to S3 • Based on Hadoop 0.18.3 15
  16. 16. Elastic MapReduce • No need to explicitly allocate, start and shutdown EC2 instances • Individual jobs were managed by a remote script running on master node (no longer required) • Jobs are arranged into a job flow, created with a single command • Status of a job flow and all its steps are accessible by a REST service 16
  17. 17. Before Elastic Map Reduce • Allocate/Verify cluster • Push application to cluster • Run a control script on the master • Kick off each job step on the master • Create and detect a job completion token • Shut the cluster down 17
  18. 18. After Elastic MapReduce • With Elastic MapReduce we can do all this with a single local command • Uses jar and conf files stored on S3 • Various monitoring tools for EC2 and S3 are provided 18
  19. 19. Development Environment • Cheap to set up on Amazon • Quick setup - Number of servers is controlled by a config variable • Identical to production • Separate development account recommended 19
  20. 20. Cost comparison • Average EC2 and S3 Cost – Each run is 2 to 3 hours – $1200/month for EC2 – $100/month for S3 • Projected in-house cost – $5000/month for a local cluster of 50 nodes running 24/7 – A new company needs to add data center and operation personnel expense 20
  21. 21. Summary • Dev tools really easy to work with and just work right out of the box • Standard Hadoop AMI worked great • Easy to write unit tests for MapReduce • Hadoop community support is great. • EC2/S3/EMR are cost effective
  22. 22. The End 5 minutes of question time starts now!
  23. 23. Questions 4 minutes left!
  24. 24. Questions 3 minutes left!
  25. 25. Questions 2 minutes left!
  26. 26. Questions 1 minute left!
  27. 27. Questions 30 seconds left!
  28. 28. Questions TIME IS UP!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×