Seravia in the Cloud

1,383 views
1,249 views

Published on

MongoDB Beijing Meetup on May 7th.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,383
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
19
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Seravia in the Cloud

  1. 1. Seravia in the Cloud Danny Yang May 7, 2011
  2. 2. Outline <ul><li>Search on www.seravia.com </li></ul><ul><ul><li>Deployed on Amazon Web Services </li></ul></ul><ul><ul><li>DB and text search used: Mongo and Sphinx </li></ul></ul><ul><li>Data cleaning and Analysis </li></ul><ul><ul><li>Use Amazon's Cloud to do large computations </li></ul></ul><ul><ul><li>Tools used: Pentaho and Hive </li></ul></ul><ul><li>Data crawling (Robin's talk) </li></ul>
  3. 3. Sample of Data 2.0 Companies
  4. 4. Our data sets
  5. 5. Mongo <ul><li>Perfect for our document store </li></ul><ul><ul><li>Document-oriented database </li></ul></ul><ul><ul><li>Schema-free </li></ul></ul><ul><ul><li>Exact-match query fast </li></ul></ul><ul><li>But how what about search? </li></ul><ul><ul><li>Full-text search not well supported </li></ul></ul><ul><ul><li>Tried indexed array of keywords, ie </li></ul></ul><ul><ul><ul><li>[“senior”, “software”, “engineer”] </li></ul></ul></ul>
  6. 6. Sphinx <ul><li>Fast full-text search </li></ul><ul><li>Supports query language </li></ul><ul><li>Support search relevancy </li></ul><ul><li>Simpler and faster than Lucene </li></ul><ul><li>Open Source, C++ </li></ul><ul><li>Users: Meetup, Craigslist, Slashdot </li></ul>sphinxsearch.com
  7. 7. Combining Mongo and Sphinx <ul><li>Mongo is used for document store </li></ul><ul><li>Sphinx is used for text search </li></ul><ul><li>64 bit _id in Mongo is used by Sphinx </li></ul><ul><li>Sphinx search returns array of _ids that are retrieved from Mongo </li></ul><ul><li>Many different Sphinx/Mongo collections </li></ul><ul><li>Search cache stored in a Mongo collection </li></ul><ul><li>Open-source our ruby library? </li></ul>
  8. 8. Why cloud? <ul><li>“ The cloud lets its users focus on delivering differentiating business value instead of wasting valuable resources on the undifferentiated heavy lifting that makes up most of IT infrastructure .” </li></ul><ul><li>Werner Vogels - Amazon CTO </li></ul><ul><li>“ We want to use clouds, we don't have time to build them .” </li></ul><ul><li>Adrian Cockcroft – Architect at Netflix </li></ul>
  9. 9. Amazon Cloud Advantages <ul><li>Scalability </li></ul><ul><ul><li>Small scale - simple and low cost, pay per hour </li></ul></ul><ul><ul><li>Large scale – launch 1000's of servers </li></ul></ul><ul><li>Mature API </li></ul><ul><li>Developer Community </li></ul><ul><ul><li>Many open source tools </li></ul></ul><ul><ul><li>Support staff available </li></ul></ul><ul><li>No need for IT department </li></ul><ul><li>Cheaper </li></ul><ul><ul><li>Pay for what you use </li></ul></ul><ul><ul><li>No upfront cost, maintenance cost, depreciation </li></ul></ul>
  10. 10. Amazon Cloud users <ul><li>Zynga, Quora, Reddit, Yelp, Netflix, Foursquare, 37signals </li></ul><ul><li>Case studies: </li></ul><ul><ul><li>http://aws.amazon.com/solutions/case-studies/ </li></ul></ul>
  11. 11. Amazon Web Services (AWS) <ul><li>Elastic Compute Cloud (EC2) </li></ul><ul><ul><li>Virtual machine (instance) </li></ul></ul><ul><ul><li>Different machine types available </li></ul></ul><ul><ul><li>Temporary instances – run and kill </li></ul></ul><ul><li>Simple Storage Service (S3) </li></ul><ul><ul><li>“ bucket” file store </li></ul></ul><ul><ul><li>Http access </li></ul></ul><ul><li>Elastic Map Reduce (EMR) </li></ul><ul><ul><li>Hadoop/Hive cluster for Map Reduce jobs </li></ul></ul><ul><li>Elastic Load Balancer (ELB) </li></ul>
  12. 12. Seravia on AWS WWW Data Crawlware ELB EC2 rails, mongo, mysql, sphinx S3 EC2 parsing, pentaho, ETL S3 EMR hadoop, hive, BI EC2 S3 EC2 rails, mongo, mysql, sphinx
  13. 13. WWW architecture ELB EC2 webserver S3 EC2 webserver EC2 mongo EC2 sphinx EC2 mysql EC2 webserver, rails EC2 mongo EC2 sphinx
  14. 14. Data Architecture EC2 post-processing EC2 Parsing, ETL S3 EC2 Parsing, ETL EMR Hadoop, hive, BI EC2 post-processing 1. Raw data – html, xml, text files 2. Pre-processed – unrelated tsv files 3. Analyzed – related tsv files and reports 4. Post-processed – json documents EC2 post-processing
  15. 15. Crawlware Architecture EC2 Crawler EC2 Crawler EC2 Controller S3 EC2 Crawler EC2 Crawler EC2 Controller
  16. 16. Summary <ul><li>Many good tools and services </li></ul><ul><ul><li>Mongo – document db </li></ul></ul><ul><ul><li>Sphinx – text search engine </li></ul></ul><ul><ul><li>AWS (EC2, S3, EMR) </li></ul></ul><ul><ul><li>HDFS/Hadoop/Hive – data analysis </li></ul></ul><ul><ul><li>Pentaho – ETL, data cleaning </li></ul></ul><ul><li>Do not need to build them yourself </li></ul><ul><li>Use them for what they are designed to do </li></ul><ul><li>Good slides: </li></ul><ul><ul><li>Big data: http://www.slideshare.net/FactualTeam/factual-web-2-10-presentation </li></ul></ul><ul><ul><li>Netflix in the cloud: http://www.slideshare.net/adrianco/netflix-in-the-cloud-2011 </li></ul></ul>
  17. 17. Q & A

×