Lesson from Building a Search Engine using the cloud

2,825 views
2,713 views

Published on

Chirayu Patel shres his experience in building SnappyFingers.com

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,825
On SlideShare
0
From Embeds
0
Number of Embeds
215
Actions
Shares
0
Downloads
23
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Lesson from Building a Search Engine using the cloud

  1. 1. Lessons from building a search engine with Amazon Web Services Chirayu Patel chirayu@snappyfingers.com
  2. 2. Chirayu Patel Developer SnappyFingers Question and Answer Search Engine 100% in the cloud 5-Apr-09 CloudCamp - Bangalore
  3. 3. My experiences What worked? What didn’t? What could be done better? What did I miss? 5-Apr-09 CloudCamp - Bangalore
  4. 4. Recap - AWS? • EC2 – Elastic Compute Cloud • S3 – Simple Storage Service • SQS – Simple Queue Service • SDB – SimpleDB 5-Apr-09 CloudCamp - Bangalore
  5. 5. Why AWS? • Computing Power requirements unknown • Cheap • Availability of multiple services • Easy to implement SnappyFingers architecture using AWS services 5-Apr-09 CloudCamp - Bangalore
  6. 6. SnappyFingers • Information Retrieval System (IRS) • FrontEnd – Nothing unique here 5-Apr-09 CloudCamp - Bangalore
  7. 7. Three motivations (behind my decisions) • Reluctance to learn • Cost Conscious • I write buggy code 5-Apr-09 CloudCamp - Bangalore
  8. 8. Architectural Requirements • Loose Coupled • Scalable • Fault Tolerant • Budget dependent 5-Apr-09 CloudCamp - Bangalore
  9. 9. IRS Architecture Pipeline SQS Pipe Crawler Pipe Parser Pipe Indexer EC2 Crawler Parser Indexer EC2 + S3 SDB Data Store Errors 5-Apr-09 CloudCamp - Bangalore
  10. 10. Pipes and Pipelines 5-Apr-09 CloudCamp - Bangalore
  11. 11. Pipes and Pipelines • Pipes contain jobs • Pipeline is a group of pipe • Easy to create pipelines and add pipes 5-Apr-09 CloudCamp - Bangalore
  12. 12. Job ORM SQS API class CrawlerJob (JobBase): SDB API class SDBInterfaceConfig: domain_name = settings.CRAWLER_JOB_DOMAIN class SQSInterfaceConfig: queue_name = settings.CRAWLER_JOB_QUEUE timeout = settings.CRAWLER_JOB_TIMEOUT class AWSMetaData: action = CharField (...) url = CharField (...) ... ... Default attributes of each Job: • Pipeline Name • Status • Start Time • End Time • Id 5-Apr-09 CloudCamp - Bangalore
  13. 13. Job Processing for i in range (num_of_jobs): try: job = cls.jobclass.sqs_get() # process job ... except Exception, e: job.job_processing_complete(…) fsdebug.mail_admins (..) end_transaction(rollback = True) job.sdb_save() # save in error store finally: job.sqs_del() # delete the job 5-Apr-09 CloudCamp - Bangalore
  14. 14. The Good • Architecture easy to extend • ORM approach is a big time saver • Simple to add new services 5-Apr-09 CloudCamp - Bangalore
  15. 15. The Bad • Messages may be lost – Service Failure – SQS deletes messages after 4 days. Imp: System should be able to recreate jobs 5-Apr-09 CloudCamp - Bangalore
  16. 16. Storage 5-Apr-09 CloudCamp - Bangalore
  17. 17. What do we store? • Crawler Data – Web Pages • Extracted Content – Questions/Answers • Backups 5-Apr-09 CloudCamp - Bangalore
  18. 18. Storage Structure Meta Data Key + Value Postgres S3 5-Apr-09 CloudCamp - Bangalore
  19. 19. ORM • Extended Django ORM to support S3 class S3WebPage (S3Model): _allowed_attrs = [quot;urlquot;, quot;contentquot;, ..] _name = quot;S3WebPage“ ... ... 5-Apr-09 CloudCamp - Bangalore
  20. 20. The Good • Extremely scalable • Possible to store Python objects in S3 • Latency issues can be solved by using a caching layer • No need to backup S3 data • Storage is cheap 5-Apr-09 CloudCamp - Bangalore
  21. 21. The Bad • Postgres + S3 is not an elegant solution – Periodic syncing of Postgres and S3 required • High transaction costs – $.01 per 1000 PUT,COPY,POST or LIST Requests – $.01 per 10000 GET Requests 5-Apr-09 CloudCamp - Bangalore
  22. 22. Computing 5-Apr-09 CloudCamp - Bangalore
  23. 23. EC2 – The Good • Computing needs are not constant • Data transfer to other AWS services is free • AMI’s per node type 5-Apr-09 CloudCamp - Bangalore
  24. 24. The bad • Missed having a nerve center – Budget – Job Load – CPU load • Low cost 64bit severs are not available 5-Apr-09 CloudCamp - Bangalore
  25. 25. Thank You

×