0
Lessons from building a search
engine with Amazon Web Services
            Chirayu Patel
     chirayu@snappyfingers.com
Chirayu Patel
                       Developer
                     SnappyFingers
           Question and Answer Search En...
My experiences
                 What worked?
                 What didn’t?
           What could be done better?
         ...
Recap - AWS?
      •    EC2 – Elastic Compute Cloud
      •    S3 – Simple Storage Service
      •    SQS – Simple Queue S...
Why AWS?
      •    Computing Power requirements unknown
      •    Cheap
      •    Availability of multiple services
   ...
SnappyFingers
      • Information Retrieval System (IRS)
      • FrontEnd
           – Nothing unique here




5-Apr-09   ...
Three motivations (behind my decisions)
      • Reluctance to learn
      • Cost Conscious
      • I write buggy code




...
Architectural Requirements
      •    Loose Coupled
      •    Scalable
      •    Fault Tolerant
      •    Budget depend...
IRS Architecture
           Pipeline
                                                                                     ...
Pipes and Pipelines




5-Apr-09         CloudCamp - Bangalore
Pipes and Pipelines
      • Pipes contain jobs
      • Pipeline is a group of pipe
      • Easy to create pipelines and ad...
Job ORM
                                                                SQS API
      class CrawlerJob (JobBase):
        ...
Job Processing
      for i in range (num_of_jobs):
         try:
             job = cls.jobclass.sqs_get() # process job
 ...
The Good
      • Architecture easy to extend
      • ORM approach is a big time saver
      • Simple to add new services

...
The Bad
      • Messages may be lost
           – Service Failure
           – SQS deletes messages after 4 days.


      ...
Storage




5-Apr-09   CloudCamp - Bangalore
What do we store?
      • Crawler Data – Web Pages
      • Extracted Content – Questions/Answers
      • Backups




5-Apr...
Storage Structure

                        Meta Data                        Key + Value


           Postgres             ...
ORM
      • Extended Django ORM to support S3

      class S3WebPage (S3Model):
          _allowed_attrs = [quot;urlquot;,...
The Good
      • Extremely scalable
      • Possible to store Python objects in S3
      • Latency issues can be solved by...
The Bad
      • Postgres + S3 is not an elegant solution
           – Periodic syncing of Postgres and S3 required
      •...
Computing




5-Apr-09    CloudCamp - Bangalore
EC2 – The Good
      • Computing needs are not constant
      • Data transfer to other AWS services is free
      • AMI’s ...
The bad
      • Missed having a nerve center
           – Budget
           – Job Load
           – CPU load
      • Low c...
Thank You
Upcoming SlideShare
Loading in...5
×

Lesson from Building a Search Engine using the cloud

2,551

Published on

Chirayu Patel shres his experience in building SnappyFingers.com

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,551
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
23
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Lesson from Building a Search Engine using the cloud"

  1. 1. Lessons from building a search engine with Amazon Web Services Chirayu Patel chirayu@snappyfingers.com
  2. 2. Chirayu Patel Developer SnappyFingers Question and Answer Search Engine 100% in the cloud 5-Apr-09 CloudCamp - Bangalore
  3. 3. My experiences What worked? What didn’t? What could be done better? What did I miss? 5-Apr-09 CloudCamp - Bangalore
  4. 4. Recap - AWS? • EC2 – Elastic Compute Cloud • S3 – Simple Storage Service • SQS – Simple Queue Service • SDB – SimpleDB 5-Apr-09 CloudCamp - Bangalore
  5. 5. Why AWS? • Computing Power requirements unknown • Cheap • Availability of multiple services • Easy to implement SnappyFingers architecture using AWS services 5-Apr-09 CloudCamp - Bangalore
  6. 6. SnappyFingers • Information Retrieval System (IRS) • FrontEnd – Nothing unique here 5-Apr-09 CloudCamp - Bangalore
  7. 7. Three motivations (behind my decisions) • Reluctance to learn • Cost Conscious • I write buggy code 5-Apr-09 CloudCamp - Bangalore
  8. 8. Architectural Requirements • Loose Coupled • Scalable • Fault Tolerant • Budget dependent 5-Apr-09 CloudCamp - Bangalore
  9. 9. IRS Architecture Pipeline SQS Pipe Crawler Pipe Parser Pipe Indexer EC2 Crawler Parser Indexer EC2 + S3 SDB Data Store Errors 5-Apr-09 CloudCamp - Bangalore
  10. 10. Pipes and Pipelines 5-Apr-09 CloudCamp - Bangalore
  11. 11. Pipes and Pipelines • Pipes contain jobs • Pipeline is a group of pipe • Easy to create pipelines and add pipes 5-Apr-09 CloudCamp - Bangalore
  12. 12. Job ORM SQS API class CrawlerJob (JobBase): SDB API class SDBInterfaceConfig: domain_name = settings.CRAWLER_JOB_DOMAIN class SQSInterfaceConfig: queue_name = settings.CRAWLER_JOB_QUEUE timeout = settings.CRAWLER_JOB_TIMEOUT class AWSMetaData: action = CharField (...) url = CharField (...) ... ... Default attributes of each Job: • Pipeline Name • Status • Start Time • End Time • Id 5-Apr-09 CloudCamp - Bangalore
  13. 13. Job Processing for i in range (num_of_jobs): try: job = cls.jobclass.sqs_get() # process job ... except Exception, e: job.job_processing_complete(…) fsdebug.mail_admins (..) end_transaction(rollback = True) job.sdb_save() # save in error store finally: job.sqs_del() # delete the job 5-Apr-09 CloudCamp - Bangalore
  14. 14. The Good • Architecture easy to extend • ORM approach is a big time saver • Simple to add new services 5-Apr-09 CloudCamp - Bangalore
  15. 15. The Bad • Messages may be lost – Service Failure – SQS deletes messages after 4 days. Imp: System should be able to recreate jobs 5-Apr-09 CloudCamp - Bangalore
  16. 16. Storage 5-Apr-09 CloudCamp - Bangalore
  17. 17. What do we store? • Crawler Data – Web Pages • Extracted Content – Questions/Answers • Backups 5-Apr-09 CloudCamp - Bangalore
  18. 18. Storage Structure Meta Data Key + Value Postgres S3 5-Apr-09 CloudCamp - Bangalore
  19. 19. ORM • Extended Django ORM to support S3 class S3WebPage (S3Model): _allowed_attrs = [quot;urlquot;, quot;contentquot;, ..] _name = quot;S3WebPage“ ... ... 5-Apr-09 CloudCamp - Bangalore
  20. 20. The Good • Extremely scalable • Possible to store Python objects in S3 • Latency issues can be solved by using a caching layer • No need to backup S3 data • Storage is cheap 5-Apr-09 CloudCamp - Bangalore
  21. 21. The Bad • Postgres + S3 is not an elegant solution – Periodic syncing of Postgres and S3 required • High transaction costs – $.01 per 1000 PUT,COPY,POST or LIST Requests – $.01 per 10000 GET Requests 5-Apr-09 CloudCamp - Bangalore
  22. 22. Computing 5-Apr-09 CloudCamp - Bangalore
  23. 23. EC2 – The Good • Computing needs are not constant • Data transfer to other AWS services is free • AMI’s per node type 5-Apr-09 CloudCamp - Bangalore
  24. 24. The bad • Missed having a nerve center – Budget – Job Load – CPU load • Low cost 64bit severs are not available 5-Apr-09 CloudCamp - Bangalore
  25. 25. Thank You
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×