Lesson from Building a Search Engine using the cloud
Upcoming SlideShare
Loading in...5
×
 

Lesson from Building a Search Engine using the cloud

on

  • 4,215 views

Chirayu Patel shres his experience in building SnappyFingers.com

Chirayu Patel shres his experience in building SnappyFingers.com

Statistics

Views

Total Views
4,215
Views on SlideShare
4,029
Embed Views
186

Actions

Likes
0
Downloads
22
Comments
0

6 Embeds 186

http://headstart.in 107
http://thoughts.vinayakhegde.com 58
http://www.slideshare.net 10
http://www.linkedin.com 7
https://www.linkedin.com 3
http://translate.yandex.net 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Lesson from Building a Search Engine using the cloud Lesson from Building a Search Engine using the cloud Presentation Transcript

    • Lessons from building a search engine with Amazon Web Services Chirayu Patel chirayu@snappyfingers.com
    • Chirayu Patel Developer SnappyFingers Question and Answer Search Engine 100% in the cloud 5-Apr-09 CloudCamp - Bangalore
    • My experiences What worked? What didn’t? What could be done better? What did I miss? 5-Apr-09 CloudCamp - Bangalore
    • Recap - AWS? • EC2 – Elastic Compute Cloud • S3 – Simple Storage Service • SQS – Simple Queue Service • SDB – SimpleDB 5-Apr-09 CloudCamp - Bangalore
    • Why AWS? • Computing Power requirements unknown • Cheap • Availability of multiple services • Easy to implement SnappyFingers architecture using AWS services 5-Apr-09 CloudCamp - Bangalore
    • SnappyFingers • Information Retrieval System (IRS) • FrontEnd – Nothing unique here 5-Apr-09 CloudCamp - Bangalore
    • Three motivations (behind my decisions) • Reluctance to learn • Cost Conscious • I write buggy code 5-Apr-09 CloudCamp - Bangalore
    • Architectural Requirements • Loose Coupled • Scalable • Fault Tolerant • Budget dependent 5-Apr-09 CloudCamp - Bangalore
    • IRS Architecture Pipeline SQS Pipe Crawler Pipe Parser Pipe Indexer EC2 Crawler Parser Indexer EC2 + S3 SDB Data Store Errors 5-Apr-09 CloudCamp - Bangalore
    • Pipes and Pipelines 5-Apr-09 CloudCamp - Bangalore
    • Pipes and Pipelines • Pipes contain jobs • Pipeline is a group of pipe • Easy to create pipelines and add pipes 5-Apr-09 CloudCamp - Bangalore
    • Job ORM SQS API class CrawlerJob (JobBase): SDB API class SDBInterfaceConfig: domain_name = settings.CRAWLER_JOB_DOMAIN class SQSInterfaceConfig: queue_name = settings.CRAWLER_JOB_QUEUE timeout = settings.CRAWLER_JOB_TIMEOUT class AWSMetaData: action = CharField (...) url = CharField (...) ... ... Default attributes of each Job: • Pipeline Name • Status • Start Time • End Time • Id 5-Apr-09 CloudCamp - Bangalore
    • Job Processing for i in range (num_of_jobs): try: job = cls.jobclass.sqs_get() # process job ... except Exception, e: job.job_processing_complete(…) fsdebug.mail_admins (..) end_transaction(rollback = True) job.sdb_save() # save in error store finally: job.sqs_del() # delete the job 5-Apr-09 CloudCamp - Bangalore
    • The Good • Architecture easy to extend • ORM approach is a big time saver • Simple to add new services 5-Apr-09 CloudCamp - Bangalore
    • The Bad • Messages may be lost – Service Failure – SQS deletes messages after 4 days. Imp: System should be able to recreate jobs 5-Apr-09 CloudCamp - Bangalore
    • Storage 5-Apr-09 CloudCamp - Bangalore
    • What do we store? • Crawler Data – Web Pages • Extracted Content – Questions/Answers • Backups 5-Apr-09 CloudCamp - Bangalore
    • Storage Structure Meta Data Key + Value Postgres S3 5-Apr-09 CloudCamp - Bangalore
    • ORM • Extended Django ORM to support S3 class S3WebPage (S3Model): _allowed_attrs = [quot;urlquot;, quot;contentquot;, ..] _name = quot;S3WebPage“ ... ... 5-Apr-09 CloudCamp - Bangalore
    • The Good • Extremely scalable • Possible to store Python objects in S3 • Latency issues can be solved by using a caching layer • No need to backup S3 data • Storage is cheap 5-Apr-09 CloudCamp - Bangalore
    • The Bad • Postgres + S3 is not an elegant solution – Periodic syncing of Postgres and S3 required • High transaction costs – $.01 per 1000 PUT,COPY,POST or LIST Requests – $.01 per 10000 GET Requests 5-Apr-09 CloudCamp - Bangalore
    • Computing 5-Apr-09 CloudCamp - Bangalore
    • EC2 – The Good • Computing needs are not constant • Data transfer to other AWS services is free • AMI’s per node type 5-Apr-09 CloudCamp - Bangalore
    • The bad • Missed having a nerve center – Budget – Job Load – CPU load • Low cost 64bit severs are not available 5-Apr-09 CloudCamp - Bangalore
    • Thank You