Build Your Own Search Engine Jeff Barr Web Services Evangelist Amazon Web Services NGW044
Agenda Amazon Web Services Overview Looking Back Build  Your Own Search Engine Q&A
Introduction And Background Software development background Veteran of several startups Visual Studio team at Microsoft (DHTML, XML, Web Services)  3.5 Years with Amazon Amazon Web Services Evangelist
What Is Amazon? Online Retailer Over 55 million active customer accounts Seven countries: US, UK, Germany, Japan, France,  Canada, China Technology Consumer Multi-National Web Sites Vast Data Warehouse – 25 TB World-Class Logistics – 21 fulfillment centers; 9 million ft2 Technology Provider Hundreds of thousands of Amazon Associates Over 1,050,000 active seller accounts Over 150,000 software developers registered to use Amazon Web Services
What Is Alexa? Amazon subsidiary since 1999 Alexa Toolbar Web metrics Traffic rankings Web crawling
What Is Amazon Web Services?  APIs that give developers programmatic access to Amazon’s data and technology Building-block web services Web-scale infrastructure E-commerce capability Content, data, and information New business models Customer-created content
AWS Product Family Amazon E-Commerce Service Complete access to Amazon’s product catalog Free + Associates commissions paid Amazon Historical Pricing Data warehouse access for product pricing Monthly Fee Amazon Mechanical Turk  Artificial Artificial Intelligence 10% Commission  Paid workforce Amazon Simple Queue Service IT building block In beta Amazon S3 Storage for the internet Charge by storage/bandwidth usage Alexa Web Information Service Data warehouse access for web crawl data 10K calls per month free, then 15 cents per 1000 calls Alexa Top Sites Top sites by Alexa traffic rank Charges by URL Alexa Web Search Platform Roll your own search engine Pay for time, storage, bandwidth
Amazon S3 Simple Storage Service Storage for the internet - web service to read and write data 15 cents per Gigabyte-Month to store data 20 cents per Gigabyte to access data Private and  public storage Scalable, reliable, cost-effective, and simple!
Looking Back
Getting Online History Lesson 1996 vs. 2006 Lot has changed Let’s take a look
Going Online Then and Now What does is take to bring a simple web site online? Domain registration DNS support Network connection Server Hardware Development Tools Publicity Vehicle Monetization System
Then And Now Domain Registration Then Expensive ($70/year) Single vendor Multi-step, multi-day process Now Cheap ($10 or less / year) Dozens of vendors Single step, 10 minute process
Then And Now  DNS Support Then Leech off of friend or university Long propagation times Complicated Days to understand & set up Now Free services (e.g. ZoneEdit) Very short propagation time Minutes to understand & set up
Then Versus Now Network Connection Then 9600 baud modem ISDN T1 Expensive Now DSL Dedicated hosting Cheap
Then Versus Now Server Hardware Then Start with dedicated PC Upgrade to expensive Sun hardware Now Build your own PC Hosting providers (EV1, BocaCom, Server Beach) Expensive Sun hardware
 
Then And Now  Development Tools Then Text Editor Shell Window Now Visual Web Developer HTML Kit Front Page
Then Versus Now Publicity Vehicle Then Yahoo What’s New Usenet Press Release Wired Magazine Now Blogs / RSS / Pings  Link sites Word of Mouth
Then Versus Now Monetization System Then Money? We are purists and we are doing this  for fun! Banner ads Ad sales people Large sites only Now Pay per click Self serve Monetize page views
Then Building a Search Engine Lots of Servers Lots of Bandwidth Lots of Software Lots of Money Lots of Intellectual Capital Lots of Time
Now Building a Search Engine Use our infrastructure Leverage Alexa’s Crawl Alexa Web Search Platform 300 TB Archive 10 Billion web pages Pay as you go
AWSP Alexa Web Search Platform Build your own search engine! Process Specify pages to access within the 300TB archive Write parallelizable application to process pages Publish results as XML feed or as web service Pricing – everything costs $1 50 GB of data processing 1 CPU Hour 1 GB of data downloaded 4000 web service requests
AWSP Concepts Interactive Node - Development User Store – 12 TB of storage Compute Node – Processing Data Store 4 billion documents per crawl  3 crawls @ 100 TB In Process Current Previous All document types (HTML, Media, XML) Document header data
AWSP Design Process Great Idea Write Code Test Code Identify  Pages Schedule Job Run Job Check  Results Publish  Results
Great Ideas Vertical search engine Search engine optimization (SEO) Search engine marketing (SEM) Research < your idea here >
AWSP Design Process Great Idea Write Code Test Code Identify  Pages Schedule Job Run Job Check  Results Publish  Results
Write Code Run on Interactive Node Linux command  line Interactive application development Use Collection API for data retrieval Use any language Libraries for C, Java, Perl Execution framework Application processes one document
Write Code Code can Examine document Examine headers Write to a collection Write to <stdout> Store data to Amazon S3
AWSP Design Process Great Idea Write Code Test Code Identify  Pages Schedule Job Run Job Check  Results Publish  Results
Test Code Run small test on Interactive Node Use predefined document collection Ensure proper functioning Measure document processing time
AWSP Design Process Great Idea Write Code Test Code Identify  Pages Schedule Job Run Job Check  Results Publish  Results
Identify Pages Choose a crawl Choose pages within the crawl by URL Linkage Alexa Traffic Rank (Top N) Redirection status Content Define a Collection
 
 
 
 
 
 
 
AWSP Design Process Great Idea Write Code Test Code Identify  Pages Schedule Job Run Job Check  Results Publish  Results
Schedule Job Allocate compute cluster resources Time Processors (1-10) Each processor 3.6 GHz CPU 4 GB of RAM 500 GB of local disk storage Charged at $1 per CPU hour
 
 
AWSP Design Process Great Idea Write Code Test Code Identify  Pages Schedule Job Run Job Check  Results Publish  Results
Run Job Job runs at specified time Code instances created on each node Job output combined automatically Collection Compute Node #1 Compute Node N ... Combine Results
AWSP Design Process Great Idea Write Code Test Code Identify  Pages Schedule Job Run Job Check  Results Publish  Results
Check Results Monitor progress using portal Final status email Log files Output
AWSP Design Process Great Idea Write Code Test Code Identify  Pages Schedule Job Run Job Check  Results Publish  Results
Publishing Results Store data to S3 Create a new index for AWIS use Publish data for access via web search
Q & A
 
For More Information: AWSP:  websearch.alexa.com Alexa Blog:  awis.blogspot.com AWS  Blog:  aws.typepad.com Amazon Web Services:  aws.amazon.com © 2006 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.
© 2006 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.

Build Your Own Search Engine

  • 1.
    Build Your OwnSearch Engine Jeff Barr Web Services Evangelist Amazon Web Services NGW044
  • 2.
    Agenda Amazon WebServices Overview Looking Back Build Your Own Search Engine Q&A
  • 3.
    Introduction And BackgroundSoftware development background Veteran of several startups Visual Studio team at Microsoft (DHTML, XML, Web Services) 3.5 Years with Amazon Amazon Web Services Evangelist
  • 4.
    What Is Amazon?Online Retailer Over 55 million active customer accounts Seven countries: US, UK, Germany, Japan, France, Canada, China Technology Consumer Multi-National Web Sites Vast Data Warehouse – 25 TB World-Class Logistics – 21 fulfillment centers; 9 million ft2 Technology Provider Hundreds of thousands of Amazon Associates Over 1,050,000 active seller accounts Over 150,000 software developers registered to use Amazon Web Services
  • 5.
    What Is Alexa?Amazon subsidiary since 1999 Alexa Toolbar Web metrics Traffic rankings Web crawling
  • 6.
    What Is AmazonWeb Services? APIs that give developers programmatic access to Amazon’s data and technology Building-block web services Web-scale infrastructure E-commerce capability Content, data, and information New business models Customer-created content
  • 7.
    AWS Product FamilyAmazon E-Commerce Service Complete access to Amazon’s product catalog Free + Associates commissions paid Amazon Historical Pricing Data warehouse access for product pricing Monthly Fee Amazon Mechanical Turk Artificial Artificial Intelligence 10% Commission Paid workforce Amazon Simple Queue Service IT building block In beta Amazon S3 Storage for the internet Charge by storage/bandwidth usage Alexa Web Information Service Data warehouse access for web crawl data 10K calls per month free, then 15 cents per 1000 calls Alexa Top Sites Top sites by Alexa traffic rank Charges by URL Alexa Web Search Platform Roll your own search engine Pay for time, storage, bandwidth
  • 8.
    Amazon S3 SimpleStorage Service Storage for the internet - web service to read and write data 15 cents per Gigabyte-Month to store data 20 cents per Gigabyte to access data Private and public storage Scalable, reliable, cost-effective, and simple!
  • 9.
  • 10.
    Getting Online HistoryLesson 1996 vs. 2006 Lot has changed Let’s take a look
  • 11.
    Going Online Thenand Now What does is take to bring a simple web site online? Domain registration DNS support Network connection Server Hardware Development Tools Publicity Vehicle Monetization System
  • 12.
    Then And NowDomain Registration Then Expensive ($70/year) Single vendor Multi-step, multi-day process Now Cheap ($10 or less / year) Dozens of vendors Single step, 10 minute process
  • 13.
    Then And Now DNS Support Then Leech off of friend or university Long propagation times Complicated Days to understand & set up Now Free services (e.g. ZoneEdit) Very short propagation time Minutes to understand & set up
  • 14.
    Then Versus NowNetwork Connection Then 9600 baud modem ISDN T1 Expensive Now DSL Dedicated hosting Cheap
  • 15.
    Then Versus NowServer Hardware Then Start with dedicated PC Upgrade to expensive Sun hardware Now Build your own PC Hosting providers (EV1, BocaCom, Server Beach) Expensive Sun hardware
  • 16.
  • 17.
    Then And Now Development Tools Then Text Editor Shell Window Now Visual Web Developer HTML Kit Front Page
  • 18.
    Then Versus NowPublicity Vehicle Then Yahoo What’s New Usenet Press Release Wired Magazine Now Blogs / RSS / Pings Link sites Word of Mouth
  • 19.
    Then Versus NowMonetization System Then Money? We are purists and we are doing this for fun! Banner ads Ad sales people Large sites only Now Pay per click Self serve Monetize page views
  • 20.
    Then Building aSearch Engine Lots of Servers Lots of Bandwidth Lots of Software Lots of Money Lots of Intellectual Capital Lots of Time
  • 21.
    Now Building aSearch Engine Use our infrastructure Leverage Alexa’s Crawl Alexa Web Search Platform 300 TB Archive 10 Billion web pages Pay as you go
  • 22.
    AWSP Alexa WebSearch Platform Build your own search engine! Process Specify pages to access within the 300TB archive Write parallelizable application to process pages Publish results as XML feed or as web service Pricing – everything costs $1 50 GB of data processing 1 CPU Hour 1 GB of data downloaded 4000 web service requests
  • 23.
    AWSP Concepts InteractiveNode - Development User Store – 12 TB of storage Compute Node – Processing Data Store 4 billion documents per crawl 3 crawls @ 100 TB In Process Current Previous All document types (HTML, Media, XML) Document header data
  • 24.
    AWSP Design ProcessGreat Idea Write Code Test Code Identify Pages Schedule Job Run Job Check Results Publish Results
  • 25.
    Great Ideas Verticalsearch engine Search engine optimization (SEO) Search engine marketing (SEM) Research < your idea here >
  • 26.
    AWSP Design ProcessGreat Idea Write Code Test Code Identify Pages Schedule Job Run Job Check Results Publish Results
  • 27.
    Write Code Runon Interactive Node Linux command line Interactive application development Use Collection API for data retrieval Use any language Libraries for C, Java, Perl Execution framework Application processes one document
  • 28.
    Write Code Codecan Examine document Examine headers Write to a collection Write to <stdout> Store data to Amazon S3
  • 29.
    AWSP Design ProcessGreat Idea Write Code Test Code Identify Pages Schedule Job Run Job Check Results Publish Results
  • 30.
    Test Code Runsmall test on Interactive Node Use predefined document collection Ensure proper functioning Measure document processing time
  • 31.
    AWSP Design ProcessGreat Idea Write Code Test Code Identify Pages Schedule Job Run Job Check Results Publish Results
  • 32.
    Identify Pages Choosea crawl Choose pages within the crawl by URL Linkage Alexa Traffic Rank (Top N) Redirection status Content Define a Collection
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
    AWSP Design ProcessGreat Idea Write Code Test Code Identify Pages Schedule Job Run Job Check Results Publish Results
  • 41.
    Schedule Job Allocatecompute cluster resources Time Processors (1-10) Each processor 3.6 GHz CPU 4 GB of RAM 500 GB of local disk storage Charged at $1 per CPU hour
  • 42.
  • 43.
  • 44.
    AWSP Design ProcessGreat Idea Write Code Test Code Identify Pages Schedule Job Run Job Check Results Publish Results
  • 45.
    Run Job Jobruns at specified time Code instances created on each node Job output combined automatically Collection Compute Node #1 Compute Node N ... Combine Results
  • 46.
    AWSP Design ProcessGreat Idea Write Code Test Code Identify Pages Schedule Job Run Job Check Results Publish Results
  • 47.
    Check Results Monitorprogress using portal Final status email Log files Output
  • 48.
    AWSP Design ProcessGreat Idea Write Code Test Code Identify Pages Schedule Job Run Job Check Results Publish Results
  • 49.
    Publishing Results Storedata to S3 Create a new index for AWIS use Publish data for access via web search
  • 50.
  • 51.
  • 52.
    For More Information:AWSP: websearch.alexa.com Alexa Blog: awis.blogspot.com AWS Blog: aws.typepad.com Amazon Web Services: aws.amazon.com © 2006 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.
  • 53.
    © 2006 MicrosoftCorporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.