Build Your Own Search Engine Jeff Barr Web Services Evangelist Amazon Web Services NGW044
Agenda <ul><li>Amazon Web Services Overview </li></ul><ul><li>Looking Back </li></ul><ul><li>Build  Your Own Search Engine...
Introduction And Background <ul><li>Software development background </li></ul><ul><li>Veteran of several startups </li></u...
What Is Amazon? <ul><li>Online Retailer </li></ul><ul><ul><li>Over 55 million active customer accounts </li></ul></ul><ul>...
What Is Alexa? <ul><li>Amazon subsidiary since 1999 </li></ul><ul><li>Alexa Toolbar </li></ul><ul><li>Web metrics </li></u...
What Is Amazon Web Services?  <ul><li>APIs that give developers programmatic access to Amazon’s data and technology </li><...
AWS Product Family <ul><li>Amazon E-Commerce Service </li></ul><ul><ul><li>Complete access to Amazon’s product catalog </l...
Amazon S3 Simple Storage Service <ul><li>Storage for the internet - web service to read and write data </li></ul><ul><li>1...
Looking Back
Getting Online <ul><li>History Lesson </li></ul><ul><li>1996 vs. 2006 </li></ul><ul><li>Lot has changed </li></ul><ul><li>...
Going Online Then and Now <ul><li>What does is take to bring a simple web site online? </li></ul><ul><ul><li>Domain regist...
Then And Now Domain Registration <ul><li>Then </li></ul><ul><ul><li>Expensive ($70/year) </li></ul></ul><ul><ul><li>Single...
Then And Now  DNS Support <ul><li>Then </li></ul><ul><ul><li>Leech off of friend or university </li></ul></ul><ul><ul><li>...
Then Versus Now Network Connection <ul><li>Then </li></ul><ul><ul><li>9600 baud modem </li></ul></ul><ul><ul><li>ISDN </li...
Then Versus Now Server Hardware <ul><li>Then </li></ul><ul><ul><li>Start with dedicated PC </li></ul></ul><ul><ul><li>Upgr...
 
Then And Now  Development Tools <ul><li>Then </li></ul><ul><ul><li>Text Editor </li></ul></ul><ul><ul><li>Shell Window </l...
Then Versus Now Publicity Vehicle <ul><li>Then </li></ul><ul><ul><li>Yahoo What’s New </li></ul></ul><ul><ul><li>Usenet </...
Then Versus Now Monetization System <ul><li>Then </li></ul><ul><ul><li>Money? We are purists and we are doing this  for fu...
Then Building a Search Engine <ul><li>Lots of Servers </li></ul><ul><li>Lots of Bandwidth </li></ul><ul><li>Lots of Softwa...
Now Building a Search Engine <ul><li>Use our infrastructure </li></ul><ul><li>Leverage Alexa’s Crawl </li></ul><ul><li>Ale...
AWSP Alexa Web Search Platform <ul><li>Build your own search engine! </li></ul><ul><li>Process </li></ul><ul><ul><li>Speci...
AWSP Concepts <ul><li>Interactive Node - Development </li></ul><ul><li>User Store – 12 TB of storage </li></ul><ul><li>Com...
AWSP Design Process Great Idea Write Code Test Code Identify  Pages Schedule Job Run Job Check  Results Publish  Results
Great Ideas <ul><li>Vertical search engine </li></ul><ul><li>Search engine optimization (SEO) </li></ul><ul><li>Search eng...
AWSP Design Process Great Idea Write Code Test Code Identify  Pages Schedule Job Run Job Check  Results Publish  Results
Write Code <ul><li>Run on Interactive Node </li></ul><ul><ul><li>Linux command  line </li></ul></ul><ul><ul><li>Interactiv...
Write Code <ul><li>Code can </li></ul><ul><ul><li>Examine document </li></ul></ul><ul><ul><li>Examine headers </li></ul></...
AWSP Design Process Great Idea Write Code Test Code Identify  Pages Schedule Job Run Job Check  Results Publish  Results
Test Code <ul><li>Run small test on Interactive Node </li></ul><ul><li>Use predefined document collection </li></ul><ul><l...
AWSP Design Process Great Idea Write Code Test Code Identify  Pages Schedule Job Run Job Check  Results Publish  Results
Identify Pages <ul><li>Choose a crawl </li></ul><ul><li>Choose pages within the crawl by </li></ul><ul><ul><li>URL </li></...
 
 
 
 
 
 
 
AWSP Design Process Great Idea Write Code Test Code Identify  Pages Schedule Job Run Job Check  Results Publish  Results
Schedule Job <ul><li>Allocate compute cluster resources </li></ul><ul><ul><ul><li>Time </li></ul></ul></ul><ul><ul><ul><li...
 
 
AWSP Design Process Great Idea Write Code Test Code Identify  Pages Schedule Job Run Job Check  Results Publish  Results
Run Job <ul><li>Job runs at specified time </li></ul><ul><li>Code instances created on each node </li></ul><ul><li>Job out...
AWSP Design Process Great Idea Write Code Test Code Identify  Pages Schedule Job Run Job Check  Results Publish  Results
Check Results <ul><li>Monitor progress using portal </li></ul><ul><li>Final status email </li></ul><ul><li>Log files </li>...
AWSP Design Process Great Idea Write Code Test Code Identify  Pages Schedule Job Run Job Check  Results Publish  Results
Publishing Results <ul><li>Store data to S3 </li></ul><ul><li>Create a new index for AWIS use </li></ul><ul><li>Publish da...
Q & A
 
For More Information: AWSP:  websearch.alexa.com Alexa Blog:  awis.blogspot.com AWS  Blog:  aws.typepad.com Amazon Web Ser...
© 2006 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes n...
Upcoming SlideShare
Loading in …5
×

Build Your Own Search Engine

543 views

Published on

Amazon subsidiary Alexa.com is leveling the search playing field. For the first time, developers looking to build the next "big thing" in search or an ultra custom search engine have access to the 300 terabytes of Alexa crawl data, along with the utilities to search, process, and publish their own custom subset of the data-all at a reasonable price.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
543
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
25
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Build Your Own Search Engine

    1. 1. Build Your Own Search Engine Jeff Barr Web Services Evangelist Amazon Web Services NGW044
    2. 2. Agenda <ul><li>Amazon Web Services Overview </li></ul><ul><li>Looking Back </li></ul><ul><li>Build Your Own Search Engine </li></ul><ul><li>Q&A </li></ul>
    3. 3. Introduction And Background <ul><li>Software development background </li></ul><ul><li>Veteran of several startups </li></ul><ul><li>Visual Studio team at Microsoft (DHTML, XML, Web Services) </li></ul><ul><li>3.5 Years with Amazon </li></ul><ul><li>Amazon Web Services Evangelist </li></ul>
    4. 4. What Is Amazon? <ul><li>Online Retailer </li></ul><ul><ul><li>Over 55 million active customer accounts </li></ul></ul><ul><ul><li>Seven countries: US, UK, Germany, Japan, France, Canada, China </li></ul></ul><ul><li>Technology Consumer </li></ul><ul><ul><li>Multi-National Web Sites </li></ul></ul><ul><ul><li>Vast Data Warehouse – 25 TB </li></ul></ul><ul><ul><li>World-Class Logistics – 21 fulfillment centers; 9 million ft2 </li></ul></ul><ul><li>Technology Provider </li></ul><ul><ul><li>Hundreds of thousands of Amazon Associates </li></ul></ul><ul><ul><li>Over 1,050,000 active seller accounts </li></ul></ul><ul><ul><li>Over 150,000 software developers registered to use Amazon Web Services </li></ul></ul>
    5. 5. What Is Alexa? <ul><li>Amazon subsidiary since 1999 </li></ul><ul><li>Alexa Toolbar </li></ul><ul><li>Web metrics </li></ul><ul><li>Traffic rankings </li></ul><ul><li>Web crawling </li></ul>
    6. 6. What Is Amazon Web Services? <ul><li>APIs that give developers programmatic access to Amazon’s data and technology </li></ul><ul><ul><li>Building-block web services </li></ul></ul><ul><ul><li>Web-scale infrastructure </li></ul></ul><ul><ul><li>E-commerce capability </li></ul></ul><ul><ul><li>Content, data, and information </li></ul></ul><ul><ul><li>New business models </li></ul></ul><ul><ul><li>Customer-created content </li></ul></ul>
    7. 7. AWS Product Family <ul><li>Amazon E-Commerce Service </li></ul><ul><ul><li>Complete access to Amazon’s product catalog </li></ul></ul><ul><ul><li>Free + Associates commissions paid </li></ul></ul><ul><li>Amazon Historical Pricing </li></ul><ul><ul><li>Data warehouse access for product pricing </li></ul></ul><ul><ul><li>Monthly Fee </li></ul></ul><ul><li>Amazon Mechanical Turk </li></ul><ul><ul><li>Artificial Artificial Intelligence </li></ul></ul><ul><ul><li>10% Commission </li></ul></ul><ul><ul><li>Paid workforce </li></ul></ul><ul><li>Amazon Simple Queue Service </li></ul><ul><ul><li>IT building block </li></ul></ul><ul><ul><li>In beta </li></ul></ul><ul><li>Amazon S3 </li></ul><ul><ul><li>Storage for the internet </li></ul></ul><ul><ul><li>Charge by storage/bandwidth usage </li></ul></ul><ul><li>Alexa Web Information Service </li></ul><ul><ul><li>Data warehouse access for web crawl data </li></ul></ul><ul><ul><li>10K calls per month free, then 15 cents per 1000 calls </li></ul></ul><ul><li>Alexa Top Sites </li></ul><ul><ul><li>Top sites by Alexa traffic rank </li></ul></ul><ul><ul><li>Charges by URL </li></ul></ul><ul><li>Alexa Web Search Platform </li></ul><ul><ul><li>Roll your own search engine </li></ul></ul><ul><ul><li>Pay for time, storage, bandwidth </li></ul></ul>
    8. 8. Amazon S3 Simple Storage Service <ul><li>Storage for the internet - web service to read and write data </li></ul><ul><li>15 cents per Gigabyte-Month to store data </li></ul><ul><li>20 cents per Gigabyte to access data </li></ul><ul><li>Private and public storage </li></ul><ul><li>Scalable, reliable, cost-effective, and simple! </li></ul>
    9. 9. Looking Back
    10. 10. Getting Online <ul><li>History Lesson </li></ul><ul><li>1996 vs. 2006 </li></ul><ul><li>Lot has changed </li></ul><ul><li>Let’s take a look </li></ul>
    11. 11. Going Online Then and Now <ul><li>What does is take to bring a simple web site online? </li></ul><ul><ul><li>Domain registration </li></ul></ul><ul><ul><li>DNS support </li></ul></ul><ul><ul><li>Network connection </li></ul></ul><ul><ul><li>Server Hardware </li></ul></ul><ul><ul><li>Development Tools </li></ul></ul><ul><ul><li>Publicity Vehicle </li></ul></ul><ul><ul><li>Monetization System </li></ul></ul>
    12. 12. Then And Now Domain Registration <ul><li>Then </li></ul><ul><ul><li>Expensive ($70/year) </li></ul></ul><ul><ul><li>Single vendor </li></ul></ul><ul><ul><li>Multi-step, multi-day process </li></ul></ul><ul><li>Now </li></ul><ul><ul><li>Cheap ($10 or less / year) </li></ul></ul><ul><ul><li>Dozens of vendors </li></ul></ul><ul><ul><li>Single step, 10 minute process </li></ul></ul>
    13. 13. Then And Now DNS Support <ul><li>Then </li></ul><ul><ul><li>Leech off of friend or university </li></ul></ul><ul><ul><li>Long propagation times </li></ul></ul><ul><ul><li>Complicated </li></ul></ul><ul><ul><li>Days to understand & set up </li></ul></ul><ul><li>Now </li></ul><ul><ul><li>Free services (e.g. ZoneEdit) </li></ul></ul><ul><ul><li>Very short propagation time </li></ul></ul><ul><ul><li>Minutes to understand & set up </li></ul></ul>
    14. 14. Then Versus Now Network Connection <ul><li>Then </li></ul><ul><ul><li>9600 baud modem </li></ul></ul><ul><ul><li>ISDN </li></ul></ul><ul><ul><li>T1 </li></ul></ul><ul><ul><li>Expensive </li></ul></ul><ul><li>Now </li></ul><ul><ul><li>DSL </li></ul></ul><ul><ul><li>Dedicated hosting </li></ul></ul><ul><ul><li>Cheap </li></ul></ul>
    15. 15. Then Versus Now Server Hardware <ul><li>Then </li></ul><ul><ul><li>Start with dedicated PC </li></ul></ul><ul><ul><li>Upgrade to expensive Sun hardware </li></ul></ul><ul><li>Now </li></ul><ul><ul><li>Build your own PC </li></ul></ul><ul><ul><li>Hosting providers (EV1, BocaCom, Server Beach) </li></ul></ul><ul><ul><li>Expensive Sun hardware </li></ul></ul>
    16. 17. Then And Now Development Tools <ul><li>Then </li></ul><ul><ul><li>Text Editor </li></ul></ul><ul><ul><li>Shell Window </li></ul></ul><ul><li>Now </li></ul><ul><ul><li>Visual Web Developer </li></ul></ul><ul><ul><li>HTML Kit </li></ul></ul><ul><ul><li>Front Page </li></ul></ul>
    17. 18. Then Versus Now Publicity Vehicle <ul><li>Then </li></ul><ul><ul><li>Yahoo What’s New </li></ul></ul><ul><ul><li>Usenet </li></ul></ul><ul><ul><li>Press Release </li></ul></ul><ul><ul><li>Wired Magazine </li></ul></ul><ul><li>Now </li></ul><ul><ul><li>Blogs / RSS / Pings </li></ul></ul><ul><ul><li>Link sites </li></ul></ul><ul><ul><li>Word of Mouth </li></ul></ul>
    18. 19. Then Versus Now Monetization System <ul><li>Then </li></ul><ul><ul><li>Money? We are purists and we are doing this for fun! </li></ul></ul><ul><ul><li>Banner ads </li></ul></ul><ul><ul><li>Ad sales people </li></ul></ul><ul><ul><li>Large sites only </li></ul></ul><ul><li>Now </li></ul><ul><ul><li>Pay per click </li></ul></ul><ul><ul><li>Self serve </li></ul></ul><ul><ul><li>Monetize page views </li></ul></ul>
    19. 20. Then Building a Search Engine <ul><li>Lots of Servers </li></ul><ul><li>Lots of Bandwidth </li></ul><ul><li>Lots of Software </li></ul><ul><li>Lots of Money </li></ul><ul><li>Lots of Intellectual Capital </li></ul><ul><li>Lots of Time </li></ul>
    20. 21. Now Building a Search Engine <ul><li>Use our infrastructure </li></ul><ul><li>Leverage Alexa’s Crawl </li></ul><ul><li>Alexa Web Search Platform </li></ul><ul><li>300 TB Archive </li></ul><ul><li>10 Billion web pages </li></ul><ul><li>Pay as you go </li></ul>
    21. 22. AWSP Alexa Web Search Platform <ul><li>Build your own search engine! </li></ul><ul><li>Process </li></ul><ul><ul><li>Specify pages to access within the 300TB archive </li></ul></ul><ul><ul><li>Write parallelizable application to process pages </li></ul></ul><ul><ul><li>Publish results as XML feed or as web service </li></ul></ul><ul><li>Pricing – everything costs $1 </li></ul><ul><ul><li>50 GB of data processing </li></ul></ul><ul><ul><li>1 CPU Hour </li></ul></ul><ul><ul><li>1 GB of data downloaded </li></ul></ul><ul><ul><li>4000 web service requests </li></ul></ul>
    22. 23. AWSP Concepts <ul><li>Interactive Node - Development </li></ul><ul><li>User Store – 12 TB of storage </li></ul><ul><li>Compute Node – Processing </li></ul><ul><li>Data Store </li></ul><ul><ul><li>4 billion documents per crawl </li></ul></ul><ul><ul><li>3 crawls @ 100 TB </li></ul></ul><ul><ul><ul><li>In Process </li></ul></ul></ul><ul><ul><ul><li>Current </li></ul></ul></ul><ul><ul><ul><li>Previous </li></ul></ul></ul><ul><ul><li>All document types (HTML, Media, XML) </li></ul></ul><ul><ul><li>Document header data </li></ul></ul>
    23. 24. AWSP Design Process Great Idea Write Code Test Code Identify Pages Schedule Job Run Job Check Results Publish Results
    24. 25. Great Ideas <ul><li>Vertical search engine </li></ul><ul><li>Search engine optimization (SEO) </li></ul><ul><li>Search engine marketing (SEM) </li></ul><ul><li>Research </li></ul><ul><li>< your idea here > </li></ul>
    25. 26. AWSP Design Process Great Idea Write Code Test Code Identify Pages Schedule Job Run Job Check Results Publish Results
    26. 27. Write Code <ul><li>Run on Interactive Node </li></ul><ul><ul><li>Linux command line </li></ul></ul><ul><ul><li>Interactive application development </li></ul></ul><ul><li>Use Collection API for data retrieval </li></ul><ul><li>Use any language </li></ul><ul><li>Libraries for C, Java, Perl </li></ul><ul><li>Execution framework </li></ul><ul><li>Application processes one document </li></ul>
    27. 28. Write Code <ul><li>Code can </li></ul><ul><ul><li>Examine document </li></ul></ul><ul><ul><li>Examine headers </li></ul></ul><ul><ul><li>Write to a collection </li></ul></ul><ul><ul><li>Write to <stdout> </li></ul></ul><ul><ul><li>Store data to Amazon S3 </li></ul></ul>
    28. 29. AWSP Design Process Great Idea Write Code Test Code Identify Pages Schedule Job Run Job Check Results Publish Results
    29. 30. Test Code <ul><li>Run small test on Interactive Node </li></ul><ul><li>Use predefined document collection </li></ul><ul><li>Ensure proper functioning </li></ul><ul><li>Measure document processing time </li></ul>
    30. 31. AWSP Design Process Great Idea Write Code Test Code Identify Pages Schedule Job Run Job Check Results Publish Results
    31. 32. Identify Pages <ul><li>Choose a crawl </li></ul><ul><li>Choose pages within the crawl by </li></ul><ul><ul><li>URL </li></ul></ul><ul><ul><li>Linkage </li></ul></ul><ul><ul><li>Alexa Traffic Rank (Top N) </li></ul></ul><ul><ul><li>Redirection status </li></ul></ul><ul><ul><li>Content </li></ul></ul><ul><li>Define a Collection </li></ul>
    32. 40. AWSP Design Process Great Idea Write Code Test Code Identify Pages Schedule Job Run Job Check Results Publish Results
    33. 41. Schedule Job <ul><li>Allocate compute cluster resources </li></ul><ul><ul><ul><li>Time </li></ul></ul></ul><ul><ul><ul><li>Processors (1-10) </li></ul></ul></ul><ul><li>Each processor </li></ul><ul><ul><li>3.6 GHz CPU </li></ul></ul><ul><ul><li>4 GB of RAM </li></ul></ul><ul><ul><li>500 GB of local disk storage </li></ul></ul><ul><li>Charged at $1 per CPU hour </li></ul>
    34. 44. AWSP Design Process Great Idea Write Code Test Code Identify Pages Schedule Job Run Job Check Results Publish Results
    35. 45. Run Job <ul><li>Job runs at specified time </li></ul><ul><li>Code instances created on each node </li></ul><ul><li>Job output combined automatically </li></ul>Collection Compute Node #1 Compute Node N ... Combine Results
    36. 46. AWSP Design Process Great Idea Write Code Test Code Identify Pages Schedule Job Run Job Check Results Publish Results
    37. 47. Check Results <ul><li>Monitor progress using portal </li></ul><ul><li>Final status email </li></ul><ul><li>Log files </li></ul><ul><li>Output </li></ul>
    38. 48. AWSP Design Process Great Idea Write Code Test Code Identify Pages Schedule Job Run Job Check Results Publish Results
    39. 49. Publishing Results <ul><li>Store data to S3 </li></ul><ul><li>Create a new index for AWIS use </li></ul><ul><li>Publish data for access via web search </li></ul>
    40. 50. Q & A
    41. 52. For More Information: AWSP: websearch.alexa.com Alexa Blog: awis.blogspot.com AWS Blog: aws.typepad.com Amazon Web Services: aws.amazon.com © 2006 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.
    42. 53. © 2006 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.

    ×