Your SlideShare is downloading. ×
Building Better Search for Wikipedia:          How We Did It Using Amazon                  CloudSearch                    ...
Speakers    Paul Nelson                                            Michael Bohlig                                         ...
Housekeeping items!   Polling questions!   Q&A will be at the end!   Recording and slides will be distributed and posted  ...
Agenda!        Amazon CloudSearch Overview!        Data Acquisition – Getting the Files from Wikipedia!        Data Proces...
Amazon CloudSearch!        Fully-managed, full-featured search service!        Automatically scales for data & traffic!   ...
Polling Question #1What Are You Using For Search Today?© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. Ma...
Introduction   SEARCHING WIKIPEDIA© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modi...
Why Wikipedia?!   It’s awesome!   Default Wikipedia search is pretty bad &    everyone knows it!   It’s publicly available...
Why CloudSearch for Wikipedia?!   It’s awesome!   A great choice for a public search engine –    it lives in the internet!...
Let’s try it!        http://wikipedia.searchtechnologies.com© 2012 Amazon.com, Inc. and its affiliates. All rights reserve...
Getting the Files from Wikipedia   DATA ACQUISITION© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May no...
Wikipedia Dump Files       http://dumps.wikimedia.org/enwiki/latest/!   Desired files have the pattern:     enwiki-latest-...
Our Solution                                                                            Wikipedia                         ...
Content Processing Framework Advantages!   Process multiple files simultaneously!   Fully Streaming        •  Files are ne...
Polling Question #2Where is your data stored?© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be c...
Preparing the Data for Search   DATA PROCESSING© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be...
What Do Wikipedia Files Look Like?                                         Sample Wikipedia Data© 2012 Amazon.com, Inc. an...
Data Processing: Basic Requirements!   Decompression: BZip2 à UTF-8!   Process each page as a separate CloudSearch    doc...
Data Processing: Advanced Feature Support!        Extract Categories!        Extract Author (IP address or author name)!  ...
Sending Documents to CloudSearch   INDEXING© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be cop...
CloudSearch: Document ID!   @id = uniquely identifies every document in the index       •  Must be made up of letters and ...
CloudSearch: Document Version!   @version = identifies most recent document       •  Integer number, must always increase ...
CloudSearch Indexing Details!        Form Fields into CloudSearch SDF!        Submit in batches to CloudSearch!        Mul...
CloudSearch SDF for Indexing<batch>  <add lang="en" version="5438086" id="wikipedia930503">    <field name="title">Terran ...
XHTML	  Page                 Process	  Latest	  Listing	  Pipeline                             Fetch                      ...
Providing good search results   QUERIES AND RELEVANCY© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May ...
Recommendation: Debug Interface!   Useful tool for testing CloudSearch query behavior                                     ...
Queries for Wikipedia!   Uses simple “q” parameter for user query string!   Selecting facets uses “bq” parameter       •  ...
Relevancy Ranking!   In CloudSearch, this is done with Rank Expressions       •  Affect relevancy using document-quality d...
Relevancy Ranking for Wikipedia                                            	   content	                                   ...
Relevancy Ranking for Wikipedia:De-Weighting “Wikipedia:” Types!   “Wikipedia:” docs not of general interest       •  Abou...
Adding the Sizzle   BUILDING THE USER INTERFACE© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be...
Wikipedia Search UI Architecture                                                      Tomcat                              ...
UI Architecture!   Tomcat       •  Java application container! Twigkit       •  Graphical user interface templates       •...
Let’s Wrap It Up!   SUMMARY© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or...
Summary – Problems & Solutions!   Problem: Data Acquisition       •  Solution: Content Processing Framework (Aspire)!   Pr...
Q&A                                                Enter questions on your                                                ...
Thank You                                        For More Information:                                                  ht...
Upcoming SlideShare
Loading in...5
×

Building Better Search For Wikipedia: How We Did It Using Amazon CloudSearch - Webinar

2,813

Published on

In this webinar Paul Nelson, CTO and search guru at Search Technologies, covers how he implemented improved search capabilities for Wikipedia using Amazon CloudSearch, a fully-managed search service in the AWS cloud. See how Wikipedia search can now deliver a richer experience that includes faceted navigation, better and more relevant results, and an improved user interface. Topics include data acquisition and clean-up, indexing, handling queries, relevance ranking, and building the search user interface. For more information please see: http://aws.amazon.com/cloudsearch/

Published in: Technology, Business

Transcript of "Building Better Search For Wikipedia: How We Did It Using Amazon CloudSearch - Webinar"

  1. 1. Building Better Search for Wikipedia: How We Did It Using Amazon CloudSearch July 26, 2012© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  2. 2. Speakers Paul Nelson Michael Bohlig Jon Handler CTO Marketing Manager Solutions ArchitectSearch Technologies Amazon CloudSearch Amazon CloudSearch © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  3. 3. Housekeeping items!   Polling questions!   Q&A will be at the end!   Recording and slides will be distributed and posted (Slideshare & YouTube)© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  4. 4. Agenda!   Amazon CloudSearch Overview!   Data Acquisition – Getting the Files from Wikipedia!   Data Processing – Clean-up and Preparation!   Indexing!   Queries and Relevancy Ranking!   Building the UI!   Final Results & Recommendations!   Q&A © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  5. 5. Amazon CloudSearch!   Fully-managed, full-featured search service!   Automatically scales for data & traffic!   Handles both structured and unstructured data!   Near real-time indexing!   Up and running in less than 1 hour © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  6. 6. Polling Question #1What Are You Using For Search Today?© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  7. 7. Introduction SEARCHING WIKIPEDIA© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  8. 8. Why Wikipedia?!   It’s awesome!   Default Wikipedia search is pretty bad & everyone knows it!   It’s publicly available data!   It’s awesome© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  9. 9. Why CloudSearch for Wikipedia?!   It’s awesome!   A great choice for a public search engine – it lives in the internet!   First version up & running quickly!   Automatically scales to required query volume!   Rank expressions work great for Wikipedia relevancy!   Easy Search Domain Creation = Easy system iteration© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  10. 10. Let’s try it! http://wikipedia.searchtechnologies.com© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  11. 11. Getting the Files from Wikipedia DATA ACQUISITION© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  12. 12. Wikipedia Dump Files http://dumps.wikimedia.org/enwiki/latest/!   Desired files have the pattern: enwiki-latest-pages-articles#.xml-*.bz2© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  13. 13. Our Solution Wikipedia dump files Content Processing Framework Fetch Identify Article Open File Send toFiles Listing Files to Fetch Stream Processing CloudSearch Amazon CloudSearch © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  14. 14. Content Processing Framework Advantages!   Process multiple files simultaneously!   Fully Streaming •  Files are never downloaded to local disk •  From Wikipedia à Streaming Processor à CloudSearch!   Very Fast (450 documents per second, end-to-end)!   Integrated Connectors / Web Crawlers •  SharePoint, Documentum, Web Sites, RDBMS, RightNow, Confluence, Salesforce.com, etc.!   Text extraction (from PDF, Office Docs, etc.) •  Using Apache Tika!   Entity Extraction •  Names, places, companies, dates, phone numbers, zip codes, etc. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  15. 15. Polling Question #2Where is your data stored?© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  16. 16. Preparing the Data for Search DATA PROCESSING© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  17. 17. What Do Wikipedia Files Look Like? Sample Wikipedia Data© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  18. 18. Data Processing: Basic Requirements!   Decompression: BZip2 à UTF-8!   Process each page as a separate CloudSearch document •  Multiple pages specified in a single XML file!   Skip #REDIRECT pages!   Compute document statistics •  Necessary for relevancy ranking •  Includes: Content size, title size, number of outbound links •  (FUTURE: Number of inbound links)© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  19. 19. Data Processing: Advanced Feature Support!   Extract Categories!   Extract Author (IP address or author name)!   Extract Update Date!   Extract Document Type •  Wikipedia “name space” based on title prefix!   Determine Disambiguation Pages •  Based on certain Wikipedia {{templates}} •  Template whitelist and blacklist!   Produce Static Teaser Before After © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  20. 20. Sending Documents to CloudSearch INDEXING© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  21. 21. CloudSearch: Document ID!   @id = uniquely identifies every document in the index •  Must be made up of letters and digits (no spaces or punctuation) <batch> <add lang="en" version="5438086" id="wikipedia930503"> . . . FIELDS GO HERE . . . </add> </batch>© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  22. 22. CloudSearch: Document Version!   @version = identifies most recent document •  Integer number, must always increase •  Updates or deletes to same doc ID must have larger @version •  My Formula: (System.currentTimeMillis() - 1325394000000)/1000!   Why does it exist? •  So that multiple processes can submit updates simultaneously •  Updates processed quickly are not overwritten by older updates processed slowly <batch> <add lang="en" version="5438086" id="wikipedia930503"> . . . FIELDS GO HERE . . . </add> </batch>© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  23. 23. CloudSearch Indexing Details!   Form Fields into CloudSearch SDF!   Submit in batches to CloudSearch!   Multiple open connections to CloudSearch!   Co-locate indexer on EC2 instance in same zone as CloudSearch •  Several times better performance © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  24. 24. CloudSearch SDF for Indexing<batch> <add lang="en" version="5438086" id="wikipedia930503"> <field name="title">Terran Trade Authority</field> <field name="title_size">22</field> <field name="content">The Terran Trade Authority is a science-fiction setting originally presented in a collection of fourlarge illustrated science… </field> <field name="content_size">893</field> <field name="teaser"> The Terran Trade Authority is a science-fiction setting originally presentedin a collection of four large illustrated science fiction books published between 1978 and… </field> <field name="url">http://en.wikipedia.org/wiki/Terran_Trade_Authority</field> <field name="type">Article</field> <field name="f_type">Article</field> <field name="year">2012</field> <field name="f_year">2012</field> <field name="year_month">2012/01</field> <field name="f_year_month">2012/01</field> <field name="categories">Science fiction book series</field> <field name="f_categories">Science fiction book series</field> <field name="author">76.173.50.22</field> <field name="f_author">76.173.50.22</field> </add></batch> © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  25. 25. XHTML  Page Process  Latest  Listing  Pipeline Fetch Extract   Start dumps.wikimedia.org/ URLS enwiki/latest/ (Groovy  Script) 27  URLs  to  27  Dump   Files Process  File  Pipeline Open  Stream BZip2 XML  Sub  Job   URL Decompress Extractor Compressed   Decompressed   data  stream stream Single  <page>  XML   plus  Metadata Process  Page  Pipeline Extract  Metadata  End-to-End Indexing and  Cleanse   Content (Groovy  Script) Post  XML Amazon CloudSearchDataflow Cleansed   XSL Metadata Transform© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  26. 26. Providing good search results QUERIES AND RELEVANCY© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  27. 27. Recommendation: Debug Interface!   Useful tool for testing CloudSearch query behavior Sample Debug Interface© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  28. 28. Queries for Wikipedia!   Uses simple “q” parameter for user query string!   Selecting facets uses “bq” parameter •  For filtering a facet value: bq=(field name ‘value’) •  For excluding a facet value: bq=(not name: ‘value’) •  Can handle AND & OR •  Don’t forget to escape single-quotes© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  29. 29. Relevancy Ranking!   In CloudSearch, this is done with Rank Expressions •  Affect relevancy using document-quality data, such as: •  Document Statistics •  Ratings •  Link Counting •  Editorial Comments •  Popularity!   Expressions are very flexible •  All types of mathematical functions available© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  30. 30. Relevancy Ranking for Wikipedia   content   title   text   size   clog   cboost   size   tlog   tboost   relevance   FINAL   Germany   65253   4.815   192.58   7   0.845   -­‐12.676   572   751.90   Outline  of  Germany   14238   4.153   166.14   18   1.255   -­‐18.829   601   748.30   History  of  Germany   74750   4.874   194.94   30   1.477   -­‐22.157   574   746.78   British  Army  Germany   2201   3.343   133.70   37   1.568   -­‐23.523   589   699.18   rugby  union  team   New  Germany   337   2.528   101.11   11   1.041   -­‐15.621   598   683.48   Embassy  of  Germany   516   2.713   108.51   28   1.447   -­‐21.707   596   682.79   in  Moscow     RANK_EXPRESSION = text_relevance + log10(content_size)*40.0 - log10(title_size)*15.0 © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  31. 31. Relevancy Ranking for Wikipedia:De-Weighting “Wikipedia:” Types!   “Wikipedia:” docs not of general interest •  About the running and managing of Wikipedia!   Often very large •  Skews the statisticsRANK_EXPRESSION (adjusted) = text_relevance + log10(content_size) * ( doc_boost == 1 ? 25.0:40.0 ) - log10(title_size)*15.0© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  32. 32. Adding the Sizzle BUILDING THE USER INTERFACE© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  33. 33. Wikipedia Search UI Architecture Tomcat Twigkit CloudSearch Platform CloudSearch Java API CloudSearch© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  34. 34. UI Architecture!   Tomcat •  Java application container! Twigkit •  Graphical user interface templates •  Handles navigators, controller events, presentation!   CloudSearch Platform •  API Translation Interface between Twigkit and CloudSearch API!   CloudSearch Java API •  Manages all communcations to/from CloudSearch •  Parameter construction / results parsing© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  35. 35. Let’s Wrap It Up! SUMMARY© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  36. 36. Summary – Problems & Solutions!   Problem: Data Acquisition •  Solution: Content Processing Framework (Aspire)!   Problem: Data Processing •  Solution: Content Processing Framework (Aspire)!   Problem: Indexing •  Solution: CloudSearch SDF – Very easy to work with!   Problem: Query •  Solution: CloudSearch Query Parameters & Rank Expressions!   Problem: User Interface •  Solution: New CloudSearch Platform for Twigkit© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  37. 37. Q&A Enter questions on your screen© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  38. 38. Thank You For More Information: http://aws.amazon.com/cloudsearch/ http://www.searchtechnologies.com/wikipedia-cloudsearch.html© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

×