• Save
Building Better Search For Wikipedia: How We Did It Using Amazon CloudSearch - Webinar
Upcoming SlideShare
Loading in...5
×
 

Building Better Search For Wikipedia: How We Did It Using Amazon CloudSearch - Webinar

on

  • 3,131 views

In this webinar Paul Nelson, CTO and search guru at Search Technologies, covers how he implemented improved search capabilities for Wikipedia using Amazon CloudSearch, a fully-managed search service ...

In this webinar Paul Nelson, CTO and search guru at Search Technologies, covers how he implemented improved search capabilities for Wikipedia using Amazon CloudSearch, a fully-managed search service in the AWS cloud. See how Wikipedia search can now deliver a richer experience that includes faceted navigation, better and more relevant results, and an improved user interface. Topics include data acquisition and clean-up, indexing, handling queries, relevance ranking, and building the search user interface. For more information please see: http://aws.amazon.com/cloudsearch/

Statistics

Views

Total Views
3,131
Views on SlideShare
2,470
Embed Views
661

Actions

Likes
5
Downloads
0
Comments
0

4 Embeds 661

http://www.scoop.it 638
https://twitter.com 21
https://si0.twimg.com 1
http://webcache.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Building Better Search For Wikipedia: How We Did It Using Amazon CloudSearch - Webinar Building Better Search For Wikipedia: How We Did It Using Amazon CloudSearch - Webinar Presentation Transcript

  • Building Better Search for Wikipedia: How We Did It Using Amazon CloudSearch July 26, 2012© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Speakers Paul Nelson Michael Bohlig Jon Handler CTO Marketing Manager Solutions ArchitectSearch Technologies Amazon CloudSearch Amazon CloudSearch © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Housekeeping items!   Polling questions!   Q&A will be at the end!   Recording and slides will be distributed and posted (Slideshare & YouTube)© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Agenda!   Amazon CloudSearch Overview!   Data Acquisition – Getting the Files from Wikipedia!   Data Processing – Clean-up and Preparation!   Indexing!   Queries and Relevancy Ranking!   Building the UI!   Final Results & Recommendations!   Q&A © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Amazon CloudSearch!   Fully-managed, full-featured search service!   Automatically scales for data & traffic!   Handles both structured and unstructured data!   Near real-time indexing!   Up and running in less than 1 hour © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Polling Question #1What Are You Using For Search Today?© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Introduction SEARCHING WIKIPEDIA© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Why Wikipedia?!   It’s awesome!   Default Wikipedia search is pretty bad & everyone knows it!   It’s publicly available data!   It’s awesome© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Why CloudSearch for Wikipedia?!   It’s awesome!   A great choice for a public search engine – it lives in the internet!   First version up & running quickly!   Automatically scales to required query volume!   Rank expressions work great for Wikipedia relevancy!   Easy Search Domain Creation = Easy system iteration© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Let’s try it! http://wikipedia.searchtechnologies.com© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Getting the Files from Wikipedia DATA ACQUISITION© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Wikipedia Dump Files http://dumps.wikimedia.org/enwiki/latest/!   Desired files have the pattern: enwiki-latest-pages-articles#.xml-*.bz2© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Our Solution Wikipedia dump files Content Processing Framework Fetch Identify Article Open File Send toFiles Listing Files to Fetch Stream Processing CloudSearch Amazon CloudSearch © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Content Processing Framework Advantages!   Process multiple files simultaneously!   Fully Streaming •  Files are never downloaded to local disk •  From Wikipedia à Streaming Processor à CloudSearch!   Very Fast (450 documents per second, end-to-end)!   Integrated Connectors / Web Crawlers •  SharePoint, Documentum, Web Sites, RDBMS, RightNow, Confluence, Salesforce.com, etc.!   Text extraction (from PDF, Office Docs, etc.) •  Using Apache Tika!   Entity Extraction •  Names, places, companies, dates, phone numbers, zip codes, etc. © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Polling Question #2Where is your data stored?© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Preparing the Data for Search DATA PROCESSING© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • What Do Wikipedia Files Look Like? Sample Wikipedia Data© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Data Processing: Basic Requirements!   Decompression: BZip2 à UTF-8!   Process each page as a separate CloudSearch document •  Multiple pages specified in a single XML file!   Skip #REDIRECT pages!   Compute document statistics •  Necessary for relevancy ranking •  Includes: Content size, title size, number of outbound links •  (FUTURE: Number of inbound links)© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Data Processing: Advanced Feature Support!   Extract Categories!   Extract Author (IP address or author name)!   Extract Update Date!   Extract Document Type •  Wikipedia “name space” based on title prefix!   Determine Disambiguation Pages •  Based on certain Wikipedia {{templates}} •  Template whitelist and blacklist!   Produce Static Teaser Before After © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Sending Documents to CloudSearch INDEXING© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • CloudSearch: Document ID!   @id = uniquely identifies every document in the index •  Must be made up of letters and digits (no spaces or punctuation) <batch> <add lang="en" version="5438086" id="wikipedia930503"> . . . FIELDS GO HERE . . . </add> </batch>© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • CloudSearch: Document Version!   @version = identifies most recent document •  Integer number, must always increase •  Updates or deletes to same doc ID must have larger @version •  My Formula: (System.currentTimeMillis() - 1325394000000)/1000!   Why does it exist? •  So that multiple processes can submit updates simultaneously •  Updates processed quickly are not overwritten by older updates processed slowly <batch> <add lang="en" version="5438086" id="wikipedia930503"> . . . FIELDS GO HERE . . . </add> </batch>© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • CloudSearch Indexing Details!   Form Fields into CloudSearch SDF!   Submit in batches to CloudSearch!   Multiple open connections to CloudSearch!   Co-locate indexer on EC2 instance in same zone as CloudSearch •  Several times better performance © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • CloudSearch SDF for Indexing<batch> <add lang="en" version="5438086" id="wikipedia930503"> <field name="title">Terran Trade Authority</field> <field name="title_size">22</field> <field name="content">The Terran Trade Authority is a science-fiction setting originally presented in a collection of fourlarge illustrated science… </field> <field name="content_size">893</field> <field name="teaser"> The Terran Trade Authority is a science-fiction setting originally presentedin a collection of four large illustrated science fiction books published between 1978 and… </field> <field name="url">http://en.wikipedia.org/wiki/Terran_Trade_Authority</field> <field name="type">Article</field> <field name="f_type">Article</field> <field name="year">2012</field> <field name="f_year">2012</field> <field name="year_month">2012/01</field> <field name="f_year_month">2012/01</field> <field name="categories">Science fiction book series</field> <field name="f_categories">Science fiction book series</field> <field name="author">76.173.50.22</field> <field name="f_author">76.173.50.22</field> </add></batch> © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • XHTML  Page Process  Latest  Listing  Pipeline Fetch Extract   Start dumps.wikimedia.org/ URLS enwiki/latest/ (Groovy  Script) 27  URLs  to  27  Dump   Files Process  File  Pipeline Open  Stream BZip2 XML  Sub  Job   URL Decompress Extractor Compressed   Decompressed   data  stream stream Single  <page>  XML   plus  Metadata Process  Page  Pipeline Extract  Metadata  End-to-End Indexing and  Cleanse   Content (Groovy  Script) Post  XML Amazon CloudSearchDataflow Cleansed   XSL Metadata Transform© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Providing good search results QUERIES AND RELEVANCY© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Recommendation: Debug Interface!   Useful tool for testing CloudSearch query behavior Sample Debug Interface© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Queries for Wikipedia!   Uses simple “q” parameter for user query string!   Selecting facets uses “bq” parameter •  For filtering a facet value: bq=(field name ‘value’) •  For excluding a facet value: bq=(not name: ‘value’) •  Can handle AND & OR •  Don’t forget to escape single-quotes© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Relevancy Ranking!   In CloudSearch, this is done with Rank Expressions •  Affect relevancy using document-quality data, such as: •  Document Statistics •  Ratings •  Link Counting •  Editorial Comments •  Popularity!   Expressions are very flexible •  All types of mathematical functions available© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Relevancy Ranking for Wikipedia   content   title   text   size   clog   cboost   size   tlog   tboost   relevance   FINAL   Germany   65253   4.815   192.58   7   0.845   -­‐12.676   572   751.90   Outline  of  Germany   14238   4.153   166.14   18   1.255   -­‐18.829   601   748.30   History  of  Germany   74750   4.874   194.94   30   1.477   -­‐22.157   574   746.78   British  Army  Germany   2201   3.343   133.70   37   1.568   -­‐23.523   589   699.18   rugby  union  team   New  Germany   337   2.528   101.11   11   1.041   -­‐15.621   598   683.48   Embassy  of  Germany   516   2.713   108.51   28   1.447   -­‐21.707   596   682.79   in  Moscow     RANK_EXPRESSION = text_relevance + log10(content_size)*40.0 - log10(title_size)*15.0 © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Relevancy Ranking for Wikipedia:De-Weighting “Wikipedia:” Types!   “Wikipedia:” docs not of general interest •  About the running and managing of Wikipedia!   Often very large •  Skews the statisticsRANK_EXPRESSION (adjusted) = text_relevance + log10(content_size) * ( doc_boost == 1 ? 25.0:40.0 ) - log10(title_size)*15.0© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Adding the Sizzle BUILDING THE USER INTERFACE© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Wikipedia Search UI Architecture Tomcat Twigkit CloudSearch Platform CloudSearch Java API CloudSearch© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • UI Architecture!   Tomcat •  Java application container! Twigkit •  Graphical user interface templates •  Handles navigators, controller events, presentation!   CloudSearch Platform •  API Translation Interface between Twigkit and CloudSearch API!   CloudSearch Java API •  Manages all communcations to/from CloudSearch •  Parameter construction / results parsing© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Let’s Wrap It Up! SUMMARY© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Summary – Problems & Solutions!   Problem: Data Acquisition •  Solution: Content Processing Framework (Aspire)!   Problem: Data Processing •  Solution: Content Processing Framework (Aspire)!   Problem: Indexing •  Solution: CloudSearch SDF – Very easy to work with!   Problem: Query •  Solution: CloudSearch Query Parameters & Rank Expressions!   Problem: User Interface •  Solution: New CloudSearch Platform for Twigkit© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Q&A Enter questions on your screen© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • Thank You For More Information: http://aws.amazon.com/cloudsearch/ http://www.searchtechnologies.com/wikipedia-cloudsearch.html© 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.