1
Searching Wikipedia with Amazon CloudSearch
2
Agenda
• Project Background
• High-level Architecture
• Summary & Observations
3
Project Background
• Amazon contracted with Search Technologies
to help with beta-testing, prior to the launch of
Amazon CloudSearch
• Decision to use Wikipedia as a convenient data
set for testing purposes
3
4
High-level Architecture
4
5
Indexing
• Wikipedia provides content in a series of large xml files
• Amazon CloudSearch ingests xml in a specified form
• Various content processing tasks to perform
• Splitting into individual documents
• Date normalization
• Metadata extraction & mapping
• Cleanup, etc.
• We used Aspire for these tasks
5
6
Aspire in Brief
• Based on Apache Felix / OSGi
• Thread-safe, multi-threaded, distributable
• Any number of pipelines, conditional branching
• Plug-in components individually testable & upgradable
• In use with FAST ESP, FS4SP, Solr, Amazon CloudSearch, GSA.
• Tested with Elasticsearch and SP 2013
6
7
XML Input
7
8
Indexing
• Streaming Wikipedia Dump Files directly into
CloudSearch
• 500 docs/second achieved without much effort
• Using 4 x XL instances of CloudSearch
• 1 x XL EC2 instance for Aspire
8
9
Searching
• Amazon CloudSearch provides a RESTful/XML
interface for search purposes
• For the Wikipedia project, we needed a UI
• Chose to use Twigkit
• Wrote a Java API for CloudSearch
• The Java API is freely downloadable (with source) at
http://www.searchtechnologies.com/java-api-amazon-
cloudsearch.html
9
10
Searching
• Supports navigators and
relevancy customization
• E.g. a “PageRank” style link
analysis was performed
• Limits set high: E.g.
retrieve 500,000 results in a
single list, delivered in just a
few seconds
• Very useful for analysis
applications
• So, what does it look like?
10
11wikipedia.searchtechnologies.com 11
12wikipedia.searchtechnologies.com 12
13
Summary & Observations
• A capable and scalable “raw” engine
• xml in, RESTful/xml out
• Easy to set up – much the same as an EC2
instance
• Elastic scalability
13
14
Summary & Observations
• Cost effective
• From $75 per month, including management /
maintenance
• Extremely convenient
• Switch on / off at leisure
• Promotes experimentation & agility
14
15

Wikipedia Cloud Search Webinar

  • 1.
    1 Searching Wikipedia withAmazon CloudSearch
  • 2.
    2 Agenda • Project Background •High-level Architecture • Summary & Observations
  • 3.
    3 Project Background • Amazoncontracted with Search Technologies to help with beta-testing, prior to the launch of Amazon CloudSearch • Decision to use Wikipedia as a convenient data set for testing purposes 3
  • 4.
  • 5.
    5 Indexing • Wikipedia providescontent in a series of large xml files • Amazon CloudSearch ingests xml in a specified form • Various content processing tasks to perform • Splitting into individual documents • Date normalization • Metadata extraction & mapping • Cleanup, etc. • We used Aspire for these tasks 5
  • 6.
    6 Aspire in Brief •Based on Apache Felix / OSGi • Thread-safe, multi-threaded, distributable • Any number of pipelines, conditional branching • Plug-in components individually testable & upgradable • In use with FAST ESP, FS4SP, Solr, Amazon CloudSearch, GSA. • Tested with Elasticsearch and SP 2013 6
  • 7.
  • 8.
    8 Indexing • Streaming WikipediaDump Files directly into CloudSearch • 500 docs/second achieved without much effort • Using 4 x XL instances of CloudSearch • 1 x XL EC2 instance for Aspire 8
  • 9.
    9 Searching • Amazon CloudSearchprovides a RESTful/XML interface for search purposes • For the Wikipedia project, we needed a UI • Chose to use Twigkit • Wrote a Java API for CloudSearch • The Java API is freely downloadable (with source) at http://www.searchtechnologies.com/java-api-amazon- cloudsearch.html 9
  • 10.
    10 Searching • Supports navigatorsand relevancy customization • E.g. a “PageRank” style link analysis was performed • Limits set high: E.g. retrieve 500,000 results in a single list, delivered in just a few seconds • Very useful for analysis applications • So, what does it look like? 10
  • 11.
  • 12.
  • 13.
    13 Summary & Observations •A capable and scalable “raw” engine • xml in, RESTful/xml out • Easy to set up – much the same as an EC2 instance • Elastic scalability 13
  • 14.
    14 Summary & Observations •Cost effective • From $75 per month, including management / maintenance • Extremely convenient • Switch on / off at leisure • Promotes experimentation & agility 14
  • 15.

Editor's Notes

  • #7 For further information about Aspire, see http://www.searchtechnologies.com/aspire.html
  • #10 The Java API for Amazon CloudSearch can be downloaded from http://www.searchtechnologies.com/java-api-amazon-cloudsearch.html