• Share
  • Email
  • Embed
  • Like
  • Private Content
Wikipedia Cloud Search Webinar
 

Wikipedia Cloud Search Webinar

on

  • 258 views

View this webinar presented by Search Technologies' Chief Architect Paul Nelson on cloud search and a Wikipedia use case. Webinar given in conjunction with Amazon Cloud Search. Search Technologies ...

View this webinar presented by Search Technologies' Chief Architect Paul Nelson on cloud search and a Wikipedia use case. Webinar given in conjunction with Amazon Cloud Search. Search Technologies provides implementation and consulting services for Amazon CloudSearch. For further information, see http://www.searchtechnologies.com/amazon-cloudsearch-services.html

http://www.searchtechnologies.com/

Statistics

Views

Total Views
258
Views on SlideShare
258
Embed Views
0

Actions

Likes
0
Downloads
3
Comments
1

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • The Wikipedia search experience can be found at http://wikipedia.searchtechnologies.com
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • For further information about Aspire, see http://www.searchtechnologies.com/aspire.html
  • The Java API for Amazon CloudSearch can be downloaded from http://www.searchtechnologies.com/java-api-amazon-cloudsearch.html

Wikipedia Cloud Search Webinar Wikipedia Cloud Search Webinar Presentation Transcript

  • 1 Searching Wikipedia with Amazon CloudSearch
  • 2 Agenda • Project Background • High-level Architecture • Summary & Observations
  • 3 Project Background • Amazon contracted with Search Technologies to help with beta-testing, prior to the launch of Amazon CloudSearch • Decision to use Wikipedia as a convenient data set for testing purposes 3
  • 4 High-level Architecture 4
  • 5 Indexing • Wikipedia provides content in a series of large xml files • Amazon CloudSearch ingests xml in a specified form • Various content processing tasks to perform • Splitting into individual documents • Date normalization • Metadata extraction & mapping • Cleanup, etc. • We used Aspire for these tasks 5
  • 6 Aspire in Brief • Based on Apache Felix / OSGi • Thread-safe, multi-threaded, distributable • Any number of pipelines, conditional branching • Plug-in components individually testable & upgradable • In use with FAST ESP, FS4SP, Solr, Amazon CloudSearch, GSA. • Tested with Elasticsearch and SP 2013 6
  • 7 XML Input 7
  • 8 Indexing • Streaming Wikipedia Dump Files directly into CloudSearch • 500 docs/second achieved without much effort • Using 4 x XL instances of CloudSearch • 1 x XL EC2 instance for Aspire 8
  • 9 Searching • Amazon CloudSearch provides a RESTful/XML interface for search purposes • For the Wikipedia project, we needed a UI • Chose to use Twigkit • Wrote a Java API for CloudSearch • The Java API is freely downloadable (with source) at http://www.searchtechnologies.com/java-api-amazon- cloudsearch.html 9
  • 10 Searching • Supports navigators and relevancy customization • E.g. a “PageRank” style link analysis was performed • Limits set high: E.g. retrieve 500,000 results in a single list, delivered in just a few seconds • Very useful for analysis applications • So, what does it look like? 10
  • 11wikipedia.searchtechnologies.com 11
  • 12wikipedia.searchtechnologies.com 12
  • 13 Summary & Observations • A capable and scalable “raw” engine • xml in, RESTful/xml out • Easy to set up – much the same as an EC2 instance • Elastic scalability 13
  • 14 Summary & Observations • Cost effective • From $75 per month, including management / maintenance • Extremely convenient • Switch on / off at leisure • Promotes experimentation & agility 14
  • 15