Wikipedia Cloud Search Webinar

690 views

Published on

View this webinar presented by Search Technologies' Chief Architect Paul Nelson on cloud search and a Wikipedia use case. Webinar given in conjunction with Amazon Cloud Search. Search Technologies provides implementation and consulting services for Amazon CloudSearch. For further information, see http://www.searchtechnologies.com/amazon-cloudsearch-services.html

http://www.searchtechnologies.com/

Published in: Technology, Education
1 Comment
0 Likes
Statistics
Notes
  • Be the first to like this

No Downloads
Views
Total views
690
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
4
Comments
1
Likes
0
Embeds 0
No embeds

No notes for slide
  • For further information about Aspire, see http://www.searchtechnologies.com/aspire.html
  • The Java API for Amazon CloudSearch can be downloaded from http://www.searchtechnologies.com/java-api-amazon-cloudsearch.html
  • Wikipedia Cloud Search Webinar

    1. 1. 1 Searching Wikipedia with Amazon CloudSearch
    2. 2. 2 Agenda • Project Background • High-level Architecture • Summary & Observations
    3. 3. 3 Project Background • Amazon contracted with Search Technologies to help with beta-testing, prior to the launch of Amazon CloudSearch • Decision to use Wikipedia as a convenient data set for testing purposes 3
    4. 4. 4 High-level Architecture 4
    5. 5. 5 Indexing • Wikipedia provides content in a series of large xml files • Amazon CloudSearch ingests xml in a specified form • Various content processing tasks to perform • Splitting into individual documents • Date normalization • Metadata extraction & mapping • Cleanup, etc. • We used Aspire for these tasks 5
    6. 6. 6 Aspire in Brief • Based on Apache Felix / OSGi • Thread-safe, multi-threaded, distributable • Any number of pipelines, conditional branching • Plug-in components individually testable & upgradable • In use with FAST ESP, FS4SP, Solr, Amazon CloudSearch, GSA. • Tested with Elasticsearch and SP 2013 6
    7. 7. 7 XML Input 7
    8. 8. 8 Indexing • Streaming Wikipedia Dump Files directly into CloudSearch • 500 docs/second achieved without much effort • Using 4 x XL instances of CloudSearch • 1 x XL EC2 instance for Aspire 8
    9. 9. 9 Searching • Amazon CloudSearch provides a RESTful/XML interface for search purposes • For the Wikipedia project, we needed a UI • Chose to use Twigkit • Wrote a Java API for CloudSearch • The Java API is freely downloadable (with source) at http://www.searchtechnologies.com/java-api-amazon- cloudsearch.html 9
    10. 10. 10 Searching • Supports navigators and relevancy customization • E.g. a “PageRank” style link analysis was performed • Limits set high: E.g. retrieve 500,000 results in a single list, delivered in just a few seconds • Very useful for analysis applications • So, what does it look like? 10
    11. 11. 11wikipedia.searchtechnologies.com 11
    12. 12. 12wikipedia.searchtechnologies.com 12
    13. 13. 13 Summary & Observations • A capable and scalable “raw” engine • xml in, RESTful/xml out • Easy to set up – much the same as an EC2 instance • Elastic scalability 13
    14. 14. 14 Summary & Observations • Cost effective • From $75 per month, including management / maintenance • Extremely convenient • Switch on / off at leisure • Promotes experimentation & agility 14
    15. 15. 15

    ×