Introducing HydraAn Open Source Document Processing Framework                 Joel Westberg                               ...
About Findwise•   Founded in 2005•   Offices in Sweden, Denmark,    Norway and Poland•   80 employees (April 2012)•   Our ...
Technology independentCreating search-driven Findability solutions based on market-leadingcommercial and open source searc...
Generic Search Architecture
Connecting source to search Garbage in, garbage out. But what about unstructured data in? • Flat data is richer than it ap...
Enrichment and structuring possibilities • Enrich your documents with metadata, to power your search             • Languag...
Classic Pipeline
Classic Architecture
The Hydra Architecture
Main Design Objectives Scalability               • Horizontally scalable central repository               • Independent pr...
The Hydra Architecture
Hadoop/Big Data integration Usecases for document enrichment • Pagerank • Analytics Hadoop & Map/Reduce advantages • Huge ...
Hadoop/Big Data integration       Blue – First round of indexing only       Red – Second round of indexing       Purple – ...
Future Configuration UI
Open Source initiative • Other committers • The role of Findwise For more information: • http://www.findwise.com/hydra • h...
Questions?
Thank   Joel Westberg        joel.westberg@findwise.com you!
Hydra - Content Processing Framework for Search Driven Solutions
Upcoming SlideShare
Loading in …5
×

Hydra - Content Processing Framework for Search Driven Solutions

3,594 views

Published on

Presented at Lucene Revolution, 7-8 May in Boston and Berlin Buzzwords 4-5 June, 2012.

When working with free text search, the quality of the data in the index is a key factor on the quality of the results delivered and has a major impact on the information consumption experience. Hydra is designed to give the search solution the tools necessary to modify the data that is to be indexed in an efficient and flexible way. Providing a scalable and efficient pipeline which the documents pass through before being indexed into the search engine does this.

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,594
On SlideShare
0
From Embeds
0
Number of Embeds
26
Actions
Shares
0
Downloads
19
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Findwise is a consultingcompany, focused on deliveringFindability solutions toourcustomers.
  • That’s where we come from, now let’s talk about whatLet’s start off with a view of how Apache Solr in the middle, Manifold CF on the far left
  • In examples: “Post the XML file to the DataImportHandler, and now you’ve indexed wikipedia!”In reality: “Here’s a jumbled heap of files, and in these documents when it says “product id: XYZ”, that’s the most important thing to index”
  • Today we use OpenPipeline.One stage might extract references from the text and store them in separate fieldsOne might extract entities from the textOne might be parsing the pdf itself.
  • Also forking is a problem etcetc
  • Core is the framework itself, e.g. “java –jar hydra-jar-with-dependencies.jar” to launch Hydra.
  • Light blue – all documents prior to being exported to hbasePurple – All documentsRed – all documents that have been reinserted from hbase
  • Give special thanks to Johannes
  • Hydra - Content Processing Framework for Search Driven Solutions

    1. 1. Introducing HydraAn Open Source Document Processing Framework Joel Westberg © FINDWISE 2012
    2. 2. About Findwise• Founded in 2005• Offices in Sweden, Denmark, Norway and Poland• 80 employees (April 2012)• Our objective is to be a leading provider of Findability solutions utilising the full potential of search technology to create customer business value
    3. 3. Technology independentCreating search-driven Findability solutions based on market-leadingcommercial and open source search technology platforms:  Autonomy IDOL  Microsoft (SharePoint and FAST Search products)  Google GSA  IBM ICA/OmniFind  LucidWorks  Apache Lucene/Solr  Elastic Search  and more…
    4. 4. Generic Search Architecture
    5. 5. Connecting source to search Garbage in, garbage out. But what about unstructured data in? • Flat data is richer than it appears • Don’t discard information too soon! The unstructured structured data paradox Example: News articles Plain text that contains invaluable metadata for search, such as: • Title • Author byline • Lead paragraph
    6. 6. Enrichment and structuring possibilities • Enrich your documents with metadata, to power your search • Language detection • Sentiment analysis • Headline extraction • Regular expression matching and extraction • Filter out unwanted documents • Collect statistics • Export to Staging environments
    7. 7. Classic Pipeline
    8. 8. Classic Architecture
    9. 9. The Hydra Architecture
    10. 10. Main Design Objectives Scalability • Horizontally scalable central repository • Independent processing nodes Failiure tolerant • Failiure of a stage affects only a single document • Failiure of a node affects at most n documents • Failiures can be automaticly detected Robustness • Independent stages Development ease • Debug stages from IDE against actual data • Allow test driven pipeline development
    11. 11. The Hydra Architecture
    12. 12. Hadoop/Big Data integration Usecases for document enrichment • Pagerank • Analytics Hadoop & Map/Reduce advantages • Huge scalability • Ability to work on entire document set at once Hadoop & Map/Reduce drawbacks • Batch processing • Time-to-index
    13. 13. Hadoop/Big Data integration Blue – First round of indexing only Red – Second round of indexing Purple – All documents
    14. 14. Future Configuration UI
    15. 15. Open Source initiative • Other committers • The role of Findwise For more information: • http://www.findwise.com/hydra • http://findwise.github.com/Hydra • Email: joel.westberg@findwise.com
    16. 16. Questions?
    17. 17. Thank Joel Westberg joel.westberg@findwise.com you!

    ×