Introducing Hydra !An Open Source Document Processing Framework!                 Joel	  Westberg	                         ...
About Findwise	  •      Founded	  in	  2005	  •  Offices	  in	  Sweden,	  Denmark,	  	  	  	  	  	  	  	  	  	  Norway	  and...
The image cannot be displayed. Your computer maynot have enough memory to open the image, or theimage may have been corrup...
Technology independent!CreaGng	  search-­‐driven	  Findability	  soluGons	  based	  on	  market-­‐leading	  	  commercial	...
Generic Search Architecture !
Connecting source to search! Garbage	  in,	  garbage	  out.	  But	  what	  about	  unstructured	  data	  in?	   •  Flat	  ...
Enrichment and structuring possibilities! •  Enrich	  your	  documents	  with	  metadata,	  to	  power	  your	  search	   ...
Classic Pipeline!
Classic Architecture!
The Hydra Architecture!
Main Design Objectives! Scalability	  	                         •  Horizontally scalable central repository               ...
The Hydra Architecture!
Writing a Stage - Example! @Stage(descripGon="This	  is	  a	  Simple	  Writer")	   public	  class	  SimpleWriter	  extends...
Hadoop/Big Data integration! 	   Usecases	  for	  document	  enrichment	   •  Pagerank	   •  AnalyGcs	   Hadoop	  &	  Map/...
Hadoop/Big Data integration! 	   	          Blue – First round of indexing only        Red – Second round of indexing     ...
Future Configuration UI!
Open Source initiative! •  Other	  commi?ers	   •  The	  role	  of	  Findwise	   For	  more	  informaNon:	   •  h?p://www....
Questions?!
Thank    Joel	  Westberg	           joel.westberg@findwise.com	   you!!
Upcoming SlideShare
Loading in …5
×

Introducing Hydra – An Open Source Document Processing Framework

2,534 views

Published on

Presented by Joel Westberg, Findwise AB - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012

This presentation will detail the document-processing framework called Hydra that has been developed by Findwise. It is intended as a description of the framework and the problem it aims to solve. We will first discuss the need for scalable document processing, outlining that there is a missing link between the open source chain to bridge the gap between source system and the search engine, then will move on to describe the design goals of Hydra, as well as how it has been implemented to meet those demands on flexibility, robustness and ease of use. This session will end by discussing some of the possibilities that this new pipeline framework can offer, such as freely seamlessly scaling up the solution during peak loads, metadata enrichment as well as proposed integration with Hadoop for Map/Reduce tasks such as page rank calculations.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,534
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Introducing Hydra – An Open Source Document Processing Framework

  1. 1. Introducing Hydra !An Open Source Document Processing Framework! Joel  Westberg   ©  FINDWISE  2012  
  2. 2. About Findwise  •  Founded  in  2005  •  Offices  in  Sweden,  Denmark,                    Norway  and  Poland  •  80  employees  (April  2012)    •    Our  objecGve  is  to  be  a  leading  provider  of  Findability  soluGons  uGlising   the  full  potenGal  of  search  technology  to  create  customer  business  value      
  3. 3. The image cannot be displayed. Your computer maynot have enough memory to open the image, or theimage may have been corrupted. Restart yourcomputer, and then open the file again. If the red xstill appears, you may have to delete the image andthen insert it again.
  4. 4. Technology independent!CreaGng  search-­‐driven  Findability  soluGons  based  on  market-­‐leading    commercial  and  open  source  search  technology  plaYorms:     ü     Autonomy  IDOL     ü     MicrosoB  (SharePoint  and  FAST  Search  products)     ü     Google  GSA         ü     IBM  ICA/OmniFind         ü  LucidWorks   ü     Apache  Lucene/Solr     ü     ElasGc  Search   ü     and  more…  
  5. 5. Generic Search Architecture !
  6. 6. Connecting source to search! Garbage  in,  garbage  out.  But  what  about  unstructured  data  in?   •  Flat  data  is  richer  than  it  appears   •  Don’t  discard  informaGon  too  soon!     The  unstructured  structured  data  paradox   Example:  News  arGcles     Plain  text  that  contains  invaluable  metadata  for  search,  such  as:     •  Title   •  Author  byline   •  Lead  paragraph  
  7. 7. Enrichment and structuring possibilities! •  Enrich  your  documents  with  metadata,  to  power  your  search   •  Language detection •  Sentiment analysis •  Headline extraction •  Regular expression matching and extraction •  Filter  out  unwanted  documents   •  Collect  staGsGcs   •  Export  to  Staging  environments  
  8. 8. Classic Pipeline!
  9. 9. Classic Architecture!
  10. 10. The Hydra Architecture!
  11. 11. Main Design Objectives! Scalability     •  Horizontally scalable central repository •  Independent processing nodes Failiure  tolerant   •  Failiure of a stage affects only a single document •  Failiure of a node affects at most n documents •  Failiures can be automaticly detected Robustness   •  Independent stages Development  ease   •  Debug stages from IDE against actual data •  Allow test driven pipeline development
  12. 12. The Hydra Architecture!
  13. 13. Writing a Stage - Example! @Stage(descripGon="This  is  a  Simple  Writer")   public  class  SimpleWriter  extends  AbstractProcessStage  {    @Parameter(descripGon="Name  of  field  to  write  value  to")    private  String  field;    @Parameter(descripGon="Value  to  write")    private  Object  value;        @Override    public  void  process(LocalDocument  doc)  throws  ProcessExcepGon  {      doc.putContentField(field,  value);    }        @Override    public  void  init()  throws  RequiredArgumentMissingExcepGon  {      if(field==null)  throw  new  RequiredArgumentMissingExcepGon("field  is  missing");    }   }  
  14. 14. Hadoop/Big Data integration!   Usecases  for  document  enrichment   •  Pagerank   •  AnalyGcs   Hadoop  &  Map/Reduce  advantages   •  Huge  scalability   •  Ability  to  work  on  enGre  document  set  at  once   Hadoop  &  Map/Reduce  drawbacks   •  Batch  processing   •  Time-­‐to-­‐index  
  15. 15. Hadoop/Big Data integration!     Blue – First round of indexing only Red – Second round of indexing Purple – All documents
  16. 16. Future Configuration UI!
  17. 17. Open Source initiative! •  Other  commi?ers   •  The  role  of  Findwise   For  more  informaNon:   •  h?p://www.findwise.com/hydra   •  h?p://findwise.github.com/Hydra   •  Email:  joel.westberg@findwise.com  
  18. 18. Questions?!
  19. 19. Thank Joel  Westberg   joel.westberg@findwise.com   you!!

×