Hydra brings structure What is unstructured data? • A linguistic excuse? News articles Plain text that contains invaluable metadata for search, such as: • Title • Author byline • Lead paragraph
Hydra is about your data • Enrich your documents with metadata, to power your search • Language detec+on • Sen+ment analysis • Headline extrac+on • Regular expression matching and extrac+on • Filter out unwanted documents • Collect statistics • Export to Staging environments
Before Hydra
Before Hydra
Hydra scales
Hydra Design Objectives Scalability • Possible to connect any number of processing machines Fault tolerance • Failiure of a stage affects only a single document • Failiures can be automaticly detected Robustness • Stages and nodes are completely independent (no domino- effect) Development ease • Allow test driven pipeline development
What about Hadoop and Big Data? Usecases for document enrichment • Pagerank • Analy+cs Hadoop & Map/Reduce advantages • Huge scalability • Ability to work on en+re document set at once Hadoop & Map/Reduce drawbacks • Batch processing • Time-‐to-‐index
Hydra integrated with Hadoop Blue – First round of indexing only Red – Second round of indexing Purple – All documents
Hydra in summary Hydra • can chew through almost anything • has many heads • regenerates • scales
Hydra is Open Source • Other committers • The role of Findwise For more information: • http://www.findwise.com/hydra • http://findwise.github.com/Hydra • Email: joel.westberg@findwise.com
Joel Westberg joel.westberg@findwise.com @joelwes
Let LinkedIn power your SlideShare experience
+
Let LinkedIn power your SlideShare experience
Customize SlideShare content based on your interests
We will import your LinkedIn profile and you will be visible on SlideShare.
Keep up to date when your LinkedIn contacts post on SlideShare