What is Hydra?

•

1 like•2,749 views

Hydra is an open source technology that brings structure to unstructured data like news articles and text documents. It enriches documents with metadata through language detection, sentiment analysis, and other techniques. This allows the data to be better searched and filtered. Hydra is designed to scale to large amounts of data through a fault tolerant and robust pipeline architecture. It can integrate with Hadoop for processing entire document sets at once for applications like pagerank and analytics.

Technology

Hydra brings structure

What is unstructured data?
•  A linguistic excuse?

News articles
Plain text that contains invaluable metadata for search, such as:
•  Title
•  Author byline
•  Lead paragraph

Hydra is about your data

•  Enrich your documents with metadata, to power your search
•  Language
detec+on

•  Sen+ment
analysis

•  Headline
extrac+on

•  Regular
expression
matching
and
extrac+on

•  Filter out unwanted documents
•  Collect statistics
•  Export to Staging environments

Hydra Design Objectives

Scalability
•  Possible to connect any number of processing machines
Fault tolerance
•  Failiure of a stage aﬀects only a single document
•  Failiures can be automaticly detected
Robustness
•  Stages and nodes are completely independent (no domino-
eﬀect)
Development ease
•  Allow test driven pipeline development

What about Hadoop and Big Data?

Usecases for document enrichment
•  Pagerank

•  Analy+cs

Hadoop & Map/Reduce advantages
•  Huge
scalability

•  Ability
to
work
on
en+re
document
set
at
once

Hadoop & Map/Reduce drawbacks
•  Batch
processing

•  Time-‐to-‐index

Hydra integrated with Hadoop

Blue – First round of indexing only
Red – Second round of indexing
Purple – All documents

Hydra in summary

Hydra
•  can chew through almost anything
•  has many heads
•  regenerates
•  scales

Hydra is Open Source

•  Other committers
•  The role of Findwise

For more information:
•  http://www.ﬁndwise.com/hydra
•  http://ﬁndwise.github.com/Hydra

•  Email: joel.westberg@ﬁndwise.com

Joel Westberg
joel.westberg@ﬁndwise.com

@joelwes

Presented at Lucene Revolution, 7-8 May in Boston and Berlin Buzzwords 4-5 June, 2012. When working with free text search, the quality of the data in the index is a key factor on the quality of the results delivered and has a major impact on the information consumption experience. Hydra is designed to give the search solution the tools necessary to modify the data that is to be indexed in an efficient and flexible way. Providing a scalable and efficient pipeline which the documents pass through before being indexed into the search engine does this.

Introducing Hydra – An Open Source Document Processing Framework

lucenerevolution

Presented by Joel Westberg, Findwise AB - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 This presentation will detail the document-processing framework called Hydra that has been developed by Findwise. It is intended as a description of the framework and the problem it aims to solve. We will first discuss the need for scalable document processing, outlining that there is a missing link between the open source chain to bridge the gap between source system and the search engine, then will move on to describe the design goals of Hydra, as well as how it has been implemented to meet those demands on flexibility, robustness and ease of use. This session will end by discussing some of the possibilities that this new pipeline framework can offer, such as freely seamlessly scaling up the solution during peak loads, metadata enrichment as well as proposed integration with Hadoop for Map/Reduce tasks such as page rank calculations.

Introduction to Big Data

Md. Afif Al Mamun

Big data refers to large volumes of structured and unstructured data that can be analyzed to reveal patterns and trends. It is characterized by 3 Vs - volume, velocity, and variety. Hadoop and associated tools like HDFS, MapReduce, Hive and NoSQL databases are used to handle big data. These tools provide scalability, flexibility and support both structured and unstructured data. Understanding big data analytics provides opportunities in data science and IT jobs and benefits industries like banking, healthcare, manufacturing and more through real-time insights.

Drupal and the Semantic Web - ESIP Webinar

scorlosquet

This document summarizes a presentation about using semantic web technologies like the Resource Description Framework (RDF) and Linked Data with Drupal 7. It discusses how Drupal 7 maps content types and fields to RDF vocabularies by default and how additional modules can add features like mapping to Schema.org and exposing SPARQL and JSON-LD endpoints. The presentation also covers how Drupal integrates with the larger Semantic Web through technologies like Linked Open Data.

Integrating Hadoop & Solr

Lucidworks (Archived)

The document introduces Yann Yu from Lucidworks and provides information about Lucidworks and its products Solr and Hadoop. It discusses how Solr can be used to provide search capabilities for large amounts of both structured and unstructured data stored in Hadoop. Integrating Solr and Hadoop allows for fast search across big data stored in Hadoop along with real-time indexing and querying capabilities. Examples discussed include enabling enterprise-wide search of documents stored in Hadoop and using Flume to index log data from Hadoop into Solr for real-time analytics and search.

Exploration of multidimensional biomedical data in pub chem, Presented by Lia...

Lucidworks (Archived)

The document discusses the development of a new search system for PubChem to allow for exploration of multidimensional biomedical data. The new system was needed to address the challenges of handling large and heterogeneous datasets with many relationships between data types in a way that allows for fast querying. The system leverages Apache SOLR to provide features like full text search, faceting, molecule structure searching and joining of related data. It includes backend components like SOLR, SQL and specialized search engines as well as web APIs and frontend interfaces like reusable widgets and a new search interface.

Use cases for cassandra in federal and state government

OpenSource Connections

Spark in 15 min

Christophe Marchal

ORCID identifiers in repositories The ORCID identifier has been incorporated into numerous repository platforms. This session will offer a discussion of integration points, policy issues, data flow between systems, researcher participation, discovered opportunities, and demonstrations by universities, research organizations, and vendors. Moderator: Salvatore Mele, Head of Open Access at CERN Presenters: Robin Haw, Scientific Associate and Reactome Outreach Coordinator, Department of Informatics and Bio-computing, OICR Rick Johnson, Co-Program Director, Digital Library Initiatives and Scholarship E-Research and Digital Initiatives, Notre Dame University Ann Campion Riley, Associate Director for Access, Collections and Technical Services, University of Missouri Library Sarah Shreeves, Coordinator, Illinois Digital Environment for Access to Learning and Scholarship (IDEALS), University Library. University of Illinois at Urbana-Champaign Michael Witt, Head, Distributed Data Curation Center, Purdue University

Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow

PyData

By Sudheesh Katkam PyData New York City 2017 Dremio is a new open source project for self-service data fabric. Dremio simplifies and accelerates access to data from any source and any size, including relational databases, NoSQL, Hadoop, Parquet, and text files. We'll show you how you can use Dremio to visually curate data from any source, then access via Pandas or Jupyter notebook for rapid access.

ORCID for DSpace

Bram Luyten

Next Generation Data Platforms - Deon Thomas

Thoughtworks

Indexing big data in the cloud

OpenSource Connections

Amazon Web Services offers a quick and easy way to build a scalable search platform, a flexibility is especially useful when an initial data load is required but the hardware is no longer needed for day-to-day searching and adding new documents. This presentation will cover one such approach capable of enlisting hundreds of worker nodes to ingest data, track their progress, and relinquish them back to the cloud when the job is done. The data set that will be discussed is the collection of published patent grants available through Google Patents. A single Solr instance can easily handle searching the roughly 1 million patents issued between 2010 and 2005, but up to 50 worker nodes were necessary to load that data in a reasonable amount of time. Also, the same basic approach was used to make three sizes of PNG thumbnails of the patent grant TIFF images. In that case 150 worker nodes were used to generate 1.6 Tb of data over the course of three days. In this session, attendees will learn how to leverage EC2 as a scalable indexer and tricks for using XSLT on very large XML documents.

Bi on Big Data - Strata 2016 in London

Dremio Corporation

So we all have ORCID integrations, now what?

Bram Luyten

In the past year, the major groundwork has been laid for repository systems to support ORCID identifiers. DSpace, Hydra, and EPrints all have support for storing and managing ORCIDs. However, we are still in the early stages of ORCID adoption. Only a small fraction of repository content is annotated with ORCIDs, and most end-users have not yet realized any benefit from the features based on ORCID. This panel will bring together representatives of major repository systems to relate the current status of ORCID implementations, discuss plans for future work, and identify shared goals and challenges. The panelists will discuss how ORCID support provides practical benefits both to repository staff and end-users, with a focus on features that exist now or will exist in the next year. Rick Johnson (1), Hardy Pottinger (2), Ryan Scherle (3), Peter West (4), Bram Luyten (5) (1) University of Notre Dame; (2) University of Missouri System; (3) Dryad Digital Repository; (4) Digital Repository Services Ltd; (5) @mire

Apache Accumulo and the Data Lake

Aaron Cordova

Introduction to Dremio

Dremio Corporation

An introduction to self-service data with Dremio. Dremio reimagines analytics for modern data. Created by veterans of open source and big data technologies, Dremio is a fundamentally new approach that dramatically simplifies and accelerates time to insight. Dremio empowers business users to curate precisely the data they need, from any data source, then accelerate analytical processing for BI tools, machine learning, data science, and SQL clients. Dremio starts to deliver value in minutes, and learns from your data and queries, making your data engineers, analysts, and data scientists more productive.

Big data overview

beCloudReady

This document provides an overview of big data concepts and related technologies. It discusses what big data is, how Apache Hadoop uses MapReduce for distributed storage and processing of large datasets. Key components of the Hadoop ecosystem are described including HDFS for storage and YARN for resource management. Apache Spark is presented as an alternative to Hadoop for its in-memory computing capabilities and support for stream processing. Spark can complement Hadoop. Elasticsearch is introduced as a NoSQL database for full text search. Apache Kafka is summarized as a system for publishing and processing streams of records. Data engineering processes of acquiring, preparing, and analyzing data are outlined for both legacy and big data systems.

ISBG 2016 - XPages on IBM Bluemix

Oliver Busse

This document provides an agenda for a presentation on best practices for developing XPages applications on IBM Bluemix. The agenda covers prerequisites for getting started with Bluemix, separating application design from data, deployment options using the Domino Designer plugin versus the command line, understanding the MANIFEST.YML configuration file, security considerations, plugin support, and tips/tricks.

Analytics and Access to the UK web archive

Lewis Crawford

The document summarizes the background, purpose, and methods of the UK Web Archive. It discusses how the archive collects, stores, and provides access to snapshots of UK websites over time to preserve digital cultural heritage. It also describes challenges of scale due to the immense size of web content and techniques like full-text search and data analytics that are used to facilitate discovery of information within the archive.

HDF Cloud Services

The HDF-EOS Tools and Information Center

HDF Cloud Services aims to bring HDF5 to the cloud by defining a REST API for HDF5 and implementing related services. The HDF REST API allows HDF5 data to be accessed via HTTP requests and responses. H5serv is an open source reference implementation of the HDF REST API. The HDF Scalable Data Service (HSDS) is being developed to support large HDF5 repositories in a scalable, cost effective manner using object storage like AWS S3.

Integrating Drupal with a Triple Store

Barry Norton

The document discusses integrating Drupal, an open-source content management system, with a triple store to enable semantics-driven publishing of open data at scale. Existing approaches in Drupal concentrate on embedding RDFa from its internal data model and depend on arc2, which lacks SPARQL 1.1 support and scalability. The proposed approach uses RESTful calls from Drupal to a triple store via SPARQL to access data beyond Drupal's entity model, enhancing pages. This allows Drupal to publish much larger, semantically enriched open data on topics like 200+ countries, 400-500 disciplines, and 10,000+ athletes.

Apache Spark Introduction

bigdata trunk

Semantics, rdf and drupal

Gokul Nk

Drupal 7 and RDF

scorlosquet

This document discusses Drupal 7 and its new capabilities for representing content as Resource Description Framework (RDF) data. It provides an overview of Drupal's history with RDF and semantic technologies. It describes how Drupal 7 core is now RDFa enabled out of the box and how contributed modules can import vocabularies and provide SPARQL endpoints. The document advocates experimenting with the new RDF features in Drupal 7.

Spark - The beginnings

Daniel Leon

Open source big data landscape and possible ITS applications

SoftwareMill

Apache Arrow: In Theory, In Practice

Dremio Corporation

This document discusses Apache Arrow, an open source cross-language development platform for in-memory analytics. It provides an overview of Arrow's goals of being cross-language compatible, optimized for modern CPUs, and enabling interoperability between systems. Key components include core C++/Java libraries, integrations with projects like Pandas and Spark, and common message patterns for sharing data. The document also describes how Arrow is implemented in practice in systems like Dremio's Sabot query engine.

Hid 2

Masakazu Ishikawa

This document summarizes a seminar presentation about the evolutionary study of endosymbiosis between Hydra and Chlorella through comparative genomic analysis. The presentation compared genomic and transcriptomic differences between symbiotic and non-symbiotic Chlorella species and strains. It also discussed using RNA interference methods to knock down gene expression in Chlorella and observe the effects on symbiosis with Hydra. The goal is to understand how and why certain Hydra and Chlorella species can live in an endosymbiotic relationship while others cannot.

Introduction to Hydra

Alejandro Inestal

This document introduces HYDRA, a lightweight vocabulary that allows the creation of hypermedia-driven web APIs. HYDRA extends JSON-LD to provide semantics that enable servers to advertise valid state transitions to clients. This allows generic clients to understand operations on APIs and navigate through them using hyperlinks. HYDRA aims to address issues with existing RESTful APIs, such as clients needing to be re-written when APIs change, by making APIs self-descriptive through linked data.

What's hot

ORCID Adoption & Integration in DSpace

ORCID, Inc

Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow

PyData

ORCID for DSpace

Bram Luyten

Next Generation Data Platforms - Deon Thomas

Thoughtworks

Indexing big data in the cloud

OpenSource Connections

Bi on Big Data - Strata 2016 in London

Dremio Corporation

So we all have ORCID integrations, now what?

Bram Luyten

Apache Accumulo and the Data Lake

Aaron Cordova

Introduction to Dremio

Dremio Corporation

Big data overview

beCloudReady

ISBG 2016 - XPages on IBM Bluemix

Oliver Busse

Analytics and Access to the UK web archive

Lewis Crawford

HDF Cloud Services

The HDF-EOS Tools and Information Center

Integrating Drupal with a Triple Store

Barry Norton

Apache Spark Introduction

bigdata trunk

Semantics, rdf and drupal

Gokul Nk

Drupal 7 and RDF

scorlosquet

Spark - The beginnings

Daniel Leon

Open source big data landscape and possible ITS applications

SoftwareMill

Apache Arrow: In Theory, In Practice

Dremio Corporation

What's hot (20)

ORCID Adoption & Integration in DSpace

Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow

ORCID for DSpace

Next Generation Data Platforms - Deon Thomas

Indexing big data in the cloud

Bi on Big Data - Strata 2016 in London

So we all have ORCID integrations, now what?

Apache Accumulo and the Data Lake

Introduction to Dremio

Big data overview

ISBG 2016 - XPages on IBM Bluemix

Analytics and Access to the UK web archive

HDF Cloud Services

Integrating Drupal with a Triple Store

Apache Spark Introduction

Semantics, rdf and drupal

Drupal 7 and RDF

Spark - The beginnings

Open source big data landscape and possible ITS applications

Apache Arrow: In Theory, In Practice

Viewers also liked

Hid 2

Masakazu Ishikawa

Introduction to Hydra

Alejandro Inestal

Hydra

Chris Birchall

Hydra is a Hadoop-style distributed processing framework optimized for building and navigating tree data structures. It includes components for job control, task running, querying, and a distributed filesystem. To get started, users install prerequisites like RabbitMQ and Maven, clone and build the Hydra repository, start the local stack, seed sample data, and can then run sample jobs and queries to see results. The document provides tips for analyzing text files with Hydra and concludes that it is well suited for applications that involve working with tree data structures.

Training_deck_081015

Jennifer McClellan

This document provides an overview of a company that focuses on systemic enzymes to address health issues at their root cause. It was founded in 2000 and was the first to develop an enteric-coated liquid enzyme supplement. Systemic enzymes operate throughout the body, differing from digestive enzymes which work in the digestive tract. The company's flagship product is an enteric-coated softgel with a proprietary blend of enzymes like serrapeptase. The document describes the company's other enzyme-based supplement products and their benefits for digestion, heart health, detoxification, joints, and more.

Jobb mer effektivt med søk

What is Hydra?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to What is Hydra?

Similar to What is Hydra? (20)

More from Findwise

More from Findwise (20)

Recently uploaded

Recently uploaded (20)

What is Hydra?