• Like

Adding Search to the Hadoop Ecosystem

  • 1,313 views
Uploaded on

 

More in: Software , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,313
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
66
Comments
0
Likes
9

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 1 Finding a needle in a stack of needles - adding Search to the Hadoop Ecosystem Patrick Hunt (@phunt) Bay Area Search Meetup, April 2014
  • 2. Agenda • Big Data and Search – setting the stage • Cloudera Search’s Architecture • Apache Lucene/Solr • Apache Flume • Apache HBase • Apache MapReduce • Apache Sentry • Near Real Time and Batch Use Cases • Conclusion and Q&A
  • 3. Why Search? An Integrated Framework on Apache Hadoop One pool of data One security framework One set of system resources One management interface
  • 4. Search Simplifies Interaction • User Goals • Explore • Navigate • Correlate • Experts know MapReduce • Savvy people know SQL • Everyone knows Search!
  • 5. What is Cloudera Search? • Full-text, interactive search and faceted navigation • Batch, near real-time, and on-demand indexing • Apache Solr integrated with CDH • Established, mature search with vibrant community • Separate runtime like MapReduce, Impala • Incorporated as part of the Apache Hadoop ecosystem • Open Source • 100% Apache, 100% Solr • Standard Solr APIs
  • 6. Challenges • Scalable/Reliable Index Storage • Near Real Time (NRT) indexing • Scalable Batch Indexing • Usability
  • 7. Apache Lucene/Solr • Lucene - full text search library • Solr – search service on Lucene • SolrCloud – distributed search • We are using version 4 (4.4 currently)
  • 8. Integrate Solr/Lucene with HDFS • Lucene Directory Abstraction • Implemented HDFSDirectory using HDFS client library • Read/Write index files directly to HDFS • Solr DirectoryFactory Abstraction • HDFSDirectoryFactory plugs HDFSDirectory into Solr • Configuration – Solr and HDFS
  • 9. Cloudera Upstream Contributions - Solr • SOLR-3911 - Directory/DirectoryFactory now first class • Solr Replication now uses Directory abstraction • Solr Admin UI no longer assumes local directory access • SOLR-4916 – support for reading/writing Solr index files and transaction log files to/from HDFS • HDFSDirectoryFactory/HDFSDirectory implementation • SOLR-4655 - The Overseer should assign node names by default. • SOLR-3706 - Ship setup to log with log4j • SOLR-4494 - Clean up and polish Collections API • SOLR-4718 -Improvements to configurability • Configuration now entirely through ZooKeeper (optional) • Many more improvements/cleanup/hardening/…
  • 10. Distributed Search on Hadoop Flume Hue UI Custom UI Custom App Solr Solr Solr SolrCloud query query query index Hadoop Cluster MR HDFS index HBase index
  • 11. Near Real Time Indexing with Flume Log File Solr and Flume • Data ingest at scale • Flexible extraction and mapping • Indexing at data ingest HDFS Flume Agent Indexer Other Log File Flume Agent Indexer 11
  • 12. Apache Flume - MorphlineSolrSink • Flume – reliable/scalable log collection • Created a Flume “Sink” for indexing events to Solr • Integrates Cloudera Morphlines (ETL framework)
  • 13. Near Real Time indexing of HBase HDFS HBase interactiveload Indexer(s) Triggerson updates Solr server Solr server Solr server Solr server Solr server Search + = planet-sized tabular data immediate access & updates fast & flexible information discovery BIG DATA DATAMANAGEMENT
  • 14. Lily HBase Indexer • Collaboration between NGData & Cloudera • NGData are creators of the Lily data management platform • Lily HBase Indexer • Service which acts as a HBase replication listener • Replication updates trigger indexing of updates (rows) • Integrates Cloudera Morphlines library for ETL of rows • AL2 licensed on github https://github.com/ngdata
  • 15. Scalable Batch Indexing Index shard Files Index shard Indexer Files Solr server Solr server 15 HDFS/HBase Solr and MapReduce • Flexible, scalable batch indexing • Start serving new indices with no downtime • On-demand indexing, cost- efficient re-indexingIndexer
  • 16. Scalable Batch Indexing 16 Mapper: Parse input into indexable document Mapper: Parse input into indexable document Mapper: Parse input into indexable document Index shard 1 Index shard 2 Arbitrary reducing steps of indexing and merging End-Reducer (shard 1): Index document End-Reducer (shard 2): Index document
  • 17. MapReduce Indexer • MapReduce Job with two parts 1) Scan HDFS for files (or HBase for records) to be indexed 2) Mapper/Reducer indexing step • Mapper extracts content via Cloudera Morphlines • Reducer uses Lucene to index documents directly to HDFS • “golive” • Cloudera created this to bridge the gap between NRT (low latency, expensive) and Batch (high latency, cheap at scale) indexing • Results of MR indexing operation are immediately merged into a live SolrCloud serving cluster • No downtime for users • No NRT expense • Linear scale out to the size of your MR cluster
  • 18. Simple, Customizable Search Interface Hue • Simple UI • Navigated, faceted drill down • Customizable display • Full text search, standard Solr API and query language
  • 19. Conclusion and Q&A • Try it now with Cloudera Live! • Cloudera Search • Free Download • Extensive documentation • Send your questions and feedback to Cloudera Search Forum • Take the Search online training • Cloudera Express (i.e. the free version) • Simple management of Search • Free Download