Your SlideShare is downloading. ×
0
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Adding Search to the Hadoop Ecosystem

1,618

Published on

Published in: Software, Technology
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,618
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
72
Comments
0
Likes
9
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 1 Finding a needle in a stack of needles - adding Search to the Hadoop Ecosystem Patrick Hunt (@phunt) Bay Area Search Meetup, April 2014
  • 2. Agenda • Big Data and Search – setting the stage • Cloudera Search’s Architecture • Apache Lucene/Solr • Apache Flume • Apache HBase • Apache MapReduce • Apache Sentry • Near Real Time and Batch Use Cases • Conclusion and Q&A
  • 3. Why Search? An Integrated Framework on Apache Hadoop One pool of data One security framework One set of system resources One management interface
  • 4. Search Simplifies Interaction • User Goals • Explore • Navigate • Correlate • Experts know MapReduce • Savvy people know SQL • Everyone knows Search!
  • 5. What is Cloudera Search? • Full-text, interactive search and faceted navigation • Batch, near real-time, and on-demand indexing • Apache Solr integrated with CDH • Established, mature search with vibrant community • Separate runtime like MapReduce, Impala • Incorporated as part of the Apache Hadoop ecosystem • Open Source • 100% Apache, 100% Solr • Standard Solr APIs
  • 6. Challenges • Scalable/Reliable Index Storage • Near Real Time (NRT) indexing • Scalable Batch Indexing • Usability
  • 7. Apache Lucene/Solr • Lucene - full text search library • Solr – search service on Lucene • SolrCloud – distributed search • We are using version 4 (4.4 currently)
  • 8. Integrate Solr/Lucene with HDFS • Lucene Directory Abstraction • Implemented HDFSDirectory using HDFS client library • Read/Write index files directly to HDFS • Solr DirectoryFactory Abstraction • HDFSDirectoryFactory plugs HDFSDirectory into Solr • Configuration – Solr and HDFS
  • 9. Cloudera Upstream Contributions - Solr • SOLR-3911 - Directory/DirectoryFactory now first class • Solr Replication now uses Directory abstraction • Solr Admin UI no longer assumes local directory access • SOLR-4916 – support for reading/writing Solr index files and transaction log files to/from HDFS • HDFSDirectoryFactory/HDFSDirectory implementation • SOLR-4655 - The Overseer should assign node names by default. • SOLR-3706 - Ship setup to log with log4j • SOLR-4494 - Clean up and polish Collections API • SOLR-4718 -Improvements to configurability • Configuration now entirely through ZooKeeper (optional) • Many more improvements/cleanup/hardening/…
  • 10. Distributed Search on Hadoop Flume Hue UI Custom UI Custom App Solr Solr Solr SolrCloud query query query index Hadoop Cluster MR HDFS index HBase index
  • 11. Near Real Time Indexing with Flume Log File Solr and Flume • Data ingest at scale • Flexible extraction and mapping • Indexing at data ingest HDFS Flume Agent Indexer Other Log File Flume Agent Indexer 11
  • 12. Apache Flume - MorphlineSolrSink • Flume – reliable/scalable log collection • Created a Flume “Sink” for indexing events to Solr • Integrates Cloudera Morphlines (ETL framework)
  • 13. Near Real Time indexing of HBase HDFS HBase interactiveload Indexer(s) Triggerson updates Solr server Solr server Solr server Solr server Solr server Search + = planet-sized tabular data immediate access & updates fast & flexible information discovery BIG DATA DATAMANAGEMENT
  • 14. Lily HBase Indexer • Collaboration between NGData & Cloudera • NGData are creators of the Lily data management platform • Lily HBase Indexer • Service which acts as a HBase replication listener • Replication updates trigger indexing of updates (rows) • Integrates Cloudera Morphlines library for ETL of rows • AL2 licensed on github https://github.com/ngdata
  • 15. Scalable Batch Indexing Index shard Files Index shard Indexer Files Solr server Solr server 15 HDFS/HBase Solr and MapReduce • Flexible, scalable batch indexing • Start serving new indices with no downtime • On-demand indexing, cost- efficient re-indexingIndexer
  • 16. Scalable Batch Indexing 16 Mapper: Parse input into indexable document Mapper: Parse input into indexable document Mapper: Parse input into indexable document Index shard 1 Index shard 2 Arbitrary reducing steps of indexing and merging End-Reducer (shard 1): Index document End-Reducer (shard 2): Index document
  • 17. MapReduce Indexer • MapReduce Job with two parts 1) Scan HDFS for files (or HBase for records) to be indexed 2) Mapper/Reducer indexing step • Mapper extracts content via Cloudera Morphlines • Reducer uses Lucene to index documents directly to HDFS • “golive” • Cloudera created this to bridge the gap between NRT (low latency, expensive) and Batch (high latency, cheap at scale) indexing • Results of MR indexing operation are immediately merged into a live SolrCloud serving cluster • No downtime for users • No NRT expense • Linear scale out to the size of your MR cluster
  • 18. Simple, Customizable Search Interface Hue • Simple UI • Navigated, faceted drill down • Customizable display • Full text search, standard Solr API and query language
  • 19. Conclusion and Q&A • Try it now with Cloudera Live! • Cloudera Search • Free Download • Extensive documentation • Send your questions and feedback to Cloudera Search Forum • Take the Search online training • Cloudera Express (i.e. the free version) • Simple management of Search • Free Download

×