Your SlideShare is downloading. ×
0
1
Adding Search to the
Hadoop Ecosystem
Gregory Chanan (gchanan AT cloudera.com)
SF HUG August 2013
Agenda
• Big Data and Search – setting the stage
• Cloudera Search Architecture
• Component deep dive
• Security
• Conclus...
Why Search?
• Hadoop for everyone
• Typical case:
• Ingest data to storage engine (HDFS, HBase, etc)
• Process data (MapRe...
Why Search?
An Integrated Part of
the Hadoop System
One pool of data
One security framework
One set of system resources
On...
Benefits of Search
• Improved Big Data ROI
• An interactive experience without technical knowledge
• Single data set for m...
What is Cloudera Search?
• Full-text, interactive search with faceted navigation
• Batch, near real-time, and on-demand in...
Cloudera Search Components
• HDFS/MR/Lucene/Solr/SolrCloud
• Indexing
• Near Real Time (NRT) indexing
• Batch
• ETL – Clou...
Apache Hadoop
• Apache HDFS
• Distributed file system
• High reliability
• High throughput
• Apache MapReduce
• Parallel, ...
Apache Lucene
• Full text search
• Indexing
• Query
• Traditional inverted index
• Batch and Incremental indexing
• We are...
Apache Solr
• Search service built using Lucene
• Ships with Lucene (same TLP at Apache)
• Provides XML/HTTP/JSON/Python/R...
Apache SolrCloud
• Provides distributed Search capability
• Part of Solr (not a separate library/codebase)
• Shards – prov...
Distributed Search on Hadoop
Flume
Hue UI
Custom
UI
Custom
App
Solr
Solr
Solr
SolrCloud
query
query
query
index
Hadoop Clu...
Indexing
• Near Real Time (NRT)
• Flume
• HBase Indexer
• Batch (MR)
Indexing
• Near Real Time (NRT)
• Flume
• HBase Indexer
• Batch (MR)
Near Real Time Indexing with Flume
Log File
Solr and Flume
• Data ingest at scale
• Flexible extraction and
mapping
• Inde...
Apache Flume - MorphlineSolrSink
• A Flume Source…
• Receives/gathers events
• A Flume Channel…
• Carries the event – Memo...
Indexing
• Near Real Time (NRT)
• Flume
• HBase Indexer
• Batch (MR)
Near Real Time Indexing of Apache HBase
HDFS
HBase
interactiveload
HBase
Indexer(s)
Trigger Solr server
Solr server
Solr s...
Lily HBase Indexer
• Collaboration between NGData & Cloudera
• NGData are creators of the Lily data management platform
• ...
Indexing
• Near Real Time (NRT)
• Flume
• HBase Indexer
• Batch (MR)
Scalable Batch Indexing
Index
shard
Files
Index
shard
Indexer
Files
Solr
server
Indexer
Solr
server
21
HDFS
Solr and MapRe...
MapReduce Indexer
MapReduce Job with two parts
1) Scan HDFS for files to be indexed
• Much like Unix “find” – see HADOOP-8...
MapReduce Indexer “golive”
• Cloudera created this to bridge the gap between NRT
(low latency, expensive) and Batch (high ...
Cloudera Morphlines
• Open Source framework for simple ETL
• Ships as part Cloudera Developer Kit (CDK)
• It’s a Java libr...
Cloudera Morphlines Architecture
Solr
Solr
Solr
SolrCloud
Logs, tweets, social
media, html,
images, pdf, text….
Anything y...
Extraction and Mapping
• Modeled after Unix
pipelines
• Simple and flexible data
transformation
• Reusable across multiple...
Morphline Example – syslog with grok
morphlines : [
{
id : morphline1
importCommands : ["com.cloudera.**", "org.apache.sol...
Current Command Library
• Integrate with and load into Apache Solr
• Flexible log file analysis
• Single-line record, mult...
Current Command Library (cont)
• Scripting support for dynamic java code
• Operations on fields for assignment and compari...
Querying
• Built-in solr web UI
• Write your own
• Hue
Simple, Customizable Search Interface
Hue
• Simple UI
• Navigated, faceted drill
down
• Customizable display
• Full text s...
Security
• Upstream Solr doesn’t really deal with security
• Goal: use kerberos, like other CDH components
• Current relea...
Conclusion
• Cloudera Search now in public beta
• Free Download
• Extensive documentation
• Send your questions and feedba...
Upcoming SlideShare
Loading in...5
×

Search onhadoopsfhug081413

1,005

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,005
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
40
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Search onhadoopsfhug081413"

  1. 1. 1 Adding Search to the Hadoop Ecosystem Gregory Chanan (gchanan AT cloudera.com) SF HUG August 2013
  2. 2. Agenda • Big Data and Search – setting the stage • Cloudera Search Architecture • Component deep dive • Security • Conclusion
  3. 3. Why Search? • Hadoop for everyone • Typical case: • Ingest data to storage engine (HDFS, HBase, etc) • Process data (MapReduce, Hive, Impala) • Experts know MapReduce • Savvy people know SQL • Everyone knows Search!
  4. 4. Why Search? An Integrated Part of the Hadoop System One pool of data One security framework One set of system resources One management interface
  5. 5. Benefits of Search • Improved Big Data ROI • An interactive experience without technical knowledge • Single data set for multiple computing frameworks • Faster time to insight • Exploratory analysis, esp. unstructured data • Broad range of indexing options to accommodate needs • Cost efficiency • Single scalable platform; no incremental investment • No need for separate systems, storage
  6. 6. What is Cloudera Search? • Full-text, interactive search with faceted navigation • Batch, near real-time, and on-demand indexing • Apache Solr integrated with CDH • Established, mature search with vibrant community • In production environments for years • Open Source • 100% Apache, 100% Solr • Standard Solr APIs • In public beta (version 0.9.3)
  7. 7. Cloudera Search Components • HDFS/MR/Lucene/Solr/SolrCloud • Indexing • Near Real Time (NRT) indexing • Batch • ETL – Cloudera Morphlines • Querying
  8. 8. Apache Hadoop • Apache HDFS • Distributed file system • High reliability • High throughput • Apache MapReduce • Parallel, distributed programming model • Allows processing of large datasets • Fault tolerant
  9. 9. Apache Lucene • Full text search • Indexing • Query • Traditional inverted index • Batch and Incremental indexing • We are using version 4.3 in current release
  10. 10. Apache Solr • Search service built using Lucene • Ships with Lucene (same TLP at Apache) • Provides XML/HTTP/JSON/Python/Ruby/… APIs • Indexing • Query • Administrative interface • Also rich web admin GUI via HTTP
  11. 11. Apache SolrCloud • Provides distributed Search capability • Part of Solr (not a separate library/codebase) • Shards – provide scalability • partition index for size • replicate for query performance • Uses ZooKeeper for coordination • No split-brain issues • Simplifies operations
  12. 12. Distributed Search on Hadoop Flume Hue UI Custom UI Custom App Solr Solr Solr SolrCloud query query query index Hadoop Cluster MR HDFS index HBase index ZK
  13. 13. Indexing • Near Real Time (NRT) • Flume • HBase Indexer • Batch (MR)
  14. 14. Indexing • Near Real Time (NRT) • Flume • HBase Indexer • Batch (MR)
  15. 15. Near Real Time Indexing with Flume Log File Solr and Flume • Data ingest at scale • Flexible extraction and mapping • Indexing at data ingest HDFS Flume Agent Indexer Other Log File Flume Agent Indexer 15
  16. 16. Apache Flume - MorphlineSolrSink • A Flume Source… • Receives/gathers events • A Flume Channel… • Carries the event – MemoryChannel or reliable FileChannel • A Flume Sink… • Sends the events on to the next location • Flume MorphlineSolrSink • Integrates Cloudera Morphlines library • ETL, more on that in a bit • Does batching • Results sent to Solr for indexing
  17. 17. Indexing • Near Real Time (NRT) • Flume • HBase Indexer • Batch (MR)
  18. 18. Near Real Time Indexing of Apache HBase HDFS HBase interactiveload HBase Indexer(s) Trigger Solr server Solr server Solr server Solr server Solr server Search + = planet-sized tabular data immediate access & updates fast & flexible information discovery BIG DATA DATAMANAGEMENT
  19. 19. Lily HBase Indexer • Collaboration between NGData & Cloudera • NGData are creators of the Lily data management platform • Lily HBase Indexer • Service which acts as a HBase replication listener • HBase replication features, such as filtering, supported • Replication updates trigger indexing of updates (rows) • Integrates Cloudera Morphlines library for ETL of rows • AL2 licensed on github https://github.com/ngdata
  20. 20. Indexing • Near Real Time (NRT) • Flume • HBase Indexer • Batch (MR)
  21. 21. Scalable Batch Indexing Index shard Files Index shard Indexer Files Solr server Indexer Solr server 21 HDFS Solr and MapReduce • Flexible, scalable batch indexing • Start serving new indices with no downtime • On-demand indexing, cost- efficient re-indexing
  22. 22. MapReduce Indexer MapReduce Job with two parts 1) Scan HDFS for files to be indexed • Much like Unix “find” – see HADOOP-8989 • Output is NLineInputFormat’ed file 2) Mapper/Reducer indexing step • Mapper extracts content via Cloudera Morphlines • Reducer indexes documents via embedded Solr server • Originally based on SOLR-1301 • Many modifications to enable linear scalability
  23. 23. MapReduce Indexer “golive” • Cloudera created this to bridge the gap between NRT (low latency, expensive) and Batch (high latency, cheap at scale) indexing • Results of MR indexing operation are immediately merged into a live SolrCloud serving cluster • No downtime for users • No NRT expense • Linear scale out to the size of your MR cluster
  24. 24. Cloudera Morphlines • Open Source framework for simple ETL • Ships as part Cloudera Developer Kit (CDK) • It’s a Java library • AL2 licensed on github https://github.com/cloudera/cdk • Simplify ETL • Built-in commands and library support (Avro format, Hadoop SequenceFiles, grok for syslog messages) • Configuration over coding • Standardize ETL
  25. 25. Cloudera Morphlines Architecture Solr Solr Solr SolrCloud Logs, tweets, social media, html, images, pdf, text…. Anything you want to index Flume, MR Indexer, HBase indexer, etc... Or your application! Morphline Library Morphlines can be embedded in any application…
  26. 26. Extraction and Mapping • Modeled after Unix pipelines • Simple and flexible data transformation • Reusable across multiple index workloads • Over time, extend and re- use across platform workloads syslog Flume Agent Solr sink Command: readLine Command: grok Command: loadSolr Solr Event Record Record Record Document MorphlineLibrary
  27. 27. Morphline Example – syslog with grok morphlines : [ { id : morphline1 importCommands : ["com.cloudera.**", "org.apache.solr.**"] commands : [ { readLine {} } { grok { dictionaryFiles : [/tmp/grok-dictionaries] expressions : { message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?: %{GREEDYDATA:syslog_message}""" } } } { loadSolr {} } ] } ] Example Input <164>Feb 4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22 Output Record syslog_pri:164 syslog_timestamp:Feb 4 10:46:14 syslog_hostname:syslog syslog_program:sshd syslog_pid:607 syslog_message:listening on 0.0.0.0 port 22.
  28. 28. Current Command Library • Integrate with and load into Apache Solr • Flexible log file analysis • Single-line record, multi-line records, CSV files • Regex based pattern matching and extraction • Integration with Avro • Integration with Apache Hadoop Sequence Files • Integration with SolrCell and all Apache Tika parsers • Auto-detection of MIME types from binary data using Apache Tika
  29. 29. Current Command Library (cont) • Scripting support for dynamic java code • Operations on fields for assignment and comparison • Operations on fields with list and set semantics • if-then-else conditionals • A small rules engine (tryRules) • String and timestamp conversions • slf4j logging • Yammer metrics and counters • Decompression and unpacking of arbitrarily nested container file formats • Etc…
  30. 30. Querying • Built-in solr web UI • Write your own • Hue
  31. 31. Simple, Customizable Search Interface Hue • Simple UI • Navigated, faceted drill down • Customizable display • Full text search, standard Solr API and query language
  32. 32. Security • Upstream Solr doesn’t really deal with security • Goal: use kerberos, like other CDH components • Current release: Support for kerberos authentication • Actively working on Index-level authorization • Future: more granular authorization
  33. 33. Conclusion • Cloudera Search now in public beta • Free Download • Extensive documentation • Send your questions and feedback to search- user@cloudera.org • Take the Search online training • Cloudera Manager Standard (i.e. the free version) • Simple management of Search • Free Download • QuickStart VM also available!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×