Adding Search to the
Hadoop Ecosystem
Gregory Chanan (gchanan AT cloudera.com)
Frontier Meetup Dec 2013

1
Agenda
•

•
•
•
•

Big Data and Search – setting the stage
Cloudera Search Architecture
Component deep dive
Security
Concl...
Why Search?
Hadoop for everyone
• Typical case:
•

•
•

Ingest data to storage engine (HDFS, HBase, etc)
Process data (Map...
Why Search?
An Integrated Part of
the Hadoop System
One pool of data
One security framework
One set of system resources
On...
Benefits of Search
•

Improved Big Data ROI
•
•

•

Faster time to insight
•
•

•

An interactive experience without techn...
What is Cloudera Search?
Full-text, interactive search with faceted navigation
• Apache Solr integrated with CDH
•

•
•

•...
Cloudera Search Components
HDFS/MR/Lucene/Solr/SolrCloud
• Indexing
•

•
•

Near Real Time (NRT) indexing
Batch

ETL – Clo...
Apache Hadoop
•

Apache HDFS
•
•
•

•

Distributed file system
High reliability
High throughput

Apache MapReduce
•
•
•

P...
Apache Lucene
•

Full text search
•
•

Indexing
Query

Traditional inverted index
• Batch and Incremental indexing
• We ar...
Apache Solr
•

Search service built using Lucene
•

•

Ships with Lucene (same TLP at Apache)

Provides XML/HTTP/JSON/Pyth...
Apache SolrCloud
Provides distributed Search capability
• Part of Solr (not a separate library/codebase)
• Shards – provid...
SolrCloud Architecture
•
•
•

Updates automatically sent to
the correct shard
Replicas handle queries,
forward updates to ...
SolrCloud Architecture

Visual representation via admin UI
Distributed Search on Hadoop
ZK
Flume

SolrCloud
Hue UI

query

index

query

Custom
UI

Solr

HBase

index

Solr

query
S...
Indexing
•

Near Real Time (NRT)
•
•

•

Flume
HBase Indexer

Batch (MR)
Indexing
•

Near Real Time (NRT)
•
•

•

Flume
HBase Indexer

Batch (MR)
Near Real Time Indexing with Flume
Other
Log File

Log File

Flume
Agent

Flume
Agent

Indexer

17

HDFS

Solr and Flume
•...
Apache Flume - MorphlineSolrSink
•

A Flume Source…
•

•

A Flume Channel…
•

•

Carries the event – MemoryChannel or reli...
Indexing
•

Near Real Time (NRT)
•
•

•

Flume
HBase Indexer

Batch (MR)
+

Search

Near Real Time Indexing of Apache HBase
=

HBase

Replication

interactive load

B I G D ATA D ATA M A N A G E ...
Lily HBase Indexer
•

Collaboration between NGData & Cloudera
•

•

NGData are creators of the Lily data management platfo...
Indexing
•

Near Real Time (NRT)
•
•

•

Flume
HBase Indexer

Batch (MR)
Scalable Batch Indexing
Solr
server

Solr and MapReduce
Index
shard

Solr
server

Index
shard
Indexer

HDFS
Indexer
Files
...
MapReduce Indexer
MapReduce Job with two parts
1) Scan HDFS for files to be indexed
•
•

Much like Unix “find” – see HADOO...
MapReduce Indexer “golive”
Cloudera created this to bridge the gap between NRT
(low latency, expensive) and Batch (high la...
HBase + MapReduce
•

New in search 1.1: run MapReduce job over HBase
tables
•
•

Same architecture as running over HDFS
Si...
Cloudera Morphlines
Open Source framework for simple ETL
• Simplify ETL
•

•
•

Built-in commands and library support (Avr...
Cloudera Morphlines Architecture
Morphlines can be embedded in any application…
SolrCloud
Logs, tweets, social
media, html...
Extraction and Mapping
syslog

Flume
Agent
Event

Solr sink

Morphline Library

Record

Command: readLine
Record

Command:...
Morphline Example – syslog with grok
morphlines : [
{
id : morphline1
importCommands : ["com.cloudera.**", "org.apache.sol...
Current Command Library
•

•
•
•
•
•

•
•

Integrate with and load into Apache Solr
Flexible log file analysis
Single-line...
Current Command Library (cont)
•
•
•
•

•
•
•

•
•

•

Scripting support for dynamic java code
Operations on fields for as...
Querying
Built-in solr web UI
• Write your own
• Hue
•
Simple, Customizable Search Interface
Hue
• Simple UI
• Navigated, faceted drill
down
• Customizable display
• Full text s...
Security
Upstream Solr doesn’t deal with security
• Search 1.0 supports kerberos authentication
•

•

•

Similar to Oozie ...
Index-Level Authorization
Sentry works via “policy files” stored in HDFS
• Can grant roles administrative-only, query-only...
Index-Level Authorization 2
•

Works by hooking into Solr RequestHandlers:
<requestHandler name="/update“ class="solr.Upda...
Conclusion
•

Cloudera Search now Generally Available (1.1)
•
•
•
•

•

Cloudera Manager Standard (i.e. the free version)
...
Upcoming SlideShare
Loading in …5
×

Search On Hadoop Frontier Meetup

561 views
447 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
561
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
26
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Search On Hadoop Frontier Meetup

  1. 1. Adding Search to the Hadoop Ecosystem Gregory Chanan (gchanan AT cloudera.com) Frontier Meetup Dec 2013 1
  2. 2. Agenda • • • • • Big Data and Search – setting the stage Cloudera Search Architecture Component deep dive Security Conclusion
  3. 3. Why Search? Hadoop for everyone • Typical case: • • • Ingest data to storage engine (HDFS, HBase, etc) Process data (MapReduce, Hive, Impala) Experts know MapReduce • Savvy people know SQL • Everyone knows Search! •
  4. 4. Why Search? An Integrated Part of the Hadoop System One pool of data One security framework One set of system resources One management interface
  5. 5. Benefits of Search • Improved Big Data ROI • • • Faster time to insight • • • An interactive experience without technical knowledge Single data set for multiple computing frameworks Exploratory analysis, esp. unstructured data Broad range of indexing options to accommodate needs Cost efficiency • • Single scalable platform; no incremental investment No need for separate systems, storage
  6. 6. What is Cloudera Search? Full-text, interactive search with faceted navigation • Apache Solr integrated with CDH • • • • Established, mature search with vibrant community In production environments for years Open Source • • 100% Apache, 100% Solr Standard Solr APIs Batch, near real-time, and on-demand indexing • Generally Available; released 1.1 last month •
  7. 7. Cloudera Search Components HDFS/MR/Lucene/Solr/SolrCloud • Indexing • • • Near Real Time (NRT) indexing Batch ETL – Cloudera Morphlines • Querying •
  8. 8. Apache Hadoop • Apache HDFS • • • • Distributed file system High reliability High throughput Apache MapReduce • • • Parallel, distributed programming model Allows processing of large datasets Fault tolerant
  9. 9. Apache Lucene • Full text search • • Indexing Query Traditional inverted index • Batch and Incremental indexing • We are using version 4.4 in current release •
  10. 10. Apache Solr • Search service built using Lucene • • Ships with Lucene (same TLP at Apache) Provides XML/HTTP/JSON/Python/Ruby/… APIs Indexing • Query • Administrative interface • Also rich web admin GUI via HTTP •
  11. 11. Apache SolrCloud Provides distributed Search capability • Part of Solr (not a separate library/codebase) • Shards – provide scalability • • • • partition index for size replicate for query performance Uses ZooKeeper for coordination • • No split-brain issues Simplifies operations
  12. 12. SolrCloud Architecture • • • Updates automatically sent to the correct shard Replicas handle queries, forward updates to the leader Leader indexes the document for the shard, and forwards the index notation to itself and any replicas.
  13. 13. SolrCloud Architecture Visual representation via admin UI
  14. 14. Distributed Search on Hadoop ZK Flume SolrCloud Hue UI query index query Custom UI Solr HBase index Solr query Solr index MR HDFS Hadoop Cluster Custom App
  15. 15. Indexing • Near Real Time (NRT) • • • Flume HBase Indexer Batch (MR)
  16. 16. Indexing • Near Real Time (NRT) • • • Flume HBase Indexer Batch (MR)
  17. 17. Near Real Time Indexing with Flume Other Log File Log File Flume Agent Flume Agent Indexer 17 HDFS Solr and Flume • Data ingest at scale • Flexible extraction and mapping • Indexing at data ingest Indexer
  18. 18. Apache Flume - MorphlineSolrSink • A Flume Source… • • A Flume Channel… • • Carries the event – MemoryChannel or reliable FileChannel A Flume Sink… • • Receives/gathers events Sends the events on to the next location Flume MorphlineSolrSink • Integrates Cloudera Morphlines library • ETL, more on that in a bit Does batching • Results sent to Solr for indexing •
  19. 19. Indexing • Near Real Time (NRT) • • • Flume HBase Indexer Batch (MR)
  20. 20. + Search Near Real Time Indexing of Apache HBase = HBase Replication interactive load B I G D ATA D ATA M A N A G E M E N T HDFS planet-sized tabular data immediate access & updates fast & flexible information discovery HBase Indexer(s) Solr server Solr server Solr server Solr server Solr server
  21. 21. Lily HBase Indexer • Collaboration between NGData & Cloudera • • NGData are creators of the Lily data management platform Lily HBase Indexer • Service which acts as a HBase replication listener • HBase replication features, such as filtering, supported Replication updates trigger indexing of updates (rows) • Integrates Cloudera Morphlines library for ETL of rows • AL2 licensed on github https://github.com/ngdata •
  22. 22. Indexing • Near Real Time (NRT) • • • Flume HBase Indexer Batch (MR)
  23. 23. Scalable Batch Indexing Solr server Solr and MapReduce Index shard Solr server Index shard Indexer HDFS Indexer Files Files 23 • Flexible, scalable batch indexing • Start serving new indices with no downtime • On-demand indexing, costefficient re-indexing
  24. 24. MapReduce Indexer MapReduce Job with two parts 1) Scan HDFS for files to be indexed • • Much like Unix “find” – see HADOOP-8989 Output is NLineInputFormat’ed file 2) Mapper/Reducer indexing step Mapper extracts content via Cloudera Morphlines • Reducer indexes documents via embedded Solr server • Originally based on SOLR-1301 • • Many modifications to enable linear scalability
  25. 25. MapReduce Indexer “golive” Cloudera created this to bridge the gap between NRT (low latency, expensive) and Batch (high latency, cheap at scale) indexing • Results of MR indexing operation are immediately merged into a live SolrCloud serving cluster • • • • No downtime for users No NRT expense Linear scale out to the size of your MR cluster
  26. 26. HBase + MapReduce • New in search 1.1: run MapReduce job over HBase tables • • Same architecture as running over HDFS Similar to HBase’s CopyTable,
  27. 27. Cloudera Morphlines Open Source framework for simple ETL • Simplify ETL • • • Built-in commands and library support (Avro format, Hadoop SequenceFiles, grok for syslog messages) Configuration over coding Standardize ETL • Ships as part of Kite SDK, formerly Cloudera Developer Kit (CDK) • • • It’s a Java library AL2 licensed on github https://github.com/kite-sdk
  28. 28. Cloudera Morphlines Architecture Morphlines can be embedded in any application… SolrCloud Logs, tweets, social media, html, images, pdf, text…. Anything you want to index Flume, MR Indexer, HBase indexer, etc... Or your application! Solr Solr Morphline Library Solr
  29. 29. Extraction and Mapping syslog Flume Agent Event Solr sink Morphline Library Record Command: readLine Record Command: grok Record Command: loadSolr Document Solr • Modeled after Unix pipelines • Simple and flexible data transformation • Reusable across multiple index workloads • Over time, extend and reuse across platform workloads
  30. 30. Morphline Example – syslog with grok morphlines : [ { id : morphline1 importCommands : ["com.cloudera.**", "org.apache.solr.**"] commands : [ { readLine {} } { grok { dictionaryFiles : [/tmp/grok-dictionaries] expressions : { message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?: %{GREEDYDATA:syslog_message}""" } Example Input <164>Feb 4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22 } Output Record } syslog_pri:164 { loadSolr {} } syslog_timestamp:Feb 4 10:46:14 ] syslog_hostname:syslog } syslog_program:sshd ] syslog_pid:607 syslog_message:listening on 0.0.0.0 port 22.
  31. 31. Current Command Library • • • • • • • • Integrate with and load into Apache Solr Flexible log file analysis Single-line record, multi-line records, CSV files Regex based pattern matching and extraction Integration with Avro Integration with Apache Hadoop Sequence Files Integration with SolrCell and all Apache Tika parsers Auto-detection of MIME types from binary data using Apache Tika
  32. 32. Current Command Library (cont) • • • • • • • • • • Scripting support for dynamic java code Operations on fields for assignment and comparison Operations on fields with list and set semantics if-then-else conditionals A small rules engine (tryRules) String and timestamp conversions slf4j logging Yammer metrics and counters Decompression and unpacking of arbitrarily nested container file formats Etc…
  33. 33. Querying Built-in solr web UI • Write your own • Hue •
  34. 34. Simple, Customizable Search Interface Hue • Simple UI • Navigated, faceted drill down • Customizable display • Full text search, standard Solr API and query language
  35. 35. Security Upstream Solr doesn’t deal with security • Search 1.0 supports kerberos authentication • • • Similar to Oozie / WebHDFS Search 1.1 supports index-level authorization via Apache Sentry (incubating)
  36. 36. Index-Level Authorization Sentry works via “policy files” stored in HDFS • Can grant roles administrative-only, query-only, update-only access • Example: [groups] # Assigns each Hadoop group to its set of roles dev_ops = engineer_role, ops_role [roles] engineer_role = collection = source_code->action=* ops_role = collection = hbase_logs->action=Query •
  37. 37. Index-Level Authorization 2 • Works by hooking into Solr RequestHandlers: <requestHandler name="/update“ class="solr.UpdateRequestHandler"> <lst name="defaults“> <str name="update.chain">updateIndexAuthorization</str> </lst> </requestHandler> Also includes secure impersonation support • Unauthorized attempts get a 401 response and are written to the solr log • Future work: more fine grain authorization •
  38. 38. Conclusion • Cloudera Search now Generally Available (1.1) • • • • • Cloudera Manager Standard (i.e. the free version) • • • Free Download Extensive documentation Send your questions and feedback to searchuser@cloudera.org Take the Search online training Simple management of Search Free Download QuickStart VM also available!

×