SlideShare a Scribd company logo
1 of 38
Download to read offline
Adding Search to the
Hadoop Ecosystem
Gregory Chanan (gchanan AT cloudera.com)
Frontier Meetup Dec 2013

1
Agenda
•

•
•
•
•

Big Data and Search – setting the stage
Cloudera Search Architecture
Component deep dive
Security
Conclusion
Why Search?
Hadoop for everyone
• Typical case:
•

•
•

Ingest data to storage engine (HDFS, HBase, etc)
Process data (MapReduce, Hive, Impala)

Experts know MapReduce
• Savvy people know SQL
• Everyone knows Search!
•
Why Search?
An Integrated Part of
the Hadoop System
One pool of data
One security framework
One set of system resources
One management interface
Benefits of Search
•

Improved Big Data ROI
•
•

•

Faster time to insight
•
•

•

An interactive experience without technical knowledge
Single data set for multiple computing frameworks
Exploratory analysis, esp. unstructured data
Broad range of indexing options to accommodate needs

Cost efficiency
•
•

Single scalable platform; no incremental investment
No need for separate systems, storage
What is Cloudera Search?
Full-text, interactive search with faceted navigation
• Apache Solr integrated with CDH
•

•
•

•

Established, mature search with vibrant community
In production environments for years

Open Source
•
•

100% Apache, 100% Solr
Standard Solr APIs

Batch, near real-time, and on-demand indexing
• Generally Available; released 1.1 last month
•
Cloudera Search Components
HDFS/MR/Lucene/Solr/SolrCloud
• Indexing
•

•
•

Near Real Time (NRT) indexing
Batch

ETL – Cloudera Morphlines
• Querying
•
Apache Hadoop
•

Apache HDFS
•
•
•

•

Distributed file system
High reliability
High throughput

Apache MapReduce
•
•
•

Parallel, distributed programming model
Allows processing of large datasets
Fault tolerant
Apache Lucene
•

Full text search
•
•

Indexing
Query

Traditional inverted index
• Batch and Incremental indexing
• We are using version 4.4 in current release
•
Apache Solr
•

Search service built using Lucene
•

•

Ships with Lucene (same TLP at Apache)

Provides XML/HTTP/JSON/Python/Ruby/… APIs
Indexing
• Query
• Administrative interface
• Also rich web admin GUI via HTTP
•
Apache SolrCloud
Provides distributed Search capability
• Part of Solr (not a separate library/codebase)
• Shards – provide scalability
•

•
•

•

partition index for size
replicate for query performance

Uses ZooKeeper for coordination
•
•

No split-brain issues
Simplifies operations
SolrCloud Architecture
•
•
•

Updates automatically sent to
the correct shard
Replicas handle queries,
forward updates to the leader
Leader indexes the document
for the shard, and forwards
the index notation to itself
and any replicas.
SolrCloud Architecture

Visual representation via admin UI
Distributed Search on Hadoop
ZK
Flume

SolrCloud
Hue UI

query

index

query

Custom
UI

Solr

HBase

index

Solr

query
Solr

index
MR
HDFS
Hadoop Cluster

Custom
App
Indexing
•

Near Real Time (NRT)
•
•

•

Flume
HBase Indexer

Batch (MR)
Indexing
•

Near Real Time (NRT)
•
•

•

Flume
HBase Indexer

Batch (MR)
Near Real Time Indexing with Flume
Other
Log File

Log File

Flume
Agent

Flume
Agent

Indexer

17

HDFS

Solr and Flume
• Data ingest at scale
• Flexible extraction and
mapping
• Indexing at data ingest

Indexer
Apache Flume - MorphlineSolrSink
•

A Flume Source…
•

•

A Flume Channel…
•

•

Carries the event – MemoryChannel or reliable FileChannel

A Flume Sink…
•

•

Receives/gathers events

Sends the events on to the next location

Flume MorphlineSolrSink
•

Integrates Cloudera Morphlines library
•

ETL, more on that in a bit

Does batching
• Results sent to Solr for indexing
•
Indexing
•

Near Real Time (NRT)
•
•

•

Flume
HBase Indexer

Batch (MR)
+

Search

Near Real Time Indexing of Apache HBase
=

HBase

Replication

interactive load

B I G D ATA D ATA M A N A G E M E N T

HDFS

planet-sized tabular data
immediate access & updates
fast & flexible information
discovery

HBase
Indexer(s)

Solr server
Solr server
Solr server
Solr server
Solr server
Lily HBase Indexer
•

Collaboration between NGData & Cloudera
•

•

NGData are creators of the Lily data management platform

Lily HBase Indexer
•

Service which acts as a HBase replication listener
•

HBase replication features, such as filtering, supported

Replication updates trigger indexing of updates (rows)
• Integrates Cloudera Morphlines library for ETL of rows
• AL2 licensed on github https://github.com/ngdata
•
Indexing
•

Near Real Time (NRT)
•
•

•

Flume
HBase Indexer

Batch (MR)
Scalable Batch Indexing
Solr
server

Solr and MapReduce
Index
shard

Solr
server

Index
shard
Indexer

HDFS
Indexer
Files
Files

23

• Flexible, scalable batch
indexing
• Start serving new indices
with no downtime
• On-demand indexing, costefficient re-indexing
MapReduce Indexer
MapReduce Job with two parts
1) Scan HDFS for files to be indexed
•
•

Much like Unix “find” – see HADOOP-8989
Output is NLineInputFormat’ed file

2) Mapper/Reducer indexing step
Mapper extracts content via Cloudera Morphlines
• Reducer indexes documents via embedded Solr server
• Originally based on SOLR-1301
•

•

Many modifications to enable linear scalability
MapReduce Indexer “golive”
Cloudera created this to bridge the gap between NRT
(low latency, expensive) and Batch (high latency,
cheap at scale) indexing
• Results of MR indexing operation are immediately
merged into a live SolrCloud serving cluster
•

•
•
•

No downtime for users
No NRT expense
Linear scale out to the size of your MR cluster
HBase + MapReduce
•

New in search 1.1: run MapReduce job over HBase
tables
•
•

Same architecture as running over HDFS
Similar to HBase’s CopyTable,
Cloudera Morphlines
Open Source framework for simple ETL
• Simplify ETL
•

•
•

Built-in commands and library support (Avro format, Hadoop
SequenceFiles, grok for syslog messages)
Configuration over coding

Standardize ETL
• Ships as part of Kite SDK, formerly Cloudera
Developer Kit (CDK)
•

•
•

It’s a Java library
AL2 licensed on github https://github.com/kite-sdk
Cloudera Morphlines Architecture
Morphlines can be embedded in any application…
SolrCloud
Logs, tweets, social
media, html,
images, pdf, text….

Anything you want
to index

Flume, MR Indexer, HBase indexer, etc...
Or your application!

Solr

Solr
Morphline Library
Solr
Extraction and Mapping
syslog

Flume
Agent
Event

Solr sink

Morphline Library

Record

Command: readLine
Record

Command: grok
Record

Command: loadSolr
Document

Solr

• Modeled after Unix
pipelines
• Simple and flexible data
transformation
• Reusable across multiple
index workloads
• Over time, extend and reuse across platform
workloads
Morphline Example – syslog with grok
morphlines : [
{
id : morphline1
importCommands : ["com.cloudera.**", "org.apache.solr.**"]
commands : [
{ readLine {} }
{
grok {
dictionaryFiles : [/tmp/grok-dictionaries]
expressions : {
message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp}
%{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:
%{GREEDYDATA:syslog_message}"""
}
Example Input
<164>Feb 4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22
}
Output Record
}
syslog_pri:164
{ loadSolr {} }
syslog_timestamp:Feb 4 10:46:14
]
syslog_hostname:syslog
}
syslog_program:sshd
]
syslog_pid:607
syslog_message:listening on 0.0.0.0 port 22.
Current Command Library
•

•
•
•
•
•

•
•

Integrate with and load into Apache Solr
Flexible log file analysis
Single-line record, multi-line records, CSV files
Regex based pattern matching and extraction
Integration with Avro
Integration with Apache Hadoop Sequence Files
Integration with SolrCell and all Apache Tika parsers
Auto-detection of MIME types from binary data using
Apache Tika
Current Command Library (cont)
•
•
•
•

•
•
•

•
•

•

Scripting support for dynamic java code
Operations on fields for assignment and comparison
Operations on fields with list and set semantics
if-then-else conditionals
A small rules engine (tryRules)
String and timestamp conversions
slf4j logging
Yammer metrics and counters
Decompression and unpacking of arbitrarily nested
container file formats
Etc…
Querying
Built-in solr web UI
• Write your own
• Hue
•
Simple, Customizable Search Interface
Hue
• Simple UI
• Navigated, faceted drill
down
• Customizable display
• Full text search,
standard Solr API and
query language
Security
Upstream Solr doesn’t deal with security
• Search 1.0 supports kerberos authentication
•

•

•

Similar to Oozie / WebHDFS

Search 1.1 supports index-level authorization via
Apache Sentry (incubating)
Index-Level Authorization
Sentry works via “policy files” stored in HDFS
• Can grant roles administrative-only, query-only,
update-only access
• Example:
[groups]
# Assigns each Hadoop group to its set of roles
dev_ops = engineer_role, ops_role
[roles]
engineer_role = collection = source_code->action=*
ops_role = collection = hbase_logs->action=Query
•
Index-Level Authorization 2
•

Works by hooking into Solr RequestHandlers:
<requestHandler name="/update“ class="solr.UpdateRequestHandler">
<lst name="defaults“>
<str name="update.chain">updateIndexAuthorization</str>
</lst>
</requestHandler>

Also includes secure impersonation support
• Unauthorized attempts get a 401 response and are
written to the solr log
• Future work: more fine grain authorization
•
Conclusion
•

Cloudera Search now Generally Available (1.1)
•
•
•
•

•

Cloudera Manager Standard (i.e. the free version)
•
•

•

Free Download
Extensive documentation
Send your questions and feedback to searchuser@cloudera.org
Take the Search online training
Simple management of Search
Free Download

QuickStart VM also available!

More Related Content

What's hot

Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, EtsyLessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, EtsyLucidworks
 
Searching for Better Code: Presented by Grant Ingersoll, Lucidworks
Searching for Better Code: Presented by Grant Ingersoll, LucidworksSearching for Better Code: Presented by Grant Ingersoll, Lucidworks
Searching for Better Code: Presented by Grant Ingersoll, LucidworksLucidworks
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLCloudera, Inc.
 
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...Lucidworks
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...Lucidworks
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark PresentationStephen Borg
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSpark Summit
 
SplunkLive! Atlanta Mar 2013 - University of Alabama at Birmingham
SplunkLive! Atlanta Mar 2013 - University of Alabama at BirminghamSplunkLive! Atlanta Mar 2013 - University of Alabama at Birmingham
SplunkLive! Atlanta Mar 2013 - University of Alabama at BirminghamSplunk
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine LearningChris Fregly
 
SplunkLive Melbourne Scaling and best practice for Splunk on premise and in t...
SplunkLive Melbourne Scaling and best practice for Splunk on premise and in t...SplunkLive Melbourne Scaling and best practice for Splunk on premise and in t...
SplunkLive Melbourne Scaling and best practice for Splunk on premise and in t...Gabrielle Knowles
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksData Con LA
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackDataWorks Summit/Hadoop Summit
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
Securing Data in Hadoop at Uber
Securing Data in Hadoop at UberSecuring Data in Hadoop at Uber
Securing Data in Hadoop at UberDataWorks Summit
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streamsJoey Echeverria
 
Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...DataWorks Summit
 

What's hot (20)

Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, EtsyLessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
Lessons From Sharding Solr At Etsy: Presented by Gregg Donovan, Etsy
 
Searching for Better Code: Presented by Grant Ingersoll, Lucidworks
Searching for Better Code: Presented by Grant Ingersoll, LucidworksSearching for Better Code: Presented by Grant Ingersoll, Lucidworks
Searching for Better Code: Presented by Grant Ingersoll, Lucidworks
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETL
 
Cost-based Query Optimization
Cost-based Query Optimization Cost-based Query Optimization
Cost-based Query Optimization
 
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark Presentation
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
 
SplunkLive! Atlanta Mar 2013 - University of Alabama at Birmingham
SplunkLive! Atlanta Mar 2013 - University of Alabama at BirminghamSplunkLive! Atlanta Mar 2013 - University of Alabama at Birmingham
SplunkLive! Atlanta Mar 2013 - University of Alabama at Birmingham
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 
SplunkLive Melbourne Scaling and best practice for Splunk on premise and in t...
SplunkLive Melbourne Scaling and best practice for Splunk on premise and in t...SplunkLive Melbourne Scaling and best practice for Splunk on premise and in t...
SplunkLive Melbourne Scaling and best practice for Splunk on premise and in t...
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Securing Data in Hadoop at Uber
Securing Data in Hadoop at UberSecuring Data in Hadoop at Uber
Securing Data in Hadoop at Uber
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
 
Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 

Viewers also liked

Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopHortonworks
 
Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesDataWorks Summit
 
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Hortonworks
 
10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data LakeVMware Tanzu
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
 

Viewers also liked (6)

Create a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache HadoopCreate a Smarter Data Lake with HP Haven and Apache Hadoop
Create a Smarter Data Lake with HP Haven and Apache Hadoop
 
Hadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data ArchitecturesHadoop Powers Modern Enterprise Data Architectures
Hadoop Powers Modern Enterprise Data Architectures
 
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
Discover Red Hat and Apache Hadoop for the Modern Data Architecture - Part 3
 
10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake
 
Building SaaS products with Windows Azure
Building SaaS products with Windows Azure Building SaaS products with Windows Azure
Building SaaS products with Windows Azure
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
 

Similar to Search On Hadoop Frontier Meetup

Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoopgregchanan
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchMark Miller
 
Webinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big DataWebinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big DataLucidworks
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & SolrLucidworks
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesDataWorks Summit
 
Spark volume requirements 2018
Spark volume requirements 2018Spark volume requirements 2018
Spark volume requirements 2018Rachit Arora
 
Indexing with solr search server and hadoop framework
Indexing with solr search server and hadoop frameworkIndexing with solr search server and hadoop framework
Indexing with solr search server and hadoop frameworkkeval dalasaniya
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldSearch in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldAlex Moundalexis
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalogmarkgrover
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek
Cloudera - Using morphlines for on the-fly ETL by Wolfgang HoschekCloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek
Cloudera - Using morphlines for on the-fly ETL by Wolfgang HoschekHakka Labs
 

Similar to Search On Hadoop Frontier Meetup (20)

Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
Solr + Hadoop = Big Data Search
Solr + Hadoop = Big Data SearchSolr + Hadoop = Big Data Search
Solr + Hadoop = Big Data Search
 
Webinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big DataWebinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big Data
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on Kubernetes
 
Spark volume requirements 2018
Spark volume requirements 2018Spark volume requirements 2018
Spark volume requirements 2018
 
An intro to Azure Data Lake
An intro to Azure Data LakeAn intro to Azure Data Lake
An intro to Azure Data Lake
 
Indexing with solr search server and hadoop framework
Indexing with solr search server and hadoop frameworkIndexing with solr search server and hadoop framework
Indexing with solr search server and hadoop framework
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Apache drill
Apache drillApache drill
Apache drill
 
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the FieldSearch in the Apache Hadoop Ecosystem: Thoughts from the Field
Search in the Apache Hadoop Ecosystem: Thoughts from the Field
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalog
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Drill at the Chicago Hug
Drill at the Chicago HugDrill at the Chicago Hug
Drill at the Chicago Hug
 
Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek
Cloudera - Using morphlines for on the-fly ETL by Wolfgang HoschekCloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek
Cloudera - Using morphlines for on the-fly ETL by Wolfgang Hoschek
 

Recently uploaded

9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 

Recently uploaded (20)

9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 

Search On Hadoop Frontier Meetup

  • 1. Adding Search to the Hadoop Ecosystem Gregory Chanan (gchanan AT cloudera.com) Frontier Meetup Dec 2013 1
  • 2. Agenda • • • • • Big Data and Search – setting the stage Cloudera Search Architecture Component deep dive Security Conclusion
  • 3. Why Search? Hadoop for everyone • Typical case: • • • Ingest data to storage engine (HDFS, HBase, etc) Process data (MapReduce, Hive, Impala) Experts know MapReduce • Savvy people know SQL • Everyone knows Search! •
  • 4. Why Search? An Integrated Part of the Hadoop System One pool of data One security framework One set of system resources One management interface
  • 5. Benefits of Search • Improved Big Data ROI • • • Faster time to insight • • • An interactive experience without technical knowledge Single data set for multiple computing frameworks Exploratory analysis, esp. unstructured data Broad range of indexing options to accommodate needs Cost efficiency • • Single scalable platform; no incremental investment No need for separate systems, storage
  • 6. What is Cloudera Search? Full-text, interactive search with faceted navigation • Apache Solr integrated with CDH • • • • Established, mature search with vibrant community In production environments for years Open Source • • 100% Apache, 100% Solr Standard Solr APIs Batch, near real-time, and on-demand indexing • Generally Available; released 1.1 last month •
  • 7. Cloudera Search Components HDFS/MR/Lucene/Solr/SolrCloud • Indexing • • • Near Real Time (NRT) indexing Batch ETL – Cloudera Morphlines • Querying •
  • 8. Apache Hadoop • Apache HDFS • • • • Distributed file system High reliability High throughput Apache MapReduce • • • Parallel, distributed programming model Allows processing of large datasets Fault tolerant
  • 9. Apache Lucene • Full text search • • Indexing Query Traditional inverted index • Batch and Incremental indexing • We are using version 4.4 in current release •
  • 10. Apache Solr • Search service built using Lucene • • Ships with Lucene (same TLP at Apache) Provides XML/HTTP/JSON/Python/Ruby/… APIs Indexing • Query • Administrative interface • Also rich web admin GUI via HTTP •
  • 11. Apache SolrCloud Provides distributed Search capability • Part of Solr (not a separate library/codebase) • Shards – provide scalability • • • • partition index for size replicate for query performance Uses ZooKeeper for coordination • • No split-brain issues Simplifies operations
  • 12. SolrCloud Architecture • • • Updates automatically sent to the correct shard Replicas handle queries, forward updates to the leader Leader indexes the document for the shard, and forwards the index notation to itself and any replicas.
  • 14. Distributed Search on Hadoop ZK Flume SolrCloud Hue UI query index query Custom UI Solr HBase index Solr query Solr index MR HDFS Hadoop Cluster Custom App
  • 15. Indexing • Near Real Time (NRT) • • • Flume HBase Indexer Batch (MR)
  • 16. Indexing • Near Real Time (NRT) • • • Flume HBase Indexer Batch (MR)
  • 17. Near Real Time Indexing with Flume Other Log File Log File Flume Agent Flume Agent Indexer 17 HDFS Solr and Flume • Data ingest at scale • Flexible extraction and mapping • Indexing at data ingest Indexer
  • 18. Apache Flume - MorphlineSolrSink • A Flume Source… • • A Flume Channel… • • Carries the event – MemoryChannel or reliable FileChannel A Flume Sink… • • Receives/gathers events Sends the events on to the next location Flume MorphlineSolrSink • Integrates Cloudera Morphlines library • ETL, more on that in a bit Does batching • Results sent to Solr for indexing •
  • 19. Indexing • Near Real Time (NRT) • • • Flume HBase Indexer Batch (MR)
  • 20. + Search Near Real Time Indexing of Apache HBase = HBase Replication interactive load B I G D ATA D ATA M A N A G E M E N T HDFS planet-sized tabular data immediate access & updates fast & flexible information discovery HBase Indexer(s) Solr server Solr server Solr server Solr server Solr server
  • 21. Lily HBase Indexer • Collaboration between NGData & Cloudera • • NGData are creators of the Lily data management platform Lily HBase Indexer • Service which acts as a HBase replication listener • HBase replication features, such as filtering, supported Replication updates trigger indexing of updates (rows) • Integrates Cloudera Morphlines library for ETL of rows • AL2 licensed on github https://github.com/ngdata •
  • 22. Indexing • Near Real Time (NRT) • • • Flume HBase Indexer Batch (MR)
  • 23. Scalable Batch Indexing Solr server Solr and MapReduce Index shard Solr server Index shard Indexer HDFS Indexer Files Files 23 • Flexible, scalable batch indexing • Start serving new indices with no downtime • On-demand indexing, costefficient re-indexing
  • 24. MapReduce Indexer MapReduce Job with two parts 1) Scan HDFS for files to be indexed • • Much like Unix “find” – see HADOOP-8989 Output is NLineInputFormat’ed file 2) Mapper/Reducer indexing step Mapper extracts content via Cloudera Morphlines • Reducer indexes documents via embedded Solr server • Originally based on SOLR-1301 • • Many modifications to enable linear scalability
  • 25. MapReduce Indexer “golive” Cloudera created this to bridge the gap between NRT (low latency, expensive) and Batch (high latency, cheap at scale) indexing • Results of MR indexing operation are immediately merged into a live SolrCloud serving cluster • • • • No downtime for users No NRT expense Linear scale out to the size of your MR cluster
  • 26. HBase + MapReduce • New in search 1.1: run MapReduce job over HBase tables • • Same architecture as running over HDFS Similar to HBase’s CopyTable,
  • 27. Cloudera Morphlines Open Source framework for simple ETL • Simplify ETL • • • Built-in commands and library support (Avro format, Hadoop SequenceFiles, grok for syslog messages) Configuration over coding Standardize ETL • Ships as part of Kite SDK, formerly Cloudera Developer Kit (CDK) • • • It’s a Java library AL2 licensed on github https://github.com/kite-sdk
  • 28. Cloudera Morphlines Architecture Morphlines can be embedded in any application… SolrCloud Logs, tweets, social media, html, images, pdf, text…. Anything you want to index Flume, MR Indexer, HBase indexer, etc... Or your application! Solr Solr Morphline Library Solr
  • 29. Extraction and Mapping syslog Flume Agent Event Solr sink Morphline Library Record Command: readLine Record Command: grok Record Command: loadSolr Document Solr • Modeled after Unix pipelines • Simple and flexible data transformation • Reusable across multiple index workloads • Over time, extend and reuse across platform workloads
  • 30. Morphline Example – syslog with grok morphlines : [ { id : morphline1 importCommands : ["com.cloudera.**", "org.apache.solr.**"] commands : [ { readLine {} } { grok { dictionaryFiles : [/tmp/grok-dictionaries] expressions : { message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?: %{GREEDYDATA:syslog_message}""" } Example Input <164>Feb 4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22 } Output Record } syslog_pri:164 { loadSolr {} } syslog_timestamp:Feb 4 10:46:14 ] syslog_hostname:syslog } syslog_program:sshd ] syslog_pid:607 syslog_message:listening on 0.0.0.0 port 22.
  • 31. Current Command Library • • • • • • • • Integrate with and load into Apache Solr Flexible log file analysis Single-line record, multi-line records, CSV files Regex based pattern matching and extraction Integration with Avro Integration with Apache Hadoop Sequence Files Integration with SolrCell and all Apache Tika parsers Auto-detection of MIME types from binary data using Apache Tika
  • 32. Current Command Library (cont) • • • • • • • • • • Scripting support for dynamic java code Operations on fields for assignment and comparison Operations on fields with list and set semantics if-then-else conditionals A small rules engine (tryRules) String and timestamp conversions slf4j logging Yammer metrics and counters Decompression and unpacking of arbitrarily nested container file formats Etc…
  • 33. Querying Built-in solr web UI • Write your own • Hue •
  • 34. Simple, Customizable Search Interface Hue • Simple UI • Navigated, faceted drill down • Customizable display • Full text search, standard Solr API and query language
  • 35. Security Upstream Solr doesn’t deal with security • Search 1.0 supports kerberos authentication • • • Similar to Oozie / WebHDFS Search 1.1 supports index-level authorization via Apache Sentry (incubating)
  • 36. Index-Level Authorization Sentry works via “policy files” stored in HDFS • Can grant roles administrative-only, query-only, update-only access • Example: [groups] # Assigns each Hadoop group to its set of roles dev_ops = engineer_role, ops_role [roles] engineer_role = collection = source_code->action=* ops_role = collection = hbase_logs->action=Query •
  • 37. Index-Level Authorization 2 • Works by hooking into Solr RequestHandlers: <requestHandler name="/update“ class="solr.UpdateRequestHandler"> <lst name="defaults“> <str name="update.chain">updateIndexAuthorization</str> </lst> </requestHandler> Also includes secure impersonation support • Unauthorized attempts get a 401 response and are written to the solr log • Future work: more fine grain authorization •
  • 38. Conclusion • Cloudera Search now Generally Available (1.1) • • • • • Cloudera Manager Standard (i.e. the free version) • • • Free Download Extensive documentation Send your questions and feedback to searchuser@cloudera.org Take the Search online training Simple management of Search Free Download QuickStart VM also available!