Search On Hadoop

1
Finding
a
needle
in
a
stack
of

needles
-‐
adding
Search
to
the

Hadoop
Ecosystem

Patrick
Hunt
(@phunt)

Big
Data
Gurus
Meetup
July
2013

Agenda

•  Big
Data
and
Search
–
seIng
the
stage

•  Cloudera
Search’s
Architecture

•  Component
deep
dive

•  Early
performance
insights

•  What’s
next?

Feel
free
to
ask
quesQons
as
we
go!

Why
Search?

An
Integrated
Part
of

the
Hadoop
System

One
pool
of
data

One
security
framework

One
set
of
system
resources

One
management
interface

Search
Simpliﬁes
InteracQon

•  User
Goals

•  Explore

•  Navigate

•  Correlate

•  Experts
know
MapReduce

•  Savvy
people
know
SQL

•  Everyone
knows
Search!

Benefits
of
Search

•  Improved
Big
Data
ROI

•  An
interacQve
experience
without
technical
knowledge

•  Single
data
set
for
mulQple
compuQng
frameworks

•  Faster
Qme
to
insight

•  Exploratory
analysis,
esp.
unstructured
data

•  Broad
range
of
indexing
opQons
to
accommodate
needs

•  Cost
efficiency

•  Single
scalable
plaòrm;
no
incremental
investment

•  No
need
for
separate
systems,
storage

•  Solid
foundaQons
and
reliability

•  Apache
Solr
in
producQon
environments
for
years

•  Hadoop-‐powered
reliability
and
scalability

What
is
Cloudera
Search?

•  Full-‐text,
interacQve
search
and
faceted
navigaQon

•  Batch,
near
real-‐Qme,
and
on-‐demand
indexing

•  Apache
Solr
integrated
with
CDH

•  Established,
mature
search
with
vibrant
community

•  Separate
runQme
like
MapReduce,
Impala

•  Incorporated
as
part
of
the
Hadoop
ecosystem

•  Open
Source

•  100%
Apache,
100%
Solr

•  Standard
Solr
APIs

Cloudera
Search
Components

•  Refresher
–
HDFS/MR/Lucene/Solr/SolrCloud

•  HDFSDirectoryFactory/HDFSDirectory

•  BlockDirectory/BlockDirectoryCache

•  Near
Real
Time
(NRT)
indexing

•  Apache
Flume
MorphlineSolrSink

•  Lily
HBase
Indexer

•  Batch
–
MapReduce
Indexer

•  ETL
–
Cloudera
Morphlines

•  Hue
Search
ApplicaQon

Apache
Hadoop

•  Apache
HDFS

•  Distributed
ﬁle
system

•  High
reliability

•  High
throughput

•  Apache
MapReduce

•  Parallel,
distributed
programming
model

•  Allows
processing
of
large
datasets

•  Fault
tolerant

Apache
Lucene

•  Full
text
search

•  Indexing

•  Query

•  TradiQonal
inverted
index

•  Batch
and
Incremental
indexing

•  We
are
using
version
4
(4.3
currently)

Apache
Solr

•  Search
service
built
using
Lucene

•  Ships
with
Lucene
(same
TLP
at
Apache)

•  Provides
XML/HTTP/JSON/Python/Ruby/…
APIs

•  Indexing

•  Query

•  AdministraQve
interface

•  Also
rich
web
admin
GUI
via
HTTP

Apache
SolrCloud

•  Provides
distributed
Search
capability

•  Part
of
Solr
(not
a
separate
library/codebase)

•  Shards
-‐
both
verQcally
and
horizontally
scaleable

•  Horizontally
–
parQQon
index
for
size

•  VerQcally
–
replicate
for
query
performance

•  Uses
ZooKeeper
for
coordinaQon

•  No
split-‐brain
issues

•  Simpliﬁes
operaQons

Distributed
Search
on
Hadoop

Flume

Hue
UI

Custom

UI

Custom

App

Solr

Solr

Solr

SolrCloud

query

query

query

index

Hadoop
Cluster

MR

HDFS

index

HBase

index

High
Level
View

13

HDFS

Lucene

Solr

ZooKeeper

SolrCloud

Querying
API
Indexing
API

Solr
on
HDFS

•  Scalable,
cost-‐eﬃcient

index
storage

•  Higher
availability

•  Search
and
process
data

in
one
pla`orm

Cloudera
Upstream
ContribuQons

•  SOLR-‐3911
-‐
Directory/DirectoryFactory
now
first
class

•  Solr
ReplicaQon
now
uses
Directory
abstracQon

•  Solr
Admin
UI
no
longer
assumes
local
directory
access

•  SOLR-‐4916
–
support
for
reading/wriQng
Solr
index
files
and

transacQon
log
files
to/from
HDFS

•  HDFSDirectoryFactory/HDFSDirectory
implementaQon

•  SOLR-‐4655
-‐
The
Overseer
should
assign
node
names
by
default.

•  SOLR-‐3706
-‐
Ship
setup
to
log
with
log4j

•  SOLR-‐4494
-‐
Clean
up
and
polish
CollecQons
API

•  SOLR-‐4718
-‐Improvements
to
configurability

•  ConfiguraQon
now
enQrely
through
ZooKeeper

(opQonal)

•  Many
more
improvements/cleanup/hardening/…

Lucene
Directory
abstracQon

•  It’s
how
Lucene
interacts
with
index
ﬁles

•  Solr
uses
it
too,
but
spory
prior
to
4.x

Class Directory {
listAll();
createOutput(file, context);
openInput(file, context);
deleteFile(file);
makeLock(file);
clearLock(file);
…
}

HDFSDirectory

•  Originally
implemented
against
Lucene
3
by
Blur

•  Cloudera
ported
to
Lucene
4
and
now
upstream

•  Solr
trunk
and
version
4.4
(upcoming)

•  Uses
the
HDFS
Client
API

import org.apache.hadoop.fs.FileSystem;
public IndexInput openInput(file, context){
…
_inputStream = fileSystem.open(path, bufferSize);
…
}

HDFSDirectoryFactory

•  Enables
plugin
of
HDFSDirectory
into
Solr

•  Configurable
through
solrconfig.xml

•  Also
handles

•  Directory
configuraQon

•  ComposiQng
of
Directory(s)

•  NRTCachingDirectory

•  BlockDirectory/BlockDirectoryCache

BlockDirectory/BlockDirectoryCache

•  In
memory
cache
of
index
file
blocks

•  Caches
on
read,
in
some
cases
on
write

•  Compensate
for
less
effecQve
file
system
cache

•  Uses
DirectByteBuffer,
not
JVM
heap
(default)

•  Size
configurable
by
user

Near
Real
Time
Indexing
with
Flume

Log
File

Solr
and
Flume

•  Data
ingest
at
scale

•  Flexible
extracQon
and

mapping

•  Indexing
at
data
ingest

HDFS

Flume

Agent

Indexer

Other

Log
File

Flume

Agent

Indexer

19

Apache
Flume
-‐
MorphlineSolrSink

•  A
Flume
Source…

•  Receives/gathers
events

•  A
Flume
Channel…

•  Carries
the
event
–
MemoryChannel
or
reliable
FileChannel

•  A
Flume
Sink…

•  Sends
the
events
on
to
the
next
locaQon

•  Flume
MorphlineSolrSink

•  Integrates
Cloudera
Morphlines
library

•  ETL,
more
on
that
in
a
bit

•  Does
batching

•  Results
sent
to
Solr
for
indexing

Near
Real
Time
indexing
of
Apache
HBase

HDFS

HBase

interacQve
load

Indexer(s)

Triggers
on

updates

Solr
server

Solr
server

Solr
server

Solr
server

Solr
server

Search

+
=

planet-‐sized
tabular
data

immediate
access
&
updates

fast
&
ﬂexible
informaFon

discovery

BIG
DATA
DATAMANAGEMENT

Lily
HBase
Indexer

•  CollaboraQon
between
NGData
&
Cloudera

•  NGData
are
creators
of
the
Lily
data
management
pla`orm

•  Lily
HBase
Indexer

•  Service
which
acts
as
a
HBase
replicaQon
listener

•  HBase
replicaQon
features,
such
as
ﬁltering,
supported

•  ReplicaQon
updates
trigger
indexing
of
updates
(rows)

•  Integrates
Cloudera
Morphlines
library
for
ETL
of
rows

•  AL2
licensed
on
github
hrps://github.com/ngdata

Scalable
Batch
Indexing

Index

shard

Files

Index

shard

Indexer

Files

Solr

server

Indexer

Solr

server

23
HDFS

Solr
and
MapReduce

•  Flexible,
scalable
batch

indexing

•  Start
serving
new
indices

with
no
downQme

•  On-‐demand
indexing,
cost-‐
eﬃcient
re-‐indexing

Scalable
Batch
Indexing

24
Mapper:

Parse
input
into

indexable
document

Mapper:

Parse
input
into

indexable
document

Mapper:

Parse
input
into

indexable
document

Index

shard
1

Index

shard
2

Arbitrary
reducing
steps
of
indexing
and
merging

End-‐Reducer
(shard
1):

Index
document

End-‐Reducer
(shard
2):

Index
document

MapReduce
Indexer

MapReduce
Job
with
two
parts

1)
Scan
HDFS
for
files
to
be
indexed

•  Much
like
Unix
“find”
–
see
HADOOP-‐8989

•  Output
is
NLineInputFormat’ed
file

2)
Mapper/Reducer
indexing
step

•  Mapper
extracts
content
via
Cloudera
Morphlines

•  Reducer
indexes
documents
via
embedded
Solr
server

•  Originally
based
on
SOLR-‐1301

•  Many
modificaQons
to
enable
linear
scalability

MapReduce
Indexer
“golive”

•  Cloudera
created
this
to
bridge
the
gap
between
NRT

(low
latency,
expensive)
and
Batch
(high
latency,

cheap
at
scale)
indexing

•  Results
of
MR
indexing
operaQon
are
immediately

merged
into
a
live
SolrCloud
serving
cluster

•  No
downQme
for
users

•  No
NRT
expense

•  Linear
scale
out
to
the
size
of
your
MR
cluster

Cloudera
Morphlines

•  Open
Source
framework
for
simple
ETL

•  Ships
as
part
Cloudera
Developer
Kit
(CDK)

•  It’s
a
Java
library

•  AL2
licensed
on
github
hrps://github.com/cloudera/cdk

•  Similar
to
Unix
pipelines

•  ConﬁguraQon
over
coding

•  Supports
common
Hadoop
formats

•  Avro

•  Sequence
ﬁle

•  Text

•  Etc…

Cloudera
Morphlines
Architecture

Solr

Solr

Solr

SolrCloud

Logs,
tweets,
social

media,
html,

images,
pdf,
text….

Anything
you
want

to
index

Flume,
MR
Indexer,
HBase
indexer,
etc...

Or
your
applicaQon!

Morphline
Library

Morphlines
can
be
embedded
in
any
applicaQon…

ExtracQon
and
Mapping

•  Simple
and
ﬂexible
data

transformaQon

•  Reusable
across
mulQple

index
workloads

•  Over
Qme,
extend
and
re-‐
use
across
pla`orm

workloads

syslog
Flume

Agent

Solr
sink

Command:
readLine

Command:
grok

Command:
loadSolr

Solr

Event

Record

Record

Record

Document

Morphline
Library

Current
Command
Library

•  Integrate
with
and
load
into
Apache
Solr

•  Flexible
log
ﬁle
analysis

•  Single-‐line
record,
mulQ-‐line
records,
CSV
ﬁles

•  Regex
based
parern
matching
and
extracQon

•  IntegraQon
with
Avro

•  IntegraQon
with
Apache
Hadoop
Sequence
Files

•  IntegraQon
with
SolrCell
and
all
Apache
Tika
parsers

•  Auto-‐detecQon
of
MIME
types
from
binary
data
using

Apache
Tika

Current
Command
Library
(cont)

•  ScripQng
support
for
dynamic
java
code

•  OperaQons
on
fields
for
assignment
and
comparison

•  OperaQons
on
fields
with
list
and
set
semanQcs

•  if-‐then-‐else
condiQonals

•  A
small
rules
engine
(tryRules)

•  String
and
Qmestamp
conversions

•  slf4j
logging

•  Yammer
metrics
and
counters

•  Decompression
and
unpacking
of
arbitrarily
nested

container
file
formats

•  Etc…

Morphline
Example
–
syslog
with
grok

morphlines
:
[

{

id
:
morphline1

importCommands
:
["com.cloudera.**",
"org.apache.solr.**"]

commands
:
[

{
readLine
{}
}

{

grok
{

dicQonaryFiles
:
[/tmp/grok-‐dicQonaries]

expressions
:
{

message
:
"""<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_Qmestamp}
%
{SYSLOGHOST:syslog_hostname}
%{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:
%
{GREEDYDATA:syslog_message}"""

}

}

}

{
loadSolr
{}
}

]

}

]

Example
Input

<164>Feb

4
10:46:14
syslog
sshd[607]:
listening
on
0.0.0.0
port
22

Output
Record

syslog_pri:164

syslog_Qmestamp:Feb

4
10:46:14

syslog_hostname:syslog

syslog_program:sshd

syslog_pid:607

syslog_message:listening
on
0.0.0.0
port
22.

Simple,
Customizable
Search
Interface

Hue

•  Simple
UI

•  Navigated,
faceted
drill

down

•  Customizable
display

•  Full
text
search,

standard
Solr
API
and

query
language

Performance

•  Cloudera
internal
tesQng
results

•  Cisco
WebEx
results
from
Hadoop
Summit
2013

Cloudera
Internal
TesQng

•  We’ve
looked
at

•  NRT
and
Batch
indexing

•  Query
performance

•  Performance
has
been
similar
to
Solr
on
local
disk

•  Indexing/query
operaQons
are
typically
CPU
bound

•  Caching
obviously
plays
a
big
factor
for
queries

•  Limited
use
cases
explored
–
public
beta
helping
here!

Details
shared
by
WebEx
at
2013
Summit

•  Cisco
presented
on
their
use
of
Flume,
Cloudera

Search,
and
Cloudera
Morphlines

•  Indexing
log
events
in
Near
Real
Time
via
Flume

•  Cisco
UCS
C240
M3
servers

•  2
quad
cores
@2.3ghz

•  16gb
RAM

•  12
x
3TB
storage

•  Ingest
rate

•  70k
events/sec,
1.2
TB/day
inbound

What’s
next

•  Usability
–
“solrctl”

•  Security

•  Index,
Document
and
(eventually)
Field
level
security

•  Lots
of
scalability/performance
work
to
be
done

•  What
are
the
best
Solr/Lucene
seIngs
for
HDFS?

•  InvesQgate
short
circuit
HDFS
reads

•  BlockDirectoryCache
tuning

•  HDFS
block
aﬃnity

•  More
sophisQcated
index
management

•  Take
advantage
of
collecQon
alias
support
(SOLR-‐4497)

Conclusion

•  Cloudera
Search
now
in
public
beta

•  Free
Download

•  Extensive
documentaQon

•  Send
your
quesQons
and
feedback
to

search-‐user@cloudera.org

•  Take
the
Search
online
training

•  Cloudera
Manager
Standard
(i.e.
the
free
version)

•  Simple
management
of
Search

•  Free
Download

•  QuickStart
VM
also
available!

Search On Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Search On Hadoop

Similar to Search On Hadoop (20)

More from bigdatagurus_meetup

More from bigdatagurus_meetup (11)

Recently uploaded

Recently uploaded (20)

Search On Hadoop