From idea to production in a day – Leveraging Azure ML and Streamlit to build...
Search On Hadoop
1. 1
Finding
a
needle
in
a
stack
of
needles
-‐
adding
Search
to
the
Hadoop
Ecosystem
Patrick
Hunt
(@phunt)
Big
Data
Gurus
Meetup
July
2013
2. Agenda
• Big
Data
and
Search
–
seIng
the
stage
• Cloudera
Search’s
Architecture
• Component
deep
dive
• Early
performance
insights
• What’s
next?
Feel
free
to
ask
quesQons
as
we
go!
3. Why
Search?
An
Integrated
Part
of
the
Hadoop
System
One
pool
of
data
One
security
framework
One
set
of
system
resources
One
management
interface
5. Benefits
of
Search
• Improved
Big
Data
ROI
• An
interacQve
experience
without
technical
knowledge
• Single
data
set
for
mulQple
compuQng
frameworks
• Faster
Qme
to
insight
• Exploratory
analysis,
esp.
unstructured
data
• Broad
range
of
indexing
opQons
to
accommodate
needs
• Cost
efficiency
• Single
scalable
pla`orm;
no
incremental
investment
• No
need
for
separate
systems,
storage
• Solid
foundaQons
and
reliability
• Apache
Solr
in
producQon
environments
for
years
• Hadoop-‐powered
reliability
and
scalability
6. What
is
Cloudera
Search?
• Full-‐text,
interacQve
search
and
faceted
navigaQon
• Batch,
near
real-‐Qme,
and
on-‐demand
indexing
• Apache
Solr
integrated
with
CDH
• Established,
mature
search
with
vibrant
community
• Separate
runQme
like
MapReduce,
Impala
• Incorporated
as
part
of
the
Hadoop
ecosystem
• Open
Source
• 100%
Apache,
100%
Solr
• Standard
Solr
APIs
8. Apache
Hadoop
• Apache
HDFS
• Distributed
file
system
• High
reliability
• High
throughput
• Apache
MapReduce
• Parallel,
distributed
programming
model
• Allows
processing
of
large
datasets
• Fault
tolerant
9. Apache
Lucene
• Full
text
search
• Indexing
• Query
• TradiQonal
inverted
index
• Batch
and
Incremental
indexing
• We
are
using
version
4
(4.3
currently)
10. Apache
Solr
• Search
service
built
using
Lucene
• Ships
with
Lucene
(same
TLP
at
Apache)
• Provides
XML/HTTP/JSON/Python/Ruby/…
APIs
• Indexing
• Query
• AdministraQve
interface
• Also
rich
web
admin
GUI
via
HTTP
11. Apache
SolrCloud
• Provides
distributed
Search
capability
• Part
of
Solr
(not
a
separate
library/codebase)
• Shards
-‐
both
verQcally
and
horizontally
scaleable
• Horizontally
–
parQQon
index
for
size
• VerQcally
–
replicate
for
query
performance
• Uses
ZooKeeper
for
coordinaQon
• No
split-‐brain
issues
• Simplifies
operaQons
12. Distributed
Search
on
Hadoop
Flume
Hue
UI
Custom
UI
Custom
App
Solr
Solr
Solr
SolrCloud
query
query
query
index
Hadoop
Cluster
MR
HDFS
index
HBase
index
13. High
Level
View
13
HDFS
Lucene
Solr
ZooKeeper
SolrCloud
Querying
API
Indexing
API
Solr
on
HDFS
• Scalable,
cost-‐efficient
index
storage
• Higher
availability
• Search
and
process
data
in
one
pla`orm
14. Cloudera
Upstream
ContribuQons
• SOLR-‐3911
-‐
Directory/DirectoryFactory
now
first
class
• Solr
ReplicaQon
now
uses
Directory
abstracQon
• Solr
Admin
UI
no
longer
assumes
local
directory
access
• SOLR-‐4916
–
support
for
reading/wriQng
Solr
index
files
and
transacQon
log
files
to/from
HDFS
• HDFSDirectoryFactory/HDFSDirectory
implementaQon
• SOLR-‐4655
-‐
The
Overseer
should
assign
node
names
by
default.
• SOLR-‐3706
-‐
Ship
setup
to
log
with
log4j
• SOLR-‐4494
-‐
Clean
up
and
polish
CollecQons
API
• SOLR-‐4718
-‐Improvements
to
configurability
• ConfiguraQon
now
enQrely
through
ZooKeeper
(opQonal)
• Many
more
improvements/cleanup/hardening/…
15. Lucene
Directory
abstracQon
• It’s
how
Lucene
interacts
with
index
files
• Solr
uses
it
too,
but
spory
prior
to
4.x
Class Directory {
listAll();
createOutput(file, context);
openInput(file, context);
deleteFile(file);
makeLock(file);
clearLock(file);
…
}
16. HDFSDirectory
• Originally
implemented
against
Lucene
3
by
Blur
• Cloudera
ported
to
Lucene
4
and
now
upstream
• Solr
trunk
and
version
4.4
(upcoming)
• Uses
the
HDFS
Client
API
import org.apache.hadoop.fs.FileSystem;
public IndexInput openInput(file, context){
…
_inputStream = fileSystem.open(path, bufferSize);
…
}
17. HDFSDirectoryFactory
• Enables
plugin
of
HDFSDirectory
into
Solr
• Configurable
through
solrconfig.xml
• Also
handles
• Directory
configuraQon
• ComposiQng
of
Directory(s)
• NRTCachingDirectory
• BlockDirectory/BlockDirectoryCache
18. BlockDirectory/BlockDirectoryCache
• In
memory
cache
of
index
file
blocks
• Caches
on
read,
in
some
cases
on
write
• Compensate
for
less
effecQve
file
system
cache
• Uses
DirectByteBuffer,
not
JVM
heap
(default)
• Size
configurable
by
user
19. Near
Real
Time
Indexing
with
Flume
Log
File
Solr
and
Flume
• Data
ingest
at
scale
• Flexible
extracQon
and
mapping
• Indexing
at
data
ingest
HDFS
Flume
Agent
Indexer
Other
Log
File
Flume
Agent
Indexer
19
20. Apache
Flume
-‐
MorphlineSolrSink
• A
Flume
Source…
• Receives/gathers
events
• A
Flume
Channel…
• Carries
the
event
–
MemoryChannel
or
reliable
FileChannel
• A
Flume
Sink…
• Sends
the
events
on
to
the
next
locaQon
• Flume
MorphlineSolrSink
• Integrates
Cloudera
Morphlines
library
• ETL,
more
on
that
in
a
bit
• Does
batching
• Results
sent
to
Solr
for
indexing
21. Near
Real
Time
indexing
of
Apache
HBase
HDFS
HBase
interacQve
load
Indexer(s)
Triggers
on
updates
Solr
server
Solr
server
Solr
server
Solr
server
Solr
server
Search
+
=
planet-‐sized
tabular
data
immediate
access
&
updates
fast
&
flexible
informaFon
discovery
BIG
DATA
DATAMANAGEMENT
22. Lily
HBase
Indexer
• CollaboraQon
between
NGData
&
Cloudera
• NGData
are
creators
of
the
Lily
data
management
pla`orm
• Lily
HBase
Indexer
• Service
which
acts
as
a
HBase
replicaQon
listener
• HBase
replicaQon
features,
such
as
filtering,
supported
• ReplicaQon
updates
trigger
indexing
of
updates
(rows)
• Integrates
Cloudera
Morphlines
library
for
ETL
of
rows
• AL2
licensed
on
github
hrps://github.com/ngdata
23. Scalable
Batch
Indexing
Index
shard
Files
Index
shard
Indexer
Files
Solr
server
Indexer
Solr
server
23
HDFS
Solr
and
MapReduce
• Flexible,
scalable
batch
indexing
• Start
serving
new
indices
with
no
downQme
• On-‐demand
indexing,
cost-‐
efficient
re-‐indexing
24. Scalable
Batch
Indexing
24
Mapper:
Parse
input
into
indexable
document
Mapper:
Parse
input
into
indexable
document
Mapper:
Parse
input
into
indexable
document
Index
shard
1
Index
shard
2
Arbitrary
reducing
steps
of
indexing
and
merging
End-‐Reducer
(shard
1):
Index
document
End-‐Reducer
(shard
2):
Index
document
25. MapReduce
Indexer
MapReduce
Job
with
two
parts
1)
Scan
HDFS
for
files
to
be
indexed
• Much
like
Unix
“find”
–
see
HADOOP-‐8989
• Output
is
NLineInputFormat’ed
file
2)
Mapper/Reducer
indexing
step
• Mapper
extracts
content
via
Cloudera
Morphlines
• Reducer
indexes
documents
via
embedded
Solr
server
• Originally
based
on
SOLR-‐1301
• Many
modificaQons
to
enable
linear
scalability
26. MapReduce
Indexer
“golive”
• Cloudera
created
this
to
bridge
the
gap
between
NRT
(low
latency,
expensive)
and
Batch
(high
latency,
cheap
at
scale)
indexing
• Results
of
MR
indexing
operaQon
are
immediately
merged
into
a
live
SolrCloud
serving
cluster
• No
downQme
for
users
• No
NRT
expense
• Linear
scale
out
to
the
size
of
your
MR
cluster
27. Cloudera
Morphlines
• Open
Source
framework
for
simple
ETL
• Ships
as
part
Cloudera
Developer
Kit
(CDK)
• It’s
a
Java
library
• AL2
licensed
on
github
hrps://github.com/cloudera/cdk
• Similar
to
Unix
pipelines
• ConfiguraQon
over
coding
• Supports
common
Hadoop
formats
• Avro
• Sequence
file
• Text
• Etc…
28. Cloudera
Morphlines
Architecture
Solr
Solr
Solr
SolrCloud
Logs,
tweets,
social
media,
html,
images,
pdf,
text….
Anything
you
want
to
index
Flume,
MR
Indexer,
HBase
indexer,
etc...
Or
your
applicaQon!
Morphline
Library
Morphlines
can
be
embedded
in
any
applicaQon…
29. ExtracQon
and
Mapping
• Simple
and
flexible
data
transformaQon
• Reusable
across
mulQple
index
workloads
• Over
Qme,
extend
and
re-‐
use
across
pla`orm
workloads
syslog
Flume
Agent
Solr
sink
Command:
readLine
Command:
grok
Command:
loadSolr
Solr
Event
Record
Record
Record
Document
Morphline
Library
30. Current
Command
Library
• Integrate
with
and
load
into
Apache
Solr
• Flexible
log
file
analysis
• Single-‐line
record,
mulQ-‐line
records,
CSV
files
• Regex
based
parern
matching
and
extracQon
• IntegraQon
with
Avro
• IntegraQon
with
Apache
Hadoop
Sequence
Files
• IntegraQon
with
SolrCell
and
all
Apache
Tika
parsers
• Auto-‐detecQon
of
MIME
types
from
binary
data
using
Apache
Tika
31. Current
Command
Library
(cont)
• ScripQng
support
for
dynamic
java
code
• OperaQons
on
fields
for
assignment
and
comparison
• OperaQons
on
fields
with
list
and
set
semanQcs
• if-‐then-‐else
condiQonals
• A
small
rules
engine
(tryRules)
• String
and
Qmestamp
conversions
• slf4j
logging
• Yammer
metrics
and
counters
• Decompression
and
unpacking
of
arbitrarily
nested
container
file
formats
• Etc…
32. Morphline
Example
–
syslog
with
grok
morphlines
:
[
{
id
:
morphline1
importCommands
:
["com.cloudera.**",
"org.apache.solr.**"]
commands
:
[
{
readLine
{}
}
{
grok
{
dicQonaryFiles
:
[/tmp/grok-‐dicQonaries]
expressions
:
{
message
:
"""<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_Qmestamp}
%
{SYSLOGHOST:syslog_hostname}
%{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:
%
{GREEDYDATA:syslog_message}"""
}
}
}
{
loadSolr
{}
}
]
}
]
Example
Input
<164>Feb
4
10:46:14
syslog
sshd[607]:
listening
on
0.0.0.0
port
22
Output
Record
syslog_pri:164
syslog_Qmestamp:Feb
4
10:46:14
syslog_hostname:syslog
syslog_program:sshd
syslog_pid:607
syslog_message:listening
on
0.0.0.0
port
22.
33. Simple,
Customizable
Search
Interface
Hue
• Simple
UI
• Navigated,
faceted
drill
down
• Customizable
display
• Full
text
search,
standard
Solr
API
and
query
language
35. Cloudera
Internal
TesQng
• We’ve
looked
at
• NRT
and
Batch
indexing
• Query
performance
• Performance
has
been
similar
to
Solr
on
local
disk
• Indexing/query
operaQons
are
typically
CPU
bound
• Caching
obviously
plays
a
big
factor
for
queries
• Limited
use
cases
explored
–
public
beta
helping
here!
36. Details
shared
by
WebEx
at
2013
Summit
• Cisco
presented
on
their
use
of
Flume,
Cloudera
Search,
and
Cloudera
Morphlines
• Indexing
log
events
in
Near
Real
Time
via
Flume
• Cisco
UCS
C240
M3
servers
• 2
quad
cores
@2.3ghz
• 16gb
RAM
• 12
x
3TB
storage
• Ingest
rate
• 70k
events/sec,
1.2
TB/day
inbound
37. What’s
next
• Usability
–
“solrctl”
• Security
• Index,
Document
and
(eventually)
Field
level
security
• Lots
of
scalability/performance
work
to
be
done
• What
are
the
best
Solr/Lucene
seIngs
for
HDFS?
• InvesQgate
short
circuit
HDFS
reads
• BlockDirectoryCache
tuning
• HDFS
block
affinity
• More
sophisQcated
index
management
• Take
advantage
of
collecQon
alias
support
(SOLR-‐4497)
38. Conclusion
• Cloudera
Search
now
in
public
beta
• Free
Download
• Extensive
documentaQon
• Send
your
quesQons
and
feedback
to
search-‐user@cloudera.org
• Take
the
Search
online
training
• Cloudera
Manager
Standard
(i.e.
the
free
version)
• Simple
management
of
Search
• Free
Download
• QuickStart
VM
also
available!