Search in the Apache Hadoop Ecosystem: Thoughts from the Field

1
Search
in
the
Apache
Hadoop

Ecosystem:
Thoughts
from
the
Field

Open
Source
Search
Conference,
November
2013

Alex
Moundalexis

@technmsg

2
Thoughts
of
a
Former
SA

3
Thoughts
of
a
Former
SA
Field
Guy

Disclaimer

•  Technologies,
not
products

•  Cloudera
builds
things
soJware

•  most
donated
to
Apache

•  some
closed-‐source

•  I
will
likely
menOon
“Cloudera
Something”

•  Cloudera
“products”
I
reference
are
open
source

•  Apache
Licensed

•  Source
code
is
on
GitHub

•  hSps://github.com/cloudera

4

What
This
Talk
Isn’t
About

•  Deploying

•  Puppet,
Chef,
Ansible,
homegrown
scripts,
intern
labor

•  Sizing
&
Tuning

•  Depends
heavily
on
data
and
workload

•  Coding

•  Algorithms

5

6

“
The
answer
to
most

Hadoop
quesOons
is
it

depends.”

7
Quick
and
dirty,
more
Ome
for
use
cases.

The
Apache
Hadoop
Ecosystem

Why
“Ecosystem?”

•  In
the
beginning,
just
Hadoop

•  HDFS

•  MapReduce

•  Today,
dozens
of
interrelated
components

•  I/O

•  Processing

•  Specialty
ApplicaOons

•  ConﬁguraOon

•  Workﬂow

8

ParOal
Ecosystem

9
Hadoop

external
system

RDBMS
/
DWH

web
server

device
logs

API
access

log
collecOon

DB
table
import

batch
processing

machine
learning

external
system

API
access

user

RDBMS
/
DWH

DB
table

export

BI
tool

+
JDBC/ODBC

Search

SQL

HDFS

•  Distributed,
highly
fault-‐tolerant
ﬁlesystem

•  OpOmized
for
large
streaming
access
to
data

•  Based
on
Google
File
System

•  hSp://research.google.com/archive/gfs.html

10

Lots
of
Commodity
Machines

11
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

MapReduce
(MR)

•  Programming
paradigm

•  Batch
oriented,
not
realOme

•  Works
well
with
distributed
compuOng

•  Lots
of
Java,
but
other
languages
supported

•  Based
on
Google’s
paper

•  hSp://research.google.com/archive/mapreduce.html

12

You specify map() and
reduce() functions.

The framework does the
rest.

60

Apache
HBase

•  Random,
realOme
read/write
access

•  Key/value
columnar
store

•  (b|tr)illions
of
rows/columns

•  Based
on
Google
BigTable

•  hSp://research.google.com/archive/bigtable.html

15

Apache
Accumulo

•  Random,
realOme
read/write
access

•  Key/value
columnar
store

•  (b|tr)illions
of
rows/columns

•  Based
on
Google
BigTable

•  hSp://research.google.com/archive/bigtable.html

•  Adds
cell-‐level
security

•  Implemented
by
NaOonal
Security
Agency

•  Donated
to
ASF

16

Apache
Hive

Pig

•  AbstracOon
of
Hadoop’s
Java
API

•  Hive
is
SQL-‐based

•  Pig
is
more
data-‐ﬂow
oriented

•  Eases
analysis
using
MapReduce

17

Cloudera
Impala

•  SQL-‐based,
but
interacOve
response

•  Backed
by
HDFS
or
HBase

•  Allows
for
fast
iteraOon/discovery

•  Not
as
fault-‐tolerant
as
MapReduce

18

Apache
Sqoop

Flume

•  Get
your
data
in
and
out
of
HDFS

•  Sqoop
focuses
on
relaOonal
databases

•  Flume
focuses
on
log
ﬁles

19

Cloudera
Hue

•  Hadoop
User
Experience

•  Hadoop
is
largely
command
line

•  Hue
provides
a
UI
for
end-‐users

•  SDK
to
build
your
own
apps
on
top

20

Apache
Mahout

•  Machine
learning
algorithms
that
run
on
MapReduce

•  Clustering

•  ClassiﬁcaOon

•  Filtering

•  I
didn’t
study
these
algorithms
in
school

•  Data
science
people
are
excited

•  Math
people
are
excited

•  I’m
excited
for
them

21

Apache
Tika

•  Content
analysis
toolkit

•  Simply
put,
a
lot
of
parsers

•  Detect/extract
metadata/text
from
documents

•  HTML

•  XML

•  Oﬃce

•  PDF

•  mbox

•  More…

22

Apache
ZooKeeper

•  Distributed
systems
are
HARD

•  Everyone
was
trying
to
implement
the
same
subsystems

•  Bugs
leads
to
race
condiOons,
other
bad
things

•  ZK:
Highly
reliable
distributed
coordinaOon
services

•  ConﬁguraOon

•  Naming

•  SynchronizaOon

•  Group
Services

23

Apache
Oozie

•  Workﬂow
scheduling
for
Hadoop

•  Like
cron,
but
in
directed
graph
fashion

•  Out
of
box
hooks:

•  MR

•  Pig

•  Hive

•  Sqoop

•  Impala

24

Sentry
(incubaOng)

•  Role-‐based
access
control
for
Hive/Impala/Solr

•  Regulatory/compliance
assurance

25

Cloudera
Morphlines

•  In-‐memory
transformaOons

•  Load,
parse,
transform,
process

•  Records
as
name-‐value
pairs
w/
opOonal
blob/pojo
objects

•  Java
library,
embedded
in
your
codebase

•  Used
to
ETL
data
from
Flume
and
MR
into
Solr

26

Apache
Lucene

•  Java-‐based
index
and
search

•  Spellchecking

•  Hit
highlighOng

•  TokenizaOon

27

Apache
Solr

•  Enterprise
search
plaoorm

•  Based
on
Apache
Lucene

•  Full-‐text
search

•  FaceOng

•  NRT
indexing

28

Apache
SolrCloud

•  IntegraOon
of
Solr
+
ZooKeeper

•  Provides
for
shard
failover

29

Cloudera
Search

•  Based
on
Apache
Solr
(incl
Lucene
and
SolrCloud)

•  Fault-‐tolerance:
collecOons
backed
by
HDFS
or
Hbase

•  IntegraOon
galore:

•  HBase/Flume/MapReduce
w/
Lucene

•  Hue
w/
Solr

•  Avro
w/
Tika

•  HDFS
w/
Solr/Lucene

•  Sentry
w/
Solr

30

Cloudera
Search
+
Hue

31

Cloudera
Search
+
Hue

32

33
Apologies,
I
swiped
some
preSy
slides
from
markeOng…

Why
Search?

Search
Design
Strategy

34
One
pool
of
data

One
security
framework

One
set
of
system
resources

One
management
interface

An
Integrated
Part
of

the
Hadoop
System

Storage

Integra5on

Resource
Management

Metadata

Batch

Processing

MAPREDUCE,

HIVE

PIG

…
HDFS
HBase

TEXT,
RCFILE,
PARQUET,
AVRO,
ETC.
RECORDS

Engines

InteracOve

SQL

CLOUDERA

IMPALA

InteracOve

Search

CLOUDERA

SEARCH

Machine

Learning

MAHOUT

Math

Sta5s5cs

SAS,
R

Beneﬁts
of
Search
IntegraOon

35
Improved
Big
Data
ROI

§  An
interacOve
experience
without
technical
knowledge

§  Single
data
set
for
mulOple
compuOng
frameworks

Faster
Time
to
Insight

§  Exploratory
analysis,
esp.
unstructured
data

§  Broad
range
of
indexing
opOons
to
accommodate
needs

Cost
Eﬃciency

§  Single
scalable
plaoorm;
no
incremental
investment

§  No
need
for
separate
systems,
storage

Solid
Founda5ons

Reliability

§  Solr
in
producOon
environments
for
years

§  Hadoop-‐powered
reliability
and
scalability

36
So
much
soJware…

Making
Decisions

That’s
a
Lot
of
SoJware

•  21
packages,
depending
on
how
you
count

•  And
there’s
plenty
more…

•  How
to
decide
what
to
use?

37

38

“
The
answer
to
most

Hadoop
quesOons
is
it

depends.”

Some
of
the
Big
Issues

•  Response
Ome

•  User
interfaces

•  Programming
paradigm

•  Input/output
formats

•  Use
cases

39

Response
Time

•  MapReduce
is
batch
oriented

•  Resilient
to
hardware
failures

•  Robust
scheduling
opOons

•  Impala
is
near-‐realOme

•  HBase
is
realOme

•  Key/values
are
cached
in
memory

•  Search
can
be
(near-‐)realOme.

•  Hybrid
systems
are
common!

40

User
Interfaces

•  Java

•  MapReduce,
HBase

•  SQL

•  Hive,
Impala

•  Shell

•  Pig

•  Natural
Language
/
Free
Text

•  Search

41

Data
Constraints

•  MapReduce

•  Paradigm
takes
some
getng
used
to

•  Processing
must
accommodate
format

•  HBase

•  Columnar
key/value
store

•  Hue
makes
this
easier

•  Search

•  Indexing
and
display

•  Hue
makes
this
easier

42

Input/Output
Formats

•  Know
what
they
are…
opOonal.

•  Don’t
know?
That’s
okay.

•  Schema
on
read.

•  Be
able
to
extract
what
you
need

43

Lack
of
Use
Case

•  “Big
Data”
and
Hadoop

•  They
ENABLE
you
to
solve
problems

•  Won’t
solve
problems
for
you

•  Doesn’t
know
about
your
business
logic

•  “Big”
is
bigger
than
you’re
accustomed
to…

•  Have
a
plan

•  Bring
your
use
cases

•  Bring
your
business
quesOons

44

45
One
typical
Hadoop
use
case.

Index
GeneraOon/Serving

eBay
–
Cassini
Project

•  June
2012

•  2B
page
views/day

•  250M
searches/day

•  9
PB
online

•  Custom
search
indexes

•  Limited
by
ﬁeld
or
Ome
period

46

eBay
–
Cassini
Project

•  MapReduce
to
generate
indexes

•  Customer
history

•  Item
fields:
name,
price,
descripOons,
etc

•  Bulk
import
indexes
into
HBase,
served

•  15
TB
in
HBase,
1.2
TB
daily
import
into
Hbase

•  Ranking
algorithms
can
take
into
account

•  More
history

•  More
fields

•  More
customer-‐specific
details

47

48
Some
quick
examples.

Search
Use
Cases

Search
Use
Cases

49
Oﬀer
easy
access
to
non-‐technical

resources

Explore
data
prior
to
processing
and

modeling

Gain
immediate
access
and
ﬁnd

correlaOons
in
mission-‐criOcal
data

Powerful,
proven
search
capabili5es
that

let
organiza5ons:

Monsanto

50
Scalable,
eﬃcient
image
search
for

analysis
and
research

Track
plant
characterisOcs
throughout
their

lifecycle

Before:
Manual
aSribute
extracOon
and
search

queries
within
database

Now:
Parse
and
index
images
at
acquisiOon
and

on
demand,
index
archived
images
in
batch

51
Cloudera:
Internal
Field
Portal

Custom
Aggregated
Search

Cloudera
–
Internal
Field
Portal

•  Single
stop
for
ﬁeld
engineers

•  Mailing
lists:
public,
private

•  Tickets:
support,
development,
public
ASF

•  Customer
data:
accounts,
clusters,
KB
arOcles

•  Customer
Clusters:
conﬁgs,
audits,
logs,
events

•  Books
and
papers

•  Discussion
forums

•  Dogfooding,
yes

•  Makes
my
life
easier

52

Cloudera
–
Internal
Field
Portal

53

Cloudera
–
Internal
Field
Portal

•  Varied
fetchers/observers
for
web/API
content

•  Content
is
retrieved
via
Flume,
Sqoop

•  Search
indexes
and
replicates
into
HBase

•  Each
collecOon
has
collecOon-‐specific
filters/fields

•  Provides
Otle,
content
snippet,
link
to
original

•  Morphlines
extracts
books
and
papers
using
Tika

•  Impala
for
analyOcs

•  Future:
Use
MapReduce
to
ingest
logs

54

55
PaSerns

PredicOons:
Durkheim
Project

Risk
ClassiﬁcaOon

PredicOve
Analysis

56 Image: http://www.flickr.com/photos/soldiersmediacenter/4598169027/
US
Combat
Deaths
AFG

301

2012

US
Combat
Deaths
AFG

301

US
Military
Suicides

349

2012

US
Combat
Deaths
AFG

301

US
Military
Suicides

349

349

301

2012

PaSerns

PredicOons
–
Durkheim
Project

•  Assessment
of
mental
health
risks

•  Correlate
veterans’
communicaOons
with
suicide
risk

59

PaSerns

PredicOons
–
Durkheim
Project

•  Build
machine
learning
algorithms
on
MapReduce

•  Train
using
expert
knowledge

•  Keywords

•  PaSerns

•  Algorithm
detects
and
assign
risk
scores

•  In
what
medium?

60

PaSerns

PredicOons
–
Durkheim
Project

61 Image: http://www.flickr.com/photos/42586873@N00/3770782889/
Unstructured

Clinical

Notes

PaSerns

PredicOons
–
Durkheim
Project

•  Phase
1

•  3
cohorts:
non-‐psychiatric,
psychiatric,
suicide-‐posiOve

•  100
clinical
proﬁles
per
cohort

•  65%
accurate
in
predicOng
suicide
risk
in
control
group

•  Phase
2

•  Text
analyOcs
of
clinical
records,
opt-‐in
social
media

•  Goal
of
100,000
veteran
parOcipants

•  Represents
a
huge
increase
of
data

•  TradiOonal
enterprise
search
couldn’t
scale

62

PaSerns

PredicOons
–
Durkheim
Project

•  Technologies

•  Hadoop

•  Search

•  Indexing
of
machine
learning,
backed
by
HBase
for
performance

•  Hue
interface
for
non-‐technical
users

•  Discovery
of
terms,
keywords,
risk
factors
in
numerous
facets

•  Impala

•  Deep
SQL
queries
if/when
interesOng
deviaOons
are
found

•  e.g.
if
the
word
“Molly”
appeared
in
top
10
facets

•  Write
some
SQL
to
dig
in,
perhaps
revise
indexing
scheme

63

PaSerns

PredicOons
–
Durkheim
Project

•  Currently

•  Monitoring

•  Analysis

•  Future

•  IntervenOonal
study

•  Back
our
hopes
with
data…

•  More
detailed
Case
Study

•  hSp://goo.gl/3ZJMwS

•  hSp://durkheimproject.org/

64

65
ParOng
thoughts…
in
no
parOcular
order.

Summary

Search
Simpliﬁes
InteracOon

66
Explore

Navigate

Correlate

Experts
know
MapReduce.
Savvy
people
know
SQL.

Everyone
knows
Search.

Summary

•  With
Hadoop,
it
depends.

•  The
tools
are
out
there.

•  Open
source
soJware

•  Many
interconnected
pieces

•  Many
unexplored
opportuniOes

•  A
thriving
community
awaits
you…

•  Data
can
make
a
diﬀerence.

•  Search
allows
everyone
to
interact
with
data.

•  This
is
a
Big
Deal.

67

What’s
Next?

•  Download
Hadoop!

•  Already
done
that?
Contribute…

•  CDH
available
at
www.cloudera.com

•  Cloudera
provides
pre-‐loaded
VMs

•  hSp://Ony.cloudera.com/quickstartvm

•  Clone
our
repos!

•  hSps://github.com/cloudera

68

69
Preferably
related
to
the
talk…

QuesOons?

70
Thank
You!

Alex
Moundalexis

@technmsg

We’re
hiring,
kids!
Well,
not
kids.

Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Similar to Search in the Apache Hadoop Ecosystem: Thoughts from the Field (20)

More from Alex Moundalexis

More from Alex Moundalexis (8)

Recently uploaded

Recently uploaded (20)

Search in the Apache Hadoop Ecosystem: Thoughts from the Field