This presentation describes the Hadoop ecosystem and gives examples of how these open source tools are combined and used to solve specific and sometimes very complex problems. Drawing upon case studies from the field, Mr. Moundalexis demonstrates that one-size, rigid traditional systems don’t fit all, but that combinations of tools in the Apache Hadoop ecosystem provide a versatile and flexible platform for integrating, finding, and analyzing information.
4. Disclaimer
• Technologies,
not
products
• Cloudera
builds
things
soJware
• most
donated
to
Apache
• some
closed-‐source
• I
will
likely
menOon
“Cloudera
Something”
• Cloudera
“products”
I
reference
are
open
source
• Apache
Licensed
• Source
code
is
on
GitHub
• hSps://github.com/cloudera
4
5. What
This
Talk
Isn’t
About
• Deploying
• Puppet,
Chef,
Ansible,
homegrown
scripts,
intern
labor
• Sizing
&
Tuning
• Depends
heavily
on
data
and
workload
• Coding
• Algorithms
5
6. 6
“
The
answer
to
most
Hadoop
quesOons
is
it
depends.”
8. Why
“Ecosystem?”
• In
the
beginning,
just
Hadoop
• HDFS
• MapReduce
• Today,
dozens
of
interrelated
components
• I/O
• Processing
• Specialty
ApplicaOons
• ConfiguraOon
• Workflow
8
9. ParOal
Ecosystem
9
Hadoop
external
system
RDBMS
/
DWH
web
server
device
logs
API
access
log
collecOon
DB
table
import
batch
processing
machine
learning
external
system
API
access
user
RDBMS
/
DWH
DB
table
export
BI
tool
+
JDBC/ODBC
Search
SQL
10. HDFS
• Distributed,
highly
fault-‐tolerant
filesystem
• OpOmized
for
large
streaming
access
to
data
• Based
on
Google
File
System
• hSp://research.google.com/archive/gfs.html
10
12. MapReduce
(MR)
• Programming
paradigm
• Batch
oriented,
not
realOme
• Works
well
with
distributed
compuOng
• Lots
of
Java,
but
other
languages
supported
• Based
on
Google’s
paper
• hSp://research.google.com/archive/mapreduce.html
12
14. You specify map() and
reduce() functions.
The framework does the
rest.
60
15. Apache
HBase
• Random,
realOme
read/write
access
• Key/value
columnar
store
• (b|tr)illions
of
rows/columns
• Based
on
Google
BigTable
• hSp://research.google.com/archive/bigtable.html
15
16. Apache
Accumulo
• Random,
realOme
read/write
access
• Key/value
columnar
store
• (b|tr)illions
of
rows/columns
• Based
on
Google
BigTable
• hSp://research.google.com/archive/bigtable.html
• Adds
cell-‐level
security
• Implemented
by
NaOonal
Security
Agency
• Donated
to
ASF
16
17. Apache
Hive
Pig
• AbstracOon
of
Hadoop’s
Java
API
• Hive
is
SQL-‐based
• Pig
is
more
data-‐flow
oriented
• Eases
analysis
using
MapReduce
17
18. Cloudera
Impala
• SQL-‐based,
but
interacOve
response
• Backed
by
HDFS
or
HBase
• Allows
for
fast
iteraOon/discovery
• Not
as
fault-‐tolerant
as
MapReduce
18
19. Apache
Sqoop
Flume
• Get
your
data
in
and
out
of
HDFS
• Sqoop
focuses
on
relaOonal
databases
• Flume
focuses
on
log
files
19
20. Cloudera
Hue
• Hadoop
User
Experience
• Hadoop
is
largely
command
line
• Hue
provides
a
UI
for
end-‐users
• SDK
to
build
your
own
apps
on
top
20
21. Apache
Mahout
• Machine
learning
algorithms
that
run
on
MapReduce
• Clustering
• ClassificaOon
• Filtering
• I
didn’t
study
these
algorithms
in
school
• Data
science
people
are
excited
• Math
people
are
excited
• I’m
excited
for
them
21
22. Apache
Tika
• Content
analysis
toolkit
• Simply
put,
a
lot
of
parsers
• Detect/extract
metadata/text
from
documents
• HTML
• XML
• Office
• PDF
• mbox
• More…
22
23. Apache
ZooKeeper
• Distributed
systems
are
HARD
• Everyone
was
trying
to
implement
the
same
subsystems
• Bugs
leads
to
race
condiOons,
other
bad
things
• ZK:
Highly
reliable
distributed
coordinaOon
services
• ConfiguraOon
• Naming
• SynchronizaOon
• Group
Services
23
24. Apache
Oozie
• Workflow
scheduling
for
Hadoop
• Like
cron,
but
in
directed
graph
fashion
• Out
of
box
hooks:
• MR
• Pig
• Hive
• Sqoop
• Impala
24
25. Sentry
(incubaOng)
• Role-‐based
access
control
for
Hive/Impala/Solr
• Regulatory/compliance
assurance
25
26. Cloudera
Morphlines
• In-‐memory
transformaOons
• Load,
parse,
transform,
process
• Records
as
name-‐value
pairs
w/
opOonal
blob/pojo
objects
• Java
library,
embedded
in
your
codebase
• Used
to
ETL
data
from
Flume
and
MR
into
Solr
26
27. Apache
Lucene
• Java-‐based
index
and
search
• Spellchecking
• Hit
highlighOng
• TokenizaOon
27
34. Search
Design
Strategy
34
One
pool
of
data
One
security
framework
One
set
of
system
resources
One
management
interface
An
Integrated
Part
of
the
Hadoop
System
Storage
Integra5on
Resource
Management
Metadata
Batch
Processing
MAPREDUCE,
HIVE
PIG
…
HDFS
HBase
TEXT,
RCFILE,
PARQUET,
AVRO,
ETC.
RECORDS
Engines
InteracOve
SQL
CLOUDERA
IMPALA
InteracOve
Search
CLOUDERA
SEARCH
Machine
Learning
MAHOUT
Math
Sta5s5cs
SAS,
R
35. Benefits
of
Search
IntegraOon
35
Improved
Big
Data
ROI
§ An
interacOve
experience
without
technical
knowledge
§ Single
data
set
for
mulOple
compuOng
frameworks
Faster
Time
to
Insight
§ Exploratory
analysis,
esp.
unstructured
data
§ Broad
range
of
indexing
opOons
to
accommodate
needs
Cost
Efficiency
§ Single
scalable
plaoorm;
no
incremental
investment
§ No
need
for
separate
systems,
storage
Solid
Founda5ons
Reliability
§ Solr
in
producOon
environments
for
years
§ Hadoop-‐powered
reliability
and
scalability
37. That’s
a
Lot
of
SoJware
• 21
packages,
depending
on
how
you
count
• And
there’s
plenty
more…
• How
to
decide
what
to
use?
37
38. 38
“
The
answer
to
most
Hadoop
quesOons
is
it
depends.”
39. Some
of
the
Big
Issues
• Response
Ome
• User
interfaces
• Programming
paradigm
• Input/output
formats
• Use
cases
39
40. Response
Time
• MapReduce
is
batch
oriented
• Resilient
to
hardware
failures
• Robust
scheduling
opOons
• Impala
is
near-‐realOme
• HBase
is
realOme
• Key/values
are
cached
in
memory
• Search
can
be
(near-‐)realOme.
• Hybrid
systems
are
common!
40
41. User
Interfaces
• Java
• MapReduce,
HBase
• SQL
• Hive,
Impala
• Shell
• Pig
• Natural
Language
/
Free
Text
• Search
41
42. Data
Constraints
• MapReduce
• Paradigm
takes
some
getng
used
to
• Processing
must
accommodate
format
• HBase
• Columnar
key/value
store
• Hue
makes
this
easier
• Search
• Indexing
and
display
• Hue
makes
this
easier
42
43. Input/Output
Formats
• Know
what
they
are…
opOonal.
• Don’t
know?
That’s
okay.
• Schema
on
read.
• Be
able
to
extract
what
you
need
43
44. Lack
of
Use
Case
• “Big
Data”
and
Hadoop
• They
ENABLE
you
to
solve
problems
• Won’t
solve
problems
for
you
• Doesn’t
know
about
your
business
logic
• “Big”
is
bigger
than
you’re
accustomed
to…
• Have
a
plan
• Bring
your
use
cases
• Bring
your
business
quesOons
44
46. eBay
–
Cassini
Project
• June
2012
• 2B
page
views/day
• 250M
searches/day
• 9
PB
online
• Custom
search
indexes
• Limited
by
field
or
Ome
period
46
47. eBay
–
Cassini
Project
• MapReduce
to
generate
indexes
• Customer
history
• Item
fields:
name,
price,
descripOons,
etc
• Bulk
import
indexes
into
HBase,
served
• 15
TB
in
HBase,
1.2
TB
daily
import
into
Hbase
• Ranking
algorithms
can
take
into
account
• More
history
• More
fields
• More
customer-‐specific
details
47
49. Search
Use
Cases
49
Offer
easy
access
to
non-‐technical
resources
Explore
data
prior
to
processing
and
modeling
Gain
immediate
access
and
find
correlaOons
in
mission-‐criOcal
data
Powerful,
proven
search
capabili5es
that
let
organiza5ons:
50. Monsanto
50
Scalable,
efficient
image
search
for
analysis
and
research
Track
plant
characterisOcs
throughout
their
lifecycle
Before:
Manual
aSribute
extracOon
and
search
queries
within
database
Now:
Parse
and
index
images
at
acquisiOon
and
on
demand,
index
archived
images
in
batch
54. Cloudera
–
Internal
Field
Portal
• Varied
fetchers/observers
for
web/API
content
• Content
is
retrieved
via
Flume,
Sqoop
• Search
indexes
and
replicates
into
HBase
• Each
collecOon
has
collecOon-‐specific
filters/fields
• Provides
Otle,
content
snippet,
link
to
original
• Morphlines
extracts
books
and
papers
using
Tika
• Impala
for
analyOcs
• Future:
Use
MapReduce
to
ingest
logs
54
62. PaSerns
PredicOons
–
Durkheim
Project
• Phase
1
• 3
cohorts:
non-‐psychiatric,
psychiatric,
suicide-‐posiOve
• 100
clinical
profiles
per
cohort
• 65%
accurate
in
predicOng
suicide
risk
in
control
group
• Phase
2
• Text
analyOcs
of
clinical
records,
opt-‐in
social
media
• Goal
of
100,000
veteran
parOcipants
• Represents
a
huge
increase
of
data
• TradiOonal
enterprise
search
couldn’t
scale
62
63. PaSerns
PredicOons
–
Durkheim
Project
• Technologies
• Hadoop
• Search
• Indexing
of
machine
learning,
backed
by
HBase
for
performance
• Hue
interface
for
non-‐technical
users
• Discovery
of
terms,
keywords,
risk
factors
in
numerous
facets
• Impala
• Deep
SQL
queries
if/when
interesOng
deviaOons
are
found
• e.g.
if
the
word
“Molly”
appeared
in
top
10
facets
• Write
some
SQL
to
dig
in,
perhaps
revise
indexing
scheme
63
64. PaSerns
PredicOons
–
Durkheim
Project
• Currently
• Monitoring
• Analysis
• Future
• IntervenOonal
study
• Back
our
hopes
with
data…
• More
detailed
Case
Study
• hSp://goo.gl/3ZJMwS
• hSp://durkheimproject.org/
64
67. Summary
• With
Hadoop,
it
depends.
• The
tools
are
out
there.
• Open
source
soJware
• Many
interconnected
pieces
• Many
unexplored
opportuniOes
• A
thriving
community
awaits
you…
• Data
can
make
a
difference.
• Search
allows
everyone
to
interact
with
data.
• This
is
a
Big
Deal.
67