2. Big Data App Server
A
new
applica5on
framework
for
(4
V’s):
• Volume
of
raw
data
(Petabytes)
• Velocity
at
which
it
is
being
generated/
ingested
• Variety
of
data
sources
and
schemas
• Advanced
data
sciences
and
analy5cs
that
can
be
applied
to
extract
Value
3.
4. Big Data App Server Use Cases
• Log/Machine
Analy5cs
• Security/Fraud
Detec5on
• Sensor
Data
Analy5cs
• Financial
Analy5cs
• Retail
Analy5cs
• Ad
Targe5ng
• Recommenda5on
(e.g.
NeMlix,
Amazon)
8. Storage and Compute
Mo8va8on
Google
needed
to
capture
the
web
and
process
it
efficiently
• Calculate
importance
of
pages,
words,
domains
against
each
other
• The
more
cost-‐effec5ve
they
could
make
it
-‐
the
more
they
could
process,
index,
understand
10. Storage/Compute: Sharding
• Sharding
is
spliXng
the
problem
into
isolated
chunks
• Sharding
scales,
but
fails
when
you
need
to
look
across
the
data
• E.G.
How
to
calculate
term
weights
or
top
pages
across
shards??
✓
✓
✓
✓
✓
✓
✓
≠
11. DFS, MapReduce
• Used
a
new
programming
model
to
distribute
computa5on
AND
data
(NOT
sharding)
• Runs
on
commodity
hardware
• Failure
resilience
using
so_ware
control
• Easy
to
calculate
across
corpus
• Two
parts
of
a
complete
Solu5on:
• Distributed
File
System
–
DFS
• MapReduce
13. MapReduce
• Process
where
the
data
resides
(Data
and
compute
are
local
to
each
other)
• Map
(read
the
data,
emit
a
key
and
a
value)
• Reduce
(group
all
values
per
key,
perform
another
opera5on)
14. Hadoop
• Open
Source
implementa5on
of
Google’s
DFS
and
MapReduce
whitepaper
• Huge
Eco-‐System
• Used
by:
Yahoo,
Facebook,
TwiWer,
LinkedIn,
Sears,
Apple,
The
New
York
Times,
Telefonica,
+1000’s
more!
16. Data Ingestion
Mo8va8on
• Data
origina5ng
from
a
variety
of
sources
• Some
data
more
valuable
than
others:
• Time-‐to-‐live
(TTL)
• Guarantees
on
delivery
17. Data Ingestion: Apache Flume
• A
scalable,
fault-‐tolerant,
configurable
topology
data
inges5on
pipeline
that
works
hand
in
hand
with
the
Hadoop
Eco-‐System
• Configurable
delivery
guarantees
-‐
rou5ng,
replica5on,
failover
• Extensible
sources
and
sinks
allows
for
pluggable
data
sources
• Scales
out
horizontally
–
100k’s
messages/sec
19. Workflow: Oozie
A
workflow
engine
that
understands
the
dependency
graph
of
work
and
can
schedule,
replay,
and
report
on
the
steps
• Jobs
triggered
by
5me
(frequency)
and
data
availability
• Integrated
with
the
rest
of
the
Hadoop
stack
• Scalable,
reliable
and
extensible
system.
20. Schema Management
Mo8va8on
As
data
sources
explode,
the
need
to
understand
the
data
schemas
becomes
a
principle
concern
21. Schema: HCatalog
• A
table
and
storage
management
layer
for
Hadoop
• Enables
users
with
different
data
processing
tools
–
Pig,
MapReduce,
and
Hive
–
to
more
easily
read
and
write
data
on
the
grid.
22. Schema: Avro
• A
data
serializa5on
system
• When
Avro
data
is
stored
in
a
file,
its
schema
is
stored
with
it
• Correspondence
between
same
named
fields,
missing
fields,
extra
fields,
etc.
can
all
be
easily
resolved.
• Most
technologies
in
the
Hadoop
stack
understand
avro–
interoperability/data
passing
24. Data Access
Mo8va8on
Various
data
access
paWerns
require
data
stores
beyond
just
the
DFS
files.
An
example
is
a
key
value
store
that
needs
random
access
to
data.
Solu8on(s)
There
are
a
number
of
solu5ons
depending
on
the
use
case.
• Google’s
BigTable
whitepaper
• SQL
has
been
adapted
to
Hadoop
25. Data Access: HBase
• The
Hadoop
database
-‐
a
distributed,
scalable,
big
data
store
(sorted
map)
–
from
Google’s
BigTable,
backed
by
Hadoop
DFS
• Linear
and
modular
scalability.
• Automa5c
and
configurable
sharding
of
tables
• Automa5c
failover
support
• Convenient
base
classes
for
backing
Hadoop
MapReduce
jobs
with
Apache
HBase
tables.
26. Data Access: SQL – Hive, Impala
• SQL
querying
of
raw
data
on
the
distributed
file
system
• Impala
–
Query
files
on
HDFS
including
SELECT,
JOIN,
and
aggregate
func5ons
–
in
real
5me
• Hive
–
provides
easy
data
summariza5on,
ad-‐hoc
queries,
and
the
analysis
of
large
datasets
stored
in
Hadoop
compa5ble
file
systems
28. Data Analytics
Mo8va8on
• Discover
the
latent
value
of
the
data.
The
core
mo5va5on
behind
Big
Data!
• Clustering,
Machine
Learning,
Correla5ons,
Modeling
–
the
guts
of
the
Data
Science
–
o_en
extremely
diverse
use
cases.
Solu8on(s)
A
pluggable
architecture
that
can
share
schemas,
but
allow
for
a
suite
of
tools
appropriate
for
the
use
case
29. Data Analytics: Example
Frameworks
• Mahout
• Machine
learning,
clustering
• PaWern
-‐
Machine
Learning
DSL
for
Hadoop
from
Cascading
• 0xData
• Open
source
math
and
predic5on
engine
for
big
data
• Sample
Algorithms
• Random
Forest
algorithm
• K-‐Means
Clustering
• Hierarchical
Clustering
• Linear
Regression
• Logis5c
Regression
• Support
Vector
Machines
• Ar5ficial
Neural
Networks
• Associa5on
Rule
Learning
31. Serving
Mo8va8on
• Powering
applica5ons
for
end
users
• Search/browse
and
recommenda5on
engines
allow
real-‐5me
access
to
data
32. Serving: Search – Solr
Cloud
• Builds
indexes
on
top
of
Hadoop
• Horizontally
scalable,
fault
tolerant
• Incredible
flexibility
in
indexing
op5ons
• Tokeniza5on
• Field
types
• Data
storage
• Search
op5ons
just
as
flexible
• AND,OR,NOT,
wildcard
• Facets
(counts
from
a
derived
ontology)
• Extensive
algorithm
and
weigh5ng
plug-‐
ability
33. Serving: Manas – Matching Engine
• The
Hive’s
massively
scalable
matching
engine
• Handles
100’s
millions
to
billions
of
documents
efficiently
while
matching
against
100’s
to
1000’s
features
• Nothing
exists
today
in
the
Open
Source
community
that
has
these
capabili5es