Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Big
Data
App
Server

Lance
Riedel

Big Data App Server
A
new
applica5on
framework
for
(4
V’s):

•  Volume
of
raw
data
(Petabytes)

•  Velocity
at
which
it
is
being
generated/
ingested

•  Variety
of
data
sources
and
schemas

•  Advanced
data
sciences
and
analy5cs
that

can
be
applied
to
extract
Value

Big Data App Server Use Cases
•  Log/Machine
Analy5cs

•  Security/Fraud
Detec5on

•  Sensor
Data
Analy5cs

•  Financial
Analy5cs

•  Retail
Analy5cs

•  Ad
Targe5ng

•  Recommenda5on
(e.g.
NeMlix,
Amazon)

Storage and ComputeBigDataPlatform

Storage and Compute
Mo8va8on

Google
needed
to
capture
the
web
and

process
it
eﬃciently

•  Calculate
importance
of
pages,
words,

domains
against
each
other

•  The
more
cost-‐eﬀec5ve
they
could
make

it
-‐
the
more
they
could
process,
index,

understand

Storage/Compute: Centralized
•  Centralized
doesn’t
scale!

•  Move
a
lot
of
data
–
boWleneck

Storage/Compute: Sharding
•  Sharding
is
spliXng
the
problem
into
isolated
chunks

•  Sharding
scales,
but
fails
when
you
need
to
look

across
the
data

•  E.G.
How
to
calculate
term
weights
or
top
pages

across
shards??

✓
✓
✓
✓
✓
✓
✓

≠

DFS, MapReduce
•  Used
a
new
programming
model
to

distribute
computa5on
AND
data
(NOT

sharding)

•  Runs
on
commodity
hardware

•  Failure
resilience
using
so_ware
control

•  Easy
to
calculate
across
corpus

•  Two
parts
of
a
complete
Solu5on:

•  Distributed
File
System
–
DFS

•  MapReduce

MapReduce
•  Process
where
the
data
resides
(Data
and
compute
are
local
to
each
other)

•  Map
(read
the
data,
emit
a
key
and
a
value)

•  Reduce
(group
all
values
per
key,
perform
another
opera5on)

Hadoop
•  Open
Source
implementa5on
of

Google’s
DFS
and
MapReduce

whitepaper

•  Huge
Eco-‐System

•  Used
by:
Yahoo,
Facebook,
TwiWer,

LinkedIn,
Sears,
Apple,
The
New
York

Times,
Telefonica,
+1000’s
more!

Data Ingestion
Mo8va8on

•  Data
origina5ng
from
a

variety
of
sources

•  Some
data
more

valuable
than
others:

•  Time-‐to-‐live
(TTL)

•  Guarantees
on

delivery

Data Ingestion: Apache Flume
•  A
scalable,
fault-‐tolerant,
conﬁgurable
topology

data
inges5on
pipeline
that
works
hand
in
hand
with

the
Hadoop
Eco-‐System

•  Conﬁgurable
delivery
guarantees

-‐
rou5ng,
replica5on,
failover

•  Extensible
sources
and
sinks
allows
for
pluggable

data
sources

•  Scales
out
horizontally
–
100k’s
messages/sec

Workflow
Mo8va8on

Transforming,
storing,
joining,
data
can
take
a
lot

of
steps
that
need
to
be
repeatable
and
traceable
–

the
programming
model
for
data

Workflow: Oozie
A
workﬂow
engine
that
understands
the

dependency
graph
of
work
and
can
schedule,

replay,
and
report
on
the
steps

•  Jobs
triggered
by
5me
(frequency)
and
data

availability

•  Integrated
with
the
rest
of
the
Hadoop
stack

•  Scalable,
reliable
and
extensible
system.

Schema Management
Mo8va8on

As
data
sources
explode,
the
need
to
understand

the
data
schemas
becomes
a
principle
concern

Schema: HCatalog
•  A
table
and
storage
management
layer
for

Hadoop

•  Enables
users
with
diﬀerent
data

processing
tools
–
Pig,
MapReduce,
and

Hive
–
to
more
easily
read
and
write
data

on
the
grid.

Schema: Avro

•  A
data
serializa5on
system

•  When
Avro
data
is
stored
in
a
file,
its

schema
is
stored
with
it

•  Correspondence
between
same
named

fields,
missing
fields,
extra
fields,
etc.
can

all
be
easily
resolved.

•  Most
technologies
in
the
Hadoop
stack

understand
avro–
interoperability/data

passing

Data Access, QueryingBigDataPlatform

Data Access
Mo8va8on

Various
data
access
paWerns
require
data
stores

beyond
just
the
DFS
ﬁles.
An
example
is
a
key
value

store
that
needs
random
access
to
data.

Solu8on(s)

There
are
a
number
of
solu5ons
depending
on
the

use
case.

•  Google’s
BigTable
whitepaper

•  SQL
has
been
adapted
to
Hadoop

Data Access: HBase
•  The
Hadoop
database
-‐
a
distributed,

scalable,
big
data
store
(sorted
map)
–

from
Google’s
BigTable,
backed
by
Hadoop

DFS

•  Linear
and
modular
scalability.

•  Automa5c
and
conﬁgurable
sharding
of

tables

•  Automa5c
failover
support

•  Convenient
base
classes
for
backing

Hadoop
MapReduce
jobs
with
Apache

HBase
tables.

Data Access: SQL – Hive, Impala
•  SQL
querying
of
raw
data
on
the

distributed
file
system

•  Impala
–
Query
files
on
HDFS
including

SELECT,
JOIN,
and
aggregate
func5ons
–
in

real
5me

•  Hive
–
provides
easy
data
summariza5on,

ad-‐hoc
queries,
and
the
analysis
of
large

datasets
stored
in
Hadoop
compa5ble
file

systems

Data Analytics
Mo8va8on

•  Discover
the
latent
value
of
the
data.
The
core

mo5va5on
behind
Big
Data!

•  Clustering,
Machine
Learning,
Correla5ons,

Modeling
–
the
guts
of
the
Data
Science
–
o_en

extremely
diverse
use
cases.

Solu8on(s)

A
pluggable
architecture
that
can
share
schemas,

but
allow
for
a
suite
of
tools
appropriate
for
the

use
case

Data Analytics: Example
Frameworks
•  Mahout

•  Machine
learning,
clustering

•  PaWern
-‐
Machine
Learning
DSL
for
Hadoop
from

Cascading

•  0xData

•  Open
source
math
and
predic5on
engine
for
big
data

•  Sample
Algorithms

•  Random
Forest
algorithm

•  K-‐Means
Clustering

•  Hierarchical
Clustering

•  Linear
Regression

•  Logis5c
Regression

•  Support
Vector
Machines

•  Ar5ﬁcial
Neural
Networks

•  Associa5on
Rule
Learning

Serving
Mo8va8on

•  Powering
applica5ons
for
end
users

•  Search/browse
and
recommenda5on
engines

allow
real-‐5me
access
to
data

Serving: Search – Solr
Cloud
•  Builds
indexes
on
top
of
Hadoop

•  Horizontally
scalable,
fault
tolerant

•  Incredible
ﬂexibility
in
indexing
op5ons

•  Tokeniza5on

•  Field
types

•  Data
storage

•  Search
op5ons
just
as
ﬂexible

•  AND,OR,NOT,
wildcard

•  Facets
(counts
from
a
derived
ontology)

•  Extensive
algorithm
and
weigh5ng
plug-‐
ability

Serving: Manas – Matching Engine
•  The
Hive’s
massively
scalable

matching
engine

•  Handles
100’s
millions
to
billions
of

documents
eﬃciently
while
matching

against
100’s
to
1000’s
features

•  Nothing
exists
today
in
the
Open

Source
community
that
has
these

capabili5es

EXAMPLE
APP
USE-‐CASE

Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event

Similar to Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event (20)

More from The Hive

More from The Hive (20)

Recently uploaded

Recently uploaded (20)

Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event