12-Step Program for Scaling Web Applications on PostgreSQL
gobblin-meetup-yarn
1. A
Preview
of
Gobblin
on
Yarn
Yinan
Li
Data
Analy,cs
Infrastructure
@
LinkedIn
2. Agenda
• Mo,va,ons
• Architecture
Overview
• Implementa,on
Notes
– The
Role
of
Apache
Helix
– Log
Compac,on
– Security
and
Token
Management
• Deployment
@
LinkedIn
• Future
Work
3. Why
Gobblin
on
Yarn
• BeJer
resource
u,liza,on
– Sharing
of
containers
– BeJer
control
over
container
provisioning
– BeJer
container
life
cycle
management
• Supports
Gobblin
as
a
con,nuous
long-‐
running
service
• BeJer
fit
for
streaming
inges,on
5. The
Role
of
Apache
Helix
• Distributed
task
execu,on
framework
– Automa,c
task
assignment
and
rebalancing
• Coordina,on
between
the
AM
and
containers
– Through
ZooKeeper
• Messaging
between
components
6. Log
Aggregation
• Containers
are
log
sources
• Logs
get
streamed
to
HDFS
and
further
to
the
driver
Client/Driver
Applica,onMaster
Container
Container
HDFS
8. Deployment
@
LinkedIn
• Dark
launch
for
a
few
data
sources
– Running
size
by
size
with
produc,on
instances
running
on
MR
• Planned
to
migrate
more
data
sources
in
Q1
2016
9. Future
Work
• AM
and
container
restart
handling
• Log
reten,on
management
• Monitoring
and
repor,ng
• Run,me
cluster
resizing
10. Thank
You
• hJps://github.com/linkedin/gobblin/
wiki/Gobblin-‐on-‐Yarn
• hJps://groups.google.com/forum/#!
forum/gobblin-‐users