Azkaban

Azkaban in my use case
2017/03/09
@wyukawa
Workflow Engines Meetup #1
#wfemeetup
https://connpass.com/event/50900/

Azkaban
• Implemented at LinkedIn to solve the problem
of Hadoop job dependencies
• Written in Java
– Not modern Java(raw servlet, velocity,...)

Azkaban feature
• Simple Job Management Tool
– Define job dependency
– Retry
– Scheduling
– Web UI
• See dependency/execution time/log
• Store log to db as blob
– SPOF
– Not register holiday
– Not triggered by file creation event
• Mail notification only
– HTTP Job Callback
• No binary
– Need to build source
• Not so active development
• Mailing List doesn’t function very well

Job File
# foo.job
type=command
command=echo foo
retries=1
retry.backoff=300000
# bar.job
type=command
dependencies=foo
command=echo bar

Failure Options
• Finish Current Running
– finishes only the currently running job. It will not
start any new jobs.
• Cancel All
– immediately kills all jobs and fails the flow.
• Finish All Possible
– will keep executing jobs as long as its
dependencies are met.

Why “Finish All Possible” is not
default?

Re-run when flow failed
• User can execute failed jobs only if user
pushes “prepare execution” button. It’s
convenient!

Concurrent Execution Options
• Skip Execution
– Do not run flow if it is already running.
• Run Concurrently
– Run the flow anyway. Previous execution is
unaffected.
• Pipeline

SLA Notification
• If duration threshold is exceeded, then an
alert email can be sent or the flow can be auto
killed.

Flow parameter
• can set parameter(for example, date) when
Azkaban executes flow

Q/A
https://connpass.com/event/50900/
○
△
❌
○
○
△

My use case
• Use Azkaban to manage hadoop job
– Write batch in python
• Use Azkaban API
– I created client https://github.com/wyukawa/eboshi
– Commit scheduling information to GHE
• Painful to write job file
– I created generation tool
https://github.com/wyukawa/ayd
– generate 1 flow from 1 yaml file

Python batch example
def validate_before(self):
hive.exists("access_log", "yyyymmdd='%s'" % (...))
def process(self):
insert_query = """ INSERT OVERWRITE TABLE aggregate PARTITION(yyyymmdd='%s')
SELECT ... FROM access_log WHERE ... GROUP BY ... """ % (...)
hiveCli.query(insert_query)
def validate_after(self):
hive.exists("aggregate", "yyyymmdd='%s'" % (...))

Yaml example
foo:
type: command
command: echo "foo”
retries: 1
retry.backoff: 300000
bar:
type: command
command: echo "bar”
dependencies: foo
retries: 1
retry.backoff: 300000

Job Management Overview
git push
push button
upload job
register schedule
git pull
generate job file

Log Analysis Platform
Hadoop, Hive of HDP2.5.3
Azkaban 3.15.0-
1-g77411d7
Presto 0.166
Cognos
Prestogres
Netezza
DBDB
ETL with Python
2.7.13
InfiniDB
Pentaho
Saiku

My usage situation
• More than 120 Azkaban flows
• Many daily batches, a few hourly, weekly, monthly batches
• Most flows are related to hive
• There is the Azkaban in batch server
• I prepare the template Azkaban flows to reaggregate past
data
– Set job name and date to parameter
– Set Run Concurrently
• I don’t use SLA but I may use in the future
– https://github.com/azkaban/azkaban/pull/911
• I don’t use HTTP Job Callback
– use hipchat in python ETL

My feeling
• Simple
• Easy to use
• Web UI is convenient
• API is useful
• There is no reason to replace Azkaban
• I hope development become active

Podcast
• https://itunes.apple.com/jp/podcast/wyukaw
as-podcast/id1152456701
• http://wyukawa.tumblr.com/

Azkaban

More Related Content

What's hot

More from wyukawa

Recently uploaded

Azkaban