Azkaban	in	my	use	case
2017/03/09
@wyukawa
Workflow	Engines	Meetup	#1
#wfemeetup
https://connpass.com/event/50900/
Azkaban
• Implemented	at	LinkedIn	to	solve	the	problem	
of	Hadoop	job	dependencies
• Written	in	Java
– Not	modern	Java(raw	servlet,	velocity,...)
Azkaban	feature
• Simple	Job	Management	Tool
– Define	job	dependency
– Retry
– Scheduling
– Web	UI
• See	dependency/execution	time/log
• Store	log	to	db as	blob
– SPOF
– Not	register	holiday
– Not	triggered	by	file	creation	event
• Mail	notification	only
– HTTP	Job	Callback
• No	binary
– Need	to	build	source
• Not	so	active	development
• Mailing	List	doesn’t	function	very	well
Job	File
#	foo.job
type=command
command=echo	foo
retries=1
retry.backoff=300000
#	bar.job
type=command
dependencies=foo
command=echo	bar
Job	History
Scheduling
Failure	Options
• Finish	Current	Running
– finishes	only	the	currently	running	job.	It	will	not	
start	any	new	jobs.
• Cancel	All
– immediately	kills	all	jobs	and	fails	the	flow.
• Finish	All	Possible
– will	keep	executing	jobs	as	long	as	its	
dependencies	are	met.
Difference	when	ccc	failed
Why	“Finish	All	Possible”	is	not	
default?
Re-run	when	flow	failed
• User	can	execute	failed	jobs	only	if	user	
pushes	“prepare	execution”	button.	It’s	
convenient!
Concurrent	Execution	Options
• Skip	Execution
– Do	not	run	flow	if	it	is	already	running.
• Run	Concurrently
– Run	the	flow	anyway.	Previous	execution	is	
unaffected.
• Pipeline
SLA		Notification
• If	duration	threshold	is	exceeded,	then	an	
alert	email	can	be	sent	or	the	flow	can	be	auto	
killed.
Flow	parameter
• can	set	parameter(for	example,	date)	when	
Azkaban	executes	flow
Q/A
https://connpass.com/event/50900/
○
△
❌
○
○
△
My	use	case
• Use	Azkaban	to	manage	hadoop job
– Write	batch	in	python
• Use	Azkaban	API
– I	created	client	https://github.com/wyukawa/eboshi
– Commit	scheduling	information	to	GHE
• Painful	to	write	job	file
– I	created	generation	tool	
https://github.com/wyukawa/ayd
– generate	1	flow	from	1	yaml file
Python	batch	example
def validate_before(self):	
hive.exists("access_log",	"yyyymmdd='%s'"	%	(...))
def process(self):
insert_query =	"""	INSERT	OVERWRITE	TABLE	aggregate	PARTITION(yyyymmdd='%s')	
SELECT	...	FROM	access_log WHERE	...	GROUP	BY	...	"""	%	(...)
hiveCli.query(insert_query)
def validate_after(self):
hive.exists("aggregate",	"yyyymmdd='%s'"	%	(...))
Yaml example
foo:
type:	command
command:	echo	"foo”
retries:	1
retry.backoff:	300000
bar:
type:	command
command:	echo	"bar”
dependencies:	foo
retries:	1
retry.backoff:	300000
Job	Management	Overview
git push
push	button
upload	job
register	schedule
git pull
generate	job	file
Log	Analysis	Platform
Hadoop,	Hive	of	HDP2.5.3
Azkaban	3.15.0-
1-g77411d7
Presto	0.166
Cognos
Prestogres
Netezza
DBDB
ETL	with	Python	
2.7.13
InfiniDB
Pentaho
Saiku
My	usage	situation
• More	than	120	Azkaban	flows
• Many	daily	batches,	a	few	hourly,	weekly,	monthly	batches
• Most	flows	are	related	to	hive
• There	is	the	Azkaban	in	batch	server
• I	prepare	the	template	Azkaban	flows	to	reaggregate past	
data
– Set	job	name	and	date	to	parameter
– Set	Run	Concurrently
• I	don’t	use	SLA	but	I	may	use	in	the	future
– https://github.com/azkaban/azkaban/pull/911
• I	don’t	use	HTTP	Job	Callback
– use	hipchat in	python	ETL
My	feeling
• Simple
• Easy	to	use
• Web	UI	is	convenient
• API	is	useful
• There	is	no	reason	to	replace	Azkaban
• I	hope	development	become	active
Podcast
• https://itunes.apple.com/jp/podcast/wyukaw
as-podcast/id1152456701
• http://wyukawa.tumblr.com/

Azkaban