LINE’s	log	analysis	platform
2017/12/07
Wataru Yukawa(@wyukawa)
#StrataData
Who	am	I?
• Data	engineer	at	LINE
• First	time	to	Singapore!
• Maintain	an	on-premises	log	analysis	platform	on	top	of	
Hadoop/Hive/Presto/Azkaban
• LINE	has	many	hadoop clusters which	has	different	roles
• Today	I	would	like	to	share	use	case	in	one	of	them
• I	talk	about	3	years	history	related	to	hadoop cluster
LINE
• LINE	makes	a	messaging	application	of	the	same	name,	in	
addition	to	other	related	services
• It	is	the	most	popular	messaging	platform	in	Japan
LINE
about	3	years	ago
• We	need	new	analytics	department	
• many	LINE	Family	services
– LINE	Fortune
– LINE	Manga
– etc
• demand	for	LINE	Family	services	analytics	increases
• created	new	analytics	department	in	2014/05
analytics	department
• data	engineer
– implement	batch
– maintain	hadoop
• data	planner
– communicate	with	service	planner
– design	KPI
– create	report
– execute	ad	hoc	query	to	extract	data,	for	example	campaign
Agenda
• log	analysis	platform	overview	in	2014
• Prestogres
• presto/hive	web	UI(yanagishima)
• upgrade	hadoop cluster
• log	analysis	platform	overview	in	2017
Log	Analysis	Platform(2014)
Hadoop,	Hive	of	HDP2.1
Azkaban	2.6
Presto	0.75
Cognos	10
DB
MySQL	5.5
DBDB
Python	2.7.7
Shib
batch(sqoop,	hive,	etc)
batch
• written	in	python
• execute	sqoop,	hive,	etc
• mostly	use	hive	cli	because	hiveserver2	is	not	so	stable
• we	create	thin	python	batch	framework
– python	bin/main.py –d	20171207	hoge
• enable	dry	run
– print	hive	query,	not	execute
Azkaban	use	case
• Use	Azkaban	to	manage	job
• Use	Azkaban	API
– I	created	client	https://github.com/wyukawa/eboshi
– Commit	scheduling	information	to	GHE
• Painful	to	write	job	file	due	to	many	job	files
– I	created	generation	tool	
https://github.com/wyukawa/ayd
– generate	1	flow(many	jobs)	from	1	yaml file
Azkaban	Job	File
#	foo.job
type=command
command=echo	foo
retries=1
retry.backoff=300000
#	bar.job
type=command
dependencies=foo
command=echo	bar
Azkaban	flow
Yaml example
foo:
type:	command
command:	echo	"foo”
retries:	1
retry.backoff:	300000
bar:
type:	command
command:	echo	"bar”
dependencies:	foo
retries:	1
retry.backoff:	300000
Job	Management	Overview
git push
push	button
upload	job
register	schedule
execute	job
git pull
generate	job	file
Azkaban	usage	situation
• More	than	150	Azkaban	flows
• Many	daily	batches,	some	hourly,	weekly,	
monthly	batches
• Most	flows	are	related	to	hive
• I	prepare	the	template	Azkaban	flows	to	re-
aggregate	past	data	due	to	no	backfill
Cognos
• Commercial	BI	tool	by	IBM
• rich	authorization	management
• flexible	reporting
Cognos sample	report
Presto
• distributed	SQL	query	engine
• fast	and	many	useful	UDF
Upgrade Presto
• easy	to	upgrade	due	to	stateless	architecture
• but	sometimes	need	to	rollback
– 0.101	https://github.com/prestodb/presto/pull/2834
– 0.108	https://github.com/prestodb/presto/pull/3212
• query	stuck
• revert	commit
– 0.113	https://github.com/prestodb/presto/pull/3400
– 0.148	https://github.com/prestodb/presto/pull/5612
• memory	error
– 0.189	https://github.com/prestodb/presto/issues/9354
• empty	ORC	file	is	not	supported
How	do	we use	Presto?
• batch	with	Hive due	to	fault-tolerance
• presto	is	fast,	but	currently	has	limited	fault	
tolerance	capabilities
• execute	adhoc	presto	query by	shib
– shib is	web	UI	for	presto/hive
– https://github.com/tagomoris/shib
Shib
Agenda
• log	analysis	platform	overview	in	2014
• Prestogres
• presto/hive	web	UI(yanagishima)
• upgrade	hadoop cluster
• log	analysis	platform	overview	in	2017
Log	Analysis	Platform(2014)
Hadoop,	Hive	of	HDP2.1
Azkaban	2.6
Presto	0.75
Cognos	10
DB
MySQL	5.5
DBDB
Python	2.7.7
Shib
batch(sqoop,	hive,	etc)
MySQL
• aggregate	data	into	MySQL
• easy	to	connect	to	Cognos
• MySQL	doesn’t	fit	analytics
• no	window	function
• MySQL will	be	a	bottleneck because	it’s	not	scalable
• presto	has	many	useful	UDF,	window	function
• reduce	maintenance	cost	for	multi	storages
• we	wanted	Cognos to	connect	to	Presto
• hard	to	connect	Cognos to	presto	in	2014	due	to	immature	presto	
jdbc driver.
What	is	Prestogres?
PostgreSQL
pgpool-II
(patched)
BI	tool
Presto
Prestogres
PL/python
Prestogres	use	case
Presto
Cognos	10
Prestogres
Postgresql JDBC
Driver
Log	Analysis	Platform(2015)
Hadoop,	Hive	of	HDP2.1
Azkaban	2.6
Presto	0.89
Cognos	10
DBDBDB
ETL	with	Python	
2.7.7 Prestogres
Shib
Presto view	
• about	400	presto	views
• Presto	view	doesn’t	need	ETL
• data	planners	create	presto	views
• Cognos refer	to	presto	views
• we	have	2	presto	view	check	systems
– execute	select	… from	… limit	1	on	all	presto	views	every	day
• easy	to	find	problem	when	we	upgrade	presto
– compare	DDL	in	github to	existed	presto	view
presto	tool
• https://github.com/wyukawa/presto-woothee
• UDF	to	parse	user_agent
• https://github.com/wyukawa/presto-fluentd
• send	presto	query	to	fluentd
• we	use	presto-fluentd,	send	log	to	hadoop
• so	we	can	check	query	log	by	presto
prestogres current	status
• prestogres was	best	choice	for	us	3	years	ago
• Currently,	prestogres is	obsolete
• So	we	have	a	plan	to	upgrade	Cognos which	connects	to	
presto	without	prestogres
Agenda
• log	analysis	platform	overview	in	2014
• Prestogres
• presto/hive	web	UI(yanagishima)
• upgrade	hadoop cluster
• log	analysis	platform	overview	in	2017
yanagishima
• Presto/Hive	web	UI
• started	in	2015	because	data	planners	execute	ad	hoc	query more	
easily
• UI	export	joined	in	2017
• easy	to	use/install
• share	query	with	permanent	link
• chart
• handle	multiple	clusters
• https://github.com/yanagishima/yanagishima
yanagishima demo	movie
yanagishima use	case
• check	data
• ad	hoc	query
• share	query
• create	presto	view
• about	100	DAU	in	LINE
new	yanagishima feature
• timeline	tab
• user	can	comment	about	query	and	share	in	timeline	tab
• social	feature
• will	be	available	in	the	next	version
Agenda
• log	analysis	platform	overview	in	2014
• Prestogres
• presto/hive	web	UI(yanagishima)
• upgrade	hadoop cluster
• log	analysis	platform	overview	in	2017
Log	Analysis	Platform(2016)
Hadoop,	Hive	of	HDP2.1
Azkaban	3.0
Presto	0.147
Cognos	10
DBDBDB
Python	2.7.11
yanagishima
batch(sqoop,	hive,	etc)
Prestogres
2016
• 2	years	have	passed	since	I	created	log	analysis	platform
• easy	to	upgrade	Presto,	Azkaban
• Hadoop	version	became	old
• We	used	HDP	2.1(Hadoop	2.4)
• Latest	version	was	HDP	2.5(Hadoop	2.7)	at	that	time
• guarantee	period	for	machines		expire	in	2017/6
• We	need	to	upgrade	hadoop in	new	machines
new	Machine	spec,	Hadoop	version
• Machines
– 40	servers(same	as	old	hadoop cluster)
– CPU	40	processors(24	in	old)
– Memory	256GB(64GB	in	old)
– HDD	6.1TB	x	12(3.6TB	x	12	in	old)
– Network	10Gbps(1Gbps	in	old)
• HDP2.5.3(Ambari 2.4.2)
– Hadoop	2.7.3
• NameNode HA
• ResourceManager HA
– Hive	1.2.1
• MapReduce
• Tez
How	to	upgrade	hadoop
• Setup	new	Hadoop	Cluster	to	new	machines
• Blue	green	deployment	all	at	once
• Migrate	data	by	distcp(-m 20 -bandwidth 125)
– Copy	500TB(first	copy	took	about	3	days)
• don’t	execute	batch	in	parallel	on	both	hadoop
clusters
distcp with	HDFS	Snapshot
• HDFS	Snapshot	is	useful	feature	because	batches	add	data	
during	distcp
• -update	-diff	option	doesn’t	support	webhdfs://orig/...
– Edit	hdfs-site.xml in	destination	hadoop and	use	hdfs://orig/...
Migrate	Hive	schema
• Use	show	create	table	command
• Use	msck repair	command to	add	partition
– But	it	didn’t	work	in	too	many(for	example,	4000)	partition	tables
• Use	webhdfs://...	in	external	table
– can’t	use	hdfs://…
– but	empty	returns	when	you	select	by	presto
– need	to	add	jersey-bundle-1.19.3.jar	due	to	NoClassDefFoundError
– https://groups.google.com/forum/#!topic/presto-
users/HXMW4XtmYf8
HDFS/YARN/Hive/Sqoop setting
• disable	hdfs-audit.log due	to	many	adhoc queries
• dfs.datanode.failed.volumes.tolerated=1
• fs.trash.interval=4320
• Namenode heap	64GB
• yarn.nodemanager.resource.memory-mb 100GB
• yarn.scheduler.maximum-allocation-mb 100GB
• Use	DominantResourceCalculator
• hive.server2.authentication=NOSASL
• hive.server2.enable.doAs=false
• hive.auto.convert.join=false
• hive.support.sql11.reserved.keywords=false
• org.apache.sqoop.splitter.allow_text_splitter=true
• Sometimes	use	Tez
monitoring
• Ambari Metrics
• Prometheus
– monitor	machine	metrics(HDD/memory/CPU,	slab,	TIME_WAIT,	
entropy,	etc)	with	node_exporter
• Alertmanager
• Grafana
• Promgen
– Promgen is	a	configuration	file	generator	for	Prometheus
– https://github.com/line/promgen
My	feeling	about	upgrading	hadoop
• If	you	upgrade	hadoop with	many	batches(for	example,	more	than	100	
azkaban flows),	many	errors	will	occur	the	next	day
• We	can’t	confirm	result	of	new	hadoop immediately	because	batches	
are	scheduled
– highly	recommend	to	upgrade	on	first	half	of	the	week.	We	upgraded	
on	Tuesday.
– If	you	upgrade	on	Friday,	you	will	work	on	Saturday.
– share	jobs	with	your	colleagues	to	address	batch	error
• If	you	do	such	kind	jobs	alone,	you	will	be	overwhelmed
Log	Analysis	Platform(2017)
Hadoop,	Hive	of	HDP2.5.3
Azkaban	3.37.0
Presto	0.188
Cognos	10
DBDBDB
Python	2.7.13
yanagishima
batch(sqoop,	hive,	etc)
Prestogres
recap
• share	LINE’s	log	analysis	platform	about	3	years	journey
• batch
• cognos
• yanagishima
• upgrade	hadoop cluster
• we	really	appreciate	OSS	product	and	communities
Any	questions?

Strata2017 sg