Strata2017 sg

LINE’s log analysis platform
2017/12/07
Wataru Yukawa(@wyukawa)
#StrataData

Who am I?
• Data engineer at LINE
• First time to Singapore!
• Maintain an on-premises log analysis platform on top of
Hadoop/Hive/Presto/Azkaban
• LINE has many hadoop clusters which has different roles
• Today I would like to share use case in one of them
• I talk about 3 years history related to hadoop cluster

LINE
• LINE makes a messaging application of the same name, in
addition to other related services
• It is the most popular messaging platform in Japan

about 3 years ago
• We need new analytics department
• many LINE Family services
– LINE Fortune
– LINE Manga
– etc
• demand for LINE Family services analytics increases
• created new analytics department in 2014/05

analytics department
• data engineer
– implement batch
– maintain hadoop
• data planner
– communicate with service planner
– design KPI
– create report
– execute ad hoc query to extract data, for example campaign

Agenda
• log analysis platform overview in 2014
• Prestogres
• presto/hive web UI(yanagishima)
• upgrade hadoop cluster
• log analysis platform overview in 2017

Log Analysis Platform(2014)
Hadoop, Hive of HDP2.1
Azkaban 2.6
Presto 0.75
Cognos 10
DB
MySQL 5.5
DBDB
Python 2.7.7
Shib
batch(sqoop, hive, etc)

batch
• written in python
• execute sqoop, hive, etc
• mostly use hive cli because hiveserver2 is not so stable
• we create thin python batch framework
– python bin/main.py –d 20171207 hoge
• enable dry run
– print hive query, not execute

Azkaban use case
• Use Azkaban to manage job
• Use Azkaban API
– I created client https://github.com/wyukawa/eboshi
– Commit scheduling information to GHE
• Painful to write job file due to many job files
– I created generation tool
https://github.com/wyukawa/ayd
– generate 1 flow(many jobs) from 1 yaml file

Azkaban Job File
# foo.job
type=command
command=echo foo
retries=1
retry.backoff=300000
# bar.job
type=command
dependencies=foo
command=echo bar
Azkaban flow

Yaml example
foo:
type: command
command: echo "foo”
retries: 1
retry.backoff: 300000
bar:
type: command
command: echo "bar”
dependencies: foo
retries: 1
retry.backoff: 300000

Job Management Overview
git push
push button
upload job
register schedule
execute job
git pull
generate job file

Azkaban usage situation
• More than 150 Azkaban flows
• Many daily batches, some hourly, weekly,
monthly batches
• Most flows are related to hive
• I prepare the template Azkaban flows to re-
aggregate past data due to no backfill

Cognos
• Commercial BI tool by IBM
• rich authorization management
• flexible reporting

Presto
• distributed SQL query engine
• fast and many useful UDF

Upgrade Presto
• easy to upgrade due to stateless architecture
• but sometimes need to rollback
– 0.101 https://github.com/prestodb/presto/pull/2834
• query stuck
• revert commit
• memory error
– 0.189 https://github.com/prestodb/presto/issues/9354
• empty ORC file is not supported

How do we use Presto?
• batch with Hive due to fault-tolerance
• presto is fast, but currently has limited fault
tolerance capabilities
• execute adhoc presto query by shib
– shib is web UI for presto/hive
– https://github.com/tagomoris/shib

MySQL
• aggregate data into MySQL
• easy to connect to Cognos
• MySQL doesn’t fit analytics
• no window function
• MySQL will be a bottleneck because it’s not scalable
• presto has many useful UDF, window function
• reduce maintenance cost for multi storages
• we wanted Cognos to connect to Presto
• hard to connect Cognos to presto in 2014 due to immature presto
jdbc driver.

What is Prestogres?
PostgreSQL
pgpool-II
(patched)
BI tool
Presto
Prestogres
PL/python

Prestogres use case
Presto
Cognos 10
Prestogres
Postgresql JDBC
Driver

Azkaban 2.6
Presto 0.89
Cognos 10
DBDBDB
ETL with Python
2.7.7 Prestogres
Shib

Presto view
• about 400 presto views
• Presto view doesn’t need ETL
• data planners create presto views
• Cognos refer to presto views
• we have 2 presto view check systems
– execute select … from … limit 1 on all presto views every day
• easy to find problem when we upgrade presto
– compare DDL in github to existed presto view

presto tool
• https://github.com/wyukawa/presto-woothee
• UDF to parse user_agent
• https://github.com/wyukawa/presto-fluentd
• send presto query to fluentd
• we use presto-fluentd, send log to hadoop
• so we can check query log by presto

prestogres current status
• prestogres was best choice for us 3 years ago
• Currently, prestogres is obsolete
• So we have a plan to upgrade Cognos which connects to
presto without prestogres

yanagishima
• Presto/Hive web UI
• started in 2015 because data planners execute ad hoc query more
easily
• UI export joined in 2017
• easy to use/install
• share query with permanent link
• chart
• handle multiple clusters
• https://github.com/yanagishima/yanagishima

yanagishima use case
• check data
• ad hoc query
• share query
• create presto view
• about 100 DAU in LINE

new yanagishima feature
• timeline tab
• user can comment about query and share in timeline tab
• social feature
• will be available in the next version

Azkaban 3.0
Presto 0.147
Cognos 10
DBDBDB
Python 2.7.11
yanagishima
Prestogres

2016
• 2 years have passed since I created log analysis platform
• easy to upgrade Presto, Azkaban
• Hadoop version became old
• We used HDP 2.1(Hadoop 2.4)
• Latest version was HDP 2.5(Hadoop 2.7) at that time
• guarantee period for machines expire in 2017/6
• We need to upgrade hadoop in new machines

new Machine spec, Hadoop version
• Machines
– 40 servers(same as old hadoop cluster)
– CPU 40 processors(24 in old)
– Memory 256GB(64GB in old)
– HDD 6.1TB x 12(3.6TB x 12 in old)
– Network 10Gbps(1Gbps in old)
• HDP2.5.3(Ambari 2.4.2)
– Hadoop 2.7.3
• NameNode HA
• ResourceManager HA
– Hive 1.2.1
• MapReduce
• Tez

How to upgrade hadoop
• Setup new Hadoop Cluster to new machines
• Blue green deployment all at once
• Migrate data by distcp(-m 20 -bandwidth 125)
– Copy 500TB(first copy took about 3 days)
• don’t execute batch in parallel on both hadoop
clusters

distcp with HDFS Snapshot
• HDFS Snapshot is useful feature because batches add data
during distcp
• -update -diff option doesn’t support webhdfs://orig/...
– Edit hdfs-site.xml in destination hadoop and use hdfs://orig/...

Migrate Hive schema
• Use show create table command
• Use msck repair command to add partition
– But it didn’t work in too many(for example, 4000) partition tables
• Use webhdfs://... in external table
– can’t use hdfs://…
– but empty returns when you select by presto
– need to add jersey-bundle-1.19.3.jar due to NoClassDefFoundError
– https://groups.google.com/forum/#!topic/presto-
users/HXMW4XtmYf8

HDFS/YARN/Hive/Sqoop setting
• disable hdfs-audit.log due to many adhoc queries
• dfs.datanode.failed.volumes.tolerated=1
• fs.trash.interval=4320
• Namenode heap 64GB
• yarn.nodemanager.resource.memory-mb 100GB
• yarn.scheduler.maximum-allocation-mb 100GB
• Use DominantResourceCalculator
• hive.server2.authentication=NOSASL
• hive.server2.enable.doAs=false
• hive.auto.convert.join=false
• hive.support.sql11.reserved.keywords=false
• org.apache.sqoop.splitter.allow_text_splitter=true
• Sometimes use Tez

monitoring
• Ambari Metrics
• Prometheus
– monitor machine metrics(HDD/memory/CPU, slab, TIME_WAIT,
entropy, etc) with node_exporter
• Alertmanager
• Grafana
• Promgen
– Promgen is a configuration file generator for Prometheus
– https://github.com/line/promgen

My feeling about upgrading hadoop
• If you upgrade hadoop with many batches(for example, more than 100
azkaban flows), many errors will occur the next day
• We can’t confirm result of new hadoop immediately because batches
are scheduled
– highly recommend to upgrade on first half of the week. We upgraded
on Tuesday.
– If you upgrade on Friday, you will work on Saturday.
– share jobs with your colleagues to address batch error
• If you do such kind jobs alone, you will be overwhelmed

Hadoop, Hive of HDP2.5.3
Azkaban 3.37.0
Presto 0.188
Cognos 10
DBDBDB
Python 2.7.13
yanagishima
Prestogres

recap
• share LINE’s log analysis platform about 3 years journey
• batch
• cognos
• yanagishima
• upgrade hadoop cluster
• we really appreciate OSS product and communities

Strata2017 sg

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Strata2017 sg

Similar to Strata2017 sg (20)

More from wyukawa

More from wyukawa (16)

Recently uploaded

Recently uploaded (20)

Strata2017 sg