This document summarizes Wataru Yukawa's presentation about LINE's log analysis platform over the past 3 years. It discusses the platform in 2014 when it was first created using Hadoop, Hive, Presto, Azkaban, and Cognos. It describes the addition of Prestogres to allow Cognos to connect to Presto and the development of the yanagishima UI. It outlines the upgrade of the Hadoop cluster in 2016 to a new version with more resources and the process for migrating the data and applications. Finally, it provides an overview of the current platform in 2017.
2. Who am I?
• Data engineer at LINE
• First time to Singapore!
• Maintain an on-premises log analysis platform on top of
Hadoop/Hive/Presto/Azkaban
• LINE has many hadoop clusters which has different roles
• Today I would like to share use case in one of them
• I talk about 3 years history related to hadoop cluster
9. batch
• written in python
• execute sqoop, hive, etc
• mostly use hive cli because hiveserver2 is not so stable
• we create thin python batch framework
– python bin/main.py –d 20171207 hoge
• enable dry run
– print hive query, not execute
10. Azkaban use case
• Use Azkaban to manage job
• Use Azkaban API
– I created client https://github.com/wyukawa/eboshi
– Commit scheduling information to GHE
• Painful to write job file due to many job files
– I created generation tool
https://github.com/wyukawa/ayd
– generate 1 flow(many jobs) from 1 yaml file
18. Upgrade Presto
• easy to upgrade due to stateless architecture
• but sometimes need to rollback
– 0.101 https://github.com/prestodb/presto/pull/2834
– 0.108 https://github.com/prestodb/presto/pull/3212
• query stuck
• revert commit
– 0.113 https://github.com/prestodb/presto/pull/3400
– 0.148 https://github.com/prestodb/presto/pull/5612
• memory error
– 0.189 https://github.com/prestodb/presto/issues/9354
• empty ORC file is not supported
19. How do we use Presto?
• batch with Hive due to fault-tolerance
• presto is fast, but currently has limited fault
tolerance capabilities
• execute adhoc presto query by shib
– shib is web UI for presto/hive
– https://github.com/tagomoris/shib
23. MySQL
• aggregate data into MySQL
• easy to connect to Cognos
• MySQL doesn’t fit analytics
• no window function
• MySQL will be a bottleneck because it’s not scalable
• presto has many useful UDF, window function
• reduce maintenance cost for multi storages
• we wanted Cognos to connect to Presto
• hard to connect Cognos to presto in 2014 due to immature presto
jdbc driver.
27. Presto view
• about 400 presto views
• Presto view doesn’t need ETL
• data planners create presto views
• Cognos refer to presto views
• we have 2 presto view check systems
– execute select … from … limit 1 on all presto views every day
• easy to find problem when we upgrade presto
– compare DDL in github to existed presto view
29. prestogres current status
• prestogres was best choice for us 3 years ago
• Currently, prestogres is obsolete
• So we have a plan to upgrade Cognos which connects to
presto without prestogres
39. new Machine spec, Hadoop version
• Machines
– 40 servers(same as old hadoop cluster)
– CPU 40 processors(24 in old)
– Memory 256GB(64GB in old)
– HDD 6.1TB x 12(3.6TB x 12 in old)
– Network 10Gbps(1Gbps in old)
• HDP2.5.3(Ambari 2.4.2)
– Hadoop 2.7.3
• NameNode HA
• ResourceManager HA
– Hive 1.2.1
• MapReduce
• Tez
42. Migrate Hive schema
• Use show create table command
• Use msck repair command to add partition
– But it didn’t work in too many(for example, 4000) partition tables
• Use webhdfs://... in external table
– can’t use hdfs://…
– but empty returns when you select by presto
– need to add jersey-bundle-1.19.3.jar due to NoClassDefFoundError
– https://groups.google.com/forum/#!topic/presto-
users/HXMW4XtmYf8
43. HDFS/YARN/Hive/Sqoop setting
• disable hdfs-audit.log due to many adhoc queries
• dfs.datanode.failed.volumes.tolerated=1
• fs.trash.interval=4320
• Namenode heap 64GB
• yarn.nodemanager.resource.memory-mb 100GB
• yarn.scheduler.maximum-allocation-mb 100GB
• Use DominantResourceCalculator
• hive.server2.authentication=NOSASL
• hive.server2.enable.doAs=false
• hive.auto.convert.join=false
• hive.support.sql11.reserved.keywords=false
• org.apache.sqoop.splitter.allow_text_splitter=true
• Sometimes use Tez
44. monitoring
• Ambari Metrics
• Prometheus
– monitor machine metrics(HDD/memory/CPU, slab, TIME_WAIT,
entropy, etc) with node_exporter
• Alertmanager
• Grafana
• Promgen
– Promgen is a configuration file generator for Prometheus
– https://github.com/line/promgen
45. My feeling about upgrading hadoop
• If you upgrade hadoop with many batches(for example, more than 100
azkaban flows), many errors will occur the next day
• We can’t confirm result of new hadoop immediately because batches
are scheduled
– highly recommend to upgrade on first half of the week. We upgraded
on Tuesday.
– If you upgrade on Friday, you will work on Saturday.
– share jobs with your colleagues to address batch error
• If you do such kind jobs alone, you will be overwhelmed