2. Who am I?
• Data engineer at LINE
• First time to Singapore!
• Maintain an on-premises log analysis platform on top of
Hadoop/Hive/Presto/Azkaban
• LINE has many hadoop clusters which has different roles
• Today I would like to share use case in one of them
• I talk about 3 years history related to hadoop cluster
9. batch
• written in python
• execute sqoop, hive, etc
• mostly use hive cli because hiveserver2 is not so stable
• we create thin python batch framework
– python bin/main.py –d 20171207 hoge
• enable dry run
– print hive query, not execute
10. Azkaban use case
• Use Azkaban to manage job
• Use Azkaban API
– I created client https://github.com/wyukawa/eboshi
– Commit scheduling information to GHE
• Painful to write job file due to many job files
– I created generation tool
https://github.com/wyukawa/ayd
– generate 1 flow(many jobs) from 1 yaml file
18. Upgrade Presto
• easy to upgrade due to stateless architecture
• but sometimes need to rollback
– 0.101 https://github.com/prestodb/presto/pull/2834
– 0.108 https://github.com/prestodb/presto/pull/3212
• query stuck
• revert commit
– 0.113 https://github.com/prestodb/presto/pull/3400
– 0.148 https://github.com/prestodb/presto/pull/5612
• memory error
– 0.189 https://github.com/prestodb/presto/issues/9354
• empty ORC file is not supported
19. How do we use Presto?
• batch with Hive due to fault-tolerance
• presto is fast, but currently has limited fault
tolerance capabilities
• execute adhoc presto query by shib
– shib is web UI for presto/hive
– https://github.com/tagomoris/shib
23. MySQL
• aggregate data into MySQL
• easy to connect to Cognos
• MySQL doesn’t fit analytics
• no window function
• MySQL will be a bottleneck because it’s not scalable
• presto has many useful UDF, window function
• reduce maintenance cost for multi storages
• we wanted Cognos to connect to Presto
• hard to connect Cognos to presto in 2014 due to immature presto
jdbc driver.
27. Presto view
• about 400 presto views
• Presto view doesn’t need ETL
• data planners create presto views
• Cognos refer to presto views
• we have 2 presto view check systems
– execute select … from … limit 1 on all presto views every day
• easy to find problem when we upgrade presto
– compare DDL in github to existed presto view
29. prestogres current status
• prestogres was best choice for us 3 years ago
• Currently, prestogres is obsolete
• So we have a plan to upgrade Cognos which connects to
presto without prestogres
39. new Machine spec, Hadoop version
• Machines
– 40 servers(same as old hadoop cluster)
– CPU 40 processors(24 in old)
– Memory 256GB(64GB in old)
– HDD 6.1TB x 12(3.6TB x 12 in old)
– Network 10Gbps(1Gbps in old)
• HDP2.5.3(Ambari 2.4.2)
– Hadoop 2.7.3
• NameNode HA
• ResourceManager HA
– Hive 1.2.1
• MapReduce
• Tez
42. Migrate Hive schema
• Use show create table command
• Use msck repair command to add partition
– But it didn’t work in too many(for example, 4000) partition tables
• Use webhdfs://... in external table
– can’t use hdfs://…
– but empty returns when you select by presto
– need to add jersey-bundle-1.19.3.jar due to NoClassDefFoundError
– https://groups.google.com/forum/#!topic/presto-
users/HXMW4XtmYf8
43. HDFS/YARN/Hive/Sqoop setting
• disable hdfs-audit.log due to many adhoc queries
• dfs.datanode.failed.volumes.tolerated=1
• fs.trash.interval=4320
• Namenode heap 64GB
• yarn.nodemanager.resource.memory-mb 100GB
• yarn.scheduler.maximum-allocation-mb 100GB
• Use DominantResourceCalculator
• hive.server2.authentication=NOSASL
• hive.server2.enable.doAs=false
• hive.auto.convert.join=false
• hive.support.sql11.reserved.keywords=false
• org.apache.sqoop.splitter.allow_text_splitter=true
• Sometimes use Tez
44. monitoring
• Ambari Metrics
• Prometheus
– monitor machine metrics(HDD/memory/CPU, slab, TIME_WAIT,
entropy, etc) with node_exporter
• Alertmanager
• Grafana
• Promgen
– Promgen is a configuration file generator for Prometheus
– https://github.com/line/promgen
45. My feeling about upgrading hadoop
• If you upgrade hadoop with many batches(for example, more than 100
azkaban flows), many errors will occur the next day
• We can’t confirm result of new hadoop immediately because batches
are scheduled
– highly recommend to upgrade on first half of the week. We upgraded
on Tuesday.
– If you upgrade on Friday, you will work on Saturday.
– share jobs with your colleagues to address batch error
• If you do such kind jobs alone, you will be overwhelmed