Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Strata2017 sg

1,797 views

Published on

https://conferences.oreilly.com/strata/strata-sg/public/schedule/detail/62948

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Strata2017 sg

  1. 1. LINE’s log analysis platform 2017/12/07 Wataru Yukawa(@wyukawa) #StrataData
  2. 2. Who am I? • Data engineer at LINE • First time to Singapore! • Maintain an on-premises log analysis platform on top of Hadoop/Hive/Presto/Azkaban • LINE has many hadoop clusters which has different roles • Today I would like to share use case in one of them • I talk about 3 years history related to hadoop cluster
  3. 3. LINE • LINE makes a messaging application of the same name, in addition to other related services • It is the most popular messaging platform in Japan
  4. 4. LINE
  5. 5. about 3 years ago • We need new analytics department • many LINE Family services – LINE Fortune – LINE Manga – etc • demand for LINE Family services analytics increases • created new analytics department in 2014/05
  6. 6. analytics department • data engineer – implement batch – maintain hadoop • data planner – communicate with service planner – design KPI – create report – execute ad hoc query to extract data, for example campaign
  7. 7. Agenda • log analysis platform overview in 2014 • Prestogres • presto/hive web UI(yanagishima) • upgrade hadoop cluster • log analysis platform overview in 2017
  8. 8. Log Analysis Platform(2014) Hadoop, Hive of HDP2.1 Azkaban 2.6 Presto 0.75 Cognos 10 DB MySQL 5.5 DBDB Python 2.7.7 Shib batch(sqoop, hive, etc)
  9. 9. batch • written in python • execute sqoop, hive, etc • mostly use hive cli because hiveserver2 is not so stable • we create thin python batch framework – python bin/main.py –d 20171207 hoge • enable dry run – print hive query, not execute
  10. 10. Azkaban use case • Use Azkaban to manage job • Use Azkaban API – I created client https://github.com/wyukawa/eboshi – Commit scheduling information to GHE • Painful to write job file due to many job files – I created generation tool https://github.com/wyukawa/ayd – generate 1 flow(many jobs) from 1 yaml file
  11. 11. Azkaban Job File # foo.job type=command command=echo foo retries=1 retry.backoff=300000 # bar.job type=command dependencies=foo command=echo bar Azkaban flow
  12. 12. Yaml example foo: type: command command: echo "foo” retries: 1 retry.backoff: 300000 bar: type: command command: echo "bar” dependencies: foo retries: 1 retry.backoff: 300000
  13. 13. Job Management Overview git push push button upload job register schedule execute job git pull generate job file
  14. 14. Azkaban usage situation • More than 150 Azkaban flows • Many daily batches, some hourly, weekly, monthly batches • Most flows are related to hive • I prepare the template Azkaban flows to re- aggregate past data due to no backfill
  15. 15. Cognos • Commercial BI tool by IBM • rich authorization management • flexible reporting
  16. 16. Cognos sample report
  17. 17. Presto • distributed SQL query engine • fast and many useful UDF
  18. 18. Upgrade Presto • easy to upgrade due to stateless architecture • but sometimes need to rollback – 0.101 https://github.com/prestodb/presto/pull/2834 – 0.108 https://github.com/prestodb/presto/pull/3212 • query stuck • revert commit – 0.113 https://github.com/prestodb/presto/pull/3400 – 0.148 https://github.com/prestodb/presto/pull/5612 • memory error – 0.189 https://github.com/prestodb/presto/issues/9354 • empty ORC file is not supported
  19. 19. How do we use Presto? • batch with Hive due to fault-tolerance • presto is fast, but currently has limited fault tolerance capabilities • execute adhoc presto query by shib – shib is web UI for presto/hive – https://github.com/tagomoris/shib
  20. 20. Shib
  21. 21. Agenda • log analysis platform overview in 2014 • Prestogres • presto/hive web UI(yanagishima) • upgrade hadoop cluster • log analysis platform overview in 2017
  22. 22. Log Analysis Platform(2014) Hadoop, Hive of HDP2.1 Azkaban 2.6 Presto 0.75 Cognos 10 DB MySQL 5.5 DBDB Python 2.7.7 Shib batch(sqoop, hive, etc)
  23. 23. MySQL • aggregate data into MySQL • easy to connect to Cognos • MySQL doesn’t fit analytics • no window function • MySQL will be a bottleneck because it’s not scalable • presto has many useful UDF, window function • reduce maintenance cost for multi storages • we wanted Cognos to connect to Presto • hard to connect Cognos to presto in 2014 due to immature presto jdbc driver.
  24. 24. What is Prestogres? PostgreSQL pgpool-II (patched) BI tool Presto Prestogres PL/python
  25. 25. Prestogres use case Presto Cognos 10 Prestogres Postgresql JDBC Driver
  26. 26. Log Analysis Platform(2015) Hadoop, Hive of HDP2.1 Azkaban 2.6 Presto 0.89 Cognos 10 DBDBDB ETL with Python 2.7.7 Prestogres Shib
  27. 27. Presto view • about 400 presto views • Presto view doesn’t need ETL • data planners create presto views • Cognos refer to presto views • we have 2 presto view check systems – execute select … from … limit 1 on all presto views every day • easy to find problem when we upgrade presto – compare DDL in github to existed presto view
  28. 28. presto tool • https://github.com/wyukawa/presto-woothee • UDF to parse user_agent • https://github.com/wyukawa/presto-fluentd • send presto query to fluentd • we use presto-fluentd, send log to hadoop • so we can check query log by presto
  29. 29. prestogres current status • prestogres was best choice for us 3 years ago • Currently, prestogres is obsolete • So we have a plan to upgrade Cognos which connects to presto without prestogres
  30. 30. Agenda • log analysis platform overview in 2014 • Prestogres • presto/hive web UI(yanagishima) • upgrade hadoop cluster • log analysis platform overview in 2017
  31. 31. yanagishima • Presto/Hive web UI • started in 2015 because data planners execute ad hoc query more easily • UI export joined in 2017 • easy to use/install • share query with permanent link • chart • handle multiple clusters • https://github.com/yanagishima/yanagishima
  32. 32. yanagishima demo movie
  33. 33. yanagishima use case • check data • ad hoc query • share query • create presto view • about 100 DAU in LINE
  34. 34. new yanagishima feature • timeline tab • user can comment about query and share in timeline tab • social feature • will be available in the next version
  35. 35. Agenda • log analysis platform overview in 2014 • Prestogres • presto/hive web UI(yanagishima) • upgrade hadoop cluster • log analysis platform overview in 2017
  36. 36. Log Analysis Platform(2016) Hadoop, Hive of HDP2.1 Azkaban 3.0 Presto 0.147 Cognos 10 DBDBDB Python 2.7.11 yanagishima batch(sqoop, hive, etc) Prestogres
  37. 37. 2016 • 2 years have passed since I created log analysis platform • easy to upgrade Presto, Azkaban • Hadoop version became old • We used HDP 2.1(Hadoop 2.4) • Latest version was HDP 2.5(Hadoop 2.7) at that time • guarantee period for machines expire in 2017/6 • We need to upgrade hadoop in new machines
  38. 38. new Machine spec, Hadoop version • Machines – 40 servers(same as old hadoop cluster) – CPU 40 processors(24 in old) – Memory 256GB(64GB in old) – HDD 6.1TB x 12(3.6TB x 12 in old) – Network 10Gbps(1Gbps in old) • HDP2.5.3(Ambari 2.4.2) – Hadoop 2.7.3 • NameNode HA • ResourceManager HA – Hive 1.2.1 • MapReduce • Tez
  39. 39. How to upgrade hadoop • Setup new Hadoop Cluster to new machines • Blue green deployment all at once • Migrate data by distcp(-m 20 -bandwidth 125) – Copy 500TB(first copy took about 3 days) • don’t execute batch in parallel on both hadoop clusters
  40. 40. distcp with HDFS Snapshot • HDFS Snapshot is useful feature because batches add data during distcp • -update -diff option doesn’t support webhdfs://orig/... – Edit hdfs-site.xml in destination hadoop and use hdfs://orig/...
  41. 41. Migrate Hive schema • Use show create table command • Use msck repair command to add partition – But it didn’t work in too many(for example, 4000) partition tables • Use webhdfs://... in external table – can’t use hdfs://… – but empty returns when you select by presto – need to add jersey-bundle-1.19.3.jar due to NoClassDefFoundError – https://groups.google.com/forum/#!topic/presto- users/HXMW4XtmYf8
  42. 42. HDFS/YARN/Hive/Sqoop setting • disable hdfs-audit.log due to many adhoc queries • dfs.datanode.failed.volumes.tolerated=1 • fs.trash.interval=4320 • Namenode heap 64GB • yarn.nodemanager.resource.memory-mb 100GB • yarn.scheduler.maximum-allocation-mb 100GB • Use DominantResourceCalculator • hive.server2.authentication=NOSASL • hive.server2.enable.doAs=false • hive.auto.convert.join=false • hive.support.sql11.reserved.keywords=false • org.apache.sqoop.splitter.allow_text_splitter=true • Sometimes use Tez
  43. 43. monitoring • Ambari Metrics • Prometheus – monitor machine metrics(HDD/memory/CPU, slab, TIME_WAIT, entropy, etc) with node_exporter • Alertmanager • Grafana • Promgen – Promgen is a configuration file generator for Prometheus – https://github.com/line/promgen
  44. 44. My feeling about upgrading hadoop • If you upgrade hadoop with many batches(for example, more than 100 azkaban flows), many errors will occur the next day • We can’t confirm result of new hadoop immediately because batches are scheduled – highly recommend to upgrade on first half of the week. We upgraded on Tuesday. – If you upgrade on Friday, you will work on Saturday. – share jobs with your colleagues to address batch error • If you do such kind jobs alone, you will be overwhelmed
  45. 45. Log Analysis Platform(2017) Hadoop, Hive of HDP2.5.3 Azkaban 3.37.0 Presto 0.188 Cognos 10 DBDBDB Python 2.7.13 yanagishima batch(sqoop, hive, etc) Prestogres
  46. 46. recap • share LINE’s log analysis platform about 3 years journey • batch • cognos • yanagishima • upgrade hadoop cluster • we really appreciate OSS product and communities
  47. 47. Any questions?

×