Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Efficient And Invincible Big Data Platform In LINE

1,171 views

Published on

Neil Tu
LINE / Data Labs

LINE's services are fast-growing and continuously generating different data and logs. In today's world, data scientists are put to the test to see how they are able to quickly extract valuable information – the modern asset – from massive data sets. Moreover, providing a stable, safe, and efficient mega-platform is imperative. This session will discuss the foundation that enables LINE's Data Labs team to continuously produce data.

Published in: Technology

Efficient And Invincible Big Data Platform In LINE

  1. 1. Efficient and Invincible Big Data Platform in LINE
  2. 2. Neil Tu (杜佐民) ● Data architect and engineer ● Expert on Hadoop distributed system and its ecosystems ● 5+ years of experience in image processing, computer vision, and pattern recognition About Me
  3. 3. Agenda • Data Platforms Within LINE • Pipeline Platform • Analysis Platform • Ecosystem
  4. 4. Data Platforms Within LINE
  5. 5. Data Platforms
  6. 6. Data Platforms Big Data Data Analysis Mathematic Modeling Pipeline Machine Learning Deep Learning Etc. Protocolized Model System Integrated Streaming
  7. 7. Tracking Service
 Platform Pipeline
 Platform Analysis
 Platform Data Platforms Within LINE
  8. 8. Pipeline Platform
  9. 9. Types 30 PB 6.5M msg/sec 652
  10. 10. Service
 System ETL Protocol definition Data flow definition
  11. 11. Protocolized Data Model message ApiAccessLog { string request_id = 1; string method = 2; string path = 3; string request_ip = 4 [(EsMapping.type) = "ip"]; int32 status = 5; string contents = 6 [(EsMapping.index) = false]; string result = 7; int64 event_time = 8 [(use_as_timestamp) = true]; int64 injest_time = 9; }
  12. 12. Analysis Platform
  13. 13. Analysis Platform
  14. 14. Tables 25 PB 550 Users 1668
  15. 15. Data Infrastructure BI tool Event
 log RDBMS
 dump Other
 storages Data hub
  16. 16. Data Flow RDBMS ETL Service data Other storages
  17. 17. Real-time Query 180,000 data / sec
  18. 18. ● UI ● Security ● Local backup Nifi
  19. 19. Ecosystem
  20. 20. etc. Oasis
  21. 21. https://github.com/yanagishima/yanagishima Yanagishima
  22. 22. LINE Analytics Reporting Tool
  23. 23. Interactive Data Analytics Tool Oasis
  24. 24. Data Catalog Tool Aquarium
  25. 25. Data Catalog Tool Aquarium
  26. 26. Aquarium Data Catalog Tool
  27. 27. Security
  28. 28. Office authentication Private authentication and authorization Gateway server Client server
  29. 29. Sign-up web UI HDFS
 user home directory Registration WF
  30. 30. Monitoring
  31. 31. ● JVM ● Net traffic ● Disk capacity ● etc. Basic Monitoring
  32. 32. ● Small files ● Cluster usage per user ● Disk usage ● Blocks ● Empty files ● etc. Cluster Monitoring
  33. 33. Third Namenode NN1 NN2 NN3 JN1 JN2 JN3 Always on standby Real-time metadata
  34. 34. Tuning
  35. 35. YARN ● yarn.log-aggregation.retain-check-interval-seconds=86400 ● yarn.log-aggregation.retain-seconds=172800 Basic Tuning Spark ● spark.history.fs.cleaner.enabled=true ● spark.history.fs.cleaner.interval=1d ● spark.history.fs.cleaner.maxAge=2d
  36. 36. Hive ● hive.merge.mapredfiles=true ● hive.merge.smallfiles.avgsize=128000000 ● mapreduce.input.fileinputformat.split.maxsize=2147483648 ● mapreduce.input.fileinputformat.split.minsize=134217728 ● mapreduce.input.fileinputformat.split.minsize.per.node=134217728 ● mapreduce.input.fileinputformat.split.minsize.per.rack=134217728 hive> ALTER TABLE xxx PARTITION (dt='19840312') CONCATENATE; Basic Tuning
  37. 37. Conclusion
  38. 38. What is required? ● Be patient How to achieve results? ● Trial and error ● Never give up Running a Platform
  39. 39. THANK YOU

×