Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

OASIS - Data Analysis Platform for Multi-tenant Hadoop Cluster

3,993 views

Published on

Keiji Yoshida
LINE / Data Labs

This year, we created a web-based data analysis platform called "OASIS". All of LINE employees can analyze their services’ data as they like by writing Apache Spark application code on OASIS and submitting it to LINE's multi-tenant Hadoop cluster. Analysis results can be shared within their team/department. Currently, about 500 employees and 40 teams/departments use OASIS.

This session will cover 1) the reason we created OASIS instead of using other data analysis software, 2) its features and system architecture, and 3) its usage by LINE employees.

Published in: Technology
  • Be the first to comment

OASIS - Data Analysis Platform for Multi-tenant Hadoop Cluster

  1. 1. OASIS : DATA ANALYSIS PLATFORM FOR MULTI-TENANT HADOOP CLUSTER Keiji Yoshida - Data Engineer, Data Labs
  2. 2. OASIS • Web-based data analysis platform • Enables employees to analyze their service’s data
  3. 3. Agenda 1. Motivation 2. Features 3. Use Cases
  4. 4. Agenda 1. Motivation 2. Features 3. Use Cases
  5. 5. DATA PLATFORM LINE Ads Platform LINE Creators Market LINE NEWS LINE Pay LINE LIVE LINE MOBILE Hadoop Cluster (Data Lake) LINE Ads Platform LINE Creators Market LINE NEWS LINE Pay LINE LIVE LINE MOBILE ETL Analysis BI / Reporting
  6. 6. DATA DEMOCRATIZATION • Make Hadoop cluster public within LINE • Enable employees to analyze their service’s data as they like • Speed up data analysis process and decision making Multi-tenant Hadoop Cluster LINE Ads Platform LINE Creators Market LINE Ads Platform LINE Creators Market
  7. 7. REQUIREMENTS 1. Security 2. Stability 3. Features
  8. 8. 1. SECURITY • Strict access control • Allow employees to access only their service’s data Multi-tenant Hadoop Cluster LINE Ads Platform LINE Creators Market LINE Ads Platform LINE Creators Market
  9. 9. 1. SECURITY • Kerberos authentication • Apache Ranger for authorization
  10. 10. 2. STABILITY • Isolation of applications • Resource control Multi-tenant Hadoop Cluster App 1 App 4 App 2 App 3 App 5 App 6
  11. 11. 2. STABILITY • Apache Spark on YARN • Utilize Apache YARN’s resource control mechanism Multi-tenant Hadoop Cluster
  12. 12. 3. FEATURES Skill Role Required Features SQL Programming Data Science X X X Manager Result Sharing O X X Planner Query Result Visualization O O X Engineer ETL O O O Data Scientist Ad Hoc Data Analysis
  13. 13. 3. FEATURES • Query execution • Query result visualization • Code execution (Scala, Python, R) • Scheduled execution • Result sharing • Result access control
  14. 14. 3. FEATURES Apache Zeppelin • Has security and stability issues Jupyter • Does not support query result visualization • Does not support scheduled execution Redash • Does not support Spark application code execution • Does not support user impersonation Apache Superset • Does not support Spark application code execution • Does not support scheduled execution Apache Hue • Relies on Apache Livy • Does not support concurrent Spark SQL execution • Does not support Spark application sharing
  15. 15. APACHE ZEPPELIN 0.7.3 : SECURITY • Configurable execution user
  16. 16. APACHE ZEPPELIN 0.7.3 : SECURITY • Launch Spark application with another user account • Cheats Apache Ranger Spark Application : User BApache Zeppelin HDFS / Apache Ranger User A
  17. 17. APACHE ZEPPELIN 0.7.3 : STABILITY • Runs only on a single server • Does not support “yarn-cluster” mode • Easy to freeze Apache Zeppelin Server Apache Zeppelin Driver Program 1 Driver Program 2 Driver Program 3 Driver Program 4 Driver Program 5
  18. 18. OASIS
  19. 19. Agenda 1. Motivation 2. Features 3. Use Cases
  20. 20. OVERVIEW • Spark application submission • Query result visualization • Notebook sharing • Notebook scheduling • Multiple servers
  21. 21. SYSTEM ARCHITECTURE OASIS Spark Interpreter MySQL Redis Hadoop YARN Cluster HDFS / Apache Ranger Job Scheduler Frontend / API End Users
  22. 22. NOTEBOOK CREATION
  23. 23. SPARK APPLICATION • Launch per notebook session • Use notebook’s author’s account for accessing HDFS • Support Spark, Spark SQL, PySpark, and SparkR
  24. 24. SPARK APPLICATION SHARING
  25. 25. NOTEBOOK SHARING • Notebooks can be shared within a “space” • “space” : root directory of notebooks for each LINE service • Access rights: “read write”, “read only” Space 1 Read Write
 Users Read Only
 Users Notebooks Space 2 Read Write
 Users Read Only
 Users Notebooks
  26. 26. PARAMETERS • Parameters can be injected into a notebook • Read only users can redraw a notebook while changing its parameters
  27. 27. SCHEDULING • Automatically execute notebook • Keep notebook contents up to date • Periodically run ETL processing
  28. 28. SMALL FILES PROBLEM • Consume a lot of NameNode’s memory • Degrade search performance • Default value of spark.sql.shuffle.partitions : 200
  29. 29. DATA INSERTION API • oasis.insertOverwrite(query, table) • Replace spark.sql(query).write.mode(“overwrite”).insertInto(table) • Number of files are optimized automatically
  30. 30. OASIS.INSERTOVERWRITE(QUERY, TABLE) 1. Create temporary table 2. Insert query result to temporary table Spark Application Tmp Table 1. spark.sql(“create table …”) 2. spark.sql(query).write.insertInto(tmpTable)
  31. 31. OASIS.INSERTOVERWRITE(QUERY, TABLE) 3. Calculate optimal number of files 4. Recreate temporary table’s data with optimal number of files Spark Application Tmp Table 3. filesNum = total file size / block size 4. spark.sql(query).repartition(filesNum).write .mode(“overwrite”).insertInto(tmpTable)
  32. 32. OASIS.INSERTOVERWRITE(QUERY, TABLE) 5. Drop Hive partitions from target table 6. Move temporary table’s files to target table Spark Application Tmp Table 5. spark.sql(“alter table … drop partition …”) 6. FileSystem.get(…).rename(tmpPath, targetPath) Target Table
  33. 33. OASIS.INSERTOVERWRITE(QUERY, TABLE) 7. Add Hive partitions to the target table 8. Drop the temporary table Spark Application Tmp Table 7. spark.sql(“alter table … add partition …”) Target Table 8. spark.sql(“drop table …”)
  34. 34. MULTIPLE SERVERS • Scalable • Highly available OASIS Spark Interpreter MySQL Redis Job Scheduler Frontend / API
  35. 35. SPARK INTERPRETER ROUTING • Route information is managed on Redis • Code of the same notebook session goes to the same interpreter Spark Interpreter 1 Redis Frontend / API 1 End Users Load Balancer Frontend / API 2 Spark Interpreter 2 Update
 route Information Search
 route Information Spark
 application
 code Round-robin
  36. 36. MULTIPLE JOB SCHEDULERS • Make OASIS Job Scheduler highly available • Utilize Quartz’s clustering feature MySQL Job Scheduler 1 Job Scheduler 2 Job Scheduler 3 Quartz Clustering
  37. 37. HADOOP CLUSTER (DATA LAKE) • 500 DataNodes / NodeManagers • HDFS usage : 25PB • 150+ Hive databases • 1,500+ Hive tables
  38. 38. Agenda 1. Motivation 2. Features 3. Use Cases
  39. 39. Users 40+Spaces 1,600+Notebooks 500+ STATS
  40. 40. USE CASES 1. Report 2. Interactive dashboard 3. ETL 4. Monitoring 5. Ad hoc analysis
  41. 41. RECAP : OASIS • Data analysis platform for a multi-tenant Hadoop cluster • Data can be extracted, processed, visualized, and shared • Used for reporting, data monitoring, ad hoc analysis, etc. at LINE
  42. 42. THANK YOU

×