OASIS - Data Analysis Platform for Multi-tenant Hadoop Cluster

OASIS : DATA ANALYSIS PLATFORM
FOR MULTI-TENANT HADOOP CLUSTER
Keiji Yoshida - Data Engineer, Data Labs

OASIS
• Web-based data analysis platform
• Enables employees to analyze their service’s data

Agenda 1. Motivation
2. Features
3. Use Cases

DATA PLATFORM
LINE Ads Platform
LINE Creators Market
LINE NEWS
LINE Pay
LINE LIVE
LINE MOBILE
Hadoop Cluster (Data Lake)
LINE Ads
Platform
LINE Creators
Market
LINE NEWS LINE Pay
LINE LIVE LINE MOBILE
ETL
Analysis
BI / Reporting

DATA DEMOCRATIZATION
• Make Hadoop cluster public within LINE
• Enable employees to analyze their service’s data as they like
• Speed up data analysis process and decision making
Multi-tenant Hadoop Cluster
LINE Ads
Platform
LINE Creators
Market
LINE Ads Platform LINE Creators Market

REQUIREMENTS
1. Security
2. Stability
3. Features

1. SECURITY
• Strict access control
• Allow employees to access only their service’s data
LINE Ads
Platform
LINE Creators
Market
LINE Ads Platform LINE Creators Market

1. SECURITY
• Kerberos authentication
• Apache Ranger for authorization

2. STABILITY
• Isolation of applications
• Resource control
App 1
App 4
App 2 App 3
App 5 App 6

2. STABILITY
• Apache Spark on YARN
• Utilize Apache YARN’s resource control mechanism

3. FEATURES
Skill
Role
Required
Features
SQL Programming Data Science
X X X Manager Result Sharing
O X X Planner
Query Result
Visualization
O O X Engineer ETL
O O O Data Scientist
Ad Hoc Data
Analysis

3. FEATURES
• Query execution
• Query result visualization
• Code execution (Scala, Python, R)
• Scheduled execution
• Result sharing
• Result access control

3. FEATURES
Apache Zeppelin • Has security and stability issues
Jupyter
• Does not support query result visualization
• Does not support scheduled execution
Redash
• Does not support Spark application code execution
• Does not support user impersonation
Apache Superset
• Does not support Spark application code execution
• Does not support scheduled execution
Apache Hue
• Relies on Apache Livy
• Does not support concurrent Spark SQL execution
• Does not support Spark application sharing

APACHE ZEPPELIN 0.7.3 : SECURITY
• Configurable execution user

APACHE ZEPPELIN 0.7.3 : SECURITY
• Launch Spark application with another user account
• Cheats Apache Ranger
Spark Application : User BApache Zeppelin
HDFS / Apache Ranger
User A

APACHE ZEPPELIN 0.7.3 : STABILITY
• Runs only on a single server
• Does not support “yarn-cluster” mode
• Easy to freeze
Apache Zeppelin Server
Apache Zeppelin Driver Program 1 Driver Program 2
Driver Program 3 Driver Program 4 Driver Program 5

OVERVIEW
• Spark application submission
• Query result visualization
• Notebook sharing
• Notebook scheduling
• Multiple servers

SYSTEM ARCHITECTURE
OASIS
Spark
Interpreter
MySQL Redis
Hadoop
YARN
Cluster
HDFS /
Apache
Ranger
Job
Scheduler
Frontend /
API
End Users

SPARK APPLICATION
• Launch per notebook session
• Use notebook’s author’s account for accessing HDFS
• Support Spark, Spark SQL, PySpark, and SparkR

NOTEBOOK SHARING
• Notebooks can be shared within a “space”
• “space” : root directory of notebooks for each LINE service
• Access rights: “read write”, “read only”
Space 1
Read Write 
Users
Read Only 
Users
Notebooks
Space 2
Read Write 
Users
Read Only 
Users
Notebooks

PARAMETERS
• Parameters can be injected into a notebook
• Read only users can redraw a notebook while changing its parameters

SCHEDULING
• Automatically execute notebook
• Keep notebook contents up to date
• Periodically run ETL processing

SMALL FILES PROBLEM
• Consume a lot of NameNode’s memory
• Degrade search performance
• Default value of spark.sql.shuffle.partitions : 200

DATA INSERTION API
• oasis.insertOverwrite(query, table)
• Replace spark.sql(query).write.mode(“overwrite”).insertInto(table)
• Number of files are optimized automatically

OASIS.INSERTOVERWRITE(QUERY, TABLE)
1. Create temporary table
2. Insert query result to temporary table
Spark
Application Tmp Table
1. spark.sql(“create table …”)
2. spark.sql(query).write.insertInto(tmpTable)

3. Calculate optimal number of files
4. Recreate temporary table’s data with optimal number of files
Spark
Application Tmp Table
3. filesNum = total file size / block size
4. spark.sql(query).repartition(filesNum).write
.mode(“overwrite”).insertInto(tmpTable)

5. Drop Hive partitions from target table
6. Move temporary table’s files to target table
Spark
Application
Tmp Table
5. spark.sql(“alter table … drop partition …”)
6. FileSystem.get(…).rename(tmpPath, targetPath)
Target Table

7. Add Hive partitions to the target table
8. Drop the temporary table
Spark
Application
Tmp Table
7. spark.sql(“alter table … add partition …”)
Target Table
8. spark.sql(“drop table …”)

MULTIPLE SERVERS
• Scalable
• Highly available
OASIS
Spark
Interpreter
MySQL Redis
Job
Scheduler
Frontend /
API

SPARK INTERPRETER ROUTING
• Route information is managed on Redis
• Code of the same notebook session goes to the same interpreter
Spark Interpreter 1
Redis
Frontend / API 1
End Users
Load Balancer
Frontend / API 2 Spark Interpreter 2
Update 
route Information
Search 
route Information
Spark 
application 
code
Round-robin

MULTIPLE JOB SCHEDULERS
• Make OASIS Job Scheduler highly available
• Utilize Quartz’s clustering feature
MySQL
Job Scheduler 1 Job Scheduler 2 Job Scheduler 3
Quartz Clustering

HADOOP CLUSTER (DATA LAKE)
• 500 DataNodes / NodeManagers
• HDFS usage : 25PB
• 150+ Hive databases
• 1,500+ Hive tables

Users
40+Spaces
1,600+Notebooks
500+
STATS

USE CASES
1. Report
2. Interactive dashboard
3. ETL
4. Monitoring
5. Ad hoc analysis

RECAP : OASIS
• Data analysis platform for a multi-tenant Hadoop cluster
• Data can be extracted, processed, visualized, and shared
• Used for reporting, data monitoring, ad hoc analysis, etc. at LINE

OASIS - Data Analysis Platform for Multi-tenant Hadoop Cluster

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to OASIS - Data Analysis Platform for Multi-tenant Hadoop Cluster

Similar to OASIS - Data Analysis Platform for Multi-tenant Hadoop Cluster (20)

More from LINE Corporation

More from LINE Corporation (20)

Recently uploaded

Recently uploaded (20)

OASIS - Data Analysis Platform for Multi-tenant Hadoop Cluster