DATA ORCHESTRATION SUMMIT
2020
SuperDB
Modernizing Global Shared Data Analytics Platform and our Alluxio
Journey
Sandipan Chakraborty | Director Engineering
2
Topics
• Brief About Rakuten
• SuperDB Journey
• Our Data Landscape
• Challenges
• Approach
• Journey with Alluxio
3
70
+
Service
s
Japan’s Largest e-commerce company
Internet
Services
Fintech
Services
Communication
s
& Contents
4
SuperDB: Centralized Data Platform for Rakuten
Ecosystem
43Services*1
700+
TeraBytes
Normalized Data sets
6,500+
Users*2
from 40+
Businesses
70+Services in Ecosystem
*1 Excluding small services and common services *2 # of weekly active
users
PetaBytes of
Data
5
Our Journey
201
3
201
8
8 25+
Teradata + Hadoop
Big Data Stack
2013 - 2018
Presto
Mesos DC/OS
On-Premise & GCP
Hadoop
Cluster
GCS
Click Stream Data
Recommendation,
PersonalizationSupport ML,
1 5
200
7
201
2
Traditional EDW
Teradata
2007 –
2012
On-Premise
BI Reporting,
Ad-Hoc Analysis
30+
services
2019
2019 -
2020
40+
services
202
0
Multi-Cloud (GCP + Azure)
Presto + Alluxio
(POC)
Mesos DC/OS
Kubernetes
Starburst Presto
Alluxio (Prod.)
Cloud Storage
Hybrid Compute
Hadoop
ClustersObject Storage
Optimize Analytics
Optimize AI / ML
Teradata + Hadoop
Big Data Stack
6
Our Data
Landscape
Web /
RAT
User
Transaction
IoT /
Device
Apps
Real-Time & Batch
(Containers)
Common
Schema
Business Generated
(Data Producers) Business DWH / DL
SuperDB
(Enterprise Repository)
Data challenges
Diverse data from diverse
sources, growing rapidly
Easier Data Management
Based on Personas, Gives Transparency & Better Cost Control,
Standardized and Automated
Faster & Better Insight & AI
Start Analysis by ability to connect and Run
anywhere
Insights & Data Science
(Data Consumers)
SINGLE VERSION OF TRUTH!
Data Projections
and Feature
Sets
Virtual Data Mart
(On-Prem / GCP
Cloud)
Super DB
on-premise (JP)
SuperDB Cloud
(US, Japan)
Auto-Sync
On-Demand Scale (Cloud + Containers)
Common
Schema
Cloud
Bursting
(Containers)
AWS
Azure
GCP
Click Stream
On-premise
Faster Business
Insight
Faster time to
Analysis
Quick
Experimentation
Cloud Native & Hybrid Architecture Granular
Access
control
Data encryption (End to End)
Multi-Factor
Authenticatio
n
Query Layer
Normalized
Transaction /
aggregated
Transaction /
aggregated
Transaction /
aggregated
Auto-Sync
7
•Adhoc Query Capacity
•Discover, Fast and Easy Access, OLAP
& Low Latency
•BI Support and Reporting.
Business
Analysts
•Adhoc Query Capacity. (OLAP, low
latency)
•Run workload in large scale computing.
•Data Science Platform and tools for ML
- AI workloads
Data
Scientists
•Ability to Integrate with API’s
•Support of Data Sync to different
clusters
Applications
•Query, Data Ingestion and
Transformation
•Scalable processing, long running jobs
•Real-time and Batch Support
•Data QC Support
Data
Engineers
• Secured Access Layer
• Ability to create Audit Reports
• Data Lineage and traceability
Governanc
e, Audit &
Security
•Maintaining the data system infra.
•Workload Turning.
•Data Pipeline maintaining.
•Data QC
System
Admin &
Operators
•Creates, Joins, Ad-hoc Report,
KPI’s
•Experiments & Quick Analysis
•Support various Marketing
activities
Sand-Box
Users
Support for Different
Personas
8
Our Challenges
• Compute elasticity for experiments.
• Adding capacity was time-consuming process
System
Scalability
• Unable to address / optimize for different Personas
• Legacy Code, limited processing power resulting in Job delays
Data
Availability
• Too many data copy pipelines needed to be built, delaying the access to data
• Managing for data copy pipelines to different clusters became an operation overhead.Data freshness
• Data Movements before any Analysis can be done. Not all is present in DWH for
analysis.
• Quick Analysis cannot be done across different businesses data silos.
Analytics
Agility
9
Our Approach
•Compute Elasticity for Experiments
•Adding capacity was time-consuming
process
System
Scalability
•Unable to address / optimize for
different Personas
•Legacy Code, limited processing
power resulting in Job delays
Data
Availability
•Data sync cannot be done between
different cluster in DC’s.
•Too many Data Copies
Data
freshness
•Cannot join between Transaction &
behavior data.
•Needs lot of Data Movements
•Quick Analysis cannot be done.
Analytics
Agility
Hybrid & Cloud-native architecture
• On Demand Compute with Public Cloud
• Separate Storage and Process
• Containerization and Cloud Native
Data Sync & Orchestration (Alluxio)
• Data Sync across DC’s and Cloud.
• Data Processing Cache Layer
Query Layer (Starburst Presto)
• Start Analytics connecting to different stores on
multi-cloud , on-prem before any data
movement
• Common security layer with Ranger
10
One Major Challenge
Data Sources Teradata
Legacy HDFS
New HDFS
PwC
Legac
y
ODIN
Python
Legacy
copy
copy
copy
Pipeline X
Pipeline Y
Pipeline Z
❖ ODIN is homebrewed data ingestion system
❖ Legacy HDFS and New HDFS are in different data centers, so downstream migration is not straight forward due to computing resource
constrains
GCS
copy
Spark
Pipeline
New
11
Data Sync
Source
Data
Alluxio Ingest
Alluxio XHDFS Cluster
HDFS Cluster
GCS
Alluxio Y
Alluxio Z
Rakuten
DC1
Rakuten
DC2
GCP
❖ Alluxio Ingest Cluster: data persist to multi destination via Under Store Replication.
❖ Consumption tool cache data from different DC to improve performance, and enable DR
Released in Production
12
Data Caching for Consumption
Alluxio
master
Alluxio
worker
Alluxio
worker
Alluxio
worker
Presto
Coordinator
Presto
worker
Presto
worker
Presto
worker
Mem
Cach
e
Mem
Cach
e
Mem
Cach
e
Mem
Cach
e
GKE (GCP) & AKS (Azure) 2020
Production
Physical box
Physical box
Physical box
HDFS: DC
local
HDFS: DC
remote 1
HDFS DC
remote 2
Alluxio
master
Alluxio
worker
Alluxio
worker
Alluxio
worker
Presto
Coordinator
Presto
worker
Presto
worker
Presto
worker
On-Prem Bare Metal (POC)
13
Consumption in Production Today
Physical
box
Physical
box
Physical
box
HDFS DC1
HDFS DC 2
Alluxio
master
Alluxio
worker
Alluxio
worker
Alluxio
worker
Presto
Coordinator
Presto
worker
Presto
worker
Presto
worker
On-Prem Bare Metal
(2019 - Early 2020)
Bare Metal (K8 Cluster) --- Present Production
14
TensorFlow /
Caffe
Spark
Compute
(Transformati
on)
Spark
Compute
Aggregations
Distributed
Cache
Kubernetes ,
KubeflowLinu
x
Rakuten
OneClou
d
Bare Metal GPU CPU
HD
FS
Object
Store
NA
S
Libfuse
AlluxioFUSE
Alluxio
JVM
Distributed Cache (Presently under POC)
15
Our Journey with
Alluxio
Started using Presto
Open source
(On-Prem)
201
7
201
8
Started using Presto
Open source
(GCP)
POC with Presto +
Alluxio
(GCP)
201
9
202
0
Presto + Alluxio
(GCP , Azure)
POC : Distributed
Cache with Alluxio for
ML & Data Pipeline
Jobs
Data Sync with Alluxio
(On-Prem)
202
1
Planned : Distributed
Cache with Alluxio for
ML & Data Pipeline
Jobs
16
Overview: Wrap-up
RDB
NoSQL
Files
events Pipeline
Service
Hadoop
Discovery Service
Consumption Service
Transformations
Landing
zone
Common
Schema
mapping
Common
Marts
Data Orchestration Layer
Presto
BI toolsAI / ML
Data
Exploring
Downstream
pipelines
Spark
Schema management Data ACL Classification Auditing
Changelogs
Changelogs
Cloud
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey

Modernizing Global Shared Data Analytics Platform and our Alluxio Journey

  • 1.
    DATA ORCHESTRATION SUMMIT 2020 SuperDB Modernizing Global SharedData Analytics Platform and our Alluxio Journey Sandipan Chakraborty | Director Engineering
  • 2.
    2 Topics • Brief AboutRakuten • SuperDB Journey • Our Data Landscape • Challenges • Approach • Journey with Alluxio
  • 3.
    3 70 + Service s Japan’s Largest e-commercecompany Internet Services Fintech Services Communication s & Contents
  • 4.
    4 SuperDB: Centralized DataPlatform for Rakuten Ecosystem 43Services*1 700+ TeraBytes Normalized Data sets 6,500+ Users*2 from 40+ Businesses 70+Services in Ecosystem *1 Excluding small services and common services *2 # of weekly active users PetaBytes of Data
  • 5.
    5 Our Journey 201 3 201 8 8 25+ Teradata+ Hadoop Big Data Stack 2013 - 2018 Presto Mesos DC/OS On-Premise & GCP Hadoop Cluster GCS Click Stream Data Recommendation, PersonalizationSupport ML, 1 5 200 7 201 2 Traditional EDW Teradata 2007 – 2012 On-Premise BI Reporting, Ad-Hoc Analysis 30+ services 2019 2019 - 2020 40+ services 202 0 Multi-Cloud (GCP + Azure) Presto + Alluxio (POC) Mesos DC/OS Kubernetes Starburst Presto Alluxio (Prod.) Cloud Storage Hybrid Compute Hadoop ClustersObject Storage Optimize Analytics Optimize AI / ML Teradata + Hadoop Big Data Stack
  • 6.
    6 Our Data Landscape Web / RAT User Transaction IoT/ Device Apps Real-Time & Batch (Containers) Common Schema Business Generated (Data Producers) Business DWH / DL SuperDB (Enterprise Repository) Data challenges Diverse data from diverse sources, growing rapidly Easier Data Management Based on Personas, Gives Transparency & Better Cost Control, Standardized and Automated Faster & Better Insight & AI Start Analysis by ability to connect and Run anywhere Insights & Data Science (Data Consumers) SINGLE VERSION OF TRUTH! Data Projections and Feature Sets Virtual Data Mart (On-Prem / GCP Cloud) Super DB on-premise (JP) SuperDB Cloud (US, Japan) Auto-Sync On-Demand Scale (Cloud + Containers) Common Schema Cloud Bursting (Containers) AWS Azure GCP Click Stream On-premise Faster Business Insight Faster time to Analysis Quick Experimentation Cloud Native & Hybrid Architecture Granular Access control Data encryption (End to End) Multi-Factor Authenticatio n Query Layer Normalized Transaction / aggregated Transaction / aggregated Transaction / aggregated Auto-Sync
  • 7.
    7 •Adhoc Query Capacity •Discover,Fast and Easy Access, OLAP & Low Latency •BI Support and Reporting. Business Analysts •Adhoc Query Capacity. (OLAP, low latency) •Run workload in large scale computing. •Data Science Platform and tools for ML - AI workloads Data Scientists •Ability to Integrate with API’s •Support of Data Sync to different clusters Applications •Query, Data Ingestion and Transformation •Scalable processing, long running jobs •Real-time and Batch Support •Data QC Support Data Engineers • Secured Access Layer • Ability to create Audit Reports • Data Lineage and traceability Governanc e, Audit & Security •Maintaining the data system infra. •Workload Turning. •Data Pipeline maintaining. •Data QC System Admin & Operators •Creates, Joins, Ad-hoc Report, KPI’s •Experiments & Quick Analysis •Support various Marketing activities Sand-Box Users Support for Different Personas
  • 8.
    8 Our Challenges • Computeelasticity for experiments. • Adding capacity was time-consuming process System Scalability • Unable to address / optimize for different Personas • Legacy Code, limited processing power resulting in Job delays Data Availability • Too many data copy pipelines needed to be built, delaying the access to data • Managing for data copy pipelines to different clusters became an operation overhead.Data freshness • Data Movements before any Analysis can be done. Not all is present in DWH for analysis. • Quick Analysis cannot be done across different businesses data silos. Analytics Agility
  • 9.
    9 Our Approach •Compute Elasticityfor Experiments •Adding capacity was time-consuming process System Scalability •Unable to address / optimize for different Personas •Legacy Code, limited processing power resulting in Job delays Data Availability •Data sync cannot be done between different cluster in DC’s. •Too many Data Copies Data freshness •Cannot join between Transaction & behavior data. •Needs lot of Data Movements •Quick Analysis cannot be done. Analytics Agility Hybrid & Cloud-native architecture • On Demand Compute with Public Cloud • Separate Storage and Process • Containerization and Cloud Native Data Sync & Orchestration (Alluxio) • Data Sync across DC’s and Cloud. • Data Processing Cache Layer Query Layer (Starburst Presto) • Start Analytics connecting to different stores on multi-cloud , on-prem before any data movement • Common security layer with Ranger
  • 10.
    10 One Major Challenge DataSources Teradata Legacy HDFS New HDFS PwC Legac y ODIN Python Legacy copy copy copy Pipeline X Pipeline Y Pipeline Z ❖ ODIN is homebrewed data ingestion system ❖ Legacy HDFS and New HDFS are in different data centers, so downstream migration is not straight forward due to computing resource constrains GCS copy Spark Pipeline New
  • 11.
    11 Data Sync Source Data Alluxio Ingest AlluxioXHDFS Cluster HDFS Cluster GCS Alluxio Y Alluxio Z Rakuten DC1 Rakuten DC2 GCP ❖ Alluxio Ingest Cluster: data persist to multi destination via Under Store Replication. ❖ Consumption tool cache data from different DC to improve performance, and enable DR Released in Production
  • 12.
    12 Data Caching forConsumption Alluxio master Alluxio worker Alluxio worker Alluxio worker Presto Coordinator Presto worker Presto worker Presto worker Mem Cach e Mem Cach e Mem Cach e Mem Cach e GKE (GCP) & AKS (Azure) 2020 Production Physical box Physical box Physical box HDFS: DC local HDFS: DC remote 1 HDFS DC remote 2 Alluxio master Alluxio worker Alluxio worker Alluxio worker Presto Coordinator Presto worker Presto worker Presto worker On-Prem Bare Metal (POC)
  • 13.
    13 Consumption in ProductionToday Physical box Physical box Physical box HDFS DC1 HDFS DC 2 Alluxio master Alluxio worker Alluxio worker Alluxio worker Presto Coordinator Presto worker Presto worker Presto worker On-Prem Bare Metal (2019 - Early 2020) Bare Metal (K8 Cluster) --- Present Production
  • 14.
    14 TensorFlow / Caffe Spark Compute (Transformati on) Spark Compute Aggregations Distributed Cache Kubernetes , KubeflowLinu x Rakuten OneClou d BareMetal GPU CPU HD FS Object Store NA S Libfuse AlluxioFUSE Alluxio JVM Distributed Cache (Presently under POC)
  • 15.
    15 Our Journey with Alluxio Startedusing Presto Open source (On-Prem) 201 7 201 8 Started using Presto Open source (GCP) POC with Presto + Alluxio (GCP) 201 9 202 0 Presto + Alluxio (GCP , Azure) POC : Distributed Cache with Alluxio for ML & Data Pipeline Jobs Data Sync with Alluxio (On-Prem) 202 1 Planned : Distributed Cache with Alluxio for ML & Data Pipeline Jobs
  • 16.
    16 Overview: Wrap-up RDB NoSQL Files events Pipeline Service Hadoop DiscoveryService Consumption Service Transformations Landing zone Common Schema mapping Common Marts Data Orchestration Layer Presto BI toolsAI / ML Data Exploring Downstream pipelines Spark Schema management Data ACL Classification Auditing Changelogs Changelogs Cloud