Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Challenge And Evolution Of Data Orchestration at Rakuten Data System

95 views

Published on

Data Orchestration Summit 2019
www.alluxio.io/data-orchestration-summit-2019
Nov 7, 2019

Challenge And Evolution Of Data Orchestration at Rakuten Data System

Speaker: Lei Ai, Rakuten

Published in: Software
  • Be the first to comment

  • Be the first to like this

Challenge And Evolution Of Data Orchestration at Rakuten Data System

  1. 1. Challenge and Evolution of Data Orchestration at Rakuten Data System Nov 7th, 2019 Lei Ai – Rakuten Big Data Architect
  2. 2. Highlights 2 Challenges and Problems Solutions Q & A Introduction
  3. 3. - - - -
  4. 4. 4 Legend Data Sources Teradata Applications V1: A decade ago, Teradata based warehouse Data Sources Teradata Applications HDFS V2: In 2013, Rakuten started expand enterprise data warehouse to data lake(HDFS) Data Sources Teradata Applications HDFS V2: More complexity coming New HDFS More HDFS coming GCS
  5. 5. 5 Legacy Data Sources Teradata Legacy HDFS New HDFS PwC ODIN copy copy copy Pipeline X Pipeline Y Pipeline Z v ODIN is homebrewed data ingestion system v Legacy HDFS and New HDFS are in different data centers, so downstream migration is not straight forward due to computing resource constrains
  6. 6. 6 Legacy Challenges q Data replication: too many storages and tasks. q Consumption: combine different storages. q Performance: Hadoop cluster fully loaded q Data freshness q Downstream applications fully coupled with source Considerations q Unify data pipeline jobs q A total solution for data replication. q Common Layer for consumption q Consumption performance: Cache q CDC based data pipeline q Hybrid Cloud Computing q Downstream application decouple
  7. 7. 7 Orchestration: Overview Data Sources Teradata Applications HDFS V3: Next generation New HDFS More HDFS coming GCS
  8. 8. 8 Orchestration: Overview RDB NoSQL Files events Pipeline Service Hadoop Discovery Service Consumption Service Transformations Landing zone Common Schema mapping Common Marts Data Orchestration Layer Presto Notebooks BI tools Report tools Data Exploring Downstream pipelines Spark Schema management Data ACL Classification Auditing Changelogs Changelogs Cloud
  9. 9. 9 Orchestration Alluxio master Alluxio worker Alluxio worker Alluxio worker Presto Coordinator Presto worker Presto worker Presto worker Metastore for Alluxio Hive Metastore Infrastructure Platform Application Deployment Overview Mem Cache Mem Cache Mem Cache Mem Cache Architecture Case1: Consumption: Presto + Alluxio in GCP prod env
  10. 10. 10 Orchestration v Data Size Idx CNT Row Counts 15 million Data Size 10G v Improvements Query1 Query2 Query3 Query4 Query5 28.6% 30.8% 25.0% 18.0% 36.5% 0 10 20 30 40 50 60 70 Query1 (app_id) Query2 (app_ver) Query3 (device_type) Query4 (source IP) Query5 (user ID) Presto Aggregation Query Times GCS Alluxio 0 5 10 15 20 25 30 35 40 45 50 Query1 (app_id) Query2 (app_ver) Query3 (device_type) Query4 (source IP) Query5 (user ID) Presto Aggregation Query Source Stage Times GCS Alluxio
  11. 11. 11 Physical box Physical box Physical box Orchestration HDFS: DC local HDFS: DC remote 1 HDFS DC remote 2 Alluxio master Alluxio worker Alluxio worker Alluxio worker Presto Coordinator Presto worker Presto worker Presto worker Metastore for Alluxio Hive Metastore Case2: Consumption (POC) : Presto + Alluxio to multiple hadoop cluster
  12. 12. 12 Orchestration Source Data Alluxio Ingest Alluxio XHDFS Cluster HDFS Cluster GCS Alluxio Y Alluxio Z Rakuten DC1 Rakuten DC2 GCP v Alluxio Ingest Cluster: data persist to multi destination via Under Store Replication. v Consumption tool cache data from different DC to improve performance, and enable DR Case3: Data Pipeline (POC) : Data Orchestration

×