The Data Platform Administration
Handling the 100 PB
May 19th, 2022
Yongduck Lee
Cloud Platform Department
Rakuten Group, Inc.
2
About me
Lecture History
- Colloquium Lecturer at KAIST
Program Committee
- BigComp2017/2019
- EDB 2016
Certification
- Certified Scrum Master (CSM)
- Certified Project Management Professional (PMP #1255421)
… ETC
Lee Yongduck Daniel
A Vice Section Manager and Senior Architect at Data Storage and
Processing Section in Rakuten Group, Inc.
Started as Recommendation Engine Developer and now is focusing on
researching and verifying new Big Data Technology and how to support
users who want to use Big Data System.
B.Sc in Korea University in 2001.
21 years in Japan and have been worked for many organization and
company such as NHK, NTTD and Rakuten Group, Inc.
3
CONTENTS
1. Global Internet & Data Explosion
2. Data in Rakuten
3. Data platform & Big Data Administrator in Rakuten
4. What Advantages as Engineer in Rakuten
4
Internet & Globalization
The Internet is the global system of interconnected computer networks that use the Internet protocol
suite (TCP/IP) to link devices worldwide. It is a network of networks that consists of private, public, academic,
business, and government networks of local to global scope, linked by a broad array of electronic, wireless,
and optical networking technologies
G
C
Vast
Unstructured 80%
Structured 20%
35.2 ZB in 2020
The origins of the Internet date back to research
commissioned by the federal government of the
United States in the 1960s to build robust, fault-
tolerant communication with computer networks.
https://en.wikipedia.org/wiki/Internet#World_Wide_Web
* From IDC white paper & EMC
hances
Lobalization
Information
Structure Volume
5
Internet Users
Internet users are defined as persons who accessed the Internet in the last 12 months from any device,
including mobile phones.
https://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users#cite_note-UN_WPP-14
6
Internet Users
https://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users#cite_note-UN_WPP-14
In Japan 92.3% are using Internet ( Population 127,202,192 / Internet Users 117,400,000 )
At 2018
7
8
9
The Big Data in Rakuten
There are huge potential value and possibilities due to Diversity of Service and Users not
only from Japan but also Global. It is very interesting and ideal environment for Data
Scientiest and Data Analyst.
Increase synergy effect on personalization, clustering, segmentation, etc. by combining
data from various services.
The large volume of data every day, every month, and every year from services and users.
It is a big challenge to store data and make it easy to utilize for data users as System
Infrastructure Engineer and Data Engineer.
Diversity and Synergy
Scale
10
Rakuten Hadoop and Kafka
Supporting near-realtime & streaming processing in
each region.
Handling data totally around 1.3 Million Message/sec
( 10 GB/sec IN/OUT) around peak time at normal
date.
At 2021 Super Sale, we handled more than 2.5 times
messages and traffics.
Supporting Data Lake, Data Mart, and Data Analysis
for Rakuten Service in each region.
Lots of value mining from big data are being done by
data scientist and contributing on Rakuten Service.
Kafka: 800 Core, 20TB Mem, 4728 Topics
Hadoop : 80K Core, 600 TB Mem, 160K TB Disk
11
The Challenge on Administration
12
The Big Data in Rakuten
Platform/Middleware
Administrator
Users
Project/Product
Manager
Big Data Platform
Administrator
Infra/Server
Administrator
Network
Administrator
Software/System
Architect
Software
Developer
13
Administration Use CASE (HBase)
User reported performance issues on HBase but there were no issues or report from other users who are using
other component on Hadoop.
Confirm Way to get/put data on HBase
• HBase
Configuration
Architecture, Work/Dataflow.
Application/GC Logs
• Dependency Component (*HDFS)
READ/Write Performance Logs
Application/GC Logs
• DISK/Mem/CPU Load
• Kernel Log
• Network Connection
Date
&
Time
Matching
Data Hot Spotting.
Data or Configuration Caching
HDFS
JVM Config change
Increasing Handler
Increasing Scanner Interval
HW Improvement
Master Node Replacement
Reduced RegionServers
Move HDD to NVMe
Dedicated RegionServers
OS Configuration
Root noprocs, nofiles increasing on Dedicated RS
HBASE
TCPNoDelay, Parallel Seeking , Master Table Locality
WRITE/Short-READ/Long-READ Queue
DEADLINE Scheduler, Hedged Reads, Short Circuit READ
14
What Advantages in Rakuten as Data Engineer
You can go through all necessary domains of Big Data Platform to get rich experience for Big Data Platform
Administrators. Rakuten has experts who have rich knowledges and experiences on each technical and
management domain.
15
What Advantages in Rakuten as Data Engineer
You can also work with various stakeholders from various service domain, from the point of data utilization.
DB
Services
Event
INFRA
…
The Data Platform Administration Handling the 100 PB.pdf

The Data Platform Administration Handling the 100 PB.pdf

  • 1.
    The Data PlatformAdministration Handling the 100 PB May 19th, 2022 Yongduck Lee Cloud Platform Department Rakuten Group, Inc.
  • 2.
    2 About me Lecture History -Colloquium Lecturer at KAIST Program Committee - BigComp2017/2019 - EDB 2016 Certification - Certified Scrum Master (CSM) - Certified Project Management Professional (PMP #1255421) … ETC Lee Yongduck Daniel A Vice Section Manager and Senior Architect at Data Storage and Processing Section in Rakuten Group, Inc. Started as Recommendation Engine Developer and now is focusing on researching and verifying new Big Data Technology and how to support users who want to use Big Data System. B.Sc in Korea University in 2001. 21 years in Japan and have been worked for many organization and company such as NHK, NTTD and Rakuten Group, Inc.
  • 3.
    3 CONTENTS 1. Global Internet& Data Explosion 2. Data in Rakuten 3. Data platform & Big Data Administrator in Rakuten 4. What Advantages as Engineer in Rakuten
  • 4.
    4 Internet & Globalization TheInternet is the global system of interconnected computer networks that use the Internet protocol suite (TCP/IP) to link devices worldwide. It is a network of networks that consists of private, public, academic, business, and government networks of local to global scope, linked by a broad array of electronic, wireless, and optical networking technologies G C Vast Unstructured 80% Structured 20% 35.2 ZB in 2020 The origins of the Internet date back to research commissioned by the federal government of the United States in the 1960s to build robust, fault- tolerant communication with computer networks. https://en.wikipedia.org/wiki/Internet#World_Wide_Web * From IDC white paper & EMC hances Lobalization Information Structure Volume
  • 5.
    5 Internet Users Internet usersare defined as persons who accessed the Internet in the last 12 months from any device, including mobile phones. https://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users#cite_note-UN_WPP-14
  • 6.
    6 Internet Users https://en.wikipedia.org/wiki/List_of_countries_by_number_of_Internet_users#cite_note-UN_WPP-14 In Japan92.3% are using Internet ( Population 127,202,192 / Internet Users 117,400,000 ) At 2018
  • 7.
  • 8.
  • 9.
    9 The Big Datain Rakuten There are huge potential value and possibilities due to Diversity of Service and Users not only from Japan but also Global. It is very interesting and ideal environment for Data Scientiest and Data Analyst. Increase synergy effect on personalization, clustering, segmentation, etc. by combining data from various services. The large volume of data every day, every month, and every year from services and users. It is a big challenge to store data and make it easy to utilize for data users as System Infrastructure Engineer and Data Engineer. Diversity and Synergy Scale
  • 10.
    10 Rakuten Hadoop andKafka Supporting near-realtime & streaming processing in each region. Handling data totally around 1.3 Million Message/sec ( 10 GB/sec IN/OUT) around peak time at normal date. At 2021 Super Sale, we handled more than 2.5 times messages and traffics. Supporting Data Lake, Data Mart, and Data Analysis for Rakuten Service in each region. Lots of value mining from big data are being done by data scientist and contributing on Rakuten Service. Kafka: 800 Core, 20TB Mem, 4728 Topics Hadoop : 80K Core, 600 TB Mem, 160K TB Disk
  • 11.
    11 The Challenge onAdministration
  • 12.
    12 The Big Datain Rakuten Platform/Middleware Administrator Users Project/Product Manager Big Data Platform Administrator Infra/Server Administrator Network Administrator Software/System Architect Software Developer
  • 13.
    13 Administration Use CASE(HBase) User reported performance issues on HBase but there were no issues or report from other users who are using other component on Hadoop. Confirm Way to get/put data on HBase • HBase Configuration Architecture, Work/Dataflow. Application/GC Logs • Dependency Component (*HDFS) READ/Write Performance Logs Application/GC Logs • DISK/Mem/CPU Load • Kernel Log • Network Connection Date & Time Matching Data Hot Spotting. Data or Configuration Caching HDFS JVM Config change Increasing Handler Increasing Scanner Interval HW Improvement Master Node Replacement Reduced RegionServers Move HDD to NVMe Dedicated RegionServers OS Configuration Root noprocs, nofiles increasing on Dedicated RS HBASE TCPNoDelay, Parallel Seeking , Master Table Locality WRITE/Short-READ/Long-READ Queue DEADLINE Scheduler, Hedged Reads, Short Circuit READ
  • 14.
    14 What Advantages inRakuten as Data Engineer You can go through all necessary domains of Big Data Platform to get rich experience for Big Data Platform Administrators. Rakuten has experts who have rich knowledges and experiences on each technical and management domain.
  • 15.
    15 What Advantages inRakuten as Data Engineer You can also work with various stakeholders from various service domain, from the point of data utilization. DB Services Event INFRA …