SKT Hadoop DW
SK telecom!
Corporate R&D Center

Yousun Jeong
Copyright@ 2015 by SK Telecom All rights reserved.
1. Big Data in SKT
2. What is Hadoop DW ?
3. SQL on Hadoop TAJO
4. Hadoop DW Commercialization Cases
Table of Contents
2
Copyright@ 2015 by SK Telecom All rights reserved.
High TCO for Data Management
250TB/day (91.25PB/year)
4 Hadoop clusters with various 

commercial MPP databases for analytics
Operational

Systems
Integration 

Layer
Data
Warehouse
Marts
Marketing
Sales
ERP
SCM
ODS
Staging

Area
Staging

Area
Mart A
Mart B
Mart C
Mart D
Hadoop+Hive MPP DBMS
High TCO for Data Management

(Too much data is loaded into MPP DBMS)
One Unified Solution
30PB+ (compressed) on 1000+ nodes
10+ Hadoop clusters with Tajo & Spark 

for all purposes
Operational

Systems
Integration 

Layer
Data
Warehouse
Marts
Marketing
Sales
ERP
SCM
ODS
Staging

Area
Staging

Area
Mart A
Mart B
Mart C
Mart D
Hadoop+Tajo+Spark
Affordable & Faster

(Unified framework for Big Data)
1. Big Data in SKT
3
Copyright@ 2015 by SK Telecom All rights reserved.
✓ Optimized configuration of a large-scale cluster
✓ Operation know-how of managing 1000+ nodes
✓ Fault tolerant and effective resource management system
Data Collector
Data Collect
& pre-processing
Main Cluster
Analysis
R&D Cluster
~250 TB/day
(700+ node)
Service
Logic
Repository
(200+ Node)
(100+ node)
Service Cluster
(150+ node)
App. 1 … App. N
T-Hadoop
Data Feeding
Data Feeding
Commercialize
Develop.
1. Big Data in SKT
SKT Hadoop Clusters
4
Copyright@ 2015 by SK Telecom All rights reserved.
“Hadoop S/W and Commodity H/W!
Based Cost-effective IT Infrastructure System”
【 Hadoop DW Infrastructure】
“High-price, High-performance!
Proprietary IT Infrastructure System”
【 Legacy IT Infrastructure 】
※ MPP Massively Parallel Processing, SAN Storage Area Network, NAS Network Attached Storage, RDBMS Relational DB Management System, !
SQL Structured Query Language
2. What is Hadoop DW ?
Structured/Un-structured Data!
Scale-out Structure (Petabyte, Exabyte)
Low price

($200 ~ $1,000 / TB)
Data
Cost
Structured Data!
Scale-up Structure (Terabyte)
High price!
($5,000~$50,000 / TB)
Commodity H/W (x86 Server)H/W
High Performance H/W!
(MPP, Fabric Switch, etc.)
Hadoop Architecture
SQL on Hadoop
S/W
Proprietary S/W

(RDBMS, etc.)
Transaction/Batch
Processing!
(SQL) Hadoop File System
The Hadoop DW provides a Hadoop Architecture based Data Warehouse from
an Enterprise environment so the user can accommodate the massive amount
of increasing data at a low cost.
Solution SKT Hadoop DW
5
Copyright@ 2015 by SK Telecom All rights reserved.
Tajo
- Fully Distributed
- Vector process
HDFS
Hadoop Cluster + Tajo
[ Legacy Approach (MR) ] [Tajo Approach ]
Process more data

on same clusters

with improved

processing speed
Response

Speed
Hadoop
Cluster
Query
Hadoop
Cluster
Query
Up to 

10x min few 

sec~min
+ Tajo
Try more queries

for analysis 

with improved!
response speed
Hive
MapReduce
- Partially Distributed
- Sequential process
HDFS
Hadoop Cluster
Processing

Speed
High-speed SQL-on-Hadoop processing engine
• 3~5x improvement in processing speed to Hive under TPC-H procedure
• 80~100% response speed to Impala without data size limit
• Full ANSI-SQL support for easy RDBMS migration
3. SQL on Hadoop - TAJO
6
Copyright@ 2015 by SK Telecom All rights reserved.
7
3. SQL on Hadoop - TAJO
SQL Support
▪ ANSI SQL support
▪ Partition Type
▪ Meta Store
Service Stability
▪ High Availability
▪ Resource Manager
▪ Fair Scheduler
Performance
▪ High-speed processing
▪ Shuffling
▪ Dynamic Query Optimizer
▪ Query Rewriting
System Integration
▪ BI Connector
▪ Proxy Support
▪ Tajo-R
Function Support
▪ Analytic Function
▪ Hive Function
[ Tajo Features ]
[ Performance Comparison ]
[ Apache Top-Level Project ]
Copyright@ 2015 by SK Telecom All rights reserved.
Worker!
8
3.1 Tajo Architecture
1. Query Master!
2. TaskRunner
Tajo Master!
Persistent Storage!
!!! Derby Store! MySQL Store!
Postgre SQL
Store!
Logical
Planner!
Logical
Optimizer!
Resource
Manager!
SQL Parser!
! Query
Rewriter!
Query
Manager!
Tajo CatalogHCatalog
Client Service
Handler!
JDBC !
Driver
Tajo!
CLI!
Tajo!
CLI!
Worker!
Query Master!
!!!!!!!!
Global 

Planner!
Client Service
Handler!
!!!!!!!
Local Query
Engine!
Storage
Manager!
Local HDFS/Hbase S3 / swift
ODBC !
Driver
Copyright@ 2015 by SK Telecom All rights reserved.
9
3.1 Technical Characteristic - Logical Flow Data Processing
Tajo Master!
!
!
!
!
!
!
!
!
SQL Parser
Logical/Global
Planner
Resource
Manager
Query Parsing
Decomposition of a work unit
Work units delivered to the server
Tajo
Worker!
Tajo
Worker!
Tajo
Worker!
Tajo
Worker!
Tajo Worker!
!
!
!
!
!
!
!
Physical Planner
Query Engine
Storage Manager
Decomposing the!
task operation unit
Unit operation
Disk data I/O control
Copyright@ 2015 by SK Telecom All rights reserved.
10
3.1 Technical Characteristic - JIT Query Engine
Implemented as a binary to 

consider the number of all cases

-> performance degradation

(call, if, switch below 50%)
switch(operand)!
Case numeric : add numeric!
Case string : add string!
real-time code generation 

based on operand type

combined operation can be 

processed by the compiler optimization
Four functions in a 

single operation(+2,-1,*1)
<Existing methods> <JIT methods>
Behavior depends on
the operand
characteristic!
!
- 1 + 2 = 3!
- “a” + “b” = “ab”!
- {1,2} + {3,4} = {4,6}!
- 1 + {1,2} = {2,3}
Result = A x (1-B) + (1+C)
+
x
- +
A A A A A
+
Copyright@ 2015 by SK Telecom All rights reserved.
11
3.1 Technical Characteristic -Vectorized Query Engine
<Tuple at a time> <Vectorized engine>
- DB!
- 1 operation/record
- Vectorized data!
- 1 operation/vector
A[] = {a1, a2, a3, a4, a5, a6}!
B[] = {b1, b2, b3, b4, b5, b6}!
!
C[] = A[] + B[]
a1
a2
a3
a5
a4
a6
b1
b2
b3
b5
b4
b6
+
+
+
+
+
+
a1
a2
a3
a5
a4
a6
+
b1
b2
b3
b5
b4
b6
Copyright@ 2015 by SK Telecom All rights reserved.
12
3.1 Technical Characteristic -Storage Manager
Tajo Worker!
Tajo Worker!
Tajo Worker(scan)!
Storage Manager!
!
!
!
!
!
!
!
!
!
Disk Scanner!
! Pre-fetching Buffer!
Disk Scanner!
Disk Scanner!
Request queue!
! ! ! !
Request queue!
Request queue!
Scan !
Scheduler
Bulk Read
Fine granularity
File

request
Copyright@ 2015 by SK Telecom All rights reserved.
13
Business Challenge
How SKT Hadoop DW Helped
[ SK Telecom ]
• Explosion of log data with LTE service
• Increase in types of data to be analyzed
• Insufficient DW capacity due to high cost
✓ 3x storage expansion under same price, 

or 80% reduction in unit price
✓ Enabled Ad-hoc analysis of unstructured text
data sets for daily
✓ Hadoop DW could decrease contents-based
analysis process time from few hours to 20
minutes max.
4. Hadoop DW Commercialization Cases Telco
Category MPP DBMS Hadoop DW
Raw Data Size 0.5 TB/Day 4 TB/Day
Total ETL Time Average of 3 hours Average of 6 hours
DW Creation
!
30 minutes 40 minutes
Mart Creation 1 hour 1 hour 40 minutes
Report
Creation
1 hour 30 minutes 2 hours 4 minutes
Copyright@ 2015 by SK Telecom All rights reserved.
14
Business Challenge
[ Global Top-5 Semiconductor Player ]
• Collect immense amount of unstructured
measurement data while manufacturing
• RDMBS & BI are incapable for such data type
• Even data loading can take up to 20 min
How SKT Hadoop DW Helped
✓ Support for unstructured data through variable
column schema
✓ 100x increase in data processing capacity
✓ Decreased data loading time by 10x (2 min)
✓ Minimized user action for pivot/unpivot
4. Hadoop DW Commercialization Cases Manufacturer
Copyright@ 2015 by SK Telecom All rights reserved.
Thank you.

IEEE International Conference on Data Engineering 2015

  • 1.
    SKT Hadoop DW SKtelecom! Corporate R&D Center
 Yousun Jeong
  • 2.
    Copyright@ 2015 bySK Telecom All rights reserved. 1. Big Data in SKT 2. What is Hadoop DW ? 3. SQL on Hadoop TAJO 4. Hadoop DW Commercialization Cases Table of Contents 2
  • 3.
    Copyright@ 2015 bySK Telecom All rights reserved. High TCO for Data Management 250TB/day (91.25PB/year) 4 Hadoop clusters with various 
 commercial MPP databases for analytics Operational
 Systems Integration 
 Layer Data Warehouse Marts Marketing Sales ERP SCM ODS Staging
 Area Staging
 Area Mart A Mart B Mart C Mart D Hadoop+Hive MPP DBMS High TCO for Data Management
 (Too much data is loaded into MPP DBMS) One Unified Solution 30PB+ (compressed) on 1000+ nodes 10+ Hadoop clusters with Tajo & Spark 
 for all purposes Operational
 Systems Integration 
 Layer Data Warehouse Marts Marketing Sales ERP SCM ODS Staging
 Area Staging
 Area Mart A Mart B Mart C Mart D Hadoop+Tajo+Spark Affordable & Faster
 (Unified framework for Big Data) 1. Big Data in SKT 3
  • 4.
    Copyright@ 2015 bySK Telecom All rights reserved. ✓ Optimized configuration of a large-scale cluster ✓ Operation know-how of managing 1000+ nodes ✓ Fault tolerant and effective resource management system Data Collector Data Collect & pre-processing Main Cluster Analysis R&D Cluster ~250 TB/day (700+ node) Service Logic Repository (200+ Node) (100+ node) Service Cluster (150+ node) App. 1 … App. N T-Hadoop Data Feeding Data Feeding Commercialize Develop. 1. Big Data in SKT SKT Hadoop Clusters 4
  • 5.
    Copyright@ 2015 bySK Telecom All rights reserved. “Hadoop S/W and Commodity H/W! Based Cost-effective IT Infrastructure System” 【 Hadoop DW Infrastructure】 “High-price, High-performance! Proprietary IT Infrastructure System” 【 Legacy IT Infrastructure 】 ※ MPP Massively Parallel Processing, SAN Storage Area Network, NAS Network Attached Storage, RDBMS Relational DB Management System, ! SQL Structured Query Language 2. What is Hadoop DW ? Structured/Un-structured Data! Scale-out Structure (Petabyte, Exabyte) Low price
 ($200 ~ $1,000 / TB) Data Cost Structured Data! Scale-up Structure (Terabyte) High price! ($5,000~$50,000 / TB) Commodity H/W (x86 Server)H/W High Performance H/W! (MPP, Fabric Switch, etc.) Hadoop Architecture SQL on Hadoop S/W Proprietary S/W
 (RDBMS, etc.) Transaction/Batch Processing! (SQL) Hadoop File System The Hadoop DW provides a Hadoop Architecture based Data Warehouse from an Enterprise environment so the user can accommodate the massive amount of increasing data at a low cost. Solution SKT Hadoop DW 5
  • 6.
    Copyright@ 2015 bySK Telecom All rights reserved. Tajo - Fully Distributed - Vector process HDFS Hadoop Cluster + Tajo [ Legacy Approach (MR) ] [Tajo Approach ] Process more data
 on same clusters
 with improved
 processing speed Response
 Speed Hadoop Cluster Query Hadoop Cluster Query Up to 
 10x min few 
 sec~min + Tajo Try more queries
 for analysis 
 with improved! response speed Hive MapReduce - Partially Distributed - Sequential process HDFS Hadoop Cluster Processing
 Speed High-speed SQL-on-Hadoop processing engine • 3~5x improvement in processing speed to Hive under TPC-H procedure • 80~100% response speed to Impala without data size limit • Full ANSI-SQL support for easy RDBMS migration 3. SQL on Hadoop - TAJO 6
  • 7.
    Copyright@ 2015 bySK Telecom All rights reserved. 7 3. SQL on Hadoop - TAJO SQL Support ▪ ANSI SQL support ▪ Partition Type ▪ Meta Store Service Stability ▪ High Availability ▪ Resource Manager ▪ Fair Scheduler Performance ▪ High-speed processing ▪ Shuffling ▪ Dynamic Query Optimizer ▪ Query Rewriting System Integration ▪ BI Connector ▪ Proxy Support ▪ Tajo-R Function Support ▪ Analytic Function ▪ Hive Function [ Tajo Features ] [ Performance Comparison ] [ Apache Top-Level Project ]
  • 8.
    Copyright@ 2015 bySK Telecom All rights reserved. Worker! 8 3.1 Tajo Architecture 1. Query Master! 2. TaskRunner Tajo Master! Persistent Storage! !!! Derby Store! MySQL Store! Postgre SQL Store! Logical Planner! Logical Optimizer! Resource Manager! SQL Parser! ! Query Rewriter! Query Manager! Tajo CatalogHCatalog Client Service Handler! JDBC ! Driver Tajo! CLI! Tajo! CLI! Worker! Query Master! !!!!!!!! Global 
 Planner! Client Service Handler! !!!!!!! Local Query Engine! Storage Manager! Local HDFS/Hbase S3 / swift ODBC ! Driver
  • 9.
    Copyright@ 2015 bySK Telecom All rights reserved. 9 3.1 Technical Characteristic - Logical Flow Data Processing Tajo Master! ! ! ! ! ! ! ! ! SQL Parser Logical/Global Planner Resource Manager Query Parsing Decomposition of a work unit Work units delivered to the server Tajo Worker! Tajo Worker! Tajo Worker! Tajo Worker! Tajo Worker! ! ! ! ! ! ! ! Physical Planner Query Engine Storage Manager Decomposing the! task operation unit Unit operation Disk data I/O control
  • 10.
    Copyright@ 2015 bySK Telecom All rights reserved. 10 3.1 Technical Characteristic - JIT Query Engine Implemented as a binary to 
 consider the number of all cases
 -> performance degradation
 (call, if, switch below 50%) switch(operand)! Case numeric : add numeric! Case string : add string! real-time code generation 
 based on operand type
 combined operation can be 
 processed by the compiler optimization Four functions in a 
 single operation(+2,-1,*1) <Existing methods> <JIT methods> Behavior depends on the operand characteristic! ! - 1 + 2 = 3! - “a” + “b” = “ab”! - {1,2} + {3,4} = {4,6}! - 1 + {1,2} = {2,3} Result = A x (1-B) + (1+C) + x - + A A A A A +
  • 11.
    Copyright@ 2015 bySK Telecom All rights reserved. 11 3.1 Technical Characteristic -Vectorized Query Engine <Tuple at a time> <Vectorized engine> - DB! - 1 operation/record - Vectorized data! - 1 operation/vector A[] = {a1, a2, a3, a4, a5, a6}! B[] = {b1, b2, b3, b4, b5, b6}! ! C[] = A[] + B[] a1 a2 a3 a5 a4 a6 b1 b2 b3 b5 b4 b6 + + + + + + a1 a2 a3 a5 a4 a6 + b1 b2 b3 b5 b4 b6
  • 12.
    Copyright@ 2015 bySK Telecom All rights reserved. 12 3.1 Technical Characteristic -Storage Manager Tajo Worker! Tajo Worker! Tajo Worker(scan)! Storage Manager! ! ! ! ! ! ! ! ! ! Disk Scanner! ! Pre-fetching Buffer! Disk Scanner! Disk Scanner! Request queue! ! ! ! ! Request queue! Request queue! Scan ! Scheduler Bulk Read Fine granularity File
 request
  • 13.
    Copyright@ 2015 bySK Telecom All rights reserved. 13 Business Challenge How SKT Hadoop DW Helped [ SK Telecom ] • Explosion of log data with LTE service • Increase in types of data to be analyzed • Insufficient DW capacity due to high cost ✓ 3x storage expansion under same price, 
 or 80% reduction in unit price ✓ Enabled Ad-hoc analysis of unstructured text data sets for daily ✓ Hadoop DW could decrease contents-based analysis process time from few hours to 20 minutes max. 4. Hadoop DW Commercialization Cases Telco Category MPP DBMS Hadoop DW Raw Data Size 0.5 TB/Day 4 TB/Day Total ETL Time Average of 3 hours Average of 6 hours DW Creation ! 30 minutes 40 minutes Mart Creation 1 hour 1 hour 40 minutes Report Creation 1 hour 30 minutes 2 hours 4 minutes
  • 14.
    Copyright@ 2015 bySK Telecom All rights reserved. 14 Business Challenge [ Global Top-5 Semiconductor Player ] • Collect immense amount of unstructured measurement data while manufacturing • RDMBS & BI are incapable for such data type • Even data loading can take up to 20 min How SKT Hadoop DW Helped ✓ Support for unstructured data through variable column schema ✓ 100x increase in data processing capacity ✓ Decreased data loading time by 10x (2 min) ✓ Minimized user action for pivot/unpivot 4. Hadoop DW Commercialization Cases Manufacturer
  • 15.
    Copyright@ 2015 bySK Telecom All rights reserved. Thank you.