Introduction to Apache Tajo:
Future of Data Warehouse
Jihoon Son / Gruter Inc.
I am
● Jihoon Son (@jihoonson)
○ Ph.D at Korea Univ.
○ Tajo project co-founder
○ Committer and PMC member of Apache Tajo
○ Research engineer at Gruter
○ Linkedin
■ https://www.linkedin.com/in/jihoonson
2
Today's Topic: Tajo
● What is Tajo?
○ Tajo / tάːzo / 타조
○ Ostrich in Korean
■ Fastest two-legged animal in
the world
3
Today's Topic: Tajo
● What is Apache Tajo?
○ Our Ostrich can do SQL
processing on big data!
■ SQL-on-Hadoop system
■ Apache Top-level project
4
Maybe You Think ...
5
SQL-on-Hadoop?
Boring..
This Ostrich is Different!
6
SQL-on-Hadoop Systems
7
SQL-on-Hadoop Systems
8
SQL-on-Hadoop Systems
9
Long-running
ETL jobs
Low-latency
interactive analysis
SQL-on-Hadoop Systems
10
● Requirements
○ Stable query execution
■ Fault-tolerance
● Can avoid query
resubmission
○ Adaptation to dynamic
environment
■ Available resources,
unpredictable delays, ...
Long-running
ETL jobs
SQL-on-Hadoop Systems
11
● Requirements
○ Fast query execution
■ Several query execution
techniques
■ In-memory processing Low-latency
interactive analysis
Tajo is designed for Both Workloads
12
Long-running
ETL jobs
Low-latency
interactive analysis
Who are using Tajo?
13
Use Cases: SK Telecom
● Data warehousing & analysis
○ 1st
telco in South Korea
■ 40 TB/day compressed data (2014)
14
ETLETLETL
Integration
Layer
Data
Warehouse
Operational
Systems
SK Telecom: Before Tajo
15
Marketing
Sales
ERP
SCM
ODS
Staging
Area
Data
Vault
Data Marts
Strategic
Marts
Hadoop MPP DBMS
ETLETLETL
Integration
Layer
Data
Warehouse
Operational
Systems
SK Telecom: After Tajo
16
Marketing
Sales
ERP
SCM
ODS
Staging
Area
Data
Vault
Data Marts
Strategic
Marts
ETLETLETL
Integration
Layer
Data
Warehouse
Operational
Systems
SK Telecom: After Tajo
17
Marketing
Sales
ERP
SCM
ODS
Staging
Area
Data
Vault
Data Marts
Strategic
Marts
● Long-running ETL jobs
● Ad-hoc analysis
Use Cases: SK Telecom
● Significantly reduced ETL & analysis time
○ Daily analysis becomes possible
○ More exploratory analysis is newly available
with remaining resources
18
Use Cases: Bluehole Studio
● Game log analysis
○ Finding principal
causes of service-
quality deficiencies
19
Use Cases: Bluehole Studio
● Tajo on EMR
20
Use Cases: Bluehole Studio
● Their first log analysis system
○ Easy and rapid deployment of Tajo
○ Low learning curve with SQL standard
● Immediate action becomes possible for
user complaints and hidden bugs
21
Use Cases: Melon
● Data discovery
○ Music streaming service (26 million users)
○ Analysis of purchase history for target
marketing
● Significantly reduced analysis time
○ Faster analysis by replacing Hive with Tajo
○ More analysis becomes possible
22
So, Why should you use Tajo?
23
So, Why should you use Tajo?
● Easy to use
24
So, Why should you use Tajo?
● Easy to use
○ ANSI-SQL standard compliance (2003)
■ CTAS, Window functions, ...
25
So, Why should you use Tajo?
● Easy to use
○ ANSI-SQL standard compliance (2003)
■ CTAS, Window functions, ...
○ Mature SQL features
■ Most existing queries can be executed without
modification
26
So, Why should you use Tajo?
● Easy to use
○ ANSI-SQL standard compliance (2003)
■ CTAS, Window functions, ...
○ Mature SQL features
■ Most existing queries can be executed without
modification
○ Various data format support
■ Text, JSON, Orc, Parquet, …
27
So, Why should you use Tajo?
● Optimized performance
28
So, Why should you use Tajo?
● Optimized performance
○ Optimized code
■ Optimized I/O performance
● Nearly max I/O performance (~120MB/s) per disk
■ Off-heap data processing
● Mitigating GC overhead
29
So, Why should you use Tajo?
● Optimized performance
○ Cost-based query plan optimization
■ Join ordering
■ Best algorithm selection
● According to input size
■ Progressive optimization
● Further optimize the query plan during query execution
● Especially excellent for long running queries
■ => Efficient start schema processing
30
So, Why should you use Tajo?
● Various storage type support
31
So, Why should you use Tajo?
● Various storage type support
32
Logical Data Warehouse with Tajo
33
Global view
Application DBMS NoSQL
Cloud
storage
On-premise
storage
Logical Data Warehouse with Tajo
34
Global view
Application DBMS NoSQL
Cloud
storage
On-premise
storage
● Fast delivery
● Easy maintenance
● Simple data flow
How fast is Tajo?
35
Evaluation on Cloud Environment
● Google Cloud Platform
○ Instance type: n1-standard-8
■ 8 core, 30GB RAM
36
Target Systems
● Hive (0.12)
○ Baseline performance
○ Default configuration provided by GCP
■ Use the whole cpu and memory
● Tajo (0.11.0)
○ Default configuration provided by GCP
■ Use the whole cpu and memory
37
Target Systems
● Spark-SQL (1.5.0)
○ Default configuration provided by GCP
■ Use the whole cpu and memory
■ Tungsten enabled by default
○ spark.sql.shuffle.partitions is
adjusted for better performance
38
TPC-DS
● Data
○ 24 tables
■ Plain text format
■ Stored on Google Cloud Storage
● Query
○ Which can be executed on every system
without modifications
■ For Hive, 0.12 doesn't support implicit join, so
every query had to be changed
39
SF 1000, 50 instances
40
SF 1000, 50 instances
41
SF 1000, 50 instances
42
Cannot be run
on 1TB
SF 10000, 50 instances
43
SF 10000, 50 instances
44
Demo
45
Simple Demo on EMR
46
● Using TPC-H data set, but
○ Lineitem table is stored on HDFS
○ Orders table is stored on PostgreSQL
○ Other tables are stored on S3
Apache Tajo
● Is excellent for both long-running ETL jobs
and exploratory ad-hoc analysis
● Is very fast
● Supports query federation on diverse data
sources
47
Get Involved!
● We are recruiting contributors!
● General
○ http://tajo.apache.org/
● Getting Started
○ http://tajo.apache.org/docs/current/getting_started.html
● Downloads
○ http://tajo.apache.org/downloads.html
● Issue tracker
○ http://issues.apache.org/jira/browse/TAJO
● Join the mailing list
○ dev-subscribe@tajo.apache.org
○ issues-subscribe@tajo.apache.org
48
Useful Links
49
● EMR bootstrap
○ https://github.com/awslabs/emr-bootstrap-
actions/tree/master/tajo
● How to setup Tajo on EMR
○ http://www.gruter.com/blog/setting-up-a-
tajo-cluster-on-amazon-emr/
Q & A
50

Introduction to Apache Tajo: Future of Data Warehouse

  • 1.
    Introduction to ApacheTajo: Future of Data Warehouse Jihoon Son / Gruter Inc.
  • 2.
    I am ● JihoonSon (@jihoonson) ○ Ph.D at Korea Univ. ○ Tajo project co-founder ○ Committer and PMC member of Apache Tajo ○ Research engineer at Gruter ○ Linkedin ■ https://www.linkedin.com/in/jihoonson 2
  • 3.
    Today's Topic: Tajo ●What is Tajo? ○ Tajo / tάːzo / 타조 ○ Ostrich in Korean ■ Fastest two-legged animal in the world 3
  • 4.
    Today's Topic: Tajo ●What is Apache Tajo? ○ Our Ostrich can do SQL processing on big data! ■ SQL-on-Hadoop system ■ Apache Top-level project 4
  • 5.
    Maybe You Think... 5 SQL-on-Hadoop? Boring..
  • 6.
    This Ostrich isDifferent! 6
  • 7.
  • 8.
  • 9.
  • 10.
    SQL-on-Hadoop Systems 10 ● Requirements ○Stable query execution ■ Fault-tolerance ● Can avoid query resubmission ○ Adaptation to dynamic environment ■ Available resources, unpredictable delays, ... Long-running ETL jobs
  • 11.
    SQL-on-Hadoop Systems 11 ● Requirements ○Fast query execution ■ Several query execution techniques ■ In-memory processing Low-latency interactive analysis
  • 12.
    Tajo is designedfor Both Workloads 12 Long-running ETL jobs Low-latency interactive analysis
  • 13.
    Who are usingTajo? 13
  • 14.
    Use Cases: SKTelecom ● Data warehousing & analysis ○ 1st telco in South Korea ■ 40 TB/day compressed data (2014) 14
  • 15.
    ETLETLETL Integration Layer Data Warehouse Operational Systems SK Telecom: BeforeTajo 15 Marketing Sales ERP SCM ODS Staging Area Data Vault Data Marts Strategic Marts Hadoop MPP DBMS
  • 16.
    ETLETLETL Integration Layer Data Warehouse Operational Systems SK Telecom: AfterTajo 16 Marketing Sales ERP SCM ODS Staging Area Data Vault Data Marts Strategic Marts
  • 17.
    ETLETLETL Integration Layer Data Warehouse Operational Systems SK Telecom: AfterTajo 17 Marketing Sales ERP SCM ODS Staging Area Data Vault Data Marts Strategic Marts ● Long-running ETL jobs ● Ad-hoc analysis
  • 18.
    Use Cases: SKTelecom ● Significantly reduced ETL & analysis time ○ Daily analysis becomes possible ○ More exploratory analysis is newly available with remaining resources 18
  • 19.
    Use Cases: BlueholeStudio ● Game log analysis ○ Finding principal causes of service- quality deficiencies 19
  • 20.
    Use Cases: BlueholeStudio ● Tajo on EMR 20
  • 21.
    Use Cases: BlueholeStudio ● Their first log analysis system ○ Easy and rapid deployment of Tajo ○ Low learning curve with SQL standard ● Immediate action becomes possible for user complaints and hidden bugs 21
  • 22.
    Use Cases: Melon ●Data discovery ○ Music streaming service (26 million users) ○ Analysis of purchase history for target marketing ● Significantly reduced analysis time ○ Faster analysis by replacing Hive with Tajo ○ More analysis becomes possible 22
  • 23.
    So, Why shouldyou use Tajo? 23
  • 24.
    So, Why shouldyou use Tajo? ● Easy to use 24
  • 25.
    So, Why shouldyou use Tajo? ● Easy to use ○ ANSI-SQL standard compliance (2003) ■ CTAS, Window functions, ... 25
  • 26.
    So, Why shouldyou use Tajo? ● Easy to use ○ ANSI-SQL standard compliance (2003) ■ CTAS, Window functions, ... ○ Mature SQL features ■ Most existing queries can be executed without modification 26
  • 27.
    So, Why shouldyou use Tajo? ● Easy to use ○ ANSI-SQL standard compliance (2003) ■ CTAS, Window functions, ... ○ Mature SQL features ■ Most existing queries can be executed without modification ○ Various data format support ■ Text, JSON, Orc, Parquet, … 27
  • 28.
    So, Why shouldyou use Tajo? ● Optimized performance 28
  • 29.
    So, Why shouldyou use Tajo? ● Optimized performance ○ Optimized code ■ Optimized I/O performance ● Nearly max I/O performance (~120MB/s) per disk ■ Off-heap data processing ● Mitigating GC overhead 29
  • 30.
    So, Why shouldyou use Tajo? ● Optimized performance ○ Cost-based query plan optimization ■ Join ordering ■ Best algorithm selection ● According to input size ■ Progressive optimization ● Further optimize the query plan during query execution ● Especially excellent for long running queries ■ => Efficient start schema processing 30
  • 31.
    So, Why shouldyou use Tajo? ● Various storage type support 31
  • 32.
    So, Why shouldyou use Tajo? ● Various storage type support 32
  • 33.
    Logical Data Warehousewith Tajo 33 Global view Application DBMS NoSQL Cloud storage On-premise storage
  • 34.
    Logical Data Warehousewith Tajo 34 Global view Application DBMS NoSQL Cloud storage On-premise storage ● Fast delivery ● Easy maintenance ● Simple data flow
  • 35.
    How fast isTajo? 35
  • 36.
    Evaluation on CloudEnvironment ● Google Cloud Platform ○ Instance type: n1-standard-8 ■ 8 core, 30GB RAM 36
  • 37.
    Target Systems ● Hive(0.12) ○ Baseline performance ○ Default configuration provided by GCP ■ Use the whole cpu and memory ● Tajo (0.11.0) ○ Default configuration provided by GCP ■ Use the whole cpu and memory 37
  • 38.
    Target Systems ● Spark-SQL(1.5.0) ○ Default configuration provided by GCP ■ Use the whole cpu and memory ■ Tungsten enabled by default ○ spark.sql.shuffle.partitions is adjusted for better performance 38
  • 39.
    TPC-DS ● Data ○ 24tables ■ Plain text format ■ Stored on Google Cloud Storage ● Query ○ Which can be executed on every system without modifications ■ For Hive, 0.12 doesn't support implicit join, so every query had to be changed 39
  • 40.
    SF 1000, 50instances 40
  • 41.
    SF 1000, 50instances 41
  • 42.
    SF 1000, 50instances 42 Cannot be run on 1TB
  • 43.
    SF 10000, 50instances 43
  • 44.
    SF 10000, 50instances 44
  • 45.
  • 46.
    Simple Demo onEMR 46 ● Using TPC-H data set, but ○ Lineitem table is stored on HDFS ○ Orders table is stored on PostgreSQL ○ Other tables are stored on S3
  • 47.
    Apache Tajo ● Isexcellent for both long-running ETL jobs and exploratory ad-hoc analysis ● Is very fast ● Supports query federation on diverse data sources 47
  • 48.
    Get Involved! ● Weare recruiting contributors! ● General ○ http://tajo.apache.org/ ● Getting Started ○ http://tajo.apache.org/docs/current/getting_started.html ● Downloads ○ http://tajo.apache.org/downloads.html ● Issue tracker ○ http://issues.apache.org/jira/browse/TAJO ● Join the mailing list ○ dev-subscribe@tajo.apache.org ○ issues-subscribe@tajo.apache.org 48
  • 49.
    Useful Links 49 ● EMRbootstrap ○ https://github.com/awslabs/emr-bootstrap- actions/tree/master/tajo ● How to setup Tajo on EMR ○ http://www.gruter.com/blog/setting-up-a- tajo-cluster-on-amazon-emr/
  • 50.