Introduction to Apache Tajo: Future of Data Warehouse

Introduction to Apache Tajo:
Future of Data Warehouse
Jihoon Son / Gruter Inc.

I am
● Jihoon Son (@jihoonson)
○ Ph.D at Korea Univ.
○ Tajo project co-founder
○ Committer and PMC member of Apache Tajo
○ Research engineer at Gruter
○ Linkedin
■ https://www.linkedin.com/in/jihoonson
2

Today's Topic: Tajo
● What is Tajo?
○ Tajo / tάːzo / 타조
○ Ostrich in Korean
■ Fastest two-legged animal in
the world
3

Today's Topic: Tajo
● What is Apache Tajo?
○ Our Ostrich can do SQL
processing on big data!
■ SQL-on-Hadoop system
■ Apache Top-level project
4

Maybe You Think ...
5
SQL-on-Hadoop?
Boring..

SQL-on-Hadoop Systems
9
Long-running
ETL jobs
Low-latency
interactive analysis

10
● Requirements
○ Stable query execution
■ Fault-tolerance
● Can avoid query
resubmission
○ Adaptation to dynamic
environment
■ Available resources,
unpredictable delays, ...
Long-running
ETL jobs

11
● Requirements
○ Fast query execution
■ Several query execution
techniques
■ In-memory processing Low-latency

Tajo is designed for Both Workloads
12
Long-running
ETL jobs
Low-latency

Use Cases: SK Telecom
● Data warehousing & analysis
○ 1st
telco in South Korea
■ 40 TB/day compressed data (2014)
14

ETLETLETL
Integration
Layer
Data
Warehouse
Operational
Systems
SK Telecom: Before Tajo
15
Marketing
Sales
ERP
SCM
ODS
Staging
Area
Data
Vault
Data Marts
Strategic
Marts
Hadoop MPP DBMS

ETLETLETL
Integration
Layer
Data
Warehouse
Operational
Systems
SK Telecom: After Tajo
16
Marketing
Sales
ERP
SCM
ODS
Staging
Area
Data
Vault
Data Marts
Strategic
Marts

ETLETLETL
Integration
Layer
Data
Warehouse
Operational
Systems
SK Telecom: After Tajo
17
Marketing
Sales
ERP
SCM
ODS
Staging
Area
Data
Vault
Data Marts
Strategic
Marts
● Long-running ETL jobs
● Ad-hoc analysis

Use Cases: SK Telecom
● Significantly reduced ETL & analysis time
○ Daily analysis becomes possible
○ More exploratory analysis is newly available
with remaining resources
18

Use Cases: Bluehole Studio
● Game log analysis
○ Finding principal
causes of service-
quality deficiencies
19

● Tajo on EMR
20

● Their first log analysis system
○ Easy and rapid deployment of Tajo
○ Low learning curve with SQL standard
● Immediate action becomes possible for
user complaints and hidden bugs
21

Use Cases: Melon
● Data discovery
○ Music streaming service (26 million users)
○ Analysis of purchase history for target
marketing
● Significantly reduced analysis time
○ Faster analysis by replacing Hive with Tajo
○ More analysis becomes possible
22

So, Why should you use Tajo?
23

● Easy to use
24

● Easy to use
○ ANSI-SQL standard compliance (2003)
■ CTAS, Window functions, ...
25

● Easy to use
○ Mature SQL features
■ Most existing queries can be executed without
modification
26

● Easy to use
○ Mature SQL features
■ Most existing queries can be executed without
modification
○ Various data format support
■ Text, JSON, Orc, Parquet, …
27

● Optimized performance
28

○ Optimized code
■ Optimized I/O performance
● Nearly max I/O performance (~120MB/s) per disk
■ Off-heap data processing
● Mitigating GC overhead
29

○ Cost-based query plan optimization
■ Join ordering
■ Best algorithm selection
● According to input size
■ Progressive optimization
● Further optimize the query plan during query execution
● Especially excellent for long running queries
■ => Efficient start schema processing
30

● Various storage type support
31

● Various storage type support
32

Logical Data Warehouse with Tajo
33
Global view
Application DBMS NoSQL
Cloud
storage
On-premise
storage

Logical Data Warehouse with Tajo
34
Global view
Application DBMS NoSQL
Cloud
storage
On-premise
storage
● Fast delivery
● Easy maintenance
● Simple data flow

Evaluation on Cloud Environment
● Google Cloud Platform
○ Instance type: n1-standard-8
■ 8 core, 30GB RAM
36

Target Systems
● Hive (0.12)
○ Baseline performance
○ Default configuration provided by GCP
■ Use the whole cpu and memory
● Tajo (0.11.0)
37

Target Systems
● Spark-SQL (1.5.0)
■ Tungsten enabled by default
○ spark.sql.shuffle.partitions is
adjusted for better performance
38

TPC-DS
● Data
○ 24 tables
■ Plain text format
■ Stored on Google Cloud Storage
● Query
○ Which can be executed on every system
without modifications
■ For Hive, 0.12 doesn't support implicit join, so
every query had to be changed
39

SF 1000, 50 instances
42
Cannot be run
on 1TB

Simple Demo on EMR
46
● Using TPC-H data set, but
○ Lineitem table is stored on HDFS
○ Orders table is stored on PostgreSQL
○ Other tables are stored on S3

Apache Tajo
● Is excellent for both long-running ETL jobs
and exploratory ad-hoc analysis
● Is very fast
● Supports query federation on diverse data
sources
47

Get Involved!
● We are recruiting contributors!
● General
○ http://tajo.apache.org/
● Getting Started
○ http://tajo.apache.org/docs/current/getting_started.html
● Downloads
○ http://tajo.apache.org/downloads.html
● Issue tracker
○ http://issues.apache.org/jira/browse/TAJO
● Join the mailing list
○ dev-subscribe@tajo.apache.org
○ issues-subscribe@tajo.apache.org
48

Useful Links
49
● EMR bootstrap
○ https://github.com/awslabs/emr-bootstrap-
actions/tree/master/tajo
● How to setup Tajo on EMR
○ http://www.gruter.com/blog/setting-up-a-
tajo-cluster-on-amazon-emr/

Introduction to Apache Tajo: Future of Data Warehouse

More Related Content

What's hot

Viewers also liked

Similar to Introduction to Apache Tajo: Future of Data Warehouse

More from Gruter

Recently uploaded

Introduction to Apache Tajo: Future of Data Warehouse