SlideShare a Scribd company logo
Introduction to Apache Tajo:
Future of Data Warehouse
Jihoon Son / Gruter Inc.
I am
● Jihoon Son (@jihoonson)
○ Ph.D at Korea Univ.
○ Tajo project co-founder
○ Committer and PMC member of Apache Tajo
○ Research engineer at Gruter
○ Linkedin
■ https://www.linkedin.com/in/jihoonson
2
Today's Topic: Tajo
● What is Tajo?
○ Tajo / tάːzo / 타조
○ Ostrich in Korean
■ Fastest two-legged animal in
the world
3
Today's Topic: Tajo
● What is Apache Tajo?
○ Our Ostrich can do SQL
processing on big data!
■ SQL-on-Hadoop system
■ Apache Top-level project
4
Maybe You Think ...
5
SQL-on-Hadoop?
Boring..
This Ostrich is Different!
6
SQL-on-Hadoop Systems
7
SQL-on-Hadoop Systems
8
SQL-on-Hadoop Systems
9
Long-running
ETL jobs
Low-latency
interactive analysis
SQL-on-Hadoop Systems
10
● Requirements
○ Stable query execution
■ Fault-tolerance
● Can avoid query
resubmission
○ Adaptation to dynamic
environment
■ Available resources,
unpredictable delays, ...
Long-running
ETL jobs
SQL-on-Hadoop Systems
11
● Requirements
○ Fast query execution
■ Several query execution
techniques
■ In-memory processing Low-latency
interactive analysis
Tajo is designed for Both Workloads
12
Long-running
ETL jobs
Low-latency
interactive analysis
Who are using Tajo?
13
Use Cases: SK Telecom
● Data warehousing & analysis
○ 1st
telco in South Korea
■ 40 TB/day compressed data (2014)
14
ETLETLETL
Integration
Layer
Data
Warehouse
Operational
Systems
SK Telecom: Before Tajo
15
Marketing
Sales
ERP
SCM
ODS
Staging
Area
Data
Vault
Data Marts
Strategic
Marts
Hadoop MPP DBMS
ETLETLETL
Integration
Layer
Data
Warehouse
Operational
Systems
SK Telecom: After Tajo
16
Marketing
Sales
ERP
SCM
ODS
Staging
Area
Data
Vault
Data Marts
Strategic
Marts
ETLETLETL
Integration
Layer
Data
Warehouse
Operational
Systems
SK Telecom: After Tajo
17
Marketing
Sales
ERP
SCM
ODS
Staging
Area
Data
Vault
Data Marts
Strategic
Marts
● Long-running ETL jobs
● Ad-hoc analysis
Use Cases: SK Telecom
● Significantly reduced ETL & analysis time
○ Daily analysis becomes possible
○ More exploratory analysis is newly available
with remaining resources
18
Use Cases: Bluehole Studio
● Game log analysis
○ Finding principal
causes of service-
quality deficiencies
19
Use Cases: Bluehole Studio
● Tajo on EMR
20
Use Cases: Bluehole Studio
● Their first log analysis system
○ Easy and rapid deployment of Tajo
○ Low learning curve with SQL standard
● Immediate action becomes possible for
user complaints and hidden bugs
21
Use Cases: Melon
● Data discovery
○ Music streaming service (26 million users)
○ Analysis of purchase history for target
marketing
● Significantly reduced analysis time
○ Faster analysis by replacing Hive with Tajo
○ More analysis becomes possible
22
So, Why should you use Tajo?
23
So, Why should you use Tajo?
● Easy to use
24
So, Why should you use Tajo?
● Easy to use
○ ANSI-SQL standard compliance (2003)
■ CTAS, Window functions, ...
25
So, Why should you use Tajo?
● Easy to use
○ ANSI-SQL standard compliance (2003)
■ CTAS, Window functions, ...
○ Mature SQL features
■ Most existing queries can be executed without
modification
26
So, Why should you use Tajo?
● Easy to use
○ ANSI-SQL standard compliance (2003)
■ CTAS, Window functions, ...
○ Mature SQL features
■ Most existing queries can be executed without
modification
○ Various data format support
■ Text, JSON, Orc, Parquet, …
27
So, Why should you use Tajo?
● Optimized performance
28
So, Why should you use Tajo?
● Optimized performance
○ Optimized code
■ Optimized I/O performance
● Nearly max I/O performance (~120MB/s) per disk
■ Off-heap data processing
● Mitigating GC overhead
29
So, Why should you use Tajo?
● Optimized performance
○ Cost-based query plan optimization
■ Join ordering
■ Best algorithm selection
● According to input size
■ Progressive optimization
● Further optimize the query plan during query execution
● Especially excellent for long running queries
■ => Efficient start schema processing
30
So, Why should you use Tajo?
● Various storage type support
31
So, Why should you use Tajo?
● Various storage type support
32
Logical Data Warehouse with Tajo
33
Global view
Application DBMS NoSQL
Cloud
storage
On-premise
storage
Logical Data Warehouse with Tajo
34
Global view
Application DBMS NoSQL
Cloud
storage
On-premise
storage
● Fast delivery
● Easy maintenance
● Simple data flow
How fast is Tajo?
35
Evaluation on Cloud Environment
● Google Cloud Platform
○ Instance type: n1-standard-8
■ 8 core, 30GB RAM
36
Target Systems
● Hive (0.12)
○ Baseline performance
○ Default configuration provided by GCP
■ Use the whole cpu and memory
● Tajo (0.11.0)
○ Default configuration provided by GCP
■ Use the whole cpu and memory
37
Target Systems
● Spark-SQL (1.5.0)
○ Default configuration provided by GCP
■ Use the whole cpu and memory
■ Tungsten enabled by default
○ spark.sql.shuffle.partitions is
adjusted for better performance
38
TPC-DS
● Data
○ 24 tables
■ Plain text format
■ Stored on Google Cloud Storage
● Query
○ Which can be executed on every system
without modifications
■ For Hive, 0.12 doesn't support implicit join, so
every query had to be changed
39
SF 1000, 50 instances
40
SF 1000, 50 instances
41
SF 1000, 50 instances
42
Cannot be run
on 1TB
SF 10000, 50 instances
43
SF 10000, 50 instances
44
Demo
45
Simple Demo on EMR
46
● Using TPC-H data set, but
○ Lineitem table is stored on HDFS
○ Orders table is stored on PostgreSQL
○ Other tables are stored on S3
Apache Tajo
● Is excellent for both long-running ETL jobs
and exploratory ad-hoc analysis
● Is very fast
● Supports query federation on diverse data
sources
47
Get Involved!
● We are recruiting contributors!
● General
○ http://tajo.apache.org/
● Getting Started
○ http://tajo.apache.org/docs/current/getting_started.html
● Downloads
○ http://tajo.apache.org/downloads.html
● Issue tracker
○ http://issues.apache.org/jira/browse/TAJO
● Join the mailing list
○ dev-subscribe@tajo.apache.org
○ issues-subscribe@tajo.apache.org
48
Useful Links
49
● EMR bootstrap
○ https://github.com/awslabs/emr-bootstrap-
actions/tree/master/tajo
● How to setup Tajo on EMR
○ http://www.gruter.com/blog/setting-up-a-
tajo-cluster-on-amazon-emr/
Q & A
50

More Related Content

What's hot

Data Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQLData Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQL
EDB
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
Ryan Blue
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
Alluxio, Inc.
 
IOT with PostgreSQL
IOT with PostgreSQLIOT with PostgreSQL
IOT with PostgreSQL
EDB
 
PostgreSQL and Sphinx pgcon 2013
PostgreSQL and Sphinx   pgcon 2013PostgreSQL and Sphinx   pgcon 2013
PostgreSQL and Sphinx pgcon 2013
Emanuel Calvo
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
Denis Shestakov
 
RubiX
RubiXRubiX
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
Athens Big Data
 
23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...
23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...
23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...
Amazon Web Services
 
Performance Tuning Cheat Sheet for MongoDB
Performance Tuning Cheat Sheet for MongoDBPerformance Tuning Cheat Sheet for MongoDB
Performance Tuning Cheat Sheet for MongoDB
Severalnines
 
Presto: SQL-on-Anything. Netherlands Hadoop User Group Meetup
Presto: SQL-on-Anything. Netherlands Hadoop User Group MeetupPresto: SQL-on-Anything. Netherlands Hadoop User Group Meetup
Presto: SQL-on-Anything. Netherlands Hadoop User Group Meetup
Wojciech Biela
 
Bucket your partitions wisely - Cassandra summit 2016
Bucket your partitions wisely - Cassandra summit 2016Bucket your partitions wisely - Cassandra summit 2016
Bucket your partitions wisely - Cassandra summit 2016
Markus Höfer
 
Performance Tuning and Optimization
Performance Tuning and OptimizationPerformance Tuning and Optimization
Performance Tuning and Optimization
MongoDB
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
HBaseCon
 
Effectively deploying hadoop to the cloud
Effectively  deploying hadoop to the cloudEffectively  deploying hadoop to the cloud
Effectively deploying hadoop to the cloud
Avinash Ramineni
 
Advanced MySql Data-at-Rest Encryption in Percona Server
Advanced MySql Data-at-Rest Encryption in Percona ServerAdvanced MySql Data-at-Rest Encryption in Percona Server
Advanced MySql Data-at-Rest Encryption in Percona Server
Severalnines
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series Database
Pramit Choudhary
 
Optiq: A dynamic data management framework
Optiq: A dynamic data management frameworkOptiq: A dynamic data management framework
Optiq: A dynamic data management framework
Julian Hyde
 
Caching in
Caching inCaching in
Caching in
RichardWarburton
 

What's hot (20)

Data Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQLData Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQL
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 
IOT with PostgreSQL
IOT with PostgreSQLIOT with PostgreSQL
IOT with PostgreSQL
 
PostgreSQL and Sphinx pgcon 2013
PostgreSQL and Sphinx   pgcon 2013PostgreSQL and Sphinx   pgcon 2013
PostgreSQL and Sphinx pgcon 2013
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
RubiX
RubiXRubiX
RubiX
 
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
 
23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...
23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...
23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...
 
Performance Tuning Cheat Sheet for MongoDB
Performance Tuning Cheat Sheet for MongoDBPerformance Tuning Cheat Sheet for MongoDB
Performance Tuning Cheat Sheet for MongoDB
 
Presto: SQL-on-Anything. Netherlands Hadoop User Group Meetup
Presto: SQL-on-Anything. Netherlands Hadoop User Group MeetupPresto: SQL-on-Anything. Netherlands Hadoop User Group Meetup
Presto: SQL-on-Anything. Netherlands Hadoop User Group Meetup
 
Bucket your partitions wisely - Cassandra summit 2016
Bucket your partitions wisely - Cassandra summit 2016Bucket your partitions wisely - Cassandra summit 2016
Bucket your partitions wisely - Cassandra summit 2016
 
Performance Tuning and Optimization
Performance Tuning and OptimizationPerformance Tuning and Optimization
Performance Tuning and Optimization
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
Effectively deploying hadoop to the cloud
Effectively  deploying hadoop to the cloudEffectively  deploying hadoop to the cloud
Effectively deploying hadoop to the cloud
 
Advanced MySql Data-at-Rest Encryption in Percona Server
Advanced MySql Data-at-Rest Encryption in Percona ServerAdvanced MySql Data-at-Rest Encryption in Percona Server
Advanced MySql Data-at-Rest Encryption in Percona Server
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series Database
 
Optiq: A dynamic data management framework
Optiq: A dynamic data management frameworkOptiq: A dynamic data management framework
Optiq: A dynamic data management framework
 
Caching in
Caching inCaching in
Caching in
 

Similar to Introduction to Apache Tajo: Future of Data Warehouse

Procella: A fast versatile SQL query engine powering data at Youtube
Procella: A fast versatile SQL query engine powering data at YoutubeProcella: A fast versatile SQL query engine powering data at Youtube
Procella: A fast versatile SQL query engine powering data at Youtube
DataWorks Summit
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
Hakan Ilter
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
Omid Vahdaty
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
Mukesh Singh
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series database
felixbarny
 
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Jonathan Singer
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
PingCAP
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
Zhenxiao Luo
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
OVHcloud
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
Itai Yaffe
 
Presto Bangalore Meetup1 Repertoire@Myntra
Presto Bangalore Meetup1 Repertoire@MyntraPresto Bangalore Meetup1 Repertoire@Myntra
Presto Bangalore Meetup1 Repertoire@Myntra
Shubham Tagra
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
Scalable, good, cheap
Scalable, good, cheapScalable, good, cheap
Scalable, good, cheap
Marc Cluet
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
C4Media
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Omid Vahdaty
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Omid Vahdaty
 
High performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodbHigh performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodb
Wei Shan Ang
 

Similar to Introduction to Apache Tajo: Future of Data Warehouse (20)

Procella: A fast versatile SQL query engine powering data at Youtube
Procella: A fast versatile SQL query engine powering data at YoutubeProcella: A fast versatile SQL query engine powering data at Youtube
Procella: A fast versatile SQL query engine powering data at Youtube
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series database
 
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Logs @ OVHcloud
Logs @ OVHcloudLogs @ OVHcloud
Logs @ OVHcloud
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
 
Presto Bangalore Meetup1 Repertoire@Myntra
Presto Bangalore Meetup1 Repertoire@MyntraPresto Bangalore Meetup1 Repertoire@Myntra
Presto Bangalore Meetup1 Repertoire@Myntra
 
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json  postgre-sql vs. mongodbPGConf APAC 2018 - High performance json  postgre-sql vs. mongodb
PGConf APAC 2018 - High performance json postgre-sql vs. mongodb
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Scalable, good, cheap
Scalable, good, cheapScalable, good, cheap
Scalable, good, cheap
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
High performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodbHigh performance json- postgre sql vs. mongodb
High performance json- postgre sql vs. mongodb
 

Recently uploaded

JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 

Recently uploaded (20)

JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 

Introduction to Apache Tajo: Future of Data Warehouse

  • 1. Introduction to Apache Tajo: Future of Data Warehouse Jihoon Son / Gruter Inc.
  • 2. I am ● Jihoon Son (@jihoonson) ○ Ph.D at Korea Univ. ○ Tajo project co-founder ○ Committer and PMC member of Apache Tajo ○ Research engineer at Gruter ○ Linkedin ■ https://www.linkedin.com/in/jihoonson 2
  • 3. Today's Topic: Tajo ● What is Tajo? ○ Tajo / tάːzo / 타조 ○ Ostrich in Korean ■ Fastest two-legged animal in the world 3
  • 4. Today's Topic: Tajo ● What is Apache Tajo? ○ Our Ostrich can do SQL processing on big data! ■ SQL-on-Hadoop system ■ Apache Top-level project 4
  • 5. Maybe You Think ... 5 SQL-on-Hadoop? Boring..
  • 6. This Ostrich is Different! 6
  • 10. SQL-on-Hadoop Systems 10 ● Requirements ○ Stable query execution ■ Fault-tolerance ● Can avoid query resubmission ○ Adaptation to dynamic environment ■ Available resources, unpredictable delays, ... Long-running ETL jobs
  • 11. SQL-on-Hadoop Systems 11 ● Requirements ○ Fast query execution ■ Several query execution techniques ■ In-memory processing Low-latency interactive analysis
  • 12. Tajo is designed for Both Workloads 12 Long-running ETL jobs Low-latency interactive analysis
  • 13. Who are using Tajo? 13
  • 14. Use Cases: SK Telecom ● Data warehousing & analysis ○ 1st telco in South Korea ■ 40 TB/day compressed data (2014) 14
  • 15. ETLETLETL Integration Layer Data Warehouse Operational Systems SK Telecom: Before Tajo 15 Marketing Sales ERP SCM ODS Staging Area Data Vault Data Marts Strategic Marts Hadoop MPP DBMS
  • 16. ETLETLETL Integration Layer Data Warehouse Operational Systems SK Telecom: After Tajo 16 Marketing Sales ERP SCM ODS Staging Area Data Vault Data Marts Strategic Marts
  • 17. ETLETLETL Integration Layer Data Warehouse Operational Systems SK Telecom: After Tajo 17 Marketing Sales ERP SCM ODS Staging Area Data Vault Data Marts Strategic Marts ● Long-running ETL jobs ● Ad-hoc analysis
  • 18. Use Cases: SK Telecom ● Significantly reduced ETL & analysis time ○ Daily analysis becomes possible ○ More exploratory analysis is newly available with remaining resources 18
  • 19. Use Cases: Bluehole Studio ● Game log analysis ○ Finding principal causes of service- quality deficiencies 19
  • 20. Use Cases: Bluehole Studio ● Tajo on EMR 20
  • 21. Use Cases: Bluehole Studio ● Their first log analysis system ○ Easy and rapid deployment of Tajo ○ Low learning curve with SQL standard ● Immediate action becomes possible for user complaints and hidden bugs 21
  • 22. Use Cases: Melon ● Data discovery ○ Music streaming service (26 million users) ○ Analysis of purchase history for target marketing ● Significantly reduced analysis time ○ Faster analysis by replacing Hive with Tajo ○ More analysis becomes possible 22
  • 23. So, Why should you use Tajo? 23
  • 24. So, Why should you use Tajo? ● Easy to use 24
  • 25. So, Why should you use Tajo? ● Easy to use ○ ANSI-SQL standard compliance (2003) ■ CTAS, Window functions, ... 25
  • 26. So, Why should you use Tajo? ● Easy to use ○ ANSI-SQL standard compliance (2003) ■ CTAS, Window functions, ... ○ Mature SQL features ■ Most existing queries can be executed without modification 26
  • 27. So, Why should you use Tajo? ● Easy to use ○ ANSI-SQL standard compliance (2003) ■ CTAS, Window functions, ... ○ Mature SQL features ■ Most existing queries can be executed without modification ○ Various data format support ■ Text, JSON, Orc, Parquet, … 27
  • 28. So, Why should you use Tajo? ● Optimized performance 28
  • 29. So, Why should you use Tajo? ● Optimized performance ○ Optimized code ■ Optimized I/O performance ● Nearly max I/O performance (~120MB/s) per disk ■ Off-heap data processing ● Mitigating GC overhead 29
  • 30. So, Why should you use Tajo? ● Optimized performance ○ Cost-based query plan optimization ■ Join ordering ■ Best algorithm selection ● According to input size ■ Progressive optimization ● Further optimize the query plan during query execution ● Especially excellent for long running queries ■ => Efficient start schema processing 30
  • 31. So, Why should you use Tajo? ● Various storage type support 31
  • 32. So, Why should you use Tajo? ● Various storage type support 32
  • 33. Logical Data Warehouse with Tajo 33 Global view Application DBMS NoSQL Cloud storage On-premise storage
  • 34. Logical Data Warehouse with Tajo 34 Global view Application DBMS NoSQL Cloud storage On-premise storage ● Fast delivery ● Easy maintenance ● Simple data flow
  • 35. How fast is Tajo? 35
  • 36. Evaluation on Cloud Environment ● Google Cloud Platform ○ Instance type: n1-standard-8 ■ 8 core, 30GB RAM 36
  • 37. Target Systems ● Hive (0.12) ○ Baseline performance ○ Default configuration provided by GCP ■ Use the whole cpu and memory ● Tajo (0.11.0) ○ Default configuration provided by GCP ■ Use the whole cpu and memory 37
  • 38. Target Systems ● Spark-SQL (1.5.0) ○ Default configuration provided by GCP ■ Use the whole cpu and memory ■ Tungsten enabled by default ○ spark.sql.shuffle.partitions is adjusted for better performance 38
  • 39. TPC-DS ● Data ○ 24 tables ■ Plain text format ■ Stored on Google Cloud Storage ● Query ○ Which can be executed on every system without modifications ■ For Hive, 0.12 doesn't support implicit join, so every query had to be changed 39
  • 40. SF 1000, 50 instances 40
  • 41. SF 1000, 50 instances 41
  • 42. SF 1000, 50 instances 42 Cannot be run on 1TB
  • 43. SF 10000, 50 instances 43
  • 44. SF 10000, 50 instances 44
  • 46. Simple Demo on EMR 46 ● Using TPC-H data set, but ○ Lineitem table is stored on HDFS ○ Orders table is stored on PostgreSQL ○ Other tables are stored on S3
  • 47. Apache Tajo ● Is excellent for both long-running ETL jobs and exploratory ad-hoc analysis ● Is very fast ● Supports query federation on diverse data sources 47
  • 48. Get Involved! ● We are recruiting contributors! ● General ○ http://tajo.apache.org/ ● Getting Started ○ http://tajo.apache.org/docs/current/getting_started.html ● Downloads ○ http://tajo.apache.org/downloads.html ● Issue tracker ○ http://issues.apache.org/jira/browse/TAJO ● Join the mailing list ○ dev-subscribe@tajo.apache.org ○ issues-subscribe@tajo.apache.org 48
  • 49. Useful Links 49 ● EMR bootstrap ○ https://github.com/awslabs/emr-bootstrap- actions/tree/master/tajo ● How to setup Tajo on EMR ○ http://www.gruter.com/blog/setting-up-a- tajo-cluster-on-amazon-emr/