SlideShare a Scribd company logo
1 of 46
Download to read offline
Introduction to Apache Tajo:
Data Warehouse for Big Data
Jihoon Son / Gruter inc.
About Me
● Jihoon Son (@jihoonson)
○ Tajo project co-founder
○ Committer and PMC member of Apache Tajo
○ Research engineer at Gruter
2
Outline
● About Tajo
● Features of the Recent Release
● Demo
● Roadmap
3
What is Tajo?
● Tajo / tάːzo / 타조
○ An ostrich in Korean
○ The world's fastest two-legged animal
4
What is Tajo?
● Apache Top-level Project
○ Big data warehouse system
■ ANSI-SQL compliant
■ Mature SQL features
● Various types of join, window functions
○ Rapid query execution with own distributed DAG engine
■ Low latency, and long running batch queries with a single
system
■ Fault-tolerance
○ Beyond SQL-on-Hadoop
■ Support various types of storage
5
Tajo Master
Catalog Server
Tajo Master
Catalog Server
Architecture Overview
DBMS
HCatalog
Tajo Master
Catalog Server
Tajo Worker
Query Master
Query Executor
Storage Service
Tajo Worker
Query Master
Query Executor
Storage Service
Tajo Worker
Query Master
Query Executor
Storage Service
JDBC client
TSQLWebUI
REST API
Storage
Submit
a query
Manage
metadataAllocate
a query
Send tasks
& monitor
Send tasks
& monitor
6
Who are Using Tajo?
● Use cases: replacement of commercial DW
○ 1st
telco in South Korea
■ Replacement of long-running ETL workloads on several
TB datasets
■ Lots of daily reports about user behavior
■ Ad­‐hoc analysis on TB datasets
○ Benefits
■ Simplified architecture for data analysis
● An unified system for DW ETL, OLAP, and Hadoop ETL
■ Much less cost, more data analysis within same SLA
● Saved license fee of commercial DW
7
Who are Using Tajo?
● Use cases: data discovery
○ Music streaming service (26 million users)
■ Analysis of purchase history for target marketing
○ Benefits
■ Interactive query on large datasets
■ Data analysis with familiar BI tools
8
Recent Release: 0.11
● Feature highlights
○ Query federation
○ JDBC-based storage support
○ Self-describing data formats support
○ Multi-query support
○ More stable and efficient join execution
○ Index support
○ Python UDF/UDAF support
9
Recent Release: 0.11
● Today's topic
○ Query federation
○ JDBC-based storage support
○ Self-describing data formats support
○ Multi-query support
○ More stable and efficient join execution
○ Index support
○ Python UDF/UDAF support
10
Query Federation with Tajo
11
● Your data might be spread on multiple heterogeneous
sites
○ Cloud, DBMS, Hadoop, NoSQL, …
Your Data
DBMS
Application
Cloud storage
On-premise
storage
NoSQL
12
● Even in a single site, your data might be stored in
different data formats
Your Data
JSONCSV Parquet ORC Log
...
13
Your Data
● How to analyze distributed data?
○ Traditionally ...
DBMSApplication
Cloud storage
On-premise
storage
NoSQL
Global view
ETL transform
● Long delivery
● Complex data flow
● Human-intensive
14
● Query federation
Your Data with Tajo
DBMSApplication
Cloud storage
On-premise
storage
NoSQL
Global view
● Fast delivery
● Easy maintenance
● Simple data flow
15
Storage and Data Format Support
Data
formats
Storage
types
16
> CREATE EXTERNAL TABLE archive1 (id BIGINT, ...) USING text WITH
('text.delimiter'='|') LOCATION 'hdfs://localhost:8020/archive1';
> CREATE EXTERNAL TABLE user (user_id BIGINT, ...) USING orc WITH
('orc.compression.kind'='snappy') LOCATION 's3://user';
> CREATE EXTERNAL TABLE table1 (key TEXT, ...) USING hbase LOCATION
'hbase:zk://localhost:2181/uptodate';
> ...
Create Table
Data
formatStorage
URI
17
Create Table
> CREATE EXTERNAL TABLE archive1 (id BIGINT, ...) USING text WITH ('text.
delimiter'='|','text.null'='N','compression.codec'='org.apache.hadoop.io.compress.
SnappyCodec','timezone'='UTC+9','text.skip.headerlines'='2') LOCATION 'hdfs://localhost:
8020/tajo/warehouse/archive1';
> CREATE EXTERNAL TABLE archive2 (id BIGINT, ...) USING text WITH ('text.
delimiter'='|','text.null'='N','compression.codec'='org.apache.hadoop.io.compress.
SnappyCodec','timezone'='UTC+9','text.skip.headerlines'='2') LOCATION 'hdfs://localhost:
8020/tajo/warehouse/archive2';
> CREATE EXTERNAL TABLE archive3 (id BIGINT, ...) USING text WITH ('text.
delimiter'='|','text.null'='N','compression.codec'='org.apache.hadoop.io.compress.
SnappyCodec','timezone'='UTC+9','text.skip.headerlines'='2') LOCATION 'hdfs://localhost:
8020/tajo/warehouse/archive3';
> ...
18
Too tedious!
Introduction to Tablespace
● Tablespace
○ Registered storage space
○ A tablespace is identified by an unique URI
○ Configurations and policies are shared by all tables in a
tablespace
■ Storage type
■ Default data format and supported data formats
○ It allows users to reuse registered storage
configurations and policies
19
Tablespaces, Databases, and Tables
Namespace
Storage1
Storage2
...
...
...
Tablespace1
Tablespace2
Tablespace3
Physical space
Table1
Table2
Table3
Database1
Database1
...
20
{
"spaces" : {
"warehouse" : {
"uri" : "hdfs://localhost:8020/tajo/warehouse",
"configs" : [
{'text.delimiter'='|'},
{'text.null'='N'},
{'compression.codec'='org.apache.hadoop.io.compress.SnappyCodec'},
{'timezone'='UTC+9'},
{'text.skip.headerlines'='2'}
]
},
"hbase1" : {
"uri" : "hbase:zk://localhost:2181/table1"
}
}
}
Tablespace Configuration
Tablespace name
Tablespace URI
21
Create Table
> CREATE TABLE archive1 (id BIGINT, ...) TABLESPACE warehouse;
Tablespace
name
Data format is omitted. Default data format is TEXT.
"warehouse" : {
"uri" : "hdfs://localhost:8020/tajo/warehouse",
"configs" : [
{'text.delimiter'='|'},
{'text.null'='N'},
{'compression.codec'='org.apache.hadoop.io.compress.SnappyCodec'},
{'timezone'='UTC+9'},
{'text.skip.headerlines'='2'}
]
},
22
Create Table
> CREATE TABLE archive1 (id BIGINT, ...) TABLESPACE warehouse;
> CREATE TABLE archive2 (id BIGINT, ...) TABLESPACE warehouse;
> CREATE TABLE archive3 (id BIGINT, ...) TABLESPACE warehouse;
> CREATE TABLE user (user_id BIGINT, ...) TABLESPACE aws USING orc
WITH ('orc.compression.kind'='snappy');
> CREATE TABLE table1 (key TEXT, ...) TABLESPACE hbase1;
> ...
23
HDFS HBase
Tajo Worker
Query
Engine
Storage
ServiceHDFS
handler
Tajo Worker
Query
Engine
Storage
ServiceHDFS
handler
Tajo Worker
Query
Engine
Storage
ServiceHBase
handler
Querying on Different Data Silos
● How does a worker access different data sources?
○ Storage service
■ Return a proper handler for underlying storage
> SELECT ... FROM hdfs_table, hbase_table, ...
24
JDBC-based Storage Support
25
jdbc_db1 tajo_db1
JDBC-based Storage
● Storage providing the JDBC interface
○ PostgreSQL, MySQL, MariaDB, ...
● Databases of JDBC-based storage are mapped to Tajo
databases
Table1
Table2
Table3
Table1
Table2
Table3
tajo_db2
Table1
Table2
Table3
…
jdbc_db2
Table1
Table2
Table3
…
JDBC-based storage Tajo
26
Tablespace Configuration
{
"spaces": {
"pgsql_db1": {
"uri": "jdbc:postgresql://hostname:port/db1"
"configs": {
"mapped_database": "tajo_db1"
"connection_properties": {
"user": "tajo",
"password": "xxxx"
}
}
}
}
}
PostgreSQL
database name
Tajo
database name
Tablespace name
27
Return to Query Federation
● How to correlate data on JDBC-based storage and
others?
○ Need to have a global view of metadata across different
storage types
■ Tajo also has its own metadata for its data
■ Each JDBC-based storage has own metadata for its data
■ Each NoSQL storage has metadata for its data
■ …
28
● Federating metadata of underlying storage
Metadata Federation
DBMS metadata provider NoSQL metadata provider
Linked Metadata Manager
DBMS HCatalog
Tajo catalog metadata provider
Catalog Interface
● Tablespace
● Database
● Tables
● Schema names
...
29
Querying on JDBC-based Storage
● A plan is converted into a SQL string
● Query generation
○ Diverse SQL syntax of different types of storage
○ Different SQL builder for each storage type
Tajo Master Tajo Worker JDBC-based
storage
SELECT ...
Query plan
SELECT ...
30
Operation Push Down
● Tajo can exploit the processing capability of underlying
storage
○ DBMSs, MongoDB, HBase, …
● Operations are pushed down into underlying storage
○ Leveraging the advanced features provided by
underlying storage
■ Ex) DBMSs' query optimization, index, ...
31
Example 1
SELECT
count(*)
FROM
account ac, archive ar
WHERE
ac.key = ar.id and
ac.name = 'tajo'
account
DBMS
archive
HDFS
scan archive
scan account
ac.name = 'tajo'
join
ac.key = ar.id
group by
count(*)
group by
count(*)
Full scan Result only
Push
operation
32
Example 2
SELECT
ac.name, count(*)
FROM
account ac
GROUP BY
ac.name
account
DBMS
scan account
group by
count(*)
Result only
Push
operation
33
Self-describing Data Formats
Support
34
Self-describing Data Formats
● Some data formats include schema information as well
as data
○ JSON, ORC, Parquet, …
● Tajo 0.11 natively supports self-describing data
formats
○ Since they already have schema information, Tajo
doesn't need to store it aside
○ Instead, Tajo can infer the schema at query execution
time
35
Create Table with Nested Data Format
{ "title" : "Hand of the King", "name" : { "first_name": "Eddard", "last_name": "Stark"}}
{ "title" : "Assassin", "name" : { "first_name": "Arya", "last_name": "Stark"}}
{ "title" : "Dancing Master", "name" : { "first_name": "Syrio", "last_name": "Forel"}}
> CREATE EXTERNAL TABLE schemaful_table (
title TEXT,
name RECORD (
first_name TEXT,
last_name TEXT
)
) USING json LOCATION 'hdfs:///json_table';
Nested type
36
How about This Data?
{"id":"2937257761","type":"ForkEvent","actor":{"id":1088854,"login":"CAOakleyII","gravatar_id":"","url":"https://api.github.com/users/CAOakleyII","avatar_url":"https://avatars.githubusercontent.
com/u/1088854?"},"repo":{"id":11909954,"name":"skycocker/chromebrew","url":"https://api.github.com/repos/skycocker/chromebrew"},"payload":{"forkee":{"id":38339291,"name":"chromebrew","
full_name":"CAOakleyII/chromebrew","owner":{"login":"CAOakleyII","id":1088854,"avatar_url":"https://avatars.githubusercontent.com/u/1088854?v=3","gravatar_id":"","url":"https://api.github.
com/users/CAOakleyII","html_url":"https://github.com/CAOakleyII","followers_url":"https://api.github.com/users/CAOakleyII/followers","following_url":"https://api.github.
com/users/CAOakleyII/following{/other_user}","gists_url":"https://api.github.com/users/CAOakleyII/gists{/gist_id}","starred_url":"https://api.github.com/users/CAOakleyII/starred{/owner}{/repo}","
subscriptions_url":"https://api.github.com/users/CAOakleyII/subscriptions","organizations_url":"https://api.github.com/users/CAOakleyII/orgs","repos_url":"https://api.github.
com/users/CAOakleyII/repos","events_url":"https://api.github.com/users/CAOakleyII/events{/privacy}","received_events_url":"https://api.github.com/users/CAOakleyII/received_events","type":"
User","site_admin":false},"private":false,"html_url":"https://github.com/CAOakleyII/chromebrew","description":"Package manager for Chrome OS","fork":true,"url":"https://api.github.
com/repos/CAOakleyII/chromebrew","forks_url":"https://api.github.com/repos/CAOakleyII/chromebrew/forks","keys_url":"https://api.github.com/repos/CAOakleyII/chromebrew/keys{/key_id}","
collaborators_url":"https://api.github.com/repos/CAOakleyII/chromebrew/collaborators{/collaborator}","teams_url":"https://api.github.com/repos/CAOakleyII/chromebrew/teams","hooks_url":"https:
//api.github.com/repos/CAOakleyII/chromebrew/hooks","issue_events_url":"https://api.github.com/repos/CAOakleyII/chromebrew/issues/events{/number}","events_url":"https://api.github.
com/repos/CAOakleyII/chromebrew/events","assignees_url":"https://api.github.com/repos/CAOakleyII/chromebrew/assignees{/user}","branches_url":"https://api.github.
com/repos/CAOakleyII/chromebrew/branches{/branch}","tags_url":"https://api.github.com/repos/CAOakleyII/chromebrew/tags","blobs_url":"https://api.github.
com/repos/CAOakleyII/chromebrew/git/blobs{/sha}","git_tags_url":"https://api.github.com/repos/CAOakleyII/chromebrew/git/tags{/sha}","git_refs_url":"https://api.github.
com/repos/CAOakleyII/chromebrew/git/refs{/sha}","trees_url":"https://api.github.com/repos/CAOakleyII/chromebrew/git/trees{/sha}","statuses_url":"https://api.github.
com/repos/CAOakleyII/chromebrew/statuses/{sha}","languages_url":"https://api.github.com/repos/CAOakleyII/chromebrew/languages","stargazers_url":"https://api.github.
com/repos/CAOakleyII/chromebrew/stargazers","contributors_url":"https://api.github.com/repos/CAOakleyII/chromebrew/contributors","subscribers_url":"https://api.github.
com/repos/CAOakleyII/chromebrew/subscribers","subscription_url":"https://api.github.com/repos/CAOakleyII/chromebrew/subscription","commits_url":"https://api.github.
com/repos/CAOakleyII/chromebrew/commits{/sha}","git_commits_url":"https://api.github.com/repos/CAOakleyII/chromebrew/git/commits{/sha}","comments_url":"https://api.github.
com/repos/CAOakleyII/chromebrew/comments{/number}","issue_comment_url":"https://api.github.com/repos/CAOakleyII/chromebrew/issues/comments{/number}","contents_url":"https://api.github.
com/repos/CAOakleyII/chromebrew/contents/{+path}","compare_url":"https://api.github.com/repos/CAOakleyII/chromebrew/compare/{base}...{head}","merges_url":"https://api.github.
com/repos/CAOakleyII/chromebrew/merges","archive_url":"https://api.github.com/repos/CAOakleyII/chromebrew/{archive_format}{/ref}","downloads_url":"https://api.github.
com/repos/CAOakleyII/chromebrew/downloads","issues_url":"https://api.github.com/repos/CAOakleyII/chromebrew/issues{/number}","pulls_url":"https://api.github.
com/repos/CAOakleyII/chromebrew/pulls{/number}","milestones_url":"https://api.github.com/repos/CAOakleyII/chromebrew/milestones{/number}","notifications_url":"https://api.github.
com/repos/CAOakleyII/chromebrew/notifications{?since,all,participating}","labels_url":"https://api.github.com/repos/CAOakleyII/chromebrew/labels{/name}","releases_url":"https://api.github.
com/repos/CAOakleyII/chromebrew/releases{/id}","created_at":"2015-07-01T00:00:00Z","updated_at":"2015-06-28T10:11:09Z","pushed_at":"2015-06-09T07:46:57Z","git_url":"git://github.
com/CAOakleyII/chromebrew.git","ssh_url":"git@github.com:CAOakleyII/chromebrew.git","clone_url":"https://github.com/CAOakleyII/chromebrew.git","svn_url":"https://github.
com/CAOakleyII/chromebrew","homepage":"http://skycocker.github.io/chromebrew/","size":846,"stargazers_count":0,"watchers_count":0,"language":null,"has_issues":false,"has_downloads":true,"
has_wiki":true,"has_pages":false,"forks_count":0,"mirror_url":null,"open_issues_count":0,"forks":0,"open_issues":0,"watchers":0,"default_branch":"master","public":true}},"public":true,"created_at":"
2015-07-01T00:00:01Z"}
...
37
Create Schemaless Table
> CREATE EXTERNAL TABLE schemaless_table (*) USING json LOCATION
'hdfs:///json_table';
That's all!
Allow any schema
38
Schema-free Query Execution
> CREATE EXTERNAL TABLE schemaful_table (id BIGINT, name TEXT, ...)
USING text LOCATION 'hdfs:///csv_table;
> CREATE EXTERNAL TABLE schemaless_table (*) USING json LOCATION
'hdfs:///json_table';
> SELECT name.first_name, name.last_name from schemaless_table;
> SELECT title, count(*) FROM schemaful_table, schemaless_table WHERE
name = name.last_name GROUP BY title;
39
Schema Inference
● Table schema is inferred at query time
● Example
SELECT
a, b.b1, b.b2.c1
FROM
t;
(
a text,
b record (
b1 text,
b2 record (
c1 text
)
)
)
Inferred schemaQuery
40
Demo
41
Demo with Command line
42
Roadmap
43
Roadmap
● 0.12
○ Improved Yarn integration
○ Authentication support
○ JavaScript stored procedure support
○ Scalar subquery support
○ Hive UDF support
44
Roadmap
● Next generation (beyond 0.12)
○ Exploiting modern hardware
○ Approximate query processing
○ Genetic query optimization
○ And more …
45
tajo> select question from you;
46

More Related Content

What's hot

Understanding and tuning WiredTiger, the new high performance database engine...
Understanding and tuning WiredTiger, the new high performance database engine...Understanding and tuning WiredTiger, the new high performance database engine...
Understanding and tuning WiredTiger, the new high performance database engine...Ontico
 
Effectively deploying hadoop to the cloud
Effectively  deploying hadoop to the cloudEffectively  deploying hadoop to the cloud
Effectively deploying hadoop to the cloudAvinash Ramineni
 
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...Athens Big Data
 
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseJimmy Angelakos
 
Tachyon meetup slides.
Tachyon meetup slides.Tachyon meetup slides.
Tachyon meetup slides.David Groozman
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Alluxio, Inc.
 
HBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at BoxHBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at BoxCloudera, Inc.
 
Bucket your partitions wisely - Cassandra summit 2016
Bucket your partitions wisely - Cassandra summit 2016Bucket your partitions wisely - Cassandra summit 2016
Bucket your partitions wisely - Cassandra summit 2016Markus Höfer
 
PGConf.ASIA 2019 Bali - Performance Analysis at Full Power - Julien Rouhaud
PGConf.ASIA 2019 Bali - Performance Analysis at Full Power - Julien RouhaudPGConf.ASIA 2019 Bali - Performance Analysis at Full Power - Julien Rouhaud
PGConf.ASIA 2019 Bali - Performance Analysis at Full Power - Julien RouhaudEqunix Business Solutions
 
Performance Tuning and Optimization
Performance Tuning and OptimizationPerformance Tuning and Optimization
Performance Tuning and OptimizationMongoDB
 
An introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAn introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAmir Sedighi
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache HadoopSufi Nawaz
 
ScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous SpeedScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous SpeedJ On The Beach
 
Life as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan RossiLife as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan RossiGluster.org
 
Optimizing columnar stores
Optimizing columnar storesOptimizing columnar stores
Optimizing columnar storesIstvan Szukacs
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Spark Summit
 
Monitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDBMonitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDBGeoffrey Anderson
 

What's hot (19)

Understanding and tuning WiredTiger, the new high performance database engine...
Understanding and tuning WiredTiger, the new high performance database engine...Understanding and tuning WiredTiger, the new high performance database engine...
Understanding and tuning WiredTiger, the new high performance database engine...
 
Effectively deploying hadoop to the cloud
Effectively  deploying hadoop to the cloudEffectively  deploying hadoop to the cloud
Effectively deploying hadoop to the cloud
 
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
 
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
 
Tachyon meetup slides.
Tachyon meetup slides.Tachyon meetup slides.
Tachyon meetup slides.
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...
 
HBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at BoxHBaseCon 2013: OpenTSDB at Box
HBaseCon 2013: OpenTSDB at Box
 
Caching in
Caching inCaching in
Caching in
 
Bucket your partitions wisely - Cassandra summit 2016
Bucket your partitions wisely - Cassandra summit 2016Bucket your partitions wisely - Cassandra summit 2016
Bucket your partitions wisely - Cassandra summit 2016
 
PGConf.ASIA 2019 Bali - Performance Analysis at Full Power - Julien Rouhaud
PGConf.ASIA 2019 Bali - Performance Analysis at Full Power - Julien RouhaudPGConf.ASIA 2019 Bali - Performance Analysis at Full Power - Julien Rouhaud
PGConf.ASIA 2019 Bali - Performance Analysis at Full Power - Julien Rouhaud
 
Performance Tuning and Optimization
Performance Tuning and OptimizationPerformance Tuning and Optimization
Performance Tuning and Optimization
 
An introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAn introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoop
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache Hadoop
 
ScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous SpeedScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous Speed
 
Life as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan RossiLife as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan Rossi
 
Optimizing columnar stores
Optimizing columnar storesOptimizing columnar stores
Optimizing columnar stores
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
 
Glusterfs and Hadoop
Glusterfs and HadoopGlusterfs and Hadoop
Glusterfs and Hadoop
 
Monitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDBMonitoring MySQL with OpenTSDB
Monitoring MySQL with OpenTSDB
 

Viewers also liked

SQL-on-Hadoop with Apache Tajo, and application case of SK Telecom
SQL-on-Hadoop with Apache Tajo,  and application case of SK TelecomSQL-on-Hadoop with Apache Tajo,  and application case of SK Telecom
SQL-on-Hadoop with Apache Tajo, and application case of SK TelecomGruter
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks
 
Capital One's Next Generation Decision in less than 2 ms
Capital One's Next Generation Decision in less than 2 msCapital One's Next Generation Decision in less than 2 ms
Capital One's Next Generation Decision in less than 2 msApache Apex
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiManish Gupta
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesIsheeta Sanghi
 
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...Hortonworks
 

Viewers also liked (6)

SQL-on-Hadoop with Apache Tajo, and application case of SK Telecom
SQL-on-Hadoop with Apache Tajo,  and application case of SK TelecomSQL-on-Hadoop with Apache Tajo,  and application case of SK Telecom
SQL-on-Hadoop with Apache Tajo, and application case of SK Telecom
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1
 
Capital One's Next Generation Decision in less than 2 ms
Capital One's Next Generation Decision in less than 2 msCapital One's Next Generation Decision in less than 2 ms
Capital One's Next Generation Decision in less than 2 ms
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
 
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
 

Similar to Introduction to Apache Tajo: Data Warehouse for Big Data

Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataGruter
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyGuillaume Lefranc
 
Restlet: Building a multi-tenant API PaaS with DataStax Enterprise Search
Restlet: Building a multi-tenant API PaaS with DataStax Enterprise SearchRestlet: Building a multi-tenant API PaaS with DataStax Enterprise Search
Restlet: Building a multi-tenant API PaaS with DataStax Enterprise SearchDataStax Academy
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
Cassandra Summit 2015 - Building a multi-tenant API PaaS with DataStax Enterp...
Cassandra Summit 2015 - Building a multi-tenant API PaaS with DataStax Enterp...Cassandra Summit 2015 - Building a multi-tenant API PaaS with DataStax Enterp...
Cassandra Summit 2015 - Building a multi-tenant API PaaS with DataStax Enterp...Restlet
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
Introducing the ultimate MariaDB cloud, SkySQL
Introducing the ultimate MariaDB cloud, SkySQLIntroducing the ultimate MariaDB cloud, SkySQL
Introducing the ultimate MariaDB cloud, SkySQLMariaDB plc
 
Build an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data ScientistsBuild an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data ScientistsShawn Zhu
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseGruter
 
Tips to drive maria db cluster performance for nextcloud
Tips to drive maria db cluster performance for nextcloudTips to drive maria db cluster performance for nextcloud
Tips to drive maria db cluster performance for nextcloudSeveralnines
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simpleDori Waldman
 
Scaling up and accelerating Drupal 8 with NoSQL
Scaling up and accelerating Drupal 8 with NoSQLScaling up and accelerating Drupal 8 with NoSQL
Scaling up and accelerating Drupal 8 with NoSQLOSInet
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudZhenxiao Luo
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerMichael Spector
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django applicationbangaloredjangousergroup
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Junping Du
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
Updating materialized views and caches using kafka
Updating materialized views and caches using kafkaUpdating materialized views and caches using kafka
Updating materialized views and caches using kafkaZach Cox
 

Similar to Introduction to Apache Tajo: Data Warehouse for Big Data (20)

Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Introducing Datawave
Introducing DatawaveIntroducing Datawave
Introducing Datawave
 
NoSQL Solutions - a comparative study
NoSQL Solutions - a comparative studyNoSQL Solutions - a comparative study
NoSQL Solutions - a comparative study
 
Restlet: Building a multi-tenant API PaaS with DataStax Enterprise Search
Restlet: Building a multi-tenant API PaaS with DataStax Enterprise SearchRestlet: Building a multi-tenant API PaaS with DataStax Enterprise Search
Restlet: Building a multi-tenant API PaaS with DataStax Enterprise Search
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Cassandra Summit 2015 - Building a multi-tenant API PaaS with DataStax Enterp...
Cassandra Summit 2015 - Building a multi-tenant API PaaS with DataStax Enterp...Cassandra Summit 2015 - Building a multi-tenant API PaaS with DataStax Enterp...
Cassandra Summit 2015 - Building a multi-tenant API PaaS with DataStax Enterp...
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Introducing the ultimate MariaDB cloud, SkySQL
Introducing the ultimate MariaDB cloud, SkySQLIntroducing the ultimate MariaDB cloud, SkySQL
Introducing the ultimate MariaDB cloud, SkySQL
 
Build an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data ScientistsBuild an Open Source Data Lake For Data Scientists
Build an Open Source Data Lake For Data Scientists
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
Tips to drive maria db cluster performance for nextcloud
Tips to drive maria db cluster performance for nextcloudTips to drive maria db cluster performance for nextcloud
Tips to drive maria db cluster performance for nextcloud
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
Scaling up and accelerating Drupal 8 with NoSQL
Scaling up and accelerating Drupal 8 with NoSQLScaling up and accelerating Drupal 8 with NoSQL
Scaling up and accelerating Drupal 8 with NoSQL
 
Netflix running Presto in the AWS Cloud
Netflix running Presto in the AWS CloudNetflix running Presto in the AWS Cloud
Netflix running Presto in the AWS Cloud
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
Journey through high performance django application
Journey through high performance django applicationJourney through high performance django application
Journey through high performance django application
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Updating materialized views and caches using kafka
Updating materialized views and caches using kafkaUpdating materialized views and caches using kafka
Updating materialized views and caches using kafka
 

Recently uploaded

Searching and Sorting Algorithms
Searching and Sorting AlgorithmsSearching and Sorting Algorithms
Searching and Sorting AlgorithmsAshutosh Satapathy
 
Final PPT.ppt about human detection and counting
Final PPT.ppt  about human detection and countingFinal PPT.ppt  about human detection and counting
Final PPT.ppt about human detection and countingArbazAhmad25
 
introduction to python, fundamentals and basics
introduction to python, fundamentals and basicsintroduction to python, fundamentals and basics
introduction to python, fundamentals and basicsKNaveenKumarECE
 
Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024
Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024
Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024California Asphalt Pavement Association
 
Support nodes for large-span coal storage structures
Support nodes for large-span coal storage structuresSupport nodes for large-span coal storage structures
Support nodes for large-span coal storage structureswendy cai
 
pulse modulation technique (Pulse code modulation).pptx
pulse modulation technique (Pulse code modulation).pptxpulse modulation technique (Pulse code modulation).pptx
pulse modulation technique (Pulse code modulation).pptxNishanth Asmi
 
Tekom Netherlands | The evolving landscape of Simplified Technical English b...
Tekom Netherlands | The evolving landscape of Simplified Technical English  b...Tekom Netherlands | The evolving landscape of Simplified Technical English  b...
Tekom Netherlands | The evolving landscape of Simplified Technical English b...Shumin Chen
 
The Art of Cloud Native Defense on Kubernetes
The Art of Cloud Native Defense on KubernetesThe Art of Cloud Native Defense on Kubernetes
The Art of Cloud Native Defense on KubernetesJacopo Nardiello
 
Governors ppt.pdf .
Governors ppt.pdf                              .Governors ppt.pdf                              .
Governors ppt.pdf .happycocoman
 
This chapter gives an outline of the security.
This chapter gives an outline of the security.This chapter gives an outline of the security.
This chapter gives an outline of the security.RoshniIsrani1
 
Conventional vs Modern method (Philosophies) of Tunneling-re.pptx
Conventional vs Modern method (Philosophies) of Tunneling-re.pptxConventional vs Modern method (Philosophies) of Tunneling-re.pptx
Conventional vs Modern method (Philosophies) of Tunneling-re.pptxSAQIB KHURSHEED WANI
 
عناصر نباتية PDF.pdf architecture engineering
عناصر نباتية PDF.pdf architecture engineeringعناصر نباتية PDF.pdf architecture engineering
عناصر نباتية PDF.pdf architecture engineeringmennamohamed200y
 
NIPORT Home Economics Questions Solution 2024.pdf
NIPORT Home Economics Questions Solution 2024.pdfNIPORT Home Economics Questions Solution 2024.pdf
NIPORT Home Economics Questions Solution 2024.pdfMohonDas
 
presentation by faizan[1] [Read-Only].pptx
presentation by faizan[1] [Read-Only].pptxpresentation by faizan[1] [Read-Only].pptx
presentation by faizan[1] [Read-Only].pptxkhfaizan534
 
zomato data mining datasets for quality prefernece and conntrol.pptx
zomato data mining  datasets for quality prefernece and conntrol.pptxzomato data mining  datasets for quality prefernece and conntrol.pptx
zomato data mining datasets for quality prefernece and conntrol.pptxPratikMhatre39
 
First Review Group 1 PPT.pptx with slide
First Review Group 1 PPT.pptx with slideFirst Review Group 1 PPT.pptx with slide
First Review Group 1 PPT.pptx with slideMonika860882
 
EJECTOR REFRIGERATION CYCLE WITH THE INJECTION OF A HIGH DENSITY FLUID INTO A...
EJECTOR REFRIGERATION CYCLE WITH THE INJECTION OF A HIGH DENSITY FLUID INTO A...EJECTOR REFRIGERATION CYCLE WITH THE INJECTION OF A HIGH DENSITY FLUID INTO A...
EJECTOR REFRIGERATION CYCLE WITH THE INJECTION OF A HIGH DENSITY FLUID INTO A...marijomiljkovic1
 

Recently uploaded (20)

Update on the latest research with regard to RAP
Update on the latest research with regard to RAPUpdate on the latest research with regard to RAP
Update on the latest research with regard to RAP
 
Searching and Sorting Algorithms
Searching and Sorting AlgorithmsSearching and Sorting Algorithms
Searching and Sorting Algorithms
 
Final PPT.ppt about human detection and counting
Final PPT.ppt  about human detection and countingFinal PPT.ppt  about human detection and counting
Final PPT.ppt about human detection and counting
 
introduction to python, fundamentals and basics
introduction to python, fundamentals and basicsintroduction to python, fundamentals and basics
introduction to python, fundamentals and basics
 
Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024
Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024
Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024
 
Support nodes for large-span coal storage structures
Support nodes for large-span coal storage structuresSupport nodes for large-span coal storage structures
Support nodes for large-span coal storage structures
 
pulse modulation technique (Pulse code modulation).pptx
pulse modulation technique (Pulse code modulation).pptxpulse modulation technique (Pulse code modulation).pptx
pulse modulation technique (Pulse code modulation).pptx
 
Tekom Netherlands | The evolving landscape of Simplified Technical English b...
Tekom Netherlands | The evolving landscape of Simplified Technical English  b...Tekom Netherlands | The evolving landscape of Simplified Technical English  b...
Tekom Netherlands | The evolving landscape of Simplified Technical English b...
 
The Art of Cloud Native Defense on Kubernetes
The Art of Cloud Native Defense on KubernetesThe Art of Cloud Native Defense on Kubernetes
The Art of Cloud Native Defense on Kubernetes
 
Governors ppt.pdf .
Governors ppt.pdf                              .Governors ppt.pdf                              .
Governors ppt.pdf .
 
Caltrans view on recycling of in-place asphalt pavements
Caltrans view on recycling of in-place asphalt pavementsCaltrans view on recycling of in-place asphalt pavements
Caltrans view on recycling of in-place asphalt pavements
 
This chapter gives an outline of the security.
This chapter gives an outline of the security.This chapter gives an outline of the security.
This chapter gives an outline of the security.
 
Conventional vs Modern method (Philosophies) of Tunneling-re.pptx
Conventional vs Modern method (Philosophies) of Tunneling-re.pptxConventional vs Modern method (Philosophies) of Tunneling-re.pptx
Conventional vs Modern method (Philosophies) of Tunneling-re.pptx
 
عناصر نباتية PDF.pdf architecture engineering
عناصر نباتية PDF.pdf architecture engineeringعناصر نباتية PDF.pdf architecture engineering
عناصر نباتية PDF.pdf architecture engineering
 
NIPORT Home Economics Questions Solution 2024.pdf
NIPORT Home Economics Questions Solution 2024.pdfNIPORT Home Economics Questions Solution 2024.pdf
NIPORT Home Economics Questions Solution 2024.pdf
 
presentation by faizan[1] [Read-Only].pptx
presentation by faizan[1] [Read-Only].pptxpresentation by faizan[1] [Read-Only].pptx
presentation by faizan[1] [Read-Only].pptx
 
Hydraulic Loading System - Neometrix Defence
Hydraulic Loading System - Neometrix DefenceHydraulic Loading System - Neometrix Defence
Hydraulic Loading System - Neometrix Defence
 
zomato data mining datasets for quality prefernece and conntrol.pptx
zomato data mining  datasets for quality prefernece and conntrol.pptxzomato data mining  datasets for quality prefernece and conntrol.pptx
zomato data mining datasets for quality prefernece and conntrol.pptx
 
First Review Group 1 PPT.pptx with slide
First Review Group 1 PPT.pptx with slideFirst Review Group 1 PPT.pptx with slide
First Review Group 1 PPT.pptx with slide
 
EJECTOR REFRIGERATION CYCLE WITH THE INJECTION OF A HIGH DENSITY FLUID INTO A...
EJECTOR REFRIGERATION CYCLE WITH THE INJECTION OF A HIGH DENSITY FLUID INTO A...EJECTOR REFRIGERATION CYCLE WITH THE INJECTION OF A HIGH DENSITY FLUID INTO A...
EJECTOR REFRIGERATION CYCLE WITH THE INJECTION OF A HIGH DENSITY FLUID INTO A...
 

Introduction to Apache Tajo: Data Warehouse for Big Data

  • 1. Introduction to Apache Tajo: Data Warehouse for Big Data Jihoon Son / Gruter inc.
  • 2. About Me ● Jihoon Son (@jihoonson) ○ Tajo project co-founder ○ Committer and PMC member of Apache Tajo ○ Research engineer at Gruter 2
  • 3. Outline ● About Tajo ● Features of the Recent Release ● Demo ● Roadmap 3
  • 4. What is Tajo? ● Tajo / tάːzo / 타조 ○ An ostrich in Korean ○ The world's fastest two-legged animal 4
  • 5. What is Tajo? ● Apache Top-level Project ○ Big data warehouse system ■ ANSI-SQL compliant ■ Mature SQL features ● Various types of join, window functions ○ Rapid query execution with own distributed DAG engine ■ Low latency, and long running batch queries with a single system ■ Fault-tolerance ○ Beyond SQL-on-Hadoop ■ Support various types of storage 5
  • 6. Tajo Master Catalog Server Tajo Master Catalog Server Architecture Overview DBMS HCatalog Tajo Master Catalog Server Tajo Worker Query Master Query Executor Storage Service Tajo Worker Query Master Query Executor Storage Service Tajo Worker Query Master Query Executor Storage Service JDBC client TSQLWebUI REST API Storage Submit a query Manage metadataAllocate a query Send tasks & monitor Send tasks & monitor 6
  • 7. Who are Using Tajo? ● Use cases: replacement of commercial DW ○ 1st telco in South Korea ■ Replacement of long-running ETL workloads on several TB datasets ■ Lots of daily reports about user behavior ■ Ad­‐hoc analysis on TB datasets ○ Benefits ■ Simplified architecture for data analysis ● An unified system for DW ETL, OLAP, and Hadoop ETL ■ Much less cost, more data analysis within same SLA ● Saved license fee of commercial DW 7
  • 8. Who are Using Tajo? ● Use cases: data discovery ○ Music streaming service (26 million users) ■ Analysis of purchase history for target marketing ○ Benefits ■ Interactive query on large datasets ■ Data analysis with familiar BI tools 8
  • 9. Recent Release: 0.11 ● Feature highlights ○ Query federation ○ JDBC-based storage support ○ Self-describing data formats support ○ Multi-query support ○ More stable and efficient join execution ○ Index support ○ Python UDF/UDAF support 9
  • 10. Recent Release: 0.11 ● Today's topic ○ Query federation ○ JDBC-based storage support ○ Self-describing data formats support ○ Multi-query support ○ More stable and efficient join execution ○ Index support ○ Python UDF/UDAF support 10
  • 12. ● Your data might be spread on multiple heterogeneous sites ○ Cloud, DBMS, Hadoop, NoSQL, … Your Data DBMS Application Cloud storage On-premise storage NoSQL 12
  • 13. ● Even in a single site, your data might be stored in different data formats Your Data JSONCSV Parquet ORC Log ... 13
  • 14. Your Data ● How to analyze distributed data? ○ Traditionally ... DBMSApplication Cloud storage On-premise storage NoSQL Global view ETL transform ● Long delivery ● Complex data flow ● Human-intensive 14
  • 15. ● Query federation Your Data with Tajo DBMSApplication Cloud storage On-premise storage NoSQL Global view ● Fast delivery ● Easy maintenance ● Simple data flow 15
  • 16. Storage and Data Format Support Data formats Storage types 16
  • 17. > CREATE EXTERNAL TABLE archive1 (id BIGINT, ...) USING text WITH ('text.delimiter'='|') LOCATION 'hdfs://localhost:8020/archive1'; > CREATE EXTERNAL TABLE user (user_id BIGINT, ...) USING orc WITH ('orc.compression.kind'='snappy') LOCATION 's3://user'; > CREATE EXTERNAL TABLE table1 (key TEXT, ...) USING hbase LOCATION 'hbase:zk://localhost:2181/uptodate'; > ... Create Table Data formatStorage URI 17
  • 18. Create Table > CREATE EXTERNAL TABLE archive1 (id BIGINT, ...) USING text WITH ('text. delimiter'='|','text.null'='N','compression.codec'='org.apache.hadoop.io.compress. SnappyCodec','timezone'='UTC+9','text.skip.headerlines'='2') LOCATION 'hdfs://localhost: 8020/tajo/warehouse/archive1'; > CREATE EXTERNAL TABLE archive2 (id BIGINT, ...) USING text WITH ('text. delimiter'='|','text.null'='N','compression.codec'='org.apache.hadoop.io.compress. SnappyCodec','timezone'='UTC+9','text.skip.headerlines'='2') LOCATION 'hdfs://localhost: 8020/tajo/warehouse/archive2'; > CREATE EXTERNAL TABLE archive3 (id BIGINT, ...) USING text WITH ('text. delimiter'='|','text.null'='N','compression.codec'='org.apache.hadoop.io.compress. SnappyCodec','timezone'='UTC+9','text.skip.headerlines'='2') LOCATION 'hdfs://localhost: 8020/tajo/warehouse/archive3'; > ... 18 Too tedious!
  • 19. Introduction to Tablespace ● Tablespace ○ Registered storage space ○ A tablespace is identified by an unique URI ○ Configurations and policies are shared by all tables in a tablespace ■ Storage type ■ Default data format and supported data formats ○ It allows users to reuse registered storage configurations and policies 19
  • 20. Tablespaces, Databases, and Tables Namespace Storage1 Storage2 ... ... ... Tablespace1 Tablespace2 Tablespace3 Physical space Table1 Table2 Table3 Database1 Database1 ... 20
  • 21. { "spaces" : { "warehouse" : { "uri" : "hdfs://localhost:8020/tajo/warehouse", "configs" : [ {'text.delimiter'='|'}, {'text.null'='N'}, {'compression.codec'='org.apache.hadoop.io.compress.SnappyCodec'}, {'timezone'='UTC+9'}, {'text.skip.headerlines'='2'} ] }, "hbase1" : { "uri" : "hbase:zk://localhost:2181/table1" } } } Tablespace Configuration Tablespace name Tablespace URI 21
  • 22. Create Table > CREATE TABLE archive1 (id BIGINT, ...) TABLESPACE warehouse; Tablespace name Data format is omitted. Default data format is TEXT. "warehouse" : { "uri" : "hdfs://localhost:8020/tajo/warehouse", "configs" : [ {'text.delimiter'='|'}, {'text.null'='N'}, {'compression.codec'='org.apache.hadoop.io.compress.SnappyCodec'}, {'timezone'='UTC+9'}, {'text.skip.headerlines'='2'} ] }, 22
  • 23. Create Table > CREATE TABLE archive1 (id BIGINT, ...) TABLESPACE warehouse; > CREATE TABLE archive2 (id BIGINT, ...) TABLESPACE warehouse; > CREATE TABLE archive3 (id BIGINT, ...) TABLESPACE warehouse; > CREATE TABLE user (user_id BIGINT, ...) TABLESPACE aws USING orc WITH ('orc.compression.kind'='snappy'); > CREATE TABLE table1 (key TEXT, ...) TABLESPACE hbase1; > ... 23
  • 24. HDFS HBase Tajo Worker Query Engine Storage ServiceHDFS handler Tajo Worker Query Engine Storage ServiceHDFS handler Tajo Worker Query Engine Storage ServiceHBase handler Querying on Different Data Silos ● How does a worker access different data sources? ○ Storage service ■ Return a proper handler for underlying storage > SELECT ... FROM hdfs_table, hbase_table, ... 24
  • 26. jdbc_db1 tajo_db1 JDBC-based Storage ● Storage providing the JDBC interface ○ PostgreSQL, MySQL, MariaDB, ... ● Databases of JDBC-based storage are mapped to Tajo databases Table1 Table2 Table3 Table1 Table2 Table3 tajo_db2 Table1 Table2 Table3 … jdbc_db2 Table1 Table2 Table3 … JDBC-based storage Tajo 26
  • 27. Tablespace Configuration { "spaces": { "pgsql_db1": { "uri": "jdbc:postgresql://hostname:port/db1" "configs": { "mapped_database": "tajo_db1" "connection_properties": { "user": "tajo", "password": "xxxx" } } } } } PostgreSQL database name Tajo database name Tablespace name 27
  • 28. Return to Query Federation ● How to correlate data on JDBC-based storage and others? ○ Need to have a global view of metadata across different storage types ■ Tajo also has its own metadata for its data ■ Each JDBC-based storage has own metadata for its data ■ Each NoSQL storage has metadata for its data ■ … 28
  • 29. ● Federating metadata of underlying storage Metadata Federation DBMS metadata provider NoSQL metadata provider Linked Metadata Manager DBMS HCatalog Tajo catalog metadata provider Catalog Interface ● Tablespace ● Database ● Tables ● Schema names ... 29
  • 30. Querying on JDBC-based Storage ● A plan is converted into a SQL string ● Query generation ○ Diverse SQL syntax of different types of storage ○ Different SQL builder for each storage type Tajo Master Tajo Worker JDBC-based storage SELECT ... Query plan SELECT ... 30
  • 31. Operation Push Down ● Tajo can exploit the processing capability of underlying storage ○ DBMSs, MongoDB, HBase, … ● Operations are pushed down into underlying storage ○ Leveraging the advanced features provided by underlying storage ■ Ex) DBMSs' query optimization, index, ... 31
  • 32. Example 1 SELECT count(*) FROM account ac, archive ar WHERE ac.key = ar.id and ac.name = 'tajo' account DBMS archive HDFS scan archive scan account ac.name = 'tajo' join ac.key = ar.id group by count(*) group by count(*) Full scan Result only Push operation 32
  • 33. Example 2 SELECT ac.name, count(*) FROM account ac GROUP BY ac.name account DBMS scan account group by count(*) Result only Push operation 33
  • 35. Self-describing Data Formats ● Some data formats include schema information as well as data ○ JSON, ORC, Parquet, … ● Tajo 0.11 natively supports self-describing data formats ○ Since they already have schema information, Tajo doesn't need to store it aside ○ Instead, Tajo can infer the schema at query execution time 35
  • 36. Create Table with Nested Data Format { "title" : "Hand of the King", "name" : { "first_name": "Eddard", "last_name": "Stark"}} { "title" : "Assassin", "name" : { "first_name": "Arya", "last_name": "Stark"}} { "title" : "Dancing Master", "name" : { "first_name": "Syrio", "last_name": "Forel"}} > CREATE EXTERNAL TABLE schemaful_table ( title TEXT, name RECORD ( first_name TEXT, last_name TEXT ) ) USING json LOCATION 'hdfs:///json_table'; Nested type 36
  • 37. How about This Data? {"id":"2937257761","type":"ForkEvent","actor":{"id":1088854,"login":"CAOakleyII","gravatar_id":"","url":"https://api.github.com/users/CAOakleyII","avatar_url":"https://avatars.githubusercontent. com/u/1088854?"},"repo":{"id":11909954,"name":"skycocker/chromebrew","url":"https://api.github.com/repos/skycocker/chromebrew"},"payload":{"forkee":{"id":38339291,"name":"chromebrew"," full_name":"CAOakleyII/chromebrew","owner":{"login":"CAOakleyII","id":1088854,"avatar_url":"https://avatars.githubusercontent.com/u/1088854?v=3","gravatar_id":"","url":"https://api.github. com/users/CAOakleyII","html_url":"https://github.com/CAOakleyII","followers_url":"https://api.github.com/users/CAOakleyII/followers","following_url":"https://api.github. com/users/CAOakleyII/following{/other_user}","gists_url":"https://api.github.com/users/CAOakleyII/gists{/gist_id}","starred_url":"https://api.github.com/users/CAOakleyII/starred{/owner}{/repo}"," subscriptions_url":"https://api.github.com/users/CAOakleyII/subscriptions","organizations_url":"https://api.github.com/users/CAOakleyII/orgs","repos_url":"https://api.github. com/users/CAOakleyII/repos","events_url":"https://api.github.com/users/CAOakleyII/events{/privacy}","received_events_url":"https://api.github.com/users/CAOakleyII/received_events","type":" User","site_admin":false},"private":false,"html_url":"https://github.com/CAOakleyII/chromebrew","description":"Package manager for Chrome OS","fork":true,"url":"https://api.github. com/repos/CAOakleyII/chromebrew","forks_url":"https://api.github.com/repos/CAOakleyII/chromebrew/forks","keys_url":"https://api.github.com/repos/CAOakleyII/chromebrew/keys{/key_id}"," collaborators_url":"https://api.github.com/repos/CAOakleyII/chromebrew/collaborators{/collaborator}","teams_url":"https://api.github.com/repos/CAOakleyII/chromebrew/teams","hooks_url":"https: //api.github.com/repos/CAOakleyII/chromebrew/hooks","issue_events_url":"https://api.github.com/repos/CAOakleyII/chromebrew/issues/events{/number}","events_url":"https://api.github. com/repos/CAOakleyII/chromebrew/events","assignees_url":"https://api.github.com/repos/CAOakleyII/chromebrew/assignees{/user}","branches_url":"https://api.github. com/repos/CAOakleyII/chromebrew/branches{/branch}","tags_url":"https://api.github.com/repos/CAOakleyII/chromebrew/tags","blobs_url":"https://api.github. com/repos/CAOakleyII/chromebrew/git/blobs{/sha}","git_tags_url":"https://api.github.com/repos/CAOakleyII/chromebrew/git/tags{/sha}","git_refs_url":"https://api.github. com/repos/CAOakleyII/chromebrew/git/refs{/sha}","trees_url":"https://api.github.com/repos/CAOakleyII/chromebrew/git/trees{/sha}","statuses_url":"https://api.github. com/repos/CAOakleyII/chromebrew/statuses/{sha}","languages_url":"https://api.github.com/repos/CAOakleyII/chromebrew/languages","stargazers_url":"https://api.github. com/repos/CAOakleyII/chromebrew/stargazers","contributors_url":"https://api.github.com/repos/CAOakleyII/chromebrew/contributors","subscribers_url":"https://api.github. com/repos/CAOakleyII/chromebrew/subscribers","subscription_url":"https://api.github.com/repos/CAOakleyII/chromebrew/subscription","commits_url":"https://api.github. com/repos/CAOakleyII/chromebrew/commits{/sha}","git_commits_url":"https://api.github.com/repos/CAOakleyII/chromebrew/git/commits{/sha}","comments_url":"https://api.github. com/repos/CAOakleyII/chromebrew/comments{/number}","issue_comment_url":"https://api.github.com/repos/CAOakleyII/chromebrew/issues/comments{/number}","contents_url":"https://api.github. com/repos/CAOakleyII/chromebrew/contents/{+path}","compare_url":"https://api.github.com/repos/CAOakleyII/chromebrew/compare/{base}...{head}","merges_url":"https://api.github. com/repos/CAOakleyII/chromebrew/merges","archive_url":"https://api.github.com/repos/CAOakleyII/chromebrew/{archive_format}{/ref}","downloads_url":"https://api.github. com/repos/CAOakleyII/chromebrew/downloads","issues_url":"https://api.github.com/repos/CAOakleyII/chromebrew/issues{/number}","pulls_url":"https://api.github. com/repos/CAOakleyII/chromebrew/pulls{/number}","milestones_url":"https://api.github.com/repos/CAOakleyII/chromebrew/milestones{/number}","notifications_url":"https://api.github. com/repos/CAOakleyII/chromebrew/notifications{?since,all,participating}","labels_url":"https://api.github.com/repos/CAOakleyII/chromebrew/labels{/name}","releases_url":"https://api.github. com/repos/CAOakleyII/chromebrew/releases{/id}","created_at":"2015-07-01T00:00:00Z","updated_at":"2015-06-28T10:11:09Z","pushed_at":"2015-06-09T07:46:57Z","git_url":"git://github. com/CAOakleyII/chromebrew.git","ssh_url":"git@github.com:CAOakleyII/chromebrew.git","clone_url":"https://github.com/CAOakleyII/chromebrew.git","svn_url":"https://github. com/CAOakleyII/chromebrew","homepage":"http://skycocker.github.io/chromebrew/","size":846,"stargazers_count":0,"watchers_count":0,"language":null,"has_issues":false,"has_downloads":true," has_wiki":true,"has_pages":false,"forks_count":0,"mirror_url":null,"open_issues_count":0,"forks":0,"open_issues":0,"watchers":0,"default_branch":"master","public":true}},"public":true,"created_at":" 2015-07-01T00:00:01Z"} ... 37
  • 38. Create Schemaless Table > CREATE EXTERNAL TABLE schemaless_table (*) USING json LOCATION 'hdfs:///json_table'; That's all! Allow any schema 38
  • 39. Schema-free Query Execution > CREATE EXTERNAL TABLE schemaful_table (id BIGINT, name TEXT, ...) USING text LOCATION 'hdfs:///csv_table; > CREATE EXTERNAL TABLE schemaless_table (*) USING json LOCATION 'hdfs:///json_table'; > SELECT name.first_name, name.last_name from schemaless_table; > SELECT title, count(*) FROM schemaful_table, schemaless_table WHERE name = name.last_name GROUP BY title; 39
  • 40. Schema Inference ● Table schema is inferred at query time ● Example SELECT a, b.b1, b.b2.c1 FROM t; ( a text, b record ( b1 text, b2 record ( c1 text ) ) ) Inferred schemaQuery 40
  • 42. Demo with Command line 42
  • 44. Roadmap ● 0.12 ○ Improved Yarn integration ○ Authentication support ○ JavaScript stored procedure support ○ Scalar subquery support ○ Hive UDF support 44
  • 45. Roadmap ● Next generation (beyond 0.12) ○ Exploiting modern hardware ○ Approximate query processing ○ Genetic query optimization ○ And more … 45
  • 46. tajo> select question from you; 46