Apache TAJO

APACHE TAJO
By Asis Mohanty,
CBIP, CDMP
asismohanty@gmail.com

Tajo: A Data Warehouse System
• Apache Top Level Project
• Distributed and scalable data warehouse system on Hadoop
• Low latency and long running batch queries in a single system
• Features:
 ANSI/ISO SQL compliance
 Mature SQL features: Join, Group by, Order By, Aggregation
and windows functions
 Supports Partition table
 Supports Java/Python, UDF
 JDBC & Java based asynchronous API
 Supports Read/Write of CSV, JSON, RCFile, Sequential file,
Parquet and ORC

Where to use Tajo
• Extraction, Transformation, Loading (ETL)
• Interactive BI/ Analytics on web-scale Big Data
• Data Discovery/ Exploratory analysis with R and existing SQL tools
• Query federation
• Customer wants a unified system for batch and interactive queries
on Hadoop, Amazon S3 or Hbase
• Customer wants to use mixed use of Hadoop-based DW and
RDBMS-based DW or want to replace RDBMS DW.
• Customer wants to use existing SQL tools on Hadoop DW

Hbase Storage Support
• You can use SQL to access Hbase tables.
• TAJO supports Hbase storage
• CREATE(EXTERNAL)/DROP/INSERT/OVERWRITE
• Create TABLE hbase_t1 (Key TEXT, Col1 TEXT, Col2 Int) USING
HBASE (
‘table’ = ‘t1’,
‘columns’ = ‘:key,cf1:col1,cf2:col2’,
‘hbase.zookeper.quorum’ = ‘host1:2181,host2:2181’

Tajo Shell (TSQL)
Tajo provides a shell utility named Tsql. It is a command-line interface
(CLI) where users can create or drop tables, inspect schema and
query tables, etc.
• Meta Commands
• Executing HDFS commands
• Session Variables
• Administration Commands
• Introducing to TSQL
• Executing a single command
• Executing Queries from Files
• Executing as background process
Refer: http://tajo.apache.org/docs/current/index.html

Tajo SQL Language (DDL)
CREATE DATABASE
CREATE DATABASE [IF NOT EXISTS] <database_name>
DROP DATABASE
DROP DATABASE [IF EXISTS] <database_name>
CREATE TABLE
CREATE TABLE [IF NOT EXISTS] <table_name> [(<column_name> <data_type>, ... )]
[using <storage_type> [with (<key> = <value>, ...)]] [AS <select_statement>]
CREATE EXTERNAL TABLE [IF NOT EXISTS] <table_name> (<column_name> <data_type>, ... )
using <storage_type> [with (<key> = <value>, ...)] LOCATION '<path>'
Compression
L_ORDERKEY bigint,
L_PARTKEY bigint,
...
L_COMMENT text)
USING TEXT WITH ('text.delimiter'='|','compression.codec'='org.apache.hadoop.io.compress.DeflateCodec')
LOCATION 'hdfs://localhost:9010/tajo/warehouse/lineitem_100_snappy';
DROP TABLE
DROP TABLE [IF EXISTS] <table_name> [PURGE]
CREATE INDEX
CREATE INDEX [ name ] ON table_name [ USING method ]
( { column_name | ( expression ) } [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )
[ WHERE predicate ]
DROP INDEX
DROP INDEX name

INSERT (OVERWRITE) INTO
INSERT OVERWRITE statement overwrites a table data of an existing table or a data in a given
directory. Tajo’s INSERT OVERWRITE statement follows INSERT INTO SELECT statement of
SQL. The examples are as follows:
create table t1 (col1 int8, col2 int4, col3 float8);
-- when a target table schema and output schema are equivalent to each other
INSERT OVERWRITE INTO t1 SELECT l_orderkey, l_partkey, l_quantity FROM
lineitem;
-- or
INSERT OVERWRITE INTO t1 SELECT * FROM lineitem;
-- when the output schema are smaller than the target table schema
INSERT OVERWRITE INTO t1 SELECT l_orderkey FROM lineitem;
-- when you want to specify certain target columns
INSERT OVERWRITE INTO t1 (col1, col3) SELECT l_orderkey, l_quantity FROM
lineitem;
In addition, INSERT OVERWRITE statement overwrites table data as well as a specific
directory.
INSERT OVERWRITE INTO LOCATION '/dir/subdir' SELECT l_orderkey, l_quantity FROM lineitem;

Tajo Queries
Sample Query
SELECT [distinct [all]] * | <expression> [[AS] <alias>] [, ...]
[FROM <table reference> [[AS] <table alias name>] [, ...]]
[WHERE <condition>]
[GROUP BY <expression> [, ...]]
[HAVING <condition>]
[ORDER BY <expression> [ASC|DESC] [NULL FIRST|NULL LAST] [, ...]]
Table and Table Aliases
A temporary name can be given to tables and complex table references to be used for references to
the derived table in the rest of the query. This is called a table alias.
FROM table_reference AS alias or FROM table_reference alias
Window Functions
A window function performs a calculation across multiple table rows that belong to some window
frame.
SELECT ...., func(param) OVER ([PARTITION BY partition-expr [, ...]] [ORDER BY sort-expr [, ...]]),
...., FROM

Better SQL support via thin JDBC
ETL Tools BI Tools Reporting Tools
TAJO CLUSTER
Tajo JDBC
HDFS HBase S3 Swift

Dataset Stored in various Formats/Storage

Which one to choose
Impala Vs Presto Vs Drill Vs Tajo

Apache TAJO

More Related Content

What's hot

Similar to Apache TAJO

More from Asis Mohanty

Recently uploaded

Apache TAJO