SlideShare a Scribd company logo
Efficient In-situ Processing of
Various Storage Types on Apache
Tajo
Hadoop Summit 2015 San Jose
Hyunsik Choi, Gruter Inc.
Agenda
• Tajo Overview
• Various Storage Support
• Motivation
• Design Consideration
• What we did/are doing
An overview of Apache Tajo
Tajo: A Data Warehouse System
• Data Warehouse System
• Apache Top-level project
• Low latency, and long running batch queries in a single system
• ~100 ms up to several hours
• Fault tolerance
• Features
• ANSI SQL compliance
• Mature SQL features: Joins, Group by, Sort, Multiple distinct aggregations and Window function
• Partitioned table support
• Java/Python UDF support
• JDBC driver and Java-based asynchronous API
• SQL data type and Nested type support
• Direct JSON support
Master Server
TajoMaster
Slave Server
TajoWorker
QueryMaster
Local Query Engine
StorageManager
Local
FileSystem
HDFS
Client
JDBC SQL Shell Web UI
Slave Server
TajoWorker
QueryMaster
Local Query Engine
StorageManager
Local
FileSystem
HDFS
Slave Server
TajoWorker
QueryMaster
Local Query Engine
StorageManager
Local
FileSystem
HDFS
CatalogStore
DBMS (MySQL, ..)
Hive Meta StoreSubmit a query
Manage metadata
Allocate a query
send tasks
& monitor
send tasks
& monitor
Tajo Overall Architecture
Background: Query Optimization Phases
BLK 1
BLK 5
BLK 3
BLK 4
BLK 2
BLK 6
Task Assigning with Locality
Worker Worker Worker
HDFS
Cluster
Node1 Node2 Node3
Tajo
Cluster…
…
…
• Each task is assigned to a node according to its locality.
Background: Task Execution
• Physical operators are assembled into a tree and their execution
pipelined in the same machine.
• Leaf operators must be scanners.
• Tajo provides abstraction scanner, allowing to read
different physical tables.
Background: Local Execution
Various Storage Support
Motivation
• Unified Interface
• Data Integration
• In-situ Processing
HDFS NoSQL S3
Openstack
Swift
Apache Tajo
Sequence File
RCFile
Protocol Buffer
Datasets stored in Various Formats/Storages
Design Considerations
• More Storage Properties
• Splittable, compressible (codecs), indexable, seekable, projectable,
aggregatable, …
• Query Optimization
• Pluggable Storage and Data Format
• More operation pushdown
Sequence File
RCFile
Protocol Buffer
Separation between Storage and Format
Data
Formats
Storage
Types
Relationships between Storage and Format
Storages Data Format
Text
RCFile
Parquet
Avro
Hbase Serialization
Protobuf
Local File System
Swift
…..
JSON
.
.
.
.
Tablespace
• Tablespace
• Each table space is identified by a URI.
• Hdfs://host:port/warehouse, hbase:zk://quorum1:2171, quorum2:2171, …
• All tables in the same tablespace shares the same configuration.
• URI scheme indicates storage type.
• Hdfs, hbase, jdbc, …
• Multiple tablespaces is possible in single storage namespace.
• HDFS-2832: Enable support for heterogeneous in HDFS.
• e.g.,
• /warehouse/ (disk)
• /today/ (ssd)
Storage Configuration
Storage Type Name and URI scheme
Storage Handler Class
Tablespace Configuration
Tablespace name
Tablespace URI
Format Configuration
Format names
The relationship between
formats and storages
CREATE Table using Tablespace
CREATE TABLE uptodate (key TEXT, …) TABLESPACE hbase1;
CREATE TABLE archive (l_orderkey bigint, …) TABLESPACE warehouse
USING text WITH (‘text.delimiter’ = ‘|’);
Tablespace Name
Format name
Storage Layer Access over Query Lifecycle
Query
Planning
Query
Executor
Running
Task
Completed
Task
Failed
Task
Query
Master
- guessTableVolume()
- validateSchema()
- getStorageProperty()
- getFormatProperty()
- …
- createTable()
- purgeTable()
- prepareTable()
- getScanner()
- getAppender()
…..
- getSplits()
- commitTable() - rollbackTable()
- getScanner()
- getAppender()
Query Rewrite for Specific Storages
CREATE TABLE hbase_table (key TEXT, …)
INSERT INTO hbase_table SELECT id, name, …
SCAN
Table Write
SCAN
Table Write
Sort
Logical Plan
Rewrite
HFileHandler
Operation Push Down
SELECT
X,
SUM(Y)
FROM
table1
WHERE
x > 100
GROUP BY
x
Underlying
Storage
Filter and Projection can be pushed down into
Underlying storages (like RDBMS, Hbase,
Elasticsearch, …)
Current Status
• Storages:
• HDFS support
• Amazon S3 and Openstack Swift
• Hbase Scanner and Writer - Hfile and Put Mode
• JDBC-based Scanner and Writer (Working)
• Kafka Scanner (Patch Available)
• Elastic Search (Patch Available)
• Data Formats
• Text, JSON, RCFile, SequenceFile, Avro, Parquet, and ORC (Patch Available)
Get Involved!
• We are recruiting contributors!
• General
• http://tajo.apache.org
• Getting Started
• http://tajo.apache.org/docs/0.10.0/getting_started.html
• Downloads
• http://tajo.apache.org/downloads.html
• Jira – Issue Tracker
• https://issues.apache.org/jira/browse/TAJO
• Join the mailing list
• dev-subscribe@tajo.apache.org
• issues-subscribe@tajo.apache.org
Q&A

More Related Content

What's hot

The Hive Think Tank: Rocking the Database World with RocksDB
The Hive Think Tank: Rocking the Database World with RocksDBThe Hive Think Tank: Rocking the Database World with RocksDB
The Hive Think Tank: Rocking the Database World with RocksDB
The Hive
 
Draft slide of Demystifying DHT in GlusterFS
Draft slide of Demystifying DHT in GlusterFSDraft slide of Demystifying DHT in GlusterFS
Draft slide of Demystifying DHT in GlusterFS
Ankit Raj
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
Cloudera, Inc.
 
[Altibase] 2-2 Altibase storage
[Altibase] 2-2 Altibase storage[Altibase] 2-2 Altibase storage
[Altibase] 2-2 Altibase storage
altistory
 
Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017
HashedIn Technologies
 
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFSHadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
Erik Krogen
 
Redis database
Redis databaseRedis database
Redis database
Ñáwrás Ñzár
 
MongoDB SF Python
MongoDB SF PythonMongoDB SF Python
MongoDB SF Python
Mike Dirolf
 
Development of the irods rados plugin @ iRODS User group meeting 2014
Development of the irods rados plugin @ iRODS User group meeting 2014Development of the irods rados plugin @ iRODS User group meeting 2014
Development of the irods rados plugin @ iRODS User group meeting 2014mgrawinkel
 
MySQL Space Management
MySQL Space ManagementMySQL Space Management
MySQL Space Management
MIJIN AN
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
Gruter
 
Introduction to redis
Introduction to redisIntroduction to redis
Introduction to redis
NexThoughts Technologies
 
NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL Databases
Amit Kumar Gupta
 
EECI 2013 - ExpressionEngine Performance & Optimization - Laying a Solid Foun...
EECI 2013 - ExpressionEngine Performance & Optimization - Laying a Solid Foun...EECI 2013 - ExpressionEngine Performance & Optimization - Laying a Solid Foun...
EECI 2013 - ExpressionEngine Performance & Optimization - Laying a Solid Foun...
Nexcess.net LLC
 
Some key value stores using log-structure
Some key value stores using log-structureSome key value stores using log-structure
Some key value stores using log-structure
Zhichao Liang
 
Introduction to mongo db
Introduction to mongo dbIntroduction to mongo db
Introduction to mongo db
NexThoughts Technologies
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
MIJIN AN
 
HDFS Basics
HDFS BasicsHDFS Basics
HDFS Basics
NIVASH RAMAJAYAM
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Maarten Smeets
 

What's hot (20)

The Hive Think Tank: Rocking the Database World with RocksDB
The Hive Think Tank: Rocking the Database World with RocksDBThe Hive Think Tank: Rocking the Database World with RocksDB
The Hive Think Tank: Rocking the Database World with RocksDB
 
Draft slide of Demystifying DHT in GlusterFS
Draft slide of Demystifying DHT in GlusterFSDraft slide of Demystifying DHT in GlusterFS
Draft slide of Demystifying DHT in GlusterFS
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
 
[Altibase] 2-2 Altibase storage
[Altibase] 2-2 Altibase storage[Altibase] 2-2 Altibase storage
[Altibase] 2-2 Altibase storage
 
Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017Redis Modules - Redis India Tour - 2017
Redis Modules - Redis India Tour - 2017
 
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFSHadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
Hadoop Meetup Jan 2019 - Mounting Remote Stores in HDFS
 
Redis database
Redis databaseRedis database
Redis database
 
MongoDB SF Python
MongoDB SF PythonMongoDB SF Python
MongoDB SF Python
 
Intro ch 06_b
Intro ch 06_bIntro ch 06_b
Intro ch 06_b
 
Development of the irods rados plugin @ iRODS User group meeting 2014
Development of the irods rados plugin @ iRODS User group meeting 2014Development of the irods rados plugin @ iRODS User group meeting 2014
Development of the irods rados plugin @ iRODS User group meeting 2014
 
MySQL Space Management
MySQL Space ManagementMySQL Space Management
MySQL Space Management
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
 
Introduction to redis
Introduction to redisIntroduction to redis
Introduction to redis
 
NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL Databases
 
EECI 2013 - ExpressionEngine Performance & Optimization - Laying a Solid Foun...
EECI 2013 - ExpressionEngine Performance & Optimization - Laying a Solid Foun...EECI 2013 - ExpressionEngine Performance & Optimization - Laying a Solid Foun...
EECI 2013 - ExpressionEngine Performance & Optimization - Laying a Solid Foun...
 
Some key value stores using log-structure
Some key value stores using log-structureSome key value stores using log-structure
Some key value stores using log-structure
 
Introduction to mongo db
Introduction to mongo dbIntroduction to mongo db
Introduction to mongo db
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
 
HDFS Basics
HDFS BasicsHDFS Basics
HDFS Basics
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 

Similar to Efficient In-situ Processing of Various Storage Types on Apache Tajo

Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
Hyunsik Choi
 
What's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondWhat's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its Beyond
Gruter
 
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterBig Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Data Con LA
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better
gvernik
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
gvernik
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
earnwithme2522
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx
vishwasgarade1
 
Apache Hive
Apache HiveApache Hive
Apache Hive
Amit Khandelwal
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introduction
chrislusf
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
datastack
 
Apache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouseApache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouse
hadoopsphere
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
JAX London
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
Prashant Gupta
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
Shravan (Sean) Pabba
 
מיכאל
מיכאלמיכאל
מיכאל
sqlserver.co.il
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
Subhas Kumar Ghosh
 
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
NoSQLmatters
 

Similar to Efficient In-situ Processing of Various Storage Types on Apache Tajo (20)

Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
Tajo Seoul Meetup July 2015 - What's New Tajo 0.11
 
What's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondWhat's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its Beyond
 
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterBig Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
 
hive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptxhive_slides_Webinar_Session_1.pptx
hive_slides_Webinar_Session_1.pptx
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introduction
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Apache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouseApache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouse
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Impala for PhillyDB Meetup
Impala for PhillyDB MeetupImpala for PhillyDB Meetup
Impala for PhillyDB Meetup
 
מיכאל
מיכאלמיכאל
מיכאל
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 

Recently uploaded (20)

Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 

Efficient In-situ Processing of Various Storage Types on Apache Tajo

  • 1. Efficient In-situ Processing of Various Storage Types on Apache Tajo Hadoop Summit 2015 San Jose Hyunsik Choi, Gruter Inc.
  • 2. Agenda • Tajo Overview • Various Storage Support • Motivation • Design Consideration • What we did/are doing
  • 3. An overview of Apache Tajo
  • 4. Tajo: A Data Warehouse System • Data Warehouse System • Apache Top-level project • Low latency, and long running batch queries in a single system • ~100 ms up to several hours • Fault tolerance • Features • ANSI SQL compliance • Mature SQL features: Joins, Group by, Sort, Multiple distinct aggregations and Window function • Partitioned table support • Java/Python UDF support • JDBC driver and Java-based asynchronous API • SQL data type and Nested type support • Direct JSON support
  • 5. Master Server TajoMaster Slave Server TajoWorker QueryMaster Local Query Engine StorageManager Local FileSystem HDFS Client JDBC SQL Shell Web UI Slave Server TajoWorker QueryMaster Local Query Engine StorageManager Local FileSystem HDFS Slave Server TajoWorker QueryMaster Local Query Engine StorageManager Local FileSystem HDFS CatalogStore DBMS (MySQL, ..) Hive Meta StoreSubmit a query Manage metadata Allocate a query send tasks & monitor send tasks & monitor Tajo Overall Architecture
  • 7. BLK 1 BLK 5 BLK 3 BLK 4 BLK 2 BLK 6 Task Assigning with Locality Worker Worker Worker HDFS Cluster Node1 Node2 Node3 Tajo Cluster… … … • Each task is assigned to a node according to its locality. Background: Task Execution
  • 8. • Physical operators are assembled into a tree and their execution pipelined in the same machine. • Leaf operators must be scanners. • Tajo provides abstraction scanner, allowing to read different physical tables. Background: Local Execution
  • 10. Motivation • Unified Interface • Data Integration • In-situ Processing HDFS NoSQL S3 Openstack Swift Apache Tajo
  • 11. Sequence File RCFile Protocol Buffer Datasets stored in Various Formats/Storages
  • 12. Design Considerations • More Storage Properties • Splittable, compressible (codecs), indexable, seekable, projectable, aggregatable, … • Query Optimization • Pluggable Storage and Data Format • More operation pushdown
  • 13. Sequence File RCFile Protocol Buffer Separation between Storage and Format Data Formats Storage Types
  • 14. Relationships between Storage and Format Storages Data Format Text RCFile Parquet Avro Hbase Serialization Protobuf Local File System Swift ….. JSON . . . .
  • 15. Tablespace • Tablespace • Each table space is identified by a URI. • Hdfs://host:port/warehouse, hbase:zk://quorum1:2171, quorum2:2171, … • All tables in the same tablespace shares the same configuration. • URI scheme indicates storage type. • Hdfs, hbase, jdbc, … • Multiple tablespaces is possible in single storage namespace. • HDFS-2832: Enable support for heterogeneous in HDFS. • e.g., • /warehouse/ (disk) • /today/ (ssd)
  • 16. Storage Configuration Storage Type Name and URI scheme Storage Handler Class
  • 18. Format Configuration Format names The relationship between formats and storages
  • 19. CREATE Table using Tablespace CREATE TABLE uptodate (key TEXT, …) TABLESPACE hbase1; CREATE TABLE archive (l_orderkey bigint, …) TABLESPACE warehouse USING text WITH (‘text.delimiter’ = ‘|’); Tablespace Name Format name
  • 20. Storage Layer Access over Query Lifecycle Query Planning Query Executor Running Task Completed Task Failed Task Query Master - guessTableVolume() - validateSchema() - getStorageProperty() - getFormatProperty() - … - createTable() - purgeTable() - prepareTable() - getScanner() - getAppender() ….. - getSplits() - commitTable() - rollbackTable() - getScanner() - getAppender()
  • 21. Query Rewrite for Specific Storages CREATE TABLE hbase_table (key TEXT, …) INSERT INTO hbase_table SELECT id, name, … SCAN Table Write SCAN Table Write Sort Logical Plan Rewrite HFileHandler
  • 22. Operation Push Down SELECT X, SUM(Y) FROM table1 WHERE x > 100 GROUP BY x Underlying Storage Filter and Projection can be pushed down into Underlying storages (like RDBMS, Hbase, Elasticsearch, …)
  • 23. Current Status • Storages: • HDFS support • Amazon S3 and Openstack Swift • Hbase Scanner and Writer - Hfile and Put Mode • JDBC-based Scanner and Writer (Working) • Kafka Scanner (Patch Available) • Elastic Search (Patch Available) • Data Formats • Text, JSON, RCFile, SequenceFile, Avro, Parquet, and ORC (Patch Available)
  • 24. Get Involved! • We are recruiting contributors! • General • http://tajo.apache.org • Getting Started • http://tajo.apache.org/docs/0.10.0/getting_started.html • Downloads • http://tajo.apache.org/downloads.html • Jira – Issue Tracker • https://issues.apache.org/jira/browse/TAJO • Join the mailing list • dev-subscribe@tajo.apache.org • issues-subscribe@tajo.apache.org
  • 25. Q&A

Editor's Notes

  1. Hi, I’m a pmc member of Apache Tajo project and I’m also research director of Gruter Inc. Today, I’m going to present The various strorage support in Tajo project. As you know, Tajo is a SQL on hadoop system. Especially, its exact position is data warehouse system. One of the key feature is to integrate different data sources and warehouse them. For them, different storage type support is necessary. The early Tajo also supports various storage types. But while we expanding storage types, we faced some limitation. In this presentation, I’ll present how Tajo solve this challenge and how Tajo support different storage types.
  2. This is an agenda of Today talk. Firstly, I’ll give an overview of Tajo project. We’ll talk about its overall architecture, how query works, how query is planned. Next, we will talk about how Tajo support various storage support.
  3. One of the distinguished features is that you can execute a low latency and long-running batch queries in a single system. Internally, It’s smart optimizater makes good use of various query optimization opportunities for various workloads. And also, it has ANSI SQL compliance, Mature SQL feature, Java/Python UDF support, JDBC, and nested data type. JSON support. Tajo is already a production-ready system.
  4. This figure shows an overall architecture of Tajo. One Tajo cluster consists of one master and multiple Tajo workers. Tajo master is responsible for coordinating Tajo workers and planning. When a query is issued, TajoMAster determines if this query needs a distributed execution. Some queries like DDL can be executed instantly in TajoMaster. Otherwise, a query is forwarded to free TajoWorker. Then, the query master is launched in the worker and it makes a number tasks from the query. Then tasks are disseminated into a number of nodes and executed on them. Tajo also has its own catalog store. This catalog store supports pluggable driver for persistent storage. Tajo can use Hive meta store as a persistent storage. In this case, Tajo can see the same table to Hive. Users can directly process Hive tables with Tajo.
  5. This figure shows how query is optimized. Tajo uses multiple well-defined optimization phases. Each phase tries to find the best answer with given information.
  6. Similar to other Hadoop-based processing systems, each leaf task which scans a table fragment is assigned to nodes according to the task locality.
  7. For local execution, a physical operators is assembled into a tree. Leaf operator scans a table fragment.
  8. Why? What challenges? How we Tajo sees various storages as table.
  9. Why we should support various storages in analytical SQL system? I think that there are three reasons. I also believe many users also think so. The first reason is an unified interface. SQL is easy and very popular interface. In particular, SQL has already various tools like BI tools and visualization tools. If you can use SQL on storages, you don’t need to pay additional cost for learning and tools. The next reason is data integration. Recently, most companies use various storages. For updatable data and point queries, NoSQL or RDBMS is widely used. For archiving data, HDFS or S3 is used. Users want to see a global view of data sets. In this case, analytical processing requires data integration among data sources, that is, joining multiple data sources. BTW, Analytical SQL systems are the most proper systems for join. The third reason is in-situ processing. it is common for users to perform ETL workload, causing heavy reading and processing costs. By the way, many storage systems have some query processing capability like filter and projection. In particular, RDBMS system can do more complex processing. This capability is much more efficient than what external system can do. So, if we use those capabilities, data archiving or ETL would be more efficient.
  10. In Hadoop eco system and legacy systems, there are many de-factor standard data formats. Our goal is to support them seemlessly.
  11. It’s just not reading and writing problems. In order to fully exploit storages, we need to consider more than just reading and writing. In particular, Tajo is a SQL system. One of the key feature of SQL system is query optimization. For them, storage layer should expose more physical properties like splittable, compressible, indexable, projectable, and sorted keys. Then, query optimizer should exploit physical properties. Also, we need pluggable storage and data format system, allowing users to add them without any code modification. Also, we need to enable storage layer to accept operation pushdown for in-situ processing.
  12. Firstly, we explicitly separate the storage concept into storage and data format. This separation allows us to easily describe more properties of specific storages and data formats. It also makes modularization much easier. For example, the scan throughput of RCFile is not bad in HDFS or Local File system. But, in S3, RCFile is too slow due to expensive random access cost. Another example is Hfile of Hbase. Tajo directly HFile for hbase loading. Even though Hfile is stored in HDFS, it requires sorted data set. We wanted these kinds of properties can be described.
  13. Then, we improve Tajo storage layer to recognize the relationships between storages and data formats. Some data formats like Text, Parquet are shared among file-based storage types.
  14. We also adopt Tablespace concept. HDFS-2832 introduced heterogeneous storage in HDFS. It allows different paths in a single namespace to use different storage media. For example, with this feature, we can bind warehouse directory to disk and today directory to ssd disk. Tablespace allows Tajo to distinguish directories in the same namespace as different spaces.
  15. For new concept, we need new configuration system. This configuration defines storages. This is a storage type name and it is also used for URI scheme. This class will handle all storage instances with URIs using hdfs scheme.
  16. This configuration describes tablespaces. This is the tablespace name. This table uri starts with hdfs. So, HDFS storage handler as I mentioned before will handle this tablespace.
  17. This configuraiton describes file formats. This is a format name. It means that Avro format can be used in hdfs, file, and s3 storages.
  18. In order to allow users to easily use tablespace concept, we also added tablespace clause. With tablespace clause, users can easily specify a tablespace for each table.
  19. Then, we also investigated the query lifecycle and access points to storage layer. Planning requires storage support. In Tajo, query optimizer uses storage layer to guess input table volume or get table properties. Especially, the table volume is commonly used for join optimization when there is no statistic information. QueryExecutor in TajoMaster determines if each query is executed in instantly executed in TajoMaster or executed in a distributed manner. DDL queries like CREATE TABLE, and DROP TABLE also need storage layer support. They require storage to make or purge storage space for a table. When a query is executed across a number of cluster nodes, TajoMaster forwards a query to QueryMaster, which will be launched instantly in a worker. QueryMaster makes splits from input data sources. Making splits also need storage layer support because it can be different according to storage type and file formats. For example, Splitting Text file in HDFS and S3 can be different. Then, tasks are disseminated across a number of workers. Each task scans input data and writes output data on specific storage. In this point, they also requires storage support. Tajo supports fault tolerance, The task failure is handled. So, only completed tasks are committed into the final output directory. We derived storage access pattern from this lifecycle and clearly defined storage layer APIs.
  20. While we are investigating various storages, we also realized that more APIs are requires for them. An example is the rewrite rule injection. As I mentioned ealier, Tajo supports direct Hfile write, requiring sorted table write. Hbase requires a written data sorted in an ascending order of row keys. In this case, sort operation will be injected before table writing.
  21. We are also expecting that aggregation also can be pushed down into underlying storage like RDBMS. This work is still working.