SlideShare a Scribd company logo
1 of 31
What’s New Tajo 0.11
Tajo Seoul Meetup 2015. 07
Hyunsik Choi, Gruter Inc.
Agenda
• Tajo Overview
• Milestones and 0.10 Features
• What’s New in 0.11.
Tajo: A Big Data Warehouse System
• Apache Top-level project
• Distributed and scalable data warehouse system on various data
sources (e.g, HDFS, S3, Hbase, …)
• Low latency, and long running batch queries in a single system
• Features
• ANSI SQL compliance
• Mature SQL features
• Partitioned table support
• Java/Python UDF support
• JDBC driver and Java-based asynchronous API
• Read/Write support of CSV, JSON, RCFile, SequenceFile, Parquet, ORC
Master Server
TajoMaster
Slave Server
TajoWorker
QueryMaster
Local Query Engine
StorageManager
HDFS HBase
Client
JDBC TSql Web UI
Slave Server
TajoWorker
QueryMaster
Local Query Engine
StorageManager
Slave Server
TajoWorker
QueryMaster
Local Query Engine
StorageManager
CatalogStore
DBMS
HCatalogSubmit a query
Manage metadata
Allocate a query
send tasks
& monitor
send tasks
& monitor
Tajo Overall Architecture
HDFS HBase HDFS HBase
Common Scenarios
• Extraction, Transformation, Loading (ETL)
• Interactive BI/analytics on web-scale big data
• Data discovery/Exploratory analysis with R and
existing SQL tools
Use Cases: Replacement of Commercial DW
• Example: Biggest Telco Company in South Korea
• Goal:
• Replacement of slow ETL workloads on several TB datasets
• Lots daily reports generation about users’ behaviors
• Ad-hoc analysis on Terabytes data sets
• Key Benefits of Tajo:
• Simplification of DW ETL, OLAP, and Hadoop ETL into an
unified system
• Saved license over commercial DW
• Much less cost, more data analysis within the same SLA
Use Cases: Data Discovery
• Example: Music streaming service
(26 million users)
• Goal:
• Analysis on purchase history for target marketing
• Benefits:
• Query interactivity on large data sets
• Ability to use existing BI visualization tools
When Tajo is right choice?
• You want an unified system for batch and
interactive queries on Hadoop, Amazon S3, or
Hbase.
• You want a mixed use of Hadoop-based DW and
RDBMS-based DW or want to replace existing
RDBMS DW.
• You want to use existing SQL tools on Hadoop DW
Milestones
0.8 0.9 0.10 0.11
More features &
SQL compatibility
Stability &
Analytical
function
Eco-system
expansion
More features
• Python UDF
• Nested Schema
• Tablespace support
• Basic Query federation
• Better query scheduler
Selected Features in 0.10
Hbase Storage Support
• You can use SQL to access Hbase tables.
• Tajo supports Hbase storage
• CREATE (EXTERNAL)/DROP/INSERT (OVERWRITE)/SELECT
• Bulk Insertion through Direct HFile writing
CREATE TABLE hbase_t1 (key TEXT, col1 TEXT, col2 INT) USING
hbase
WITH (
‘table’ = ‘t1’,
‘columns’ = ‘:key,cf1:col1,cf2:col2`,
‘hbase.zookeeper.quorum’ = ‘host1:2181,host2:2181’
)
Better AWS support
• Optimized for S3 and EMR environments
• Fixed many bugs related to S3
• EMR bootstrap supported in AWS Labs Github repo
• A quick guide for Tajo on EMR
• http://www.gruter.com/blog/setting-up-a-tajo-cluster-on-amazon-emr/
• EMR bootstrap for Tajo on EMR
• https://github.com/awslabs/emr-bootstrap-actions/tree/master/tajo
Tajo JDBC
Tajo Cluster
ETL Tools BI Tools Reporting tools
Better SQL tool support via thin JDBC
HDFS HBase S3 Swift
Zeppelin Integration
Improved Performance and Stability
• Offheap sort operator for ORDER BY (TAJO-907)
• Hash shuffle IO improvement (TAJO-374, TAJO-987)
• Skewness handling of hash shuffle
• Automatic parallel degree choice during runtime
• Lots of query optimizer improvements
• Add Master HA (TAJO-704)
• More error-tolerant shuffle fetch (TAJO-789, TAJO-953)
What’s New in Tajo 0.11
Nested data and JSON support
• Nested data is becoming common
• JSON, BSON, XML, Protocol Buffer, Avro, Parquet, …
• Many web applications in common use JSON.
• MongoDB by default uses JSON document
• Many Hbase users also store JSON document in a cell.
• Flattening causes lots of data/computation
overhead.
• Tajo 0.11 natively supports nested data types.
How to create a nested schema table
Use ‘RECORD’ keyword to define complex data type
Loose schema for self-describing formats
You can handle schema evolving with ALTER ADD COLUMN!
How to retrieve nested fields
Input Data
Table Definition
SQL
Query federation and Tablespace support
• Query support across multiple data sources
• You can perform join or union among tables on different systems.
• Benefits:
• Data offload from RDBMS to Hadoop vice versa
• A mixed use of existing RDBMS and Hadoop.
• Access to NoSQL and various storages through SQL
• An unified interface for SQL tools
HDFS NoSQL S3 Swift
Apache Tajo
Sequence File
RCFile
Protocol Buffer
Data
Formats
Storage
Types
Datasets stored in Various Formats/Storages
ORC
Tablespace
• Tablespace
• Registered storage space
• A table space is identified by an unique URI
• Configuration and Policy shared in all tables in the same
tablespace
• It allows users to reuse registered storages and their
configuration.
Tablespace Configuration
Tablespace name
Tablespace URI
Create Table on a specified Tablespace
> CREATE TABLE uptodate (key TEXT, …) TABLESPACE hbase1;
> CREATE TABLE archive (l_orderkey bigint, …) TABLESPACE warehouse
USING text WITH (‘text.delimiter’ = ‘|’);
Tablespace Name
Format name
Operation Push Down
SELECT
X,
SUM(Y)
FROM
table1
WHERE
x > 100
GROUP BY
x
Underlying
Storage
Filter, Projection or Groupby can be pushed down into
Underlying storages (like RDBMS, Hbase,
Elasticsearch, …)
Current Status of Storages
• Storages:
• HDFS support
• Amazon S3 and Openstack Swift
• Hbase Scanner and Writer - HFile and Put Mode
• JDBC-based Scanner and Writer (Working)
• Auto meta data registration (working)
• Kafka, Elastic Search (Patch Available)
• Data Formats
• Text, JSON, RCFile, SequenceFile, Avro, Parquet, and ORC
(Patch Available)
Python UDF
• Python UDF and UDAF are supported in Tajo
• http://tajo.apache.org/docs/devel/functions/python.html
@output_type('int4')
def return_one():
return 1
@output_type('text')
def helloworld():
return 'Hello, World’
@output_type('int4')
def sum_py(a,b):
return a+b
Improved Standalone Scheduler
• Standalone FIFO scheduler
• Before
• only one running query at a time was allowed
• After
• multiple running queries are allowed at a time
• resizable resource allocation of running queries
• Future works after 0.11
• Multiple queues support
Get Involved!
• We are recruiting contributors!
• General
• http://tajo.apache.org
• Getting Started
• http://tajo.apache.org/docs/0.10.0/getting_started.html
• Downloads
• http://tajo.apache.org/downloads.html
• Jira – Issue Tracker
• https://issues.apache.org/jira/browse/TAJO
• Join the mailing list
• dev-subscribe@tajo.apache.org
• issues-subscribe@tajo.apache.org
Q&A

More Related Content

What's hot

Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014larsgeorge
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Gruter
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars GeorgeJAX London
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCCloudera, Inc.
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practicelarsgeorge
 
Apache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouseApache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehousehadoopsphere
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestHBaseCon
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesHBaseCon
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.
 
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Yahoo Developer Network
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaHadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaCloudera, Inc.
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
 

What's hot (19)

Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
 
NoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBaseNoSQL: Cassadra vs. HBase
NoSQL: Cassadra vs. HBase
 
HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014HBase Status Report - Hadoop Summit Europe 2014
HBase Status Report - Hadoop Summit Europe 2014
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
 
Intro to HBase - Lars George
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
Apache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouseApache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouse
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
A Survey of HBase Application Archetypes
A Survey of HBase Application ArchetypesA Survey of HBase Application Archetypes
A Survey of HBase Application Archetypes
 
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera Impala: A Modern SQL Engine for Apache Hadoop
Cloudera Impala: A Modern SQL Engine for Apache Hadoop
 
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
Apache Hadoop India Summit 2011 talk "Searching Information Inside Hadoop Pla...
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaHadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
 

Viewers also liked

Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseDataWorks Summit
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Caserta
 
Big data analysis with R and Apache Tajo (in Korean)
Big data analysis with R and Apache Tajo (in Korean)Big data analysis with R and Apache Tajo (in Korean)
Big data analysis with R and Apache Tajo (in Korean)Gruter
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsCloudera, Inc.
 
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionHow One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionDataWorks Summit
 

Viewers also liked (6)

Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
 
Big data analysis with R and Apache Tajo (in Korean)
Big data analysis with R and Apache Tajo (in Korean)Big data analysis with R and Apache Tajo (in Korean)
Big data analysis with R and Apache Tajo (in Korean)
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
 
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionHow One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
 

Similar to Tajo Seoul Meetup July 2015 - What's New Tajo 0.11

Efficient In­‐situ Processing of Various Storage Types on Apache Tajo
Efficient In­‐situ Processing of Various Storage Types on Apache TajoEfficient In­‐situ Processing of Various Storage Types on Apache Tajo
Efficient In­‐situ Processing of Various Storage Types on Apache TajoGruter
 
Efficient In-situ Processing of Various Storage Types on Apache Tajo
Efficient In-situ Processing of Various Storage Types on Apache TajoEfficient In-situ Processing of Various Storage Types on Apache Tajo
Efficient In-situ Processing of Various Storage Types on Apache TajoDataWorks Summit
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
H base introduction & development
H base introduction & developmentH base introduction & development
H base introduction & developmentShashwat Shriparv
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingearnwithme2522
 
A Presentation on MongoDB Introduction - Habilelabs
A Presentation on MongoDB Introduction - HabilelabsA Presentation on MongoDB Introduction - Habilelabs
A Presentation on MongoDB Introduction - HabilelabsHabilelabs
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceChris Nauroth
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Michael Rys
 
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکیDeep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکیEhsan Asgarian
 

Similar to Tajo Seoul Meetup July 2015 - What's New Tajo 0.11 (20)

Efficient In­‐situ Processing of Various Storage Types on Apache Tajo
Efficient In­‐situ Processing of Various Storage Types on Apache TajoEfficient In­‐situ Processing of Various Storage Types on Apache Tajo
Efficient In­‐situ Processing of Various Storage Types on Apache Tajo
 
Efficient In-situ Processing of Various Storage Types on Apache Tajo
Efficient In-situ Processing of Various Storage Types on Apache TajoEfficient In-situ Processing of Various Storage Types on Apache Tajo
Efficient In-situ Processing of Various Storage Types on Apache Tajo
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Apache TAJO
Apache TAJOApache TAJO
Apache TAJO
 
H base introduction & development
H base introduction & developmentH base introduction & development
H base introduction & development
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
מיכאל
מיכאלמיכאל
מיכאל
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
 
A Presentation on MongoDB Introduction - Habilelabs
A Presentation on MongoDB Introduction - HabilelabsA Presentation on MongoDB Introduction - Habilelabs
A Presentation on MongoDB Introduction - Habilelabs
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
 
Horizon for Big Data
Horizon for Big DataHorizon for Big Data
Horizon for Big Data
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
Apache HBase Workshop
Apache HBase WorkshopApache HBase Workshop
Apache HBase Workshop
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
 
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکیDeep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی
 

Recently uploaded

RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 

Recently uploaded (20)

RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 

Tajo Seoul Meetup July 2015 - What's New Tajo 0.11

  • 1. What’s New Tajo 0.11 Tajo Seoul Meetup 2015. 07 Hyunsik Choi, Gruter Inc.
  • 2. Agenda • Tajo Overview • Milestones and 0.10 Features • What’s New in 0.11.
  • 3. Tajo: A Big Data Warehouse System • Apache Top-level project • Distributed and scalable data warehouse system on various data sources (e.g, HDFS, S3, Hbase, …) • Low latency, and long running batch queries in a single system • Features • ANSI SQL compliance • Mature SQL features • Partitioned table support • Java/Python UDF support • JDBC driver and Java-based asynchronous API • Read/Write support of CSV, JSON, RCFile, SequenceFile, Parquet, ORC
  • 4. Master Server TajoMaster Slave Server TajoWorker QueryMaster Local Query Engine StorageManager HDFS HBase Client JDBC TSql Web UI Slave Server TajoWorker QueryMaster Local Query Engine StorageManager Slave Server TajoWorker QueryMaster Local Query Engine StorageManager CatalogStore DBMS HCatalogSubmit a query Manage metadata Allocate a query send tasks & monitor send tasks & monitor Tajo Overall Architecture HDFS HBase HDFS HBase
  • 5. Common Scenarios • Extraction, Transformation, Loading (ETL) • Interactive BI/analytics on web-scale big data • Data discovery/Exploratory analysis with R and existing SQL tools
  • 6. Use Cases: Replacement of Commercial DW • Example: Biggest Telco Company in South Korea • Goal: • Replacement of slow ETL workloads on several TB datasets • Lots daily reports generation about users’ behaviors • Ad-hoc analysis on Terabytes data sets • Key Benefits of Tajo: • Simplification of DW ETL, OLAP, and Hadoop ETL into an unified system • Saved license over commercial DW • Much less cost, more data analysis within the same SLA
  • 7. Use Cases: Data Discovery • Example: Music streaming service (26 million users) • Goal: • Analysis on purchase history for target marketing • Benefits: • Query interactivity on large data sets • Ability to use existing BI visualization tools
  • 8. When Tajo is right choice? • You want an unified system for batch and interactive queries on Hadoop, Amazon S3, or Hbase. • You want a mixed use of Hadoop-based DW and RDBMS-based DW or want to replace existing RDBMS DW. • You want to use existing SQL tools on Hadoop DW
  • 9. Milestones 0.8 0.9 0.10 0.11 More features & SQL compatibility Stability & Analytical function Eco-system expansion More features • Python UDF • Nested Schema • Tablespace support • Basic Query federation • Better query scheduler
  • 11. Hbase Storage Support • You can use SQL to access Hbase tables. • Tajo supports Hbase storage • CREATE (EXTERNAL)/DROP/INSERT (OVERWRITE)/SELECT • Bulk Insertion through Direct HFile writing CREATE TABLE hbase_t1 (key TEXT, col1 TEXT, col2 INT) USING hbase WITH ( ‘table’ = ‘t1’, ‘columns’ = ‘:key,cf1:col1,cf2:col2`, ‘hbase.zookeeper.quorum’ = ‘host1:2181,host2:2181’ )
  • 12. Better AWS support • Optimized for S3 and EMR environments • Fixed many bugs related to S3 • EMR bootstrap supported in AWS Labs Github repo • A quick guide for Tajo on EMR • http://www.gruter.com/blog/setting-up-a-tajo-cluster-on-amazon-emr/ • EMR bootstrap for Tajo on EMR • https://github.com/awslabs/emr-bootstrap-actions/tree/master/tajo
  • 13. Tajo JDBC Tajo Cluster ETL Tools BI Tools Reporting tools Better SQL tool support via thin JDBC HDFS HBase S3 Swift
  • 15. Improved Performance and Stability • Offheap sort operator for ORDER BY (TAJO-907) • Hash shuffle IO improvement (TAJO-374, TAJO-987) • Skewness handling of hash shuffle • Automatic parallel degree choice during runtime • Lots of query optimizer improvements • Add Master HA (TAJO-704) • More error-tolerant shuffle fetch (TAJO-789, TAJO-953)
  • 16. What’s New in Tajo 0.11
  • 17. Nested data and JSON support • Nested data is becoming common • JSON, BSON, XML, Protocol Buffer, Avro, Parquet, … • Many web applications in common use JSON. • MongoDB by default uses JSON document • Many Hbase users also store JSON document in a cell. • Flattening causes lots of data/computation overhead. • Tajo 0.11 natively supports nested data types.
  • 18. How to create a nested schema table Use ‘RECORD’ keyword to define complex data type
  • 19. Loose schema for self-describing formats You can handle schema evolving with ALTER ADD COLUMN!
  • 20. How to retrieve nested fields Input Data Table Definition SQL
  • 21. Query federation and Tablespace support • Query support across multiple data sources • You can perform join or union among tables on different systems. • Benefits: • Data offload from RDBMS to Hadoop vice versa • A mixed use of existing RDBMS and Hadoop. • Access to NoSQL and various storages through SQL • An unified interface for SQL tools HDFS NoSQL S3 Swift Apache Tajo
  • 23. Tablespace • Tablespace • Registered storage space • A table space is identified by an unique URI • Configuration and Policy shared in all tables in the same tablespace • It allows users to reuse registered storages and their configuration.
  • 25. Create Table on a specified Tablespace > CREATE TABLE uptodate (key TEXT, …) TABLESPACE hbase1; > CREATE TABLE archive (l_orderkey bigint, …) TABLESPACE warehouse USING text WITH (‘text.delimiter’ = ‘|’); Tablespace Name Format name
  • 26. Operation Push Down SELECT X, SUM(Y) FROM table1 WHERE x > 100 GROUP BY x Underlying Storage Filter, Projection or Groupby can be pushed down into Underlying storages (like RDBMS, Hbase, Elasticsearch, …)
  • 27. Current Status of Storages • Storages: • HDFS support • Amazon S3 and Openstack Swift • Hbase Scanner and Writer - HFile and Put Mode • JDBC-based Scanner and Writer (Working) • Auto meta data registration (working) • Kafka, Elastic Search (Patch Available) • Data Formats • Text, JSON, RCFile, SequenceFile, Avro, Parquet, and ORC (Patch Available)
  • 28. Python UDF • Python UDF and UDAF are supported in Tajo • http://tajo.apache.org/docs/devel/functions/python.html @output_type('int4') def return_one(): return 1 @output_type('text') def helloworld(): return 'Hello, World’ @output_type('int4') def sum_py(a,b): return a+b
  • 29. Improved Standalone Scheduler • Standalone FIFO scheduler • Before • only one running query at a time was allowed • After • multiple running queries are allowed at a time • resizable resource allocation of running queries • Future works after 0.11 • Multiple queues support
  • 30. Get Involved! • We are recruiting contributors! • General • http://tajo.apache.org • Getting Started • http://tajo.apache.org/docs/0.10.0/getting_started.html • Downloads • http://tajo.apache.org/downloads.html • Jira – Issue Tracker • https://issues.apache.org/jira/browse/TAJO • Join the mailing list • dev-subscribe@tajo.apache.org • issues-subscribe@tajo.apache.org
  • 31. Q&A

Editor's Notes

  1. This is an agenda of Today talk. Firstly, I’ll give an overview of Tajo project. Then, I’ll have a talk about the milestones and new features 0.10. Finally, I’ll discuss upcoming release.
  2. We did lots of things for 0.10 release. Many things among them are related to eco system expansion.
  3. In 0.10, we integrated Hbase storage to Tajo. Users can use SQL to access Hbase tables. In tajo, You can do create table, insertion, and select queries including join and aggregation. In particular, Tajo support bulk loading through direct hfile writing.
  4. One of the main improvements in 0.10 is better AWS support. We extensively tested Tajo in EMR. Basically, Tajo accesses to S3 through S3 implementation of HDFS. Because S3 is different from HDFS, we had to optimize Tajo’s S3 support and fixed many bugs for S3. For example, S3 does not support move operation. We had to find a different way for temporary staging data of table writing. While we were improving S3 support, we also made EMR bootstrap, which is a kind of script to launch Tajo on EMR service. This work was committed to AWS lab repository. You can easily launch a Tajo cluster by using this script in EMR service.
  5. We also refactored Tajo JDBC to be thin. Unlike other systems, Tajo Thin JDBC driver does not require extra classpath. And its compatibility is also very high. You can use many JDBC-based SQL tools to access Tajo. We tested the driver on Spotfire, Burst, and Pentaho.
  6. We also integrated Tajo with Zeppelin. Zeppelin is the most promising opensource data science tool in hadoop ecosystem. It allows users to access execution engines and visualize the result in a single platform without any context switching. Tajo team also submitted the patch for the zeppelin integration to Zeppelin community. So, you can just use zeppelin to access Tajo.
  7. We also improved query performance and stability. We introduced offheap sort to avoid gc overhead during large sort. We also enhanced shuffle performance in many issues. Shuffle is essential distributed operation for join, and aggregation. We also did many things for high availability.
  8. As you can see, Tajo provides RECORD keyword to describe complex nested data type. You can use complex type for nested file formats like Parquet and Avro. This is an example CREATE TABLE statement for such a JSON. You can see an issue about schema on self-describing format. Basically, current Tajo needs schema definition for each table. You still need to define some schema even for self-describing format like Json. But, strict schema definition does not make sense for JSON. because schemaless is one of the main reasons why we use JSON.
  9. So, we introduced loose schema for self-describing formats. With loose schema support, you just need to define only columns you want to project. Because many file formats like Parquert, Avro, ORC are self-describing, this feature is very important. See the example, against this data set, you can use various schema definition like them. If there is no value corresponding to the column definition, null value will be retrieved. Later, we have a plan to support schema-on-read, which is an way to guess or recognize the schema from self-describing format file. After than, you can omit the schema definition.
  10. How we retrieve nested fields? You can use dot notation to access nested fields. For example, this column expression let Tajo to access the nested field ‘first name’ under ‘name’. Nested schema support is still evolving is Tajo project. We will also add more feature about it.
  11. One of the main feature 0.11 is query federation and tablespace support. You can perform a single query across multiple data sources. This feature has various benefits. You can offload data stored in RDBMS to Hadoop. It is very helpful to use an unified SQL interface to access various storages.
  12. In 0.11, you can use these data formats and storage types.
  13. This example shows an tablespace configuration. This configurare sets two tablespaces named warehouse and hbase1. After you made such a configuration,
  14. Of course, Tajo pushes down filter and projection into underlying storage. We are also expecting that aggregation also can be pushed down into underlying storage like RDBMS. This work is still working.
  15. We also Hbase table write. Tajo supports bulk load for hbase. This approach is to write hfile directly and let hbase to load the hfile. Also, Tajo support put mode. With put mode, you can instantly insert row into Hbase by using insert statement. Kafka and Elastic search are already patch available. ORC scanner is also patch available. We have a plan to ORC scanning and writing to 0.11.
  16. Many data scientist have asked us to support python udf. So, we add this feature to 0.11. Tajo support UDF as well UDFA.