Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Apache Tajo : 
A Big Data Warehouse System 
on Hadoop 
Jaehwa Jung 
Apache Tajo PMC & Research Director at Gruter 
BWC 201...
©2014 Gruter. All rights reserved. 
Agenda 
• What is Apache Tajo? 
• What you can do with Tajo? 
• Why you should use Taj...
What is Apache Tajo?
©2014 Gruter. All rights reserved. 
Apache Tajo Overview 
• A big data warehouse system on Hadoop 
• Apache Top-level proj...
©2014 Gruter. All rights reserved. 
Tajo Architecture 
Master Server 
TajoMaster 
Slave Server 
TajoWorker 
QueryMaster 
L...
What You Can Do with Tajo?
©2014 Gruter. All rights reserved. 
Commercial Data Warehouse 
Front-End 
Analytics 
Source Data Data Warehouse 
OLTP 
CRM...
Hadoop-based Data Warehouse with Tajo 
Front-End 
Analytics 
©2014 Gruter. All rights reserved. 
We can do ETL and Interac...
Why You Should Use Tajo?
©2014 Gruter. All rights reserved. 
Mature SQL Feature Set 
• Fully distributed query executions 
– Inner join, and left/r...
©2014 Gruter. All rights reserved. 
Performance and Speed 
• Faster than Hive 0.10 (1.5 – 10 times) 
– http://slidesha.re/...
©2014 Gruter. All rights reserved. 
Simple Operation and Software Stack 
• Simple Installation and Operation 
– http://taj...
©2014 Gruter. All rights reserved. 
Simple Integration 
• Integration with Hadoop Ecosystem 
– Hadoop 2.2.0 – 2.5.1 suppor...
©2014 Gruter. All rights reserved. 
Active Open Source Community 
• Fully community-driven open source 
• Stable developme...
Use Cases
Replace Commercial Data Warehouse (SKT) 
• ETL Processing: 120+ queries, ~4TB read/day 
• OLAP Processing: 500+ queries 
©...
©2014 Gruter. All rights reserved. 
Tajo-as-a-Service on AWS
©2014 Gruter. All rights reserved. 
About Gruter, Inc 
• Big Data platform company since 2006.2 
• Hadoop platforms, Hadoo...
GRUTER: YOUR PARTNER 
IN THE BIG DATA REVOLUTION 
©2014 Gruter. All rights reserved. 
Phone +82-70-8129-2950 
Fax +82-70-8...
Upcoming SlideShare
Loading in …5
×

Apache Tajo - BWC 2014

2,342 views

Published on

Apache Tajo: A Big Data Warehouse System on Hadoop

Presented by Jae-hwa Jeong, Apache Tajo committer and senior research engineer at Gruter, in Bigdata World Convention 2014 at Oct.23, Busan, Korea

Published in: Data & Analytics

Apache Tajo - BWC 2014

  1. 1. Apache Tajo : A Big Data Warehouse System on Hadoop Jaehwa Jung Apache Tajo PMC & Research Director at Gruter BWC 2014
  2. 2. ©2014 Gruter. All rights reserved. Agenda • What is Apache Tajo? • What you can do with Tajo? • Why you should use Tajo? • Use Cases
  3. 3. What is Apache Tajo?
  4. 4. ©2014 Gruter. All rights reserved. Apache Tajo Overview • A big data warehouse system on Hadoop • Apache Top-level project since March 2014 • Supports SQL standards • Features – Powerful distributed processing architecture (Not MapReduce) – Advanced query optimization algorithms and techniques – Long running queries : for many hours – Interactive analysis queries : from 100 milliseconds • Recent 0.9.0 release
  5. 5. ©2014 Gruter. All rights reserved. Tajo Architecture Master Server TajoMaster Slave Server TajoWorker QueryMaster Local Query Engine StorageManager Local FileSystem HDFS Client JDBC TSql Web UI Slave Server TajoWorker QueryMaster Local Query Engine StorageManager Local FileSystem HDFS Slave Server TajoWorker QueryMaster Local Query Engine StorageManager Local FileSystem HDFS CatalogStore DBMS Submit a query HCatalog Manage metadata Allocate a query Run & monitor a query Run & monitor a query
  6. 6. What You Can Do with Tajo?
  7. 7. ©2014 Gruter. All rights reserved. Commercial Data Warehouse Front-End Analytics Source Data Data Warehouse OLTP CRM ERP ecommerce Other ODS (Operational Data Store) Data Warehouse Data Mart OLAP Visualiz ation ETL ETL ETL Reports Data Mining
  8. 8. Hadoop-based Data Warehouse with Tajo Front-End Analytics ©2014 Gruter. All rights reserved. We can do ETL and Interactive Analytics! Source Data Data Warehouse OLTP CRM ERP ecommerce Other ODS Data Warehouse Data Mart Reports OLAP Visualiz ation Data Mining ETL ETL ETL
  9. 9. Why You Should Use Tajo?
  10. 10. ©2014 Gruter. All rights reserved. Mature SQL Feature Set • Fully distributed query executions – Inner join, and left/right/full outer join – Groupby, sort, multiple distinct aggregation – window function • SQL data types – CHAR, BOOL, INT, DOUBLE, TEXT, DATE, Etc • Various file formats – Text file (CSV), SequenceFile, RCFile, Parquet, Avro • SQL Standards – Non standard features : PgSQL and Oracle
  11. 11. ©2014 Gruter. All rights reserved. Performance and Speed • Faster than Hive 0.10 (1.5 – 10 times) – http://slidesha.re/1yTBTaa • Tajo vs Hive on Tez ? • Tajo vs Impala ?
  12. 12. ©2014 Gruter. All rights reserved. Simple Operation and Software Stack • Simple Installation and Operation – http://tajo.apache.org/docs/current/getting_started.html • Simple Software Stack Requirement – No MapReduce and No Tez – Yarn support but not mandatory – Tajo + Linux system for single node cluster – Tajo + HDFS for a distributed cluster
  13. 13. ©2014 Gruter. All rights reserved. Simple Integration • Integration with Hadoop Ecosystem – Hadoop 2.2.0 – 2.5.1 support – Be able to connect to Hive Metastore – Directly process tables managed by Hive • Yarn support (backport) – Enable Tajo to deploy and run on Yarn cluster – Allow users to add/remove cluster nodes to/from Tajo cluster in runtime
  14. 14. ©2014 Gruter. All rights reserved. Active Open Source Community • Fully community-driven open source • Stable development team – 17 committers + many contributors
  15. 15. Use Cases
  16. 16. Replace Commercial Data Warehouse (SKT) • ETL Processing: 120+ queries, ~4TB read/day • OLAP Processing: 500+ queries ©2014 Gruter. All rights reserved. Operational Systems Integration Layer Data Warehouse Data Mart Marketing Sales ERP SCM ODS Staging Area Strategic Marts Data Vault
  17. 17. ©2014 Gruter. All rights reserved. Tajo-as-a-Service on AWS
  18. 18. ©2014 Gruter. All rights reserved. About Gruter, Inc • Big Data platform company since 2006.2 • Hadoop platforms, Hadoop ecosystem consulting, Big data analytics • Major sponsor of open-source Tajo, including PMC chair and 5 full time committers • Bringing Tajo to Enterprise on Cloud and on Premises • Based in Palo Alto, USA and Seoul, Korea
  19. 19. GRUTER: YOUR PARTNER IN THE BIG DATA REVOLUTION ©2014 Gruter. All rights reserved. Phone +82-70-8129-2950 Fax +82-70-8129-2952 E-mail contact@gruter.com Web www.gruter.com Phone +1-415-841-3345

×