Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Apache Tajo:
A Big Data Warehouse System
on Hadoop
Hyunsik Choi
Director of Research, Gruter
Big Data Camp LA 2014
Talk Outline
• Introduction to Apache Tajo
• What you can do with Tajo
• Why you should use Tajo
• Current Status of Tajo ...
About Me
• Hyunsik Choi (pronounced “Hyeon-shick Cheh”)
• PhD (Computer Science & Engineering, 2013), Korea Univ
.
• Direc...
Apache Tajo
• Open-source “SQL-on-H” “Big DW” system
• Apache Top-level project since March 2014
• Supports SQL standards
...
Overall Architecture
What You Can Do with Tajo
• Batch queries
– Long-running queries (~ hours)
• Dynamic Scheduling
• Fault Tolerance
– ETL wo...
Why You Should Use Tajo
• SQL Standards
– Non standard features – PgSQL and Oracle
• Simple Installation and Operation
– h...
Why You Should Use Tajo
• Mature SQL Feature Set
– Fully distributed query executions
• Inner join, and left/right/full ou...
Why You Should Use Tajo
• Fully community-driven open source
• Stable development team
– 5 fulltime contributors + many co...
Why You Should Use Tajo
• Integration with Hadoop Ecosystem
– Hadoop 2.2.0 – 2.4.0 support
– Be able to connect to Hive Me...
Current Status – Overall
• Under beta stage – majority of key features are getting ready
• Most of SQL features implemente...
Current Status – Logical Plan Optimizer
• Basic Rewrite Rule
– Common sub expression elimination
– Constant folding (CF), ...
Current Status – Logical Plan Optimizer
SELECT
item_id,
order_id
sum_price * (1.2 * 0.3)
as total,
FROM (
SELECT
item_id,
...
Current Status – Logical Plan Optimizer
• Cost-based Join Order (since v0.2)
– Don’t need to guess right join orders anymo...
Current Status – Window Function
• OVER clause
– row_number() and rank()
– Aggregation function support
– PARTITION and OR...
Current Status – Join
• Join
– NATURAL, INNER, OUTER (LEFT, RIGHT, FULL)
– SEMI, ANTI Join (planned for v0.9)
• Join Predi...
Current Status – Table Partitions
• Column Value Partition
– Hive Compatible Partition
• Range Partition (planned for 1.0)...
Future Works
• Multi-tenant Scheduler (v0.9)
– Support multiple users and multiple queries
• Runtime byte code generation ...
Get Involved!
• We are recruiting contributors!
• General
– http://tajo.apache.org
• Getting Started
– http://tajo.apache....
Demonstration
Upcoming SlideShare
Loading in …5
×

Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

9,779 views

Published on

Apache Tajo is an open source big data warehouse system on Hadoop. This slide is a presentation material used in Big Data Camp LA 2014. This slide shows an introduction to Apache Tajo and the current status of the project. The current status includes cost-based optimization and the current supported SQL feature set.

Published in: Software, Technology
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

  1. 1. Apache Tajo: A Big Data Warehouse System on Hadoop Hyunsik Choi Director of Research, Gruter Big Data Camp LA 2014
  2. 2. Talk Outline • Introduction to Apache Tajo • What you can do with Tajo • Why you should use Tajo • Current Status of Tajo Project • Demonstration
  3. 3. About Me • Hyunsik Choi (pronounced “Hyeon-shick Cheh”) • PhD (Computer Science & Engineering, 2013), Korea Univ . • Director of Research, Gruter Corp • Open-source Involvement – Full-time contributor to Apache Tajo (2013.6 ~ ) – Apache Tajo PMC member and committer (2013.3 ~ ) – Apache Giraph PMC member and committer (2011. 8 ~ ) • Contact Info – Email: hyunsik@apache.org – Linkedin: http://linkedin.com/in/hyunsikchoi/
  4. 4. Apache Tajo • Open-source “SQL-on-H” “Big DW” system • Apache Top-level project since March 2014 • Supports SQL standards • Low latency, long running batch queries • Features – Supports Joins (inner and all outer), Groupby, and Sort – Window function – Most SQL data types supported (except for Decimal) • Recent 0.8.0 release – https://blogs.apache.org/tajo/entry/apache_tajo_0_8_0
  5. 5. Overall Architecture
  6. 6. What You Can Do with Tajo • Batch queries – Long-running queries (~ hours) • Dynamic Scheduling • Fault Tolerance – ETL workloads • Interactive Ad-hoc Queries – Very low-latency (100 ms ~) – Few seconds on several TB dataset if you cluster capability is enough
  7. 7. Why You Should Use Tajo • SQL Standards – Non standard features – PgSQL and Oracle • Simple Installation and Operation – http://tajo.apache.org/docs/0.8.0/getting_started.html • Simple Software Stack Requirement – No MapReduce and No Tez – Yarn support but not mandatory – Tajo + Linux system for single node cluster – Tajo + HDFS for a distributed cluster
  8. 8. Why You Should Use Tajo • Mature SQL Feature Set – Fully distributed query executions • Inner join, and left/right/full outer join • Groupby, sort, multiple distinct aggregation, window function – SQL data types • CHAR, BOOL, INT, BIGINT, REAL, DOUBLE, and TEXT • TIMESTAMP, DATE, TIME, and INTERVAL • DECIMAL (working) – Various file formats • Text file (CSV), RCFile, Parquet (flat schema), and Avro (flat schema)
  9. 9. Why You Should Use Tajo • Fully community-driven open source • Stable development team – 5 fulltime contributors + many contributors • Performance and speed – Faster than Hive 0.10 (1.5 – 10 times) – Tajo v.s. Hive 0.13 ? – Tajo v.s. Impala ?
  10. 10. Why You Should Use Tajo • Integration with Hadoop Ecosystem – Hadoop 2.2.0 – 2.4.0 support – Be able to connect to Hive Metastore – Directly process tables managed by Hive – Yarn support (backport) • Enable Tajo to deploy and run on Yarn cluster • Allow users to add/remove cluster nodes to/from Tajo cluster in runtime • Contributed by Min Zhou (committer), Linkedin Engineer • https://github.com/coderplay/tajo-yarn
  11. 11. Current Status – Overall • Under beta stage – majority of key features are getting ready • Most of SQL features implemented • Working on hundreds of clusters for production – Collaboration with the biggest telco in S. Korea • We’ve just started works on low-level optimization. – Runtime byte code generation (v0.9) – Unsafe-based hash table for hash aggregation/join – Vectorized execution engine
  12. 12. Current Status – Logical Plan Optimizer • Basic Rewrite Rule – Common sub expression elimination – Constant folding (CF), and Null propagation • Projection Push Down (PPD) – push expressions to operators lower as possible – narrow read columns – remove duplicated expressions • if some expressions has common expression • Filter Push Down (FPD) – reduce rows to be processed earlier as possible • Extensible Rewrite Rule – Allow developers to write their own rewrite rules
  13. 13. Current Status – Logical Plan Optimizer SELECT item_id, order_id sum_price * (1.2 * 0.3) as total, FROM ( SELECT item_id, order_id, sum(price) as sum_price FROM ITEMS GROUP BY item_id, order_id ) a WHERE item_id = 17234 SELECT item_id, order_id, sum(price) * (3.6) FROM ITEMS GROUP BY item_id, order_id WHERE item_id = 17234 Original Rewritten CF + PPD FPD
  14. 14. Current Status – Logical Plan Optimizer • Cost-based Join Order (since v0.2) – Don’t need to guess right join orders anymore – Greedy heuristic algorithm • Resulting in a bushy join tree instead of left-deep join tree Left-deep Join Tree Bush Join Tree
  15. 15. Current Status – Window Function • OVER clause – row_number() and rank() – Aggregation function support – PARTITION and ORDER BY clause SELECT depname, empno, salary, enroll_date FROM ( SELECT depname, empno, salary, enroll_date, rank() OVER (PARTITION BY depname ORDER BY salary DESC, empno) AS pos FROM empsalary ) AS ss WHERE pos < 3;
  16. 16. Current Status – Join • Join – NATURAL, INNER, OUTER (LEFT, RIGHT, FULL) – SEMI, ANTI Join (planned for v0.9) • Join Predicates – WHERE and ON predicates – de-factor standard outer join behavior with both predicates SELECT * FROM t1 LEFT JOIN t2 ON t1.num = t2.num WHERE t2.value = 'xxx'; SELECT * FROM t1 LEFT JOIN t2 WHERE t1.num = t2.n um and t2.value = ‘xxx’;
  17. 17. Current Status – Table Partitions • Column Value Partition – Hive Compatible Partition • Range Partition (planned for 1.0) – Table will be partitioned by disjoint ranges. – Will remove the partition granularity problem of Hive Partition CREATE TABLE T1 (C1 INT, C2 TEXT) using PARQUET WITH (‘parquet.compression’ = ‘SNAPPY’) PARTITION BY COLUMN (C3 INT, C4 TEXT);
  18. 18. Future Works • Multi-tenant Scheduler (v0.9) – Support multiple users and multiple queries • Runtime byte code generation for expressions (v0.9) – Eliminate interpret overhead of expression evaluation • Authentication and SQL Standard Access Control • JIT-based Vectorized Processing Engine – Refer to Hadoop Summit 2014 Slide (http://goo.gl/jWghhp)
  19. 19. Get Involved! • We are recruiting contributors! • General – http://tajo.apache.org • Getting Started – http://tajo.apache.org/docs/0.8.0/getting_started.html • Downloads – http://tajo.apache.org/docs/0.8.0/getting_started/downloading_source.html • Jira – Issue Tracker – https://issues.apache.org/jira/browse/TAJO • Join the mailing list – dev-subscribe@tajo.apache.org – issues-subscribe@tajo.apache.org
  20. 20. Demonstration

×