Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

8,832 views

Published on

Apache Tajo is an open source big data warehouse system on Hadoop. This slide is a presentation material used in Big Data Camp LA 2014. This slide shows an introduction to Apache Tajo and the current status of the project. The current status includes cost-based optimization and the current supported SQL feature set.

Published in: Software, Technology
0 Comments
18 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
8,832
On SlideShare
0
From Embeds
0
Number of Embeds
199
Actions
Shares
0
Downloads
48
Comments
0
Likes
18
Embeds 0
No embeds

No notes for slide

Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop

  1. 1. Apache Tajo: A Big Data Warehouse System on Hadoop Hyunsik Choi Director of Research, Gruter Big Data Camp LA 2014
  2. 2. Talk Outline • Introduction to Apache Tajo • What you can do with Tajo • Why you should use Tajo • Current Status of Tajo Project • Demonstration
  3. 3. About Me • Hyunsik Choi (pronounced “Hyeon-shick Cheh”) • PhD (Computer Science & Engineering, 2013), Korea Univ . • Director of Research, Gruter Corp • Open-source Involvement – Full-time contributor to Apache Tajo (2013.6 ~ ) – Apache Tajo PMC member and committer (2013.3 ~ ) – Apache Giraph PMC member and committer (2011. 8 ~ ) • Contact Info – Email: hyunsik@apache.org – Linkedin: http://linkedin.com/in/hyunsikchoi/
  4. 4. Apache Tajo • Open-source “SQL-on-H” “Big DW” system • Apache Top-level project since March 2014 • Supports SQL standards • Low latency, long running batch queries • Features – Supports Joins (inner and all outer), Groupby, and Sort – Window function – Most SQL data types supported (except for Decimal) • Recent 0.8.0 release – https://blogs.apache.org/tajo/entry/apache_tajo_0_8_0
  5. 5. Overall Architecture
  6. 6. What You Can Do with Tajo • Batch queries – Long-running queries (~ hours) • Dynamic Scheduling • Fault Tolerance – ETL workloads • Interactive Ad-hoc Queries – Very low-latency (100 ms ~) – Few seconds on several TB dataset if you cluster capability is enough
  7. 7. Why You Should Use Tajo • SQL Standards – Non standard features – PgSQL and Oracle • Simple Installation and Operation – http://tajo.apache.org/docs/0.8.0/getting_started.html • Simple Software Stack Requirement – No MapReduce and No Tez – Yarn support but not mandatory – Tajo + Linux system for single node cluster – Tajo + HDFS for a distributed cluster
  8. 8. Why You Should Use Tajo • Mature SQL Feature Set – Fully distributed query executions • Inner join, and left/right/full outer join • Groupby, sort, multiple distinct aggregation, window function – SQL data types • CHAR, BOOL, INT, BIGINT, REAL, DOUBLE, and TEXT • TIMESTAMP, DATE, TIME, and INTERVAL • DECIMAL (working) – Various file formats • Text file (CSV), RCFile, Parquet (flat schema), and Avro (flat schema)
  9. 9. Why You Should Use Tajo • Fully community-driven open source • Stable development team – 5 fulltime contributors + many contributors • Performance and speed – Faster than Hive 0.10 (1.5 – 10 times) – Tajo v.s. Hive 0.13 ? – Tajo v.s. Impala ?
  10. 10. Why You Should Use Tajo • Integration with Hadoop Ecosystem – Hadoop 2.2.0 – 2.4.0 support – Be able to connect to Hive Metastore – Directly process tables managed by Hive – Yarn support (backport) • Enable Tajo to deploy and run on Yarn cluster • Allow users to add/remove cluster nodes to/from Tajo cluster in runtime • Contributed by Min Zhou (committer), Linkedin Engineer • https://github.com/coderplay/tajo-yarn
  11. 11. Current Status – Overall • Under beta stage – majority of key features are getting ready • Most of SQL features implemented • Working on hundreds of clusters for production – Collaboration with the biggest telco in S. Korea • We’ve just started works on low-level optimization. – Runtime byte code generation (v0.9) – Unsafe-based hash table for hash aggregation/join – Vectorized execution engine
  12. 12. Current Status – Logical Plan Optimizer • Basic Rewrite Rule – Common sub expression elimination – Constant folding (CF), and Null propagation • Projection Push Down (PPD) – push expressions to operators lower as possible – narrow read columns – remove duplicated expressions • if some expressions has common expression • Filter Push Down (FPD) – reduce rows to be processed earlier as possible • Extensible Rewrite Rule – Allow developers to write their own rewrite rules
  13. 13. Current Status – Logical Plan Optimizer SELECT item_id, order_id sum_price * (1.2 * 0.3) as total, FROM ( SELECT item_id, order_id, sum(price) as sum_price FROM ITEMS GROUP BY item_id, order_id ) a WHERE item_id = 17234 SELECT item_id, order_id, sum(price) * (3.6) FROM ITEMS GROUP BY item_id, order_id WHERE item_id = 17234 Original Rewritten CF + PPD FPD
  14. 14. Current Status – Logical Plan Optimizer • Cost-based Join Order (since v0.2) – Don’t need to guess right join orders anymore – Greedy heuristic algorithm • Resulting in a bushy join tree instead of left-deep join tree Left-deep Join Tree Bush Join Tree
  15. 15. Current Status – Window Function • OVER clause – row_number() and rank() – Aggregation function support – PARTITION and ORDER BY clause SELECT depname, empno, salary, enroll_date FROM ( SELECT depname, empno, salary, enroll_date, rank() OVER (PARTITION BY depname ORDER BY salary DESC, empno) AS pos FROM empsalary ) AS ss WHERE pos < 3;
  16. 16. Current Status – Join • Join – NATURAL, INNER, OUTER (LEFT, RIGHT, FULL) – SEMI, ANTI Join (planned for v0.9) • Join Predicates – WHERE and ON predicates – de-factor standard outer join behavior with both predicates SELECT * FROM t1 LEFT JOIN t2 ON t1.num = t2.num WHERE t2.value = 'xxx'; SELECT * FROM t1 LEFT JOIN t2 WHERE t1.num = t2.n um and t2.value = ‘xxx’;
  17. 17. Current Status – Table Partitions • Column Value Partition – Hive Compatible Partition • Range Partition (planned for 1.0) – Table will be partitioned by disjoint ranges. – Will remove the partition granularity problem of Hive Partition CREATE TABLE T1 (C1 INT, C2 TEXT) using PARQUET WITH (‘parquet.compression’ = ‘SNAPPY’) PARTITION BY COLUMN (C3 INT, C4 TEXT);
  18. 18. Future Works • Multi-tenant Scheduler (v0.9) – Support multiple users and multiple queries • Runtime byte code generation for expressions (v0.9) – Eliminate interpret overhead of expression evaluation • Authentication and SQL Standard Access Control • JIT-based Vectorized Processing Engine – Refer to Hadoop Summit 2014 Slide (http://goo.gl/jWghhp)
  19. 19. Get Involved! • We are recruiting contributors! • General – http://tajo.apache.org • Getting Started – http://tajo.apache.org/docs/0.8.0/getting_started.html • Downloads – http://tajo.apache.org/docs/0.8.0/getting_started/downloading_source.html • Jira – Issue Tracker – https://issues.apache.org/jira/browse/TAJO • Join the mailing list – dev-subscribe@tajo.apache.org – issues-subscribe@tajo.apache.org
  20. 20. Demonstration

×