Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Apache Tajo: Future of Data Warehouse

1,058 views

Published on

Introduction to Apache Tajo: Future of Data Warehouse
- by Jihoon Son presented at San Francisco MeetUp of Hadoop

Published in: Data & Analytics

Introduction to Apache Tajo: Future of Data Warehouse

  1. 1. Introduction to Apache Tajo: Future of Data Warehouse Jihoon Son / Gruter Inc.
  2. 2. I am ● Jihoon Son (@jihoonson) ○ Ph.D at Korea Univ. ○ Tajo project co-founder ○ Committer and PMC member of Apache Tajo ○ Research engineer at Gruter ○ Linkedin ■ https://www.linkedin.com/in/jihoonson 2
  3. 3. Today's Topic: Tajo ● What is Tajo? ○ Tajo / tάːzo / 타조 ○ Ostrich in Korean ■ Fastest two-legged animal in the world 3
  4. 4. Today's Topic: Tajo ● What is Apache Tajo? ○ Our Ostrich can do SQL processing on big data! ■ SQL-on-Hadoop system ■ Apache Top-level project 4
  5. 5. Maybe You Think ... 5 SQL-on-Hadoop? Boring..
  6. 6. This Ostrich is Different! 6
  7. 7. SQL-on-Hadoop Systems 7
  8. 8. SQL-on-Hadoop Systems 8
  9. 9. SQL-on-Hadoop Systems 9 Long-running ETL jobs Low-latency interactive analysis
  10. 10. SQL-on-Hadoop Systems 10 ● Requirements ○ Stable query execution ■ Fault-tolerance ● Can avoid query resubmission ○ Adaptation to dynamic environment ■ Available resources, unpredictable delays, ... Long-running ETL jobs
  11. 11. SQL-on-Hadoop Systems 11 ● Requirements ○ Fast query execution ■ Several query execution techniques ■ In-memory processing Low-latency interactive analysis
  12. 12. Tajo is designed for Both Workloads 12 Long-running ETL jobs Low-latency interactive analysis
  13. 13. Who are using Tajo? 13
  14. 14. Use Cases: SK Telecom ● Data warehousing & analysis ○ 1st telco in South Korea ■ 40 TB/day compressed data (2014) 14
  15. 15. ETLETLETL Integration Layer Data Warehouse Operational Systems SK Telecom: Before Tajo 15 Marketing Sales ERP SCM ODS Staging Area Data Vault Data Marts Strategic Marts Hadoop MPP DBMS
  16. 16. ETLETLETL Integration Layer Data Warehouse Operational Systems SK Telecom: After Tajo 16 Marketing Sales ERP SCM ODS Staging Area Data Vault Data Marts Strategic Marts
  17. 17. ETLETLETL Integration Layer Data Warehouse Operational Systems SK Telecom: After Tajo 17 Marketing Sales ERP SCM ODS Staging Area Data Vault Data Marts Strategic Marts ● Long-running ETL jobs ● Ad-hoc analysis
  18. 18. Use Cases: SK Telecom ● Significantly reduced ETL & analysis time ○ Daily analysis becomes possible ○ More exploratory analysis is newly available with remaining resources 18
  19. 19. Use Cases: Bluehole Studio ● Game log analysis ○ Finding principal causes of service- quality deficiencies 19
  20. 20. Use Cases: Bluehole Studio ● Tajo on EMR 20
  21. 21. Use Cases: Bluehole Studio ● Their first log analysis system ○ Easy and rapid deployment of Tajo ○ Low learning curve with SQL standard ● Immediate action becomes possible for user complaints and hidden bugs 21
  22. 22. Use Cases: Melon ● Data discovery ○ Music streaming service (26 million users) ○ Analysis of purchase history for target marketing ● Significantly reduced analysis time ○ Faster analysis by replacing Hive with Tajo ○ More analysis becomes possible 22
  23. 23. So, Why should you use Tajo? 23
  24. 24. So, Why should you use Tajo? ● Easy to use 24
  25. 25. So, Why should you use Tajo? ● Easy to use ○ ANSI-SQL standard compliance (2003) ■ CTAS, Window functions, ... 25
  26. 26. So, Why should you use Tajo? ● Easy to use ○ ANSI-SQL standard compliance (2003) ■ CTAS, Window functions, ... ○ Mature SQL features ■ Most existing queries can be executed without modification 26
  27. 27. So, Why should you use Tajo? ● Easy to use ○ ANSI-SQL standard compliance (2003) ■ CTAS, Window functions, ... ○ Mature SQL features ■ Most existing queries can be executed without modification ○ Various data format support ■ Text, JSON, Orc, Parquet, … 27
  28. 28. So, Why should you use Tajo? ● Optimized performance 28
  29. 29. So, Why should you use Tajo? ● Optimized performance ○ Optimized code ■ Optimized I/O performance ● Nearly max I/O performance (~120MB/s) per disk ■ Off-heap data processing ● Mitigating GC overhead 29
  30. 30. So, Why should you use Tajo? ● Optimized performance ○ Cost-based query plan optimization ■ Join ordering ■ Best algorithm selection ● According to input size ■ Progressive optimization ● Further optimize the query plan during query execution ● Especially excellent for long running queries ■ => Efficient start schema processing 30
  31. 31. So, Why should you use Tajo? ● Various storage type support 31
  32. 32. So, Why should you use Tajo? ● Various storage type support 32
  33. 33. Logical Data Warehouse with Tajo 33 Global view Application DBMS NoSQL Cloud storage On-premise storage
  34. 34. Logical Data Warehouse with Tajo 34 Global view Application DBMS NoSQL Cloud storage On-premise storage ● Fast delivery ● Easy maintenance ● Simple data flow
  35. 35. How fast is Tajo? 35
  36. 36. Evaluation on Cloud Environment ● Google Cloud Platform ○ Instance type: n1-standard-8 ■ 8 core, 30GB RAM 36
  37. 37. Target Systems ● Hive (0.12) ○ Baseline performance ○ Default configuration provided by GCP ■ Use the whole cpu and memory ● Tajo (0.11.0) ○ Default configuration provided by GCP ■ Use the whole cpu and memory 37
  38. 38. Target Systems ● Spark-SQL (1.5.0) ○ Default configuration provided by GCP ■ Use the whole cpu and memory ■ Tungsten enabled by default ○ spark.sql.shuffle.partitions is adjusted for better performance 38
  39. 39. TPC-DS ● Data ○ 24 tables ■ Plain text format ■ Stored on Google Cloud Storage ● Query ○ Which can be executed on every system without modifications ■ For Hive, 0.12 doesn't support implicit join, so every query had to be changed 39
  40. 40. SF 1000, 50 instances 40
  41. 41. SF 1000, 50 instances 41
  42. 42. SF 1000, 50 instances 42 Cannot be run on 1TB
  43. 43. SF 10000, 50 instances 43
  44. 44. SF 10000, 50 instances 44
  45. 45. Demo 45
  46. 46. Simple Demo on EMR 46 ● Using TPC-H data set, but ○ Lineitem table is stored on HDFS ○ Orders table is stored on PostgreSQL ○ Other tables are stored on S3
  47. 47. Apache Tajo ● Is excellent for both long-running ETL jobs and exploratory ad-hoc analysis ● Is very fast ● Supports query federation on diverse data sources 47
  48. 48. Get Involved! ● We are recruiting contributors! ● General ○ http://tajo.apache.org/ ● Getting Started ○ http://tajo.apache.org/docs/current/getting_started.html ● Downloads ○ http://tajo.apache.org/downloads.html ● Issue tracker ○ http://issues.apache.org/jira/browse/TAJO ● Join the mailing list ○ dev-subscribe@tajo.apache.org ○ issues-subscribe@tajo.apache.org 48
  49. 49. Useful Links 49 ● EMR bootstrap ○ https://github.com/awslabs/emr-bootstrap- actions/tree/master/tajo ● How to setup Tajo on EMR ○ http://www.gruter.com/blog/setting-up-a- tajo-cluster-on-amazon-emr/
  50. 50. Q & A 50

×