Introduction to Apache Tajo: Data Warehouse for Big Data

1,533 views

Published on

Introduction to Apache Tajo: Data Warehouse for Big Data
- by Jihoon Son presented at ApacheCon Big Data Europe 2015

Published in: Data & Analytics
0 Comments
12 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,533
On SlideShare
0
From Embeds
0
Number of Embeds
325
Actions
Shares
0
Downloads
0
Comments
0
Likes
12
Embeds 0
No embeds

No notes for slide

Introduction to Apache Tajo: Data Warehouse for Big Data

  1. 1. Introduction to Apache Tajo: Data Warehouse for Big Data Jihoon Son / Gruter inc.
  2. 2. About Me Jihoon Son (@jihoonson) Tajo project co-founder Committer and PMC member of Apache Tajo Research engineer at Gruter 2
  3. 3. Outline About Tajo Features of the Recent Release Demo Roadmap 3
  4. 4. What is Tajo? Tajo / tάːzo / 타조 An ostrich in Korean The world's fastest two-legged animal 4
  5. 5. What is Tajo? Apache Top-level Project Big data warehouse system ANSI-SQL compliant Mature SQL features Various types of join, window functions Rapid query execution with own distributed DAG engine Low latency, and long running batch queries with a single system Fault-tolerance Beyond SQL-on-Hadoop Support various types of storage 5
  6. 6. Tajo Master Catalog Server Tajo Master Catalog Server Architecture Overview 6 DBMS HCatalog Tajo Master Catalog Server Tajo Worker Query Master Query Executor Storage Service Tajo Worker Query Master Query Executor Storage Service Tajo Worker Query Master Query Executor Storage Service JDBC client TSQLWebUI REST API Storage Submit a query Manage metadataAllocate a query Send tasks & monitor Send tasks & monitor
  7. 7. Who are Using Tajo? Use cases: replacement of commercial DW 1st telco in South Korea Replacement of long-running ETL workloads on several TB datasets Lots of daily reports about user behavior Ad-‐hoc analysis on TB datasets Benefits Simplified architecture for data analysis An unified system for DW ETL, OLAP, and Hadoop ETL Much less cost, more data analysis within same SLA Saved license fee of commercial DW 7
  8. 8. Who are Using Tajo? Use cases: data discovery Music streaming service (26 million users) Analysis of purchase history for target marketing Benefits Interactive query on large datasets Data analysis with familiar BI tools 8
  9. 9. Recent Release: 0.11 Feature highlights Query federation JDBC-based storage support Self-describing data formats support Multi-query support More stable and efficient join execution Index support Python UDF/UDAF support 9
  10. 10. Recent Release: 0.11 Today's topic Query federation JDBC-based storage support Self-describing data formats support Multi-query support More stable and efficient join execution Index support Python UDF/UDAF support 10
  11. 11. Query Federation with Tajo 11
  12. 12. Your Data Your data might be spread on multiple heterogeneous sites Cloud, DBMS, Hadoop, NoSQL, … 12 DBMS Application Cloud storage On-premise storage NoSQL
  13. 13. Your Data Even in a single site, your data might be stored in different data formats 13 JSONCSV Parque t ORC Log ...
  14. 14. Your Data How to analyze distributed data? Traditionally ... 14 DBMSApplication Cloud storage On-premise storage NoSQL Global view ETL transform ● Long delivery ● Complex data flow ● Human-intensive
  15. 15. Your Data with Tajo Query federation 15 DBMSApplication Cloud storage On-premise storage NoSQL Global view ● Fast delivery ● Easy maintenance ● Simple data flow
  16. 16. Storage and Data Format Support 16 Data formats Storage types
  17. 17. Create Table > CREATE EXTERNAL TABLE archive1 (id BIGINT, ...) USING text WITH ('text.delimiter'='|') LOCATION 'hdfs://localhost:8020/archive1'; > CREATE EXTERNAL TABLE user (user_id BIGINT, ...) USING orc WITH ('orc.compression.kind'='snappy') LOCATION 's3://user'; > CREATE EXTERNAL TABLE table1 (key TEXT, ...) USING hbase LOCATION 'hbase:zk://localhost:2181/uptodate'; > ... 17 Data formatStorage URI
  18. 18. Create Table > CREATE EXTERNAL TABLE archive1 (id BIGINT, ...) USING text WITH ('text.delimiter'='|','text.null'='N','compression.codec'='org.apache.hadoop.io.compr ess.SnappyCodec','timezone'='UTC+9','text.skip.headerlines'='2') LOCATION 'hdfs://localhost:8020/tajo/warehouse/archive1'; > CREATE EXTERNAL TABLE archive2 (id BIGINT, ...) USING text WITH ('text.delimiter'='|','text.null'='N','compression.codec'='org.apache.hadoop.io.compr ess.SnappyCodec','timezone'='UTC+9','text.skip.headerlines'='2') LOCATION 'hdfs://localhost:8020/tajo/warehouse/archive2'; > CREATE EXTERNAL TABLE archive3 (id BIGINT, ...) USING text WITH ('text.delimiter'='|','text.null'='N','compression.codec'='org.apache.hadoop.io.compr ess.SnappyCodec','timezone'='UTC+9','text.skip.headerlines'='2') LOCATION 'hdfs://localhost:8020/tajo/warehouse/archive3'; > ... 18 Too tedious!
  19. 19. Introduction to Tablespace Tablespace Registered storage space A tablespace is identified by an unique URI Configurations and policies are shared by all tables in a tablespace Storage type Default data format and supported data formats It allows users to reuse registered storage configurations and policies 19
  20. 20. Tablespaces, Databases, and Tables 20 Namespace Storage1 Storage2 ... ... ... Tablespace1 Tablespace2 Tablespace3 Physical space Table1 Table2 Table3 Database1 Database1 ...
  21. 21. Tablespace Configuration { "spaces" : { "warehouse" : { "uri" : "hdfs://localhost:8020/tajo/warehouse", "configs" : [ {'text.delimiter'='|'}, {'text.null'='N'}, {'compression.codec'='org.apache.hadoop.io.compress.SnappyCodec'}, {'timezone'='UTC+9'}, {'text.skip.headerlines'='2'} ] }, "hbase1" : { "uri" : "hbase:zk://localhost:2181/table1" } } } 21 Tablespace name Tablespace URI
  22. 22. Create Table > CREATE TABLE archive1 (id BIGINT, ...) TABLESPACE warehouse; 22 Tablespace nameData format is omitted. Default data format is TEXT. "warehouse" : { "uri" : "hdfs://localhost:8020/tajo/warehouse", "configs" : [ {'text.delimiter'='|'}, {'text.null'='N'}, {'compression.codec'='org.apache.hadoop.io.compress.SnappyCodec'}, {'timezone'='UTC+9'}, {'text.skip.headerlines'='2'} ] },
  23. 23. Create Table > CREATE TABLE archive1 (id BIGINT, ...) TABLESPACE warehouse; > CREATE TABLE archive2 (id BIGINT, ...) TABLESPACE warehouse; > CREATE TABLE archive3 (id BIGINT, ...) TABLESPACE warehouse; > CREATE TABLE user (user_id BIGINT, ...) TABLESPACE aws USING orc WITH ('orc.compression.kind'='snappy'); > CREATE TABLE table1 (key TEXT, ...) TABLESPACE hbase1; > ... 23
  24. 24. HDFS HBase Tajo Worker Query Engine Storage ServiceHDFS handler Tajo Worker Query Engine Storage ServiceHDFS handler Tajo Worker Query Engine Storage ServiceHBase handler Querying on Different Data Silos How does a worker access different data sources? Storage service Return a proper handler for underlying storage > SELECT ... FROM hdfs_table, hbase_table, ... 24
  25. 25. JDBC-based Storage Support 25
  26. 26. jdbc_db1 tajo_db1 JDBC-based Storage Storage providing the JDBC interface PostgreSQL, MySQL, MariaDB, ... Databases of JDBC-based storage are mapped to Tajo databases 26 Table 1 Table 2 Table 3 Table 1 Table 2 Table 3 tajo_db2 Table 1 Table 2 Table 3 … jdbc_db2 Table 1 Table 2 Table 3 … JDBC-based storage Tajo
  27. 27. Tablespace Configuration { "spaces": { "pgsql_db1": { "uri": "jdbc:postgresql://hostname:port/db1" "configs": { "mapped_database": "tajo_db1" "connection_properties": { "user": "tajo", "password": "xxxx" } } } } } 27 PostgreSQL database name Tajo database name Tablespace name
  28. 28. Return to Query Federation How to correlate data on JDBC-based storage and others? Need to have a global view of metadata across different storage types Tajo also has its own metadata for its data Each JDBC-based storage has own metadata for its data Each NoSQL storage has metadata for its data … 28
  29. 29. Metadata Federation Federating metadata of underlying storage 29 DBMS metadata provider NoSQL metadata provider Linked Metadata Manager DBMS HCatalog Tajo catalog metadata provider Catalog Interface ● Tablespace ● Database ● Tables ● Schema names ...
  30. 30. Querying on JDBC-based Storage A plan is converted into a SQL string Query generation Diverse SQL syntax of different types of storage Different SQL builder for each storage type 30 Tajo Master Tajo Worker JDBC-based storage SELECT ... Query plan SELECT ...
  31. 31. Operation Push Down Tajo can exploit the processing capability of underlying storage DBMSs, MongoDB, HBase, … Operations are pushed down into underlying storage Leveraging the advanced features provided by underlying storage Ex) DBMSs' query optimization, index, ... 31
  32. 32. Example 1 32 SELECT count(*) FROM account ac, archive ar WHERE ac.key = ar.id and ac.name = 'tajo' account DBMS archive HDFS scan archive scan account ac.name = 'tajo' join ac.key = ar.id group by count(*) group by count(*) Full scan Result only Push operation
  33. 33. Example 2 33 SELECT ac.name, count(*) FROM account ac GROUP BY ac.name account DBMS scan account group by count(*) Result only Push operation
  34. 34. Self-describing Data Formats Support 34
  35. 35. Self-describing Data Formats Some data formats include schema information as well as data JSON, ORC, Parquet, … Tajo 0.11 natively supports self-describing data formats Since they already have schema information, Tajo doesn't need to store it aside Instead, Tajo can infer the schema at query execution time 35
  36. 36. Create Table with Nested Data Format { "title" : "Hand of the King", "name" : { "first_name": "Eddard", "last_name": "Stark"}} { "title" : "Assassin", "name" : { "first_name": "Arya", "last_name": "Stark"}} { "title" : "Dancing Master", "name" : { "first_name": "Syrio", "last_name": "Forel"}} > CREATE EXTERNAL TABLE schemaful_table ( title TEXT, name RECORD ( first_name TEXT, last_name TEXT ) ) USING json LOCATION 'hdfs:///json_table'; 36 Nested type
  37. 37. How about This Data? {"id":"2937257761","type":"ForkEvent","actor":{"id":1088854,"login":"CAOakleyII","gravatar_id":"","url":"https://api.github.com/users/CAOakleyII","avatar_url":"https://avatars.githubusercontent.com/ u/1088854?"},"repo":{"id":11909954,"name":"skycocker/chromebrew","url":"https://api.github.com/repos/skycocker/chromebrew"},"payload":{"forkee":{"id":38339291,"name":"chromebrew","full_na me":"CAOakleyII/chromebrew","owner":{"login":"CAOakleyII","id":1088854,"avatar_url":"https://avatars.githubusercontent.com/u/1088854?v=3","gravatar_id":"","url":"https://api.github.com/users/C AOakleyII","html_url":"https://github.com/CAOakleyII","followers_url":"https://api.github.com/users/CAOakleyII/followers","following_url":"https://api.github.com/users/CAOakleyII/following{/other_ user}","gists_url":"https://api.github.com/users/CAOakleyII/gists{/gist_id}","starred_url":"https://api.github.com/users/CAOakleyII/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/ users/CAOakleyII/subscriptions","organizations_url":"https://api.github.com/users/CAOakleyII/orgs","repos_url":"https://api.github.com/users/CAOakleyII/repos","events_url":"https://api.github.com/ users/CAOakleyII/events{/privacy}","received_events_url":"https://api.github.com/users/CAOakleyII/received_events","type":"User","site_admin":false},"private":false,"html_url":"https://github.com/C AOakleyII/chromebrew","description":"Package manager for Chrome OS","fork":true,"url":"https://api.github.com/repos/CAOakleyII/chromebrew","forks_url":"https://api.github.com/repos/CAOakleyII/chromebrew/forks","keys_url":"https://api.github.com/repos/CAOa kleyII/chromebrew/keys{/key_id}","collaborators_url":"https://api.github.com/repos/CAOakleyII/chromebrew/collaborators{/collaborator}","teams_url":"https://api.github.com/repos/CAOakleyII/chro mebrew/teams","hooks_url":"https://api.github.com/repos/CAOakleyII/chromebrew/hooks","issue_events_url":"https://api.github.com/repos/CAOakleyII/chromebrew/issues/events{/number}","event s_url":"https://api.github.com/repos/CAOakleyII/chromebrew/events","assignees_url":"https://api.github.com/repos/CAOakleyII/chromebrew/assignees{/user}","branches_url":"https://api.github.com /repos/CAOakleyII/chromebrew/branches{/branch}","tags_url":"https://api.github.com/repos/CAOakleyII/chromebrew/tags","blobs_url":"https://api.github.com/repos/CAOakleyII/chromebrew/git/blo bs{/sha}","git_tags_url":"https://api.github.com/repos/CAOakleyII/chromebrew/git/tags{/sha}","git_refs_url":"https://api.github.com/repos/CAOakleyII/chromebrew/git/refs{/sha}","trees_url":"https:/ /api.github.com/repos/CAOakleyII/chromebrew/git/trees{/sha}","statuses_url":"https://api.github.com/repos/CAOakleyII/chromebrew/statuses/{sha}","languages_url":"https://api.github.com/repos/C AOakleyII/chromebrew/languages","stargazers_url":"https://api.github.com/repos/CAOakleyII/chromebrew/stargazers","contributors_url":"https://api.github.com/repos/CAOakleyII/chromebrew/cont ributors","subscribers_url":"https://api.github.com/repos/CAOakleyII/chromebrew/subscribers","subscription_url":"https://api.github.com/repos/CAOakleyII/chromebrew/subscription","commits_url": "https://api.github.com/repos/CAOakleyII/chromebrew/commits{/sha}","git_commits_url":"https://api.github.com/repos/CAOakleyII/chromebrew/git/commits{/sha}","comments_url":"https://api.gith ub.com/repos/CAOakleyII/chromebrew/comments{/number}","issue_comment_url":"https://api.github.com/repos/CAOakleyII/chromebrew/issues/comments{/number}","contents_url":"https://api.git hub.com/repos/CAOakleyII/chromebrew/contents/{+path}","compare_url":"https://api.github.com/repos/CAOakleyII/chromebrew/compare/{base}...{head}","merges_url":"https://api.github.com/repo s/CAOakleyII/chromebrew/merges","archive_url":"https://api.github.com/repos/CAOakleyII/chromebrew/{archive_format}{/ref}","downloads_url":"https://api.github.com/repos/CAOakleyII/chromebr ew/downloads","issues_url":"https://api.github.com/repos/CAOakleyII/chromebrew/issues{/number}","pulls_url":"https://api.github.com/repos/CAOakleyII/chromebrew/pulls{/number}","milestones_ url":"https://api.github.com/repos/CAOakleyII/chromebrew/milestones{/number}","notifications_url":"https://api.github.com/repos/CAOakleyII/chromebrew/notifications{?since,all,participating}","lab els_url":"https://api.github.com/repos/CAOakleyII/chromebrew/labels{/name}","releases_url":"https://api.github.com/repos/CAOakleyII/chromebrew/releases{/id}","created_at":"2015-07- 01T00:00:00Z","updated_at":"2015-06-28T10:11:09Z","pushed_at":"2015-06- 09T07:46:57Z","git_url":"git://github.com/CAOakleyII/chromebrew.git","ssh_url":"git@github.com:CAOakleyII/chromebrew.git","clone_url":"https://github.com/CAOakleyII/chromebrew.git","svn_url":" https://github.com/CAOakleyII/chromebrew","homepage":"http://skycocker.github.io/chromebrew/","size":846,"stargazers_count":0,"watchers_count":0,"language":null,"has_issues":false,"has_downl oads":true,"has_wiki":true,"has_pages":false,"forks_count":0,"mirror_url":null,"open_issues_count":0,"forks":0,"open_issues":0,"watchers":0,"default_branch":"master","public":true}},"public":true,"cre ated_at":"2015-07-01T00:00:01Z"} ... 37
  38. 38. Create Schemaless Table > CREATE EXTERNAL TABLE schemaless_table (*) USING json LOCATION 'hdfs:///json_table'; That's all! 38 Allow any schema
  39. 39. Schema-free Query Execution > CREATE EXTERNAL TABLE schemaful_table (id BIGINT, name TEXT, ...) USING text LOCATION 'hdfs:///csv_table; > CREATE EXTERNAL TABLE schemaless_table (*) USING json LOCATION 'hdfs:///json_table'; > SELECT name.first_name, name.last_name from schemaless_table; > SELECT title, count(*) FROM schemaful_table, schemaless_table WHERE name = name.last_name GROUP BY title; 39
  40. 40. Schema Inference Table schema is inferred at query time Example 40 SELECT a, b.b1, b.b2.c1 FROM t; ( a text, b record ( b1 text, b2 record ( c1 text ) ) ) Inferred schemaQuery
  41. 41. Demo 41
  42. 42. Demo with Command line 42
  43. 43. Roadmap 43
  44. 44. Roadmap 0.12 Improved Yarn integration Authentication support JavaScript stored procedure support Scalar subquery support Hive UDF support 44
  45. 45. Roadmap Next generation (beyond 0.12) Exploiting modern hardware Approximate query processing Genetic query optimization And more … 45
  46. 46. tajo> select question from you; 46

×