SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive

2,669
-1

Published on

Currently telecom companies store their data in database or data warehouse, treating them through ETL process and working on statistics and analysis by using OLAP tools or data mining engines. However, due to the data explosion along with the spread of Smart Phones traditional data storages like DB and DW aren’t sufficient to cope with these “Big Data”. As an alternative the method of storing data in Hadoop and performing ETL process and Ad-hoc Query with Hive is being introduced, and China Mobile is being mentioned as the most representative example. But, they are adopted mainly by new projects, which have low barriers in applying the new Hive data model and HQL. On the other hand, it is extremely difficult to replace the existing database with the combination of Hadoop and Hive if there are already a number of tables and SQL queries. NexR is migrating the telecom company’s data from Oracle DB to Hadoop, and converting a lot of existing Oracle SQL queries to Hive HQL queries. Though HQL supports a similar syntax to ANSI-SQL, it lacks a large portion of basic functions and hardly supports Oracle analytic functions like rank() which are utilized mainly in statistical analysis. Furthermore, the difference of data types like null value is also blocking the application of it. In this presentation, we will share the experience converting Oracle SQL to Hive HQL and developing additional functions with MapReduce. Also, we will introduce several ideas and trials to improve Hive performance.

http://sdec.kr/

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,669
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
208
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive

  1. 1. SDEC 2011 Seoul Data Engineering Camp June 27-28 Seoul, South Korea
  2. 2. Replacing Legacy Telco DB/DW to Hadoop and Hive JunHo Cho NexR
  3. 3. Agenda
  4. 4. Agenda • Motivation for Hive and Hadoop
  5. 5. Agenda • Motivation for Hive and Hadoop • Hive Internal
  6. 6. Agenda • Motivation for Hive and Hadoop • Hive Internal • Oracle Migration UseCase
  7. 7. Agenda • Motivation for Hive and Hadoop • Hive Internal • Oracle Migration UseCase • Hive Optimization
  8. 8. Agenda • Motivation for Hive and Hadoop • Hive Internal • Oracle Migration UseCase • Hive Optimization • Future Work
  9. 9. Telco Data
  10. 10. Telco Data
  11. 11. Telco Data
  12. 12. Telco Data
  13. 13. Telco Data
  14. 14. Telco Data
  15. 15. Telco Data
  16. 16. Telco Data
  17. 17. qu er Co n de & Di vi
  18. 18. OpenSource
  19. 19. OpenSource Storage & Computing
  20. 20. OpenSource
  21. 21. OpenSource Collection
  22. 22. OpenSource
  23. 23. OpenSource Search
  24. 24. OpenSource
  25. 25. OpenSource Analysis
  26. 26. OpenSource
  27. 27. OpenSource Coordination
  28. 28. OpenSource
  29. 29. What is HIVE ?
  30. 30. What is HIVE ? • A system for managing and querying structured data built on top of Hadoop • Map-Reduce for execution • HDFS for storage • Metadata in an RDBMS
  31. 31. What is HIVE ? • A system for managing and querying structured data built on top of Hadoop • Map-Reduce for execution • HDFS for storage • Metadata in an RDBMS • Key Building Principles • SQL is a familiar language • Extensibility - Types, Functions, Formats, Scripts • Performance
  32. 32. Why Hive ?
  33. 33. Count call-record per phone ?
  34. 34. public class CallCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); word.set(itr.nextToken()); output.collect(word, one); }}
  35. 35. public class CallCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final IntWritable one = new IntWritable(1); er private Text word = new Text(); app public void map(WritableComparable key, Writable value, M OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); word.set(itr.nextToken()); output.collect(word, one); }}
  36. 36. public class CallCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final IntWritable one = new IntWritable(1); er private Text word = new Text(); app public void map(WritableComparable key, Writable value, M OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); word.set(itr.nextToken()); output.collect(word, one); }}public class CallCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { IntWritable value = (IntWritable) values.next(); sum += value.get(); // process value } output.collect(key, new IntWritable(sum)); }}
  37. 37. public class CallCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final IntWritable one = new IntWritable(1); er private Text word = new Text(); app public void map(WritableComparable key, Writable value, M OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); word.set(itr.nextToken()); output.collect(word, one); }}public class CallCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator values, er OutputCollector output, Reporter reporter) throws IOException { uc int sum = 0; ed while (values.hasNext()) { R IntWritable value = (IntWritable) values.next(); sum += value.get(); // process value } output.collect(key, new IntWritable(sum)); }}
  38. 38. public class CallCountMapper extends MapReduceBase public class CallCount { implements Mapper<LongWritable, Text, Text, IntWritable> { public static void main(String[] args) { private final IntWritable one = new IntWritable(1); JobClient client = new JobClient(); er private Text word = new Text(); JobConf conf = new JobConf(WordCount.class); app public void map(WritableComparable key, Writable value, // specify output types M OutputCollector output, Reporter reporter) throws IOException { conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); String line = value.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); // specify input and output dirs word.set(itr.nextToken()); FileInputPath.addInputPath(conf, new Path("input")); output.collect(word, one); FileOutputPath.addOutputPath(conf, new Path("output")); }} // specify a mapper conf.setMapperClass(KeyCountMapper.class); // specify a reducer conf.setReducerClass(CallCountReducer.class); conf.setCombinerClass(CallCountReducer.class);public class CallCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { client.setConf(conf); try { public void reduce(Text key, Iterator values, JobClient.runJob(conf); er OutputCollector output, Reporter reporter) throws IOException { } catch (Exception e) { uc e.printStackTrace(); int sum = 0; } ed while (values.hasNext()) { } R IntWritable value = (IntWritable) values.next(); } sum += value.get(); // process value } output.collect(key, new IntWritable(sum)); }}
  39. 39. public class CallCountMapper extends MapReduceBase public class CallCount { implements Mapper<LongWritable, Text, Text, IntWritable> { public static void main(String[] args) { private final IntWritable one = new IntWritable(1); JobClient client = new JobClient(); er private Text word = new Text(); JobConf conf = new JobConf(WordCount.class); app public void map(WritableComparable key, Writable value, // specify output types M OutputCollector output, Reporter reporter) throws IOException { conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); String line = value.toString(); er StringTokenizer itr = new StringTokenizer(line.toLowerCase()); // specify input and output dirs riv word.set(itr.nextToken()); FileInputPath.addInputPath(conf, new Path("input")); output.collect(word, one); FileOutputPath.addOutputPath(conf, new Path("output")); D }} // specify a mapper conf.setMapperClass(KeyCountMapper.class); // specify a reducer conf.setReducerClass(CallCountReducer.class); conf.setCombinerClass(CallCountReducer.class);public class CallCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { client.setConf(conf); try { public void reduce(Text key, Iterator values, JobClient.runJob(conf); er OutputCollector output, Reporter reporter) throws IOException { } catch (Exception e) { uc e.printStackTrace(); int sum = 0; } ed while (values.hasNext()) { } R IntWritable value = (IntWritable) values.next(); } sum += value.get(); // process value } output.collect(key, new IntWritable(sum)); }}
  40. 40. SELECT pnum, count(pnum) FROM cdr GROUP BY pnum;
  41. 41. History of Hive • Hive development cycle is fast and the developer community is growing rapidly • Product release cycle is accelerating Project started 0.3.0 0.4.0 0.5.0 0.6.0 0.7.0 0.7.1 03/08 4/09 12/09 02/10 10/10 03/11 06/11
  42. 42. History of Hive • Hive development cycle is fast and the developer community is growing rapidly • Product release cycle is accelerating Project started 0.3.0 0.4.0 0.5.0 0.6.0 0.7.0 0.7.1 03/08 4/09 12/09 02/10 10/10 03/11 06/11
  43. 43. History of Hive • Hive development cycle is fast and the developer community is growing rapidly • Product release cycle is accelerating Project started 0.3.0 0.4.0 0.5.0 0.6.0 0.7.0 0.7.1 03/08 4/09 12/09 02/10 10/10 03/11 06/11
  44. 44. Who use Hive? http://wiki.apache.org/hadoop/Hive/PoweredBy
  45. 45. UseCase in Hive?
  46. 46. UseCase in Hive? • Report and ad hoc query
  47. 47. UseCase in Hive? • Report and ad hoc query • Log Analysis
  48. 48. UseCase in Hive? • Report and ad hoc query • Log Analysis • Social Graph Analysis
  49. 49. UseCase in Hive? • Report and ad hoc query • Log Analysis • Social Graph Analysis • Data mining and analysis
  50. 50. UseCase in Hive? • Report and ad hoc query • Log Analysis • Social Graph Analysis • Data mining and analysis • Machine Learning
  51. 51. UseCase in Hive? • Report and ad hoc query • Log Analysis • Social Graph Analysis • Data mining and analysis • Machine Learning • Dataset cleaning
  52. 52. UseCase in Hive? • Report and ad hoc query • Log Analysis • Social Graph Analysis • Data mining and analysis • Machine Learning • Dataset cleaning • Data Warehouse
  53. 53. Hive Architecture UI Driver DDL HQL Execution Works Engine MetaStore Compiler ORM Hadoop Result
  54. 54. Hive Architecture UI Driver select col1 from tab1 where ... DDL HQL Execution Works Engine MetaStore Compiler ORM Hadoop Result
  55. 55. Hive Architecture UI Driver DDL HQL Execution Works Engine MetaStore Compiler ORM Hadoop Result
  56. 56. Hive Architecture UI Driver DDL HQL Execution Works Engine MetaStore Compiler ORM Hadoop Result
  57. 57. Hive Architecture UI Driver DDL HQL Execution Works Engine MetaStore Compiler ORM Hadoop Result
  58. 58. Hive Architecture a 123344 b 121211 c 342434 UI Driver DDL HQL Execution Works Engine MetaStore Compiler ORM Hadoop Result
  59. 59. Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF/UDAF SELOperator substr sum MetaStore Hive QL FSOperator average Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  60. 60. Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF/UDAF SELOperator substr sum MetaStore Hive QL FSOperator average Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  61. 61. Parser Parser Select col1,col2 From tab1 Where col3 > 5 TOK_QUERY TOK_FROM TOK_INSERT TOK_DESTINATION TOK_SELECT TOK_WHERE TOK_TABNAME TOK_SELEXPR TOK_SELEXPR TOK_DIR > TOK_TABLE_OR_COL TOK_TABLE_OR_COL TOK_TMP_FILE TOK_TABLE_OR_COL 5
  62. 62. Parser Parser Select col1,col2 From tab1 Where col3 > 5 QB TOK_QUERY TOK_FROM TOK_INSERT TOK_DESTINATION TOK_SELECT TOK_WHERE TOK_TABNAME TOK_SELEXPR TOK_SELEXPR TOK_DIR > TOK_TABLE_OR_COL TOK_TABLE_OR_COL TOK_TMP_FILE TOK_TABLE_OR_COL 5
  63. 63. Parser Parser Select col1,col2 From tab1 Where col3 > 5 TOK_QUERY TOK_FROM TOK_INSERT TOK_DESTINATION TOK_SELECT TOK_WHERE TOK_TABNAME TOK_SELEXPR TOK_SELEXPR QB tab1 TOK_DIR > TOK_TABLE_OR_COL TOK_TABLE_OR_COL TOK_TMP_FILE TOK_TABLE_OR_COL 5
  64. 64. Parser Parser Select col1,col2 From tab1 Where col3 > 5 TOK_QUERY TOK_FROM TOK_INSERT TOK_DESTINATION TOK_SELECT TOK_WHERE TOK_TABNAME TOK_SELEXPR TOK_SELEXPR tab1 TOK_DIR > TOK_TABLE_OR_COL TOK_TABLE_OR_COL TOK_TMP_FILE TOK_TABLE_OR_COL 5 QB insclause-0
  65. 65. Parser Parser Select col1,col2 From tab1 Where col3 > 5 TOK_QUERY TOK_FROM TOK_INSERT TOK_DESTINATION TOK_SELECT TOK_WHERE TOK_TABNAME TOK_SELEXPR TOK_SELEXPR tab1 TOK_DIR > TOK_TABLE_OR_COL TOK_TABLE_OR_COL TOK_TMP_FILE col1 QB TOK_TABLE_OR_COL 5 insclause-0
  66. 66. Parser Parser Select col1,col2 From tab1 Where col3 > 5 TOK_QUERY TOK_FROM TOK_INSERT TOK_DESTINATION TOK_SELECT TOK_WHERE TOK_TABNAME TOK_SELEXPR TOK_SELEXPR tab1 TOK_DIR > TOK_TABLE_OR_COL TOK_TABLE_OR_COL col1 col2 QB TOK_TMP_FILE TOK_TABLE_OR_COL 5 insclause-0
  67. 67. Parser Parser Select col1,col2 From tab1 Where col3 > 5 TOK_QUERY TOK_FROM TOK_INSERT TOK_DESTINATION TOK_SELECT TOK_WHERE QB TOK_TABNAME TOK_SELEXPR TOK_SELEXPR tab1 TOK_DIR > TOK_TABLE_OR_COL TOK_TABLE_OR_COL col1 col2 TOK_TMP_FILE TOK_TABLE_OR_COL 5 insclause-0
  68. 68. Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF/UDAF SELOperator substr sum MetaStore Hive QL FSOperator average Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  69. 69. Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF/UDAF SELOperator substr sum MetaStore Hive QL FSOperator average Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  70. 70. Plan Plan Select col1,col2 From tab1 Where col3 > 5 QB
  71. 71. Plan Plan Select col1,col2 From tab1 Where col3 > 5 QB TOK_FROM TOK_WHERE TOK_SELECT TOK_DESTINATION
  72. 72. Plan Plan Select col1,col2 From tab1 Where col3 > 5 QB TOK_FROM TableScanOperator TOK_WHERE TOK_SELECT TOK_DESTINATION
  73. 73. Plan Plan Select col1,col2 From tab1 Where col3 > 5 QB TOK_FROM TableScanOperator TOK_WHERE TOK_SELECT TOK_DESTINATION
  74. 74. Plan Plan Select col1,col2 From tab1 Where col3 > 5 QB TOK_FROM TableScanOperator TOK_WHERE FilterOperator TOK_SELECT TOK_DESTINATION
  75. 75. Plan Plan Select col1,col2 From tab1 Where col3 > 5 QB TOK_FROM TableScanOperator TOK_WHERE FilterOperator TOK_SELECT TOK_DESTINATION
  76. 76. Plan Plan Select col1,col2 From tab1 Where col3 > 5 QB TOK_FROM TableScanOperator TOK_WHERE FilterOperator TOK_SELECT SelectOperator TOK_DESTINATION
  77. 77. Plan Plan Select col1,col2 From tab1 Where col3 > 5 QB TOK_FROM TableScanOperator TOK_WHERE FilterOperator TOK_SELECT SelectOperator TOK_DESTINATION
  78. 78. Plan Plan Select col1,col2 From tab1 Where col3 > 5 QB TOK_FROM TableScanOperator TOK_WHERE FilterOperator TOK_SELECT SelectOperator TOK_DESTINATION FileSinkOperator
  79. 79. Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF/UDAF SELOperator substr sum MetaStore Hive QL FSOperator average Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  80. 80. Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF/UDAF SELOperator substr sum MetaStore Hive QL FSOperator average Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  81. 81. Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 TableScanOperator FilterOperator SelectOperator FileSinkOperator
  82. 82. Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7} TableScanOperator FilterOperator SelectOperator FileSinkOperator
  83. 83. Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7}TableScanOperator FilterOperator SelectOperatorFileSinkOperator
  84. 84. Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7} ContextTableScanOperator FilterOperator ColumnPruner SelectOperatorFileSinkOperator
  85. 85. Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7} ContextTableScanOperator FilterOperator FIL ColumnPruner TS SEL SelectOperatorFileSinkOperator
  86. 86. Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7}TableScanOperator FilterOperator FIL ColumnPruner TS SEL SelectOperatorFileSinkOperator Context
  87. 87. Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7}TableScanOperator FilterOperator ColumnPruner SelectOperator FILFileSinkOperator Context TS SEL
  88. 88. Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7}TableScanOperator FilterOperator ColumnPruner FIL SelectOperator Context TS SELFileSinkOperator
  89. 89. Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7}TableScanOperator FilterOperator ColumnPruner FIL SelectOperator Context TS SEL col1, col2FileSinkOperator
  90. 90. Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7}TableScanOperator FilterOperator ColumnPruner FIL SelectOperator Context TS SELFileSinkOperator
  91. 91. Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7}TableScanOperator FIL col1, col2, col3 FilterOperator Context TS ColumnPruner SEL SelectOperatorFileSinkOperator
  92. 92. Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7}TableScanOperator FIL FilterOperator Context TS ColumnPruner SEL SelectOperatorFileSinkOperator
  93. 93. Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7} FILTableScanOperator Context TS col1, col2, col3 SEL FilterOperator ColumnPruner FilterOperator SelectOperatorFileSinkOperator
  94. 94. Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF/UDAF SELOperator substr sum MetaStore Hive QL FSOperator average Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  95. 95. Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF/UDAF SELOperator substr sum MetaStore Hive QL FSOperator average Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  96. 96. Task Task Select col1,col2 From tab1 Where col3 > 5 TS - GenMRTableScan1 TaskFactory FS - GenMRFileSink1 QB
  97. 97. Task Task Select col1,col2 From tab1 Where col3 > 5 TS - GenMRTableScan1 TaskFactory FS - GenMRFileSink1 QB FetchTask
  98. 98. Task Task Select col1,col2 From tab1 Where col3 > 5 TS - GenMRTableScan1 TaskFactory FS - GenMRFileSink1 QB TableScanOperator FilterOperator FetchTask FilterOperator SelectOperator FileSinkOperator
  99. 99. Task Task Select col1,col2 From tab1 Where col3 > 5 TS - GenMRTableScan1 TaskFactory FS - GenMRFileSink1 QB TableScanOperator FilterOperator FetchTask FilterOperator SelectOperator FileSinkOperator
  100. 100. Task Task Select col1,col2 From tab1 Where col3 > 5 TaskFactory FS - GenMRFileSink1 QB MapRedTask TableScanOperator FilterOperator FetchTask FilterOperator SelectOperator FileSinkOperator
  101. 101. Task Task Select col1,col2 From tab1 Where col3 > 5 TaskFactory FS - GenMRFileSink1 QB MapRedTask TableScanOperator FilterOperator FetchTask FilterOperator SelectOperator FileSinkOperator
  102. 102. Task Task Select col1,col2 From tab1 Where col3 > 5 TaskFactory FS - GenMRFileSink1 QB MapRedTask TableScanOperator FilterOperator FetchTask FilterOperator SelectOperator FileSinkOperator
  103. 103. Task Task Select col1,col2 From tab1 Where col3 > 5 TaskFactory FS - GenMRFileSink1 QB MapRedTask TableScanOperator FilterOperator FetchTask FilterOperator SelectOperator FileSinkOperator
  104. 104. Task Task Select col1,col2 From tab1 Where col3 > 5 TaskFactory FS - GenMRFileSink1 QB MapRedTask TableScanOperator FilterOperator FetchTask FilterOperator SelectOperator FileSinkOperator
  105. 105. Task Task Select col1,col2 From tab1 Where col3 > 5 TaskFactory QB MapRedTask TableScanOperator FilterOperator FetchTask FilterOperator SelectOperator FileSinkOperator
  106. 106. Task Task Select col1,col2 From tab1 Where col3 > 5 TaskFactory QB MapRedTask MapRedTask TableScanOperator FilterOperator FetchTask FilterOperator SelectOperator FileSinkOperator
  107. 107. Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF FILOperator SELOperator MetaStore Hive QL FILOperator FSOperator Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  108. 108. Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF FILOperator SELOperator MetaStore Hive QL FILOperator FSOperator Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  109. 109. Oracle Migration to Hive
  110. 110. l l l l l
  111. 111. l l l l l l l l l l
  112. 112. l l l l l l l l l l
  113. 113. Oracle SQL
  114. 114. Data Model Hive Entity Sample HDFS LOC
  115. 115. Data Model Hive Entity Sample HDFS LOC Table
  116. 116. Data Model Hive Entity Sample HDFS LOC Table Log /hive/Log
  117. 117. Data Model Hive Entity Sample HDFS LOC Table Log /hive/Log Partition
  118. 118. Data Model Hive Entity Sample HDFS LOC Table Log /hive/Log Partition time=hour /hive/Log/time=1h
  119. 119. Data Model Hive Entity Sample HDFS LOC Table Log /hive/Log Partition time=hour /hive/Log/time=1h Bucket
  120. 120. Data Model Hive Entity Sample HDFS LOC Table Log /hive/Log Partition time=hour /hive/Log/time=1h /wh/Log/time=1h/ Bucket phone-num part-$hash(phone-num)
  121. 121. Data Model Hive Entity Sample HDFS LOC Table Log /hive/Log Partition time=hour /hive/Log/time=1h /wh/Log/time=1h/ Bucket phone-num part-$hash(phone-num) External Table
  122. 122. Data Model Hive Entity Sample HDFS LOC Table Log /hive/Log Partition time=hour /hive/Log/time=1h /wh/Log/time=1h/ Bucket phone-num part-$hash(phone-num) External /app/meta/dir customer (arbitrary location) Table
  123. 123. Data Model MetaStore HDFS Table Data Location Partition Bucketing Info Partitioning Info part-001 Bucket Partition MetaStore DB /hive/Log /hive/Log/time=1h /hive/Log/time=1h/part-0001
  124. 124. Column Data Types
  125. 125. Column Data Types • Primitive Types • int type : tinyint, smallint, int, bigint • boolean, float, double, string
  126. 126. Column Data Types • Primitive Types • int type : tinyint, smallint, int, bigint • boolean, float, double, string • Nest-able Collections • array : value(any-type) • map : key(primitive) and value(any-type)
  127. 127. Column Data Types • Primitive Types • int type : tinyint, smallint, int, bigint • boolean, float, double, string • Nest-able Collections • array : value(any-type) • map : key(primitive) and value(any-type) • User-defined types • structures with attributes
  128. 128. DataType Convert
  129. 129. DataType Convert NUMBER(n)
  130. 130. DataType Convert NUMBER(n) TINYINT INT/BIGINT
  131. 131. DataType Convert NUMBER(n) TINYINT INT/BIGINT NUMBER(n,m)
  132. 132. DataType Convert NUMBER(n) TINYINT INT/BIGINT NUMBER(n,m) FLOAT/DOUBLE
  133. 133. DataType Convert NUMBER(n) TINYINT INT/BIGINT NUMBER(n,m) FLOAT/DOUBLE VARCHAR2
  134. 134. DataType Convert NUMBER(n) TINYINT INT/BIGINT NUMBER(n,m) FLOAT/DOUBLE VARCHAR2 STRING
  135. 135. DataType Convert NUMBER(n) TINYINT INT/BIGINT NUMBER(n,m) FLOAT/DOUBLE VARCHAR2 STRING DATE
  136. 136. DataType Convert NUMBER(n) TINYINT INT/BIGINT NUMBER(n,m) FLOAT/DOUBLE VARCHAR2 STRING DATE STRING “yyyy-MM-dd HH:mm:ss” format
  137. 137. Oracle DML • HIVE supports ANSI-SQL • Sub-Queries in FROM clause • Join query : equi-join/inner-join , outer-join
  138. 138. Range Operator
  139. 139. Range Operator BETWEEN ~ AND ~
  140. 140. Range Operator BETWEEN ~ AND ~ SELECT * from Employee WHERE salary BETWEEN 100 AND 500;
  141. 141. Range Operator BETWEEN ~ AND ~ SELECT * from Employee WHERE salary BETWEEN 100 AND 500; SELECT * from Employee WHERE salary >= 100 AND salary <=500;
  142. 142. Range Operator BETWEEN ~ AND ~ SELECT * from Employee WHERE salary BETWEEN 100 AND 500; SELECT * from Employee WHERE salary >= 100 AND salary <=500; SELECT * from Employee WHERE BETWEEN(salary,100,500);
  143. 143. IN / EXISTS Clause
  144. 144. IN / EXISTS Clause IN / EXISTS SubQuery
  145. 145. IN / EXISTS Clause IN / EXISTS SubQuery SELECT * from Employee e WHERE e.DeptNo IN(SELECT d.DeptNo FROM Dept d)
  146. 146. IN / EXISTS Clause IN / EXISTS SubQuery SELECT * from Employee e WHERE e.DeptNo IN(SELECT d.DeptNo FROM Dept d) SELECT * from Employee e WHERE EXISTS(SELECT ) 1 FROM Dept d WHERE e.DeptNo=d.DeptNo
  147. 147. IN / EXISTS Clause IN / EXISTS SubQuery SELECT * from Employee e WHERE e.DeptNo IN(SELECT d.DeptNo FROM Dept d) SELECT * from Employee e WHERE EXISTS(SELECT 1 FROM Dept d WHERE e.DeptNo=d.DeptNo ) SELECT * from Employee e LEFT SEMI JOIN Dept d ON (e.DeptNo=d.DeptNo)
  148. 148. NOT IN Clause
  149. 149. NOT IN Clause NOT IN SubQuery
  150. 150. NOT IN Clause NOT IN SubQuery SELECT * from Employee e WHERE e.DeptNo NOT IN(SELECT d.DeptNo FROM Dept d)
  151. 151. NOT IN Clause NOT IN SubQuery SELECT * from Employee e WHERE e.DeptNo NOT IN(SELECT d.DeptNo FROM Dept d) SELECT e.* from Employee e LEFT OUTER JOIN Dept d ON (e.DeptNo=d.DeptNo) WHERE d.DeptNo IS NULL
  152. 152. NOT EXIST Clause
  153. 153. NOT EXIST Clause NOT EXIST SubQuery
  154. 154. NOT EXIST Clause NOT EXIST SubQuery SELECT * from Employee e WHERE NOT EXISTS(SELECT 1 FROM Dept d WHERE e.DeptNo=d.DeptNo )
  155. 155. NOT EXIST Clause NOT EXIST SubQuery SELECT * from Employee e WHERE NOT EXISTS(SELECT 1 FROM Dept d WHERE e.DeptNo=d.DeptNo ) SELECT e.* from Employee e LEFT OUTER JOIN Dept d ON (e.DeptNo=d.DeptNo) WHERE d.DeptNo IS NULL
  156. 156. LIKE Clause
  157. 157. LIKE Clause LIKE / NOT LIKE
  158. 158. LIKE Clause LIKE / NOT LIKE SELECT * from Employee e WHERE name LIKE ’%steve’
  159. 159. LIKE Clause LIKE / NOT LIKE SELECT * from Employee e WHERE name LIKE ’%steve’ SELECT e.* from Employee e WHERE name LIKE ‘%steve’
  160. 160. LIKE Clause LIKE / NOT LIKE SELECT * from Employee e WHERE name LIKE ’%steve’ SELECT * from Employee e WHERE name NOT LIKE ’%steve’ SELECT e.* from Employee e WHERE name LIKE ‘%steve’
  161. 161. LIKE Clause LIKE / NOT LIKE SELECT * from Employee e WHERE name LIKE ’%steve’ SELECT * from Employee e WHERE name NOT LIKE ’%steve’ SELECT e.* from Employee e WHERE name LIKE ‘%steve’ SELECT e.* from Employee e WHERE NOT name LIKE ‘%steve’
  162. 162. LIKE Clause LIKE / NOT LIKE SELECT * from Employee e WHERE name LIKE ’%steve’ SELECT * from Employee e WHERE name NOT LIKE ’%steve’ SELECT e.* from Employee e WHERE name LIKE ‘%steve’ SELECT e.* from Employee e WHERE NOT name LIKE ‘%steve’
  163. 163. JOIN Operator (1/4)
  164. 164. JOIN Operator (1/4) SELF JOIN
  165. 165. JOIN Operator (1/4) SELF JOIN SELECT * FROM Employee e1, Employee e2 WHERE e1.ID = e2.Id
  166. 166. JOIN Operator (1/4) SELF JOIN SELECT * FROM Employee e1, Employee e2 WHERE e1.ID = e2.Id SELECT * FROM Employee e1 JOIN Employee e2 ON (e1.ID = e2.Id )
  167. 167. JOIN Operator (2/4)
  168. 168. JOIN Operator (2/4) CROSS JOIN (Cartesian Product)
  169. 169. JOIN Operator (2/4) CROSS JOIN (Cartesian Product) SELECT emp.Name, dept.Name FROM Employee emp, Dept dep
  170. 170. JOIN Operator (2/4) CROSS JOIN (Cartesian Product) SELECT emp.Name, dept.Name FROM Employee emp, Dept dep SELECT emp.Name, dept.Name FROM Employee emp JOIN Dept dep
  171. 171. JOIN Operator (3/4)
  172. 172. JOIN Operator (3/4) LEFT OUTER JOIN
  173. 173. JOIN Operator (3/4) LEFT OUTER JOIN FROM Emp, Dept SELECT * WHERE Emp.deptNo = Dept.deptNo(+)
  174. 174. JOIN Operator (3/4) LEFT OUTER JOIN FROM Emp, Dept SELECT * WHERE Emp.deptNo = Dept.deptNo(+) FROM Emp SELECT * LEFT OUTER JOIN Dept ON Emp.deptNO = Dept.deptNo
  175. 175. JOIN Operator (4/4)
  176. 176. JOIN Operator (4/4) RIGHT OUTER JOIN
  177. 177. JOIN Operator (4/4) RIGHT OUTER JOIN FROM Emp, Dept SELECT * WHERE Emp.deptNo(+) = Dept.deptNo
  178. 178. JOIN Operator (4/4) RIGHT OUTER JOIN FROM Emp, Dept SELECT * WHERE Emp.deptNo(+) = Dept.deptNo FROM Emp SELECT * RIGHT OUTER JOIN Dept ON Emp.deptNO = Dept.deptNo
  179. 179. Oracle Function
  180. 180. Condition Function
  181. 181. Condition Function CASE
  182. 182. Condition Function CASE CASE expr WHEN THEN r1 cond1 [WHEN cond2 THEN r2]* [ELSE r] END
  183. 183. Condition Function CASE CASE expr WHEN THEN r1 cond1 [WHEN cond2 THEN r2]* [ELSE r] END CASE expr WHEN THEN r1 cond1 [WHEN cond2 THEN r2]* [ELSE r] END
  184. 184. Math Function
  185. 185. Math Function ROUND
  186. 186. Math Function ROUND ROUND
  187. 187. Math Function ROUND ROUND CEIL
  188. 188. Math Function ROUND ROUND CEIL CEIL/CEILING
  189. 189. Math Function ROUND ROUND CEIL CEIL/CEILING MOD
  190. 190. Math Function ROUND ROUND CEIL CEIL/CEILING MOD PMOD
  191. 191. Math Function ROUND ROUND CEIL CEIL/CEILING MOD PMOD POWER
  192. 192. Math Function ROUND ROUND CEIL CEIL/CEILING MOD PMOD POWER POW/POWER
  193. 193. Math Function ROUND ROUND CEIL CEIL/CEILING MOD PMOD POWER POW/POWER SQRT
  194. 194. Math Function ROUND ROUND CEIL CEIL/CEILING MOD PMOD POWER POW/POWER SQRT SQRT
  195. 195. Math Function ROUND ROUND CEIL CEIL/CEILING MOD PMOD POWER POW/POWER SQRT SQRT SIN/COS
  196. 196. Math Function ROUND ROUND CEIL CEIL/CEILING MOD PMOD POWER POW/POWER SQRT SQRT SIN/COS SIN/COS
  197. 197. Character Function
  198. 198. Character Function SUBSTR
  199. 199. Character Function SUBSTR SUBSTR
  200. 200. Character Function SUBSTR SUBSTR TRIM
  201. 201. Character Function SUBSTR SUBSTR TRIM TRIM
  202. 202. Character Function SUBSTR SUBSTR TRIM TRIM LPAD/RPAD
  203. 203. Character Function SUBSTR SUBSTR TRIM TRIM LPAD/RPAD LPAD/RPAD
  204. 204. Character Function SUBSTR SUBSTR TRIM TRIM LPAD/RPAD LPAD/RPAD LTRIM/RTRIM
  205. 205. Character Function SUBSTR SUBSTR TRIM TRIM LPAD/RPAD LPAD/RPAD LTRIM/RTRIM LTRIM/RTRIM
  206. 206. Character Function SUBSTR SUBSTR TRIM TRIM LPAD/RPAD LPAD/RPAD LTRIM/RTRIM LTRIM/RTRIM REPLACE
  207. 207. Character Function SUBSTR SUBSTR TRIM TRIM LPAD/RPAD LPAD/RPAD LTRIM/RTRIM LTRIM/RTRIM REPLACE REGEXP_REPLACE
  208. 208. NULL Function
  209. 209. NULL Function COALESCE
  210. 210. NULL Function COALESCE COALESCE
  211. 211. NULL Function COALESCE COALESCE NVL
  212. 212. NULL Function COALESCE COALESCE NVL Custom UDF
  213. 213. NULL Function COALESCE COALESCE NVL Custom UDF NVL2
  214. 214. NULL Function COALESCE COALESCE NVL Custom UDF NVL2 Custom UDF
  215. 215. Custom UDF Function • Condition Function • DECODE • Null Comparison Function • NVL / NVL2 • Type Conversion • TO_NUMBER • TO_CHAR • TO_DATE
  216. 216. Oracle Analytic Function
  217. 217. Analytic Function
  218. 218. Analytic Function Joins, WHERE, GROUP BY clauses are performed
  219. 219. Analytic Function Joins, WHERE, GROUP BY clauses are performed the analytic functions are performed with the result set
  220. 220. Analytic Function Joins, WHERE, GROUP BY clauses are performed the analytic functions are performed with the result set ORDER BY clause is processed
  221. 221. Analytic Function Rank salary in dept name dept salary --------------------- a Research 100 b Research 100 c Sales 200 d Sales 300 e Research 50 f Accounting 200 g Accounting 300 h Accounting 400 i Research 10
  222. 222. Analytic Functionname dept salary---------------------a Research 100b Research 100c Sales 200d Sales 300e Research 50f Accounting 200g Accounting 300h Accounting 400i Research 10
  223. 223. Analytic Function Mapname dept salary---------------------a Research 100b Research 100c Sales 200d Sales 300e Research 50 Mapf Accounting 200g Accounting 300h Accounting 400i Research 10 Map
  224. 224. Analytic Functiona Research 100b Research 100c Sales Map 200d Sales 300e Research Map 50f Accounting 200g Accounting 300h Accounting Map 400i Research 10
  225. 225. Analytic Function DISTRIBUTED BY depta Research 100b Research 100c Sales Map 200d Sales 300e Research Map 50f Accounting 200g Accounting 300h Accounting Map 400i Research 10
  226. 226. Analytic Function DISTRIBUTED BY depta Research 100b Research 100c Sales Map 200 Reduced Sales 300e Research Map 50f Accounting 200 Reduceg Accounting 300h Accounting Map 400i Research 10
  227. 227. Analytic Function DISTRIBUTED BY dept c Sales 200 Map g Accounting 300 h d Accounting Sales 400 300 Reduce f Accounting 200 Map g Research 300 h Research 400 e Research 300 Reduce i Research 10 Map
  228. 228. Analytic Function SORT BY dept, salary c Sales 200 Map d Sales 300 f Accounting 200 g Accounting 300 Reduce h Accounting 400 Map i Research 10 g Research 300 e Research 300 Reduce h Research 400 Map
  229. 229. Analytic Function c Sales 200 Map d Sales 300 f Accounting 200 g Accounting 300 Reduce h Accounting 400 Map i Research 10 g Research 300 e Research 300 Reduce h Research 400 Map
  230. 230. Analytic Function RANK(dept,salary) c Sales 200 1 Map d Sales 300 2 f Accounting 200 1 Reduce g Accounting 300 2 h Accounting 400 3 Map i Research 10 1 g Research 300 2 Reduce e Research 300 3 h Research 400 4 Map
  231. 231. Analytic Function
  232. 232. Analytic FunctionRANK
  233. 233. Analytic FunctionRANKSELECT name,dept,salary,RANK() OVER (PARTITION BY deptORDER BY salary DESC) FROM emp
  234. 234. Analytic FunctionRANKSELECT name,dept,salary,RANK() OVER (PARTITION BY deptORDER BY salary DESC) FROM empSELECT e.name,e.dept,e.salary,RANK( e.dept,e.salary)FROM (SELECT name, dept, salary FROM empDISTRIBUTEDBY dept SORT BY dept, salary DESC) e
  235. 235. Analytic FunctionRANKSELECT name,dept,salary,RANK() OVER (PARTITION BY deptORDER BY salary DESC) FROM empRANK(arg1,arg2) - Custom UDFSELECT e.name,e.dept,e.salary,RANK( e.dept,e.salary)FROM (SELECT name, dept, salary FROM empDISTRIBUTEDBY dept SORT BY dept, salary DESC) e
  236. 236. Hive Optimization & Future Work
  237. 237. Tuning Parameter
  238. 238. Tuning Parameter • Hadoop Tunning
  239. 239. Tuning Parameter • Hadoop Tunning • mapred.job.reuse.jvm.num.task
  240. 240. Tuning Parameter • Hadoop Tunning • mapred.job.reuse.jvm.num.task • mapred.child.java.opts
  241. 241. Tuning Parameter • Hadoop Tunning • mapred.job.reuse.jvm.num.task • mapred.child.java.opts • mapred.min.split.size / mapred.max.split.size
  242. 242. Tuning Parameter • Hadoop Tunning • mapred.job.reuse.jvm.num.task • mapred.child.java.opts • mapred.min.split.size / mapred.max.split.size • dfs.block.size
  243. 243. Tuning Parameter • Hadoop Tunning • mapred.job.reuse.jvm.num.task • mapred.child.java.opts • mapred.min.split.size / mapred.max.split.size • dfs.block.size • Hive Tunning
  244. 244. Tuning Parameter • Hadoop Tunning • mapred.job.reuse.jvm.num.task • mapred.child.java.opts • mapred.min.split.size / mapred.max.split.size • dfs.block.size • Hive Tunning • hive.input.format = CombineHiveInputFormat
  245. 245. UDF/UDAF • Develop UDF to optimize number of MR jobs • Extend GenericUDF to avoid java reflection • Avoid creating new objects in UDF
  246. 246. Future Work
  247. 247. Future Work • HiveQL SQL Compliance • HIVE-282 - IN statement for WHERE clauses • HIVE-192 - Add TIMESTAMP column type • HIVE-1269 - Support Date/Datetime/Time/Timestamp Primitive Types
  248. 248. Future Work • HiveQL SQL Compliance • HIVE-282 - IN statement for WHERE clauses • HIVE-192 - Add TIMESTAMP column type • HIVE-1269 - Support Date/Datetime/Time/Timestamp Primitive Types • Analytic Function • HIVE-896 - Add LEAD/LAG/FIRST/LAST analytical windowing functions to Hive • HIVE-952 - Support analytic NTILE function
  249. 249. Future Work • HiveQL SQL Compliance • HIVE-282 - IN statement for WHERE clauses • HIVE-192 - Add TIMESTAMP column type • HIVE-1269 - Support Date/Datetime/Time/Timestamp Primitive Types • Analytic Function • HIVE-896 - Add LEAD/LAG/FIRST/LAST analytical windowing functions to Hive • HIVE-952 - Support analytic NTILE function • Optimization • HIVE-1694 - Accelerate GROUP BY execution using indexes • HIVE-482 - Optimize Group By + Order By with the same keys
  250. 250. Hive Oracle 2 Hive
  251. 251. Hive A system for managing and querying structured data built on top of Hadoop Oracle 2 Hive
  252. 252. Hive A system for managing and querying structured data built on top of Hadoop Oracle 2 Hive data model ANSI-SQL built-in function / custom UDF analytic function
  253. 253. Question ?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×