SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
Upcoming SlideShare
Loading in...5
×
 

SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive

on

  • 2,853 views

Currently telecom companies store their data in database or data warehouse, treating them through ETL process and working on statistics and analysis by using OLAP tools or data mining engines. ...

Currently telecom companies store their data in database or data warehouse, treating them through ETL process and working on statistics and analysis by using OLAP tools or data mining engines. However, due to the data explosion along with the spread of Smart Phones traditional data storages like DB and DW aren’t sufficient to cope with these “Big Data”. As an alternative the method of storing data in Hadoop and performing ETL process and Ad-hoc Query with Hive is being introduced, and China Mobile is being mentioned as the most representative example. But, they are adopted mainly by new projects, which have low barriers in applying the new Hive data model and HQL. On the other hand, it is extremely difficult to replace the existing database with the combination of Hadoop and Hive if there are already a number of tables and SQL queries. NexR is migrating the telecom company’s data from Oracle DB to Hadoop, and converting a lot of existing Oracle SQL queries to Hive HQL queries. Though HQL supports a similar syntax to ANSI-SQL, it lacks a large portion of basic functions and hardly supports Oracle analytic functions like rank() which are utilized mainly in statistical analysis. Furthermore, the difference of data types like null value is also blocking the application of it. In this presentation, we will share the experience converting Oracle SQL to Hive HQL and developing additional functions with MapReduce. Also, we will introduce several ideas and trials to improve Hive performance.

http://sdec.kr/

Statistics

Views

Total Views
2,853
Views on SlideShare
2,853
Embed Views
0

Actions

Likes
3
Downloads
198
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive Presentation Transcript

  • SDEC 2011 Seoul Data Engineering Camp June 27-28 Seoul, South Korea
  • Replacing Legacy Telco DB/DW to Hadoop and Hive JunHo Cho NexR
  • Agenda
  • Agenda • Motivation for Hive and Hadoop
  • Agenda • Motivation for Hive and Hadoop • Hive Internal
  • Agenda • Motivation for Hive and Hadoop • Hive Internal • Oracle Migration UseCase
  • Agenda • Motivation for Hive and Hadoop • Hive Internal • Oracle Migration UseCase • Hive Optimization
  • Agenda • Motivation for Hive and Hadoop • Hive Internal • Oracle Migration UseCase • Hive Optimization • Future Work
  • Telco Data
  • Telco Data
  • Telco Data
  • Telco Data
  • Telco Data
  • Telco Data
  • Telco Data
  • Telco Data
  • qu er Co n de & Di vi
  • OpenSource
  • OpenSource Storage & Computing
  • OpenSource
  • OpenSource Collection
  • OpenSource
  • OpenSource Search
  • OpenSource
  • OpenSource Analysis
  • OpenSource
  • OpenSource Coordination
  • OpenSource
  • What is HIVE ?
  • What is HIVE ? • A system for managing and querying structured data built on top of Hadoop • Map-Reduce for execution • HDFS for storage • Metadata in an RDBMS
  • What is HIVE ? • A system for managing and querying structured data built on top of Hadoop • Map-Reduce for execution • HDFS for storage • Metadata in an RDBMS • Key Building Principles • SQL is a familiar language • Extensibility - Types, Functions, Formats, Scripts • Performance
  • Why Hive ?
  • Count call-record per phone ?
  • public class CallCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); word.set(itr.nextToken()); output.collect(word, one); }}
  • public class CallCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final IntWritable one = new IntWritable(1); er private Text word = new Text(); app public void map(WritableComparable key, Writable value, M OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); word.set(itr.nextToken()); output.collect(word, one); }}
  • public class CallCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final IntWritable one = new IntWritable(1); er private Text word = new Text(); app public void map(WritableComparable key, Writable value, M OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); word.set(itr.nextToken()); output.collect(word, one); }}public class CallCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { IntWritable value = (IntWritable) values.next(); sum += value.get(); // process value } output.collect(key, new IntWritable(sum)); }}
  • public class CallCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final IntWritable one = new IntWritable(1); er private Text word = new Text(); app public void map(WritableComparable key, Writable value, M OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); word.set(itr.nextToken()); output.collect(word, one); }}public class CallCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator values, er OutputCollector output, Reporter reporter) throws IOException { uc int sum = 0; ed while (values.hasNext()) { R IntWritable value = (IntWritable) values.next(); sum += value.get(); // process value } output.collect(key, new IntWritable(sum)); }}
  • public class CallCountMapper extends MapReduceBase public class CallCount { implements Mapper<LongWritable, Text, Text, IntWritable> { public static void main(String[] args) { private final IntWritable one = new IntWritable(1); JobClient client = new JobClient(); er private Text word = new Text(); JobConf conf = new JobConf(WordCount.class); app public void map(WritableComparable key, Writable value, // specify output types M OutputCollector output, Reporter reporter) throws IOException { conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); String line = value.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); // specify input and output dirs word.set(itr.nextToken()); FileInputPath.addInputPath(conf, new Path("input")); output.collect(word, one); FileOutputPath.addOutputPath(conf, new Path("output")); }} // specify a mapper conf.setMapperClass(KeyCountMapper.class); // specify a reducer conf.setReducerClass(CallCountReducer.class); conf.setCombinerClass(CallCountReducer.class);public class CallCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { client.setConf(conf); try { public void reduce(Text key, Iterator values, JobClient.runJob(conf); er OutputCollector output, Reporter reporter) throws IOException { } catch (Exception e) { uc e.printStackTrace(); int sum = 0; } ed while (values.hasNext()) { } R IntWritable value = (IntWritable) values.next(); } sum += value.get(); // process value } output.collect(key, new IntWritable(sum)); }}
  • public class CallCountMapper extends MapReduceBase public class CallCount { implements Mapper<LongWritable, Text, Text, IntWritable> { public static void main(String[] args) { private final IntWritable one = new IntWritable(1); JobClient client = new JobClient(); er private Text word = new Text(); JobConf conf = new JobConf(WordCount.class); app public void map(WritableComparable key, Writable value, // specify output types M OutputCollector output, Reporter reporter) throws IOException { conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); String line = value.toString(); er StringTokenizer itr = new StringTokenizer(line.toLowerCase()); // specify input and output dirs riv word.set(itr.nextToken()); FileInputPath.addInputPath(conf, new Path("input")); output.collect(word, one); FileOutputPath.addOutputPath(conf, new Path("output")); D }} // specify a mapper conf.setMapperClass(KeyCountMapper.class); // specify a reducer conf.setReducerClass(CallCountReducer.class); conf.setCombinerClass(CallCountReducer.class);public class CallCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { client.setConf(conf); try { public void reduce(Text key, Iterator values, JobClient.runJob(conf); er OutputCollector output, Reporter reporter) throws IOException { } catch (Exception e) { uc e.printStackTrace(); int sum = 0; } ed while (values.hasNext()) { } R IntWritable value = (IntWritable) values.next(); } sum += value.get(); // process value } output.collect(key, new IntWritable(sum)); }}
  • SELECT pnum, count(pnum) FROM cdr GROUP BY pnum;
  • History of Hive • Hive development cycle is fast and the developer community is growing rapidly • Product release cycle is accelerating Project started 0.3.0 0.4.0 0.5.0 0.6.0 0.7.0 0.7.1 03/08 4/09 12/09 02/10 10/10 03/11 06/11
  • History of Hive • Hive development cycle is fast and the developer community is growing rapidly • Product release cycle is accelerating Project started 0.3.0 0.4.0 0.5.0 0.6.0 0.7.0 0.7.1 03/08 4/09 12/09 02/10 10/10 03/11 06/11
  • History of Hive • Hive development cycle is fast and the developer community is growing rapidly • Product release cycle is accelerating Project started 0.3.0 0.4.0 0.5.0 0.6.0 0.7.0 0.7.1 03/08 4/09 12/09 02/10 10/10 03/11 06/11
  • Who use Hive? http://wiki.apache.org/hadoop/Hive/PoweredBy
  • UseCase in Hive?
  • UseCase in Hive? • Report and ad hoc query
  • UseCase in Hive? • Report and ad hoc query • Log Analysis
  • UseCase in Hive? • Report and ad hoc query • Log Analysis • Social Graph Analysis
  • UseCase in Hive? • Report and ad hoc query • Log Analysis • Social Graph Analysis • Data mining and analysis
  • UseCase in Hive? • Report and ad hoc query • Log Analysis • Social Graph Analysis • Data mining and analysis • Machine Learning
  • UseCase in Hive? • Report and ad hoc query • Log Analysis • Social Graph Analysis • Data mining and analysis • Machine Learning • Dataset cleaning
  • UseCase in Hive? • Report and ad hoc query • Log Analysis • Social Graph Analysis • Data mining and analysis • Machine Learning • Dataset cleaning • Data Warehouse
  • Hive Architecture UI Driver DDL HQL Execution Works Engine MetaStore Compiler ORM Hadoop Result
  • Hive Architecture UI Driver select col1 from tab1 where ... DDL HQL Execution Works Engine MetaStore Compiler ORM Hadoop Result
  • Hive Architecture UI Driver DDL HQL Execution Works Engine MetaStore Compiler ORM Hadoop Result
  • Hive Architecture UI Driver DDL HQL Execution Works Engine MetaStore Compiler ORM Hadoop Result
  • Hive Architecture UI Driver DDL HQL Execution Works Engine MetaStore Compiler ORM Hadoop Result
  • Hive Architecture a 123344 b 121211 c 342434 UI Driver DDL HQL Execution Works Engine MetaStore Compiler ORM Hadoop Result
  • Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF/UDAF SELOperator substr sum MetaStore Hive QL FSOperator average Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  • Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF/UDAF SELOperator substr sum MetaStore Hive QL FSOperator average Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  • Parser Parser Select col1,col2 From tab1 Where col3 > 5 TOK_QUERY TOK_FROM TOK_INSERT TOK_DESTINATION TOK_SELECT TOK_WHERE TOK_TABNAME TOK_SELEXPR TOK_SELEXPR TOK_DIR > TOK_TABLE_OR_COL TOK_TABLE_OR_COL TOK_TMP_FILE TOK_TABLE_OR_COL 5
  • Parser Parser Select col1,col2 From tab1 Where col3 > 5 QB TOK_QUERY TOK_FROM TOK_INSERT TOK_DESTINATION TOK_SELECT TOK_WHERE TOK_TABNAME TOK_SELEXPR TOK_SELEXPR TOK_DIR > TOK_TABLE_OR_COL TOK_TABLE_OR_COL TOK_TMP_FILE TOK_TABLE_OR_COL 5
  • Parser Parser Select col1,col2 From tab1 Where col3 > 5 TOK_QUERY TOK_FROM TOK_INSERT TOK_DESTINATION TOK_SELECT TOK_WHERE TOK_TABNAME TOK_SELEXPR TOK_SELEXPR QB tab1 TOK_DIR > TOK_TABLE_OR_COL TOK_TABLE_OR_COL TOK_TMP_FILE TOK_TABLE_OR_COL 5
  • Parser Parser Select col1,col2 From tab1 Where col3 > 5 TOK_QUERY TOK_FROM TOK_INSERT TOK_DESTINATION TOK_SELECT TOK_WHERE TOK_TABNAME TOK_SELEXPR TOK_SELEXPR tab1 TOK_DIR > TOK_TABLE_OR_COL TOK_TABLE_OR_COL TOK_TMP_FILE TOK_TABLE_OR_COL 5 QB insclause-0
  • Parser Parser Select col1,col2 From tab1 Where col3 > 5 TOK_QUERY TOK_FROM TOK_INSERT TOK_DESTINATION TOK_SELECT TOK_WHERE TOK_TABNAME TOK_SELEXPR TOK_SELEXPR tab1 TOK_DIR > TOK_TABLE_OR_COL TOK_TABLE_OR_COL TOK_TMP_FILE col1 QB TOK_TABLE_OR_COL 5 insclause-0
  • Parser Parser Select col1,col2 From tab1 Where col3 > 5 TOK_QUERY TOK_FROM TOK_INSERT TOK_DESTINATION TOK_SELECT TOK_WHERE TOK_TABNAME TOK_SELEXPR TOK_SELEXPR tab1 TOK_DIR > TOK_TABLE_OR_COL TOK_TABLE_OR_COL col1 col2 QB TOK_TMP_FILE TOK_TABLE_OR_COL 5 insclause-0
  • Parser Parser Select col1,col2 From tab1 Where col3 > 5 TOK_QUERY TOK_FROM TOK_INSERT TOK_DESTINATION TOK_SELECT TOK_WHERE QB TOK_TABNAME TOK_SELEXPR TOK_SELEXPR tab1 TOK_DIR > TOK_TABLE_OR_COL TOK_TABLE_OR_COL col1 col2 TOK_TMP_FILE TOK_TABLE_OR_COL 5 insclause-0
  • Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF/UDAF SELOperator substr sum MetaStore Hive QL FSOperator average Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  • Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF/UDAF SELOperator substr sum MetaStore Hive QL FSOperator average Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  • Plan Plan Select col1,col2 From tab1 Where col3 > 5 QB
  • Plan Plan Select col1,col2 From tab1 Where col3 > 5 QB TOK_FROM TOK_WHERE TOK_SELECT TOK_DESTINATION
  • Plan Plan Select col1,col2 From tab1 Where col3 > 5 QB TOK_FROM TableScanOperator TOK_WHERE TOK_SELECT TOK_DESTINATION
  • Plan Plan Select col1,col2 From tab1 Where col3 > 5 QB TOK_FROM TableScanOperator TOK_WHERE TOK_SELECT TOK_DESTINATION
  • Plan Plan Select col1,col2 From tab1 Where col3 > 5 QB TOK_FROM TableScanOperator TOK_WHERE FilterOperator TOK_SELECT TOK_DESTINATION
  • Plan Plan Select col1,col2 From tab1 Where col3 > 5 QB TOK_FROM TableScanOperator TOK_WHERE FilterOperator TOK_SELECT TOK_DESTINATION
  • Plan Plan Select col1,col2 From tab1 Where col3 > 5 QB TOK_FROM TableScanOperator TOK_WHERE FilterOperator TOK_SELECT SelectOperator TOK_DESTINATION
  • Plan Plan Select col1,col2 From tab1 Where col3 > 5 QB TOK_FROM TableScanOperator TOK_WHERE FilterOperator TOK_SELECT SelectOperator TOK_DESTINATION
  • Plan Plan Select col1,col2 From tab1 Where col3 > 5 QB TOK_FROM TableScanOperator TOK_WHERE FilterOperator TOK_SELECT SelectOperator TOK_DESTINATION FileSinkOperator
  • Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF/UDAF SELOperator substr sum MetaStore Hive QL FSOperator average Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  • Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF/UDAF SELOperator substr sum MetaStore Hive QL FSOperator average Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  • Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 TableScanOperator FilterOperator SelectOperator FileSinkOperator
  • Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7} TableScanOperator FilterOperator SelectOperator FileSinkOperator
  • Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7}TableScanOperator FilterOperator SelectOperatorFileSinkOperator
  • Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7} ContextTableScanOperator FilterOperator ColumnPruner SelectOperatorFileSinkOperator
  • Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7} ContextTableScanOperator FilterOperator FIL ColumnPruner TS SEL SelectOperatorFileSinkOperator
  • Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7}TableScanOperator FilterOperator FIL ColumnPruner TS SEL SelectOperatorFileSinkOperator Context
  • Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7}TableScanOperator FilterOperator ColumnPruner SelectOperator FILFileSinkOperator Context TS SEL
  • Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7}TableScanOperator FilterOperator ColumnPruner FIL SelectOperator Context TS SELFileSinkOperator
  • Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7}TableScanOperator FilterOperator ColumnPruner FIL SelectOperator Context TS SEL col1, col2FileSinkOperator
  • Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7}TableScanOperator FilterOperator ColumnPruner FIL SelectOperator Context TS SELFileSinkOperator
  • Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7}TableScanOperator FIL col1, col2, col3 FilterOperator Context TS ColumnPruner SEL SelectOperatorFileSinkOperator
  • Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7}TableScanOperator FIL FilterOperator Context TS ColumnPruner SEL SelectOperatorFileSinkOperator
  • Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7} FILTableScanOperator Context TS col1, col2, col3 SEL FilterOperator ColumnPruner FilterOperator SelectOperatorFileSinkOperator
  • Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF/UDAF SELOperator substr sum MetaStore Hive QL FSOperator average Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  • Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF/UDAF SELOperator substr sum MetaStore Hive QL FSOperator average Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  • Task Task Select col1,col2 From tab1 Where col3 > 5 TS - GenMRTableScan1 TaskFactory FS - GenMRFileSink1 QB
  • Task Task Select col1,col2 From tab1 Where col3 > 5 TS - GenMRTableScan1 TaskFactory FS - GenMRFileSink1 QB FetchTask
  • Task Task Select col1,col2 From tab1 Where col3 > 5 TS - GenMRTableScan1 TaskFactory FS - GenMRFileSink1 QB TableScanOperator FilterOperator FetchTask FilterOperator SelectOperator FileSinkOperator
  • Task Task Select col1,col2 From tab1 Where col3 > 5 TS - GenMRTableScan1 TaskFactory FS - GenMRFileSink1 QB TableScanOperator FilterOperator FetchTask FilterOperator SelectOperator FileSinkOperator
  • Task Task Select col1,col2 From tab1 Where col3 > 5 TaskFactory FS - GenMRFileSink1 QB MapRedTask TableScanOperator FilterOperator FetchTask FilterOperator SelectOperator FileSinkOperator
  • Task Task Select col1,col2 From tab1 Where col3 > 5 TaskFactory FS - GenMRFileSink1 QB MapRedTask TableScanOperator FilterOperator FetchTask FilterOperator SelectOperator FileSinkOperator
  • Task Task Select col1,col2 From tab1 Where col3 > 5 TaskFactory FS - GenMRFileSink1 QB MapRedTask TableScanOperator FilterOperator FetchTask FilterOperator SelectOperator FileSinkOperator
  • Task Task Select col1,col2 From tab1 Where col3 > 5 TaskFactory FS - GenMRFileSink1 QB MapRedTask TableScanOperator FilterOperator FetchTask FilterOperator SelectOperator FileSinkOperator
  • Task Task Select col1,col2 From tab1 Where col3 > 5 TaskFactory FS - GenMRFileSink1 QB MapRedTask TableScanOperator FilterOperator FetchTask FilterOperator SelectOperator FileSinkOperator
  • Task Task Select col1,col2 From tab1 Where col3 > 5 TaskFactory QB MapRedTask TableScanOperator FilterOperator FetchTask FilterOperator SelectOperator FileSinkOperator
  • Task Task Select col1,col2 From tab1 Where col3 > 5 TaskFactory QB MapRedTask MapRedTask TableScanOperator FilterOperator FetchTask FilterOperator SelectOperator FileSinkOperator
  • Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF FILOperator SELOperator MetaStore Hive QL FILOperator FSOperator Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  • Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF FILOperator SELOperator MetaStore Hive QL FILOperator FSOperator Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  • Oracle Migration to Hive
  • l l l l l
  • l l l l l l l l l l
  • l l l l l l l l l l
  • Oracle SQL
  • Data Model Hive Entity Sample HDFS LOC
  • Data Model Hive Entity Sample HDFS LOC Table
  • Data Model Hive Entity Sample HDFS LOC Table Log /hive/Log
  • Data Model Hive Entity Sample HDFS LOC Table Log /hive/Log Partition
  • Data Model Hive Entity Sample HDFS LOC Table Log /hive/Log Partition time=hour /hive/Log/time=1h
  • Data Model Hive Entity Sample HDFS LOC Table Log /hive/Log Partition time=hour /hive/Log/time=1h Bucket
  • Data Model Hive Entity Sample HDFS LOC Table Log /hive/Log Partition time=hour /hive/Log/time=1h /wh/Log/time=1h/ Bucket phone-num part-$hash(phone-num)
  • Data Model Hive Entity Sample HDFS LOC Table Log /hive/Log Partition time=hour /hive/Log/time=1h /wh/Log/time=1h/ Bucket phone-num part-$hash(phone-num) External Table
  • Data Model Hive Entity Sample HDFS LOC Table Log /hive/Log Partition time=hour /hive/Log/time=1h /wh/Log/time=1h/ Bucket phone-num part-$hash(phone-num) External /app/meta/dir customer (arbitrary location) Table
  • Data Model MetaStore HDFS Table Data Location Partition Bucketing Info Partitioning Info part-001 Bucket Partition MetaStore DB /hive/Log /hive/Log/time=1h /hive/Log/time=1h/part-0001
  • Column Data Types
  • Column Data Types • Primitive Types • int type : tinyint, smallint, int, bigint • boolean, float, double, string
  • Column Data Types • Primitive Types • int type : tinyint, smallint, int, bigint • boolean, float, double, string • Nest-able Collections • array : value(any-type) • map : key(primitive) and value(any-type)
  • Column Data Types • Primitive Types • int type : tinyint, smallint, int, bigint • boolean, float, double, string • Nest-able Collections • array : value(any-type) • map : key(primitive) and value(any-type) • User-defined types • structures with attributes
  • DataType Convert
  • DataType Convert NUMBER(n)
  • DataType Convert NUMBER(n) TINYINT INT/BIGINT
  • DataType Convert NUMBER(n) TINYINT INT/BIGINT NUMBER(n,m)
  • DataType Convert NUMBER(n) TINYINT INT/BIGINT NUMBER(n,m) FLOAT/DOUBLE
  • DataType Convert NUMBER(n) TINYINT INT/BIGINT NUMBER(n,m) FLOAT/DOUBLE VARCHAR2
  • DataType Convert NUMBER(n) TINYINT INT/BIGINT NUMBER(n,m) FLOAT/DOUBLE VARCHAR2 STRING
  • DataType Convert NUMBER(n) TINYINT INT/BIGINT NUMBER(n,m) FLOAT/DOUBLE VARCHAR2 STRING DATE
  • DataType Convert NUMBER(n) TINYINT INT/BIGINT NUMBER(n,m) FLOAT/DOUBLE VARCHAR2 STRING DATE STRING “yyyy-MM-dd HH:mm:ss” format
  • Oracle DML • HIVE supports ANSI-SQL • Sub-Queries in FROM clause • Join query : equi-join/inner-join , outer-join
  • Range Operator
  • Range Operator BETWEEN ~ AND ~
  • Range Operator BETWEEN ~ AND ~ SELECT * from Employee WHERE salary BETWEEN 100 AND 500;
  • Range Operator BETWEEN ~ AND ~ SELECT * from Employee WHERE salary BETWEEN 100 AND 500; SELECT * from Employee WHERE salary >= 100 AND salary <=500;
  • Range Operator BETWEEN ~ AND ~ SELECT * from Employee WHERE salary BETWEEN 100 AND 500; SELECT * from Employee WHERE salary >= 100 AND salary <=500; SELECT * from Employee WHERE BETWEEN(salary,100,500);
  • IN / EXISTS Clause
  • IN / EXISTS Clause IN / EXISTS SubQuery
  • IN / EXISTS Clause IN / EXISTS SubQuery SELECT * from Employee e WHERE e.DeptNo IN(SELECT d.DeptNo FROM Dept d)
  • IN / EXISTS Clause IN / EXISTS SubQuery SELECT * from Employee e WHERE e.DeptNo IN(SELECT d.DeptNo FROM Dept d) SELECT * from Employee e WHERE EXISTS(SELECT ) 1 FROM Dept d WHERE e.DeptNo=d.DeptNo
  • IN / EXISTS Clause IN / EXISTS SubQuery SELECT * from Employee e WHERE e.DeptNo IN(SELECT d.DeptNo FROM Dept d) SELECT * from Employee e WHERE EXISTS(SELECT 1 FROM Dept d WHERE e.DeptNo=d.DeptNo ) SELECT * from Employee e LEFT SEMI JOIN Dept d ON (e.DeptNo=d.DeptNo)
  • NOT IN Clause
  • NOT IN Clause NOT IN SubQuery
  • NOT IN Clause NOT IN SubQuery SELECT * from Employee e WHERE e.DeptNo NOT IN(SELECT d.DeptNo FROM Dept d)
  • NOT IN Clause NOT IN SubQuery SELECT * from Employee e WHERE e.DeptNo NOT IN(SELECT d.DeptNo FROM Dept d) SELECT e.* from Employee e LEFT OUTER JOIN Dept d ON (e.DeptNo=d.DeptNo) WHERE d.DeptNo IS NULL
  • NOT EXIST Clause
  • NOT EXIST Clause NOT EXIST SubQuery
  • NOT EXIST Clause NOT EXIST SubQuery SELECT * from Employee e WHERE NOT EXISTS(SELECT 1 FROM Dept d WHERE e.DeptNo=d.DeptNo )
  • NOT EXIST Clause NOT EXIST SubQuery SELECT * from Employee e WHERE NOT EXISTS(SELECT 1 FROM Dept d WHERE e.DeptNo=d.DeptNo ) SELECT e.* from Employee e LEFT OUTER JOIN Dept d ON (e.DeptNo=d.DeptNo) WHERE d.DeptNo IS NULL
  • LIKE Clause
  • LIKE Clause LIKE / NOT LIKE
  • LIKE Clause LIKE / NOT LIKE SELECT * from Employee e WHERE name LIKE ’%steve’
  • LIKE Clause LIKE / NOT LIKE SELECT * from Employee e WHERE name LIKE ’%steve’ SELECT e.* from Employee e WHERE name LIKE ‘%steve’
  • LIKE Clause LIKE / NOT LIKE SELECT * from Employee e WHERE name LIKE ’%steve’ SELECT * from Employee e WHERE name NOT LIKE ’%steve’ SELECT e.* from Employee e WHERE name LIKE ‘%steve’
  • LIKE Clause LIKE / NOT LIKE SELECT * from Employee e WHERE name LIKE ’%steve’ SELECT * from Employee e WHERE name NOT LIKE ’%steve’ SELECT e.* from Employee e WHERE name LIKE ‘%steve’ SELECT e.* from Employee e WHERE NOT name LIKE ‘%steve’
  • LIKE Clause LIKE / NOT LIKE SELECT * from Employee e WHERE name LIKE ’%steve’ SELECT * from Employee e WHERE name NOT LIKE ’%steve’ SELECT e.* from Employee e WHERE name LIKE ‘%steve’ SELECT e.* from Employee e WHERE NOT name LIKE ‘%steve’
  • JOIN Operator (1/4)
  • JOIN Operator (1/4) SELF JOIN
  • JOIN Operator (1/4) SELF JOIN SELECT * FROM Employee e1, Employee e2 WHERE e1.ID = e2.Id
  • JOIN Operator (1/4) SELF JOIN SELECT * FROM Employee e1, Employee e2 WHERE e1.ID = e2.Id SELECT * FROM Employee e1 JOIN Employee e2 ON (e1.ID = e2.Id )
  • JOIN Operator (2/4)
  • JOIN Operator (2/4) CROSS JOIN (Cartesian Product)
  • JOIN Operator (2/4) CROSS JOIN (Cartesian Product) SELECT emp.Name, dept.Name FROM Employee emp, Dept dep
  • JOIN Operator (2/4) CROSS JOIN (Cartesian Product) SELECT emp.Name, dept.Name FROM Employee emp, Dept dep SELECT emp.Name, dept.Name FROM Employee emp JOIN Dept dep
  • JOIN Operator (3/4)
  • JOIN Operator (3/4) LEFT OUTER JOIN
  • JOIN Operator (3/4) LEFT OUTER JOIN FROM Emp, Dept SELECT * WHERE Emp.deptNo = Dept.deptNo(+)
  • JOIN Operator (3/4) LEFT OUTER JOIN FROM Emp, Dept SELECT * WHERE Emp.deptNo = Dept.deptNo(+) FROM Emp SELECT * LEFT OUTER JOIN Dept ON Emp.deptNO = Dept.deptNo
  • JOIN Operator (4/4)
  • JOIN Operator (4/4) RIGHT OUTER JOIN
  • JOIN Operator (4/4) RIGHT OUTER JOIN FROM Emp, Dept SELECT * WHERE Emp.deptNo(+) = Dept.deptNo
  • JOIN Operator (4/4) RIGHT OUTER JOIN FROM Emp, Dept SELECT * WHERE Emp.deptNo(+) = Dept.deptNo FROM Emp SELECT * RIGHT OUTER JOIN Dept ON Emp.deptNO = Dept.deptNo
  • Oracle Function
  • Condition Function
  • Condition Function CASE
  • Condition Function CASE CASE expr WHEN THEN r1 cond1 [WHEN cond2 THEN r2]* [ELSE r] END
  • Condition Function CASE CASE expr WHEN THEN r1 cond1 [WHEN cond2 THEN r2]* [ELSE r] END CASE expr WHEN THEN r1 cond1 [WHEN cond2 THEN r2]* [ELSE r] END
  • Math Function
  • Math Function ROUND
  • Math Function ROUND ROUND
  • Math Function ROUND ROUND CEIL
  • Math Function ROUND ROUND CEIL CEIL/CEILING
  • Math Function ROUND ROUND CEIL CEIL/CEILING MOD
  • Math Function ROUND ROUND CEIL CEIL/CEILING MOD PMOD
  • Math Function ROUND ROUND CEIL CEIL/CEILING MOD PMOD POWER
  • Math Function ROUND ROUND CEIL CEIL/CEILING MOD PMOD POWER POW/POWER
  • Math Function ROUND ROUND CEIL CEIL/CEILING MOD PMOD POWER POW/POWER SQRT
  • Math Function ROUND ROUND CEIL CEIL/CEILING MOD PMOD POWER POW/POWER SQRT SQRT
  • Math Function ROUND ROUND CEIL CEIL/CEILING MOD PMOD POWER POW/POWER SQRT SQRT SIN/COS
  • Math Function ROUND ROUND CEIL CEIL/CEILING MOD PMOD POWER POW/POWER SQRT SQRT SIN/COS SIN/COS
  • Character Function
  • Character Function SUBSTR
  • Character Function SUBSTR SUBSTR
  • Character Function SUBSTR SUBSTR TRIM
  • Character Function SUBSTR SUBSTR TRIM TRIM
  • Character Function SUBSTR SUBSTR TRIM TRIM LPAD/RPAD
  • Character Function SUBSTR SUBSTR TRIM TRIM LPAD/RPAD LPAD/RPAD
  • Character Function SUBSTR SUBSTR TRIM TRIM LPAD/RPAD LPAD/RPAD LTRIM/RTRIM
  • Character Function SUBSTR SUBSTR TRIM TRIM LPAD/RPAD LPAD/RPAD LTRIM/RTRIM LTRIM/RTRIM
  • Character Function SUBSTR SUBSTR TRIM TRIM LPAD/RPAD LPAD/RPAD LTRIM/RTRIM LTRIM/RTRIM REPLACE
  • Character Function SUBSTR SUBSTR TRIM TRIM LPAD/RPAD LPAD/RPAD LTRIM/RTRIM LTRIM/RTRIM REPLACE REGEXP_REPLACE
  • NULL Function
  • NULL Function COALESCE
  • NULL Function COALESCE COALESCE
  • NULL Function COALESCE COALESCE NVL
  • NULL Function COALESCE COALESCE NVL Custom UDF
  • NULL Function COALESCE COALESCE NVL Custom UDF NVL2
  • NULL Function COALESCE COALESCE NVL Custom UDF NVL2 Custom UDF
  • Custom UDF Function • Condition Function • DECODE • Null Comparison Function • NVL / NVL2 • Type Conversion • TO_NUMBER • TO_CHAR • TO_DATE
  • Oracle Analytic Function
  • Analytic Function
  • Analytic Function Joins, WHERE, GROUP BY clauses are performed
  • Analytic Function Joins, WHERE, GROUP BY clauses are performed the analytic functions are performed with the result set
  • Analytic Function Joins, WHERE, GROUP BY clauses are performed the analytic functions are performed with the result set ORDER BY clause is processed
  • Analytic Function Rank salary in dept name dept salary --------------------- a Research 100 b Research 100 c Sales 200 d Sales 300 e Research 50 f Accounting 200 g Accounting 300 h Accounting 400 i Research 10
  • Analytic Functionname dept salary---------------------a Research 100b Research 100c Sales 200d Sales 300e Research 50f Accounting 200g Accounting 300h Accounting 400i Research 10
  • Analytic Function Mapname dept salary---------------------a Research 100b Research 100c Sales 200d Sales 300e Research 50 Mapf Accounting 200g Accounting 300h Accounting 400i Research 10 Map
  • Analytic Functiona Research 100b Research 100c Sales Map 200d Sales 300e Research Map 50f Accounting 200g Accounting 300h Accounting Map 400i Research 10
  • Analytic Function DISTRIBUTED BY depta Research 100b Research 100c Sales Map 200d Sales 300e Research Map 50f Accounting 200g Accounting 300h Accounting Map 400i Research 10
  • Analytic Function DISTRIBUTED BY depta Research 100b Research 100c Sales Map 200 Reduced Sales 300e Research Map 50f Accounting 200 Reduceg Accounting 300h Accounting Map 400i Research 10
  • Analytic Function DISTRIBUTED BY dept c Sales 200 Map g Accounting 300 h d Accounting Sales 400 300 Reduce f Accounting 200 Map g Research 300 h Research 400 e Research 300 Reduce i Research 10 Map
  • Analytic Function SORT BY dept, salary c Sales 200 Map d Sales 300 f Accounting 200 g Accounting 300 Reduce h Accounting 400 Map i Research 10 g Research 300 e Research 300 Reduce h Research 400 Map
  • Analytic Function c Sales 200 Map d Sales 300 f Accounting 200 g Accounting 300 Reduce h Accounting 400 Map i Research 10 g Research 300 e Research 300 Reduce h Research 400 Map
  • Analytic Function RANK(dept,salary) c Sales 200 1 Map d Sales 300 2 f Accounting 200 1 Reduce g Accounting 300 2 h Accounting 400 3 Map i Research 10 1 g Research 300 2 Reduce e Research 300 3 h Research 400 4 Map
  • Analytic Function
  • Analytic FunctionRANK
  • Analytic FunctionRANKSELECT name,dept,salary,RANK() OVER (PARTITION BY deptORDER BY salary DESC) FROM emp
  • Analytic FunctionRANKSELECT name,dept,salary,RANK() OVER (PARTITION BY deptORDER BY salary DESC) FROM empSELECT e.name,e.dept,e.salary,RANK( e.dept,e.salary)FROM (SELECT name, dept, salary FROM empDISTRIBUTEDBY dept SORT BY dept, salary DESC) e
  • Analytic FunctionRANKSELECT name,dept,salary,RANK() OVER (PARTITION BY deptORDER BY salary DESC) FROM empRANK(arg1,arg2) - Custom UDFSELECT e.name,e.dept,e.salary,RANK( e.dept,e.salary)FROM (SELECT name, dept, salary FROM empDISTRIBUTEDBY dept SORT BY dept, salary DESC) e
  • Hive Optimization & Future Work
  • Tuning Parameter
  • Tuning Parameter • Hadoop Tunning
  • Tuning Parameter • Hadoop Tunning • mapred.job.reuse.jvm.num.task
  • Tuning Parameter • Hadoop Tunning • mapred.job.reuse.jvm.num.task • mapred.child.java.opts
  • Tuning Parameter • Hadoop Tunning • mapred.job.reuse.jvm.num.task • mapred.child.java.opts • mapred.min.split.size / mapred.max.split.size
  • Tuning Parameter • Hadoop Tunning • mapred.job.reuse.jvm.num.task • mapred.child.java.opts • mapred.min.split.size / mapred.max.split.size • dfs.block.size
  • Tuning Parameter • Hadoop Tunning • mapred.job.reuse.jvm.num.task • mapred.child.java.opts • mapred.min.split.size / mapred.max.split.size • dfs.block.size • Hive Tunning
  • Tuning Parameter • Hadoop Tunning • mapred.job.reuse.jvm.num.task • mapred.child.java.opts • mapred.min.split.size / mapred.max.split.size • dfs.block.size • Hive Tunning • hive.input.format = CombineHiveInputFormat
  • UDF/UDAF • Develop UDF to optimize number of MR jobs • Extend GenericUDF to avoid java reflection • Avoid creating new objects in UDF
  • Future Work
  • Future Work • HiveQL SQL Compliance • HIVE-282 - IN statement for WHERE clauses • HIVE-192 - Add TIMESTAMP column type • HIVE-1269 - Support Date/Datetime/Time/Timestamp Primitive Types
  • Future Work • HiveQL SQL Compliance • HIVE-282 - IN statement for WHERE clauses • HIVE-192 - Add TIMESTAMP column type • HIVE-1269 - Support Date/Datetime/Time/Timestamp Primitive Types • Analytic Function • HIVE-896 - Add LEAD/LAG/FIRST/LAST analytical windowing functions to Hive • HIVE-952 - Support analytic NTILE function
  • Future Work • HiveQL SQL Compliance • HIVE-282 - IN statement for WHERE clauses • HIVE-192 - Add TIMESTAMP column type • HIVE-1269 - Support Date/Datetime/Time/Timestamp Primitive Types • Analytic Function • HIVE-896 - Add LEAD/LAG/FIRST/LAST analytical windowing functions to Hive • HIVE-952 - Support analytic NTILE function • Optimization • HIVE-1694 - Accelerate GROUP BY execution using indexes • HIVE-482 - Optimize Group By + Order By with the same keys
  • Hive Oracle 2 Hive
  • Hive A system for managing and querying structured data built on top of Hadoop Oracle 2 Hive
  • Hive A system for managing and querying structured data built on top of Hadoop Oracle 2 Hive data model ANSI-SQL built-in function / custom UDF analytic function
  • Question ?