SDEC 2011               Seoul Data Engineering Camp                                             June 27-28                ...
Replacing Legacy Telco DB/DW                    to Hadoop and Hive                        JunHo Cho                       ...
Agenda
Agenda           • Motivation for Hive and Hadoop
Agenda           • Motivation for Hive and Hadoop           • Hive Internal
Agenda           • Motivation for Hive and Hadoop           • Hive Internal           • Oracle Migration UseCase
Agenda           • Motivation for Hive and Hadoop           • Hive Internal           • Oracle Migration UseCase          ...
Agenda           • Motivation for Hive and Hadoop           • Hive Internal           • Oracle Migration UseCase          ...
Telco Data
Telco Data
Telco Data
Telco Data
Telco Data
Telco Data
Telco Data
Telco Data
qu er                            Co n                     de &               Di vi
OpenSource
OpenSource               Storage & Computing
OpenSource
OpenSource     Collection
OpenSource
OpenSource     Search
OpenSource
OpenSource                            Analysis
OpenSource
OpenSource                 Coordination
OpenSource
What is HIVE ?
What is HIVE ?          •    A system for managing and querying structured data               built on top of Hadoop      ...
What is HIVE ?          •    A system for managing and querying structured data               built on top of Hadoop      ...
Why Hive ?
Count call-record per phone ?
public class CallCountMapper extends MapReduceBase    implements Mapper<LongWritable, Text, Text, IntWritable> {    privat...
public class CallCountMapper extends MapReduceBase    implements Mapper<LongWritable, Text, Text, IntWritable> {    privat...
public class CallCountMapper extends MapReduceBase    implements Mapper<LongWritable, Text, Text, IntWritable> {    privat...
public class CallCountMapper extends MapReduceBase    implements Mapper<LongWritable, Text, Text, IntWritable> {    privat...
public class CallCountMapper extends MapReduceBase                            public class CallCount {    implements Mappe...
public class CallCountMapper extends MapReduceBase                            public class CallCount {    implements Mappe...
SELECT pnum, count(pnum)               FROM cdr               GROUP BY pnum;
History of Hive          •     Hive development cycle is fast and the developer                community is growing rapidl...
History of Hive          •     Hive development cycle is fast and the developer                community is growing rapidl...
History of Hive          •     Hive development cycle is fast and the developer                community is growing rapidl...
Who use Hive?               http://wiki.apache.org/hadoop/Hive/PoweredBy
UseCase in Hive?
UseCase in Hive?               • Report and ad hoc query
UseCase in Hive?               • Report and ad hoc query               • Log Analysis
UseCase in Hive?               • Report and ad hoc query               • Log Analysis               • Social Graph Analysi...
UseCase in Hive?               • Report and ad hoc query               • Log Analysis               • Social Graph Analysi...
UseCase in Hive?               • Report and ad hoc query               • Log Analysis               • Social Graph Analysi...
UseCase in Hive?               • Report and ad hoc query               • Log Analysis               • Social Graph Analysi...
UseCase in Hive?               • Report and ad hoc query               • Log Analysis               • Social Graph Analysi...
Hive Architecture               UI         Driver               DDL           HQL                                         ...
Hive Architecture               UI         Driver   select col1 from tab1 where ...               DDL           HQL       ...
Hive Architecture               UI         Driver               DDL           HQL                                         ...
Hive Architecture               UI         Driver               DDL           HQL                                         ...
Hive Architecture               UI         Driver               DDL           HQL                                         ...
Hive Architecture                            a 123344                            b 121211                            c 342...
Hive Internal                                                      Map Reduce           Web UI     Hive CLI      JDBC     ...
Hive Internal                                                      Map Reduce           Web UI     Hive CLI      JDBC     ...
Parser                 Parser                                             Select col1,col2 From tab1 Where col3 > 5       ...
Parser                 Parser                                               Select col1,col2 From tab1 Where col3 > 5     ...
Parser                 Parser                                             Select col1,col2 From tab1 Where col3 > 5       ...
Parser                 Parser                                             Select col1,col2 From tab1 Where col3 > 5       ...
Parser                 Parser                                             Select col1,col2 From tab1 Where col3 > 5       ...
Parser                 Parser                                             Select col1,col2 From tab1 Where col3 > 5       ...
Parser                 Parser                                             Select col1,col2 From tab1 Where col3 > 5       ...
Hive Internal                                                      Map Reduce           Web UI     Hive CLI      JDBC     ...
Hive Internal                                                      Map Reduce           Web UI     Hive CLI      JDBC     ...
Plan                 Plan                        Select col1,col2 From tab1 Where col3 > 5                 QB
Plan                 Plan                        Select col1,col2 From tab1 Where col3 > 5                 QB     TOK_FROM...
Plan                 Plan                        Select col1,col2 From tab1 Where col3 > 5                 QB     TOK_FROM...
Plan                 Plan                        Select col1,col2 From tab1 Where col3 > 5                 QB     TOK_FROM...
Plan                 Plan                        Select col1,col2 From tab1 Where col3 > 5                 QB     TOK_FROM...
Plan                 Plan                        Select col1,col2 From tab1 Where col3 > 5                 QB     TOK_FROM...
Plan                 Plan                        Select col1,col2 From tab1 Where col3 > 5                 QB     TOK_FROM...
Plan                 Plan                        Select col1,col2 From tab1 Where col3 > 5                 QB     TOK_FROM...
Plan                 Plan                        Select col1,col2 From tab1 Where col3 > 5                 QB     TOK_FROM...
Hive Internal                                                      Map Reduce           Web UI     Hive CLI      JDBC     ...
Hive Internal                                                      Map Reduce           Web UI     Hive CLI      JDBC     ...
Optimizer             Optimizer   Select col1,col2 From tab1 Where col3 > 5                         TableScanOperator     ...
Optimizer             Optimizer   Select col1,col2 From tab1 Where col3 > 5                         tab1 {col1, col2, col3...
Optimizer               Optimizer   Select col1,col2 From tab1 Where col3 > 5                           tab1 {col1, col2, ...
Optimizer               Optimizer   Select col1,col2 From tab1 Where col3 > 5                           tab1 {col1, col2, ...
Optimizer               Optimizer   Select col1,col2 From tab1 Where col3 > 5                           tab1 {col1, col2, ...
Optimizer               Optimizer     Select col1,col2 From tab1 Where col3 > 5                             tab1 {col1, co...
Optimizer               Optimizer     Select col1,col2 From tab1 Where col3 > 5                             tab1 {col1, co...
Optimizer               Optimizer     Select col1,col2 From tab1 Where col3 > 5                             tab1 {col1, co...
Optimizer               Optimizer     Select col1,col2 From tab1 Where col3 > 5                             tab1 {col1, co...
Optimizer               Optimizer     Select col1,col2 From tab1 Where col3 > 5                             tab1 {col1, co...
Optimizer               Optimizer     Select col1,col2 From tab1 Where col3 > 5                             tab1 {col1, co...
Optimizer               Optimizer     Select col1,col2 From tab1 Where col3 > 5                             tab1 {col1, co...
Optimizer               Optimizer     Select col1,col2 From tab1 Where col3 > 5                             tab1 {col1, co...
Hive Internal                                                      Map Reduce           Web UI     Hive CLI      JDBC     ...
Hive Internal                                                      Map Reduce           Web UI     Hive CLI      JDBC     ...
Task                 Task   Select col1,col2 From tab1 Where col3 > 5                                               TS - G...
Task                 Task   Select col1,col2 From tab1 Where col3 > 5                                               TS - G...
Task                    Task      Select col1,col2 From tab1 Where col3 > 5                                               ...
Task                   Task      Select col1,col2 From tab1 Where col3 > 5                                                ...
Task                   Task      Select col1,col2 From tab1 Where col3 > 5                                   TaskFactory  ...
Task                   Task      Select col1,col2 From tab1 Where col3 > 5                                   TaskFactory  ...
Task                   Task      Select col1,col2 From tab1 Where col3 > 5                                   TaskFactory  ...
Task                   Task      Select col1,col2 From tab1 Where col3 > 5                                   TaskFactory  ...
Task                 Task   Select col1,col2 From tab1 Where col3 > 5                              TaskFactory            ...
Task                 Task   Select col1,col2 From tab1 Where col3 > 5                              TaskFactory            ...
Task                 Task   Select col1,col2 From tab1 Where col3 > 5                              TaskFactory            ...
Hive Internal                                                          Map Reduce           Web UI     Hive CLI      JDBC ...
Hive Internal                                                          Map Reduce           Web UI     Hive CLI      JDBC ...
Oracle Migration                    to Hive
l	 l	 l	 l	            	 l
l	                     l	 l	                     l	 l	                     l	    	 l	            	        l	 l	  ...
l	                     l	 l	                     l	 l	                     l	    	 l	            	        l	 l	  ...
Oracle SQL
Data Model     Hive Entity   Sample       HDFS LOC
Data Model     Hive Entity       Sample       HDFS LOC               Table
Data Model     Hive Entity       Sample       HDFS LOC               Table     Log          /hive/Log
Data Model     Hive Entity       Sample       HDFS LOC               Table     Log          /hive/Log          Partition	 ...
Data Model     Hive Entity       Sample       HDFS LOC               Table     Log            /hive/Log          Partition...
Data Model     Hive Entity        Sample       HDFS LOC               Table      Log            /hive/Log          Partiti...
Data Model     Hive Entity        Sample         HDFS LOC               Table      Log              /hive/Log          Par...
Data Model     Hive Entity        Sample         HDFS LOC               Table      Log              /hive/Log          Par...
Data Model     Hive Entity        Sample         HDFS LOC               Table      Log              /hive/Log          Par...
Data Model                  MetaStore                             HDFS                     Table               Data Locati...
Column Data Types
Column Data Types               • Primitive Types                • int type : tinyint, smallint, int, bigint              ...
Column Data Types               • Primitive Types                • int type : tinyint, smallint, int, bigint              ...
Column Data Types               • Primitive Types                • int type : tinyint, smallint, int, bigint              ...
DataType Convert
DataType Convert          NUMBER(n)
DataType Convert          NUMBER(n)        TINYINT                         INT/BIGINT
DataType Convert          NUMBER(n)         TINYINT                          INT/BIGINT          NUMBER(n,m)
DataType Convert          NUMBER(n)         TINYINT                          INT/BIGINT          NUMBER(n,m)    FLOAT/DOUB...
DataType Convert          NUMBER(n)         TINYINT                          INT/BIGINT          NUMBER(n,m)    FLOAT/DOUB...
DataType Convert          NUMBER(n)         TINYINT                          INT/BIGINT          NUMBER(n,m)    FLOAT/DOUB...
DataType Convert          NUMBER(n)           TINYINT                            INT/BIGINT          NUMBER(n,m)      FLOA...
DataType Convert          NUMBER(n)             TINYINT                              INT/BIGINT          NUMBER(n,m)      ...
Oracle DML               • HIVE supports ANSI-SQL               • Sub-Queries in FROM clause               • Join query : ...
Range Operator
Range Operator     BETWEEN ~ AND ~
Range Operator     BETWEEN ~ AND ~          SELECT * from Employee WHERE          salary BETWEEN 100 AND 500;
Range Operator     BETWEEN ~ AND ~          SELECT * from Employee WHERE          salary BETWEEN 100 AND 500;          SEL...
Range Operator     BETWEEN ~ AND ~          SELECT * from Employee WHERE          salary BETWEEN 100 AND 500;          SEL...
IN / EXISTS Clause
IN / EXISTS Clause     IN / EXISTS SubQuery
IN / EXISTS Clause     IN / EXISTS SubQuery          SELECT * from Employee e WHERE e.DeptNo          IN(SELECT d.DeptNo F...
IN / EXISTS Clause     IN / EXISTS SubQuery          SELECT * from Employee e WHERE e.DeptNo          IN(SELECT d.DeptNo F...
IN / EXISTS Clause     IN / EXISTS SubQuery          SELECT * from Employee e WHERE e.DeptNo          IN(SELECT d.DeptNo F...
NOT IN Clause
NOT IN Clause      NOT IN SubQuery
NOT IN Clause      NOT IN SubQuery          SELECT * from Employee e WHERE e.DeptNo          NOT IN(SELECT               d...
NOT IN Clause      NOT IN SubQuery          SELECT * from Employee e WHERE e.DeptNo          NOT IN(SELECT                ...
NOT EXIST Clause
NOT EXIST Clause     NOT EXIST SubQuery
NOT EXIST Clause     NOT EXIST SubQuery          SELECT * from Employee e WHERE          NOT EXISTS(SELECT            1 FR...
NOT EXIST Clause     NOT EXIST SubQuery          SELECT * from Employee e WHERE          NOT EXISTS(SELECT            1 FR...
LIKE Clause
LIKE Clause     LIKE / NOT LIKE
LIKE Clause     LIKE / NOT LIKE          SELECT * from Employee e WHERE name   LIKE   ’%steve’
LIKE Clause     LIKE / NOT LIKE          SELECT * from Employee e WHERE name   LIKE   ’%steve’          SELECT e.* from Em...
LIKE Clause     LIKE / NOT LIKE          SELECT * from Employee e WHERE name   LIKE   ’%steve’          SELECT * from Empl...
LIKE Clause     LIKE / NOT LIKE          SELECT * from Employee e WHERE name   LIKE      ’%steve’          SELECT * from E...
LIKE Clause     LIKE / NOT LIKE          SELECT * from Employee e WHERE name   LIKE      ’%steve’          SELECT * from E...
JOIN Operator (1/4)
JOIN Operator (1/4)          SELF JOIN
JOIN Operator (1/4)          SELF JOIN          SELECT *          FROM       Employee e1, Employee e2   WHERE   e1.ID = e2...
JOIN Operator (1/4)          SELF JOIN          SELECT *          FROM       Employee e1, Employee e2   WHERE   e1.ID = e2...
JOIN Operator (2/4)
JOIN Operator (2/4)      CROSS JOIN (Cartesian Product)
JOIN Operator (2/4)      CROSS JOIN (Cartesian Product)          SELECT emp.Name, dept.Name   FROM   Employee emp, Dept de...
JOIN Operator (2/4)      CROSS JOIN (Cartesian Product)          SELECT emp.Name, dept.Name   FROM   Employee emp, Dept de...
JOIN Operator (3/4)
JOIN Operator (3/4)      LEFT OUTER JOIN
JOIN Operator (3/4)      LEFT OUTER JOIN              FROM Emp, Dept          SELECT *          WHERE Emp.deptNo = Dept.de...
JOIN Operator (3/4)      LEFT OUTER JOIN              FROM Emp, Dept          SELECT *          WHERE Emp.deptNo = Dept.de...
JOIN Operator (4/4)
JOIN Operator (4/4)      RIGHT OUTER JOIN
JOIN Operator (4/4)      RIGHT OUTER JOIN              FROM Emp, Dept          SELECT *          WHERE Emp.deptNo(+) =   D...
JOIN Operator (4/4)      RIGHT OUTER JOIN              FROM Emp, Dept          SELECT *          WHERE Emp.deptNo(+) =   D...
Oracle Function
Condition Function
Condition Function      CASE
Condition Function      CASE          CASE    expr   WHEN         THEN r1                                  cond1          ...
Condition Function      CASE          CASE    expr   WHEN         THEN r1                                  cond1          ...
Math Function
Math Function               ROUND
Math Function               ROUND          ROUND
Math Function               ROUND              ROUND                CEIL
Math Function               ROUND              ROUND                CEIL           CEIL/CEILING
Math Function               ROUND              ROUND                CEIL           CEIL/CEILING               MOD	    	   ...
Math Function               ROUND              ROUND                CEIL           CEIL/CEILING               MOD         ...
Math Function               ROUND              ROUND                CEIL           CEIL/CEILING               MOD         ...
Math Function               ROUND              ROUND                CEIL           CEIL/CEILING               MOD         ...
Math Function               ROUND              ROUND                CEIL           CEIL/CEILING               MOD         ...
Math Function               ROUND              ROUND                CEIL           CEIL/CEILING               MOD         ...
Math Function               ROUND              ROUND                CEIL           CEIL/CEILING                MOD        ...
Math Function               ROUND              ROUND                CEIL           CEIL/CEILING                MOD        ...
Character Function
Character Function               SUBSTR
Character Function               SUBSTR          SUBSTR
Character Function               SUBSTR          SUBSTR                TRIM
Character Function               SUBSTR          SUBSTR                TRIM            TRIM
Character Function               SUBSTR          SUBSTR                TRIM            TRIM          LPAD/RPAD
Character Function               SUBSTR          SUBSTR                TRIM            TRIM          LPAD/RPAD          LP...
Character Function               SUBSTR          SUBSTR                TRIM            TRIM          LPAD/RPAD          LP...
Character Function               SUBSTR          SUBSTR                TRIM            TRIM          LPAD/RPAD          LP...
Character Function               SUBSTR          SUBSTR                TRIM            TRIM          LPAD/RPAD          LP...
Character Function               SUBSTR          SUBSTR                TRIM            TRIM          LPAD/RPAD          LP...
NULL Function
NULL Function          COALESCE
NULL Function          COALESCE       COALESCE
NULL Function          COALESCE            COALESCE               NVL
NULL Function          COALESCE            COALESCE               NVL           Custom UDF
NULL Function          COALESCE            COALESCE               NVL           Custom UDF               NVL2
NULL Function          COALESCE            COALESCE               NVL           Custom UDF               NVL2          Cus...
Custom UDF Function               • Condition Function                • DECODE               • Null Comparison Function   ...
Oracle Analytic                  Function
Analytic Function
Analytic Function          Joins, WHERE, GROUP BY clauses are performed
Analytic Function          Joins, WHERE, GROUP BY clauses are performed               the analytic functions are performed...
Analytic Function          Joins, WHERE, GROUP BY clauses are performed               the analytic functions are performed...
Analytic Function                 Rank salary in dept                 name	 dept	 	       salary                 ---------...
Analytic Functionname	 dept	 	       salary---------------------a	      Research	     100b	      Research	     100c	      ...
Analytic Function                                   Mapname	 dept	 	       salary---------------------a	      Research	   ...
Analytic Functiona	             Research	     100b	             Research	     100c	             Sales	 	     Map          ...
Analytic Function                                       DISTRIBUTED BY depta	             Research	     100b	             ...
Analytic Function                                       DISTRIBUTED BY depta	             Research	     100b	             ...
Analytic Function                DISTRIBUTED BY dept                     c      Sales           200               Map   g ...
Analytic Function                               SORT BY dept, salary                c      Sales            200          M...
Analytic Function                c      Sales            200          Map   d      Sales            300                f  ...
Analytic Function                 RANK(dept,salary)                                  c      Sales            200   1      ...
Analytic Function
Analytic FunctionRANK
Analytic FunctionRANKSELECT name,dept,salary,RANK()   OVER (PARTITION BY   deptORDER BY       salary   DESC) FROM   emp	  ...
Analytic FunctionRANKSELECT name,dept,salary,RANK()   OVER (PARTITION BY     deptORDER BY       salary   DESC) FROM      e...
Analytic FunctionRANKSELECT name,dept,salary,RANK()   OVER (PARTITION BY     deptORDER BY       salary   DESC) FROM      e...
Hive Optimization                & Future Work
Tuning Parameter
Tuning Parameter          • Hadoop Tunning
Tuning Parameter          • Hadoop Tunning               •   mapred.job.reuse.jvm.num.task
Tuning Parameter          • Hadoop Tunning               •   mapred.job.reuse.jvm.num.task               •   mapred.child....
Tuning Parameter          • Hadoop Tunning               •   mapred.job.reuse.jvm.num.task               •   mapred.child....
Tuning Parameter          • Hadoop Tunning               •   mapred.job.reuse.jvm.num.task               •   mapred.child....
Tuning Parameter          • Hadoop Tunning               •   mapred.job.reuse.jvm.num.task               •   mapred.child....
Tuning Parameter          • Hadoop Tunning               •   mapred.job.reuse.jvm.num.task               •   mapred.child....
UDF/UDAF          • Develop UDF to optimize number of MR jobs          • Extend GenericUDF to avoid java reflection        ...
Future Work
Future Work      • HiveQL SQL Compliance               •   HIVE-282 - IN statement for WHERE clauses               •   HIV...
Future Work      • HiveQL SQL Compliance               •   HIVE-282 - IN statement for WHERE clauses               •   HIV...
Future Work      • HiveQL SQL Compliance               •   HIVE-282 - IN statement for WHERE clauses               •   HIV...
Hive          Oracle 2 Hive
Hive               A system for managing and querying               structured data built on top of Hadoop          Oracle...
Hive               A system for managing and querying               structured data built on top of Hadoop          Oracle...
Question ?
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
Upcoming SlideShare
Loading in...5
×

SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive

2,561

Published on

Currently telecom companies store their data in database or data warehouse, treating them through ETL process and working on statistics and analysis by using OLAP tools or data mining engines. However, due to the data explosion along with the spread of Smart Phones traditional data storages like DB and DW aren’t sufficient to cope with these “Big Data”. As an alternative the method of storing data in Hadoop and performing ETL process and Ad-hoc Query with Hive is being introduced, and China Mobile is being mentioned as the most representative example. But, they are adopted mainly by new projects, which have low barriers in applying the new Hive data model and HQL. On the other hand, it is extremely difficult to replace the existing database with the combination of Hadoop and Hive if there are already a number of tables and SQL queries. NexR is migrating the telecom company’s data from Oracle DB to Hadoop, and converting a lot of existing Oracle SQL queries to Hive HQL queries. Though HQL supports a similar syntax to ANSI-SQL, it lacks a large portion of basic functions and hardly supports Oracle analytic functions like rank() which are utilized mainly in statistical analysis. Furthermore, the difference of data types like null value is also blocking the application of it. In this presentation, we will share the experience converting Oracle SQL to Hive HQL and developing additional functions with MapReduce. Also, we will introduce several ideas and trials to improve Hive performance.

http://sdec.kr/

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,561
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
203
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Transcript of "SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive"

  1. 1. SDEC 2011 Seoul Data Engineering Camp June 27-28 Seoul, South Korea
  2. 2. Replacing Legacy Telco DB/DW to Hadoop and Hive JunHo Cho NexR
  3. 3. Agenda
  4. 4. Agenda • Motivation for Hive and Hadoop
  5. 5. Agenda • Motivation for Hive and Hadoop • Hive Internal
  6. 6. Agenda • Motivation for Hive and Hadoop • Hive Internal • Oracle Migration UseCase
  7. 7. Agenda • Motivation for Hive and Hadoop • Hive Internal • Oracle Migration UseCase • Hive Optimization
  8. 8. Agenda • Motivation for Hive and Hadoop • Hive Internal • Oracle Migration UseCase • Hive Optimization • Future Work
  9. 9. Telco Data
  10. 10. Telco Data
  11. 11. Telco Data
  12. 12. Telco Data
  13. 13. Telco Data
  14. 14. Telco Data
  15. 15. Telco Data
  16. 16. Telco Data
  17. 17. qu er Co n de & Di vi
  18. 18. OpenSource
  19. 19. OpenSource Storage & Computing
  20. 20. OpenSource
  21. 21. OpenSource Collection
  22. 22. OpenSource
  23. 23. OpenSource Search
  24. 24. OpenSource
  25. 25. OpenSource Analysis
  26. 26. OpenSource
  27. 27. OpenSource Coordination
  28. 28. OpenSource
  29. 29. What is HIVE ?
  30. 30. What is HIVE ? • A system for managing and querying structured data built on top of Hadoop • Map-Reduce for execution • HDFS for storage • Metadata in an RDBMS
  31. 31. What is HIVE ? • A system for managing and querying structured data built on top of Hadoop • Map-Reduce for execution • HDFS for storage • Metadata in an RDBMS • Key Building Principles • SQL is a familiar language • Extensibility - Types, Functions, Formats, Scripts • Performance
  32. 32. Why Hive ?
  33. 33. Count call-record per phone ?
  34. 34. public class CallCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); word.set(itr.nextToken()); output.collect(word, one); }}
  35. 35. public class CallCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final IntWritable one = new IntWritable(1); er private Text word = new Text(); app public void map(WritableComparable key, Writable value, M OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); word.set(itr.nextToken()); output.collect(word, one); }}
  36. 36. public class CallCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final IntWritable one = new IntWritable(1); er private Text word = new Text(); app public void map(WritableComparable key, Writable value, M OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); word.set(itr.nextToken()); output.collect(word, one); }}public class CallCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { IntWritable value = (IntWritable) values.next(); sum += value.get(); // process value } output.collect(key, new IntWritable(sum)); }}
  37. 37. public class CallCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final IntWritable one = new IntWritable(1); er private Text word = new Text(); app public void map(WritableComparable key, Writable value, M OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); word.set(itr.nextToken()); output.collect(word, one); }}public class CallCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator values, er OutputCollector output, Reporter reporter) throws IOException { uc int sum = 0; ed while (values.hasNext()) { R IntWritable value = (IntWritable) values.next(); sum += value.get(); // process value } output.collect(key, new IntWritable(sum)); }}
  38. 38. public class CallCountMapper extends MapReduceBase public class CallCount { implements Mapper<LongWritable, Text, Text, IntWritable> { public static void main(String[] args) { private final IntWritable one = new IntWritable(1); JobClient client = new JobClient(); er private Text word = new Text(); JobConf conf = new JobConf(WordCount.class); app public void map(WritableComparable key, Writable value, // specify output types M OutputCollector output, Reporter reporter) throws IOException { conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); String line = value.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); // specify input and output dirs word.set(itr.nextToken()); FileInputPath.addInputPath(conf, new Path("input")); output.collect(word, one); FileOutputPath.addOutputPath(conf, new Path("output")); }} // specify a mapper conf.setMapperClass(KeyCountMapper.class); // specify a reducer conf.setReducerClass(CallCountReducer.class); conf.setCombinerClass(CallCountReducer.class);public class CallCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { client.setConf(conf); try { public void reduce(Text key, Iterator values, JobClient.runJob(conf); er OutputCollector output, Reporter reporter) throws IOException { } catch (Exception e) { uc e.printStackTrace(); int sum = 0; } ed while (values.hasNext()) { } R IntWritable value = (IntWritable) values.next(); } sum += value.get(); // process value } output.collect(key, new IntWritable(sum)); }}
  39. 39. public class CallCountMapper extends MapReduceBase public class CallCount { implements Mapper<LongWritable, Text, Text, IntWritable> { public static void main(String[] args) { private final IntWritable one = new IntWritable(1); JobClient client = new JobClient(); er private Text word = new Text(); JobConf conf = new JobConf(WordCount.class); app public void map(WritableComparable key, Writable value, // specify output types M OutputCollector output, Reporter reporter) throws IOException { conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); String line = value.toString(); er StringTokenizer itr = new StringTokenizer(line.toLowerCase()); // specify input and output dirs riv word.set(itr.nextToken()); FileInputPath.addInputPath(conf, new Path("input")); output.collect(word, one); FileOutputPath.addOutputPath(conf, new Path("output")); D }} // specify a mapper conf.setMapperClass(KeyCountMapper.class); // specify a reducer conf.setReducerClass(CallCountReducer.class); conf.setCombinerClass(CallCountReducer.class);public class CallCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { client.setConf(conf); try { public void reduce(Text key, Iterator values, JobClient.runJob(conf); er OutputCollector output, Reporter reporter) throws IOException { } catch (Exception e) { uc e.printStackTrace(); int sum = 0; } ed while (values.hasNext()) { } R IntWritable value = (IntWritable) values.next(); } sum += value.get(); // process value } output.collect(key, new IntWritable(sum)); }}
  40. 40. SELECT pnum, count(pnum) FROM cdr GROUP BY pnum;
  41. 41. History of Hive • Hive development cycle is fast and the developer community is growing rapidly • Product release cycle is accelerating Project started 0.3.0 0.4.0 0.5.0 0.6.0 0.7.0 0.7.1 03/08 4/09 12/09 02/10 10/10 03/11 06/11
  42. 42. History of Hive • Hive development cycle is fast and the developer community is growing rapidly • Product release cycle is accelerating Project started 0.3.0 0.4.0 0.5.0 0.6.0 0.7.0 0.7.1 03/08 4/09 12/09 02/10 10/10 03/11 06/11
  43. 43. History of Hive • Hive development cycle is fast and the developer community is growing rapidly • Product release cycle is accelerating Project started 0.3.0 0.4.0 0.5.0 0.6.0 0.7.0 0.7.1 03/08 4/09 12/09 02/10 10/10 03/11 06/11
  44. 44. Who use Hive? http://wiki.apache.org/hadoop/Hive/PoweredBy
  45. 45. UseCase in Hive?
  46. 46. UseCase in Hive? • Report and ad hoc query
  47. 47. UseCase in Hive? • Report and ad hoc query • Log Analysis
  48. 48. UseCase in Hive? • Report and ad hoc query • Log Analysis • Social Graph Analysis
  49. 49. UseCase in Hive? • Report and ad hoc query • Log Analysis • Social Graph Analysis • Data mining and analysis
  50. 50. UseCase in Hive? • Report and ad hoc query • Log Analysis • Social Graph Analysis • Data mining and analysis • Machine Learning
  51. 51. UseCase in Hive? • Report and ad hoc query • Log Analysis • Social Graph Analysis • Data mining and analysis • Machine Learning • Dataset cleaning
  52. 52. UseCase in Hive? • Report and ad hoc query • Log Analysis • Social Graph Analysis • Data mining and analysis • Machine Learning • Dataset cleaning • Data Warehouse
  53. 53. Hive Architecture UI Driver DDL HQL Execution Works Engine MetaStore Compiler ORM Hadoop Result
  54. 54. Hive Architecture UI Driver select col1 from tab1 where ... DDL HQL Execution Works Engine MetaStore Compiler ORM Hadoop Result
  55. 55. Hive Architecture UI Driver DDL HQL Execution Works Engine MetaStore Compiler ORM Hadoop Result
  56. 56. Hive Architecture UI Driver DDL HQL Execution Works Engine MetaStore Compiler ORM Hadoop Result
  57. 57. Hive Architecture UI Driver DDL HQL Execution Works Engine MetaStore Compiler ORM Hadoop Result
  58. 58. Hive Architecture a 123344 b 121211 c 342434 UI Driver DDL HQL Execution Works Engine MetaStore Compiler ORM Hadoop Result
  59. 59. Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF/UDAF SELOperator substr sum MetaStore Hive QL FSOperator average Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  60. 60. Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF/UDAF SELOperator substr sum MetaStore Hive QL FSOperator average Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  61. 61. Parser Parser Select col1,col2 From tab1 Where col3 > 5 TOK_QUERY TOK_FROM TOK_INSERT TOK_DESTINATION TOK_SELECT TOK_WHERE TOK_TABNAME TOK_SELEXPR TOK_SELEXPR TOK_DIR > TOK_TABLE_OR_COL TOK_TABLE_OR_COL TOK_TMP_FILE TOK_TABLE_OR_COL 5
  62. 62. Parser Parser Select col1,col2 From tab1 Where col3 > 5 QB TOK_QUERY TOK_FROM TOK_INSERT TOK_DESTINATION TOK_SELECT TOK_WHERE TOK_TABNAME TOK_SELEXPR TOK_SELEXPR TOK_DIR > TOK_TABLE_OR_COL TOK_TABLE_OR_COL TOK_TMP_FILE TOK_TABLE_OR_COL 5
  63. 63. Parser Parser Select col1,col2 From tab1 Where col3 > 5 TOK_QUERY TOK_FROM TOK_INSERT TOK_DESTINATION TOK_SELECT TOK_WHERE TOK_TABNAME TOK_SELEXPR TOK_SELEXPR QB tab1 TOK_DIR > TOK_TABLE_OR_COL TOK_TABLE_OR_COL TOK_TMP_FILE TOK_TABLE_OR_COL 5
  64. 64. Parser Parser Select col1,col2 From tab1 Where col3 > 5 TOK_QUERY TOK_FROM TOK_INSERT TOK_DESTINATION TOK_SELECT TOK_WHERE TOK_TABNAME TOK_SELEXPR TOK_SELEXPR tab1 TOK_DIR > TOK_TABLE_OR_COL TOK_TABLE_OR_COL TOK_TMP_FILE TOK_TABLE_OR_COL 5 QB insclause-0
  65. 65. Parser Parser Select col1,col2 From tab1 Where col3 > 5 TOK_QUERY TOK_FROM TOK_INSERT TOK_DESTINATION TOK_SELECT TOK_WHERE TOK_TABNAME TOK_SELEXPR TOK_SELEXPR tab1 TOK_DIR > TOK_TABLE_OR_COL TOK_TABLE_OR_COL TOK_TMP_FILE col1 QB TOK_TABLE_OR_COL 5 insclause-0
  66. 66. Parser Parser Select col1,col2 From tab1 Where col3 > 5 TOK_QUERY TOK_FROM TOK_INSERT TOK_DESTINATION TOK_SELECT TOK_WHERE TOK_TABNAME TOK_SELEXPR TOK_SELEXPR tab1 TOK_DIR > TOK_TABLE_OR_COL TOK_TABLE_OR_COL col1 col2 QB TOK_TMP_FILE TOK_TABLE_OR_COL 5 insclause-0
  67. 67. Parser Parser Select col1,col2 From tab1 Where col3 > 5 TOK_QUERY TOK_FROM TOK_INSERT TOK_DESTINATION TOK_SELECT TOK_WHERE QB TOK_TABNAME TOK_SELEXPR TOK_SELEXPR tab1 TOK_DIR > TOK_TABLE_OR_COL TOK_TABLE_OR_COL col1 col2 TOK_TMP_FILE TOK_TABLE_OR_COL 5 insclause-0
  68. 68. Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF/UDAF SELOperator substr sum MetaStore Hive QL FSOperator average Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  69. 69. Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF/UDAF SELOperator substr sum MetaStore Hive QL FSOperator average Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  70. 70. Plan Plan Select col1,col2 From tab1 Where col3 > 5 QB
  71. 71. Plan Plan Select col1,col2 From tab1 Where col3 > 5 QB TOK_FROM TOK_WHERE TOK_SELECT TOK_DESTINATION
  72. 72. Plan Plan Select col1,col2 From tab1 Where col3 > 5 QB TOK_FROM TableScanOperator TOK_WHERE TOK_SELECT TOK_DESTINATION
  73. 73. Plan Plan Select col1,col2 From tab1 Where col3 > 5 QB TOK_FROM TableScanOperator TOK_WHERE TOK_SELECT TOK_DESTINATION
  74. 74. Plan Plan Select col1,col2 From tab1 Where col3 > 5 QB TOK_FROM TableScanOperator TOK_WHERE FilterOperator TOK_SELECT TOK_DESTINATION
  75. 75. Plan Plan Select col1,col2 From tab1 Where col3 > 5 QB TOK_FROM TableScanOperator TOK_WHERE FilterOperator TOK_SELECT TOK_DESTINATION
  76. 76. Plan Plan Select col1,col2 From tab1 Where col3 > 5 QB TOK_FROM TableScanOperator TOK_WHERE FilterOperator TOK_SELECT SelectOperator TOK_DESTINATION
  77. 77. Plan Plan Select col1,col2 From tab1 Where col3 > 5 QB TOK_FROM TableScanOperator TOK_WHERE FilterOperator TOK_SELECT SelectOperator TOK_DESTINATION
  78. 78. Plan Plan Select col1,col2 From tab1 Where col3 > 5 QB TOK_FROM TableScanOperator TOK_WHERE FilterOperator TOK_SELECT SelectOperator TOK_DESTINATION FileSinkOperator
  79. 79. Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF/UDAF SELOperator substr sum MetaStore Hive QL FSOperator average Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  80. 80. Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF/UDAF SELOperator substr sum MetaStore Hive QL FSOperator average Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  81. 81. Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 TableScanOperator FilterOperator SelectOperator FileSinkOperator
  82. 82. Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7} TableScanOperator FilterOperator SelectOperator FileSinkOperator
  83. 83. Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7}TableScanOperator FilterOperator SelectOperatorFileSinkOperator
  84. 84. Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7} ContextTableScanOperator FilterOperator ColumnPruner SelectOperatorFileSinkOperator
  85. 85. Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7} ContextTableScanOperator FilterOperator FIL ColumnPruner TS SEL SelectOperatorFileSinkOperator
  86. 86. Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7}TableScanOperator FilterOperator FIL ColumnPruner TS SEL SelectOperatorFileSinkOperator Context
  87. 87. Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7}TableScanOperator FilterOperator ColumnPruner SelectOperator FILFileSinkOperator Context TS SEL
  88. 88. Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7}TableScanOperator FilterOperator ColumnPruner FIL SelectOperator Context TS SELFileSinkOperator
  89. 89. Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7}TableScanOperator FilterOperator ColumnPruner FIL SelectOperator Context TS SEL col1, col2FileSinkOperator
  90. 90. Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7}TableScanOperator FilterOperator ColumnPruner FIL SelectOperator Context TS SELFileSinkOperator
  91. 91. Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7}TableScanOperator FIL col1, col2, col3 FilterOperator Context TS ColumnPruner SEL SelectOperatorFileSinkOperator
  92. 92. Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7}TableScanOperator FIL FilterOperator Context TS ColumnPruner SEL SelectOperatorFileSinkOperator
  93. 93. Optimizer Optimizer Select col1,col2 From tab1 Where col3 > 5 tab1 {col1, col2, col3, col4,col5,col6,col7} FILTableScanOperator Context TS col1, col2, col3 SEL FilterOperator ColumnPruner FilterOperator SelectOperatorFileSinkOperator
  94. 94. Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF/UDAF SELOperator substr sum MetaStore Hive QL FSOperator average Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  95. 95. Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF/UDAF SELOperator substr sum MetaStore Hive QL FSOperator average Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  96. 96. Task Task Select col1,col2 From tab1 Where col3 > 5 TS - GenMRTableScan1 TaskFactory FS - GenMRFileSink1 QB
  97. 97. Task Task Select col1,col2 From tab1 Where col3 > 5 TS - GenMRTableScan1 TaskFactory FS - GenMRFileSink1 QB FetchTask
  98. 98. Task Task Select col1,col2 From tab1 Where col3 > 5 TS - GenMRTableScan1 TaskFactory FS - GenMRFileSink1 QB TableScanOperator FilterOperator FetchTask FilterOperator SelectOperator FileSinkOperator
  99. 99. Task Task Select col1,col2 From tab1 Where col3 > 5 TS - GenMRTableScan1 TaskFactory FS - GenMRFileSink1 QB TableScanOperator FilterOperator FetchTask FilterOperator SelectOperator FileSinkOperator
  100. 100. Task Task Select col1,col2 From tab1 Where col3 > 5 TaskFactory FS - GenMRFileSink1 QB MapRedTask TableScanOperator FilterOperator FetchTask FilterOperator SelectOperator FileSinkOperator
  101. 101. Task Task Select col1,col2 From tab1 Where col3 > 5 TaskFactory FS - GenMRFileSink1 QB MapRedTask TableScanOperator FilterOperator FetchTask FilterOperator SelectOperator FileSinkOperator
  102. 102. Task Task Select col1,col2 From tab1 Where col3 > 5 TaskFactory FS - GenMRFileSink1 QB MapRedTask TableScanOperator FilterOperator FetchTask FilterOperator SelectOperator FileSinkOperator
  103. 103. Task Task Select col1,col2 From tab1 Where col3 > 5 TaskFactory FS - GenMRFileSink1 QB MapRedTask TableScanOperator FilterOperator FetchTask FilterOperator SelectOperator FileSinkOperator
  104. 104. Task Task Select col1,col2 From tab1 Where col3 > 5 TaskFactory FS - GenMRFileSink1 QB MapRedTask TableScanOperator FilterOperator FetchTask FilterOperator SelectOperator FileSinkOperator
  105. 105. Task Task Select col1,col2 From tab1 Where col3 > 5 TaskFactory QB MapRedTask TableScanOperator FilterOperator FetchTask FilterOperator SelectOperator FileSinkOperator
  106. 106. Task Task Select col1,col2 From tab1 Where col3 > 5 TaskFactory QB MapRedTask MapRedTask TableScanOperator FilterOperator FetchTask FilterOperator SelectOperator FileSinkOperator
  107. 107. Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF FILOperator SELOperator MetaStore Hive QL FILOperator FSOperator Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  108. 108. Hive Internal Map Reduce Web UI Hive CLI JDBC TSOperator User Script Browse, Query, DDL UDF FILOperator SELOperator MetaStore Hive QL FILOperator FSOperator Thrift API Parser ExecMapper/ExecReducer Plan SerDe Optimizer Input/OutputFormat Task HDFS StorageHandler RCFile DB ... HBase
  109. 109. Oracle Migration to Hive
  110. 110. l l l l l
  111. 111. l l l l l l l l l l
  112. 112. l l l l l l l l l l
  113. 113. Oracle SQL
  114. 114. Data Model Hive Entity Sample HDFS LOC
  115. 115. Data Model Hive Entity Sample HDFS LOC Table
  116. 116. Data Model Hive Entity Sample HDFS LOC Table Log /hive/Log
  117. 117. Data Model Hive Entity Sample HDFS LOC Table Log /hive/Log Partition
  118. 118. Data Model Hive Entity Sample HDFS LOC Table Log /hive/Log Partition time=hour /hive/Log/time=1h
  119. 119. Data Model Hive Entity Sample HDFS LOC Table Log /hive/Log Partition time=hour /hive/Log/time=1h Bucket
  120. 120. Data Model Hive Entity Sample HDFS LOC Table Log /hive/Log Partition time=hour /hive/Log/time=1h /wh/Log/time=1h/ Bucket phone-num part-$hash(phone-num)
  121. 121. Data Model Hive Entity Sample HDFS LOC Table Log /hive/Log Partition time=hour /hive/Log/time=1h /wh/Log/time=1h/ Bucket phone-num part-$hash(phone-num) External Table
  122. 122. Data Model Hive Entity Sample HDFS LOC Table Log /hive/Log Partition time=hour /hive/Log/time=1h /wh/Log/time=1h/ Bucket phone-num part-$hash(phone-num) External /app/meta/dir customer (arbitrary location) Table
  123. 123. Data Model MetaStore HDFS Table Data Location Partition Bucketing Info Partitioning Info part-001 Bucket Partition MetaStore DB /hive/Log /hive/Log/time=1h /hive/Log/time=1h/part-0001
  124. 124. Column Data Types
  125. 125. Column Data Types • Primitive Types • int type : tinyint, smallint, int, bigint • boolean, float, double, string
  126. 126. Column Data Types • Primitive Types • int type : tinyint, smallint, int, bigint • boolean, float, double, string • Nest-able Collections • array : value(any-type) • map : key(primitive) and value(any-type)
  127. 127. Column Data Types • Primitive Types • int type : tinyint, smallint, int, bigint • boolean, float, double, string • Nest-able Collections • array : value(any-type) • map : key(primitive) and value(any-type) • User-defined types • structures with attributes
  128. 128. DataType Convert
  129. 129. DataType Convert NUMBER(n)
  130. 130. DataType Convert NUMBER(n) TINYINT INT/BIGINT
  131. 131. DataType Convert NUMBER(n) TINYINT INT/BIGINT NUMBER(n,m)
  132. 132. DataType Convert NUMBER(n) TINYINT INT/BIGINT NUMBER(n,m) FLOAT/DOUBLE
  133. 133. DataType Convert NUMBER(n) TINYINT INT/BIGINT NUMBER(n,m) FLOAT/DOUBLE VARCHAR2
  134. 134. DataType Convert NUMBER(n) TINYINT INT/BIGINT NUMBER(n,m) FLOAT/DOUBLE VARCHAR2 STRING
  135. 135. DataType Convert NUMBER(n) TINYINT INT/BIGINT NUMBER(n,m) FLOAT/DOUBLE VARCHAR2 STRING DATE
  136. 136. DataType Convert NUMBER(n) TINYINT INT/BIGINT NUMBER(n,m) FLOAT/DOUBLE VARCHAR2 STRING DATE STRING “yyyy-MM-dd HH:mm:ss” format
  137. 137. Oracle DML • HIVE supports ANSI-SQL • Sub-Queries in FROM clause • Join query : equi-join/inner-join , outer-join
  138. 138. Range Operator
  139. 139. Range Operator BETWEEN ~ AND ~
  140. 140. Range Operator BETWEEN ~ AND ~ SELECT * from Employee WHERE salary BETWEEN 100 AND 500;
  141. 141. Range Operator BETWEEN ~ AND ~ SELECT * from Employee WHERE salary BETWEEN 100 AND 500; SELECT * from Employee WHERE salary >= 100 AND salary <=500;
  142. 142. Range Operator BETWEEN ~ AND ~ SELECT * from Employee WHERE salary BETWEEN 100 AND 500; SELECT * from Employee WHERE salary >= 100 AND salary <=500; SELECT * from Employee WHERE BETWEEN(salary,100,500);
  143. 143. IN / EXISTS Clause
  144. 144. IN / EXISTS Clause IN / EXISTS SubQuery
  145. 145. IN / EXISTS Clause IN / EXISTS SubQuery SELECT * from Employee e WHERE e.DeptNo IN(SELECT d.DeptNo FROM Dept d)
  146. 146. IN / EXISTS Clause IN / EXISTS SubQuery SELECT * from Employee e WHERE e.DeptNo IN(SELECT d.DeptNo FROM Dept d) SELECT * from Employee e WHERE EXISTS(SELECT ) 1 FROM Dept d WHERE e.DeptNo=d.DeptNo
  147. 147. IN / EXISTS Clause IN / EXISTS SubQuery SELECT * from Employee e WHERE e.DeptNo IN(SELECT d.DeptNo FROM Dept d) SELECT * from Employee e WHERE EXISTS(SELECT 1 FROM Dept d WHERE e.DeptNo=d.DeptNo ) SELECT * from Employee e LEFT SEMI JOIN Dept d ON (e.DeptNo=d.DeptNo)
  148. 148. NOT IN Clause
  149. 149. NOT IN Clause NOT IN SubQuery
  150. 150. NOT IN Clause NOT IN SubQuery SELECT * from Employee e WHERE e.DeptNo NOT IN(SELECT d.DeptNo FROM Dept d)
  151. 151. NOT IN Clause NOT IN SubQuery SELECT * from Employee e WHERE e.DeptNo NOT IN(SELECT d.DeptNo FROM Dept d) SELECT e.* from Employee e LEFT OUTER JOIN Dept d ON (e.DeptNo=d.DeptNo) WHERE d.DeptNo IS NULL
  152. 152. NOT EXIST Clause
  153. 153. NOT EXIST Clause NOT EXIST SubQuery
  154. 154. NOT EXIST Clause NOT EXIST SubQuery SELECT * from Employee e WHERE NOT EXISTS(SELECT 1 FROM Dept d WHERE e.DeptNo=d.DeptNo )
  155. 155. NOT EXIST Clause NOT EXIST SubQuery SELECT * from Employee e WHERE NOT EXISTS(SELECT 1 FROM Dept d WHERE e.DeptNo=d.DeptNo ) SELECT e.* from Employee e LEFT OUTER JOIN Dept d ON (e.DeptNo=d.DeptNo) WHERE d.DeptNo IS NULL
  156. 156. LIKE Clause
  157. 157. LIKE Clause LIKE / NOT LIKE
  158. 158. LIKE Clause LIKE / NOT LIKE SELECT * from Employee e WHERE name LIKE ’%steve’
  159. 159. LIKE Clause LIKE / NOT LIKE SELECT * from Employee e WHERE name LIKE ’%steve’ SELECT e.* from Employee e WHERE name LIKE ‘%steve’
  160. 160. LIKE Clause LIKE / NOT LIKE SELECT * from Employee e WHERE name LIKE ’%steve’ SELECT * from Employee e WHERE name NOT LIKE ’%steve’ SELECT e.* from Employee e WHERE name LIKE ‘%steve’
  161. 161. LIKE Clause LIKE / NOT LIKE SELECT * from Employee e WHERE name LIKE ’%steve’ SELECT * from Employee e WHERE name NOT LIKE ’%steve’ SELECT e.* from Employee e WHERE name LIKE ‘%steve’ SELECT e.* from Employee e WHERE NOT name LIKE ‘%steve’
  162. 162. LIKE Clause LIKE / NOT LIKE SELECT * from Employee e WHERE name LIKE ’%steve’ SELECT * from Employee e WHERE name NOT LIKE ’%steve’ SELECT e.* from Employee e WHERE name LIKE ‘%steve’ SELECT e.* from Employee e WHERE NOT name LIKE ‘%steve’
  163. 163. JOIN Operator (1/4)
  164. 164. JOIN Operator (1/4) SELF JOIN
  165. 165. JOIN Operator (1/4) SELF JOIN SELECT * FROM Employee e1, Employee e2 WHERE e1.ID = e2.Id
  166. 166. JOIN Operator (1/4) SELF JOIN SELECT * FROM Employee e1, Employee e2 WHERE e1.ID = e2.Id SELECT * FROM Employee e1 JOIN Employee e2 ON (e1.ID = e2.Id )
  167. 167. JOIN Operator (2/4)
  168. 168. JOIN Operator (2/4) CROSS JOIN (Cartesian Product)
  169. 169. JOIN Operator (2/4) CROSS JOIN (Cartesian Product) SELECT emp.Name, dept.Name FROM Employee emp, Dept dep
  170. 170. JOIN Operator (2/4) CROSS JOIN (Cartesian Product) SELECT emp.Name, dept.Name FROM Employee emp, Dept dep SELECT emp.Name, dept.Name FROM Employee emp JOIN Dept dep
  171. 171. JOIN Operator (3/4)
  172. 172. JOIN Operator (3/4) LEFT OUTER JOIN
  173. 173. JOIN Operator (3/4) LEFT OUTER JOIN FROM Emp, Dept SELECT * WHERE Emp.deptNo = Dept.deptNo(+)
  174. 174. JOIN Operator (3/4) LEFT OUTER JOIN FROM Emp, Dept SELECT * WHERE Emp.deptNo = Dept.deptNo(+) FROM Emp SELECT * LEFT OUTER JOIN Dept ON Emp.deptNO = Dept.deptNo
  175. 175. JOIN Operator (4/4)
  176. 176. JOIN Operator (4/4) RIGHT OUTER JOIN
  177. 177. JOIN Operator (4/4) RIGHT OUTER JOIN FROM Emp, Dept SELECT * WHERE Emp.deptNo(+) = Dept.deptNo
  178. 178. JOIN Operator (4/4) RIGHT OUTER JOIN FROM Emp, Dept SELECT * WHERE Emp.deptNo(+) = Dept.deptNo FROM Emp SELECT * RIGHT OUTER JOIN Dept ON Emp.deptNO = Dept.deptNo
  179. 179. Oracle Function
  180. 180. Condition Function
  181. 181. Condition Function CASE
  182. 182. Condition Function CASE CASE expr WHEN THEN r1 cond1 [WHEN cond2 THEN r2]* [ELSE r] END
  183. 183. Condition Function CASE CASE expr WHEN THEN r1 cond1 [WHEN cond2 THEN r2]* [ELSE r] END CASE expr WHEN THEN r1 cond1 [WHEN cond2 THEN r2]* [ELSE r] END
  184. 184. Math Function
  185. 185. Math Function ROUND
  186. 186. Math Function ROUND ROUND
  187. 187. Math Function ROUND ROUND CEIL
  188. 188. Math Function ROUND ROUND CEIL CEIL/CEILING
  189. 189. Math Function ROUND ROUND CEIL CEIL/CEILING MOD
  190. 190. Math Function ROUND ROUND CEIL CEIL/CEILING MOD PMOD
  191. 191. Math Function ROUND ROUND CEIL CEIL/CEILING MOD PMOD POWER
  192. 192. Math Function ROUND ROUND CEIL CEIL/CEILING MOD PMOD POWER POW/POWER
  193. 193. Math Function ROUND ROUND CEIL CEIL/CEILING MOD PMOD POWER POW/POWER SQRT
  194. 194. Math Function ROUND ROUND CEIL CEIL/CEILING MOD PMOD POWER POW/POWER SQRT SQRT
  195. 195. Math Function ROUND ROUND CEIL CEIL/CEILING MOD PMOD POWER POW/POWER SQRT SQRT SIN/COS
  196. 196. Math Function ROUND ROUND CEIL CEIL/CEILING MOD PMOD POWER POW/POWER SQRT SQRT SIN/COS SIN/COS
  197. 197. Character Function
  198. 198. Character Function SUBSTR
  199. 199. Character Function SUBSTR SUBSTR
  200. 200. Character Function SUBSTR SUBSTR TRIM
  201. 201. Character Function SUBSTR SUBSTR TRIM TRIM
  202. 202. Character Function SUBSTR SUBSTR TRIM TRIM LPAD/RPAD
  203. 203. Character Function SUBSTR SUBSTR TRIM TRIM LPAD/RPAD LPAD/RPAD
  204. 204. Character Function SUBSTR SUBSTR TRIM TRIM LPAD/RPAD LPAD/RPAD LTRIM/RTRIM
  205. 205. Character Function SUBSTR SUBSTR TRIM TRIM LPAD/RPAD LPAD/RPAD LTRIM/RTRIM LTRIM/RTRIM
  206. 206. Character Function SUBSTR SUBSTR TRIM TRIM LPAD/RPAD LPAD/RPAD LTRIM/RTRIM LTRIM/RTRIM REPLACE
  207. 207. Character Function SUBSTR SUBSTR TRIM TRIM LPAD/RPAD LPAD/RPAD LTRIM/RTRIM LTRIM/RTRIM REPLACE REGEXP_REPLACE
  208. 208. NULL Function
  209. 209. NULL Function COALESCE
  210. 210. NULL Function COALESCE COALESCE
  211. 211. NULL Function COALESCE COALESCE NVL
  212. 212. NULL Function COALESCE COALESCE NVL Custom UDF
  213. 213. NULL Function COALESCE COALESCE NVL Custom UDF NVL2
  214. 214. NULL Function COALESCE COALESCE NVL Custom UDF NVL2 Custom UDF
  215. 215. Custom UDF Function • Condition Function • DECODE • Null Comparison Function • NVL / NVL2 • Type Conversion • TO_NUMBER • TO_CHAR • TO_DATE
  216. 216. Oracle Analytic Function
  217. 217. Analytic Function
  218. 218. Analytic Function Joins, WHERE, GROUP BY clauses are performed
  219. 219. Analytic Function Joins, WHERE, GROUP BY clauses are performed the analytic functions are performed with the result set
  220. 220. Analytic Function Joins, WHERE, GROUP BY clauses are performed the analytic functions are performed with the result set ORDER BY clause is processed
  221. 221. Analytic Function Rank salary in dept name dept salary --------------------- a Research 100 b Research 100 c Sales 200 d Sales 300 e Research 50 f Accounting 200 g Accounting 300 h Accounting 400 i Research 10
  222. 222. Analytic Functionname dept salary---------------------a Research 100b Research 100c Sales 200d Sales 300e Research 50f Accounting 200g Accounting 300h Accounting 400i Research 10
  223. 223. Analytic Function Mapname dept salary---------------------a Research 100b Research 100c Sales 200d Sales 300e Research 50 Mapf Accounting 200g Accounting 300h Accounting 400i Research 10 Map
  224. 224. Analytic Functiona Research 100b Research 100c Sales Map 200d Sales 300e Research Map 50f Accounting 200g Accounting 300h Accounting Map 400i Research 10
  225. 225. Analytic Function DISTRIBUTED BY depta Research 100b Research 100c Sales Map 200d Sales 300e Research Map 50f Accounting 200g Accounting 300h Accounting Map 400i Research 10
  226. 226. Analytic Function DISTRIBUTED BY depta Research 100b Research 100c Sales Map 200 Reduced Sales 300e Research Map 50f Accounting 200 Reduceg Accounting 300h Accounting Map 400i Research 10
  227. 227. Analytic Function DISTRIBUTED BY dept c Sales 200 Map g Accounting 300 h d Accounting Sales 400 300 Reduce f Accounting 200 Map g Research 300 h Research 400 e Research 300 Reduce i Research 10 Map
  228. 228. Analytic Function SORT BY dept, salary c Sales 200 Map d Sales 300 f Accounting 200 g Accounting 300 Reduce h Accounting 400 Map i Research 10 g Research 300 e Research 300 Reduce h Research 400 Map
  229. 229. Analytic Function c Sales 200 Map d Sales 300 f Accounting 200 g Accounting 300 Reduce h Accounting 400 Map i Research 10 g Research 300 e Research 300 Reduce h Research 400 Map
  230. 230. Analytic Function RANK(dept,salary) c Sales 200 1 Map d Sales 300 2 f Accounting 200 1 Reduce g Accounting 300 2 h Accounting 400 3 Map i Research 10 1 g Research 300 2 Reduce e Research 300 3 h Research 400 4 Map
  231. 231. Analytic Function
  232. 232. Analytic FunctionRANK
  233. 233. Analytic FunctionRANKSELECT name,dept,salary,RANK() OVER (PARTITION BY deptORDER BY salary DESC) FROM emp
  234. 234. Analytic FunctionRANKSELECT name,dept,salary,RANK() OVER (PARTITION BY deptORDER BY salary DESC) FROM empSELECT e.name,e.dept,e.salary,RANK( e.dept,e.salary)FROM (SELECT name, dept, salary FROM empDISTRIBUTEDBY dept SORT BY dept, salary DESC) e
  235. 235. Analytic FunctionRANKSELECT name,dept,salary,RANK() OVER (PARTITION BY deptORDER BY salary DESC) FROM empRANK(arg1,arg2) - Custom UDFSELECT e.name,e.dept,e.salary,RANK( e.dept,e.salary)FROM (SELECT name, dept, salary FROM empDISTRIBUTEDBY dept SORT BY dept, salary DESC) e
  236. 236. Hive Optimization & Future Work
  237. 237. Tuning Parameter
  238. 238. Tuning Parameter • Hadoop Tunning
  239. 239. Tuning Parameter • Hadoop Tunning • mapred.job.reuse.jvm.num.task
  240. 240. Tuning Parameter • Hadoop Tunning • mapred.job.reuse.jvm.num.task • mapred.child.java.opts
  241. 241. Tuning Parameter • Hadoop Tunning • mapred.job.reuse.jvm.num.task • mapred.child.java.opts • mapred.min.split.size / mapred.max.split.size
  242. 242. Tuning Parameter • Hadoop Tunning • mapred.job.reuse.jvm.num.task • mapred.child.java.opts • mapred.min.split.size / mapred.max.split.size • dfs.block.size
  243. 243. Tuning Parameter • Hadoop Tunning • mapred.job.reuse.jvm.num.task • mapred.child.java.opts • mapred.min.split.size / mapred.max.split.size • dfs.block.size • Hive Tunning
  244. 244. Tuning Parameter • Hadoop Tunning • mapred.job.reuse.jvm.num.task • mapred.child.java.opts • mapred.min.split.size / mapred.max.split.size • dfs.block.size • Hive Tunning • hive.input.format = CombineHiveInputFormat
  245. 245. UDF/UDAF • Develop UDF to optimize number of MR jobs • Extend GenericUDF to avoid java reflection • Avoid creating new objects in UDF
  246. 246. Future Work
  247. 247. Future Work • HiveQL SQL Compliance • HIVE-282 - IN statement for WHERE clauses • HIVE-192 - Add TIMESTAMP column type • HIVE-1269 - Support Date/Datetime/Time/Timestamp Primitive Types
  248. 248. Future Work • HiveQL SQL Compliance • HIVE-282 - IN statement for WHERE clauses • HIVE-192 - Add TIMESTAMP column type • HIVE-1269 - Support Date/Datetime/Time/Timestamp Primitive Types • Analytic Function • HIVE-896 - Add LEAD/LAG/FIRST/LAST analytical windowing functions to Hive • HIVE-952 - Support analytic NTILE function
  249. 249. Future Work • HiveQL SQL Compliance • HIVE-282 - IN statement for WHERE clauses • HIVE-192 - Add TIMESTAMP column type • HIVE-1269 - Support Date/Datetime/Time/Timestamp Primitive Types • Analytic Function • HIVE-896 - Add LEAD/LAG/FIRST/LAST analytical windowing functions to Hive • HIVE-952 - Support analytic NTILE function • Optimization • HIVE-1694 - Accelerate GROUP BY execution using indexes • HIVE-482 - Optimize Group By + Order By with the same keys
  250. 250. Hive Oracle 2 Hive
  251. 251. Hive A system for managing and querying structured data built on top of Hadoop Oracle 2 Hive
  252. 252. Hive A system for managing and querying structured data built on top of Hadoop Oracle 2 Hive data model ANSI-SQL built-in function / custom UDF analytic function
  253. 253. Question ?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×