SQL in Hadoop
Upcoming SlideShare
Loading in...5
×
 

SQL in Hadoop

on

  • 314 views

This presentation describes the Query Compiler of Hive for MapReduce. The architecture of the Hive Query Compiler is explained. Additionally, the compilation of a SQL-query to a MapReduce-Job is ...

This presentation describes the Query Compiler of Hive for MapReduce. The architecture of the Hive Query Compiler is explained. Additionally, the compilation of a SQL-query to a MapReduce-Job is shown.

This presentation was created with the a presentation of Takeshi Nakano.

Statistics

Views

Total Views
314
Views on SlideShare
313
Embed Views
1

Actions

Likes
0
Downloads
7
Comments
0

1 Embed 1

http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

SQL in Hadoop Presentation Transcript

  • 1. SQL in Hadoop Munich, 21 January 2014 Sven Bayer
  • 2. QAware GmbH ■45 employees ■Software Engineering ■Quality ■Agility ■Projects ■Software diagnosis ■Individual Software solutions ■Customers ■Automotive, Energy, Retail, Telecommunications, and others 21 January 2014 QAware 3
  • 3. Agenda 1. Motivation 2. Big Data 3. MapReduce 4. Hadoop 5. Hive 6. Hive Query Compiler 7. Discussion 21 January 2014 QAware 4
  • 4. Motivation Masterthesis: Software-metrics + time tracking ■Hadoop processes huge data on clusters ■Hive provides SQL for Hadoop ■Hive generates complex MapReduce jobs from SQL ■How does Hive convert SQL to MapReduce? 21 January 2014 QAware 5
  • 5. Big Data ■Defined by 4 V‘s ■Volume ■Velocity ■Variety ■Veracity 21 January 2014 QAware 6
  • 6. MapReduce ■In 2004 published by Google ■MapReduce is highly scalable on clusters ■Big Data can be processed with MapReduce ■Consists mainly of a Map and Reduce function 21 January 2014 QAware 7
  • 7. MapReduce Example: Word-Count-Algorithm Input Hadoop uses MapReduce. Map Sort, Shuffle {Hadoop,1}, {uses,1}, {MapReduce,1} {Hadoop,1} {uses,1} {MapReduce,1} There is a Map phase. There is a Reduce phase, a Map phase. There is a Map phase. 21 January 2014 {There,1}, {is,1}, {a,1}, {Map,1}, {phase,1} {There,1}, {is,1}, {a,1}, {Reduce,1}, {phase,1}, {a,1}, {Map,1}, {phase,1} {There,1}, {is,1}, {a,1}, {Map,1}, {phase,1} QAware {There,1}, {There,1}, {There, 1} {is,1}, {is,1}, {is, 1}, {a,1}, {a,1}, {a,1}, {a,1} Reduce Output {Hadoop,[1]}, {uses,[1]}, {MapReduce,[1]} Hadoop 1 uses 1 MapReduce 1 {There,[1,1,1]}, {is,[1,1,1]}, {a,[1,1,1,1]} There is a 3 3 4 Map phase Reduce 3 4 1 {Map,1}, {Map,1}, {Map, 1} {phase,1}, {phase,1}, {phase, 1}, {phase,1} {Map,[1,1,1]}, {phase,[1,1,1,1]}, {Reduce,[1]} {Reduce,1} 8
  • 8. MapReduce – In practice ■ Get the users with the products that they watched ■ Get these products with their numbers, makers, price and filter the products on „audi“ Map Input Sort, Shuffle Reduce Output Join on product_no + {pNo1,[user1,user2,{audi,30€}]}, {pNo2,[user3,{audi,50€}]} pNo1 user1,audi,30€, pNo1 user2,audi,30€, pNo2 user3,audi,50€ access_log id1 user1 pNo1 id2 user2 pNo1 id3 user3 {pNo1,user1}, {pNo1,user2}, {pNo2,user3} pNo2 product pNo1 audi 30€ pNo2 audi 50€ pNo3 bmw 60€ 21 January 2014 {pNo1,{audi,30€}}, {pNo2,{audi,50€}}, {pNo3,{bmw,60€}} + Filtering on „audi“ QAware {pNo1,user1}, {pNo1,user2}, {pNo1,{audi,30€}} {pNo2,user3}, {pNo2,{audi,50€}} 9
  • 9. Hadoop ■In 2006 initiated by Yahoo ■Hadoop cluster ■Highly scalalbe for Big Data ■Hadoop architecture YARN (MapReduce) HDFS Hadoop Common 21 January 2014 QAware 10
  • 10. Hive ■Built on top of Hadoop ■MapReduce ■HDFS ■Provides HiveQL queries for Hadoop ■Compiles HiveQL to MapReduce 21 January 2014 QAware 11
  • 11. Hive Hive architecture JDBC/ ODBC CLI Web-UI Legend Thrift Server Framework Call of a component Component Driver Query Compiler Parser Semantic Analyzer Metastore Logical Plan Generator Logical Optimizer Component Physical Plan Generator Framework Execution Engine Physical Optimizer 21 January 2014 QAware 12
  • 12. Hive Hive Query Compiler Start HiveQL Parser AST Semantic Analyzer QB Logical Plan Generator QB Tree Logical Optimizer QB Tree Physical Plan Generator Phys. Plan Physical Optimizer Phys. Plan Execution Engine 21 January 2014 QAware End
  • 13. Hive Query Compiler Parser Parser HiveQL SELECT a.user, a.product_no, p.maker, p.price FROM access_log a JOIN product p ON (a.product_no = p.product_no) WHERE p.maker = `audi`; AST access_log id1 user1 pNo1 id2 user2 pNo1 id3 user3 pNo2 product pNo1 21 January 2014 Logical Plan Generator QAware Logical Optimizer Physical Plan Generator Physical Optimizer audi 50€ pNo3 Semantic Analyzer 30€ pNo2 Parser audi bmw 60€ Execution Engine 15
  • 14. Hive Query Compiler Parser HiveQL Parser AST ■… WHERE p.maker = `audi`; Parser 21 January 2014 Semantic Analyzer Logical Plan Generator QAware Logical Optimizer Physical Plan Generator Physical Optimizer Execution Engine 16
  • 15. Hive Query Compiler Semantic Analyzer AST Semantic Analyzer QB Query Block FROM-Clause MetaData ParseInfo Alias to Table Info “a”=Table Info(“access_log”) “p”=Table Info(“product”) Parser 21 January 2014 Semantic Analyzer Logical Plan Generator QAware AST of Join-Expression Logical Optimizer Physical Plan Generator Physical Optimizer Execution Engine 17
  • 16. Hive Query Compiler Logical Plan Generator QB Logical Plan Generator TableScanOperator TS_1 TableScanOperator TS_0 ReduceSinkOperator RS_2 QB Tree ReduceSinkOperator RS_3 JoinOperator JOIN_4 FilterOperator FIL_5 (maker = ‘audi’) SelectOperator SEL_6 FileSinkOperator FS_7 Parser 21 January 2014 Semantic Analyzer Logical Plan Generator QAware Logical Optimizer Physical Plan Generator Physical Optimizer Execution Engine 18
  • 17. Hive Query Compiler Logical Optimizer QB Tree Logical Optimizer QB Tree TableScanOperator TS_0 TableScanOperator TS_1 FilterOperator FIL_8 (maker = ‘audi’) ReduceSinkOperator RS_2 ReduceSinkOperator RS_3 JoinOperator JOIN_4 SelectOperator SEL_6 FileSinkOperator FS_7 Parser 21 January 2014 Semantic Analyzer Logical Plan Generator QAware Logical Optimizer Physical Plan Generator Physical Optimizer Execution Engine 19
  • 18. Hive Query Compiler Phyiscal Plan Generator QB Tree Physical Plan Generator Phys. Plan MapRedTask (Stage-1/root) Mapper Mapper TableScanOperator TS_1 TableScanOperator TS_0 FilterOperator FIL_8 (maker= ‘audi’) ReduceSinkOperator RS_2 Reducer ReduceSinkOperator RS_3 JoinOperator JOIN_4 SelectOperator SEL_6 FileSinkOperator FS_7 MoveTask (Stage-0) StatsTask (Stage-2) Parser 21 January 2014 Semantic Analyzer Logical Plan Generator QAware Logical Optimizer Physical Plan Generator Physical Optimizer Execution Engine 20
  • 19. HiveQL-Verarbeitung Physical Optimizer Phys. Plan Physical Optimizer Phys. Plan ■Optimizes the Physical Plan ■Transforms a plan with Joins to multiple MapReduce jobs ■Converts tasks including a Join to a MapJoin Parser 21 January 2014 Semantic Analyzer Logical Plan Generator QAware Logical Optimizer Physical Plan Generator Physical Optimizer Execution Engine 21
  • 20. HiveQL-Verarbeitung Execution Engine Phys. Plan Execution Engine ■MapReduce job is serialized as plan.xml ■Returns the result ■Temporary place ■Table Parser 21 January 2014 Semantic Analyzer Logical Plan Generator QAware Logical Optimizer Physical Plan Generator Physical Optimizer Execution Engine 22
  • 21. Discussion ■Hive brings SQL to Hadoop ■Advantages of Hive ■Reduces developer workload ■No need for manual coding of MapReduce jobs ■Easy migration for systems interacting with SQL ■Disadvantages of Hive ■High latency ■Outlook for Hive ■Apache Tez with container reusage, Mapper reduction in DAG ■Alternatives for Hive ■Impala, Shark, Presto, Lingual 21 January 2014 QAware 23