© 2013 IBM Corporation1
SQL on Hadoop - 12th Swiss Big Data User Group
Meeting, 3rd of July, 2014, ETH Zurich
Romeo Kienzl...
© 2013 IBM Corporation2
DataScience at present
●
Tools (http://blog.revolutionanalytics.com/2014/01/in-data-scientist-surv...
© 2013 IBM Corporation3
Data Science on Hadoop
SQL (42%)
R (33%)
Python (26%)
Excel (25%)
Java, Ruby, C++ (17%)
SPSS, SAS ...
© 2013 IBM Corporation4
SQL on Hadoop
●
IBM BigSQL (ANSI 2011 compliant, part of IBM BigInsights)
●
HIVE, Presto
●
Clouder...
© 2013 IBM Corporation5
Two types of SQL Engines
●
Type I
●
Compiler and Optimizer SQL->MapReduce
●
Type II
●
Brings own d...
© 2013 IBM Corporation6
Hive
●
Runs on top of MapReduce
●
→ Type I
Source: http://cdn.venublog.com/wp-content/uploads/2013...
© 2013 IBM Corporation7
Lingual
●
ANSI SQL Layer on top of Cascading
●
Cascading
●
Java API do express DAG
●
Runs on top o...
© 2013 IBM Corporation8
Limits of MapReduce
●
Disk writes between Map and Reduce
●
Slow for computations which depend on p...
© 2013 IBM Corporation9
Impala (Type II)
http://blog.cloudera.com/blog/wp-content/uploads/2012/10/impala.png
© 2013 IBM Corporation10
Presto (Type II)
https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-peta...
© 2013 IBM Corporation11
Spark / Shark (Type II)
Source: http://bighadoop.files.wordpress.com/2014/04/spark-architecture.p...
© 2013 IBM Corporation12
BigSQL V3.0 (Type II)
Like in Spark, MapReduce has been Kicked out :)
(No JobTracker, No Task Tra...
© 2013 IBM Corporation13
BigSQL V3.0 – Architecture
Putting the story together….
Big SQL shares a common SQL dialect with ...
© 2013 IBM Corporation14
BigSQL V3.0 – Performance
Query rewrites
Exhaustive query rewrite capabilities
Leverages addition...
© 2013 IBM Corporation15
BigSQL V3.0 – Performance
You are substantially faster if you don't use MapReduce
IBM BigInsights...
© 2013 IBM Corporation16
BigSQL V3.0 – Query Federation
Head Node
Big SQL
Compute Node
Task Tracker Data Node Big
SQL
Comp...
© 2013 IBM Corporation17
BigSQL V1.0 – Demo (small)
●
32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)
●
3 ...
© 2013 IBM Corporation18
BigSQL V1.0 – Demo (small)
CREATE EXTERNAL TABLE trace (
hour integer, employeeid integer,
depart...
© 2013 IBM Corporation19
BigSQL V1.0 – Demo (small)
© 2013 IBM Corporation20
BigSQL V1.0 – Demo (small)
© 2013 IBM Corporation21
BigSQL V1.0 – Demo (small)
[bivm.ibm.com][biadmin] 1> select count(*) from trace1;
+----------+
|...
© 2013 IBM Corporation22
BigSQL V1.0 – Demo (small)
select count(hour), hour from trace group by hour order by hour
30 row...
© 2013 IBM Corporation23
BigSQL V1.0 – Demo (small)
[bivm.ibm.com][biadmin] 1> select count(*) from trace1 t3 inner
join t...
© 2013 IBM Corporation24
BigSQL V3.0 – Demo (small)
CREATE HADOOP TABLE trace3 (
hour int, employeeid int,
departmentid in...
© 2013 IBM Corporation25
BigSQL V3.0 – Demo (small)
[bivm.ibm.com][biadmin] 1> select count(*) from trace3;
+----------+
|...
© 2013 IBM Corporation26
BigSQL V3.0 – Demo (small)
[bivm.ibm.com][biadmin] 1> select count(*) from trace3 t3 inner
join t...
© 2013 IBM Corporation27
BigSQL V3.0 – Demo (small)
[bivm.ibm.com][biadmin] 1> select count(hour), hour from trace3
group ...
© 2013 IBM Corporation28
Questions?
http://www.ibm.com/software/data/bigdata/
BigInsights free VM and Installer for non-co...
Upcoming SlideShare
Loading in …5
×

SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ETH Zurich

954 views

Published on

Presentation given at the 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ETH Zurich

Published in: Data & Analytics, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
954
On SlideShare
0
From Embeds
0
Number of Embeds
24
Actions
Shares
0
Downloads
19
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ETH Zurich

  1. 1. © 2013 IBM Corporation1 SQL on Hadoop - 12th Swiss Big Data User Group Meeting, 3rd of July, 2014, ETH Zurich Romeo Kienzler IBM Center of Excellence for Data Science, Cognitive Systems and BigData (A joint-venture between IBM Research Zurich and IBM Innovation Center DACH) Source: http://www.kdnuggets.com/2012/04/data-science-history.jpg
  2. 2. © 2013 IBM Corporation2 DataScience at present ● Tools (http://blog.revolutionanalytics.com/2014/01/in-data-scientist-survey-r-is-the-most-used-tool-other-than-databases.html) ● SQL (42%) ● R (33%) ● Python (26%) ● Excel (25%) ● Java, Ruby, C++ (17%) ● SPSS, SAS (9%) ● Limitations (Single Node usage) ● Main Memory ● CPU <> Main Memory Bandwidth ● CPU ● Storage <> Main Memory Bandwidth (either Single node or SAN)
  3. 3. © 2013 IBM Corporation3 Data Science on Hadoop SQL (42%) R (33%) Python (26%) Excel (25%) Java, Ruby, C++ (17%) SPSS, SAS (9%) Data Science Hadoop
  4. 4. © 2013 IBM Corporation4 SQL on Hadoop ● IBM BigSQL (ANSI 2011 compliant, part of IBM BigInsights) ● HIVE, Presto ● Cloudera Impala ● Lingual ● Shark ● ... SQL Hadoop
  5. 5. © 2013 IBM Corporation5 Two types of SQL Engines ● Type I ● Compiler and Optimizer SQL->MapReduce ● Type II ● Brings own distributed execution engine on Data Nodes ● Brings own Task Scheduler ● The Hadoop SQL Ecosystem is evolving very fast
  6. 6. © 2013 IBM Corporation6 Hive ● Runs on top of MapReduce ● → Type I Source: http://cdn.venublog.com/wp-content/uploads/2013/07/hive-1.jpg
  7. 7. © 2013 IBM Corporation7 Lingual ● ANSI SQL Layer on top of Cascading ● Cascading ● Java API do express DAG ● Runs on top of MapReduce ● → Type I
  8. 8. © 2013 IBM Corporation8 Limits of MapReduce ● Disk writes between Map and Reduce ● Slow for computations which depend on previously computed values ● JOINs are very slow and difficult to implement ● Only sequential data access ● Only tuple-wise data access ● Map-Side joins have sort and size constraints ● Reduce-Side joins require secondary sorting of values ● … ● ...
  9. 9. © 2013 IBM Corporation9 Impala (Type II) http://blog.cloudera.com/blog/wp-content/uploads/2012/10/impala.png
  10. 10. © 2013 IBM Corporation10 Presto (Type II) https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920
  11. 11. © 2013 IBM Corporation11 Spark / Shark (Type II) Source: http://bighadoop.files.wordpress.com/2014/04/spark-architecture.png
  12. 12. © 2013 IBM Corporation12 BigSQL V3.0 (Type II) Like in Spark, MapReduce has been Kicked out :) (No JobTracker, No Task Tracker, But HDFS/GPFS remains)
  13. 13. © 2013 IBM Corporation13 BigSQL V3.0 – Architecture Putting the story together…. Big SQL shares a common SQL dialect with DB2 Big SQL shares the same client drivers with DB2
  14. 14. © 2013 IBM Corporation14 BigSQL V3.0 – Performance Query rewrites Exhaustive query rewrite capabilities Leverages additional metadata such as constraints and nullability Optimization Statistics and heuristic driven query optimization Query optimizer based upon decades of IBM RDBMS experience Tools and metrics Highly detailed explain plans and query diagnostic tools Extensive number of available performance metrics SELECT ITEM_DESC, SUM(QUANTITY_SOLD), AVG(PRICE), AVG(COST) FROM PERIOD, DAILY_SALES, PRODUCT, STORE WHERE PERIOD.PERKEY=DAILY_SALES.PERKEY AND PRODUCT.PRODKEY=DAILY_SALES.PRODKE Y AND STORE.STOREKEY=DAILY_SALES.STOREKEY AND CALENDAR_DATE BETWEEN AND '01/01/2012' AND '04/28/2012' AND STORE_NUMBER='03' AND CATEGORY=72 GROUP BY ITEM_DESC Access plan generationQuery transformation Dozens of query transformations Hundreds or thousands of access plan options Store Product Product Store NLJOIN Daily SalesNLJOIN Period NLJOIN Product NLJOIN Daily Sales NLJOIN Period NLJOIN Store HSJOIN Daily Sales HSJOIN Period HSJOIN Product StoreZZJOIN Daily Sales HSJOIN Period
  15. 15. © 2013 IBM Corporation15 BigSQL V3.0 – Performance You are substantially faster if you don't use MapReduce IBM BigInsights v3.0, with Big SQL 3.0, is the only Hadoop distribution to successfully run ALL 99 TPC-DS queries and ALL 22 TPC-H queries without modification. Source: http://www.ibmbigdatahub.com/blog/big-deal-about- infosphere-biginsights-v30-big-sql
  16. 16. © 2013 IBM Corporation16 BigSQL V3.0 – Query Federation Head Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL
  17. 17. © 2013 IBM Corporation17 BigSQL V1.0 – Demo (small) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich) ● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich) ● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich) ● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich) ● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
  18. 18. © 2013 IBM Corporation18 BigSQL V1.0 – Demo (small) CREATE EXTERNAL TABLE trace ( hour integer, employeeid integer, departmentid integer, clientid integer, date string, timestamp string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY 'n' STORED AS TEXTFILE LOCATION '/user/biadmin/32Gtest';
  19. 19. © 2013 IBM Corporation19 BigSQL V1.0 – Demo (small)
  20. 20. © 2013 IBM Corporation20 BigSQL V1.0 – Demo (small)
  21. 21. © 2013 IBM Corporation21 BigSQL V1.0 – Demo (small) [bivm.ibm.com][biadmin] 1> select count(*) from trace1; +----------+ | | +----------+ | 11416740 | +----------+ 1 row in results(first row: 39.78s; total: 39.78s)
  22. 22. © 2013 IBM Corporation22 BigSQL V1.0 – Demo (small) select count(hour), hour from trace group by hour order by hour 30 rows in results(first row: 37.98s; total: 37.99s)
  23. 23. © 2013 IBM Corporation23 BigSQL V1.0 – Demo (small) [bivm.ibm.com][biadmin] 1> select count(*) from trace1 t3 inner join trace2 t4 on t3.hour=t4.hour; +--------+ | | +--------+ | 477340 | +--------+ 1 row in results(first row: 32.24s; total: 32.25s)
  24. 24. © 2013 IBM Corporation24 BigSQL V3.0 – Demo (small) CREATE HADOOP TABLE trace3 ( hour int, employeeid int, departmentid int,clientid int, date varchar(30), timestamp varchar(30) ) row format delimited fields terminated by '|' stored as textfile;
  25. 25. © 2013 IBM Corporation25 BigSQL V3.0 – Demo (small) [bivm.ibm.com][biadmin] 1> select count(*) from trace3; +----------+ | 1 | +----------+ | 12014733 | +----------+ 1 row in results(first row: 2.94s; total: 2.95s)
  26. 26. © 2013 IBM Corporation26 BigSQL V3.0 – Demo (small) [bivm.ibm.com][biadmin] 1> select count(*) from trace3 t3 inner join trace4 t4 on t3.hour=t4.hour; +--------+ | 1 | +--------+ | 504360 | +--------+ 1 row in results(first row: 0.79s; total: 0.80s)
  27. 27. © 2013 IBM Corporation27 BigSQL V3.0 – Demo (small) [bivm.ibm.com][biadmin] 1> select count(hour), hour from trace3 group by hour order by hour; 29 rows in results(first row: 1.88s; total: 1.89s)
  28. 28. © 2013 IBM Corporation28 Questions? http://www.ibm.com/software/data/bigdata/ BigInsights free VM and Installer for non-commercial use: ibm.co/quickstart Twitter: @RomeoKienzler, @IBMEcosystem_DE, @IBM_ISV_Alps

×