Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

945 views
810 views

Published on

Presentation held on Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

Published in: Data & Analytics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
945
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
24
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

  1. 1. © 2013 IBM Corporation1 The Data Scientists Workplace of the Future - Data Science Connect 22nd of July, 2014 Romeo Kienzler IBM Center of Excellence for Data Science, Cognitive Systems and BigData (A joint-venture between IBM Research Zurich and IBM Innovation Center DACH) Source: http://www.kdnuggets.com/2012/04/data-science-history.jpg
  2. 2. © 2013 IBM Corporation2 What is DataScience? Source: Statoo.com http://slidesha.re/1kmNiX0
  3. 3. © 2013 IBM Corporation3 DataScience at present ● Tools (http://blog.revolutionanalytics.com/2014/01/in-data-scientist-survey-r-is-the-most-used-tool-other-than-databases.html) ● SQL (42%) ● R (33%) ● Python (26%) ● Excel (25%) ● Java, Ruby, C++ (17%) ● SPSS, SAS (9%) ● Limitations (Single Node usage) ● Main Memory ● CPU <> Main Memory Bandwidth ● CPU ● Storage <> Main Memory Bandwidth (either Single node or SAN)
  4. 4. © 2013 IBM Corporation4 What is BIG data?
  5. 5. © 2013 IBM Corporation5 What is BIG data?
  6. 6. © 2013 IBM Corporation6 What is BIG data? Big Data Hadoop
  7. 7. © 2013 IBM Corporation7 What is BIG data? Business Intelligence Data Warehouse
  8. 8. © 2013 IBM Corporation8 BigData == Hadoop? Hadoop BigData Hadoop
  9. 9. © 2013 IBM Corporation9 What is beyond “Data Warehouse”? Data Lake Data Warehouse
  10. 10. © 2013 IBM Corporation10 First “BigData” UseCase ? ● Google Index ● 40 X 10^9 = 40.000.000.000 => 40 billion pages indexed ● Will break 100 PB barrier soon ● Derived from MapReduce ● now “caffeine” based on “percolator” ● Incremental vs. batch ● In-Memory vs. disk ●
  11. 11. © 2013 IBM Corporation11 Map-Reduce → Hadoop → BigInsights
  12. 12. © 2013 IBM Corporation12 BigData Analytics – Predictive Analytics "sometimes it's not who has the best algorithm that wins; it's who has the most data." (C) Google Inc. The Unreasonable Effectiveness of Data¹ ¹http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf No Sampling => Work with full dataset => No p-Value/z-Scores anymore
  13. 13. © 2013 IBM Corporation13 Aggregated Bandwith between CPU, Main Memory and Hard Drive 1 TB (at 10 GByte/s) - 1 Node - 100 sec - 10 Nodes - 10 sec - 100 Nodes - 1 sec - 1000 Nodes - 100 msec
  14. 14. © 2013 IBM Corporation14 Fault Tolerance / Commodity Hardware AMD Turion II Neo N40L (2x 1,5GHz / 2MB / 15W), 8 GB RAM, 3TB SEAGATE Barracuda 7200.14 < CHF 500  100 K => 200 X (2, 4, 3) => 400 Cores, 1,6 TB RAM, 200 TB HD  MTBF ~ 365 d > 1,5 d Source: http://www.cloudcomputingpatterns.org/Watchdog
  15. 15. © 2013 IBM Corporation15 “Elastic” Scale-Out Source: http://www.cloudcomputingpatterns.org/Continuously_Changing_Workload
  16. 16. © 2013 IBM Corporation16 “Elastic” Scale-Out of
  17. 17. © 2013 IBM Corporation17 “Elastic” Scale-Out of CPU Cores
  18. 18. © 2013 IBM Corporation18 “Elastic” Scale-Out of CPU Cores Storage
  19. 19. © 2013 IBM Corporation19 “Elastic” Scale-Out of CPU Cores Storage Memory
  20. 20. © 2013 IBM Corporation20 “Elastic” Scale-Out linear Source: http://www.cloudcomputingpatterns.org/Elastic_Platform
  21. 21. © 2013 IBM Corporation21 How do Databases Scale-Out? Shared Disk Architectures
  22. 22. © 2013 IBM Corporation22 How do Databases Scale-Out? Shared Nothing Architectures
  23. 23. © 2013 IBM Corporation23 Hadoop? Shared Nothing Architecture? Shared Disk Architecture? http://bluemix.net/ 6 Node Hadoop Cluster 4 Free
  24. 24. © 2013 IBM Corporation24 Data Science on Hadoop SQL (42%) R (33%) Python (26%) Excel (25%) Java, Ruby, C++ (17%) SPSS, SAS (9%) Data Science Hadoop
  25. 25. © 2013 IBM Corporation25 SQL on Hadoop ● IBM BigSQL (ANSI 92 compliant) ● HIVE, Presto ● Cloudera Impala ● Lingual ● Shark ● ... SQL Hadoop
  26. 26. © 2013 IBM Corporation26 Two types of SQL Engines ● Type I ● Compiler and Optimizer SQL->MapReduce ● Type II ● Brings own distributed execution engine on Data Nodes ● Brings own Task Scheduler ● The Hadoop SQL Ecosystem is evolving very fast
  27. 27. © 2013 IBM Corporation27 Hive ● Runs on top of MapReduce ● → Type I Source: http://cdn.venublog.com/wp-content/uploads/2013/07/hive-1.jpg
  28. 28. © 2013 IBM Corporation28 Lingual ● ANSI SQL Layer on top of Cascading ● Cascading ● Java API do express DAG ● Runs on top of MapReduce ● → Type I
  29. 29. © 2013 IBM Corporation29 Limits of MapReduce ● Disk writes between Map and Reduce ● Slow for computations which depend on previously computed values ● JOINs are very slow and difficult to implement ● Only sequential data access ● Only tuple-wise data access ● Map-Side joins have sort and size constraints ● Reduce-Side joins require secondary sorting of values ● … ● ...
  30. 30. © 2013 IBM Corporation30 Impala (Type II) http://blog.cloudera.com/blog/wp-content/uploads/2012/10/impala.png
  31. 31. © 2013 IBM Corporation31 Presto (Type II) https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920
  32. 32. © 2013 IBM Corporation32 Spark / Shark (Type II) Source: http://bighadoop.files.wordpress.com/2014/04/spark-architecture.png
  33. 33. © 2013 IBM Corporation33 BigSQL V3.0 (Type II) Like in Spark, MapReduce has been Kicked out :) (No JobTracker, No Task Tracker, But HDFS/GPFS remains)
  34. 34. © 2013 IBM Corporation34 BigSQL V3.0 – Architecture Putting the story together…. Big SQL shares a common SQL dialect with DB2 Big SQL shares the same client drivers with DB2
  35. 35. © 2013 IBM Corporation35 BigSQL V3.0 – Performance Query rewrites Exhaustive query rewrite capabilities Leverages additional metadata such as constraints and nullability Optimization Statistics and heuristic driven query optimization Query optimizer based upon decades of IBM RDBMS experience Tools and metrics Highly detailed explain plans and query diagnostic tools Extensive number of available performance metrics SELECT ITEM_DESC, SUM(QUANTITY_SOLD), AVG(PRICE), AVG(COST) FROM PERIOD, DAILY_SALES, PRODUCT, STORE WHERE PERIOD.PERKEY=DAILY_SALES.PERKEY AND PRODUCT.PRODKEY=DAILY_SALES.PRODKE Y AND STORE.STOREKEY=DAILY_SALES.STOREKEY AND CALENDAR_DATE BETWEEN AND '01/01/2012' AND '04/28/2012' AND STORE_NUMBER='03' AND CATEGORY=72 GROUP BY ITEM_DESC Access plan generationQuery transformation Dozens of query transformations Hundreds or thousands of access plan options Store Product Product Store NLJOIN Daily SalesNLJOIN Period NLJOIN Product NLJOIN Daily Sales NLJOIN Period NLJOIN Store HSJOIN Daily Sales HSJOIN Period HSJOIN Product StoreZZJOIN Daily Sales HSJOIN Period
  36. 36. © 2013 IBM Corporation36 BigSQL V3.0 – Performance You are substantially faster if you don't use MapReduce IBM BigInsights v3.0, with Big SQL 3.0, is the only Hadoop distribution to successfully run ALL 99 TPC-DS queries and ALL 22 TPC-H queries without modification. Source: http://www.ibmbigdatahub.com/blog/big-deal-about- infosphere-biginsights-v30-big-sql
  37. 37. © 2013 IBM Corporation37 BigSQL V3.0 – Query Federation Head Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL
  38. 38. © 2013 IBM Corporation38 BigSQL V1.0 – Demo (small) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich) ● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich) ● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich) ● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich) ● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
  39. 39. © 2013 IBM Corporation39 BigSQL V1.0 – Demo (small) CREATE EXTERNAL TABLE trace ( hour integer, employeeid integer, departmentid integer, clientid integer, date string, timestamp string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY 'n' STORED AS TEXTFILE LOCATION '/user/biadmin/32Gtest';
  40. 40. © 2013 IBM Corporation40 BigSQL V1.0 – Demo (small)
  41. 41. © 2013 IBM Corporation41 BigSQL V1.0 – Demo (small)
  42. 42. © 2013 IBM Corporation42 BigSQL V1.0 – Demo (small) [bivm.ibm.com][biadmin] 1> select count(*) from trace1; +----------+ | | +----------+ | 11416740 | +----------+ 1 row in results(first row: 39.78s; total: 39.78s)
  43. 43. © 2013 IBM Corporation43 BigSQL V1.0 – Demo (small) select count(hour), hour from trace group by hour order by hour 30 rows in results(first row: 37.98s; total: 37.99s)
  44. 44. © 2013 IBM Corporation44 BigSQL V1.0 – Demo (small) [bivm.ibm.com][biadmin] 1> select count(*) from trace1 t3 inner join trace2 t4 on t3.hour=t4.hour; +--------+ | | +--------+ | 477340 | +--------+ 1 row in results(first row: 32.24s; total: 32.25s)
  45. 45. © 2013 IBM Corporation45 BigSQL V3.0 – Demo (small) CREATE HADOOP TABLE trace3 ( hour int, employeeid int, departmentid int,clientid int, date varchar(30), timestamp varchar(30) ) row format delimited fields terminated by '|' stored as textfile;
  46. 46. © 2013 IBM Corporation46 BigSQL V3.0 – Demo (small) [bivm.ibm.com][biadmin] 1> select count(*) from trace3; +----------+ | 1 | +----------+ | 12014733 | +----------+ 1 row in results(first row: 2.94s; total: 2.95s)
  47. 47. © 2013 IBM Corporation47 BigSQL V3.0 – Demo (small) [bivm.ibm.com][biadmin] 1> select count(*) from trace3 t3 inner join trace4 t4 on t3.hour=t4.hour; +--------+ | 1 | +--------+ | 504360 | +--------+ 1 row in results(first row: 0.79s; total: 0.80s)
  48. 48. © 2013 IBM Corporation48 BigSQL V3.0 – Demo (small) [bivm.ibm.com][biadmin] 1> select count(hour), hour from trace3 group by hour order by hour; 29 rows in results(first row: 1.88s; total: 1.89s)
  49. 49. © 2013 IBM Corporation49 R on Hadoop ● IBM BigR (based on SystemML Almadan Research project) ● Rhadoop ● RHIPE ● ... “R” Hadoop
  50. 50. © 2013 IBM Corporation50
  51. 51. © 2013 IBM Corporation5151 Goal: Find column mean Problems: • Column vector can not fit into memory You have to partition and parallelize
  52. 52. © 2013 IBM Corporation52 ● Sampling  Full dataset > RAM  Example: use 1% vs 100% of dataset  Precision loss from skewed/sparse data ● Numerical Stability  Limitation from finite precision in computing  Algorithms must be carefully implemented  Instability causes errors to cascade throughout your analysis Catastrophic Cancellation Error: 6.375 – 5.625 True value: 0.75 Computed: 0 Relative Error: 1.0 6.375 round to 6.0 5.625 round to 6.0
  53. 53. © 2013 IBM Corporation53 Data in Hadoop You R User Data in distributed memory
  54. 54. © 2013 IBM Corporation54 Data in Hadoop: Can run R on a single node R User Data in distributed memory You
  55. 55. © 2013 IBM Corporation55 BigR (based on SystemML) SystemML compiles hybrid runtime plans ranging from in- memory, single machine (CP) to large-scale, cluster (MR) compute ● Challenge ● Guaranteed hard memory constraints (budget of JVM size) ● for arbitrary complex ML programs ● Key Technical Innovations ● CP & MR Runtime: Single machine & MR operations, integrated runtime ● Caching: Reuse and eviction of in-memory objects ● Cost Model: Accurate time and worst-case memory estimates ● Optimizer: Cost-based runtime plan generation ● Dyn. Recompiler: Re-optimization for initial unknowns Data size Runtime CP CP/MR MR Gradually exploit MR parallelism High performance computing for small data sizes. Scalable computing for large data sizes. Hybrid Plans
  56. 56. © 2013 IBM Corporation56 R Clients SystemML Statistics Engine Data Sources Embedded R Execution IBM R Packages IBM R Packages Pull data (summaries) to R client Or, push R functions right on the data 1 2 3 © 2014 IBM Corporation17 IBM Internal Use Only BigR Architecture
  57. 57. © 2013 IBM Corporation57 Big R Data Structures: Proxy to entire dataset data <- bigr.frame(…) Appears and acts like all of the data is on your laptop You
  58. 58. © 2013 IBM Corporation58 BigR Demo (small) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich) ● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich) ● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
  59. 59. © 2013 IBM Corporation59 BigR Demo (small) library(bigr) bigr.connect(host="bigdata", port=7052, database="default", user="biadmin", password="xxx") is.bigr.connected() tbr <- bigr.frame(dataSource="DEL", coltypes = c("numeric","numeric","numeric","numeric","character","character"), dataPath="/user/biadmin/32Gtest", delimiter=",", header=F, useMapReduce=T) h <- bigr.histogram.stats(tbr$V1, nbins=24)
  60. 60. © 2013 IBM Corporation60 BigR Demo (small) class bins counts centroids 1 ALL 0 18289280 1.583333 2 ALL 1 15360 2.750000 3 ALL 2 55040 3.916667 4 ALL 3 189440 5.083333 5 ALL 4 579840 6.250000 6 ALL 5 5292160 7.416667 7 ALL 6 8074880 8.583333 8 ALL 7 15653120 9.750000 ...
  61. 61. © 2013 IBM Corporation61 BigR Demo (small)
  62. 62. © 2013 IBM Corporation62 BigR Demo (small) jpeg('hist.jpg') bigr.histogram(tbr$V1, nbins=24) # This command runs on 32 GB / ~650.000.000 rows in HDFS dev.off()
  63. 63. © 2013 IBM Corporation63 SPSS on Hadoop
  64. 64. © 2013 IBM Corporation64 SPSS on Hadoop
  65. 65. © 2013 IBM Corporation65 BigSheets Demo (small) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich) ● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich) ● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich) ● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich) ● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
  66. 66. © 2013 IBM Corporation66 BigSheets Demo (small)
  67. 67. © 2013 IBM Corporation67 BigSheets Demo (small) This command runs on 32 GB / ~650.000.000 rows in HDFS
  68. 68. © 2013 IBM Corporation68 BigSheets Demo (small)
  69. 69. © 2013 IBM Corporation69 Text Extraction (SystemT, AQL)
  70. 70. © 2013 IBM Corporation70 Text Extraction (SystemT, AQL)
  71. 71. © 2013 IBM Corporation71 If this is not enough? → BigData AppStore
  72. 72. © 2013 IBM Corporation72 BigData AppStore, Eclipse Tooling ● Write your apps in ● Java (MapReduce) ● PigLatin,Jaql ● BigSQL/Hive/BigR ● Deploy it to BigInsights via Eclipse ● Automatically ● Schedule ● Update ● hdfs files ● BigSQL tables ● BigSheets collections
  73. 73. © 2013 IBM Corporation73 Questions? http://www.ibm.com/software/data/bigdata/ Twitter: @RomeoKienzler, @IBMEcosystem_DE, @IBM_ISV_Alps

×