© 2013 IBM Corporation1
The Data Scientists Workplace of the Future - Workshop
SwissRE, 11.6.14
Romeo Kienzler
IBM Center ...
© 2013 IBM Corporation2
The Data Scientists Workplace of the Future -
* * C R E D I T S * *
Romeo Kienzler
IBM Innovation ...
© 2013 IBM Corporation3
What is DataScience?
Source: Statoo.com http://slidesha.re/1kmNiX0
© 2013 IBM Corporation4
What is DataScience?
Source: Statoo.com http://slidesha.re/1kmNiX0
© 2013 IBM Corporation5
DataScience at present
●
Tools (http://blog.revolutionanalytics.com/2014/01/in-data-scientist-surv...
© 2013 IBM Corporation6
DataScience at present - Demo
●
Assume 1 TB file on Hard Drive
●
Spit into 16 files
●
split -d -n ...
© 2013 IBM Corporation7
What is BIG data?
© 2013 IBM Corporation8
What is BIG data?
© 2013 IBM Corporation9
What is BIG data?
Big Data
Hadoop
© 2013 IBM Corporation10
What is BIG data?
Business Intelligence
Data Warehouse
© 2013 IBM Corporation11
BigData == Hadoop?
Hadoop BigData
Hadoop
© 2013 IBM Corporation12
What is beyond “Data Warehouse”?
Data Lake
Data Warehouse
© 2013 IBM Corporation13
First “BigData” UseCase ?
●
Google Index
●
40 X 10^9 = 40.000.000.000 => 40 billion pages indexed...
© 2013 IBM Corporation14
Map-Reduce → Hadoop → BigInsights
© 2013 IBM Corporation15
BigData UseCases
●
CERN LHC
●
25 petabytes per year
●
Facebook
●
Hive Datawarehouse
●
300 PB, Gro...
© 2013 IBM Corporation1616
Why is Big Data important?
© 2013 IBM Corporation17
BigData Analytics
Source: http://www.strategy-at-risk.com/2008/01/01/what-we-do/
© 2013 IBM Corporation18
BigData Analytics – Predictive Analytics
"sometimes it's not
who has the best
algorithm that wins...
© 2013 IBM Corporation19
We need Data Parallelism
© 2013 IBM Corporation20
Aggregated Bandwith between CPU, Main
Memory and Hard Drive
1 TB (at 10 GByte/s)
- 1 Node - 100 s...
© 2013 IBM Corporation21
Fault Tolerance / Commodity Hardware
AMD Turion II Neo N40L (2x 1,5GHz / 2MB / 15W), 8 GB RAM,
3T...
© 2013 IBM Corporation22
NoSQL Databases
 Column Store
– Hadoop / HBASE
– Cassandra
– Amazon Simple DB
 JSON / Document ...
© 2013 IBM Corporation23
CAP Theorem / Brewers Theorem¹
 impossible for a distributed computer system simultaneously guar...
© 2013 IBM Corporation24
What role is the cloud playing here?
© 2013 IBM Corporation25
“Elastic” Scale-Out
Source: http://www.cloudcomputingpatterns.org/Continuously_Changing_Workload
© 2013 IBM Corporation26
“Elastic” Scale-Out
of
© 2013 IBM Corporation27
“Elastic” Scale-Out
of
CPU Cores
© 2013 IBM Corporation28
“Elastic” Scale-Out
of
CPU Cores Storage
© 2013 IBM Corporation29
“Elastic” Scale-Out
of
CPU Cores Storage Memory
© 2013 IBM Corporation30
“Elastic” Scale-Out
linear
Source: http://www.cloudcomputingpatterns.org/Elastic_Platform
© 2013 IBM Corporation31
How do Databases Scale-Out?
Shared Disk Architectures
© 2013 IBM Corporation32
How do Databases Scale-Out?
Shared Nothing Architectures
© 2013 IBM Corporation33
Hadoop?
Shared Nothing Architecture?
Shared Disk Architecture?
© 2013 IBM Corporation34
Data Science on Hadoop
SQL (42%)
R (33%)
Python (26%)
Excel (25%)
Java, Ruby, C++ (17%)
SPSS, SAS...
© 2013 IBM Corporation35
Large Scale Data Ingestion
●
Traditionally
●
Crawl to local file system (e.g. wget http://www.hei...
© 2013 IBM Corporation36
Large Scale Data Ingestion (ETL on M/R)
●
Modern ETL (Extract, Transform, Load) tools support Had...
© 2013 IBM Corporation37
Real-Time/ In-Memory Data Ingestion
●
If volume can be reduced dramatically during first processi...
© 2013 IBM Corporation38
Real-Time/ In-Memory Data Ingestion
●
If volume can be reduced dramatically during first processi...
© 2013 IBM Corporation39
SQL on Hadoop
●
IBM BigSQL (ANSI 92 compliant)
●
HIVE (SQL dialect)
●
Cloudera Impala
●
Lingual
●...
© 2013 IBM Corporation40
BigSQL V3.0 – ANSI SQL 92 compliant
IBM BigInsights v3.0, with Big SQL 3.0, is the only Hadoop di...
© 2013 IBM Corporation41
BigSQL V3.0 – Architecture
© 2013 IBM Corporation42
BigSQL V3.0 – Demo (small)
●
32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)
●
3 ...
© 2013 IBM Corporation43
BigSQL V3.0 – Demo (small)
CREATE EXTERNAL TABLE trace (
hour integer, employeeid integer,
depart...
© 2013 IBM Corporation44
BigSQL V3.0 – Demo (small)
© 2013 IBM Corporation45
BigSQL V3.0 – Demo (small)
© 2013 IBM Corporation46
R on Hadoop
●
IBM BigR (based on SystemML Almadan Research project)
●
Rhadoop
●
RHIPE
●
...
“R” H...
© 2013 IBM Corporation47
BigR (based on SystemML)
Example: Gaussian Non-negative Matrix Factorization
package gnmf;
import...
© 2013 IBM Corporation48
BigR (based on SystemML)
SystemML compiles hybrid runtime plans ranging from in-
memory, single m...
© 2013 IBM Corporation49
R Clients
SystemML
Statistics
Engine
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packa...
© 2013 IBM Corporation50
BigR Demo (small)
●
32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)
●
3 TB Data, ...
© 2013 IBM Corporation51
BigR Demo (small)
library(bigr)
bigr.connect(host="bigdata",
port=7052, database="default",
user=...
© 2013 IBM Corporation52
BigR Demo (small)
class bins counts centroids
1 ALL 0 18289280 1.583333
2 ALL 1 15360 2.750000
3 ...
© 2013 IBM Corporation53
BigR Demo (small)
© 2013 IBM Corporation54
BigR Demo (small)
jpeg('hist.jpg')
bigr.histogram(tbr$V1, nbins=24)
# This command runs on 32 GB ...
© 2013 IBM Corporation55
BigR Demo (small)
Sampling, Resampling, Bootstrapping
vs
Whole Dataset Processing
What is your ex...
© 2013 IBM Corporation56
Python on Hadoop
python Hadoop
© 2013 IBM Corporation57
SPSS on Hadoop
© 2013 IBM Corporation58
SPSS on Hadoop
© 2013 IBM Corporation59
BigSheets Demo (small)
●
32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)
●
3 TB D...
© 2013 IBM Corporation60
BigSheets Demo (small)
© 2013 IBM Corporation61
BigSheets Demo (small)
This command runs on 32 GB /
~650.000.000 rows in HDFS
© 2013 IBM Corporation62
BigSheets Demo (small)
© 2013 IBM Corporation63
Text Extraction (SystemT, AQL)
© 2013 IBM Corporation64
Text Extraction (SystemT, AQL)
© 2013 IBM Corporation65
If this is not enough? → BigData AppStore
© 2013 IBM Corporation66
BigData AppStore, Eclipse Tooling
●
Write your apps in
●
Java (MapReduce)
●
PigLatin,Jaql
●
BigSQ...
© 2013 IBM Corporation67
Questions?
http://www.ibm.com/software/data/bigdata/
Twitter: @RomeoKienzler, @IBMEcosystem_DE, @...
© 2013 IBM Corporation68
DFT/Audio Analytics (as promised)
library(tuneR)
a <- readWave("whitenoisesine.wav")
f<- fft(a@le...
© 2013 IBM Corporation69
Backup Slides
© 2013 IBM Corporation70
© 2013 IBM Corporation71
© 2013 IBM Corporation72
© 2013 IBM Corporation73
© 2013 IBM Corporation74
© 2013 IBM Corporation75
© 2013 IBM Corporation76
© 2013 IBM Corporation77
© 2013 IBM Corporation78
© 2013 IBM Corporation79
© 2013 IBM Corporation80
© 2013 IBM Corporation81
© 2013 IBM Corporation82
© 2013 IBM Corporation83
© 2013 IBM Corporation84
Map-Reduce
Source: http://www.cloudcomputingpatterns.org/Map_Reduce
© 2013 IBM Corporation85
© 2013 IBM Corporation86
© 2013 IBM Corporation87
© 2013 IBM Corporation88
© 2013 IBM Corporation89
© 2013 IBM Corporation90
© 2013 IBM Corporation91
© 2013 IBM Corporation92
© 2013 IBM Corporation93
© 2013 IBM Corporation94
© 2013 IBM Corporation95
© 2013 IBM Corporation96
© 2013 IBM Corporation97
© 2013 IBM Corporation98
© 2013 IBM Corporation99
© 2013 IBM Corporation100
© 2013 IBM Corporation101
© 2013 IBM Corporation102
© 2013 IBM Corporation103
© 2013 IBM Corporation104
© 2013 IBM Corporation105
© 2013 IBM Corporation106
© 2013 IBM Corporation107
© 2013 IBM Corporation108
Upcoming SlideShare
Loading in...5
×

The datascientists workplace of the future, IBM developerDays 2014, Vienna by Romeo Kienzler

502

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
502
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
24
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

The datascientists workplace of the future, IBM developerDays 2014, Vienna by Romeo Kienzler

  1. 1. © 2013 IBM Corporation1 The Data Scientists Workplace of the Future - Workshop SwissRE, 11.6.14 Romeo Kienzler IBM Center of Excellence for Data Science, Cognitive Systems and BigData (A joint-venture between IBM Research Zurich and IBM Innovation Center DACH) Source: http://www.kdnuggets.com/2012/04/data-science-history.jpg
  2. 2. © 2013 IBM Corporation2 The Data Scientists Workplace of the Future - * * C R E D I T S * * Romeo Kienzler IBM Innovation Center ● Parts of these slides have been copied from and/or revised by ● Dr. Anand Ranganathan, IBM Watson Research Lab ● Dr. Stefan Mück, IBM BigData Leader Europe ● Dr. Berthold Rheinwald, IBM Almaden Research Lab ● Dr. Diego Kuonen, Statoo Consulting ● Dr. Abdel Labbi, IBM Zurich Research Lab ● Brandon MacKenzie, IBM Software Group
  3. 3. © 2013 IBM Corporation3 What is DataScience? Source: Statoo.com http://slidesha.re/1kmNiX0
  4. 4. © 2013 IBM Corporation4 What is DataScience? Source: Statoo.com http://slidesha.re/1kmNiX0
  5. 5. © 2013 IBM Corporation5 DataScience at present ● Tools (http://blog.revolutionanalytics.com/2014/01/in-data-scientist-survey-r-is-the-most-used-tool-other-than-databases.html) ● SQL (42%) ● R (33%) ● Python (26%) ● Excel (25%) ● Java, Ruby, C++ (17%) ● SPSS, SAS (9%) ● Limitations (Single Node usage) ● Main Memory ● CPU <> Main Memory Bandwidth ● CPU ● Storage <> Main Memory Bandwidth (either Single node or SAN)
  6. 6. © 2013 IBM Corporation6 DataScience at present - Demo ● Assume 1 TB file on Hard Drive ● Spit into 16 files ● split -d -n 16 output.json ● Distribute on 4 Nodes ● for node in `seq 1 16`; do scp x$node id@node$i:~/; done ● Perform calculation in paralell ● for node in `seq 1 16`; do ssh id@node$i 'cat $file |awk -F":" '{print $6}' |grep -i samsung |grep breathtaking |wc -l'; done > result ● Merge Result ● cat result |sum Source: http://sergeytihon.wordpress.com/2013/03/20/the-data-science-venn-diagram/
  7. 7. © 2013 IBM Corporation7 What is BIG data?
  8. 8. © 2013 IBM Corporation8 What is BIG data?
  9. 9. © 2013 IBM Corporation9 What is BIG data? Big Data Hadoop
  10. 10. © 2013 IBM Corporation10 What is BIG data? Business Intelligence Data Warehouse
  11. 11. © 2013 IBM Corporation11 BigData == Hadoop? Hadoop BigData Hadoop
  12. 12. © 2013 IBM Corporation12 What is beyond “Data Warehouse”? Data Lake Data Warehouse
  13. 13. © 2013 IBM Corporation13 First “BigData” UseCase ? ● Google Index ● 40 X 10^9 = 40.000.000.000 => 40 billion pages indexed ● Will break 100 PB barrier soon ● Derived from MapReduce ● now “caffeine” based on “percolator” ● Incremental vs. batch ● In-Memory vs. disk ●
  14. 14. © 2013 IBM Corporation14 Map-Reduce → Hadoop → BigInsights
  15. 15. © 2013 IBM Corporation15 BigData UseCases ● CERN LHC ● 25 petabytes per year ● Facebook ● Hive Datawarehouse ● 300 PB, Growing 600 TB / d ● > 100 k servers ● Genomics ● Enterprises ● Data center analytics (Logflies, OS/NW monitors, ...) ● Predictive Maintenance, Cybersecurity ● Social Media Analytics ● DWH offload ● Call Detail Record (CDR) data preservation http://www.balthasar-glaettli.ch/vorratsdaten/
  16. 16. © 2013 IBM Corporation1616 Why is Big Data important?
  17. 17. © 2013 IBM Corporation17 BigData Analytics Source: http://www.strategy-at-risk.com/2008/01/01/what-we-do/
  18. 18. © 2013 IBM Corporation18 BigData Analytics – Predictive Analytics "sometimes it's not who has the best algorithm that wins; it's who has the most data." (C) Google Inc. The Unreasonable Effectiveness of Data¹ ¹http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf No Sampling => Work with full dataset => No p-Value/z-Scores anymore
  19. 19. © 2013 IBM Corporation19 We need Data Parallelism
  20. 20. © 2013 IBM Corporation20 Aggregated Bandwith between CPU, Main Memory and Hard Drive 1 TB (at 10 GByte/s) - 1 Node - 100 sec - 10 Nodes - 10 sec - 100 Nodes - 1 sec - 1000 Nodes - 100 msec
  21. 21. © 2013 IBM Corporation21 Fault Tolerance / Commodity Hardware AMD Turion II Neo N40L (2x 1,5GHz / 2MB / 15W), 8 GB RAM, 3TB SEAGATE Barracuda 7200.14 < CHF 500  100 K => 200 X (2, 4, 3) => 400 Cores, 1,6 TB RAM, 200 TB HD  MTBF ~ 365 d > 1,5 d Source: http://www.cloudcomputingpatterns.org/Watchdog
  22. 22. © 2013 IBM Corporation22 NoSQL Databases  Column Store – Hadoop / HBASE – Cassandra – Amazon Simple DB  JSON / Document Store – MongoDB – CouchDB  Key / Value Store – Amazon DynamoDB – Voldemort  Graph DBs – DB2 SPARQL Extension – Neo4J  MP RDBMS – DB2 DPF, DB2 pureScale, PureData for Operational Analytics – Oracle RAC – Greenplum  http://nosql-database.org/ > 150
  23. 23. © 2013 IBM Corporation23 CAP Theorem / Brewers Theorem¹  impossible for a distributed computer system simultaneously guarantee all 3 properties – Consistency (all nodes see the same data at the same time) – Availability (guarantee that every request knows whether it was successful or failed) – Partition tolerance (continues to operate despite failure of part of the system)  What about ACID? – Atomicity – Consistency – Isolation – Durability  BASE, the new ACID – Basically Available – Soft state – Eventual consistency • Monotonic Read Consistency • Monotonic Write Consistency • Read Your Own Writes – 
  24. 24. © 2013 IBM Corporation24 What role is the cloud playing here?
  25. 25. © 2013 IBM Corporation25 “Elastic” Scale-Out Source: http://www.cloudcomputingpatterns.org/Continuously_Changing_Workload
  26. 26. © 2013 IBM Corporation26 “Elastic” Scale-Out of
  27. 27. © 2013 IBM Corporation27 “Elastic” Scale-Out of CPU Cores
  28. 28. © 2013 IBM Corporation28 “Elastic” Scale-Out of CPU Cores Storage
  29. 29. © 2013 IBM Corporation29 “Elastic” Scale-Out of CPU Cores Storage Memory
  30. 30. © 2013 IBM Corporation30 “Elastic” Scale-Out linear Source: http://www.cloudcomputingpatterns.org/Elastic_Platform
  31. 31. © 2013 IBM Corporation31 How do Databases Scale-Out? Shared Disk Architectures
  32. 32. © 2013 IBM Corporation32 How do Databases Scale-Out? Shared Nothing Architectures
  33. 33. © 2013 IBM Corporation33 Hadoop? Shared Nothing Architecture? Shared Disk Architecture?
  34. 34. © 2013 IBM Corporation34 Data Science on Hadoop SQL (42%) R (33%) Python (26%) Excel (25%) Java, Ruby, C++ (17%) SPSS, SAS (9%) Data Science Hadoop
  35. 35. © 2013 IBM Corporation35 Large Scale Data Ingestion ● Traditionally ● Crawl to local file system (e.g. wget http://www.heise.de/newsticker/) ● Export RDBMS data to CSV (local file system) ● Batched FTP Servers uploads ● Then: Copy to HDFS ● BigInsights ● Use one of built-in importers ● Imports directly info HDFS ● Use Eclipse-Tooling to deploy custom importers easily
  36. 36. © 2013 IBM Corporation36 Large Scale Data Ingestion (ETL on M/R) ● Modern ETL (Extract, Transform, Load) tools support Hadoop as ● Source, Sink (HDFS) ● Engine (MapReduce) ● Example: InfoSphere DataStage
  37. 37. © 2013 IBM Corporation37 Real-Time/ In-Memory Data Ingestion ● If volume can be reduced dramatically during first processing steps ● Feature Extraction of ● Video ● Audio ● Semistructured Text (e.g. Logfiles) ● Structured Text ● Filtering ● Compression ● Recommendation: Usage of Streaming Engines ● IBM InfoSphere Streams ● Twitter Storm (now Apache incubator) ● Apache Spark Streaming
  38. 38. © 2013 IBM Corporation38 Real-Time/ In-Memory Data Ingestion ● If volume can be reduced dramatically during first processing steps ● Feature Extraction of ● Video ● Audio ● Semistructured Text (e.g. Logfiles) ● Structured Text ● Filtering ● Compression
  39. 39. © 2013 IBM Corporation39 SQL on Hadoop ● IBM BigSQL (ANSI 92 compliant) ● HIVE (SQL dialect) ● Cloudera Impala ● Lingual ● ... SQL Hadoop
  40. 40. © 2013 IBM Corporation40 BigSQL V3.0 – ANSI SQL 92 compliant IBM BigInsights v3.0, with Big SQL 3.0, is the only Hadoop distribution to successfully run ALL 99 TPC-DS queries and ALL 22 TPC-H queries without modification. Source: http://www.ibmbigdatahub.com/blog/big-deal-about-infosphere-biginsights-v30-big-sql
  41. 41. © 2013 IBM Corporation41 BigSQL V3.0 – Architecture
  42. 42. © 2013 IBM Corporation42 BigSQL V3.0 – Demo (small) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich) ● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich) ● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich) ● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich) ● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
  43. 43. © 2013 IBM Corporation43 BigSQL V3.0 – Demo (small) CREATE EXTERNAL TABLE trace ( hour integer, employeeid integer, departmentid integer, clientid integer, date string, timestamp string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY 'n' STORED AS TEXTFILE LOCATION '/user/biadmin/32Gtest'; select count(hour), hour from trace group by hour order by hour -- This command runs on 32 GB / ~650.000.000 rows in HDFS
  44. 44. © 2013 IBM Corporation44 BigSQL V3.0 – Demo (small)
  45. 45. © 2013 IBM Corporation45 BigSQL V3.0 – Demo (small)
  46. 46. © 2013 IBM Corporation46 R on Hadoop ● IBM BigR (based on SystemML Almadan Research project) ● Rhadoop ● RHIPE ● ... “R” Hadoop
  47. 47. © 2013 IBM Corporation47 BigR (based on SystemML) Example: Gaussian Non-negative Matrix Factorization package gnmf; import java.io.IOException; import java.net.URISyntaxException; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapred.JobConf; public class MatrixGNMF { public static void main(String[] args) throws IOException, URISyntaxException { if(args.length < 10) { System.out.println("missing parameters"); System.out.println("expected parameters: [directory of v] [directory of w] [directory of h] " + "[k] [num mappers] [num reducers] [replication] [working directory] " + "[final directory of w] [final directory of h]"); System.exit(1); } String vDir = args[0]; String wDir = args[1]; String hDir = args[2]; int k = Integer.parseInt(args[3]); int numMappers = Integer.parseInt(args[4]); int numReducers = Integer.parseInt(args[5]); int replication = Integer.parseInt(args[6]); String outputDir = args[7]; String wFinalDir = args[8]; String hFinalDir = args[9]; JobConf mainJob = new JobConf(MatrixGNMF.class); String vDirectory; String wDirectory; String hDirectory; FileSystem.get(mainJob).delete(new Path(outputDir)); vDirectory = vDir; hDirectory = hDir; wDirectory = wDir; String workingDirectory; String resultDirectoryX; String resultDirectoryY; long start = System.currentTimeMillis(); System.gc(); System.out.println("starting calculation"); System.out.print("calculating X = WT * V... "); workingDirectory = UpdateWHStep1.runJob(numMappers, numReducers, replication, UpdateWHStep1.UPDATE_TYPE_H, vDirectory, wDirectory, outputDir, k); resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication, workingDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating Y = WT * W * H... "); workingDirectory = UpdateWHStep3.runJob(numMappers, numReducers, replication, wDirectory, outputDir); resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory, UpdateWHStep4.UPDATE_TYPE_H, hDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating H = H .* X ./ Y... "); workingDirectory = UpdateWHStep5.runJob(numMappers, numReducers, replication, hDirectory, resultDirectoryX, resultDirectoryY, hFinalDir, k); System.out.println("done"); FileSystem.get(mainJob).delete(new Path(resultDirectoryX)); FileSystem.get(mainJob).delete(new Path(resultDirectoryY)); System.out.print("storing back H... "); FileSystem.get(mainJob).delete(new Path(hDirectory)); hDirectory = workingDirectory; System.out.println("done"); System.out.print("calculating X = V * HT... "); workingDirectory = UpdateWHStep1.runJob(numMappers, numReducers, replication, UpdateWHStep1.UPDATE_TYPE_W, vDirectory, hDirectory, outputDir, k); resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication, workingDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating Y = W * H * HT... "); workingDirectory = UpdateWHStep3.runJob(numMappers, numReducers, replication, hDirectory, outputDir); resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory, UpdateWHStep4.UPDATE_TYPE_W, wDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating W = W .* X ./ Y... "); workingDirectory = UpdateWHStep5.runJob(numMappers, numReducers, replication, wDirectory, resultDirectoryX, resultDirectoryY, wFinalDir, k); System.out.println("done"); FileSystem.get(mainJob).delete(new Path(resultDirectoryX)); FileSystem.get(mainJob).delete(new Path(resultDirectoryY)); System.out.print("storing back W... "); FileSystem.get(mainJob).delete(new Path(wDirectory)); package gnmf; import gnmf.io.MatrixObject; import gnmf.io.MatrixVector; import gnmf.io.TaggedIndex; import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.SequenceFileInputFormat; import org.apache.hadoop.mapred.SequenceFileOutputFormat; public class UpdateWHStep2 { static class UpdateWHStep2Mapper extends MapReduceBase implements Mapper<TaggedIndex, MatrixVector, TaggedIndex, MatrixVector> { @Override public void map(TaggedIndex key, MatrixVector value, OutputCollector<TaggedIndex, MatrixVector> out, Reporter reporter) throws IOException { out.collect(key, value); } } static class UpdateWHStep2Reducer extends MapReduceBase implements Reducer<TaggedIndex, MatrixVector, TaggedIndex, MatrixObject> { @Override public void reduce(TaggedIndex key, Iterator<MatrixVector> values, OutputCollector<TaggedIndex, MatrixObject> out, Reporter reporter) throws IOException { MatrixVector result = null; while(values.hasNext()) { MatrixVector current = values.next(); if(result == null) { result = current.getCopy(); } else { result.addVector(current); } } if(result != null) { out.collect(new TaggedIndex(key.getIndex(), TaggedIndex.TYPE_VECTOR_X), new MatrixObject(result)); } } } public static String runJob(int numMappers, int numReducers, int replication, String inputDir, String outputDir) throws IOException { String workingDirectory = outputDir + System.currentTimeMillis() + "-UpdateWHStep2/"; JobConf job = new JobConf(UpdateWHStep2.class); job.setJobName("MatrixGNMFUpdateWHStep2"); job.setInputFormat(SequenceFileInputFormat.class); FileInputFormat.setInputPaths(job, new Path(inputDir)); package gnmf; import gnmf.io.MatrixCell; import gnmf.io.MatrixFormats; import gnmf.io.MatrixObject; import gnmf.io.MatrixVector; import gnmf.io.TaggedIndex; import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.filecache.DistributedCache; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.SequenceFileInputFormat; import org.apache.hadoop.mapred.SequenceFileOutputFormat; public class UpdateWHStep1 { public static final int UPDATE_TYPE_H = 0; public static final int UPDATE_TYPE_W = 1; static class UpdateWHStep1Mapper extends MapReduceBase implements Mapper<TaggedIndex, MatrixObject, TaggedIndex, MatrixObject> { private int updateType; @Override public void map(TaggedIndex key, MatrixObject value, OutputCollector<TaggedIndex, MatrixObject> out, Reporter reporter) throws IOException { if(updateType == UPDATE_TYPE_W && key.getType() == TaggedIndex.TYPE_CELL) { MatrixCell current = (MatrixCell) value.getObject(); out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_CELL), new MatrixObject(new MatrixCell(key.getIndex(), current.getValue()))); } else { out.collect(key, value); } } @Override public void configure(JobConf job) { updateType = job.getInt("gnmf.updateType", 0); } } static class UpdateWHStep1Reducer extends MapReduceBase implements Reducer<TaggedIndex, MatrixObject, TaggedIndex, MatrixVector> { private double[] baseVector = null; private int vectorSizeK; @Override public void reduce(TaggedIndex key, Iterator<MatrixObject> values, OutputCollector<TaggedIndex, MatrixVector> out, Reporter reporter) throws IOException { if(key.getType() == TaggedIndex.TYPE_VECTOR) { if(!values.hasNext()) throw new RuntimeException("expected vector"); MatrixFormats current = values.next().getObject(); if(!(current instanceof MatrixVector)) throw new RuntimeException("expected vector"); baseVector = ((MatrixVector) current).getValues(); } else { while(values.hasNext()) { MatrixCell current = (MatrixCell) values.next().getObject(); if(baseVector == null) { out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR), new MatrixVector(vectorSizeK)); } else { if(baseVector.length == 0) throw new RuntimeException("base vector is corrupted"); MatrixVector resultingVector = new MatrixVector(baseVector); resultingVector.multiplyWithScalar(current.getValue()); if(resultingVector.getValues().length == 0) throw new RuntimeException("multiplying with scalar failed"); out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR), resultingVector); } } baseVector = null; } } @Override public void configure(JobConf job) { vectorSizeK = job.getInt("dml.matrix.gnmf.k", 0); Java Implementation (>1500 lines of code) Equivalent SystemML Implementation (10 lines of code) Experimenting with multiple variants! W = W*max(V%*%t(H) – alphaW JW, 0)/(W%*%H%*%t(H)) H = H*max(t(W)%*%V – alphaH JH, 0)/(t(W)%*%W%*%H) W = W*((S*V)%*%t(H))/((S*(W%*%H))%*%t(H)) H = H*(t(W)%*%(S*V))/(t(W)%*%(S*(W%*%H))) W = W*(V/(W%*%H) %*% t(H))/(E%*%t(H)) H = H*(t(W)%*%(V/(W%*%H)))/(t(W)%*%E)
  48. 48. © 2013 IBM Corporation48 BigR (based on SystemML) SystemML compiles hybrid runtime plans ranging from in- memory, single machine (CP) to large-scale, cluster (MR) compute ● Challenge ● Guaranteed hard memory constraints (budget of JVM size) ● for arbitrary complex ML programs ● Key Technical Innovations ● CP & MR Runtime: Single machine & MR operations, integrated runtime ● Caching: Reuse and eviction of in-memory objects ● Cost Model: Accurate time and worst-case memory estimates ● Optimizer: Cost-based runtime plan generation ● Dyn. Recompiler: Re-optimization for initial unknowns Data size Runtime CP CP/MR MR Gradually exploit MR parallelism High performance computing for small data sizes. Scalable computing for large data sizes. Hybrid Plans
  49. 49. © 2013 IBM Corporation49 R Clients SystemML Statistics Engine Data Sources Embedded R Execution IBM R Packages IBM R Packages Pull data (summaries) to R client Or, push R functions right on the data 1 2 3 © 2014 IBM Corporation17 IBM Internal Use Only BigR Architecture
  50. 50. © 2013 IBM Corporation50 BigR Demo (small) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich) ● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich) ● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
  51. 51. © 2013 IBM Corporation51 BigR Demo (small) library(bigr) bigr.connect(host="bigdata", port=7052, database="default", user="biadmin", password="xxx") is.bigr.connected() tbr <- bigr.frame(dataSource="DEL", coltypes = c("numeric","numeric","numeric","numeric","character","character"), dataPath="/user/biadmin/32Gtest", delimiter=",", header=F, useMapReduce=T) h <- bigr.histogram.stats(tbr$V1, nbins=24)
  52. 52. © 2013 IBM Corporation52 BigR Demo (small) class bins counts centroids 1 ALL 0 18289280 1.583333 2 ALL 1 15360 2.750000 3 ALL 2 55040 3.916667 4 ALL 3 189440 5.083333 5 ALL 4 579840 6.250000 6 ALL 5 5292160 7.416667 7 ALL 6 8074880 8.583333 8 ALL 7 15653120 9.750000 ...
  53. 53. © 2013 IBM Corporation53 BigR Demo (small)
  54. 54. © 2013 IBM Corporation54 BigR Demo (small) jpeg('hist.jpg') bigr.histogram(tbr$V1, nbins=24) # This command runs on 32 GB / ~650.000.000 rows in HDFS dev.off()
  55. 55. © 2013 IBM Corporation55 BigR Demo (small) Sampling, Resampling, Bootstrapping vs Whole Dataset Processing What is your experience?
  56. 56. © 2013 IBM Corporation56 Python on Hadoop python Hadoop
  57. 57. © 2013 IBM Corporation57 SPSS on Hadoop
  58. 58. © 2013 IBM Corporation58 SPSS on Hadoop
  59. 59. © 2013 IBM Corporation59 BigSheets Demo (small) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich) ● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich) ● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich) ● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich) ● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
  60. 60. © 2013 IBM Corporation60 BigSheets Demo (small)
  61. 61. © 2013 IBM Corporation61 BigSheets Demo (small) This command runs on 32 GB / ~650.000.000 rows in HDFS
  62. 62. © 2013 IBM Corporation62 BigSheets Demo (small)
  63. 63. © 2013 IBM Corporation63 Text Extraction (SystemT, AQL)
  64. 64. © 2013 IBM Corporation64 Text Extraction (SystemT, AQL)
  65. 65. © 2013 IBM Corporation65 If this is not enough? → BigData AppStore
  66. 66. © 2013 IBM Corporation66 BigData AppStore, Eclipse Tooling ● Write your apps in ● Java (MapReduce) ● PigLatin,Jaql ● BigSQL/Hive/BigR ● Deploy it to BigInsights via Eclipse ● Automatically ● Schedule ● Update ● hdfs files ● BigSQL tables ● BigSheets collections
  67. 67. © 2013 IBM Corporation67 Questions? http://www.ibm.com/software/data/bigdata/ Twitter: @RomeoKienzler, @IBMEcosystem_DE, @IBM_ISV_Alps
  68. 68. © 2013 IBM Corporation68 DFT/Audio Analytics (as promised) library(tuneR) a <- readWave("whitenoisesine.wav") f<- fft(a@left) jpeg('rplot_wnsine.jpg') plot(Re(f)^2) dev.off() a <- readWave("whitenoise.wav") f<- fft(a@left) jpeg('rplot_wn.jpg') plot(Re(f)^2) dev.off() a <- readWave("whitenoisesine.wav") brv <- as.bigr.vector(a@left) al <- as.list(a@left)
  69. 69. © 2013 IBM Corporation69 Backup Slides
  70. 70. © 2013 IBM Corporation70
  71. 71. © 2013 IBM Corporation71
  72. 72. © 2013 IBM Corporation72
  73. 73. © 2013 IBM Corporation73
  74. 74. © 2013 IBM Corporation74
  75. 75. © 2013 IBM Corporation75
  76. 76. © 2013 IBM Corporation76
  77. 77. © 2013 IBM Corporation77
  78. 78. © 2013 IBM Corporation78
  79. 79. © 2013 IBM Corporation79
  80. 80. © 2013 IBM Corporation80
  81. 81. © 2013 IBM Corporation81
  82. 82. © 2013 IBM Corporation82
  83. 83. © 2013 IBM Corporation83
  84. 84. © 2013 IBM Corporation84 Map-Reduce Source: http://www.cloudcomputingpatterns.org/Map_Reduce
  85. 85. © 2013 IBM Corporation85
  86. 86. © 2013 IBM Corporation86
  87. 87. © 2013 IBM Corporation87
  88. 88. © 2013 IBM Corporation88
  89. 89. © 2013 IBM Corporation89
  90. 90. © 2013 IBM Corporation90
  91. 91. © 2013 IBM Corporation91
  92. 92. © 2013 IBM Corporation92
  93. 93. © 2013 IBM Corporation93
  94. 94. © 2013 IBM Corporation94
  95. 95. © 2013 IBM Corporation95
  96. 96. © 2013 IBM Corporation96
  97. 97. © 2013 IBM Corporation97
  98. 98. © 2013 IBM Corporation98
  99. 99. © 2013 IBM Corporation99
  100. 100. © 2013 IBM Corporation100
  101. 101. © 2013 IBM Corporation101
  102. 102. © 2013 IBM Corporation102
  103. 103. © 2013 IBM Corporation103
  104. 104. © 2013 IBM Corporation104
  105. 105. © 2013 IBM Corporation105
  106. 106. © 2013 IBM Corporation106
  107. 107. © 2013 IBM Corporation107
  108. 108. © 2013 IBM Corporation108
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×