© 2012 IBM Corporation1
Information Retrieval, Applied Statistics and Mathematics
on BigData
Romeo Kienzler
Data Scientist...
© 2012 IBM Corporation2
Fault Tolerance / Commodity Hardware
AMD Turion II Neo N40L (2x 1,5GHz / 2MB / 15W), 8 GB RAM,
3TB...
© 2012 IBM Corporation3
Supercomputer in a Rack
Supercomputer before
➔
Weather
➔
Atom Bombs
➔
Science
➔
Crash Tests
Superc...
© 2012 IBM Corporation4
Hadoop / BigInsights
© 2012 IBM Corporation5
Hadoop Distributed File System
© 2012 IBM Corporation6
Hadoop Job Scheduling
© 2012 IBM Corporation7
Aggregated Bandwith between CPU, Main
Memory and Hard Drive
1 TB (at 10 GByte/s)
- 1 Node - 100 se...
© 2012 IBM Corporation8
Watson
1 TB (at 45.5 GByte/s)
- 1 Core - 22 sec
- 10 Core - 2.2 sec
- 100 Core - 220 msec
- 1000 C...
© 2012 IBM Corporation9
Data Streaming
X86
Box
X86 Blade Cell
Blade
X86 BladeFPGA
Blade
X86
Blade
X86 Blade X86
Blade
X86 ...
© 2012 IBM Corporation10
Massive Parallel DataWarehousing
© 2012 IBM Corporation11
Why do we need to process so much data?
© 2012 IBM Corporation12
12
Data Growth
Data AVAILABLE to an
organization
data an organization can
PROCESS
Missed
opportun...
© 2012 IBM Corporation13
Separate the Signal From the Noise¹
¹http://www.ibmsystemsmag.com/power/businessstrategy/BI-and-A...
© 2012 IBM Corporation14
The Unreasonable Effectiveness of Data¹
"sometimes it's not
who has the best
algorithm that wins;...
© 2012 IBM Corporation15
Statistical Modeling of Physical Systems
© 2012 IBM Corporation16
From Unstructured Data to Structured Data -
Feature Extraction
Feature extraction involves simpli...
© 2012 IBM Corporation17
Dimension Reduction
Principal Component Analysis / Singular Value Decomposition
Linear Discrimina...
© 2012 IBM Corporation18
Data Parallelism
© 2012 IBM Corporation19
Data Parallelism
Calculate the empirical mean along each dimension m = 1, ..,M (step in Principal...
© 2012 IBM Corporation20
BUT: Do I want to care about algorithm
parallelization?
© 2012 IBM Corporation21
High-Level Languages
Source: Hadoopsphere.com
© 2012 IBM Corporation22
High-Level Languages (IBM SystemML)
Extensible Library
Linear SVMs,
Logistic Reg
K-means
Classifi...
© 2012 IBM Corporation23
High-Level Languages (RHadoop)
Source: http://www.revolutionanalytics.com
© 2012 IBM Corporation24
High-Level Languages (R on IBM PureData)
Source: http://www.revolutionanalytics.com
© 2012 IBM Corporation25
Push Back
Application Algorithm Compile Engine Execution Language Engine
© 2012 IBM Corporation26
Push Back
© 2012 IBM Corporation27
Push Back
© 2012 IBM Corporation28
© 2012 IBM Corporation29
© 2012 IBM Corporation30
© 2012 IBM Corporation31
© 2012 IBM Corporation32
© 2012 IBM Corporation33
© 2012 IBM Corporation34
© 2012 IBM Corporation35
© 2012 IBM Corporation36
© 2012 IBM Corporation37
Source: coursera.org Linear Discriminant Analysis
© 2012 IBM Corporation38
© 2012 IBM Corporation39
Outlook
Theory: With BigData the machines are thinking for us
Reality: Existing algorithms are no...
© 2012 IBM Corporation40
Questions?
© 2012 IBM Corporation41
Links
www.ibm.com/developerworks
www.ibm.com/ibm/university/academic
romeo.kienzler@ch.ibm.com
rk...
Upcoming SlideShare
Loading in …5
×

Information Retrieval, Applied Statistics and Mathematics onBigData - German Society of Physicists Conference 13

221
-1

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
221
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Information Retrieval, Applied Statistics and Mathematics onBigData - German Society of Physicists Conference 13

  1. 1. © 2012 IBM Corporation1 Information Retrieval, Applied Statistics and Mathematics on BigData Romeo Kienzler Data Scientist and Architect IBM Innovation Center Zurich
  2. 2. © 2012 IBM Corporation2 Fault Tolerance / Commodity Hardware AMD Turion II Neo N40L (2x 1,5GHz / 2MB / 15W), 8 GB RAM, 3TB SEAGATE Barracuda 7200.14 < 500 EURO 100 K => 200 X (2, 4, 3) => 400 Cores, 1,6 TB RAM, 200 TB HD MTBF ~ 365 d > 1,5 d
  3. 3. © 2012 IBM Corporation3 Supercomputer in a Rack Supercomputer before ➔ Weather ➔ Atom Bombs ➔ Science ➔ Crash Tests Supercomputer in a Rack ➔ 18 TB Main Memory, 1008 CPU Cores, 113 TFLOPS (1st TOP500 2013: 17590 TFLOPS 2004: 71 TFLOPS)
  4. 4. © 2012 IBM Corporation4 Hadoop / BigInsights
  5. 5. © 2012 IBM Corporation5 Hadoop Distributed File System
  6. 6. © 2012 IBM Corporation6 Hadoop Job Scheduling
  7. 7. © 2012 IBM Corporation7 Aggregated Bandwith between CPU, Main Memory and Hard Drive 1 TB (at 10 GByte/s) - 1 Node - 100 sec - 10 Nodes - 10 sec - 100 Nodes - 1 sec - 1000 Nodes - 100 msec
  8. 8. © 2012 IBM Corporation8 Watson 1 TB (at 45.5 GByte/s) - 1 Core - 22 sec - 10 Core - 2.2 sec - 100 Core - 220 msec - 1000 Core - 22 msec - 10000 Core - 2.2 msec
  9. 9. © 2012 IBM Corporation9 Data Streaming X86 Box X86 Blade Cell Blade X86 BladeFPGA Blade X86 Blade X86 Blade X86 Blade X86 BladeX86 Blade Operating System Transport System S Data Fabric Processing Element Container Processing Element Container Processing Element Container Processing Element Container Processing Element Container
  10. 10. © 2012 IBM Corporation10 Massive Parallel DataWarehousing
  11. 11. © 2012 IBM Corporation11 Why do we need to process so much data?
  12. 12. © 2012 IBM Corporation12 12 Data Growth Data AVAILABLE to an organization data an organization can PROCESS Missed opportunity 100 Million Tweets are posted every day, 35 hours of video are being uploaded every minute,6.1 x 10^12 text messages have been sent in 2011 and 247 x 10^9 E-Mails passed through the net.80 % spam and viruses. => Filtering is more and more important. Up to 2003 the same amount of data has been produced as between 2003 and now
  13. 13. © 2012 IBM Corporation13 Separate the Signal From the Noise¹ ¹http://www.ibmsystemsmag.com/power/businessstrategy/BI-and-Analytics/signal_noise/
  14. 14. © 2012 IBM Corporation14 The Unreasonable Effectiveness of Data¹ "sometimes it's not who has the best algorithm that wins; it's who has the most data." (C) Google Inc. ¹http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf
  15. 15. © 2012 IBM Corporation15 Statistical Modeling of Physical Systems
  16. 16. © 2012 IBM Corporation16 From Unstructured Data to Structured Data - Feature Extraction Feature extraction involves simplifying the amount of resources required to describe a large set of data accurately¹ ¹: Wikipedia
  17. 17. © 2012 IBM Corporation17 Dimension Reduction Principal Component Analysis / Singular Value Decomposition Linear Discriminant Analysis Source: coursera.org
  18. 18. © 2012 IBM Corporation18 Data Parallelism
  19. 19. © 2012 IBM Corporation19 Data Parallelism Calculate the empirical mean along each dimension m = 1, ..,M (step in Principal Component Analysis) N-gram Models (NLP) Ordinary Least-Square Parameter Estimator for Linear Regression
  20. 20. © 2012 IBM Corporation20 BUT: Do I want to care about algorithm parallelization?
  21. 21. © 2012 IBM Corporation21 High-Level Languages Source: Hadoopsphere.com
  22. 22. © 2012 IBM Corporation22 High-Level Languages (IBM SystemML) Extensible Library Linear SVMs, Logistic Reg K-means Classification Linear Regression Regression SGD solver, NMF Matrix Factorizations Clustering PageRank, HITS Ranking Parser High-Level Ops Low-Level Ops Runtime Ops Optimizations Hadoop DML Scripts Open Source Variant: Apache Mahout - less algorithms - no optimizer
  23. 23. © 2012 IBM Corporation23 High-Level Languages (RHadoop) Source: http://www.revolutionanalytics.com
  24. 24. © 2012 IBM Corporation24 High-Level Languages (R on IBM PureData) Source: http://www.revolutionanalytics.com
  25. 25. © 2012 IBM Corporation25 Push Back Application Algorithm Compile Engine Execution Language Engine
  26. 26. © 2012 IBM Corporation26 Push Back
  27. 27. © 2012 IBM Corporation27 Push Back
  28. 28. © 2012 IBM Corporation28
  29. 29. © 2012 IBM Corporation29
  30. 30. © 2012 IBM Corporation30
  31. 31. © 2012 IBM Corporation31
  32. 32. © 2012 IBM Corporation32
  33. 33. © 2012 IBM Corporation33
  34. 34. © 2012 IBM Corporation34
  35. 35. © 2012 IBM Corporation35
  36. 36. © 2012 IBM Corporation36
  37. 37. © 2012 IBM Corporation37 Source: coursera.org Linear Discriminant Analysis
  38. 38. © 2012 IBM Corporation38
  39. 39. © 2012 IBM Corporation39 Outlook Theory: With BigData the machines are thinking for us Reality: Existing algorithms are now beginning to be applied on a large scale basis Presence: Every company thinks they have to urgently participate in BigData, but don't know how Future: Every company will have access to BigData technologies and will use them Hype: The whole world is doing BigData Vision: BigData Analytics is usable for everybody at their fingertips
  40. 40. © 2012 IBM Corporation40 Questions?
  41. 41. © 2012 IBM Corporation41 Links www.ibm.com/developerworks www.ibm.com/ibm/university/academic romeo.kienzler@ch.ibm.com rkie@ch.ibm.com U6K8qm_HFas Jqq66INlQ0U
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×