• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Information Retrieval, Applied Statistics and Mathematics onBigData - German Society of Physicists Conference 13
 

Information Retrieval, Applied Statistics and Mathematics onBigData - German Society of Physicists Conference 13

on

  • 196 views

 

Statistics

Views

Total Views
196
Views on SlideShare
196
Embed Views
0

Actions

Likes
0
Downloads
4
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Information Retrieval, Applied Statistics and Mathematics onBigData - German Society of Physicists Conference 13 Information Retrieval, Applied Statistics and Mathematics onBigData - German Society of Physicists Conference 13 Presentation Transcript

    • © 2012 IBM Corporation1 Information Retrieval, Applied Statistics and Mathematics on BigData Romeo Kienzler Data Scientist and Architect IBM Innovation Center Zurich
    • © 2012 IBM Corporation2 Fault Tolerance / Commodity Hardware AMD Turion II Neo N40L (2x 1,5GHz / 2MB / 15W), 8 GB RAM, 3TB SEAGATE Barracuda 7200.14 < 500 EURO 100 K => 200 X (2, 4, 3) => 400 Cores, 1,6 TB RAM, 200 TB HD MTBF ~ 365 d > 1,5 d
    • © 2012 IBM Corporation3 Supercomputer in a Rack Supercomputer before ➔ Weather ➔ Atom Bombs ➔ Science ➔ Crash Tests Supercomputer in a Rack ➔ 18 TB Main Memory, 1008 CPU Cores, 113 TFLOPS (1st TOP500 2013: 17590 TFLOPS 2004: 71 TFLOPS)
    • © 2012 IBM Corporation4 Hadoop / BigInsights
    • © 2012 IBM Corporation5 Hadoop Distributed File System
    • © 2012 IBM Corporation6 Hadoop Job Scheduling
    • © 2012 IBM Corporation7 Aggregated Bandwith between CPU, Main Memory and Hard Drive 1 TB (at 10 GByte/s) - 1 Node - 100 sec - 10 Nodes - 10 sec - 100 Nodes - 1 sec - 1000 Nodes - 100 msec
    • © 2012 IBM Corporation8 Watson 1 TB (at 45.5 GByte/s) - 1 Core - 22 sec - 10 Core - 2.2 sec - 100 Core - 220 msec - 1000 Core - 22 msec - 10000 Core - 2.2 msec
    • © 2012 IBM Corporation9 Data Streaming X86 Box X86 Blade Cell Blade X86 BladeFPGA Blade X86 Blade X86 Blade X86 Blade X86 BladeX86 Blade Operating System Transport System S Data Fabric Processing Element Container Processing Element Container Processing Element Container Processing Element Container Processing Element Container
    • © 2012 IBM Corporation10 Massive Parallel DataWarehousing
    • © 2012 IBM Corporation11 Why do we need to process so much data?
    • © 2012 IBM Corporation12 12 Data Growth Data AVAILABLE to an organization data an organization can PROCESS Missed opportunity 100 Million Tweets are posted every day, 35 hours of video are being uploaded every minute,6.1 x 10^12 text messages have been sent in 2011 and 247 x 10^9 E-Mails passed through the net.80 % spam and viruses. => Filtering is more and more important. Up to 2003 the same amount of data has been produced as between 2003 and now
    • © 2012 IBM Corporation13 Separate the Signal From the Noise¹ ¹http://www.ibmsystemsmag.com/power/businessstrategy/BI-and-Analytics/signal_noise/
    • © 2012 IBM Corporation14 The Unreasonable Effectiveness of Data¹ "sometimes it's not who has the best algorithm that wins; it's who has the most data." (C) Google Inc. ¹http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf
    • © 2012 IBM Corporation15 Statistical Modeling of Physical Systems
    • © 2012 IBM Corporation16 From Unstructured Data to Structured Data - Feature Extraction Feature extraction involves simplifying the amount of resources required to describe a large set of data accurately¹ ¹: Wikipedia
    • © 2012 IBM Corporation17 Dimension Reduction Principal Component Analysis / Singular Value Decomposition Linear Discriminant Analysis Source: coursera.org
    • © 2012 IBM Corporation18 Data Parallelism
    • © 2012 IBM Corporation19 Data Parallelism Calculate the empirical mean along each dimension m = 1, ..,M (step in Principal Component Analysis) N-gram Models (NLP) Ordinary Least-Square Parameter Estimator for Linear Regression
    • © 2012 IBM Corporation20 BUT: Do I want to care about algorithm parallelization?
    • © 2012 IBM Corporation21 High-Level Languages Source: Hadoopsphere.com
    • © 2012 IBM Corporation22 High-Level Languages (IBM SystemML) Extensible Library Linear SVMs, Logistic Reg K-means Classification Linear Regression Regression SGD solver, NMF Matrix Factorizations Clustering PageRank, HITS Ranking Parser High-Level Ops Low-Level Ops Runtime Ops Optimizations Hadoop DML Scripts Open Source Variant: Apache Mahout - less algorithms - no optimizer
    • © 2012 IBM Corporation23 High-Level Languages (RHadoop) Source: http://www.revolutionanalytics.com
    • © 2012 IBM Corporation24 High-Level Languages (R on IBM PureData) Source: http://www.revolutionanalytics.com
    • © 2012 IBM Corporation25 Push Back Application Algorithm Compile Engine Execution Language Engine
    • © 2012 IBM Corporation26 Push Back
    • © 2012 IBM Corporation27 Push Back
    • © 2012 IBM Corporation28
    • © 2012 IBM Corporation29
    • © 2012 IBM Corporation30
    • © 2012 IBM Corporation31
    • © 2012 IBM Corporation32
    • © 2012 IBM Corporation33
    • © 2012 IBM Corporation34
    • © 2012 IBM Corporation35
    • © 2012 IBM Corporation36
    • © 2012 IBM Corporation37 Source: coursera.org Linear Discriminant Analysis
    • © 2012 IBM Corporation38
    • © 2012 IBM Corporation39 Outlook Theory: With BigData the machines are thinking for us Reality: Existing algorithms are now beginning to be applied on a large scale basis Presence: Every company thinks they have to urgently participate in BigData, but don't know how Future: Every company will have access to BigData technologies and will use them Hype: The whole world is doing BigData Vision: BigData Analytics is usable for everybody at their fingertips
    • © 2012 IBM Corporation40 Questions?
    • © 2012 IBM Corporation41 Links www.ibm.com/developerworks www.ibm.com/ibm/university/academic romeo.kienzler@ch.ibm.com rkie@ch.ibm.com U6K8qm_HFas Jqq66INlQ0U