Solving Challenges with 'Huge Data'
Solutions & client cases
Dr. Axel Koester - axel.koester@de.ibm.com
Chief Technologist EMEA Storage Competence Center
Chairman of TEC think tank D/A/CH
2
Three ways how IT uses data … today
Procedural (if…then)
Image: Business over Broadway
Statistical (big data)
Machine learning
Image: opendatascience.com
3
… and in 10 years
Procedural
(if…then)
Image: Business over Broadway
Statistical
(big data)
Machine learning
Image: opendatascience.com
4
Current examples
Image: Business over Broadway Image: opendatascience.com
shopping, profiling,
fraud detection …
autonomous driving,
image classification,
chatbots, gaming…
Manual modelling Accumulation of examples Automatic modelling
business as usual
classic / legacy IT
5
OK
defect
defect
defect
defect
defect
defect
defect
defect
Example of trained (rather than programmed) quality inspection
6
Train-on-the-job by reviewing low-confidence cases
MUCH CHEAPER THAN RE-CODING AT EVERY PROD CHANGE
7
Procedural:
Archive test cases
for auditing
Statistical:
Parallel processing of
many stored samples
Machine Learning:
Train sample data, then
archive or trade data
Image: Business over Broadway Image: opendatascience.com
How is data stored?
if…then…else
GB/s
1
2
GB/s
3
parallel search
10
Imperatives for data storage:
implement workflows
avoid "data tourism"
scale without effort
11
DESY: Example for a solved "data tourism" problem
12
DESY data: Synchrotron X-ray imaging
13
Data tourism
Lambda: 60 Gb/s @ 2000 Hz
Eiger: 30 Gb/s @ 2000 Hz
2000files/s/cam
Webportalaccess
IBM Spectrum Scale + Workflow rules
3D reconstruction,
research calculus
2000
files/s/cam
ØMQ
cluster lifecycle
cluster
14
[Next-gen storage] Prototype wrote 50k Files/sec in one folder*
-- started at 02/28/2017 12:13:13 --
mdtest-1.9.3 was launched with 14 total task(s) on 14 node(s)
Command line used: /ghome/oehmes/mpi/bin/mdtest-pcmpi9131-existingdir -d /gpfs/fs2-
1m-me1/shared/mdtest-ec -i 1 -n 35000 -F -w 0 -Z -p 8
Path: /gpfs/fs2-1m-me1/shared
FS: 17.1 TiB Used FS: 0.1% Inodes: 476.8 Mi Used Inodes: 0.1%
14 tasks, 490000 files
SUMMARY: (of 1 iterations)
Operation Max Min Mean Std Dev
--------- --- --- ---- -------
File creation : 50032.690 50032.690 50032.690 0.000
File stat : 3937604.341 3937604.341 3937604.341 0.000
File read : 941193.073 941193.073 941193.073 0.000
File removal : 143095.519 143095.519 143095.519 0.000
Tree creation : 77672.296 77672.296 77672.296 0.000
Tree removal : 0.239 0.239 0.239 0.000
-- finished at 02/28/2017 12:13:39 --
(*) in independent folders, the test cluster could write 2,6 Mio 32k files/sec
15
More Workflow Examples
16
Newly acquired evidence data:
 Automatic generation of an immutable
copy before the investigation
 Life cycle management adjusted to
investigation requirements
 Life cycle management of the immutable
copy fully automated (according to law)
Workflow Automation: Preserving crime evidence data
Workflow included + Immutability included
17
Heavily used in broadcasting, but also for:
 CCTV (highlighting, automatic archiving & deletion)
 Medical tomography scans
 Fingerprint processing (association, feature extraction, distribution)
 Legal rich media document processing
Workflow Automation: Handling connected documents
IBM AREMAArchive and Essence Manager
and many
more
used by
18
The mother of all data projects
Square Kilometre Array (SKA)
19
Radio Interferometry data capture: Square Kilometre Array (SKA)
…will be the world’s largest radio telescope
̶ 900 stations
̶ 300 antennas / station
̶ begin of construction planned in 2018
Substantial technological challenges
̶ 160 terabytes of raw data collected per second
̶ 1 petabyte of data stored per day
̶ 1000 petaflops per second processing power
IBM's R&D involvement since 2012
̶ Research collaboration with Astron (Dutch Institute of Radio Astronomy)
̶ Storage aspects
̶ ExaPlan: planning tool for multi-tiered exascale storage
̶ Tape library modeling and simulation
̶ Predictive cachingArtist’s rendering of the SKA
20
For everyone else:
Build your private cloud foundation
21
S3-compatible Private Cloud as "everybody's offload storage"
driven by public cloud pricing, reducing cost by enhancing storage footprint efficiency
Organization-wide S3-compatible repository
IBM Cloud Object Storage
x86 image (contains OS)
Offload snapshots
Offload stale
volumes
IBM Spectrum Virtualize IBM Spectrum Scale
Multi-vendor
block storage
IBM file clusters (NAS)
SMB/CIFS NFS
POSIX HDFS
Disk TapeFlash
Offload old files
Offload snapshots
Cloud backup
IBM Spectrum Protect
IBM backup
Cloud backup
Cloud-2-Cloud
migration
Systems
VMs
Users
Archive
SEC-legal retention mode + deletion hold per object$$
available as appliances
22
All-or-Nothing-Transform (AONT) for safety, reliability and security
5 nines write availability, 6 nines read availability,
15+ nines reliability against data loss (3 sites)
IBM Cloud Object Storage
x86 image (contains OS)
Geographical Information Dispersal Algorithm
E.g. "encode data in 12 slices, needs 7 slices for decoding"
JBOD
undecipherable
$$
JBOD
23
How Sky avoids bottlenecks, service outages and hacking
 Object access is lightweight & secure,
resulting in low CPU footprint & cost
browser obtains object ID (movie)
24
Bonus
Artificial Intelligence Research for
Storage Management
25
AI learns to predict ideal storage based on meta-information
G. Cherubini, J. Jelitto, V. Venkatesan, “Cognitive Storage for Big Data”, Computer, April 2016
26
Data Life Cycle Prediction based on experience
Life cycles of different data types
Prediction Quality
10% Training: 95% Success
worst case (low predictable data class)
27
Data Prioritization Prediction after Blackout recovery

Recovery relevance
(Synchronous? Consistent? Expendable?)
Prediction Quality
important transactions, no loss tolerated
Temp Data
t
R
ibm.biz/AxelKoester
29
Quantum Computer:
Nobody needs one at home
Ken Olsen, founder of Digital Equipment Corporation, 1977
30
31 IBM Quantum Computing Scientists Hanhee Paik (left) and Sarah Sheldon (right)
32
33
34
37
38
January 2018: 50 Bit
39
Quantum Computer:
Nobody needs one at home
Search for IBM quantum experience
https://quantumexperience.ng.bluemix.net/qx
ibm.biz/AxelKoester

Solving Challenges With 'Huge Data'

  • 1.
    Solving Challenges with'Huge Data' Solutions & client cases Dr. Axel Koester - axel.koester@de.ibm.com Chief Technologist EMEA Storage Competence Center Chairman of TEC think tank D/A/CH
  • 2.
    2 Three ways howIT uses data … today Procedural (if…then) Image: Business over Broadway Statistical (big data) Machine learning Image: opendatascience.com
  • 3.
    3 … and in10 years Procedural (if…then) Image: Business over Broadway Statistical (big data) Machine learning Image: opendatascience.com
  • 4.
    4 Current examples Image: Businessover Broadway Image: opendatascience.com shopping, profiling, fraud detection … autonomous driving, image classification, chatbots, gaming… Manual modelling Accumulation of examples Automatic modelling business as usual classic / legacy IT
  • 5.
  • 6.
    6 Train-on-the-job by reviewinglow-confidence cases MUCH CHEAPER THAN RE-CODING AT EVERY PROD CHANGE
  • 7.
    7 Procedural: Archive test cases forauditing Statistical: Parallel processing of many stored samples Machine Learning: Train sample data, then archive or trade data Image: Business over Broadway Image: opendatascience.com How is data stored? if…then…else GB/s 1 2 GB/s 3 parallel search
  • 8.
    10 Imperatives for datastorage: implement workflows avoid "data tourism" scale without effort
  • 9.
    11 DESY: Example fora solved "data tourism" problem
  • 10.
  • 11.
    13 Data tourism Lambda: 60Gb/s @ 2000 Hz Eiger: 30 Gb/s @ 2000 Hz 2000files/s/cam Webportalaccess IBM Spectrum Scale + Workflow rules 3D reconstruction, research calculus 2000 files/s/cam ØMQ cluster lifecycle cluster
  • 12.
    14 [Next-gen storage] Prototypewrote 50k Files/sec in one folder* -- started at 02/28/2017 12:13:13 -- mdtest-1.9.3 was launched with 14 total task(s) on 14 node(s) Command line used: /ghome/oehmes/mpi/bin/mdtest-pcmpi9131-existingdir -d /gpfs/fs2- 1m-me1/shared/mdtest-ec -i 1 -n 35000 -F -w 0 -Z -p 8 Path: /gpfs/fs2-1m-me1/shared FS: 17.1 TiB Used FS: 0.1% Inodes: 476.8 Mi Used Inodes: 0.1% 14 tasks, 490000 files SUMMARY: (of 1 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation : 50032.690 50032.690 50032.690 0.000 File stat : 3937604.341 3937604.341 3937604.341 0.000 File read : 941193.073 941193.073 941193.073 0.000 File removal : 143095.519 143095.519 143095.519 0.000 Tree creation : 77672.296 77672.296 77672.296 0.000 Tree removal : 0.239 0.239 0.239 0.000 -- finished at 02/28/2017 12:13:39 -- (*) in independent folders, the test cluster could write 2,6 Mio 32k files/sec
  • 13.
  • 14.
    16 Newly acquired evidencedata:  Automatic generation of an immutable copy before the investigation  Life cycle management adjusted to investigation requirements  Life cycle management of the immutable copy fully automated (according to law) Workflow Automation: Preserving crime evidence data Workflow included + Immutability included
  • 15.
    17 Heavily used inbroadcasting, but also for:  CCTV (highlighting, automatic archiving & deletion)  Medical tomography scans  Fingerprint processing (association, feature extraction, distribution)  Legal rich media document processing Workflow Automation: Handling connected documents IBM AREMAArchive and Essence Manager and many more used by
  • 16.
    18 The mother ofall data projects Square Kilometre Array (SKA)
  • 17.
    19 Radio Interferometry datacapture: Square Kilometre Array (SKA) …will be the world’s largest radio telescope ̶ 900 stations ̶ 300 antennas / station ̶ begin of construction planned in 2018 Substantial technological challenges ̶ 160 terabytes of raw data collected per second ̶ 1 petabyte of data stored per day ̶ 1000 petaflops per second processing power IBM's R&D involvement since 2012 ̶ Research collaboration with Astron (Dutch Institute of Radio Astronomy) ̶ Storage aspects ̶ ExaPlan: planning tool for multi-tiered exascale storage ̶ Tape library modeling and simulation ̶ Predictive cachingArtist’s rendering of the SKA
  • 18.
    20 For everyone else: Buildyour private cloud foundation
  • 19.
    21 S3-compatible Private Cloudas "everybody's offload storage" driven by public cloud pricing, reducing cost by enhancing storage footprint efficiency Organization-wide S3-compatible repository IBM Cloud Object Storage x86 image (contains OS) Offload snapshots Offload stale volumes IBM Spectrum Virtualize IBM Spectrum Scale Multi-vendor block storage IBM file clusters (NAS) SMB/CIFS NFS POSIX HDFS Disk TapeFlash Offload old files Offload snapshots Cloud backup IBM Spectrum Protect IBM backup Cloud backup Cloud-2-Cloud migration Systems VMs Users Archive SEC-legal retention mode + deletion hold per object$$ available as appliances
  • 20.
    22 All-or-Nothing-Transform (AONT) forsafety, reliability and security 5 nines write availability, 6 nines read availability, 15+ nines reliability against data loss (3 sites) IBM Cloud Object Storage x86 image (contains OS) Geographical Information Dispersal Algorithm E.g. "encode data in 12 slices, needs 7 slices for decoding" JBOD undecipherable $$ JBOD
  • 21.
    23 How Sky avoidsbottlenecks, service outages and hacking  Object access is lightweight & secure, resulting in low CPU footprint & cost browser obtains object ID (movie)
  • 22.
  • 23.
    25 AI learns topredict ideal storage based on meta-information G. Cherubini, J. Jelitto, V. Venkatesan, “Cognitive Storage for Big Data”, Computer, April 2016
  • 24.
    26 Data Life CyclePrediction based on experience Life cycles of different data types Prediction Quality 10% Training: 95% Success worst case (low predictable data class)
  • 25.
    27 Data Prioritization Predictionafter Blackout recovery  Recovery relevance (Synchronous? Consistent? Expendable?) Prediction Quality important transactions, no loss tolerated Temp Data t R
  • 26.
  • 27.
    29 Quantum Computer: Nobody needsone at home Ken Olsen, founder of Digital Equipment Corporation, 1977
  • 28.
  • 29.
    31 IBM QuantumComputing Scientists Hanhee Paik (left) and Sarah Sheldon (right)
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
    39 Quantum Computer: Nobody needsone at home Search for IBM quantum experience https://quantumexperience.ng.bluemix.net/qx
  • 36.