IBM Storage for AI and Big Data

IBM Storage for AI and
Big Data
Tony Pearson
IBM Master Inventor,
Senior IT Management Consultant,
TechU Content Manager
2019 IBM Systems Technical University
10-12 Sep 2019 | Johannesburg, SA

Predict and shape future outcomes
Optimize people to do higher value work
Automate decisions, processes & experiences
Reimagine new business models
AI unlocks the value of data to transform business in totally new ways
IDC predicts that by 2019
40%
initiatives will use AI services
of digital transformation
$4.7 billion
in IT storage spend for AI
IDC 2019
Those who harness the power of their data have a
significant competitive advantage
IBM Systems Technical University © Copyright IBM Corporation 2019 2

Autonomous riving
Collision Avoidance
Route Optimization
Location-based Advertising
CX
Stock Forecasting
Buyer Behavior
Clinical Trials
Drug Discovery
Genomics
Experimental Sensor Capture
Hypothesis Modeling
Seismic Analysis
Exploration
Smart Metering / Usage Forecasting
Market Prediction
Fraud Detection
Risk Mitigation
Threat Detection/Assessment
Video Surveillance
Social Media Monitoring
Traffic Flow Analysis
Manufacturing Quality Control
Supply Chain Optimization
Warranty Analysis
AI is Transforming Every Industry

50%
Data Volume
& Quality
47%
Advanced Data
Management
44%
Skills Gap
Source: Cognitive, ML, and AI Workloads Infrastructure Market Survey, IDC,
January 2018; n=205, 1,000+ employees (U.S.); 500+ employees (Canada)
Top 3 Challenges for organizations deploying AI workloads

ANALYZE - Scale insights with AI everywhere
Data of every type,
regardless of where it
lives
MODERNIZE
your data estate for an AI
and Multicloud World
AI
The AI Ladder
INFUSE – Operationalize AI with trust and transparency
ORGANIZE - Create a trusted analytics foundation
COLLECT - Make data simple and accessible

Building AI projects in Silos is not Efficient or Accurate
There is no single source of the truth.
Every business unit is on their own.

Knowing where to start is half the battle
AI can seem dauntingly complex

…but doesn’t have to be.
Siloed data & applications
Infrastructure complexity & demands
Where and how to begin
Data Volume & Quality
Advanced Data Management
Skills Gaps
AIcan be challenging…

for a single source of truth and simpler management
Automated Data Tiering
Global Namespace
IBM Storage unifies data and metadata

IBM Storage enables you to scale quickly and easily
• Experiment
• Proof-of-Concept
• Trail Project
• Production

The top two Storage challenges for AI
Scalability
Training / Modeling Phase
Increasing compute nodes with GPUs
Increasing data requirements
Increasing storage-side nodes
Performance
Storage is the bottleneck to feeding hungry GPU
cluster
There are two common IO patterns for AI
workloads:
either small-cache IO for IOPS requirements
large-cache IO for throughput requirements.

The Goal: Move Data from Ingest to Insights
Analytics and AI Data Pipeline
12
EDGE INSIGHTS
IBM Systems Technical University © Copyright IBM Corporation 2019

Data Scientist Productivity – Improve velocity to job completion by
getting to your required data with less guesswork or trial & error.
Reduce Time to Accuracy
Speed
Spending
Scalability
Software
Support
Optimizing Economics to manage the growth with performance and
capacity Storage Systems, enhanced with Software Defined Storage
Building Block Approach – Start Small and Grow Without Limitation
from hundreds of apps that hit millions of data points to thousands of
apps hitting billions of data points
Highest Performing Hardware requires software to optimize
performance. Software automation to maximize storage IO/Throughput,
managing failures, driving parallel performance, efficiency, backup, or DR
AI framework support for both X86 and Power. Optimize data flow for
both on premises and off premises cloud integrated architectures
IBM Storage Prescription for AI

Transient Storage
Throughput-oriented,
software defined
temporary landing zone
Fast Ingest /
Real-time Analytics
High throughput
performance tier
INGEST PREPARE
CLASSIFY / TRANSFORM
Classification &
Metadata Tagging
High volume, index & auto-
tagging zone
ETL
MODEL TRAINING
ANALYZE
INFERENCE
INSIGHTS
SAS
Grid
CLOUDERA
Hortonworks
ML / DL
Tensorflow
SPARK
Realtime Analytics
Analytics Workloads
DATA IN
Storage and the AI Data Pipeline

Transient Storage
Throughput-oriented,
software defined
temporary landing zone
Fast Ingest /
Real-time Analytics
High throughput
performance tier
INGEST PREPARE
CLASSIFY / TRANSFORM
Classification &
Metadata Tagging
High volume, index & auto-
tagging zone
ETL
MODEL TRAINING
ANALYZE
INFERENCE
INSIGHTS
High scalability,
large/sequential I/O capacity
tier
Archive
1. Single Name Space
2. AFM
3. Software RAID
Elastic Storage Server
IBM Spectrum Storage for AI
with NVIDIA® DGX
Spectrum Scale Software
SAS
Grid
CLOUDERA
Hortonworks
ML / DL
Tensorflow
SPARK
Realtime Analytics
Analytics Workloads
IBM Cloud
DATA IN
Watson Machine
Learning
Accelerator
IBM Spectrum
Discover
IBM Spectrum Storage for AI -- The fastest path from ingest to insights

“IBM’s Spectrum Storage for AI is differentiated from both
the NetApp and Pure Storage offerings. IBM Spectrum Storage
for AI provides a level of scalability that is nearly unmatched by
anyone in the industry. It’s both incredibly fast at scale, and it
scales linearly.
The ability for IBM Spectrum Storage for AI to seamlessly
integrate with the rest of the Spectrum Storage suite should
make IBM’s solution an easy decision for enterprise buyers.”
- Steve McDowell
What makes IBM different?

Block
iSCSIiSCSI
Analytics
Transparent
HDFS
Transparent
HDFS
OpenStack
CinderCinder
GlanceGlance
ManillaManilla
Object
SwiftSwift
S3S3
SMBSMBNFSNFS
POSIXPOSIX
File Containers
Storage Enabler
for Containers
V2.X
| 17
Client
workstations
Users and
applications
Compute
farm
Traditional
applications
New Gen
applications
Worldwide Data
Distribution
Site BSite B
Site ASite A
Site CSite C
DR SiteDR Site
AFM-DR
Spectrum Scale:
Unleash new storage economics on a global scale
Shared Namespace
Encryption
Immutability
Audit Logging
JBOD/JBOF
Spectrum Scale RAID
Powered by
Disk Tape Shared Nothing
Cluster
Flash
IBM Spectrum Scale
Automated data placement and data migration
Transparent Cloud
Tiering
Sharing
CompressionCompression

Spectrum Scale deployment models
Shared Nothing Cluster (SNC)
Model
(Storage Rich Servers
(replication,
erasure code edition))
Span storage rich servers for converged architecture or HDFS deployment
Network Shared Disk (NSD) Model
(twin tailed storage, ESS)
Modular High-Performance Scaling
Enterprise Integrated Model
(SAN, NVMeoF, iSCSI)
Unify and parallelize storage silos

Spectrum Scale Parallel Architecture
No Hot Spots
• All NSD servers export to all clients in
active-active mode
• Spectrum Scale stripes files across NSD
servers
and NSDs in units of file-system block-size
• File-system load spread evenly
• Easy to scale file-system capacity and
performance while keeping the architecture
balanced
NSD Client does real-time parallel I/O
to all the NSD servers and storage volumes/NSDs
NSD Client
NSD Servers
Storage Storage

Low latency
global data access
Linear scale out capacity
and performance
Enterprise storage services
on standard hardware
ESS - Proven IBM Spectrum Scale software
ESS is the storage power behind the fastest
supercomputers on the planet
Summit and Sierra supercomputers at Oak Ridge
National Laboratory and Lawrence Livermore
National Laboratory are ranked the #1 and #2
fastest computers in the world
They are helping to model supernovas, pioneer
new materials, and explore cancer, genetics and
the environment, using technologies available to
all customers

IBM Elastic Storage Server: building blocks small and large
21
Model GL4S:
4 Enclosures, 24U
334 NL-SAS, 2 SSD
Model GL6S:
6 Enclosures, 34U
502 NL-SAS, 2 SSD
Model GL2S:
2 Enclosures, 14U
166 NL-SAS, 2 SSD
Capacity
ESS 5U84
Storage
ESS 5U84
Storage
ESS 5U84
Storage
ESS 5U84
Storage
ESS 5U84
Storage
ESS 5U84
Storage
ESS 5U84
Storage
ESS 5U84
Storage
ESS 5U84
Storage
ESS 5U84
Storage
ESS 5U84
Storage
ESS 5U84
Storage
30+ GB/s
10+ GB/s 20+ GB/s
Model GS1S
24 SSD
Model GS2S
48 SSD
Model GS4S
96 SSD
Speed
40 GB/s
10+ GB/s
20+ GB/s
Model GL1S:
1 Enclosures, 9U
82 NL-SAS, 2 SSD
ESS 5U84
Storage
5+ GB/s
ESS 5U84 Storage
ESS 5U84 Storage
ESS 5U84 Storage
ESS 5U84 Storage
ESS 5U84 Storage
ESS 5U84 Storage
ESS 5U84 Storage
ESS 5U84 Storage
30+ GB/s* 40+ GB/s*
Model GH14:
1 2U24 Enclosure SSD
4 5U84 Enclosure HDD
334 NL-SAS, 24 SSD
Model GH24:
334 NL-SAS, 48 SSD
ESS 5U84
Storage
ESS 5U84
Storage
Model GH12:
166 NL-SAS, 24 SSD
20+ GB/s*
* Estimate of performance aggregated across SSD and HDD. NOTE: All estimates assume EDR Infiniband connections and are read performance
Model GL5S:
5 Enclosures, 29U
418 NL-SAS, 2 SSD
ESS 5U84
Storage
ESS 5U84
Storage
ESS 5U84
Storage
ESS 5U84
Storage
ESS 5U84
Storage
25+ GB/s
ESS 5U84
Storage
ESS 5U84
Storage
Model GH22:
166 NL-SAS, 48 SSD

IBM Elastic Storage Server GLxC models
Model GL2C:
2 Enclosures, 12U
210 NL-SAS, 2 SSD
Model GL4C
4 Enclosures, 16U
432 NL-SAS, 2 SSD
Model GL6C
6 Enclosures, 28U
634 NL-SAS, 2 SSD
8.8 PB
4U106
Storage
4U106
Storage
4U106
Storage
Model GL1C:
1 Enclosure, 8U
104 NL-SAS, 2 SSD
4U106
Storage
4U106
Storage
4U106
Storage
4U106
Storage
4U106
Storage
4U106
Storage
4U106
Storage
4U106
Storage
4U106
Storage
1.46 PB raw 2.9 PB 5.9 PB
Model GL8C
8 Enclosures, 36U
846 NL-SAS, 2 SSD
11.8 PB raw
4U106
Storage
4U106
Storage
4U106
Storage
4U106
Storage
4U106
Storage
4U106
Storage
4U106
Storage
4U106
Storage
4U106
Storage
4U106
Storage
4U106
Storage
4U106
Storage
4U106
Storage
4U106
Storage
Model GL5C
5 Enclosures, 28U
528 NL-SAS, 2 SSD
7.3 PB

Spectrum Scale
Improved security and compliance
New File Audit Logging capability
(Data Management Edition only)
• Track user accesses to filesystem and events
• Supported across all nodes and all protocols
• Parseable data stored in secure retention-protected fileset
• Events that can be captured are:
– Open, Close, Destroy (Delete), Rename, Unlink, Remove Directory,
Extended Attributed Change, Access Control List (ACL) change
File-level immutability
• Independent KPMG certification anticipated
Data security following removal of physical media
• Protected by on-disk encryption (Data Management Edition only)
Protocols include encryption of data in motion
• SMB encryption, NFS via kerberos

Spectrum Scale Advanced File Management (AFM)
Spans geographic distance and unreliable networks
Caches local ‘copies’ of data distributed to one or more Spectrum Scale clusters
Low latency ‘local’ read and write performance
As data is written or modified at one location, all other locations see that same
data
Efficient data transfers over wide area network (WAN)
Speeds data access to collaborators and resources around the world
Unifies heterogeneous remote storage
Asynchronous DR is a special case of AFM
Bidirectional awareness for Fail-over & Fail-back with data integrity
Recovery Point Objectives for volume & application consistency

IBM Bluemix
Object Storage
Data aware cost optimization
Powerful policy engine
• Information Lifecycle Management
• Fast metadata ‘scanning’ and data movement
• Could automate data migration based on threshold
Users not affected by data migration
• Single namespace
Integrated with Tape or tested S3 endpoints
• DMAPI
• TCT
Example: When Online storage reaches 90% full, then
move all 1GB or larger files that are 60 days old
to offline to free up space
Small files last
accessed > 30 days
last accessed
> 60days
Silver pool is >60% full
Drain it to 20%
accessed
today and
file size is
<1G
System pool
(Flash)
Gold pool
(SSD)
Silver pool
( NL SAS)
Tape Library
Tape or S3

Find
Find data quickly and
easily by searching
catalogs of system
& custom metadata
Classify
Automatically identify
and classify data,
including sensitive and
personally identifiable
information
Label
Enrich data with
system & custom
metadata tags that
increase the value of
that data
Discover
Automatically ingest &
index system metadata
from multiple file &
object storage systems
on-prem & in the cloud
Search Less. Discover more.
Unified metadata management and insights
for heterogeneous file and object storage,
on-premises and in the cloud.

IBM Spectrum Discover Overview
File & Object Storage Data Insight
Activation &
Optimization
• Simple to deploy
(VMware virtual appliance)
• Metadata curation
• Custom metadata tagging
• Automatic indexing
• Policy-Engine
• Action Agent API
Large-Scale Analytics
• Data discovery
• Dataset identification
• Data pipeline progression
Data Governance
• Data inspection
• Data classification
• Data clean-up
Data Optimization
• Archive / tiering
• Duplicate data removal
• Trivial data removal
Scanning & Event Notifications

IBM Spectrum Archive Enterprise Edition (EE)
Flash
Gold Pool
Disk
Silver Pool
Tier 1
($$) Tier 2
($)
Single name space
IBM Spectrum Scale
CIO Finance Engineering
Tape
LTFS
Tier 3
(¢)
IBM Spectrum Archive EE
Up to 500 PB
(with TS115 5 Tape Drives)
2 Tape Libraries
Multiple Protocol
Support
Client Applications Persistent view of the data - tape storage under the
single namespace
Policy-based data placement for cold/idle data
Recall data from tape on demand
Integrated Tape Tier
Up to 3 data replicas
Data Encryption with IBM SKLM server (LME)
WORM tape for anti-tampering
Offline tapes to store the media in an isolated
environment – “air gap” for greater protection of
sensitive corporate data, or extend the storage capacity
beyond the library limit
Automated Tape Validation available with TS4500
Export the LTFS tapes for data exchange
Remove data from Scale namespace, and export tapes
for the use in other application

Spectrum Scale Transparent Cloud Tiering (TCT)
Swift/S3
NFS
SMB
Hadoop
Docker
Kubernetes
POSIX
Transparent
Data
Metadata is managed by Spectrum Scale Cloud appears as external storage pool
Auto-tiering & migration
Store as Buckets
Ensure data integrity
Transparent
Cloud Tiering

Spectrum Scale Cloud Data Sharing (CDS)
Swift/S3
NFS
SMB
Hadoop
Docker
Kubernetes
POSIX
Swift/S3
Traditional
Applications
Native Cloud
Applications

Storage Now and for the Future
— Excellent Cost per Terabyte with limitless
scale
• Demonstrated best in class TCO
— Zero Downtime or Impaired States even at
Exabyte Level
• Upgrades, System Expansion are best-in-industry
— S3 Interface scales to any Cloud or Hybrid
• Ease of coding, no vendor lock-in and future proof with analytics
plug ins and scalable performance
Highly scalable low cost per TB object storage
for files and objects with integrated analytics

IBM Spectrum Storage for AI
 IBM Spectrum Storage for AI with Power Systems
⁃ IBM Spectrum Scale and Power AC922 Reference Architecture
 IBM Spectrum Storage for AI with NVIDIA DGX
⁃ IBM Spectrum Scale and NVIDIA DGX Reference Architecture
 IBM Spectrum Storage for AI in Autonomous Driving
⁃ Solution brief, enablement and client presentations
 IBM Spectrum Storage for Hadoop/Spark workloads
⁃ IBM Spectrum Scale and Hortonworks/Cloudera Integration
⁃ IBM Spectrum Scale and IBM Spectrum Conductor for Spark Integration
 IBM Spectrum Connect – Storage Enabler for Containers
⁃ Integration with IBM Cloud Private / OpenShift
https://www.ibm.com/it-infrastructure/storage/ai-infrastructure

IBM Spectrum Storage for AI with NVIDIA DGX
Extensible for the AI Data Pipeline
• Support for any tiered storage,
including Cloud and Tape
• A Scalable, software-defined infrastructure
powered by IBM Spectrum Scale and
NVIDIA DGX systems for ML/DL workloads
• Certified solution offering with reference
architecture and published performance
benchmarks
• Meet in the channel delivery model to be
fulfilled by BPs
Composable to grow as needed
• Up to 9 DGX-1 servers (72 GPUs) in a rack
• Storage scale-out from a single 300TB
node to 8 Exabytes and a Yottabyte of files
High-Performance to feed the GPUs
• NVMe throughput of 120GB/s in a rack
• Over 40GB/s sustained random read per
2U

Autonomous Vehicle data flow

Why IBM vs Other Vendors in the Market
35
IMPROVED
Model
Training
Fastest
GPU Ingest
Automated
GPU
Performance
Minimize
Data
movement
Increased
Data
Performance
Enhanced
Data
Curation
Customized
Policy
Management

Thank you!
Tony Pearson
tpearson@us.ibm.com
+1-520-799-4309
Please complete the Session
Evaluation!

About the Speaker
Tony Pearson is a Master Inventor, Senior IT Management Consultant, and Content Manager for the
IBM Systems Technical University (TechU) events. Tony joined IBM Corporation in 1986 in Tucson,
Arizona, USA, and has lived there ever since. Tony presents briefings on storage topics covering the
entire IBM Storage product line, IBM Spectrum Storage software products, and topics related to Cloud
Computing, Analytics and Cognitive Solutions. He interacts with clients, speaks at conferences and
events, and leads client workshops to help clients with strategic planning for IBM’s integrated set of
storage management software, hardware, and virtualization solutions.
Tony writes the “Inside System Storage” blog, which is read by thousands of clients, IBM sales reps and
IBM Business Partners every week. This blog was rated one of the top 10 blogs for the IT storage
industry by “Networking World” magazine, and #1 most read IBM blog on IBM’s developerWorks. The
blog has been published in series of books, Inside System Storage: Volume I through V.
Over the past years, Tony has worked in development, marketing and consulting for various IBM
Systems hardware and software products. Tony has a Bachelor of Science degree in Software
Engineering, and a Master of Science degree in Electrical Engineering, both from the University of
Arizona. Tony is an inventor or co-inventor of 19 patents in the field of IBM Systems and electronic data
storage.
9000 S. Rita Road
Bldg 9032 Floor 1
Tucson, AZ 85744
+1 520-799-4309 (Office)
tpearson@us.ibm.com
Tony Pearson
Master Inventor
Senior Management
Consultant, IBM Systems
La Services
IBM Storage

My Social Media Presence
38
Blog*:
ibm.co/Pearson
LinkedIn:
https://www.linkedin.com/in/az990tony
Books:
www.lulu.com/spotlight/990_tony
IBM Expert Network on Slideshare:
www.slideshare.net/az990tony
Twitter:
twitter.com/az990tony
Facebook:
www.facebook.com/tony.pearson.16121
Instagram:
www.instagram.com/az990tony/
Email:
tpearson@us.ibm.com
* Not a typo. This is short URL for https://www.ibm.com/developerworks/mydeveloperworks/blogs/InsideSystemStorage/
IBM Systems
Technical University

Notices and disclaimers
— © 2019 International Business Machines Corporation. No part of
this document may be reproduced or transmitted in any form
without written permission from IBM.
— U.S. Government Users Restricted Rights — use, duplication or
disclosure restricted by GSA ADP Schedule Contract with IBM.
— Information in these presentations (including information
relating to products that have not yet been announced by IBM)
has been reviewed for accuracy as of the date of
initial publication and could include unintentional technical or
typographical errors. IBM shall have no responsibility to update
this information. This document is distributed “as is” without
any warranty, either express or implied. In no event, shall IBM
be liable for any damage arising from the use of this
information, including but not limited to, loss of data, business
interruption, loss of profit or loss of opportunity.
IBM products and services are warranted per the terms and
conditions of the agreements under which they are provided.
— IBM products are manufactured from new parts or new and used
parts.
In some cases, a product may not be new and may have been
previously installed. Regardless, our warranty terms apply.”
— Any statements regarding IBM's future direction, intent or
product plans are subject to change or withdrawal without
notice.
— Performance data contained herein was generally obtained in a
controlled, isolated environments. Customer examples are
presented as illustrations of how those
— customers have used IBM products and the results they may have
achieved. Actual performance, cost, savings or other results in
other operating environments may vary.
— References in this document to IBM products, programs, or
services does not imply that IBM intends to make such products,
programs or services available in all countries in which
IBM operates or does business.
— Workshops, sessions and associated materials may have been
prepared by independent session speakers, and do not necessarily
reflect the views of IBM. All materials and discussions are provided
for informational purposes only, and are neither intended to, nor
shall constitute legal or other guidance or advice to any individual
participant or their specific situation.
— It is the customer’s responsibility to insure its own compliance
with legal requirements and to obtain advice of competent legal
counsel as to the identification and interpretation of any
relevant laws and regulatory requirements that may affect the
customer’s business and any actions the customer may need to
take to comply with such laws. IBM does not provide legal advice
or represent or warrant that its services or products will ensure that
the customer follows any law.

Notices and disclaimers continued
— Information concerning non-IBM products was obtained from the suppliers
of those products, their published announcements or other publicly
available sources. IBM has not tested those products about this publication
and cannot confirm the accuracy of performance, compatibility or any other
claims related to non-IBM products. Questions on the capabilities of non-
IBM products should be addressed to the suppliers of those products.
IBM does not warrant the quality of any third-party products, or the ability of
any such third-party products to interoperate with IBM’s products. IBM
expressly disclaims all warranties, expressed or implied, including but
not limited to, the implied warranties of merchantability and fitness for a
purpose.
— The provision of the information contained herein is not intended to, and
does not, grant any right or license under any IBM patents, copyrights,
trademarks or other intellectual property right.
— IBM, the IBM logo, ibm.com and [names of other referenced
IBM products and services used in the presentation] are
trademarks of International Business Machines Corporation,
registered in many jurisdictions worldwide. Other product and
service names might be trademarks of IBM or other
companies. A current list of IBM trademarks is available on
the Web at "Copyright and trademark information" at:
www.ibm.com/legal/copytrade.shtml

This presentation uses the IBM Plex™ font
IBM Plex™ is our new typeface. It’s global, it’s versatile and it’s
distinctly IBM.
IBM Plex
Sans
The IBM company is freeing itself from the cold, modernist cliché
and replacing Helvetica with a new corporate typeface. Also
replaces Arial, Calibri, Lucida Grande, Trebuchet, etc.
IBM Plex
Mono
A little something for developers. Replaces
Courier New, Letter Gothic, Lucida Console, etc.
IBM Plex
Serif
A hybrid of the third kind (combining the best of Plex, Bodoni,
and Janson into a contemporary serif). Replaces Cambria,
Garamond, Lucida Bright, Times New Roman, etc.
IBM Plex is freely available as TrueType and OpenType at: https://github.com/IBM/plex/releases
and looks consistently good across Windows, Linux and Mac

IBM Storage for AI and Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to IBM Storage for AI and Big Data

Similar to IBM Storage for AI and Big Data (20)

More from Tony Pearson

More from Tony Pearson (20)

Recently uploaded

Recently uploaded (20)

IBM Storage for AI and Big Data