Powering Up
AI in Healthcare
Clarisse Taaffe-Hedglin
Executive IT Architect
IBM Garage
IBM Systems
clarisse@us.ibm.com
Agenda
Healthcare Use cases
The AI Ladder and Lifecycle
AI at Scale Themes
“AI is the
fastest-growing
workload”*
3
*Forrester Research Inc. “AI Deep Learning Workloads Demand a New Approach to Infrastructure”, by
Mike Gualtieri, Christopher Voce, Srividya Sridharan, Michele Goetz, Renee Taylor, May 4, 2018.
3 IBM IT Infrastructure / © 2021 IBM Corporation
Machine Learning Context
REINFORCEMENT
LEARNING
TRANSFER
LEARNING
“AI is the automation of automation” – Jensen Huang, GCG 2020
5
Analytics Modernization: From Data to Actions
010101010101010111100010011001010111
0000000000010101010100000000000 111101011
11000 000000000000 111111 010101 101010 10101010100
Prescriptive
What should
we do ?
Descriptive
What Has
Happened?
Cognitive
Learn
Dynamically
Predictive
What Will
Happen?
ACTION
DATA
HUMAN INPUTS
<
< >
< >
>
>
delivering faster insights with greater efficiency to impact more lives
A framework for designing, deploying, growing and optimizing infrastructure for HPC, AI and Cloud, created in
collaboration with world’s leading healthcare and life sciences institutions, and using Red Hat OpenShift, IBM
Power Systems, IBM Storage and open API endpoints.
From Data to Insight with an Optimal Reference Architecture
DATAHUB
High Performance Data Fabric & Catalog
Capable of Handling Exabytes of Data
and Trillions of Objects
ORCHESTRATION
High Performance Computing & AI
Platform Capable of Orchestrating
Thousands of Servers and GPUs
APPS & MODELS
Large-scale and high-throughput
workloads such as HPC, AI and Cloud
computing
MEDICAL TASKS
Genomics, molecular simulation,
structural analysis, diagnostics, data
fusion, manufacturing quality inspection.
Three broad categories of AI Use Cases
“Structured” Data Use Cases
Computer Vision Use Cases
- Big Data (Rows and Columns)
- Available AI Software More Accuracy !
This is sort of “Magic”
- a deep learning Model is trained to detect and classify objects
Natural Language Processing Use Cases
- A Model learns to read, hear and “understand” language
§ BIG, COMPLEX SYSTEMS
§ PERSONALIZATION
§ AUTOMATION
§ SIMULATING RELATIONSHIPS
§ VISUAL RECOGNITION
§ PATTERN DETECTION
§ CHATBOTS
§ DESIGN OF EXPERIMENTS
§ OPTIMIZATION
Thescenarios
AIcansolvefor
today
8 IBM IT Infrastructure / © 2021 IBM Corporation
Addressable Markets And Fields For AI
RETAIL
Recommendation
engines, Precision
marketing
AGRICULTURE
Crop yield, Plant
disease, Remote
sensing
LIFE SCIENCES
Sequence
Analysis,
Radiology
UTILITIES
Smart Meter analysis,
Capacity planning
$
FINANCIAL SERVICES
Risk analysis
Fraud detection
CUSTOMER SERVICE
Chatbots, Helpdesk,
Automated
Expenses
LAW & DEFENSE
Threat analysis -
social media
monitoring
RESEARCH
Physics Modeling
Simulation
optimization
TRANSPORTATION
Optimal traffic
flows, Route
planning
CONSUMER GOODS
Sentiment
analysis
HEALTH CARE
Patient sensors,
monitoring, EHRs
MEDIA/ENTERTAINMENT
Advertising
effectiveness
OIL & GAS
Exploration,
Sensor analysis
AUTOMOTIVE
ADAS,
Maintenance
MANUFACTURING
Line inspection,
Defect analysis
AI and Autonomous Machine Learning will help
revolutionized every single industry making us
more productive and efficient to do things that
today are impossible to do.
9 IBM IT Infrastructure / © 2021 IBM Corporation
10
Smart loves problems, and there has never been a bigger
problem facing our world.
Biomolecular Structure
Molecular Simulation
Genomics Medical Diagnostics AI
Data Fusion and AI
Bio-Informatics
Artificial intelligence and high-performance computing have already begun to attack the
virus, assisting in molecular drug discovery, genomics and medical image processing.
Data
Overload
Oceans of data
arise from rapid
digitization and
instrumentation
of healthcare.
App Chaos
Thousands of
applications,
workflows and
models are not
all following the
same rules.
Adoption
Vertically
integrated
toolsets with
heavy
customization
and vendor lock-
in create work
silos.
Performance
When scaling up
or out, most
institutions
cannot diagnose
or analyze the
performance
problems they
face.
Cost
Demanding
workloads
require well-
orchestrated
infrastructure to
manage, monitor
and control
costs.
Five key challenges to progress remain despite advances
Data
Insight
HPC Analysis &
Simulation
AI Inference &
Automation
Sensors
The Convergence of HPC and AI
12 IBM IT Infrastructure / © 2021 IBM Corporation
Optimizing Medical Imaging
Enhance image identification with deep learning
to assist physicians and benefit patients
1300 MRI images trained by IBM Power
Systems and IBM Storage in just two hours,
compared to forty hours on traditional
architectures
97% Accuracy for Melanoma Detection for Dermoscopic Images
Melanoma vs. Atypical & Benign
Human*
Deep
Features
Ensembles CNN DRN
Doctor/
Expert
ImageNet + Sparse
Coding
+ Low-level + Auto-
Encoder
Deep
Learning
Deep
Residual
Learning
0.84 0.91 0.92 0.93 0.94 0.95 0.97
- 0.73 0.73 0.74 0.77 - -
* Estimated human expert performance
Use Case
Automatic skin lesion image analysis for
melanoma detection with Memorial Sloan
Kettering (MSK-CC)
Visual modeling techniques:
§ Deep Residual Networks
§ Conv. Neural Networks
§ Ensemble Models
Top Performance
= 97% Accuracy!!!
Melanoma vs. Atypical
Best
14
Think 2020 / DOC ID / Month XX, 2020 / © 2020 IBM Corporation
15
Advances in instrument
design, sample preprocessing
and mathematical methods
have enabled high volume
throughput imaging at atomic
scale.
Cryogenic electron
microscopes generate an
average of 5 TB of image data
per day
BIOMOLECULAR STRUCTURE
Massive Data Sets Require Massive Processing Capability
Accelerating Cryo-EM Imaging Analysis
Reduced time-to-completion for high resolution image
analysis jobs while increasing resource utilization
Using IBM AC922 cluster, more than 100 cryo-EM
high resolution image workload analysis jobs running
in parallel on Satori cluster
BIOMOLECULAR STRUCTURE
Simulation of millions of atoms requiring large computational
resources
Large scale simulation includes millions of
atoms
• Virus molecules
• Ribosomes
• Bioenergy system and complex
Solution
• High performance computing CPU and
GPUs accelerating performance
• Optimal memory and network bandwidths
scaling performance to hundreds of nodes
• Techniques to reduce number of simulations
Receptor
ligand
Virus molecule simulation Receptor-ligand fit
Cryptic binding site prediction Binding energy prediction
MOLECULAR SIMULATION
Molecular Dynamics Simulation Computational Intensity
A) Using NAMD to simulate influenza
B) virus (left)and Covid-19 (right)
B) Drug discovery:
protein receptor
C) In silico prediction of protein cryptic binding site D) Predicting protein receptor
ligand binding energy
Receptor
ligand
Large scale simulation
includes millions of atoms
• Virus molecules
• Ribosomes
• Bioenergy system and complex
Solution
• High performance computing
CPU and GPUs accelerating
performance
• Optimal memory and network
bandwidths scaling performance
to hundreds of nodes
• Techniques to reduce number of
simulations
Bayesian optimization
accelerated workflow
uses 1/3 of the
calculations to achieve 4
orders of magnitude
resolution increase
Optimizing Molecular Modeling
Achieves human level
performance in days
instead of months.
Accelerated Force Field Tuning Intelligent Phase Diagram Exploration
Faster
Better Cheaper
BOA accelerates
time to insight, time
to value, and time to
design by factors
Example:
IBM EDA ->100x faster
than brute force
BOA can find new and
unknown optima in a
design space because of
its lack of bias and
exploration algorithm
Example:
Infineon – 3x faster than
other methods and
4 orders of magnitude
better resolution
Nothing is cheaper than a
simulation which is never
run. BOA prevents
unnecessary work which
reduces all kinds of costs
Example:
GlaxoSmithKline –
reduced their screen
workload from 20k
experiments to 200
IBM
BOA
Bayesian Optimization Value
0 200 400 600 800
BOA
Greedy
Similarity
Diversity
count
Search Method Comparison
Drug Discovery Case - Single
Objective
All Data / Ties removed
Conclusion: >80% of the
time IBM BOA is the best
method with the least regret
Optimizing Precision Genomics
Reduced time-to-completion for long-running
jobs while increasing resource utilization
Using IBM, Sidra has completed hundreds of
thousands of computing tasks comprising
millions of files and directories, without
experiencing system downtime.
COLLECT - Make data simple and accessible
ORGANIZE - Create a trusted analytics foundation
ANALYZE - Scale AI everywhere with trust & transparency
Data of every type, regardless
of where it lives
MODERNIZE
your data estate for an AI
and hybrid multicloud
world
INFUSE – Operationalize AI across business processes
The AI Ladder
A prescriptive approach to accelerating the journey to AI
AI
AI-optimized systems
infrastructure
22 IBM IT Infrastructure / © 2021 IBM Corporation
The Data: Biological Data Analytics
Biological
Data Analysis
Biomarker
Identification
Biodata
modeling and
Statistical
Analysis
Biodata
Visualization
Medical Images
Data analysis
Structural
Bioinformatics
Genomics
Sequence data
analysis
Biological Data Analytics
q Genomic Sequence Data: an explosive growth of biodata
q Sequence alignment
q Variant discovery and characterization
q Genomic profiling and pattern discovery
q Biomarker Identification: gene expression profile, RNA-seq, ChIP-
seq, microarray identification and validation, etc.
q Structural Bioinformatics: identify and predict 3D biomolecule
structures, such Cryo-EM data refinement, molecular dynamic
simulation, NMR, x-Ray crystallographic data, etc.
q Biodata Modeling & Statistical Analysis: biological pathways
analysis, Gene, clinical data cohorts study, data extraction, etc.
q Medical Image Processing: image segmentation, registration,
statistic modeling.
q Biodata Visualization: 3D molecule structures, genomics sequences
visualization, etc.
Ruzhu Chen @ 2019
High performance
and high throughput
storage hierarchy
required for data
loading, extraction
and computation.
Tertiary storage
required for archive
and store. Storage
tools for data
indexing, discovery
and governance.
Computation
High performance
and efficiency of
software tools and
applications for
genomic variants and
biomarkers analysis,
drug discovery,
medical image
processing and
molecule structure
modeling, data
visualization.
High throughput
and optimized
workload pipelines
to accelerate
biodata analysis
with highly
optimal and
parallel I/O,
memory, CPU and
GPU
computations.
Solutions
Large volume
and variety of
data around
genomic sequences,
gene expression,
images, structural
biomolecules,
clinical and
healthcare
information,
personized medicine
data
Data Storage
The Challenges: Analyzing Explosive biological data
Data Explosion
Ruzhu Chen @ 2019
Data Pipeline for AI in Healthcare
Insights Out
Trained Models,
simulations
Inference
Data In
Transient Storage
SDS/Cloud
Global Ingest
Throughput-oriented,
globally accessible
Cloud
ETL
High throughput, Random
I/O,
SSD/Hybrid
Archive
High scalability, large/sequential I/O
HDD Cloud Tape
Hadoop / Spark
Data Lakes
Throughput-oriented
Hybrid/HDD
ML / DL
Prep ⇨ Training ⇨ Inference
High throughput, low
latency,
Random I/O
SSD/NVMe
Classification &
Metadata Tagging
High volume, index &
auto-tagging zone
Fast Ingest /
Real-time Analytics
High throughput
SSD
Throughput-oriented,
software defined
temporary landing zone
capacity tier
performance tier performance &
capacity Tier
performance &
capacity Tier
performance tier
capacity tier
Fits Traditional and New Use Cases
EDGE COLLECT ORGANIZE ANALYZE INSIGHTS
INFUSE
IBM Spectrum Scale / Storage for AI / © 2020 IBM Corporation
Public data
Anything data system can pull
from the outside world for free
through web connections,
databases, IoT and sensors
Proprietary data
What private data from the
outside world could the system be
given permission to use?
Purchased data
What pre-trained data could the
system buy or subscribe to?
IBM Skills Academy / © Copyright 2018 IBM Corporation
Ground truth
Data used to define what the system
knows from day one
Domain knowledge
Data resources that can be used to
teach the system to understand and
be an expert in a particular field
Private data
Unique data the creator owns and
only shares internally
Personal public data
What unique data does the creator
share with the outside world?
Transaction and
application data
Machine,
sensor data
Enterprise
content
Image, geospatial,
video
Social data
Third-party data
Available Data Sources
2 June 2021/ © 2018 IBM Corporation
• Metadata is the structured data about the unstructured object
• Who, what, when, where, and why of account, container, object, stream, dir, file
• Perfect for indexing and searching
• Metadata may be separate from the data, stored with the data, or derived from the data
• Posix inode plus extended attributes
• Standard document headers (doc, ppt, mp3, dicom, pdf, jpeg, GeoTIFF)
• Custom metadata tags
• AI derived metadata
Age, Biomarkers, Developmental Stage, Cell
Surface, Markers, Cell Type/Cell Line,
Disease State, Extract Molecule, Genetic
Characteristics, Immunoprecipitation,
antibody, Organism
Biomedical
Natural Language
Processing
Image
Location
Size
Owner
Group
Permissions
Last-Modified
...
System
Metadata
Where is the data?
Metadata-Fueled Data Analysis
Large Scale Data Ingest
• Scan records at high speed
• Live event notifications
• Capture system-level tags
• Automatic indexing
Business-Oriented
Data Mapping
• Custom data tagging
• Content-inspection via APIs
• Policy-driven workflows
Data Activation
• Data movement via APIs
• Extensible architecture
• Solution Blueprints
Data Visualization
• Query billions of records
in seconds
• Multi-faceted search
• Drilldown dashboard
• Customizable reports
Common AI Data Considerations
Data Compute
Legacy Data
Stores
IoT, Mobile
& Sensors
Collaboration
Partners
New Data
Ingest Inference
Training
Preparation
Iterative Model training to improve accuracy
Champion
Challenge
r
-”Data Center”
- At Edge
Trained
Model
§ Ease to Massively Scale
§ High Performance
§ Tiered / Archive
§ Secure
§ High Performance
§ Metadata Tagging
§ Single Name Space
Low Latency
Dev & Inference Stack
- Open Source
- Stable and Supported
- Auditable
Productivity
Performance
Robustness
Considerations
• Ease of use
• Optimize resources
• Scale workload
AI Frameworks /
Open-Source Libraries
AI Tools and
Applications
AI Software Landscape
AI
Infrastructure
30 IBM IT Infrastructure / © 2021 IBM Corporation
Anaconda Environment for Applications
• Use anaconda enterprise network
(AEN) to manage cryo-EM software
repository on server.
• Easy to use and update software
Anaconda Architecture for Cryo-EM Analysis
Computation
Web Interface
Repo Install
Software
Control
Authentication
Anaconda Server
Compute Nodes
Database Users
Anaconda Environment for Applications
• Use anaconda enterprise network
(AEN) to manage cryo-EM
software repository on server.
• Easy to use and update software
Anaconda Architecture for Cryo-EM Analysis
Computation
Web Interface
Repo Install
Software
Control
Authentication
Anaconda Server
Compute Nodes
Database Users
OpenPOWER is a technical community
dedicated to expanding the the IBM Power architecture ecosystem
https://github.com/open-ce
Open-CE
Minimize time to value for
foundational ML/DL packages
Provide a flexible source-to-image
solution to provide a complete and
customizable AI environment.
Fairness Explainability Adversarial
Robustness
Transparency
Is it fair?
Is it easy to
understand?
Is it secure? Is it accountable?
Pillars of Trusted AI
34 IBM IT Infrastructure / © 2021 IBM Corporation
Thank You

AI in healthcare - Use Cases

  • 1.
    Powering Up AI inHealthcare Clarisse Taaffe-Hedglin Executive IT Architect IBM Garage IBM Systems clarisse@us.ibm.com
  • 2.
    Agenda Healthcare Use cases TheAI Ladder and Lifecycle AI at Scale Themes
  • 3.
    “AI is the fastest-growing workload”* 3 *ForresterResearch Inc. “AI Deep Learning Workloads Demand a New Approach to Infrastructure”, by Mike Gualtieri, Christopher Voce, Srividya Sridharan, Michele Goetz, Renee Taylor, May 4, 2018. 3 IBM IT Infrastructure / © 2021 IBM Corporation
  • 4.
    Machine Learning Context REINFORCEMENT LEARNING TRANSFER LEARNING “AIis the automation of automation” – Jensen Huang, GCG 2020
  • 5.
    5 Analytics Modernization: FromData to Actions 010101010101010111100010011001010111 0000000000010101010100000000000 111101011 11000 000000000000 111111 010101 101010 10101010100 Prescriptive What should we do ? Descriptive What Has Happened? Cognitive Learn Dynamically Predictive What Will Happen? ACTION DATA HUMAN INPUTS < < > < > > > delivering faster insights with greater efficiency to impact more lives
  • 6.
    A framework fordesigning, deploying, growing and optimizing infrastructure for HPC, AI and Cloud, created in collaboration with world’s leading healthcare and life sciences institutions, and using Red Hat OpenShift, IBM Power Systems, IBM Storage and open API endpoints. From Data to Insight with an Optimal Reference Architecture DATAHUB High Performance Data Fabric & Catalog Capable of Handling Exabytes of Data and Trillions of Objects ORCHESTRATION High Performance Computing & AI Platform Capable of Orchestrating Thousands of Servers and GPUs APPS & MODELS Large-scale and high-throughput workloads such as HPC, AI and Cloud computing MEDICAL TASKS Genomics, molecular simulation, structural analysis, diagnostics, data fusion, manufacturing quality inspection.
  • 7.
    Three broad categoriesof AI Use Cases “Structured” Data Use Cases Computer Vision Use Cases - Big Data (Rows and Columns) - Available AI Software More Accuracy ! This is sort of “Magic” - a deep learning Model is trained to detect and classify objects Natural Language Processing Use Cases - A Model learns to read, hear and “understand” language
  • 8.
    § BIG, COMPLEXSYSTEMS § PERSONALIZATION § AUTOMATION § SIMULATING RELATIONSHIPS § VISUAL RECOGNITION § PATTERN DETECTION § CHATBOTS § DESIGN OF EXPERIMENTS § OPTIMIZATION Thescenarios AIcansolvefor today 8 IBM IT Infrastructure / © 2021 IBM Corporation
  • 9.
    Addressable Markets AndFields For AI RETAIL Recommendation engines, Precision marketing AGRICULTURE Crop yield, Plant disease, Remote sensing LIFE SCIENCES Sequence Analysis, Radiology UTILITIES Smart Meter analysis, Capacity planning $ FINANCIAL SERVICES Risk analysis Fraud detection CUSTOMER SERVICE Chatbots, Helpdesk, Automated Expenses LAW & DEFENSE Threat analysis - social media monitoring RESEARCH Physics Modeling Simulation optimization TRANSPORTATION Optimal traffic flows, Route planning CONSUMER GOODS Sentiment analysis HEALTH CARE Patient sensors, monitoring, EHRs MEDIA/ENTERTAINMENT Advertising effectiveness OIL & GAS Exploration, Sensor analysis AUTOMOTIVE ADAS, Maintenance MANUFACTURING Line inspection, Defect analysis AI and Autonomous Machine Learning will help revolutionized every single industry making us more productive and efficient to do things that today are impossible to do. 9 IBM IT Infrastructure / © 2021 IBM Corporation
  • 10.
    10 Smart loves problems,and there has never been a bigger problem facing our world. Biomolecular Structure Molecular Simulation Genomics Medical Diagnostics AI Data Fusion and AI Bio-Informatics Artificial intelligence and high-performance computing have already begun to attack the virus, assisting in molecular drug discovery, genomics and medical image processing.
  • 11.
    Data Overload Oceans of data arisefrom rapid digitization and instrumentation of healthcare. App Chaos Thousands of applications, workflows and models are not all following the same rules. Adoption Vertically integrated toolsets with heavy customization and vendor lock- in create work silos. Performance When scaling up or out, most institutions cannot diagnose or analyze the performance problems they face. Cost Demanding workloads require well- orchestrated infrastructure to manage, monitor and control costs. Five key challenges to progress remain despite advances
  • 12.
    Data Insight HPC Analysis & Simulation AIInference & Automation Sensors The Convergence of HPC and AI 12 IBM IT Infrastructure / © 2021 IBM Corporation
  • 13.
    Optimizing Medical Imaging Enhanceimage identification with deep learning to assist physicians and benefit patients 1300 MRI images trained by IBM Power Systems and IBM Storage in just two hours, compared to forty hours on traditional architectures
  • 14.
    97% Accuracy forMelanoma Detection for Dermoscopic Images Melanoma vs. Atypical & Benign Human* Deep Features Ensembles CNN DRN Doctor/ Expert ImageNet + Sparse Coding + Low-level + Auto- Encoder Deep Learning Deep Residual Learning 0.84 0.91 0.92 0.93 0.94 0.95 0.97 - 0.73 0.73 0.74 0.77 - - * Estimated human expert performance Use Case Automatic skin lesion image analysis for melanoma detection with Memorial Sloan Kettering (MSK-CC) Visual modeling techniques: § Deep Residual Networks § Conv. Neural Networks § Ensemble Models Top Performance = 97% Accuracy!!! Melanoma vs. Atypical Best 14 Think 2020 / DOC ID / Month XX, 2020 / © 2020 IBM Corporation
  • 15.
    15 Advances in instrument design,sample preprocessing and mathematical methods have enabled high volume throughput imaging at atomic scale. Cryogenic electron microscopes generate an average of 5 TB of image data per day BIOMOLECULAR STRUCTURE Massive Data Sets Require Massive Processing Capability
  • 16.
    Accelerating Cryo-EM ImagingAnalysis Reduced time-to-completion for high resolution image analysis jobs while increasing resource utilization Using IBM AC922 cluster, more than 100 cryo-EM high resolution image workload analysis jobs running in parallel on Satori cluster BIOMOLECULAR STRUCTURE
  • 17.
    Simulation of millionsof atoms requiring large computational resources Large scale simulation includes millions of atoms • Virus molecules • Ribosomes • Bioenergy system and complex Solution • High performance computing CPU and GPUs accelerating performance • Optimal memory and network bandwidths scaling performance to hundreds of nodes • Techniques to reduce number of simulations Receptor ligand Virus molecule simulation Receptor-ligand fit Cryptic binding site prediction Binding energy prediction MOLECULAR SIMULATION
  • 18.
    Molecular Dynamics SimulationComputational Intensity A) Using NAMD to simulate influenza B) virus (left)and Covid-19 (right) B) Drug discovery: protein receptor C) In silico prediction of protein cryptic binding site D) Predicting protein receptor ligand binding energy Receptor ligand Large scale simulation includes millions of atoms • Virus molecules • Ribosomes • Bioenergy system and complex Solution • High performance computing CPU and GPUs accelerating performance • Optimal memory and network bandwidths scaling performance to hundreds of nodes • Techniques to reduce number of simulations
  • 19.
    Bayesian optimization accelerated workflow uses1/3 of the calculations to achieve 4 orders of magnitude resolution increase Optimizing Molecular Modeling Achieves human level performance in days instead of months. Accelerated Force Field Tuning Intelligent Phase Diagram Exploration
  • 20.
    Faster Better Cheaper BOA accelerates timeto insight, time to value, and time to design by factors Example: IBM EDA ->100x faster than brute force BOA can find new and unknown optima in a design space because of its lack of bias and exploration algorithm Example: Infineon – 3x faster than other methods and 4 orders of magnitude better resolution Nothing is cheaper than a simulation which is never run. BOA prevents unnecessary work which reduces all kinds of costs Example: GlaxoSmithKline – reduced their screen workload from 20k experiments to 200 IBM BOA Bayesian Optimization Value 0 200 400 600 800 BOA Greedy Similarity Diversity count Search Method Comparison Drug Discovery Case - Single Objective All Data / Ties removed Conclusion: >80% of the time IBM BOA is the best method with the least regret
  • 21.
    Optimizing Precision Genomics Reducedtime-to-completion for long-running jobs while increasing resource utilization Using IBM, Sidra has completed hundreds of thousands of computing tasks comprising millions of files and directories, without experiencing system downtime.
  • 22.
    COLLECT - Makedata simple and accessible ORGANIZE - Create a trusted analytics foundation ANALYZE - Scale AI everywhere with trust & transparency Data of every type, regardless of where it lives MODERNIZE your data estate for an AI and hybrid multicloud world INFUSE – Operationalize AI across business processes The AI Ladder A prescriptive approach to accelerating the journey to AI AI AI-optimized systems infrastructure 22 IBM IT Infrastructure / © 2021 IBM Corporation
  • 23.
    The Data: BiologicalData Analytics Biological Data Analysis Biomarker Identification Biodata modeling and Statistical Analysis Biodata Visualization Medical Images Data analysis Structural Bioinformatics Genomics Sequence data analysis Biological Data Analytics q Genomic Sequence Data: an explosive growth of biodata q Sequence alignment q Variant discovery and characterization q Genomic profiling and pattern discovery q Biomarker Identification: gene expression profile, RNA-seq, ChIP- seq, microarray identification and validation, etc. q Structural Bioinformatics: identify and predict 3D biomolecule structures, such Cryo-EM data refinement, molecular dynamic simulation, NMR, x-Ray crystallographic data, etc. q Biodata Modeling & Statistical Analysis: biological pathways analysis, Gene, clinical data cohorts study, data extraction, etc. q Medical Image Processing: image segmentation, registration, statistic modeling. q Biodata Visualization: 3D molecule structures, genomics sequences visualization, etc. Ruzhu Chen @ 2019
  • 24.
    High performance and highthroughput storage hierarchy required for data loading, extraction and computation. Tertiary storage required for archive and store. Storage tools for data indexing, discovery and governance. Computation High performance and efficiency of software tools and applications for genomic variants and biomarkers analysis, drug discovery, medical image processing and molecule structure modeling, data visualization. High throughput and optimized workload pipelines to accelerate biodata analysis with highly optimal and parallel I/O, memory, CPU and GPU computations. Solutions Large volume and variety of data around genomic sequences, gene expression, images, structural biomolecules, clinical and healthcare information, personized medicine data Data Storage The Challenges: Analyzing Explosive biological data Data Explosion Ruzhu Chen @ 2019
  • 25.
    Data Pipeline forAI in Healthcare Insights Out Trained Models, simulations Inference Data In Transient Storage SDS/Cloud Global Ingest Throughput-oriented, globally accessible Cloud ETL High throughput, Random I/O, SSD/Hybrid Archive High scalability, large/sequential I/O HDD Cloud Tape Hadoop / Spark Data Lakes Throughput-oriented Hybrid/HDD ML / DL Prep ⇨ Training ⇨ Inference High throughput, low latency, Random I/O SSD/NVMe Classification & Metadata Tagging High volume, index & auto-tagging zone Fast Ingest / Real-time Analytics High throughput SSD Throughput-oriented, software defined temporary landing zone capacity tier performance tier performance & capacity Tier performance & capacity Tier performance tier capacity tier Fits Traditional and New Use Cases EDGE COLLECT ORGANIZE ANALYZE INSIGHTS INFUSE IBM Spectrum Scale / Storage for AI / © 2020 IBM Corporation
  • 26.
    Public data Anything datasystem can pull from the outside world for free through web connections, databases, IoT and sensors Proprietary data What private data from the outside world could the system be given permission to use? Purchased data What pre-trained data could the system buy or subscribe to? IBM Skills Academy / © Copyright 2018 IBM Corporation Ground truth Data used to define what the system knows from day one Domain knowledge Data resources that can be used to teach the system to understand and be an expert in a particular field Private data Unique data the creator owns and only shares internally Personal public data What unique data does the creator share with the outside world? Transaction and application data Machine, sensor data Enterprise content Image, geospatial, video Social data Third-party data Available Data Sources
  • 27.
    2 June 2021/© 2018 IBM Corporation • Metadata is the structured data about the unstructured object • Who, what, when, where, and why of account, container, object, stream, dir, file • Perfect for indexing and searching • Metadata may be separate from the data, stored with the data, or derived from the data • Posix inode plus extended attributes • Standard document headers (doc, ppt, mp3, dicom, pdf, jpeg, GeoTIFF) • Custom metadata tags • AI derived metadata Age, Biomarkers, Developmental Stage, Cell Surface, Markers, Cell Type/Cell Line, Disease State, Extract Molecule, Genetic Characteristics, Immunoprecipitation, antibody, Organism Biomedical Natural Language Processing Image Location Size Owner Group Permissions Last-Modified ... System Metadata Where is the data?
  • 28.
    Metadata-Fueled Data Analysis LargeScale Data Ingest • Scan records at high speed • Live event notifications • Capture system-level tags • Automatic indexing Business-Oriented Data Mapping • Custom data tagging • Content-inspection via APIs • Policy-driven workflows Data Activation • Data movement via APIs • Extensible architecture • Solution Blueprints Data Visualization • Query billions of records in seconds • Multi-faceted search • Drilldown dashboard • Customizable reports
  • 29.
    Common AI DataConsiderations Data Compute Legacy Data Stores IoT, Mobile & Sensors Collaboration Partners New Data Ingest Inference Training Preparation Iterative Model training to improve accuracy Champion Challenge r -”Data Center” - At Edge Trained Model § Ease to Massively Scale § High Performance § Tiered / Archive § Secure § High Performance § Metadata Tagging § Single Name Space Low Latency Dev & Inference Stack - Open Source - Stable and Supported - Auditable Productivity Performance Robustness Considerations
  • 30.
    • Ease ofuse • Optimize resources • Scale workload AI Frameworks / Open-Source Libraries AI Tools and Applications AI Software Landscape AI Infrastructure 30 IBM IT Infrastructure / © 2021 IBM Corporation
  • 31.
    Anaconda Environment forApplications • Use anaconda enterprise network (AEN) to manage cryo-EM software repository on server. • Easy to use and update software Anaconda Architecture for Cryo-EM Analysis Computation Web Interface Repo Install Software Control Authentication Anaconda Server Compute Nodes Database Users
  • 32.
    Anaconda Environment forApplications • Use anaconda enterprise network (AEN) to manage cryo-EM software repository on server. • Easy to use and update software Anaconda Architecture for Cryo-EM Analysis Computation Web Interface Repo Install Software Control Authentication Anaconda Server Compute Nodes Database Users
  • 33.
    OpenPOWER is atechnical community dedicated to expanding the the IBM Power architecture ecosystem https://github.com/open-ce Open-CE Minimize time to value for foundational ML/DL packages Provide a flexible source-to-image solution to provide a complete and customizable AI environment.
  • 34.
    Fairness Explainability Adversarial Robustness Transparency Isit fair? Is it easy to understand? Is it secure? Is it accountable? Pillars of Trusted AI 34 IBM IT Infrastructure / © 2021 IBM Corporation
  • 35.