Next Gen Sequencing and the
associated Big Data problem
SUBHENDU DEY
subhendu.dey@in.ibm.com
Executive Architect / Associate Partner, Cognitive Solutions
IBM Global Business Services
It’s humanly impossible to keep up with
the knowledge and the data…
In medicine, there’s a gap between what
we know and what we do…
This rising tide of
information contains
insights critical to
your success
24 months
Frequency at
which
healthcare
data doubles2
80%
of medical data is
invisible
because it’s
unstructured1
>1M GB
The amount of health-
related data a person
generates in their
lifetime3
45%
of medicine is not
evidence based4
17 years
Time it takes to translate
science to practice5
2
Reality in Healthcare
SOURCES:
1. ASCO Releases Its First-Ever Report on the State of Cancer
Care in America. See link in notes.
2. Global Challenges in Radiation Oncology. See link in notes.
3. Clinical Trends in Molecular Medicine. See link in notes.
4. How much effort is needed to keep up with the literature
relevant for primary care? See link in notes.
5. Comparing sequencing assays and human-machine analyses
in actionable genomics for glioblastoma. See link in notes.
3
Even the
leaders in this
space confront
challenges
every day to
improve lives
and transform
health
 GLOBALLY
12,000
radiation oncologists needed
by 84 low- to middle-income
countries by 2020
29 hours
of reading per day required
for a physician to review new
literature that is published
each day, making it
impossible to keep up with
the latest professional
insights
14 Million
new individuals impacted
by cancer each year
leading to a 42% increase
in the demand for cancer
care over the next 10 years
9600
minutes
a benchmark metric of the
time and labor spent on
interpretation of whole
genome sequencing (according
to a recent study by the New York
Genome Center).
75%
of cancers will not respond to
a particular drug, requiring an
alternative treatment
NGS and Big Data Analysis | CSIR-IICB, Kolkata | March 21, 2018
 Think of the 46 chromosomes as a set of 46 books in a library
 These 46 books have 25,000 stories (genes)
 Each story has a specific ending (protein)
 Sometimes story gets interrupted (mutation), its ending changes – or, one may
never get to the end
 Changes can be at the chromosome (book) level
 Down syndrome – 47 chromosomes instead of 46 (an extra book at #21)
 Changes can be at the DNA (story) level
 A spelling change in a story (a mutation)
4In the area of Genomics the problem is intensified
In Genomics, a potent mix of influences is furthering an evolution
that crosses industries
 Highly specific gene expression panels at much lower cost
 Decreasing cost and increasing capability of high-performance
computing (HPC), also the storage is a commodity now.
 A new era of computing – cognitive computing
 For pharmaceutical industry targeted medicine is a reality with complex
pathway analysis.
5
Biology and
Computational
Technology
Market Demand
and
Consumerism
 There is a growing commercialization of genomics in the increased
uptake of direct-to-consumer genomic testing
 Genomics is also converging with social media, as evidenced by the
sharing of health experiences and questions online (e.g.
https://www.patientslikeme.com).
 The use of wireless sensors and the plethora of accessible digital health
information.

HPC, Pathway analysis and Cognitive Computing need skillset that
goes beyond the world of Biology
6
Correlation based analysis
Digital Data integration, Internet of things
OCR, Image recognition
Natural Language Processing, knowledge graph
Machine (and/or Deep) learning
Artificial reasoning, Evidence based decision
Predictive Analytics
Reference architecture for dealing with Big Data 7
PACS
LIMS
NGS
Ref DB
Publications
Orchestrator
Data Hub
Alignment & Assembly Variant Analysis Annotation Bioinformatics
App Center
I/O Life Cycle Sharing Metadata
DATABASE KNOWLEDGE BASE
Resource Workload Workflow Provenance
Catalog
Monitoring
Clinical
Notes
PubMed
Ontologies
Orchestrator
Data Hub
NLP ETL / ELT Annotation Analytics
App Center
DATABASE KNOWLEDGE BASE
Catalog
Monitoring
EMR NLP MDM Annotation Analytics
GenomicsTranslationalPersonalizedMedicine
Data Source Data Service Analytics Access
Reference architecture for dealing with Big Data 8
PACS
LIMS
NGS
Ref DB
Publications
Orchestrator
Data Hub
App Center
I/O Life Cycle Sharing Metadata
Resource Workload Workflow Provenance
Clinical
Notes
PubMed
Ontologies
Orchestrator
Data Hub
App Center
EMR
GenomicsTranslationalPersonalizedMedicine
Data Source Data Service Analytics Access
• SSD / Flash
• FC attached
• Low-cost Storage
• HA/DR Storage
• Cloud Storage
• HPC Cluster
• Big Data
• Spark Cluster
• OpenStack
• Docker
• ….
•Visualization
•Applicationand
workflow
•Systemlog
Watson for Genomics - introduction
VCF / MAF, Log2, Dge, Fusion
Encryption
Case Sequenced
MolecularProfileAnalysisPathwayAnalysisDrugAnalysis
20+ Content Sources
Including:
• Medical Articles
• Drug Information
• Clinical Trial
Information
• Genomic
Information
• OncoKB by MSKCC
Annotation,tagging,classification,summarization,clustering,
similarityanalysis,scoring
Rankedinformationretrieval,targetedoutcome
Integrationofvarietyofcontent
Information
distillation
Insight generation
through NLP
pipeline
Identifying alterations driving the
patient cancer and matching them with
molecular targeted therapies using
multiple data sources is extremely
complex and labor-intensive
Right now it can take from days to weeks
to perform a comprehensive manual
analysis of the genetic alterations for one
patient
9
Watson for Genomics – functional highlights
 “Born in the cloud” multi-user and multi-tenant solution with a single code base
 No customization, configuration or integration required for initial use
 Security rich environment managed by IBM and industry standards
 Patient data uploaded is de-identified (de-identified mutated DNA)
 Accepted input data includes somatic mutations, copy number variations, gene expression and fusion
 Supports gene panels, whole exome and whole genome sequenced files
 Natural Language Processing (NLP) used to extract information from extensive medical literature (over
23 millions articles)
 20+ structures and unstructured data sources ingested
 Analytics engine to identify relevant alterations, drugs and clinical trials for any cancer type
 Report and interactive visualizations of the molecular profile, drugs and pathways
 Summary report shows target therapeutic options categorized by FDA approved for the patient cancer
type, Investigational and FDA approved for other cancer types
 Evidences presented via hyperlinks to sources for easy drill down
10
Watson for Genomics - demonstration 11
Q & A 12

Next Gen Sequencing and Associated Big Data / AI problem

  • 1.
    Next Gen Sequencingand the associated Big Data problem SUBHENDU DEY subhendu.dey@in.ibm.com Executive Architect / Associate Partner, Cognitive Solutions IBM Global Business Services
  • 2.
    It’s humanly impossibleto keep up with the knowledge and the data… In medicine, there’s a gap between what we know and what we do… This rising tide of information contains insights critical to your success 24 months Frequency at which healthcare data doubles2 80% of medical data is invisible because it’s unstructured1 >1M GB The amount of health- related data a person generates in their lifetime3 45% of medicine is not evidence based4 17 years Time it takes to translate science to practice5 2 Reality in Healthcare
  • 3.
    SOURCES: 1. ASCO ReleasesIts First-Ever Report on the State of Cancer Care in America. See link in notes. 2. Global Challenges in Radiation Oncology. See link in notes. 3. Clinical Trends in Molecular Medicine. See link in notes. 4. How much effort is needed to keep up with the literature relevant for primary care? See link in notes. 5. Comparing sequencing assays and human-machine analyses in actionable genomics for glioblastoma. See link in notes. 3 Even the leaders in this space confront challenges every day to improve lives and transform health  GLOBALLY 12,000 radiation oncologists needed by 84 low- to middle-income countries by 2020 29 hours of reading per day required for a physician to review new literature that is published each day, making it impossible to keep up with the latest professional insights 14 Million new individuals impacted by cancer each year leading to a 42% increase in the demand for cancer care over the next 10 years 9600 minutes a benchmark metric of the time and labor spent on interpretation of whole genome sequencing (according to a recent study by the New York Genome Center). 75% of cancers will not respond to a particular drug, requiring an alternative treatment NGS and Big Data Analysis | CSIR-IICB, Kolkata | March 21, 2018
  • 4.
     Think ofthe 46 chromosomes as a set of 46 books in a library  These 46 books have 25,000 stories (genes)  Each story has a specific ending (protein)  Sometimes story gets interrupted (mutation), its ending changes – or, one may never get to the end  Changes can be at the chromosome (book) level  Down syndrome – 47 chromosomes instead of 46 (an extra book at #21)  Changes can be at the DNA (story) level  A spelling change in a story (a mutation) 4In the area of Genomics the problem is intensified
  • 5.
    In Genomics, apotent mix of influences is furthering an evolution that crosses industries  Highly specific gene expression panels at much lower cost  Decreasing cost and increasing capability of high-performance computing (HPC), also the storage is a commodity now.  A new era of computing – cognitive computing  For pharmaceutical industry targeted medicine is a reality with complex pathway analysis. 5 Biology and Computational Technology Market Demand and Consumerism  There is a growing commercialization of genomics in the increased uptake of direct-to-consumer genomic testing  Genomics is also converging with social media, as evidenced by the sharing of health experiences and questions online (e.g. https://www.patientslikeme.com).  The use of wireless sensors and the plethora of accessible digital health information. 
  • 6.
    HPC, Pathway analysisand Cognitive Computing need skillset that goes beyond the world of Biology 6 Correlation based analysis Digital Data integration, Internet of things OCR, Image recognition Natural Language Processing, knowledge graph Machine (and/or Deep) learning Artificial reasoning, Evidence based decision Predictive Analytics
  • 7.
    Reference architecture fordealing with Big Data 7 PACS LIMS NGS Ref DB Publications Orchestrator Data Hub Alignment & Assembly Variant Analysis Annotation Bioinformatics App Center I/O Life Cycle Sharing Metadata DATABASE KNOWLEDGE BASE Resource Workload Workflow Provenance Catalog Monitoring Clinical Notes PubMed Ontologies Orchestrator Data Hub NLP ETL / ELT Annotation Analytics App Center DATABASE KNOWLEDGE BASE Catalog Monitoring EMR NLP MDM Annotation Analytics GenomicsTranslationalPersonalizedMedicine Data Source Data Service Analytics Access
  • 8.
    Reference architecture fordealing with Big Data 8 PACS LIMS NGS Ref DB Publications Orchestrator Data Hub App Center I/O Life Cycle Sharing Metadata Resource Workload Workflow Provenance Clinical Notes PubMed Ontologies Orchestrator Data Hub App Center EMR GenomicsTranslationalPersonalizedMedicine Data Source Data Service Analytics Access • SSD / Flash • FC attached • Low-cost Storage • HA/DR Storage • Cloud Storage • HPC Cluster • Big Data • Spark Cluster • OpenStack • Docker • …. •Visualization •Applicationand workflow •Systemlog
  • 9.
    Watson for Genomics- introduction VCF / MAF, Log2, Dge, Fusion Encryption Case Sequenced MolecularProfileAnalysisPathwayAnalysisDrugAnalysis 20+ Content Sources Including: • Medical Articles • Drug Information • Clinical Trial Information • Genomic Information • OncoKB by MSKCC Annotation,tagging,classification,summarization,clustering, similarityanalysis,scoring Rankedinformationretrieval,targetedoutcome Integrationofvarietyofcontent Information distillation Insight generation through NLP pipeline Identifying alterations driving the patient cancer and matching them with molecular targeted therapies using multiple data sources is extremely complex and labor-intensive Right now it can take from days to weeks to perform a comprehensive manual analysis of the genetic alterations for one patient 9
  • 10.
    Watson for Genomics– functional highlights  “Born in the cloud” multi-user and multi-tenant solution with a single code base  No customization, configuration or integration required for initial use  Security rich environment managed by IBM and industry standards  Patient data uploaded is de-identified (de-identified mutated DNA)  Accepted input data includes somatic mutations, copy number variations, gene expression and fusion  Supports gene panels, whole exome and whole genome sequenced files  Natural Language Processing (NLP) used to extract information from extensive medical literature (over 23 millions articles)  20+ structures and unstructured data sources ingested  Analytics engine to identify relevant alterations, drugs and clinical trials for any cancer type  Report and interactive visualizations of the molecular profile, drugs and pathways  Summary report shows target therapeutic options categorized by FDA approved for the patient cancer type, Investigational and FDA approved for other cancer types  Evidences presented via hyperlinks to sources for easy drill down 10
  • 11.
    Watson for Genomics- demonstration 11
  • 12.

Editor's Notes

  • #3 Healthcare disruption is underway in oncology. Watson Health chose to take on cancer with cognitive because we believe the challenges facing cancer patients and caregivers today – rising incidence rates, shortages of oncologists where they are needed most, increasing costs, evolving models of care and reimbursement, the growing importance of genomics and the acceleration of publications for oncologists to keep up with – can be tackled through solutions that understand and derive insights from data. SOURCES ASCO Releases Its First-Ever Report on the State of Cancer Care in America. Available at: http://www.ascopost.com/issues/april-15-2014/asco-releases-its-first-ever-report-on-the-state-of-cancer-care-in-america/. Accessed November 15, 2016. Marconi, Katherine and Lehmann, Harold. Big Data and Health Analytics. CRC Press, 2014. Available at: http://bit.ly/1UjEtLL. Accessed June 3, 2016 Managed Care, January 2011 and HealthAffairs Blog, July 2011 Elizabeth A. McGlynn, Ph.D., Steven M. Asch, M.D., M.P.H., John Adams, Ph.D., Joan Keesey, B.A., Jennifer Hicks, M.P.H., Ph.D., Alison DeCristofaro, M.P.H., and Eve A. Kerr, M.D., M.P.H. N Engl J Med 2003; 348:2635-2645 June 26, 2003. DOI: 10.1056/NEJMsa022615. Available at: http://www.nejm.org/doi/full/10.1056/NEJMsa022615#t=abstract Slote Morris, Zoë & Wooding, Steven & Grant, Jonathan. (2011). The answer is 17 years, what is the question: Understanding time lags in translational research. Journal of the Royal Society of Medicine. 104. 510-20. 10.1258/jrsm.2011.110180. Available at: https://www.researchgate.net/publication/51897868_The_answer_is_17_years_what_is_the_question_Understanding_time_lags_in_translational_research6.
  • #4 The statistics on the this slide are things you know well. The numbers are not getting easier to manage. The health professional struggles to have all the knowledge necessary to optimize care. Too many patients per doctor, especially where the cancer’s profile becomes more unique each day. Continuing struggles to recruit into trials and therapies delivering less than life changing benefit. The thousand dollar genome test solved one problem but created another. Now the cost and time burden isn’t on physically doing the test; it is on the interpretation. Molecular profiling has led to a new library of literature on detection and treatment of molecular subtypes of cancers. These are the challenges we know you face. You have the vision and the tenacity to take them on; ours is to help with the tools. SOURCES ASCO Releases Its First-Ever Report on the State of Cancer Care in America. Available at: http://www.ascopost.com/issues/april-15-2014/asco-releases-its-first-ever-report-on-the-state-of-cancer-care-in-america/. Accessed November 15, 2016. Global Challenges in Radiation Oncology. Daniel Grant Petereit, C. Norman Coleman. Front Oncol. 2015; 5: 103. Published online 2015 May 15. Accessed at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4432796/ Brian B. Spear, Margo Heath-Chiozzi, Jeffrey Huff, Clinical Trends in Molecular Medicine, Volume 7, Issue 5, 1 May 2001, pages 201 - 204. Accessed at: http://www.personalizedmedicinecoalition.org/Education/The_Basics J Med Libr Assoc. 2004 Oct; 92(4): 429–437. How much effort is needed to keep up with the literature relevant for primary care? Available at: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC521514/. Accessed June 3, 2016. Comparing sequencing assays and human-machine analyses in actionable genomics for glioblastoma. Neurology Genetics Aug 2017, 3 (4) e164; DOI: 10.1212/NXG.0000000000000164. Accessed at: http://ng.neurology.org/content/3/4/e164.
  • #6 So what’s causing this increase in widespread use of the genomic health record? The decreased costs of sequencing the human genome The proliferation and availability of genome-based tests in the past five years The rising adoption of electronic medical records The increased use of genome data to recommend targeted treatments using companion molecular diagnostics A growing willingness of payors to reimburse payments on some genetic tests today