BigData in Life Sciences, Genomics and
Systems Biology
Harsha Rajasimha
9th September 2015
BigData in Life Sciences, Genomics and
Systems Biology
What is Bigdata
Life sciences, Genomics and systems biology
BigData in life sciences – where is it coming from?
Genomics and Systems Biology – BigData challenges.
Making sense of BigData
Future of BigData in genomics/SB
What is BigData?
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
Medicine, Ag, Food Safety, Forensics, Epidemiology
concern the study of living organisms, including biology, botany,
zoology, microbiology, physiology, biochemistry, and related
subjects
Gen“omics”
Before 2000: One Gene at a time based on prior knowledge
Now: All ~25,000 genes at once – no prior knowledge necessary
5
 Genomics is a discipline in genetics that applies recombinant
DNA, DNA sequencing methods, and bioinformatics to
sequence, assemble, and analyze the function and structure of
genomes (the complete set of DNA within a single cell of an
organism).
 OMICS Characteristics
Comprehensiveness
Scale
High-throughput and low-cost technology development
Rapid data release
Social and ethical implications
Central Dogma of Molecular Biology
DNA RNA PROTEIN
Transcription
Reverse
Transcription
RNAi
Gene silencing
FUNCTION
Molecular biology is a branch of science concerning biological
activity at the molecular level. The field of molecular biology
overlaps with biology and chemistry and in particular, genetics and
biochemistry.
Systems biology is the study of systems of biological components,
which may be molecules, cells, organisms or entire species.
Living systems are dynamic and complex, and their behavior may
be hard to predict from the properties of individual parts.
Life Sciences BigData Examples
Measuring Instruments: LIMS, ELNs
Imaging: Molecular and cellular, pathology
Genomics: personal genomes, aggregate databases, gene
expression
Electronic Health Records: variety of information, phenotypes
Literature evidence: Pubmed, ISI web of science, Clinical trials,
WWW
Curated content: biochemical pathways, drug
response/resistance
Precision medicine is an emerging approach for disease
treatment and prevention that takes into account individual
variability in genes, environment, and lifestyle for each person.
• Biological
Mechanism
Example:
• Metabolic
Pathways
• Systems Biology Graphical Notation
~4800
BigData Use Case 1: Disease Genes Discovery
Obama’s PMI: Need for Large Cohorts
15
Genome / DNA Sequencing
•Game Changer 1: First
human genome sequenced
(2001)
•Game Changer 2: Human
genome costs <1K (2014)
Cost is decreasing at the square of Moore’s law:
Flatley’s Law
ability to digitize humans through
genomics and genotyping will
overturn the practice of medicine.
only a small fraction of 700,000
medical practitioners in US are upto
speed with genomics...
16
Personal Genomes, Cancer Genomes, and other
genomes
Other BigData Use Cases
• Insurance: Cost benefit analysis of tests
• Health record- guided drug development
• Patient Stratification – drug response based on DNA
• Measuring Instruments
• FDA Office of Regulatory affairs 14 labs, 1000+ instruments,
data
• 1000genomes, 100K genomes UK, PMI million cohort
– genomes + phenomes
• Biochemical pathways: Reactome, KEGG, etc.
Lot of BigData – not enough analysis
http://searchhealthit.techtarget.com/tip/Big-data-in-health-care-
Lots-of-data-but-not-enough-analysis
Solutions to BigData
Data Storage
Data Organization
Data Analytics
Data Movement
Data Exchange
Data Visualization
BD2K: BigData is worthless
Data Dissemination: Open data, Free data, Open Govt
Data Storage
http://blog.zoolz.com
Cloud, Enterprise
storage, Planning for
tomorrow
Data Management, Retrieval
• Relational databases
• No-SQL databases
• Data use cases
http://www.tomsitpro.com/articles/rdbms-sql-cassandra-dba-
developer,2-547-2.html
Data Organization and DBs
http://www.enaxisconsulting.com/images/userfiles/image
s/MDM-Chart640Final(1).jpg
Business Cases, Continuity,
Infrastructure, Governance
E.g., NIH public data repositories
Data Analytics
Data Movement / Transfer
• How is the data expected to move within and outside
the infrastructure?
• Bring data to analysis tools or tools to data?
• From Archives to compute storage, From local to cloud,
• Network bandwidth considerations
• DAS, NAS, SAN, Tapes, RAM, Cache
Data Integration and Exchange
• APIs: Application programming
interfaces for on-demand
access
• XML: SBML
• EMRs
• RDF/OWL: BioPAX
• FastQ
• DICOM
• Commons: genomics, cancer,
etc.
Data Visualization
Circos plot
Health InfoScape: 7+ million EMRs, SENSEable city lab at
MIT and GE HealthyMagination. Freq of co-occurrence
of medical conditions.
Alignment of 8 yersinia whole bacterial
genomes
BD2K: BigData is worthless
Data Dissemination
Dissemination
Discussion!
Lets do it with BigData!
harshakarur@gmail.com

BigData in Life Sciences, Genomics and Systems Biology

  • 1.
    BigData in LifeSciences, Genomics and Systems Biology Harsha Rajasimha 9th September 2015
  • 2.
    BigData in LifeSciences, Genomics and Systems Biology What is Bigdata Life sciences, Genomics and systems biology BigData in life sciences – where is it coming from? Genomics and Systems Biology – BigData challenges. Making sense of BigData Future of BigData in genomics/SB
  • 3.
  • 4.
    Medicine, Ag, FoodSafety, Forensics, Epidemiology concern the study of living organisms, including biology, botany, zoology, microbiology, physiology, biochemistry, and related subjects
  • 5.
    Gen“omics” Before 2000: OneGene at a time based on prior knowledge Now: All ~25,000 genes at once – no prior knowledge necessary 5  Genomics is a discipline in genetics that applies recombinant DNA, DNA sequencing methods, and bioinformatics to sequence, assemble, and analyze the function and structure of genomes (the complete set of DNA within a single cell of an organism).  OMICS Characteristics Comprehensiveness Scale High-throughput and low-cost technology development Rapid data release Social and ethical implications
  • 6.
    Central Dogma ofMolecular Biology DNA RNA PROTEIN Transcription Reverse Transcription RNAi Gene silencing FUNCTION Molecular biology is a branch of science concerning biological activity at the molecular level. The field of molecular biology overlaps with biology and chemistry and in particular, genetics and biochemistry.
  • 7.
    Systems biology isthe study of systems of biological components, which may be molecules, cells, organisms or entire species. Living systems are dynamic and complex, and their behavior may be hard to predict from the properties of individual parts.
  • 8.
    Life Sciences BigDataExamples Measuring Instruments: LIMS, ELNs Imaging: Molecular and cellular, pathology Genomics: personal genomes, aggregate databases, gene expression Electronic Health Records: variety of information, phenotypes Literature evidence: Pubmed, ISI web of science, Clinical trials, WWW Curated content: biochemical pathways, drug response/resistance
  • 9.
    Precision medicine isan emerging approach for disease treatment and prevention that takes into account individual variability in genes, environment, and lifestyle for each person.
  • 10.
  • 11.
    • Systems BiologyGraphical Notation
  • 12.
    ~4800 BigData Use Case1: Disease Genes Discovery
  • 13.
    Obama’s PMI: Needfor Large Cohorts
  • 15.
    15 Genome / DNASequencing •Game Changer 1: First human genome sequenced (2001) •Game Changer 2: Human genome costs <1K (2014) Cost is decreasing at the square of Moore’s law: Flatley’s Law ability to digitize humans through genomics and genotyping will overturn the practice of medicine. only a small fraction of 700,000 medical practitioners in US are upto speed with genomics...
  • 16.
    16 Personal Genomes, CancerGenomes, and other genomes
  • 17.
    Other BigData UseCases • Insurance: Cost benefit analysis of tests • Health record- guided drug development • Patient Stratification – drug response based on DNA • Measuring Instruments • FDA Office of Regulatory affairs 14 labs, 1000+ instruments, data • 1000genomes, 100K genomes UK, PMI million cohort – genomes + phenomes • Biochemical pathways: Reactome, KEGG, etc.
  • 18.
    Lot of BigData– not enough analysis http://searchhealthit.techtarget.com/tip/Big-data-in-health-care- Lots-of-data-but-not-enough-analysis
  • 19.
    Solutions to BigData DataStorage Data Organization Data Analytics Data Movement Data Exchange Data Visualization BD2K: BigData is worthless Data Dissemination: Open data, Free data, Open Govt
  • 20.
  • 21.
    Data Management, Retrieval •Relational databases • No-SQL databases • Data use cases http://www.tomsitpro.com/articles/rdbms-sql-cassandra-dba- developer,2-547-2.html
  • 22.
    Data Organization andDBs http://www.enaxisconsulting.com/images/userfiles/image s/MDM-Chart640Final(1).jpg Business Cases, Continuity, Infrastructure, Governance E.g., NIH public data repositories
  • 23.
  • 24.
    Data Movement /Transfer • How is the data expected to move within and outside the infrastructure? • Bring data to analysis tools or tools to data? • From Archives to compute storage, From local to cloud, • Network bandwidth considerations • DAS, NAS, SAN, Tapes, RAM, Cache
  • 25.
    Data Integration andExchange • APIs: Application programming interfaces for on-demand access • XML: SBML • EMRs • RDF/OWL: BioPAX • FastQ • DICOM • Commons: genomics, cancer, etc.
  • 26.
    Data Visualization Circos plot HealthInfoScape: 7+ million EMRs, SENSEable city lab at MIT and GE HealthyMagination. Freq of co-occurrence of medical conditions. Alignment of 8 yersinia whole bacterial genomes
  • 27.
  • 28.
  • 29.
    Discussion! Lets do itwith BigData! harshakarur@gmail.com