SlideShare a Scribd company logo
Big Data Infrastructure for Translational
Research
Christopher G. Wilson, Ph.D.
Associate Professor Physiology and Pediatrics
Center for Perinatal Biology
Translational Medicine, April 18th, 2015
Disclosures
The work reported here was supported, in part,
by NIH grants:
1R01HL081622-01 (NHLBI)
1R03HD064830-01 (NICHD)
Outline
• Defining “Big Data”
• Big data is of multiple modes/types
• Scaling data acquisition to build Big Data sets
•  Patient bed
•  Unit
•  Institution-wide
• Continuing challenges
What is “Big Data”?
• Big data is a blanket term for any collection of data sets so
large and complex that it becomes difficult to process using the
typical data management tools and data processing
applications.
• Big data usually includes data sets so large that commonly
used software (like Microsoft Office) cannot be used to
capture, curate, manage, and process the data quickly and
efficiently.
• Big data set sizes are a constantly moving target ranging
from 100’s of gigabytes (109 bytes), to terabytes (1012 bytes)
and even to petabytes (1015 bytes) of data in a single data set.
A feast of data!
• The world’s technological per-capita capacity to store
information has roughly doubled every 40 months since the
1980s
• Global Internet traffic has reached almost 1000 exabytes
(1018 bytes) annually and continues to grow*
• The challenge for both business and research science is
coming up with the tools to extract usable information from this
data
*Cisco systems estimate
Where does so much data come from?
Data sets grow to vast size because they are increasingly
being gathered by:
• Ubiquitous information-sensing mobile devices (phones,
fitbits, jawbones, etc.)
• Surveillance technologies (remote sensing devices like
drones or traffic cameras)
• Software logs from your internet activity (Hello—Facebook!)
• Radio-frequency identification (RFID) tags
• Wireless sensor networks (once again, the kind of thing your
phone “wants” to attach to when you are out and about)
• And scientific instruments, clinical monitors, patient
samples…
Personal
fitness
trackers
Work-flow of “Big Data” analysis
Or…
• Obtain data
• Scrub data
• Explore data
• Model the data
• Interpret the data
• Present the data
Data analytics is a team sport!
•  Project manager—responsible for setting clear project objectives and deliverables.
The project manager should be someone with more experience in data analysis
and a more comprehensive background than the other team members.
•  Statistician—should have a strong mathematics/statistics background and will be
responsible for reporting and developing the statistics workflow for the project.
•  Visualization specialist—responsible for the design/development of data
visualization (figures/animation) for the project.
•  Database specialist—develops ontology/meta-tags to represent the data and
incorporate this information in the team's chosen database schema.
•  Content Expert—has the strongest background in the focus area of the project
(Physiologist, systems biologist, molecular biologist, biochemist, clinician, etc.) and
is responsible for providing background material relevant to the project's focus.
•  Web developer/integrator—responsible for web-content related to the project,
including the final report formatting (for web/hardcopy display).
•  Data analyst/programmer—the most junior member of the team will take on
general responsibilities to assist the other team members. This is a learning
opportunity for a team member who is new to data analysis and needs time to
develop the skills necessary to fully participate in the workflow.
Data analytics is a team sport!
Project manager/
content expert
(physician/scientist)
Database/web
developer
Statistician/
Data viz
Programmer
Team members can have multiple roles….
What tools are typically used?
• 64 bit computing environment is typical (Big RAM and Big
storage, massively parallel software running on clusters/cloud
servers)
• Data is acquired and stored in a database (SQL for some but
NOSQL databases like Hadoop, MongoDB, CouchDB,
Clusterpoint, etc. are “better”)
• Data screening & cleaning using “scripting” languages (Perl
or Python typically) and processing using tools like
MapReduce
• “Industrial strength” statistical packages (typically R, SAS, or
SPSS)
• Visualization (D3/IDL/MATLAB/Python/Plot.ly, etc.)
• Metadata tagging (XML and variants)
How can we meet the challenge
of Big Data collection/integration
in a translational setting?
What are the challenges for clinicians/researchers?
The amount of biomedical data that is increasingly available
provides both opportunity and challenge for the translational
investigator.
• Molecular biology has provided tools to allow understanding of
genomics and proteomics.
• There is growing data on the connectomics of signaling pathways
• Patient demographic data and other EHR/EMR metrics are a resource
that is only now being widely deployed and interrogated.
• Patient physiology (bedside monitors) can be used to provide
fundamental information about patient health and adaptation to
pathophysiologies.
• Health Insurance Portability and Accountability Act of 1996 (HIPAA) is
a necessary challenge for data handling.
Courtesy Michael De Georgia & J. Michael Schmidt
Big Data to Decisions!
» Technology challenges for “Data to Decisions”
~  Transforming data from multiple sources into meaningful information (evidence-context dependent)
~  Association of data from diverse heterogeneous, asynchronous sources
~  Merging/fusion of information for alerts and decision support
~  Human guided processing and analysis
Multi-source Analysis For Pattern Discovery Extract & synthesize
information from diverse
data.
SOURCE
SOURCE
SOURCE
Source-to-Evidence:
Information Processing &
Extraction
Text Analytics
Image Analysis
Signal Processing
Data Association
Data Fusion:
Alerting & Decision
Support
Combine
Information
Weigh
Evidence
Real time
Alerting
User Interface:
Display & Analysis
Visualization
Queries
Data
Provenance
Sensitivity
Real-time Decision Support
Providing useful information to the clinician
» Real-time decision support to clinicians at the point of care
~  Codify best practice protocols
~  Enable efficient treatment decisions
~  Reduce needless procedures
~  Optimize coordination among care givers
~  Reduce the probability of mistakes being made
» Key features that affect decision support
~  Methods to retrieve, merge, and present data and information
~  Algorithms to extract information from complex, heterogeneous data
~  Visualization/graphical feedback to better understand patient conditions
» Automated alerting for conditions of concern
~  Combining information across data streams
~  Accumulation of weak evidence from multiple sources
~  Enhanced retrieval and visualization of information
Challenges inherent in Big Data Analytics
• Capture
• Curation
• Storage
• Search
• Sharing
• Transfer
• Analysis
• Visualization
Data is multi-modal
Unified data set
Physiology
waveforms
(ECG, EEG,
SaO2, BP)
Radiology
(X-Ray, MRI, CAT,
etc.)
EMR/EHR
“-omics”
data
Bedside Patient Data Acquisition
Scaling to a hospital-wide data center
Ken Loparo
Michael DeGeorgia
Frank Jacono
Farhad Kaffashi
CWRU IMEDSTM Proof of Concept
Demonstration
Why is IMEDS™ Different?
The Approach
~  “Bottom-up” development with clinicians and engineers working
side-by-side
~  Open source architecture design
~  Total integrated, “plug-and-play” system solution
~  Unbiased approach
~  Unified effort, rather than stove-piped, “one-off” solutions to small
pieces of the problem
~  Non-profit nation-wide consortium
~  Builds on existing infrastructures
~  Leverages best available technology, regardless of source
Courtesy Michael De Georgia & J. Michael Schmidt
Challenges inherent in Big Data Analytics
• Capture
• Curation
• Storage
• Search
• Sharing
• Transfer
• Analysis
• Visualization
Courtesy of Susanna-Assunta Sansone, PhD
Courtesy of Susanna-Assunta Sansone, PhD
Courtesy of Susanna-Assunta Sansone, PhD
IPython
interface
http://ipython.org
•  Reproducible
•  Version controlled (git)
•  Interactive analysis
Challenges inherent in Big Data Analytics
• Capture
• Curation
• Storage
• Search
• Sharing
• Transfer
• Analysis
• Visualization
Worldwide movement for FAIR data
Barend Mons and Susanna-Assunta Sansone
http://bd2k.nih.gov/workshops.html#ADDS
!
"
Launched on May 27th, 2014
A new online-only publication for descriptions of scientifically valuable datasets in
the life, environmental and biomedical sciences, but not limited to these
Credit for sharing
your data
Focused on reuse
and reproducibility
Peer reviewed,
curated
Promoting Community
Data Repositories
Open Access
Supported by:
Courtesy of Susanna-Assunta Sansone, PhD
Challenges inherent in Big Data Analytics
• Capture
• Curation
• Storage
• Search
• Sharing
• Transfer
• Analysis
• Visualization
Data Processing
Decision Tree
Analysis
Artificial Neural
Network
Mechanistic
Approaches
Graphical
Approaches
Bayesian
Network
Hierarchical
Clustering
Probabilistic
Approaches
Classical
Statistical
Inference
Bayesian
Statistical
Inference
Complex Systems Analysis
Time
Domain
Frequency
Domain
Scale Invariant
(Fractal) Analysis
Approximate
Entropy
Integrated
Patient
Database
Data Analysis Methods
Data Analysis Methods
Python as a data analytics environment
Advantages to using a Big Data approach
• Speed of data reduction and analysis
• Visualization of complex data sets can be done relatively
quickly
• Capacity for storage and processing of vast data sets is
inherent in the tool stack
• Scalability of cloud/cluster storage
• Potential for “Big Impact” on research and clinical care
Disadvantages to a Big Data approach
• Often not hypothesis driven (a fishing mission?)
• Requires expensive computing technology depending upon
data processing and storage needs
• Requires significant programming skill to develop and use the
tool stack
• Typically requires “team based” data analysis and
management (programmer, database manager, design/
visualization person, etc.)
• Just because you have lots of data, doesn’t mean you have
an obvious or easy way to extract the information!
Summary
• We live in a data-rich era.
• The data available to us is multi-modal and requires
integration.
• Data collection and integration can occur at many scales
(bedside to institution) but the data must be converted into
usable information.
• Team-based science depends upon a wide range of data
analytics skills.
• Curation, reproducibility of and shared access to data is an
ongoing challenge.
Where do you find your data
analytics team members?
Syllabus Overview (10 week course)
Foundations 1: Using text editors, using the IPython notebook for data exploration, using
version control software (git), using the class wiki.
Foundations 2: Using IPython/NumPy/SciPy, importing and manipulating data with Pandas,
data visualization in IPython.
Analysis Methods: Basic signal theory overview, time-series data, plotting (lines, histograms,
bars, etc.) dynamical systems analyses of data variability, information theory measures
(entropy) of complexity, frequency domain/spectral measures (FFT, time-varying spectrum),
wavelets.
Handling Sequence data: Using R/Bioconductor, differences between mRNA-Seq, gene-
array, proteomics, and deep-sequencing data, visualizing data from gene/RNA arrays.
Data set storage and retrieval: Basics of relational databases, SQL vs. NOSQL, cloud
storage/NAS/computing clusters, interfacing with Hadoop/MapReduce, metadata and ontology
for biomedical/patient data (XML), using secure databases (REDCap).
Data integrity and security: The Health Insurance Portability and Accountability Act (HIPAA)
and what it means for data management, de-identifying patient data (handling PHI), data
security best practices, making data available to the public—implications for data transparency
and large-scale data mining.
Coalition Institutions
The coding Queen and her Court…
Abby Dobyns
Princesses of Python
Rhaya Johnson
Regie Felix and Adaeze Anyanwu
And a Princeling….
Jamie Tillett
Acknowledgements
Loma Linda
• Andy Hopper
• Traci Marin
• Charles Wang
• Wilson Aruni
• Valery Filippov
CWRU
•  Michael De Georgia
•  Kenneth Loparo
•  Frank Jacono
•  Farhad Kaffashi
My laboratory’s git repository:
UC Riverside
• Thomas Girke
(Bioinformatics)
La Sierra University
•  Marvin Payne
CSU San Bernardino
•  Art Concepcion
(Bioinformatics)
UC Irvine
•  Alex Nicolau
(Comp Sci/Bioinf)
https://github.com/drcgw/bass
Questions?!
Further reading
• Doing Data Science by Cathy O’Neil and Rachel Schutt
• Data Analysis with Open-Source Tools by Philipp Janert
• The Art of R Programming by Norman Matloff
• R for Everyone by Jared P. Lander
• Python for Data Analysis by Wes McKinney
• Think Python by Allen B. Downey
• Think Stats by Allen B. Downey
• Think Complexity by Allen B. Downey
• Every one of Edward Tufte’s books (The Visual Display
of Quantitative Information, Visual Explanations,
Envisioning Information, Beautiful Evidence)
Example: Patient physiology waveforms + EMR
Example: Interrogating sequence data

More Related Content

What's hot

Data Governance in two different data archives: When is a federal data reposi...
Data Governance in two different data archives: When is a federal data reposi...Data Governance in two different data archives: When is a federal data reposi...
Data Governance in two different data archives: When is a federal data reposi...
Carolyn Ten Holter
 
Sharing Confidential Data in ICPSR
Sharing Confidential Data in ICPSRSharing Confidential Data in ICPSR
Sharing Confidential Data in ICPSR
ARDC
 
Enhancing Our Capacity for Large Health Dataset Analysis
Enhancing Our Capacity for Large Health Dataset AnalysisEnhancing Our Capacity for Large Health Dataset Analysis
Enhancing Our Capacity for Large Health Dataset Analysis
CTSI at UCSF
 
Survey of research data management practices up2010digschol2011
Survey of research data management practices up2010digschol2011Survey of research data management practices up2010digschol2011
Survey of research data management practices up2010digschol2011heila1
 
development of information technology
development of information technologydevelopment of information technology
development of information technologyBiqie1995
 
DataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management PlanningDataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management Planning
DataONE
 
Open interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBIOpen interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBI
Pistoia Alliance
 
RDAP 16 Poster: Measuring adoption of Electronic Lab Notebooks and their impa...
RDAP 16 Poster: Measuring adoption of Electronic Lab Notebooks and their impa...RDAP 16 Poster: Measuring adoption of Electronic Lab Notebooks and their impa...
RDAP 16 Poster: Measuring adoption of Electronic Lab Notebooks and their impa...
ASIS&T
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
Warren Kibbe
 
An Introduction to Machine Learning and Genomics
An Introduction to Machine Learning and GenomicsAn Introduction to Machine Learning and Genomics
An Introduction to Machine Learning and Genomics
Brittany Lasseigne, Ph.D.
 
Open science and data sharing: the DataFirst experience/Martin Wittenberg
Open science and data sharing: the DataFirst experience/Martin WittenbergOpen science and data sharing: the DataFirst experience/Martin Wittenberg
Open science and data sharing: the DataFirst experience/Martin Wittenberg
African Open Science Platform
 
SWOT Analysis - What Does it Tell Us?
SWOT Analysis - What Does it Tell Us?SWOT Analysis - What Does it Tell Us?
SWOT Analysis - What Does it Tell Us?
Philip Bourne
 
Jane Howard
Jane HowardJane Howard
Jane Howard
Jane Howard
 
PhRMA Some Early Thoughts
PhRMA Some Early ThoughtsPhRMA Some Early Thoughts
PhRMA Some Early Thoughts
Philip Bourne
 

What's hot (16)

Data Governance in two different data archives: When is a federal data reposi...
Data Governance in two different data archives: When is a federal data reposi...Data Governance in two different data archives: When is a federal data reposi...
Data Governance in two different data archives: When is a federal data reposi...
 
Sharing Confidential Data in ICPSR
Sharing Confidential Data in ICPSRSharing Confidential Data in ICPSR
Sharing Confidential Data in ICPSR
 
Enhancing Our Capacity for Large Health Dataset Analysis
Enhancing Our Capacity for Large Health Dataset AnalysisEnhancing Our Capacity for Large Health Dataset Analysis
Enhancing Our Capacity for Large Health Dataset Analysis
 
Irt
IrtIrt
Irt
 
Irt
IrtIrt
Irt
 
Survey of research data management practices up2010digschol2011
Survey of research data management practices up2010digschol2011Survey of research data management practices up2010digschol2011
Survey of research data management practices up2010digschol2011
 
development of information technology
development of information technologydevelopment of information technology
development of information technology
 
DataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management PlanningDataONE Education Module 03: Data Management Planning
DataONE Education Module 03: Data Management Planning
 
Open interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBIOpen interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBI
 
RDAP 16 Poster: Measuring adoption of Electronic Lab Notebooks and their impa...
RDAP 16 Poster: Measuring adoption of Electronic Lab Notebooks and their impa...RDAP 16 Poster: Measuring adoption of Electronic Lab Notebooks and their impa...
RDAP 16 Poster: Measuring adoption of Electronic Lab Notebooks and their impa...
 
Data Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health SystemData Harmonization for a Molecularly Driven Health System
Data Harmonization for a Molecularly Driven Health System
 
An Introduction to Machine Learning and Genomics
An Introduction to Machine Learning and GenomicsAn Introduction to Machine Learning and Genomics
An Introduction to Machine Learning and Genomics
 
Open science and data sharing: the DataFirst experience/Martin Wittenberg
Open science and data sharing: the DataFirst experience/Martin WittenbergOpen science and data sharing: the DataFirst experience/Martin Wittenberg
Open science and data sharing: the DataFirst experience/Martin Wittenberg
 
SWOT Analysis - What Does it Tell Us?
SWOT Analysis - What Does it Tell Us?SWOT Analysis - What Does it Tell Us?
SWOT Analysis - What Does it Tell Us?
 
Jane Howard
Jane HowardJane Howard
Jane Howard
 
PhRMA Some Early Thoughts
PhRMA Some Early ThoughtsPhRMA Some Early Thoughts
PhRMA Some Early Thoughts
 

Viewers also liked

Big data in biology
Big data in biologyBig data in biology
Big data in biology
Omkar Reddy
 
Masterworks talk on Big Data and the implications of petascale science
Masterworks talk on Big Data and the implications of petascale scienceMasterworks talk on Big Data and the implications of petascale science
Masterworks talk on Big Data and the implications of petascale scienceDeepak Singh
 
Network biology - A basis for large-scale biomedica data mining
Network biology - A basis for large-scale biomedica data miningNetwork biology - A basis for large-scale biomedica data mining
Network biology - A basis for large-scale biomedica data miningLars Juhl Jensen
 
Network biology - Large-scale integration of data and text
Network biology - Large-scale integration of data and textNetwork biology - Large-scale integration of data and text
Network biology - Large-scale integration of data and text
Lars Juhl Jensen
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
Dataconomy Media
 
Bioinformatics in the Era of Open Science and Big Data
Bioinformatics in the Era of Open Science and Big DataBioinformatics in the Era of Open Science and Big Data
Bioinformatics in the Era of Open Science and Big Data
Philip Bourne
 
Network biology: Large-scale data and text mining
Network biology: Large-scale data and text miningNetwork biology: Large-scale data and text mining
Network biology: Large-scale data and text mining
Lars Juhl Jensen
 
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...c.titus.brown
 
Network biology: Large-scale data integration and text mining
Network biology: Large-scale data integration and text miningNetwork biology: Large-scale data integration and text mining
Network biology: Large-scale data integration and text miningLars Juhl Jensen
 
Big data biology for pythonistas: getting in on the genomics revolution
Big data biology for pythonistas: getting in on the genomics revolutionBig data biology for pythonistas: getting in on the genomics revolution
Big data biology for pythonistas: getting in on the genomics revolution
Darya Vanichkina
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
Guy Coates
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use cases
Guy Coates
 
Future Architectures for genomics
Future Architectures for genomicsFuture Architectures for genomics
Future Architectures for genomics
Guy Coates
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
c.titus.brown
 
Systems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems levelSystems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems levelLars Juhl Jensen
 
Big Data, Computational Biology & the Future of Strategic Planning for Research
Big Data, Computational Biology & the Future of Strategic Planning for ResearchBig Data, Computational Biology & the Future of Strategic Planning for Research
Big Data, Computational Biology & the Future of Strategic Planning for Research
NBBJDesign
 

Viewers also liked (16)

Big data in biology
Big data in biologyBig data in biology
Big data in biology
 
Masterworks talk on Big Data and the implications of petascale science
Masterworks talk on Big Data and the implications of petascale scienceMasterworks talk on Big Data and the implications of petascale science
Masterworks talk on Big Data and the implications of petascale science
 
Network biology - A basis for large-scale biomedica data mining
Network biology - A basis for large-scale biomedica data miningNetwork biology - A basis for large-scale biomedica data mining
Network biology - A basis for large-scale biomedica data mining
 
Network biology - Large-scale integration of data and text
Network biology - Large-scale integration of data and textNetwork biology - Large-scale integration of data and text
Network biology - Large-scale integration of data and text
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
 
Bioinformatics in the Era of Open Science and Big Data
Bioinformatics in the Era of Open Science and Big DataBioinformatics in the Era of Open Science and Big Data
Bioinformatics in the Era of Open Science and Big Data
 
Network biology: Large-scale data and text mining
Network biology: Large-scale data and text miningNetwork biology: Large-scale data and text mining
Network biology: Large-scale data and text mining
 
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
 
Network biology: Large-scale data integration and text mining
Network biology: Large-scale data integration and text miningNetwork biology: Large-scale data integration and text mining
Network biology: Large-scale data integration and text mining
 
Big data biology for pythonistas: getting in on the genomics revolution
Big data biology for pythonistas: getting in on the genomics revolutionBig data biology for pythonistas: getting in on the genomics revolution
Big data biology for pythonistas: getting in on the genomics revolution
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
Life sciences big data use cases
Life sciences big data use casesLife sciences big data use cases
Life sciences big data use cases
 
Future Architectures for genomics
Future Architectures for genomicsFuture Architectures for genomics
Future Architectures for genomics
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
Systems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems levelSystems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems level
 
Big Data, Computational Biology & the Future of Strategic Planning for Research
Big Data, Computational Biology & the Future of Strategic Planning for ResearchBig Data, Computational Biology & the Future of Strategic Planning for Research
Big Data, Computational Biology & the Future of Strategic Planning for Research
 

Similar to 2015 04-18-wilson cg

Data discovery and sharing at UCLH
Data discovery and sharing at UCLHData discovery and sharing at UCLH
Data discovery and sharing at UCLH
Jisc
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forum
Chris Dwan
 
Using Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and AnalyticsUsing Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and AnalyticsPerficient, Inc.
 
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET
 
BIG DATA.ppt
BIG DATA.pptBIG DATA.ppt
BIG DATA.ppt
UsmanAliyuAminu
 
Cri big data
Cri big dataCri big data
Cri big data
Putchong Uthayopas
 
Lecture 6_Data acquisition.pptx power points
Lecture 6_Data acquisition.pptx power pointsLecture 6_Data acquisition.pptx power points
Lecture 6_Data acquisition.pptx power points
Josephmwanika
 
Hadoop Enabled Healthcare
Hadoop Enabled HealthcareHadoop Enabled Healthcare
Hadoop Enabled Healthcare
DataWorks Summit
 
Open Access as a Means to Produce High Quality Data
Open Access as a Means to Produce High Quality DataOpen Access as a Means to Produce High Quality Data
Open Access as a Means to Produce High Quality Data
CGIAR Research Program on Dryland Systems
 
Combining Patient Records, Genomic Data and Environmental Data to Enable Tran...
Combining Patient Records, Genomic Data and Environmental Data to Enable Tran...Combining Patient Records, Genomic Data and Environmental Data to Enable Tran...
Combining Patient Records, Genomic Data and Environmental Data to Enable Tran...
Perficient, Inc.
 
ANDS health and medical data webinar 16 May. Storing and Publishing Health an...
ANDS health and medical data webinar 16 May. Storing and Publishing Health an...ANDS health and medical data webinar 16 May. Storing and Publishing Health an...
ANDS health and medical data webinar 16 May. Storing and Publishing Health an...
ARDC
 
Precision and Participatory Medicine - MEDINFO 2015 Panel on big data
Precision and Participatory Medicine - MEDINFO 2015 Panel on big dataPrecision and Participatory Medicine - MEDINFO 2015 Panel on big data
Precision and Participatory Medicine - MEDINFO 2015 Panel on big data
Health and Biomedical Informatics Centre @ The University of Melbourne
 
HETT Conference Olympic Central 2014 Integrating Healthcare Delivery
HETT Conference Olympic Central 2014 Integrating Healthcare DeliveryHETT Conference Olympic Central 2014 Integrating Healthcare Delivery
HETT Conference Olympic Central 2014 Integrating Healthcare Delivery
Elmar Flamme
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypse
ENUG
 
Pistoia alliance debates analytics 15-09-2015 16.00
Pistoia alliance debates   analytics 15-09-2015 16.00Pistoia alliance debates   analytics 15-09-2015 16.00
Pistoia alliance debates analytics 15-09-2015 16.00
Pistoia Alliance
 
Genome sharing projects around the world nijmegen oct 29 - 2015
Genome sharing projects around the world   nijmegen oct 29 - 2015Genome sharing projects around the world   nijmegen oct 29 - 2015
Genome sharing projects around the world nijmegen oct 29 - 2015
Fiona Nielsen
 
Data Warehousing: Bridging Islands of Health Information Systems
Data Warehousing: Bridging Islands of Health Information Systems Data Warehousing: Bridging Islands of Health Information Systems
Data Warehousing: Bridging Islands of Health Information Systems
MEASURE Evaluation
 
One View of Data Science
One View of Data ScienceOne View of Data Science
One View of Data Science
Philip Bourne
 
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
Susanna-Assunta Sansone
 

Similar to 2015 04-18-wilson cg (20)

Data discovery and sharing at UCLH
Data discovery and sharing at UCLHData discovery and sharing at UCLH
Data discovery and sharing at UCLH
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forum
 
Using Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and AnalyticsUsing Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and Analytics
 
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
dkNET Webinar: Creating and Sustaining a FAIR Biomedical Data Ecosystem 10/09...
 
Big data analystics
Big data analysticsBig data analystics
Big data analystics
 
BIG DATA.ppt
BIG DATA.pptBIG DATA.ppt
BIG DATA.ppt
 
Cri big data
Cri big dataCri big data
Cri big data
 
Lecture 6_Data acquisition.pptx power points
Lecture 6_Data acquisition.pptx power pointsLecture 6_Data acquisition.pptx power points
Lecture 6_Data acquisition.pptx power points
 
Hadoop Enabled Healthcare
Hadoop Enabled HealthcareHadoop Enabled Healthcare
Hadoop Enabled Healthcare
 
Open Access as a Means to Produce High Quality Data
Open Access as a Means to Produce High Quality DataOpen Access as a Means to Produce High Quality Data
Open Access as a Means to Produce High Quality Data
 
Combining Patient Records, Genomic Data and Environmental Data to Enable Tran...
Combining Patient Records, Genomic Data and Environmental Data to Enable Tran...Combining Patient Records, Genomic Data and Environmental Data to Enable Tran...
Combining Patient Records, Genomic Data and Environmental Data to Enable Tran...
 
ANDS health and medical data webinar 16 May. Storing and Publishing Health an...
ANDS health and medical data webinar 16 May. Storing and Publishing Health an...ANDS health and medical data webinar 16 May. Storing and Publishing Health an...
ANDS health and medical data webinar 16 May. Storing and Publishing Health an...
 
Precision and Participatory Medicine - MEDINFO 2015 Panel on big data
Precision and Participatory Medicine - MEDINFO 2015 Panel on big dataPrecision and Participatory Medicine - MEDINFO 2015 Panel on big data
Precision and Participatory Medicine - MEDINFO 2015 Panel on big data
 
HETT Conference Olympic Central 2014 Integrating Healthcare Delivery
HETT Conference Olympic Central 2014 Integrating Healthcare DeliveryHETT Conference Olympic Central 2014 Integrating Healthcare Delivery
HETT Conference Olympic Central 2014 Integrating Healthcare Delivery
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypse
 
Pistoia alliance debates analytics 15-09-2015 16.00
Pistoia alliance debates   analytics 15-09-2015 16.00Pistoia alliance debates   analytics 15-09-2015 16.00
Pistoia alliance debates analytics 15-09-2015 16.00
 
Genome sharing projects around the world nijmegen oct 29 - 2015
Genome sharing projects around the world   nijmegen oct 29 - 2015Genome sharing projects around the world   nijmegen oct 29 - 2015
Genome sharing projects around the world nijmegen oct 29 - 2015
 
Data Warehousing: Bridging Islands of Health Information Systems
Data Warehousing: Bridging Islands of Health Information Systems Data Warehousing: Bridging Islands of Health Information Systems
Data Warehousing: Bridging Islands of Health Information Systems
 
One View of Data Science
One View of Data ScienceOne View of Data Science
One View of Data Science
 
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
 

Recently uploaded

Top Rated Massage Center In Ajman Chandrima Spa
Top Rated Massage Center In Ajman Chandrima SpaTop Rated Massage Center In Ajman Chandrima Spa
Top Rated Massage Center In Ajman Chandrima Spa
Chandrima Spa Ajman
 
Under Pressure : Kenneth Kruk's Strategy
Under Pressure : Kenneth Kruk's StrategyUnder Pressure : Kenneth Kruk's Strategy
Under Pressure : Kenneth Kruk's Strategy
Kenneth Kruk
 
Gemma Wean- Nutritional solution for Artemia
Gemma Wean- Nutritional solution for ArtemiaGemma Wean- Nutritional solution for Artemia
Gemma Wean- Nutritional solution for Artemia
smuskaan0008
 
Cold Sores: Causes, Treatments, and Prevention Strategies | The Lifesciences ...
Cold Sores: Causes, Treatments, and Prevention Strategies | The Lifesciences ...Cold Sores: Causes, Treatments, and Prevention Strategies | The Lifesciences ...
Cold Sores: Causes, Treatments, and Prevention Strategies | The Lifesciences ...
The Lifesciences Magazine
 
Dr. David Greene R3 stem cell Breakthroughs: Stem Cell Therapy in Cardiology
Dr. David Greene R3 stem cell Breakthroughs: Stem Cell Therapy in CardiologyDr. David Greene R3 stem cell Breakthroughs: Stem Cell Therapy in Cardiology
Dr. David Greene R3 stem cell Breakthroughs: Stem Cell Therapy in Cardiology
R3 Stem Cell
 
Time line.ppQAWSDRFTGYUIOPÑLKIUYTREWASDFTGY
Time line.ppQAWSDRFTGYUIOPÑLKIUYTREWASDFTGYTime line.ppQAWSDRFTGYUIOPÑLKIUYTREWASDFTGY
Time line.ppQAWSDRFTGYUIOPÑLKIUYTREWASDFTGY
DianaRodriguez639773
 
PET CT beginners Guide covers some of the underrepresented topics in PET CT
PET CT  beginners Guide  covers some of the underrepresented topics  in PET CTPET CT  beginners Guide  covers some of the underrepresented topics  in PET CT
PET CT beginners Guide covers some of the underrepresented topics in PET CT
MiadAlsulami
 
HUMAN BRAIN.pptx.PRIYA BHOJWANI@GAMIL.COM
HUMAN BRAIN.pptx.PRIYA BHOJWANI@GAMIL.COMHUMAN BRAIN.pptx.PRIYA BHOJWANI@GAMIL.COM
HUMAN BRAIN.pptx.PRIYA BHOJWANI@GAMIL.COM
priyabhojwani1200
 
DECODING THE RISKS - ALCOHOL, TOBACCO & DRUGS.pdf
DECODING THE RISKS - ALCOHOL, TOBACCO & DRUGS.pdfDECODING THE RISKS - ALCOHOL, TOBACCO & DRUGS.pdf
DECODING THE RISKS - ALCOHOL, TOBACCO & DRUGS.pdf
Dr Rachana Gujar
 
Dimensions of Healthcare Quality
Dimensions of Healthcare QualityDimensions of Healthcare Quality
Dimensions of Healthcare Quality
Naeemshahzad51
 
Professional Secrecy: Forensic Medicine Lecture
Professional Secrecy: Forensic Medicine LectureProfessional Secrecy: Forensic Medicine Lecture
Professional Secrecy: Forensic Medicine Lecture
DIVYANSHU740006
 
KEY Points of Leicester travel clinic In London doc.docx
KEY Points of Leicester travel clinic In London doc.docxKEY Points of Leicester travel clinic In London doc.docx
KEY Points of Leicester travel clinic In London doc.docx
NX Healthcare
 
NKTI Annual Report - Annual Report FY 2022
NKTI Annual Report - Annual Report FY 2022NKTI Annual Report - Annual Report FY 2022
NKTI Annual Report - Annual Report FY 2022
nktiacc3
 
ABDOMINAL COMPARTMENT SYSNDROME
ABDOMINAL COMPARTMENT SYSNDROMEABDOMINAL COMPARTMENT SYSNDROME
ABDOMINAL COMPARTMENT SYSNDROME
Rommel Luis III Israel
 
PrudentRx's Function in the Management of Chronic Illnesses
PrudentRx's Function in the Management of Chronic IllnessesPrudentRx's Function in the Management of Chronic Illnesses
PrudentRx's Function in the Management of Chronic Illnesses
PrudentRx Program
 
INFECTION OF THE BRAIN -ENCEPHALITIS ( PPT)
INFECTION OF THE BRAIN -ENCEPHALITIS ( PPT)INFECTION OF THE BRAIN -ENCEPHALITIS ( PPT)
INFECTION OF THE BRAIN -ENCEPHALITIS ( PPT)
blessyjannu21
 
Empowering ACOs: Leveraging Quality Management Tools for MIPS and Beyond
Empowering ACOs: Leveraging Quality Management Tools for MIPS and BeyondEmpowering ACOs: Leveraging Quality Management Tools for MIPS and Beyond
Empowering ACOs: Leveraging Quality Management Tools for MIPS and Beyond
Health Catalyst
 
CANSA support - Caring for Cancer Patients' Caregivers
CANSA support - Caring for Cancer Patients' CaregiversCANSA support - Caring for Cancer Patients' Caregivers
CANSA support - Caring for Cancer Patients' Caregivers
CANSA The Cancer Association of South Africa
 
Deep Leg Vein Thrombosis (DVT): Meaning, Causes, Symptoms, Treatment, and Mor...
Deep Leg Vein Thrombosis (DVT): Meaning, Causes, Symptoms, Treatment, and Mor...Deep Leg Vein Thrombosis (DVT): Meaning, Causes, Symptoms, Treatment, and Mor...
Deep Leg Vein Thrombosis (DVT): Meaning, Causes, Symptoms, Treatment, and Mor...
The Lifesciences Magazine
 
Letter to MREC - application to conduct study
Letter to MREC - application to conduct studyLetter to MREC - application to conduct study
Letter to MREC - application to conduct study
Azreen Aj
 

Recently uploaded (20)

Top Rated Massage Center In Ajman Chandrima Spa
Top Rated Massage Center In Ajman Chandrima SpaTop Rated Massage Center In Ajman Chandrima Spa
Top Rated Massage Center In Ajman Chandrima Spa
 
Under Pressure : Kenneth Kruk's Strategy
Under Pressure : Kenneth Kruk's StrategyUnder Pressure : Kenneth Kruk's Strategy
Under Pressure : Kenneth Kruk's Strategy
 
Gemma Wean- Nutritional solution for Artemia
Gemma Wean- Nutritional solution for ArtemiaGemma Wean- Nutritional solution for Artemia
Gemma Wean- Nutritional solution for Artemia
 
Cold Sores: Causes, Treatments, and Prevention Strategies | The Lifesciences ...
Cold Sores: Causes, Treatments, and Prevention Strategies | The Lifesciences ...Cold Sores: Causes, Treatments, and Prevention Strategies | The Lifesciences ...
Cold Sores: Causes, Treatments, and Prevention Strategies | The Lifesciences ...
 
Dr. David Greene R3 stem cell Breakthroughs: Stem Cell Therapy in Cardiology
Dr. David Greene R3 stem cell Breakthroughs: Stem Cell Therapy in CardiologyDr. David Greene R3 stem cell Breakthroughs: Stem Cell Therapy in Cardiology
Dr. David Greene R3 stem cell Breakthroughs: Stem Cell Therapy in Cardiology
 
Time line.ppQAWSDRFTGYUIOPÑLKIUYTREWASDFTGY
Time line.ppQAWSDRFTGYUIOPÑLKIUYTREWASDFTGYTime line.ppQAWSDRFTGYUIOPÑLKIUYTREWASDFTGY
Time line.ppQAWSDRFTGYUIOPÑLKIUYTREWASDFTGY
 
PET CT beginners Guide covers some of the underrepresented topics in PET CT
PET CT  beginners Guide  covers some of the underrepresented topics  in PET CTPET CT  beginners Guide  covers some of the underrepresented topics  in PET CT
PET CT beginners Guide covers some of the underrepresented topics in PET CT
 
HUMAN BRAIN.pptx.PRIYA BHOJWANI@GAMIL.COM
HUMAN BRAIN.pptx.PRIYA BHOJWANI@GAMIL.COMHUMAN BRAIN.pptx.PRIYA BHOJWANI@GAMIL.COM
HUMAN BRAIN.pptx.PRIYA BHOJWANI@GAMIL.COM
 
DECODING THE RISKS - ALCOHOL, TOBACCO & DRUGS.pdf
DECODING THE RISKS - ALCOHOL, TOBACCO & DRUGS.pdfDECODING THE RISKS - ALCOHOL, TOBACCO & DRUGS.pdf
DECODING THE RISKS - ALCOHOL, TOBACCO & DRUGS.pdf
 
Dimensions of Healthcare Quality
Dimensions of Healthcare QualityDimensions of Healthcare Quality
Dimensions of Healthcare Quality
 
Professional Secrecy: Forensic Medicine Lecture
Professional Secrecy: Forensic Medicine LectureProfessional Secrecy: Forensic Medicine Lecture
Professional Secrecy: Forensic Medicine Lecture
 
KEY Points of Leicester travel clinic In London doc.docx
KEY Points of Leicester travel clinic In London doc.docxKEY Points of Leicester travel clinic In London doc.docx
KEY Points of Leicester travel clinic In London doc.docx
 
NKTI Annual Report - Annual Report FY 2022
NKTI Annual Report - Annual Report FY 2022NKTI Annual Report - Annual Report FY 2022
NKTI Annual Report - Annual Report FY 2022
 
ABDOMINAL COMPARTMENT SYSNDROME
ABDOMINAL COMPARTMENT SYSNDROMEABDOMINAL COMPARTMENT SYSNDROME
ABDOMINAL COMPARTMENT SYSNDROME
 
PrudentRx's Function in the Management of Chronic Illnesses
PrudentRx's Function in the Management of Chronic IllnessesPrudentRx's Function in the Management of Chronic Illnesses
PrudentRx's Function in the Management of Chronic Illnesses
 
INFECTION OF THE BRAIN -ENCEPHALITIS ( PPT)
INFECTION OF THE BRAIN -ENCEPHALITIS ( PPT)INFECTION OF THE BRAIN -ENCEPHALITIS ( PPT)
INFECTION OF THE BRAIN -ENCEPHALITIS ( PPT)
 
Empowering ACOs: Leveraging Quality Management Tools for MIPS and Beyond
Empowering ACOs: Leveraging Quality Management Tools for MIPS and BeyondEmpowering ACOs: Leveraging Quality Management Tools for MIPS and Beyond
Empowering ACOs: Leveraging Quality Management Tools for MIPS and Beyond
 
CANSA support - Caring for Cancer Patients' Caregivers
CANSA support - Caring for Cancer Patients' CaregiversCANSA support - Caring for Cancer Patients' Caregivers
CANSA support - Caring for Cancer Patients' Caregivers
 
Deep Leg Vein Thrombosis (DVT): Meaning, Causes, Symptoms, Treatment, and Mor...
Deep Leg Vein Thrombosis (DVT): Meaning, Causes, Symptoms, Treatment, and Mor...Deep Leg Vein Thrombosis (DVT): Meaning, Causes, Symptoms, Treatment, and Mor...
Deep Leg Vein Thrombosis (DVT): Meaning, Causes, Symptoms, Treatment, and Mor...
 
Letter to MREC - application to conduct study
Letter to MREC - application to conduct studyLetter to MREC - application to conduct study
Letter to MREC - application to conduct study
 

2015 04-18-wilson cg

  • 1. Big Data Infrastructure for Translational Research Christopher G. Wilson, Ph.D. Associate Professor Physiology and Pediatrics Center for Perinatal Biology Translational Medicine, April 18th, 2015
  • 2. Disclosures The work reported here was supported, in part, by NIH grants: 1R01HL081622-01 (NHLBI) 1R03HD064830-01 (NICHD)
  • 3.
  • 4.
  • 5. Outline • Defining “Big Data” • Big data is of multiple modes/types • Scaling data acquisition to build Big Data sets •  Patient bed •  Unit •  Institution-wide • Continuing challenges
  • 6.
  • 7. What is “Big Data”? • Big data is a blanket term for any collection of data sets so large and complex that it becomes difficult to process using the typical data management tools and data processing applications. • Big data usually includes data sets so large that commonly used software (like Microsoft Office) cannot be used to capture, curate, manage, and process the data quickly and efficiently. • Big data set sizes are a constantly moving target ranging from 100’s of gigabytes (109 bytes), to terabytes (1012 bytes) and even to petabytes (1015 bytes) of data in a single data set.
  • 8. A feast of data! • The world’s technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s • Global Internet traffic has reached almost 1000 exabytes (1018 bytes) annually and continues to grow* • The challenge for both business and research science is coming up with the tools to extract usable information from this data *Cisco systems estimate
  • 9. Where does so much data come from? Data sets grow to vast size because they are increasingly being gathered by: • Ubiquitous information-sensing mobile devices (phones, fitbits, jawbones, etc.) • Surveillance technologies (remote sensing devices like drones or traffic cameras) • Software logs from your internet activity (Hello—Facebook!) • Radio-frequency identification (RFID) tags • Wireless sensor networks (once again, the kind of thing your phone “wants” to attach to when you are out and about) • And scientific instruments, clinical monitors, patient samples…
  • 11. Work-flow of “Big Data” analysis
  • 12. Or… • Obtain data • Scrub data • Explore data • Model the data • Interpret the data • Present the data
  • 13. Data analytics is a team sport! •  Project manager—responsible for setting clear project objectives and deliverables. The project manager should be someone with more experience in data analysis and a more comprehensive background than the other team members. •  Statistician—should have a strong mathematics/statistics background and will be responsible for reporting and developing the statistics workflow for the project. •  Visualization specialist—responsible for the design/development of data visualization (figures/animation) for the project. •  Database specialist—develops ontology/meta-tags to represent the data and incorporate this information in the team's chosen database schema. •  Content Expert—has the strongest background in the focus area of the project (Physiologist, systems biologist, molecular biologist, biochemist, clinician, etc.) and is responsible for providing background material relevant to the project's focus. •  Web developer/integrator—responsible for web-content related to the project, including the final report formatting (for web/hardcopy display). •  Data analyst/programmer—the most junior member of the team will take on general responsibilities to assist the other team members. This is a learning opportunity for a team member who is new to data analysis and needs time to develop the skills necessary to fully participate in the workflow.
  • 14. Data analytics is a team sport! Project manager/ content expert (physician/scientist) Database/web developer Statistician/ Data viz Programmer Team members can have multiple roles….
  • 15. What tools are typically used? • 64 bit computing environment is typical (Big RAM and Big storage, massively parallel software running on clusters/cloud servers) • Data is acquired and stored in a database (SQL for some but NOSQL databases like Hadoop, MongoDB, CouchDB, Clusterpoint, etc. are “better”) • Data screening & cleaning using “scripting” languages (Perl or Python typically) and processing using tools like MapReduce • “Industrial strength” statistical packages (typically R, SAS, or SPSS) • Visualization (D3/IDL/MATLAB/Python/Plot.ly, etc.) • Metadata tagging (XML and variants)
  • 16.
  • 17. How can we meet the challenge of Big Data collection/integration in a translational setting?
  • 18. What are the challenges for clinicians/researchers? The amount of biomedical data that is increasingly available provides both opportunity and challenge for the translational investigator. • Molecular biology has provided tools to allow understanding of genomics and proteomics. • There is growing data on the connectomics of signaling pathways • Patient demographic data and other EHR/EMR metrics are a resource that is only now being widely deployed and interrogated. • Patient physiology (bedside monitors) can be used to provide fundamental information about patient health and adaptation to pathophysiologies. • Health Insurance Portability and Accountability Act of 1996 (HIPAA) is a necessary challenge for data handling.
  • 19. Courtesy Michael De Georgia & J. Michael Schmidt
  • 20. Big Data to Decisions! » Technology challenges for “Data to Decisions” ~  Transforming data from multiple sources into meaningful information (evidence-context dependent) ~  Association of data from diverse heterogeneous, asynchronous sources ~  Merging/fusion of information for alerts and decision support ~  Human guided processing and analysis Multi-source Analysis For Pattern Discovery Extract & synthesize information from diverse data. SOURCE SOURCE SOURCE Source-to-Evidence: Information Processing & Extraction Text Analytics Image Analysis Signal Processing Data Association Data Fusion: Alerting & Decision Support Combine Information Weigh Evidence Real time Alerting User Interface: Display & Analysis Visualization Queries Data Provenance Sensitivity
  • 21. Real-time Decision Support Providing useful information to the clinician » Real-time decision support to clinicians at the point of care ~  Codify best practice protocols ~  Enable efficient treatment decisions ~  Reduce needless procedures ~  Optimize coordination among care givers ~  Reduce the probability of mistakes being made » Key features that affect decision support ~  Methods to retrieve, merge, and present data and information ~  Algorithms to extract information from complex, heterogeneous data ~  Visualization/graphical feedback to better understand patient conditions » Automated alerting for conditions of concern ~  Combining information across data streams ~  Accumulation of weak evidence from multiple sources ~  Enhanced retrieval and visualization of information
  • 22. Challenges inherent in Big Data Analytics • Capture • Curation • Storage • Search • Sharing • Transfer • Analysis • Visualization
  • 23. Data is multi-modal Unified data set Physiology waveforms (ECG, EEG, SaO2, BP) Radiology (X-Ray, MRI, CAT, etc.) EMR/EHR “-omics” data
  • 24. Bedside Patient Data Acquisition
  • 25. Scaling to a hospital-wide data center Ken Loparo Michael DeGeorgia Frank Jacono Farhad Kaffashi
  • 26.
  • 27. CWRU IMEDSTM Proof of Concept Demonstration
  • 28. Why is IMEDS™ Different? The Approach ~  “Bottom-up” development with clinicians and engineers working side-by-side ~  Open source architecture design ~  Total integrated, “plug-and-play” system solution ~  Unbiased approach ~  Unified effort, rather than stove-piped, “one-off” solutions to small pieces of the problem ~  Non-profit nation-wide consortium ~  Builds on existing infrastructures ~  Leverages best available technology, regardless of source
  • 29. Courtesy Michael De Georgia & J. Michael Schmidt
  • 30. Challenges inherent in Big Data Analytics • Capture • Curation • Storage • Search • Sharing • Transfer • Analysis • Visualization
  • 34. IPython interface http://ipython.org •  Reproducible •  Version controlled (git) •  Interactive analysis
  • 35. Challenges inherent in Big Data Analytics • Capture • Curation • Storage • Search • Sharing • Transfer • Analysis • Visualization
  • 36. Worldwide movement for FAIR data Barend Mons and Susanna-Assunta Sansone http://bd2k.nih.gov/workshops.html#ADDS
  • 37. ! " Launched on May 27th, 2014 A new online-only publication for descriptions of scientifically valuable datasets in the life, environmental and biomedical sciences, but not limited to these Credit for sharing your data Focused on reuse and reproducibility Peer reviewed, curated Promoting Community Data Repositories Open Access Supported by: Courtesy of Susanna-Assunta Sansone, PhD
  • 38. Challenges inherent in Big Data Analytics • Capture • Curation • Storage • Search • Sharing • Transfer • Analysis • Visualization
  • 39. Data Processing Decision Tree Analysis Artificial Neural Network Mechanistic Approaches Graphical Approaches Bayesian Network Hierarchical Clustering Probabilistic Approaches Classical Statistical Inference Bayesian Statistical Inference Complex Systems Analysis Time Domain Frequency Domain Scale Invariant (Fractal) Analysis Approximate Entropy Integrated Patient Database Data Analysis Methods
  • 41. Python as a data analytics environment
  • 42. Advantages to using a Big Data approach • Speed of data reduction and analysis • Visualization of complex data sets can be done relatively quickly • Capacity for storage and processing of vast data sets is inherent in the tool stack • Scalability of cloud/cluster storage • Potential for “Big Impact” on research and clinical care
  • 43. Disadvantages to a Big Data approach • Often not hypothesis driven (a fishing mission?) • Requires expensive computing technology depending upon data processing and storage needs • Requires significant programming skill to develop and use the tool stack • Typically requires “team based” data analysis and management (programmer, database manager, design/ visualization person, etc.) • Just because you have lots of data, doesn’t mean you have an obvious or easy way to extract the information!
  • 44. Summary • We live in a data-rich era. • The data available to us is multi-modal and requires integration. • Data collection and integration can occur at many scales (bedside to institution) but the data must be converted into usable information. • Team-based science depends upon a wide range of data analytics skills. • Curation, reproducibility of and shared access to data is an ongoing challenge.
  • 45. Where do you find your data analytics team members?
  • 46. Syllabus Overview (10 week course) Foundations 1: Using text editors, using the IPython notebook for data exploration, using version control software (git), using the class wiki. Foundations 2: Using IPython/NumPy/SciPy, importing and manipulating data with Pandas, data visualization in IPython. Analysis Methods: Basic signal theory overview, time-series data, plotting (lines, histograms, bars, etc.) dynamical systems analyses of data variability, information theory measures (entropy) of complexity, frequency domain/spectral measures (FFT, time-varying spectrum), wavelets. Handling Sequence data: Using R/Bioconductor, differences between mRNA-Seq, gene- array, proteomics, and deep-sequencing data, visualizing data from gene/RNA arrays. Data set storage and retrieval: Basics of relational databases, SQL vs. NOSQL, cloud storage/NAS/computing clusters, interfacing with Hadoop/MapReduce, metadata and ontology for biomedical/patient data (XML), using secure databases (REDCap). Data integrity and security: The Health Insurance Portability and Accountability Act (HIPAA) and what it means for data management, de-identifying patient data (handling PHI), data security best practices, making data available to the public—implications for data transparency and large-scale data mining.
  • 48. The coding Queen and her Court… Abby Dobyns Princesses of Python Rhaya Johnson Regie Felix and Adaeze Anyanwu And a Princeling…. Jamie Tillett
  • 49. Acknowledgements Loma Linda • Andy Hopper • Traci Marin • Charles Wang • Wilson Aruni • Valery Filippov CWRU •  Michael De Georgia •  Kenneth Loparo •  Frank Jacono •  Farhad Kaffashi My laboratory’s git repository: UC Riverside • Thomas Girke (Bioinformatics) La Sierra University •  Marvin Payne CSU San Bernardino •  Art Concepcion (Bioinformatics) UC Irvine •  Alex Nicolau (Comp Sci/Bioinf) https://github.com/drcgw/bass
  • 51. Further reading • Doing Data Science by Cathy O’Neil and Rachel Schutt • Data Analysis with Open-Source Tools by Philipp Janert • The Art of R Programming by Norman Matloff • R for Everyone by Jared P. Lander • Python for Data Analysis by Wes McKinney • Think Python by Allen B. Downey • Think Stats by Allen B. Downey • Think Complexity by Allen B. Downey • Every one of Edward Tufte’s books (The Visual Display of Quantitative Information, Visual Explanations, Envisioning Information, Beautiful Evidence)
  • 52. Example: Patient physiology waveforms + EMR