SlideShare a Scribd company logo
Building Data
Ecosystems for
Accelerated
Discovery
April 29, 2020
Adam Kraut
@adamkraut
adam@bioteam.net
2|
The BioTeam
Virtual company founded in 2002
Staffed by scientists turned technologists
Technology agnostic and vendor independent
Pioneers of open-source distributed computing
Translate scientific drivers into innovative solutions
Providing strategic guidance and deep collaboration
Assess > Design > Build > Implement > Train > Support
BioTeam is independent and committed to Science
3|
The Central
Problems
Our primary mission is to solve complex problems at the
intersection of science, technology, and data
Most of our clients are struggling with central problems:
Science is changing faster than IT
Advanced infrastructure increases complexity
Distributed data is difficult to manage at scale
Our data is not findable
Our data is not accessible
Our data is not interoperable
Our data is not reusable
4|
The Data
Ecosystem
A data ecosystem is a set of infrastructure and services that
empowers a community of scientists and engineers.
Key features of a healthy Life Sciences data ecosystem:
Data Discoverability
Data Integrity at the Origin
Common Languages
Pipelines and Infrastructure as Code
Microservices and frontends
Experiment tracking and shared Workspaces
Continuous Delivery mindset for ML and Discovery
5|
Science at the
Speed of Light
Science is rate limited by our ability to generate and test a
hypothesis
Consider the foundational layers of your ecosystem. Primarily we
look at the Science Network to understand the data movement
challenges and access patterns.
We recommend you plan ahead and have faster data paths
between lab instruments generating data and your analysis tools.
Bring compute to the data and data to the compute.
In a worst case scenario, you actually halt experiments in
progress and destroy your potential with inferior networking.
In a best case scenario, you have a loss-free high-speed network
designed to match the capabilities and capacities of your science.
photo: Ann Lingard
6|
Data
Discoverability
The primary goal of a data scientist is to locate data, make
sense of it, and evaluate if it is trustworthy or not.
Datasets often diverge into silos which become problematic.
Human nature creates silos.
Applications and databases create silos.
Businesses and geography creates silos.
Searching and finding data is usually our primary objective.
Assessing the quality is a secondary supporting objective.
Need: Globally Unique IDs and resource resolver services.
Need: Defined metadata at the point of data instantiation.
7|
Data Integrity at
the Origin
Applying ML algorithms requires the highest level of data
integrity to be effective.
https://github.com/lyft/amundsen
Data objects should come with metadata that conforms to a
dictionary or ontology. A rich data store is harmonized, indexed
in various databases, discoverable, and queryable.
Good data hygiene is paramount. Promote upstream integrity of
the data objects to empower your downstream analytics.
Automatically infer partial metadata from information in silos.
We see an increased usage of graph databases such as Neo4J and
other scale-first storage systems like Redshift and SciDB.
The best case scenario is high-quality curated datasets for training
more accurate models and algorithms.
8|
Common
Languages
Controlled Vocabularies, Ontologies, and Data Dictionaries
Cross-functional teams require more efficient communication and
alignment up and down the chain of command.
Adopt and align around standard semantics, API’s, and formats
such as GA4GH, OpenAPI, HL7, Parquet.
Establish new domain-specific languages to avoid sharp edges.
Choose programming language wisely. Adopt a language with the
broadest compatibility across your tools and platforms.
We primarily recommend Python, Go, or JavaScript.
Gen3 Data Dictionary
9|
Pipelines and IaC Informatics pipelines are benefitting from advances in software
development
Our team continues to use Ansible playbooks and Chef
cookbooks for server configuration, along with Terraform
and CloudFormation for cloud provisioning and overall
environment integration.
This is even more critical in Hybrid Cloud scenarios where
significant gaps exist in core infrastructure components.
In AI and ML projects we expect an increase in Kubernetes tooling
and frameworks such as Helm and Kubeflow.
10|
Microservices and
micro frontends
“Serverless” architecture trend creates new design patterns.
https://blog.acolyer.org/2020/03/02/firecracker/
A Berkeley View on Serverless Computing
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2
019/EECS-2019-3.pdf
Patterns for Serverless Functions
Data Lakes, internal/robust API, state machines
Event patterns, sidecars, eventual consistency
Formal Foundations of Serverless Computing
Composition and new abstractions focused on reuse
See also: TLA+
11|
Experiment
Tracking and
Workspaces
Data science methodology is iterative and requires
collaboration
Jupyter Project continues to see mainstream adoption as a go-to for
computational notebooks and literate programming.
JupyterHub as a multi-user notebook server is the most popular
analysis and visualization component among our clients.
Start off with shared spreadsheets or docs in a repo or wiki.
The objective is tracking experimental outcomes, performance,
parameters, data provenance, and access control authorizations.
Improving the UX of using GPUs and Accelerators.
See also: Sagemaker, Colab, Nextflow, Cromwell, Tensorboard
12|
Continuous
Delivery for ML
and Data Science
Discipline of bringing DevOps principles and practices to ML
DevOps teams should bridge the gap between ML training
environments and deploying models using CI/CD techniques.
Eliminate manual handoffs between teams, reduce cycle time
between training models and deploying them.
Automate the end-to-end process. Versioning, Testing,
Deployments of ML components: data, model, and code.
Trend towards explainability of models as selection criteria.
An explainable model allows us to say how a decision was made.
Critical to understanding fundamental biology and chemistry.
13|
The 10x Engineer
pitfall
The “Unicorn” AI or ML specialist is a red flag that should be
avoided. Data Science is a Team Sport!
Teams of expert generalists with solid leadership principles
are the most successful.
Diversity is key in high-performance teams.
Recruit people with mixed talent and experience.
Include clinicians, lawyers, and other outside expertise.
Continuous learning and improvement.
Every member of the team has an opportunity to lead.
Requires discipline at first and strong communication.
Check your ego, work hard, and put the team first.
Thanks!
April 29, 2020
Adam Kraut
@adamkraut
adam@bioteam.net

More Related Content

What's hot

Darwin ai covid-net mitre
Darwin ai   covid-net mitreDarwin ai   covid-net mitre
Darwin ai covid-net mitre
ianmitch
 
OTN Gambia 2008
OTN Gambia 2008OTN Gambia 2008
OTN Gambia 2008
Greg Fegan
 
Finding common ground: integrating the eagle-i and VIVO ontologies
Finding common ground: integrating the eagle-i and VIVO ontologiesFinding common ground: integrating the eagle-i and VIVO ontologies
Finding common ground: integrating the eagle-i and VIVO ontologiesmhaendel
 
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
MLAI2
 
Practical Petabyte Pushing
Practical Petabyte PushingPractical Petabyte Pushing
Practical Petabyte Pushing
Chris Dagdigian
 
BeSTGRID OpenGridForum 29 GIN session
BeSTGRID OpenGridForum 29 GIN sessionBeSTGRID OpenGridForum 29 GIN session
BeSTGRID OpenGridForum 29 GIN session
Nick Jones
 
PA webinar on benefits & costs of FAIR implementation in life sciences
PA webinar on benefits & costs of FAIR implementation in life sciences PA webinar on benefits & costs of FAIR implementation in life sciences
PA webinar on benefits & costs of FAIR implementation in life sciences
Pistoia Alliance
 
Heartificial intelligence - claudio-mirti
Heartificial intelligence - claudio-mirtiHeartificial intelligence - claudio-mirti
Heartificial intelligence - claudio-mirti
Pistoia Alliance
 
Cloud Sobriety for Life Science IT Leadership (2018 Edition)
Cloud Sobriety for Life Science IT Leadership (2018 Edition)Cloud Sobriety for Life Science IT Leadership (2018 Edition)
Cloud Sobriety for Life Science IT Leadership (2018 Edition)
Chris Dagdigian
 
Bio-IT Trends From The Trenches (digital edition)
Bio-IT Trends From The Trenches (digital edition)Bio-IT Trends From The Trenches (digital edition)
Bio-IT Trends From The Trenches (digital edition)
Chris Dagdigian
 
Building the FAIR Research Commons: A Data Driven Society of Scientists
Building the FAIR Research Commons: A Data Driven Society of ScientistsBuilding the FAIR Research Commons: A Data Driven Society of Scientists
Building the FAIR Research Commons: A Data Driven Society of Scientists
Carole Goble
 
Data quality supporting AI in Life Sciences webinar 10 dec 2018
Data quality supporting AI in Life Sciences webinar 10 dec 2018Data quality supporting AI in Life Sciences webinar 10 dec 2018
Data quality supporting AI in Life Sciences webinar 10 dec 2018
Pistoia Alliance
 
Executive Summary - Data Management Hub
Executive Summary - Data Management HubExecutive Summary - Data Management Hub
Executive Summary - Data Management HubDenis Parfenov
 
Trends from the Trenches: 2019
Trends from the Trenches: 2019Trends from the Trenches: 2019
Trends from the Trenches: 2019
Chris Dagdigian
 
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021
dkNET
 
External controlled vocabularies support in Dataverse
External controlled vocabularies support in DataverseExternal controlled vocabularies support in Dataverse
External controlled vocabularies support in Dataverse
vty
 
Setting up Dataverse repository for research data
Setting up Dataverse repository for research dataSetting up Dataverse repository for research data
Setting up Dataverse repository for research data
vty
 
Knowledge graphs ilaria maresi the hyve 23apr2020
Knowledge graphs   ilaria maresi the hyve 23apr2020Knowledge graphs   ilaria maresi the hyve 23apr2020
Knowledge graphs ilaria maresi the hyve 23apr2020
Pistoia Alliance
 
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECAProject
 

What's hot (20)

Darwin ai covid-net mitre
Darwin ai   covid-net mitreDarwin ai   covid-net mitre
Darwin ai covid-net mitre
 
OTN Gambia 2008
OTN Gambia 2008OTN Gambia 2008
OTN Gambia 2008
 
Finding common ground: integrating the eagle-i and VIVO ontologies
Finding common ground: integrating the eagle-i and VIVO ontologiesFinding common ground: integrating the eagle-i and VIVO ontologies
Finding common ground: integrating the eagle-i and VIVO ontologies
 
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
 
Practical Petabyte Pushing
Practical Petabyte PushingPractical Petabyte Pushing
Practical Petabyte Pushing
 
BeSTGRID OpenGridForum 29 GIN session
BeSTGRID OpenGridForum 29 GIN sessionBeSTGRID OpenGridForum 29 GIN session
BeSTGRID OpenGridForum 29 GIN session
 
PA webinar on benefits & costs of FAIR implementation in life sciences
PA webinar on benefits & costs of FAIR implementation in life sciences PA webinar on benefits & costs of FAIR implementation in life sciences
PA webinar on benefits & costs of FAIR implementation in life sciences
 
Heartificial intelligence - claudio-mirti
Heartificial intelligence - claudio-mirtiHeartificial intelligence - claudio-mirti
Heartificial intelligence - claudio-mirti
 
Cloud Sobriety for Life Science IT Leadership (2018 Edition)
Cloud Sobriety for Life Science IT Leadership (2018 Edition)Cloud Sobriety for Life Science IT Leadership (2018 Edition)
Cloud Sobriety for Life Science IT Leadership (2018 Edition)
 
Bio-IT Trends From The Trenches (digital edition)
Bio-IT Trends From The Trenches (digital edition)Bio-IT Trends From The Trenches (digital edition)
Bio-IT Trends From The Trenches (digital edition)
 
Building the FAIR Research Commons: A Data Driven Society of Scientists
Building the FAIR Research Commons: A Data Driven Society of ScientistsBuilding the FAIR Research Commons: A Data Driven Society of Scientists
Building the FAIR Research Commons: A Data Driven Society of Scientists
 
Data quality supporting AI in Life Sciences webinar 10 dec 2018
Data quality supporting AI in Life Sciences webinar 10 dec 2018Data quality supporting AI in Life Sciences webinar 10 dec 2018
Data quality supporting AI in Life Sciences webinar 10 dec 2018
 
Executive Summary - Data Management Hub
Executive Summary - Data Management HubExecutive Summary - Data Management Hub
Executive Summary - Data Management Hub
 
Trends from the Trenches: 2019
Trends from the Trenches: 2019Trends from the Trenches: 2019
Trends from the Trenches: 2019
 
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021
 
External controlled vocabularies support in Dataverse
External controlled vocabularies support in DataverseExternal controlled vocabularies support in Dataverse
External controlled vocabularies support in Dataverse
 
Setting up Dataverse repository for research data
Setting up Dataverse repository for research dataSetting up Dataverse repository for research data
Setting up Dataverse repository for research data
 
new_kitching_cv
new_kitching_cvnew_kitching_cv
new_kitching_cv
 
Knowledge graphs ilaria maresi the hyve 23apr2020
Knowledge graphs   ilaria maresi the hyve 23apr2020Knowledge graphs   ilaria maresi the hyve 23apr2020
Knowledge graphs ilaria maresi the hyve 23apr2020
 
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
 

Similar to Building Data Ecosystems for Accelerated Discovery

The Eco-System of AI and How to Use It
The Eco-System of AI and How to Use ItThe Eco-System of AI and How to Use It
The Eco-System of AI and How to Use It
inside-BigData.com
 
The FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdfThe FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdf
Alan Morrison
 
Introduction to BigData
Introduction to BigData Introduction to BigData
Introduction to BigData
Abdelkader OUARED
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
Denodo
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Robert Grossman
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of ML
Jordan Birdsell
 
Artificial Intelligence As a Service
Artificial Intelligence As a ServiceArtificial Intelligence As a Service
Artificial Intelligence As a Service
John Liu
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)
Denodo
 
Embedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern StaenderEmbedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern Staender
Dataconomy Media
 
Embracing Cloud Deployment for Big Data and DevOps
Embracing Cloud Deployment for Big Data and DevOpsEmbracing Cloud Deployment for Big Data and DevOps
Embracing Cloud Deployment for Big Data and DevOps
Steve Woodward
 
Embracing Cloud Deployment for Big Data and Dev Ops
Embracing Cloud Deployment for Big Data and Dev OpsEmbracing Cloud Deployment for Big Data and Dev Ops
Embracing Cloud Deployment for Big Data and Dev Ops
Nick Brown
 
Building Data Science Teams
Building Data Science TeamsBuilding Data Science Teams
Building Data Science Teams
EMC
 
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDriving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
DataWorks Summit
 
IBM Aspera In Life Sciences
IBM Aspera In Life SciencesIBM Aspera In Life Sciences
IBM Aspera In Life Sciences
Chris Shaw
 
IBM Think Milano
IBM Think MilanoIBM Think Milano
IBM Think Milano
ATMOSPHERE .
 
Extending open source and hybrid cloud to drive OT transformation - Future Oi...
Extending open source and hybrid cloud to drive OT transformation - Future Oi...Extending open source and hybrid cloud to drive OT transformation - Future Oi...
Extending open source and hybrid cloud to drive OT transformation - Future Oi...
John Archer
 
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BIAugmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Denodo
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Nathan Bijnens
 
Qo Introduction V2
Qo Introduction V2Qo Introduction V2
Qo Introduction V2
Joe_F
 

Similar to Building Data Ecosystems for Accelerated Discovery (20)

The Eco-System of AI and How to Use It
The Eco-System of AI and How to Use ItThe Eco-System of AI and How to Use It
The Eco-System of AI and How to Use It
 
The FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdfThe FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdf
 
Introduction to BigData
Introduction to BigData Introduction to BigData
Introduction to BigData
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of ML
 
Artificial Intelligence As a Service
Artificial Intelligence As a ServiceArtificial Intelligence As a Service
Artificial Intelligence As a Service
 
ODSC and iRODS
ODSC and iRODSODSC and iRODS
ODSC and iRODS
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)
 
Embedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern StaenderEmbedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern Staender
 
Embracing Cloud Deployment for Big Data and DevOps
Embracing Cloud Deployment for Big Data and DevOpsEmbracing Cloud Deployment for Big Data and DevOps
Embracing Cloud Deployment for Big Data and DevOps
 
Embracing Cloud Deployment for Big Data and Dev Ops
Embracing Cloud Deployment for Big Data and Dev OpsEmbracing Cloud Deployment for Big Data and Dev Ops
Embracing Cloud Deployment for Big Data and Dev Ops
 
Building Data Science Teams
Building Data Science TeamsBuilding Data Science Teams
Building Data Science Teams
 
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDriving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
 
IBM Aspera In Life Sciences
IBM Aspera In Life SciencesIBM Aspera In Life Sciences
IBM Aspera In Life Sciences
 
IBM Think Milano
IBM Think MilanoIBM Think Milano
IBM Think Milano
 
Extending open source and hybrid cloud to drive OT transformation - Future Oi...
Extending open source and hybrid cloud to drive OT transformation - Future Oi...Extending open source and hybrid cloud to drive OT transformation - Future Oi...
Extending open source and hybrid cloud to drive OT transformation - Future Oi...
 
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BIAugmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
 
Qo Introduction V2
Qo Introduction V2Qo Introduction V2
Qo Introduction V2
 

Recently uploaded

filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
IvanMallco1
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
Lokesh Patil
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
sonaliswain16
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
Areesha Ahmad
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
muralinath2
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
Richard Gill
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
DiyaBiswas10
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
kumarmathi863
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
AlguinaldoKong
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
Scintica Instrumentation
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
subedisuryaofficial
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
yusufzako14
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
muralinath2
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
silvermistyshot
 

Recently uploaded (20)

filosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptxfilosofia boliviana introducción jsjdjd.pptx
filosofia boliviana introducción jsjdjd.pptx
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
 
role of pramana in research.pptx in science
role of pramana in research.pptx in sciencerole of pramana in research.pptx in science
role of pramana in research.pptx in science
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
 

Building Data Ecosystems for Accelerated Discovery

  • 1. Building Data Ecosystems for Accelerated Discovery April 29, 2020 Adam Kraut @adamkraut adam@bioteam.net
  • 2. 2| The BioTeam Virtual company founded in 2002 Staffed by scientists turned technologists Technology agnostic and vendor independent Pioneers of open-source distributed computing Translate scientific drivers into innovative solutions Providing strategic guidance and deep collaboration Assess > Design > Build > Implement > Train > Support BioTeam is independent and committed to Science
  • 3. 3| The Central Problems Our primary mission is to solve complex problems at the intersection of science, technology, and data Most of our clients are struggling with central problems: Science is changing faster than IT Advanced infrastructure increases complexity Distributed data is difficult to manage at scale Our data is not findable Our data is not accessible Our data is not interoperable Our data is not reusable
  • 4. 4| The Data Ecosystem A data ecosystem is a set of infrastructure and services that empowers a community of scientists and engineers. Key features of a healthy Life Sciences data ecosystem: Data Discoverability Data Integrity at the Origin Common Languages Pipelines and Infrastructure as Code Microservices and frontends Experiment tracking and shared Workspaces Continuous Delivery mindset for ML and Discovery
  • 5. 5| Science at the Speed of Light Science is rate limited by our ability to generate and test a hypothesis Consider the foundational layers of your ecosystem. Primarily we look at the Science Network to understand the data movement challenges and access patterns. We recommend you plan ahead and have faster data paths between lab instruments generating data and your analysis tools. Bring compute to the data and data to the compute. In a worst case scenario, you actually halt experiments in progress and destroy your potential with inferior networking. In a best case scenario, you have a loss-free high-speed network designed to match the capabilities and capacities of your science. photo: Ann Lingard
  • 6. 6| Data Discoverability The primary goal of a data scientist is to locate data, make sense of it, and evaluate if it is trustworthy or not. Datasets often diverge into silos which become problematic. Human nature creates silos. Applications and databases create silos. Businesses and geography creates silos. Searching and finding data is usually our primary objective. Assessing the quality is a secondary supporting objective. Need: Globally Unique IDs and resource resolver services. Need: Defined metadata at the point of data instantiation.
  • 7. 7| Data Integrity at the Origin Applying ML algorithms requires the highest level of data integrity to be effective. https://github.com/lyft/amundsen Data objects should come with metadata that conforms to a dictionary or ontology. A rich data store is harmonized, indexed in various databases, discoverable, and queryable. Good data hygiene is paramount. Promote upstream integrity of the data objects to empower your downstream analytics. Automatically infer partial metadata from information in silos. We see an increased usage of graph databases such as Neo4J and other scale-first storage systems like Redshift and SciDB. The best case scenario is high-quality curated datasets for training more accurate models and algorithms.
  • 8. 8| Common Languages Controlled Vocabularies, Ontologies, and Data Dictionaries Cross-functional teams require more efficient communication and alignment up and down the chain of command. Adopt and align around standard semantics, API’s, and formats such as GA4GH, OpenAPI, HL7, Parquet. Establish new domain-specific languages to avoid sharp edges. Choose programming language wisely. Adopt a language with the broadest compatibility across your tools and platforms. We primarily recommend Python, Go, or JavaScript. Gen3 Data Dictionary
  • 9. 9| Pipelines and IaC Informatics pipelines are benefitting from advances in software development Our team continues to use Ansible playbooks and Chef cookbooks for server configuration, along with Terraform and CloudFormation for cloud provisioning and overall environment integration. This is even more critical in Hybrid Cloud scenarios where significant gaps exist in core infrastructure components. In AI and ML projects we expect an increase in Kubernetes tooling and frameworks such as Helm and Kubeflow.
  • 10. 10| Microservices and micro frontends “Serverless” architecture trend creates new design patterns. https://blog.acolyer.org/2020/03/02/firecracker/ A Berkeley View on Serverless Computing https://www2.eecs.berkeley.edu/Pubs/TechRpts/2 019/EECS-2019-3.pdf Patterns for Serverless Functions Data Lakes, internal/robust API, state machines Event patterns, sidecars, eventual consistency Formal Foundations of Serverless Computing Composition and new abstractions focused on reuse See also: TLA+
  • 11. 11| Experiment Tracking and Workspaces Data science methodology is iterative and requires collaboration Jupyter Project continues to see mainstream adoption as a go-to for computational notebooks and literate programming. JupyterHub as a multi-user notebook server is the most popular analysis and visualization component among our clients. Start off with shared spreadsheets or docs in a repo or wiki. The objective is tracking experimental outcomes, performance, parameters, data provenance, and access control authorizations. Improving the UX of using GPUs and Accelerators. See also: Sagemaker, Colab, Nextflow, Cromwell, Tensorboard
  • 12. 12| Continuous Delivery for ML and Data Science Discipline of bringing DevOps principles and practices to ML DevOps teams should bridge the gap between ML training environments and deploying models using CI/CD techniques. Eliminate manual handoffs between teams, reduce cycle time between training models and deploying them. Automate the end-to-end process. Versioning, Testing, Deployments of ML components: data, model, and code. Trend towards explainability of models as selection criteria. An explainable model allows us to say how a decision was made. Critical to understanding fundamental biology and chemistry.
  • 13. 13| The 10x Engineer pitfall The “Unicorn” AI or ML specialist is a red flag that should be avoided. Data Science is a Team Sport! Teams of expert generalists with solid leadership principles are the most successful. Diversity is key in high-performance teams. Recruit people with mixed talent and experience. Include clinicians, lawyers, and other outside expertise. Continuous learning and improvement. Every member of the team has an opportunity to lead. Requires discipline at first and strong communication. Check your ego, work hard, and put the team first.
  • 14. Thanks! April 29, 2020 Adam Kraut @adamkraut adam@bioteam.net