Many of the data management and analysis challenges in microbiome research are shared with genomics and other life-science big-data disciplines. However there are aspects that are specific: some are intrinsic to microbiome data, some are related to the maturity of the field, with others related to extracting business value from the data.
Digital transformation of translational medicineEagle Genomics
Anthony Finbow, Executive Chairman, and William Spooner, Chief Science Officer, discuss Eagle Genomics' software product, marketed at pharmaceutical and biotech companies, which enables radical improvements in the productivity of scientific research.
Expert Panel on Data Challenges in Translational ResearchEagle Genomics
A panel of experts including Alexandre Passioukov, VP Translational Medicine at Pierre Fabre, Xose Fernandez, Chief Data Officer at Institut Curie, Abel Ureta-Vidal, CEO at Eagle Genomics share their first-hand experience of enabling translational research in pharmaceutical and biomedical organisations, and discuss the challenges around the establishment of streamlined, seamless data handling and governance to accelerate innovation.
Validating microbiome claims – including the latest DNA techniquesEagle Genomics
Abel Ureta-Vidal, Founder and CEO of Eagle Genomics, discusses how advanced DNA techniques help us to identify and characterise the microbiome, leading us to ways to prove cosmetic claims at the in-cosmetics formulation summit, 25th October 2017.
Innovation applications of microphysiological systems (MPS) have been growing over the past decade, especially with respect to the use of complex human tissues for assessing safety of drug candidates – but broad industry adoption of MPS methods has not yet become a reality.
This webinar addresses some recent advances in MPS development and begins to explore the barriers to increased incorporation of MPS to improve drug safety assessment and to provide safer, more effective drugs into the clinical pipeline.
Dr. Dennis Wang discusses possible ways to enable ML methods to be more powerful for discovery and to reduce ambiguity within translational medicine, allowing data-informed decision-making to deliver the next generation of diagnostics and therapeutics to patients quicker, at lowered costs, and at scale.
The talk by Dr. Dennis Wang was followed by a panel discussion with Mr. Albert Wang, M. Eng., Head, IT Business Partner, Translational Research & Technologies, Bristol-Myers Squibb.
Federated Learning (FL) is a learning paradigm that enables collaborative learning without centralizing datasets. In this webinar, NVIDIA present the concept of FL and discuss how it can help overcome some of the barriers seen in the development of AI-based solutions for pharma, genomics and healthcare. Following the presentation, the panel debate on other elements that could drive the adoption of digital approaches more widely and help answer currently intractable science and business questions.
With the explosion of interest in both enhanced knowledge management and open science, the past few years have seen considerable discussion about making scientific data “FAIR” — findable, accessible, interoperable, and reusable. The problem is that most scientific datasets are not FAIR. When left to their own devices, scientists do an absolutely terrible job creating the metadata that describe the experimental datasets that make their way in online repositories. The lack of standardization makes it extremely difficult for other investigators to locate relevant datasets, to re-analyse them, and to integrate those datasets with other data. The Center for Expanded Data Annotation and Retrieval (CEDAR) has the goal of enhancing the authoring of experimental metadata to make online datasets more useful to the scientific community. The CEDAR work bench for metadata management will be presented in this webinar. CEDAR illustrates the importance of semantic technology to driving open science. It also demonstrates a means for simplifying access to scientific data sets and enhancing the reuse of the data to drive new discoveries.
Digital transformation of translational medicineEagle Genomics
Anthony Finbow, Executive Chairman, and William Spooner, Chief Science Officer, discuss Eagle Genomics' software product, marketed at pharmaceutical and biotech companies, which enables radical improvements in the productivity of scientific research.
Expert Panel on Data Challenges in Translational ResearchEagle Genomics
A panel of experts including Alexandre Passioukov, VP Translational Medicine at Pierre Fabre, Xose Fernandez, Chief Data Officer at Institut Curie, Abel Ureta-Vidal, CEO at Eagle Genomics share their first-hand experience of enabling translational research in pharmaceutical and biomedical organisations, and discuss the challenges around the establishment of streamlined, seamless data handling and governance to accelerate innovation.
Validating microbiome claims – including the latest DNA techniquesEagle Genomics
Abel Ureta-Vidal, Founder and CEO of Eagle Genomics, discusses how advanced DNA techniques help us to identify and characterise the microbiome, leading us to ways to prove cosmetic claims at the in-cosmetics formulation summit, 25th October 2017.
Innovation applications of microphysiological systems (MPS) have been growing over the past decade, especially with respect to the use of complex human tissues for assessing safety of drug candidates – but broad industry adoption of MPS methods has not yet become a reality.
This webinar addresses some recent advances in MPS development and begins to explore the barriers to increased incorporation of MPS to improve drug safety assessment and to provide safer, more effective drugs into the clinical pipeline.
Dr. Dennis Wang discusses possible ways to enable ML methods to be more powerful for discovery and to reduce ambiguity within translational medicine, allowing data-informed decision-making to deliver the next generation of diagnostics and therapeutics to patients quicker, at lowered costs, and at scale.
The talk by Dr. Dennis Wang was followed by a panel discussion with Mr. Albert Wang, M. Eng., Head, IT Business Partner, Translational Research & Technologies, Bristol-Myers Squibb.
Federated Learning (FL) is a learning paradigm that enables collaborative learning without centralizing datasets. In this webinar, NVIDIA present the concept of FL and discuss how it can help overcome some of the barriers seen in the development of AI-based solutions for pharma, genomics and healthcare. Following the presentation, the panel debate on other elements that could drive the adoption of digital approaches more widely and help answer currently intractable science and business questions.
With the explosion of interest in both enhanced knowledge management and open science, the past few years have seen considerable discussion about making scientific data “FAIR” — findable, accessible, interoperable, and reusable. The problem is that most scientific datasets are not FAIR. When left to their own devices, scientists do an absolutely terrible job creating the metadata that describe the experimental datasets that make their way in online repositories. The lack of standardization makes it extremely difficult for other investigators to locate relevant datasets, to re-analyse them, and to integrate those datasets with other data. The Center for Expanded Data Annotation and Retrieval (CEDAR) has the goal of enhancing the authoring of experimental metadata to make online datasets more useful to the scientific community. The CEDAR work bench for metadata management will be presented in this webinar. CEDAR illustrates the importance of semantic technology to driving open science. It also demonstrates a means for simplifying access to scientific data sets and enhancing the reuse of the data to drive new discoveries.
In the late Fall and Winter of 2018, the Pistoia Alliance in cooperation with Elsevier and charitable organizations Cures within Reach and Mission: Cure ran a datathon aiming to find drugs suitable for treatment of childhood chronic pancreatitis, a rare disease that causes extreme suffering. The datathon resulted in identification of four candidate compounds in a short time frame of just under three months. In this webinar our speakers discuss the technologies that made this leap possible
The poster was made during a 2.5 day NSF sponsored workshop on August 5th – 7th on the University of Washington, Seattle campus, which brought together 100 graduate students from diverse domain sciences and engineering with Data Scientists from industry and academia to discuss and collaborate on Big Data / Data Science challenges. In addition to keynote presentations from high profile speakers, the participants presented posters covering their own research and worked collaboratively to begin to solve some of the Grand Challenge problems facing Data Enabled Science & Engineering disciplines.
This presentation was made by 10 graduate students, whose united them was high-dimensional biological data.
For more info, see http://depts.washington.edu/dswkshp/
Themes and objectives:
To position FAIR as a key enabler to automate and accelerate R&D process workflows
FAIR Implementation within the context of a use case
Grounded in precise outcomes (e.g. faster and bigger science / more reuse of data to enhance value / increased ability to share data for collaboration and partnership)
To make data actionable through FAIR interoperability
Speakers:
Mathew Woodwark,Head of Data Infrastructure and Tools, Data Science & AI, AstraZeneca
Erik Schultes, International Science Coordinator, GO-FAIR
Georges Heiter, Founder & CEO, Databiology
Practical Drug Discovery using Explainable Artificial IntelligenceAl Dossetter
How to build AI systems to enable the drug hunting medicinal chemist in their day-to-day work. Levels are AI are described and the meaning and context Explainable AI to medicinal chemists. Six medicinal chemist projects are described, as well as Matched Molecular Pair Analysis (MMPA), Machine Learning and Permutative MMPA. In each case how a system can be built to drill back to chemical sub-structures so effective decisions can be made.
Presentation Alliance of European Life Sciences Law Firms
(Julian Hitchcock and Sofie van der Meulen) on legal aspects of big data in pharma. Topics: privacy, IP, medical devices and IVD.
Sharing and standards christopher hart - clinical innovation and partnering...Christopher Hart
Acknowledging the increasing need for cooperation and collaboration in data sharing and access. Describing the complexity that this can bring. Then describing some of the ways to simplify that.
Originally presented at Terrapin's Clinical innovation and partnering world March 8-9 2017.
http://www.terrapinn.com/conference/innovation-and-partnering/index.stm
Expert panel on industrialising microbiomics - with UnileverEagle Genomics
A panel of experts, including Dr Barry Murphy, Microbiomics Science Lead at Unilever, Dr Craig McAnulla, Senior Consultant for Bioinformatics and Dr Yasmin Alam-Faruque, Scientific Data Manager/Biocurator discuss first-hand experience and views on how to get better insights faster from microbiome data.
In the late Fall and Winter of 2018, the Pistoia Alliance in cooperation with Elsevier and charitable organizations Cures within Reach and Mission: Cure ran a datathon aiming to find drugs suitable for treatment of childhood chronic pancreatitis, a rare disease that causes extreme suffering. The datathon resulted in identification of four candidate compounds in a short time frame of just under three months. In this webinar our speakers discuss the technologies that made this leap possible
The poster was made during a 2.5 day NSF sponsored workshop on August 5th – 7th on the University of Washington, Seattle campus, which brought together 100 graduate students from diverse domain sciences and engineering with Data Scientists from industry and academia to discuss and collaborate on Big Data / Data Science challenges. In addition to keynote presentations from high profile speakers, the participants presented posters covering their own research and worked collaboratively to begin to solve some of the Grand Challenge problems facing Data Enabled Science & Engineering disciplines.
This presentation was made by 10 graduate students, whose united them was high-dimensional biological data.
For more info, see http://depts.washington.edu/dswkshp/
Themes and objectives:
To position FAIR as a key enabler to automate and accelerate R&D process workflows
FAIR Implementation within the context of a use case
Grounded in precise outcomes (e.g. faster and bigger science / more reuse of data to enhance value / increased ability to share data for collaboration and partnership)
To make data actionable through FAIR interoperability
Speakers:
Mathew Woodwark,Head of Data Infrastructure and Tools, Data Science & AI, AstraZeneca
Erik Schultes, International Science Coordinator, GO-FAIR
Georges Heiter, Founder & CEO, Databiology
Practical Drug Discovery using Explainable Artificial IntelligenceAl Dossetter
How to build AI systems to enable the drug hunting medicinal chemist in their day-to-day work. Levels are AI are described and the meaning and context Explainable AI to medicinal chemists. Six medicinal chemist projects are described, as well as Matched Molecular Pair Analysis (MMPA), Machine Learning and Permutative MMPA. In each case how a system can be built to drill back to chemical sub-structures so effective decisions can be made.
Presentation Alliance of European Life Sciences Law Firms
(Julian Hitchcock and Sofie van der Meulen) on legal aspects of big data in pharma. Topics: privacy, IP, medical devices and IVD.
Sharing and standards christopher hart - clinical innovation and partnering...Christopher Hart
Acknowledging the increasing need for cooperation and collaboration in data sharing and access. Describing the complexity that this can bring. Then describing some of the ways to simplify that.
Originally presented at Terrapin's Clinical innovation and partnering world March 8-9 2017.
http://www.terrapinn.com/conference/innovation-and-partnering/index.stm
Expert panel on industrialising microbiomics - with UnileverEagle Genomics
A panel of experts, including Dr Barry Murphy, Microbiomics Science Lead at Unilever, Dr Craig McAnulla, Senior Consultant for Bioinformatics and Dr Yasmin Alam-Faruque, Scientific Data Manager/Biocurator discuss first-hand experience and views on how to get better insights faster from microbiome data.
FocalCxm presentation on improving productivity in life sciences researchFOCALCXM
This webinar is organized by FocalCXM in partnership with Salesforce. As part of this webinar, we will walk you through our vision and explain how the various aspects of Salesforce and Heroku come to life to improve the overall productivity of research in Life Sciences.
Workflow-Driven Geoinformatics Applications and Training in the Big Data EraIlkay Altintas, Ph.D.
My slides from the Big Data and The Earth Sciences: Grand Challenges Workshop on May 31st, 2017. Workshop link: http://prp.ucsd.edu/events/big-data-and-the-earth-science-grand-challenges-workshop
Building an Intelligent Biobank to Power Research Decision-MakingDenodo
This presentation belongs to the workshop: "Building an Intelligent Biobank to Power Research Decision-Making", from ISBER 2015 Annual Meeting by Lori A. Ball (Chief Operating Officer, President of Integrated Client Solutions at BioStorage Technologies, Inc), Brian Brunner (Senior Manager, Clinical Practice at LabAnswer) and Suresh Chandrasekaran (Senior Vice President at Denodo).
The workshop cover three different topic areas:
- Research sample intelligence: the growing need for Global Data Integration (Biobank Sample and Data Stakeholders).
- Building a research data integration plan and cloud sourcing strategy (data integration).
- How data virtualization works and the value it delivers (a data virtualization introduction, solution portfolio and current customers in Life Sciences industry).
The biomedical R&D environment is increasingly dependent on data meta-analysis and bioinformatics to support research advancements. The integration of biorepository sample inventory data with biomarker and clinical research information has become a priority to R&D organizations. Therefore, a flexible IT system for managing sample collections, integrating sample data with clinical data and providing a data virtualization platform will enable the advancement of research studies. This workshop provides an overview of how sample data integration, virtualization and analytics can lead to more streamlined and unified sample intelligence to support global biobanking for future research.
Automating Data Science over a Human Genomics Knowledge BaseVaticle
# Automating Data Science over a Human Genomics Knowledge Base
Radouane Oudrhiri, the CTO of Eagle Genomics, will talk about how Eagle Genomics is building a platform for automating data science over a human genomics knowledge base. Rad will dive into the architecture Eagle Genomics and also discuss how Grakn serves as the knowledge base foundation of the system. Rad also give a brief history of databases, semantic expressiveness and how Grakn fits in the big picture.
# Radouane Oudrhiri, CTO, Eagle Genomics
Radouane has an extensive experience in leading world-class software and data-intensive system developments in different industries from Telecom to Healthcare, Nuclear, Automotive, Financials. Radouane is Lean/Six Sigma Master Black Belt with speciality in high-tech, IT and Software engineering and he is recognised as the leader and early adaptor of Lean/Six Sigma and DFSS to IT and Software. He is a fellow of the Royal Statistical Society (RSS) and member of the ISO Technical Committee (TC69: Applications of Statistical methods) where he is co-author of the Lean & Six Sigma Standard (ISO 18404) as well as the new standard under development (Design for Six Sigma). He is also part of the newly formed international Group on Big Data (nominated by BSI as the UK representative/expert). Radouane has also been Chair of the working group on Measurement Systems for Automated Processes/Systems within the ISPE (International Society for Pharmaceutical Engineering).
How Logilab ELN helps Organizations in Research Data ManagementAgaram Technologies
Research Data Management (RDM) refers to the methods of recording, organizing, storing, processing, and caring for information that is produced from a research project or used during a research project.
In this webinar, Dale Sanders will provide a pragmatic, step-by-step, and measurable roadmap for the adoption of analytics in healthcare-- a roadmap that organizations can use to plot their strategy and evaluate vendors; and that vendors can use to develop their products. Attendees will have a chance to learn about:
1) The details of his eight-level model, 2) A brief introduction to the HIMSS/IIA DELTA Model, 3) The importance of permanent organizational teams to sustain improvements from analytic investments, 4) The process of curating and maturing data governance, and 5) The coordination of a data acquisition strategy with payment and reimbursement strategies
If Big Data is data that exceeds the processing capacity of conventional systems, thereby necessitating alternative processing measures, we are looking at an essentially technological challenge that IT managers are best equipped to address.
The DCC is currently working with 18 HEIs to support and develop their capabilities in the management of research data and, whilst the aforementioned challenge is not usually core to their expressed concerns, are there particular issues of curation inherent to Big Data that might force a different perspective?
We have some understanding of Big Data from our contacts in the Astronomy and High Energy Physics domains, and the scale and speed of development in Genomics data generation is well known, but the inability to provide sufficient processing capacity is not one of their more frequent complaints.
That’s not to say that Big Science and its Big Data are free of challenges in data curation; only that they are shared with their lesser cousins, where one might say that the real challenge is less one of size than diversity and complexity.
This brief presentation explores those aspects of data curation that go beyond the challenges of processing power but which may lend a broader perspective to the technology selection process.
Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information privacy.
(Em)Powering Science: High-Performance Infrastructure in Biomedical ScienceAri Berman
We’ll explore current and future considerations in advanced computing architectures that empower the conversion of data into knowledge. Life sciences produce the largest amount of data production out of all major science domains, making analytics and scientific computing cornerstones of modern research programs and methodologies. We’ll highlight the remarkable biomedical discoveries that are emerging through combined efforts, and discuss where and how the right infrastructure can catalyze the advancement of human knowledge. On-premises architectures as well as cloud, hybrid, and exotic architectures will all be discussed. It’s likely that all life science researchers will required advanced computing to perform their research within the next year. However, there has been less focus on advanced computing infrastructures across the industry due to the increased availability of public cloud infrastructure anything as a service models.
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...Bonnie Hurwitz
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to microbes. Overview of work underway to add applications and computational analysis pipelines to iPlant for metagenomics and microbial ecology.
Life Technologies' Journey to the Cloud (ENT208) | AWS re:Invent 2013Amazon Web Services
Life Technologies initially planned to build out its own data center infrastructure, but when a cost analysis revealed that by using Amazon Web Services the company would save $325,000 in hardware alone for a single new initiative, the company decided to use AWS instead. Within 6 months of adopting AWS, Life Technologies launched their Digital Hub platform in production, which now undergirds Life Technologies' entire instrumentation product suite.This immediately began to decrease their time-to-market and enhance their customers' user experience. In this session, we provide an overview of our path to the AWS cloud, with particular focus on the evaluation criteria used to make a cloud vendor decision. We also discuss the lessons learned since going into production.
Data Harmonization for a Molecularly Driven Health SystemWarren Kibbe
Maximizing the value of data, computing, data science in an academic medical center, or 'towards a molecularly informed Learning Health System. Given in October at the University of Florida in Gainesville
Similar to Considerations and challenges in building an end to-end microbiome workflow (20)
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Considerations and challenges in building an end to-end microbiome workflow
1. data - science - life
Considerations & Challenges in
Building an End-to-End
Microbiome Workflow
2. data - science - life
Craig earned his Ph.D. in Microbology at the University of Warwick, he then
completed his Postdoctoral research at the John Innes Centre before joining EBI
and developing their bioinformatics capability. Dr McAnulla has extensive
experience in developing large scale pipelines for analyzing microbiome
datasets. His skills cover both software and biology and he often provides deep
insights to industry experts on these topics.
Presenters
Raminderpal earned his PhD in the semiconductor industry where he spent a
decade building advanced technologies for semiconductor modelling and
data mining. In 2012, Dr Singh took the helm as the Business Development
Executive for IBM Research’s genomics program, where he developed and led
the execution of the go-to-market and business plan for the IBM Watson
Genomics program. In addition to many published papers, Dr Singh has two
books and twelve patents.
Dr Craig McAnulla
Senior Consultant Bioinformatics at Eagle Genomics
Dr Raminderpal Singh
Vice President Business Development at Eagle Genomics
Please Tweet your questions
@eaglegen or enter them
into GoTo Webinar
3. Agenda
1. Considerations
2. Challenges
3. Industrialized Microbiome Workflow & Collaboration
Environment
4. Deployment of a Collaboration System
data - science - life
4. data - science - life
Considerations for successful microbiome data management and analysis:
1. Intrinsic factors
2. The relatively early maturity of the field
3. Calculation of business value
1. Considerations
5. data - science - life
Intrinsic
• Microbiomes are complex and diverse biological systems
• Many components – organisms, genes, pathways, metabolites etc
• More DNA/genes than human genome
• Taxonomic diversity
• Many sources of error
• Appropriate study design and metadata collection are crucial.
1. Considerations - Intrinsic
6. data - science - life
Maturity
• Changing sequencing technologies
• Move from amplicon-based to shotgun
• Rapidly increasing data volumes
• Limited reference data
• Fast-moving software development
• Future data re-analysis with improved methods must be considered.
1. Considerations - Maturity
7. data - science - life
1. Considerations – Business value
Business value
• Commercial exploitation of the microbiome is in its infancy, and still at an exploratory phase.
• Many research/findings remain controversial and some early claims on the gut microbiome have not
been replicated.
• The true potential of microbiome data is still to be tapped, and some novel experimental designs are
being developed that result in maximum value being generated.
8. data - science - life
The analysis-at-scale of microbiomics data faces several challenges:
• Metadata management
• Compute resources necessary for storage and analysis
• Statistical techniques
Metadata Management
• Crucial to analysis
• Often done poorly
• Needs to encompass all relevant information
• Check for batch effects
2. Challenges
9. data - science - life
2. Challenges
Data Storage and Compute Resources
• Shotgun metagenomics example study - 1.7 billion paired-end Illumina reads, 266.5 Gb.
• Bigger data volume than the human genome at 30x coverage.
• Study size is increasing
How to store this?
• Analysis selection critical
• Analyses MUST work with these data volumes
• eg Kraken, Humann2
• Compute resources
• Eg Kraken, multicore machine > 100Gigs
• Solution – the cloud
• AWS, Azure
10. data - science - life
Statistical Techniques
• Data normalization
• rarefaction, normalization, log
transformation?
• Statistical techniques
• parametric, nonparametric?
• Confusion
• No correct answer
• Depends on the data
• function vs taxonomy
• shotgun vs amplicon
2. Challenges
11. data - science - life
• Eagle Genomics leads the industry in
configuring and deploying best-in-class
microbiome workflow solutions.
• These run large-scale parallelized
pipelines cost-efficiently, with seamless
connectivity between the source data
and the scientists.
3. Industrialized Microbiome
Workflow & Collaboration
Environment
www.eaglegenomics.com
“Unilever’s digital data program now
processes genetic sequences twenty
times faster - without incurring higher
compute costs. In addition, its robust
architecture supports ten times as many
scientists, all working simultaneously.”
Pete Keeley, eScience Technical Lead - R&D IT
at Unilever
12. data - science - life
organization
3. Industrialized Microbiome
Workflow & Collaboration
Environment
13. data - science - life
MICROBIOME
DATA FILES
(amplicon or shotgun)
METADATA: eaglecatalog (implements security model)
QUERY/
FILTERING:
eaglecatalog
DATA
SHARING
CRO’s
DEPOSIT
DATA
ELN
EXPERIMENT
METADATA
INTEGRATION
WITH OMICS
(RNASeq, proteomics)
FEDERATED DATA &
RESULTS
INTEGRATION
TAXONOMY/
FUNCTION
ANALYSIS
REPORTS
R,QIIME
POPULATION
COHORTS
VISUALIZATION
CLOUD STORAGE
eaglecatalog
COLLABORATIVE
ENVIRONMENT
FOR PARTNERS
FEEDBACK TO
EXPERIMENTAL
DESIGN
COST OF
EXPERIMENT
DATA COLLABORATION ARCHITECTURE
Solving the metadata problem
14. data - science - life
4. Deployment of a Collaboration
System
Eagle Genomics Proposal
Accelerating metadata adoption and scale up
• Offering a new rapid deployment for microbiome R&D organizations looking to
cost-efficiently build world-class integrated workflows.
• Set up of an initial collaboration system for your team.
• 1 to 4 weeks
• Enabling self-load your data to the catalog – leading to a standardized and
streamlined workflow, which is ready for future scale out
• Managing internal data, partner data, or public data
15. data - science - life
Current Situation Desired State
Negative Consequences
of Non-action
Positive Business Impact,
using Eagle software
Data in silos, not reused
Easily searchable, discoverable,
accessible and intuitive portal by
researchers to all data and results
Significant researcher time spent in
“data wrangling”. Breakage
pending, based on projected
bottleneck
Improved time and ability to insight
through radically improved productivity
of researchers
No correlation between
experiments
Quick and easy exploratory analyses
across experiments and data
Missed opportunities for insights.
Failure to leverage data as an asset
Faster identification and generation of
successful product innovation and
insight, leading to reduced time to
market
No unified view of data
Data and analysis results stored in a
unified catalog, with standardized &
repeatable access using best-in-
class technology
Data (internal, external) and the
associated learning not available or
exploited as necessary
Competitive advantage through
cutting edge research, driving
innovation in the product pipeline
Lack of governance and data
provenance
Secure authenticated access with
relevant data provenance of data
sets; simple data sharing with
vendors/collaborators
Costly, inefficient and limited data
sharing with vendors; risk of
inappropriate data sharing
Efficient compliant, documented,
evidenced, and enforced data sharing
process; effective (time & cost)
collaboration between vendors and
internal teams
Data is in-effect archived
Catalog integrated in the scientific
workflow, used continuously and
keeping pace with innovation in the
industry
Missed insights; silo-ed research
activities, with high barriers to
collaboration
Systematic data sharing and re-use,
driving faster unique cross-functional
insights and reducing waste in spend
BENEFITS