Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes

•Download as PPTX, PDF•

1 like•589 views

Lisa Johnson's talk at the #ICG13 GigaScience Prize Track: Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes. Shenzhen, 26th October 2018

Science

Re-assembly, quality evaluation, and annotation of
678 marine microbial eukaryotic
reference transcriptomes
Lisa K. Johnson, Harriet Alexander, C. Titus Brown
Lab for Data Intensive Biology (DIB)
University of California, Davis
ICG-13
Session 6: GigaScience Prize Track
October 25, 2018
@monsterbashseq
ljcohen@ucdavis.edu

DNA sequencing technology has revolutionized the field of biology,
“New Computational Era”
• Now, limiting step is data analysis
• New tools and approaches constantly available
• What to do if:
– New samples to add to the project?
– New software tool is developed?
Re-analysis of old data with new tools and methods is not a common
practice. Should it be?

Marine Microbial Eukaryotic Transcriptome
Sequencing Project (MMETSP)
- Standardized data set, 1 sequencing facility and library preparation
- 678 Illumina PE 50 RNA sequence datasets, 1 TB raw data
- Wide diversity spanning more than 40 phyla
- Original assemblies by the U.S. National Center for Genome Resources (NCGR)
Keeling et al. 2014
PMID: 24959919
Caron et al. 2016
PMID: 27867198

• Adapted from the Brown lab, “Eel Pond mRNA-seq Protocol”: http://eel-pond.readthedocs.io/en/latest/
Titus Brown, Camille Scott, and Leigh Sheneman
• Dr. Tessa Pierce: https://github.com/dib-lab/eelpond (snakemake workflow)
Johnson, LK; Alexander, H; Brown, CT. 2018. GigaScience. In press.
https://www.biorxiv.org/content/early/2018/09/18/323576
Programmatically automated pipeline
(Python) x 678 transcriptomes

Re-assemblies generally contain most of the information in the NCGR assemblies
Proportionofcontigs(CRB-BLAST)

Similar Open Reading Frame (ORF) and
Benchmarks of Universal Single Copy Orthologs (BUSCO)

Camille Scott,
www.camillescott.org/dammit
‘dammit’ annotation pipeline: Pfam, Rfam, OrthoDB
After annotation, ~30% extra content appears real
DIB
NCGR
Extra content

Most DIB assemblies have more unique content.
Unique k-mers (k=25), unique word combinations
Luiz Irber,
HyperLogLog:
https://doi.org/10.1101/056846
https://github.com/dib-lab/khmer

Dinophyta have more unique k-mers
Can we detect phylogenetic differences in the assemblies?
Unique k-mers = unique word combinations (k=25)
*

Ciliophora have lower ORF percentages
Can we detect phylogenetic differences in the assemblies?
*

• Re-assembly with new tools can yield new results (and content!)
• Automated and programmable pipelines can be used to process
arbitrarily many samples and test new tools
• Analyzing many samples using a common pipeline identifies
taxon-specic trends
Summary

Acknowledgements
• Data Intensive Biology Lab
–Camille Scott, Luiz Irber
• MSU iCER hpcc
• NSF-XSEDE, Jetstream
cloud
Photo by James Word

Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes

Whole genome based metagenomics analyses hold the key to discover novel species from microbial communities, reveal their full metabolic potentials, and understand their interactions with each other. Metagenomics projects based on next generation sequencing typically produce 100GB to 1000GB unstructured data. Unlike many other big data problems, analysis of metagenomics data often generates temporary files with 100 to 1000 times of the original size, posing a significant challenge in both hardware infrastructure and software algorithms. Here we report our experience with evaluating Apache Spark in metagenomics data analysis for its speed, scalability, robustness, and most importantly, ease of programming. We developed a Spark-based scalable metagenomics application to deconvolute individual genomes from a complex microbial community with thousands of species. We then systematically tested its performance on synthetic and real world datasets using the Elastic MapReduce framework provided by Amazon Web Services. Our preliminary results suggest Spark provides a cost-effective solution with rapid development/deployment cycles for metagenomics data analysis. These experience likely extends to other big genomics data analyses, in both research and production settings.

Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...

Spark Summit

Recent advances in genome sequencing technologies and bioinformatics have enabled whole-genomes to be studied at population-level rather then for small number of individuals. This provides new power to whole genome association studies (WGAS ), which now seek to identify the multi-gene causes of common complex diseases like diabetes or cancer. As WGAS involve studying thousands of genomes, they pose both technological and methodological challenges. The volume of data is significant, for example the dataset from 1000 Genomes project with genomes of 2504 individuals includes nearly 85M genomic variants with raw data size of 0.8 TB. The number of features is enormous and greatly exceeds the number of samples, which makes it challenging to apply traditional statistical approaches. Random forest is one of the methods that was found to be useful in this context, both because of its potential for parallelization and its robustness. Although there is a number of big data implementations available (including Spark ML) they are tuned for typical dataset with large number of samples and relatively small number of variables, and either fail or are inefficient in the GWAS context especially, that a costly data preprocessing is usually required. To address these problems, we have developed the RandomForestHD – a Spark based implementation optimized for highly dimensional data sets. We have successfully RandomForestHD applied it to datasets beyond the reach of other tools and for smaller datasets found its performance superior. We are currently applying RandomForestHD, released as part of the VariantSpark toolkit, to a number of WGAS studies. In the presentation we will introduce the domain of WGAS and related challenges, present RandomForestHD with its design principles and implementation details with regards to Spark, compare its performance with other tools, and finally showcase the results of a few WGAS applications.

Spark Summit EU talk by Erwin Datema and Roeland van Ham

Spark Summit

2014 bangkok-talk

c.titus.brown

Cassava genome hub

CIAT

2014 sage-talkc.titus.brown

In 2001, it cost ~$100M to sequence a single human genome. In 2014, due to dramatic improvements in sequencing technology far outpacing Moore’s law, we entered the era of the $1,000 genome. At the same time, the power of genetics to impact medicine has become evident: for example, drugs with supporting genetic evidence have twice the clinical trial success rate. These factors have led to an explosion in the volume of genetic data, in the face of which existing analysis tools are breaking down. Therefore, we began the open-source Hail project (https://hail.is) to be a scalable platform built on Apache Spark to enable the worldwide genetics community to build, share, and apply new tools. Hail is focused on variant-level (post-read) data; querying genetic data, annotations and sample data; and performing rare and common variant association analyses. Hail has already been used to analyze datasets with hundreds of thousands of exomes and tens of thousands of whole genomes. We will give an overview of the goals of the Hail project and its architecture. The challenge of efficiently manipulating genetic data in Spark has led to several innovations that may have wider applicability, including an RDD-like abstraction for representing multidimensional data and an OrderedRDD abstraction for ordered data, (for example, data indexed by position in the genome). Finally, we will discuss Hail performance and future directions.

2014 moore-dddc.titus.brown

Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...

Golden Helix Inc

Computation and Knowledge

Ian Foster

Sharing massive data analysis: from provenance to linked experiment reports

Gaignard Alban

Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Science has evolved from the isolated individual tinkering in the lab, through the era of the “gentleman scientist” with his or her assistant(s), to group-based then expansive collaboration and now to an opportunity to collaborate with the world. With the advent of the internet the opportunity for crowd-sourced contribution and large-scale collaboration has exploded and, as a result, scientific discovery has been further enabled. The contributions of enormous open data sets, liberal licensing policies and innovative technologies for mining and linking these data has given rise to platforms that are beginning to deliver on the promise of semantic technologies and nanopublications, facilitated by the unprecedented computational resources available today, especially the increasing capabilities of handheld devices. The speaker will provide an overview of his experiences in developing a crowdsourced platform for chemists allowing for data deposition, annotation and validation. The challenges of mapping chemical and pharmacological data, especially in regards to data quality, will be discussed. The promise of distributed participation in data analysis is already in place.

Fast Variant Calling with ADAM and avocado

fnothaft

Scalable Genome Analysis With ADAM

fnothaft

Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Flávio Codeço Coelho

ECCB 2014: Extracting patterns of database and software usage from the bioinf...

geraintduck

More Complete Resultset Retrieval from Large Heterogeneous RDF Sources

André Valdestilhas

Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba

Databricks

In 2001, it cost ~$100M to sequence a single human genome. In 2014, due to dramatic improvements in sequencing technology far outpacing Moore’s law, we entered the era of the $1,000 genome. At the same time, the power of genetics to impact medicine has become evident. For example, drugs with supporting genetic evidence are twice as likely to succeed in clinical trials. These factors have led to an explosion in the volume of genetic data, in the face of which existing analysis tools are breaking down. As a result, the Broad Institute began the open-source Hail project (https://hail.is), a scalable platform built on Apache Spark, to enable the worldwide genetics community to build, share and apply new tools. Hail is focused on variant-level (post-read) data; querying genetic data, as well as annotations, on variants and samples; and performing rare and common variant association analyses. Hail has already been used to analyze datasets with hundreds of thousands of exomes and tens of thousands of whole genomes, enabling dozens of major research projects.

eXframe: A Semantic Web Platform for Genomic ExperimentsTim Clark

exFrame: a Semantic Web Platform for Genomics Experiments

Tim Clark

ICAR 2015 Poster - Araport

Araport

Beyond the PDF 2, 2013Alejandra Gonzalez-Beltran

PMR metabolomics and transcriptomics database and its RESTful web APIs: A dat...

Araport

BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale

Andy Petrella

BHL Tech Overview for BHL-Europe

Chris Freeland

Annotopia open annotation services platform

Tim Clark

Annotopia is an open-access, open-source, open annotation services platform developed for scientific annotation of documents and datasets on the web using the W3C Open Annotation model http://www.openannotation.org/spec/core/. Using Annotopia, virtually any client application including lightweight web clients, can create, selectively share, and access annotation of web documents and data. This can be done regardless of the ownership of the base objects being annotated. Annotopia supports unstructured, semi-structured and fully-structured (semantic) annotation; manual and automated (textmining) annotation; permissions, groups, and sharing. It also provides access to specialized vocabulary and text analytics services. Annotopia is an open source platform licensed under Apache 2.0.

REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND E...

Lisa K. Johnson, Ph.D.

Talk for ASLO Ocean Science Meeting, Honolulu, HI March 3, 2017 Lisa J. Cohen, Harriet Alexander, C. Titus Brown The Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP) facilitated the generation of 678 Illumina RNA sequence datasets from a wide diversity of organisms spanning more than 40 phyla of cultured microbial eukaryotes collected from a variety of marine environments. This is the largest publicly available set of RNA sequencing data from a diversity of eukaryotic taxa with a standardized library preparation. We developed an automated and modularized de novo transcriptome assembly pipeline for the MMETSP data set that is extensible to accommodate both future software updates and additional samples. With this large set of assemblies from a diversity of species, we were able to quantitatively evaluate the qualities of individual transcriptomes. Moreover, a meta-analysis across the dataset revealed lineage-specific transcriptome characteristics, such as predicted open reading frames, contig features, unique k-mers and evaluation scores. Ultimately, a better understanding of these assemblies and annotations will enhance our ability to accurately identify and characterize genes of ecological and biogeochemical significance.

Advanced Bioinformatics for Genomics and BioData Driven Research

European Bioinformatics Institute

What's hot

Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed

Spark Summit

2014 moore-dddc.titus.brown

Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...

Golden Helix Inc

Computation and Knowledge

Ian Foster

Sharing massive data analysis: from provenance to linked experiment reports

Gaignard Alban

Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Fast Variant Calling with ADAM and avocado

fnothaft

Scalable Genome Analysis With ADAM

fnothaft

Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis

Flávio Codeço Coelho

ECCB 2014: Extracting patterns of database and software usage from the bioinf...

geraintduck

More Complete Resultset Retrieval from Large Heterogeneous RDF Sources

André Valdestilhas

Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba

Databricks

In 2001, it cost ~$100M to sequence a single human genome. In 2014, due to dramatic improvements in sequencing technology far outpacing Moore’s law, we entered the era of the $1,000 genome. At the same time, the power of genetics to impact medicine has become evident. For example, drugs with supporting genetic evidence are twice as likely to succeed in clinical trials. These factors have led to an explosion in the volume of genetic data, in the face of which existing analysis tools are breaking down. As a result, the Broad Institute began the open-source Hail project (https://hail.is), a scalable platform built on Apache Spark, to enable the worldwide genetics community to build, share and apply new tools. Hail is focused on variant-level (post-read) data; querying genetic data, as well as annotations, on variants and samples; and performing rare and common variant association analyses. Hail has already been used to analyze datasets with hundreds of thousands of exomes and tens of thousands of whole genomes, enabling dozens of major research projects.

eXframe: A Semantic Web Platform for Genomic ExperimentsTim Clark

exFrame: a Semantic Web Platform for Genomics Experiments

Tim Clark

ICAR 2015 Poster - Araport

Araport

Beyond the PDF 2, 2013Alejandra Gonzalez-Beltran

PMR metabolomics and transcriptomics database and its RESTful web APIs: A dat...

Araport

BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale

Andy Petrella

BHL Tech Overview for BHL-Europe

Chris Freeland

Annotopia open annotation services platform

Tim Clark

What's hot (20)

Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed

2014 moore-ddd

Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...

Computation and Knowledge

Sharing massive data analysis: from provenance to linked experiment reports

Facilitating Scientific Discovery through Crowdsourcing and Distributed Parti...

Fast Variant Calling with ADAM and avocado

Scalable Genome Analysis With ADAM

Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis

ECCB 2014: Extracting patterns of database and software usage from the bioinf...

More Complete Resultset Retrieval from Large Heterogeneous RDF Sources

Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba

eXframe: A Semantic Web Platform for Genomic Experiments

exFrame: a Semantic Web Platform for Genomics Experiments

ICAR 2015 Poster - Araport

Beyond the PDF 2, 2013

PMR metabolomics and transcriptomics database and its RESTful web APIs: A dat...

BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale

BHL Tech Overview for BHL-Europe

Annotopia open annotation services platform

Similar to Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes

REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND E...

Lisa K. Johnson, Ph.D.

Advanced Bioinformatics for Genomics and BioData Driven Research

European Bioinformatics Institute

Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...

Surya Saha

Rapidly spreading invasive diseases in systems with little or no prior experimental data or resources pose a unique set of challenges for growers, scientists as well as regulators. As a part of a USDA NIFA CAPS project focused on the psyllid, Diaphorina citri, we have released improved genomics resources including high quality genome assemblies and annotation. We have also created an open access web portal for analyses around the Citrus Greening/Huanglongbing disease complex. Citrusgreening.org includes pathosystem-wide resources and bioinformatics tools for multiple Citrus spp. hosts, the Asian citrus psyllid vector (ACP, Diaphorina citri), and multiple pathogens including Candidatus Liberibacter asiaticus (CLas). To the best of our knowledge, this is the first example of a database to use the pathosystem as a holistic framework to understand an insect transmitted plant disease. Users can submit relevant data sets to enable sharing and allow the community to leverage their data within an integrated system. The system includes the metabolic pathway databases CitrusCyc and DiaphorinaCyc with organism specific pathways that can be used to mine metabolomics, transcriptomics and proteomics results to identify pathways and regulatory mechanisms involved in disease response. The Psyllid Expression Network (PEN) contains expression profiles of ACP genes from multiple life stages, tissues, conditions and hosts. The Citrus Expression Network (CEN) contains public expression data from multiple tissues and conditions for various citrus hosts. All tools connect to a central database. The portal also includes electrical penetration graph (EPG) recordings, information about citrus rootstock trials and metabolomics data in addition to traditional omics data types with a goal of combining and mining all information related to the Huanglongbing pathosystem. User-friendly manual curation tools will allow the continuous improvement of knowledge base as more experimental research is published. The portal can be accessed at https://citrusgreening.org/.

Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics

Christopher Mason

Challenges and biases in preparing, characterizing, and sequencing DNA and RNA can have significant impacts on research in genomics across all kingdoms of life, including experiments in single cells, RNA profiling, and metagenomics. Technical artifacts and contaminations can arise at each point of sample manipulation, extraction, sequencing, and analysis. Thus, the measurement and benchmarking of these potential sources of error are of paramount importance as next-generation sequencing (NGS) projects become more global and ubiquitous. Fortunately, a variety of methods, standards, and technologies have recently emerged that improve measurements in genomics and sequencing, from the initial input material to the computational pipelines that process and annotate the data. This webinar will review work to develop standards and their applications in genomics, including the ABRF-NGS Phase II NGS Study on DNA Sequencing; the FDA’s Sequencing Quality Control Consortium (SEQC2); metagenomics standards efforts (ABRF, ATCC, Zymo, Metaquins), and the Epigenomics QC group of the SEQC2. The webinar will also review he computational methods for detection, validation, and implementation of these genomic measures.

Overview of Next Gen Sequencing Data Analysis

Bioinformatics and Computational Biosciences Branch

Role of bioinformatics in life sciences research

Anshika Bansal

Cv long

Gautam Machiraju

CV_10/17

Gautam Machiraju

Small molecule identification and the new MassBank

Steffen Neumann

Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...

Larry Smarr

Building bioinformatics resources for the global community

ExternalEvents

WikiPathways: how open source and open data can make omics technology more us...

Chris Evelo

Learning Systems for Science

Ian Foster

New learning technologies seem likely to transform much of science, as they are already doing for many areas of industry and society. We can expect these technologies to be used, for example, to obtain new insights from massive scientific data and to automate research processes. However, success in such endeavors will require new learning systems: scientific computing platforms, methods, and software that enable the large-scale application of learning technologies. These systems will need to enable learning from extremely large quantities of data; the management of large and complex data, models, and workflows; and the delivery of learning capabilities to many thousands of scientists. In this talk, I review these challenges and opportunities and describe systems that my colleagues and I are developing to enable the application of learning throughout the research process, from data acquisition to analysis.

wings2014 Workshop 1 Design, sequence, align, count, visualize

Ann Loraine

2015 genome-center

c.titus.brown

Reproducible research - to infinity

PeterMorrell4

Biological Database (1)pptxpdfpdfpdf.pdf

BioinformaticsCentre

Reusable Software and Open Data To Optimize Agriculture

David LeBauer

Abstract: Humans need a secure and sustainable food supply, and science can help. We have an opportunity to transform agriculture by combining knowledge of organisms and ecosystems to engineer ecosystems that sustainably produce food, fuel, and other services. The challenge is that the information we have. Measurements, theories, and laws found in publications, notebooks, measurements, software, and human brains are difficult to combine. We homogenize, encode, and automate the synthesis of data and mechanistic understanding in a way that links understanding at different scales and across domains. This allows extrapolation, prediction, and assessment. Reusable components allow automated construction of new knowledge that can be used to assess, predict, and optimize agro-ecosystems. Developing reusable software and open-access databases is hard, and examples will illustrate how we use the Predictive Ecosystem Analyzer (PEcAn, pecanproject.org), the Biofuel Ecophysiological Traits and Yields database (BETYdb, betydb.org), and ecophysiological crop models to predict crop yield, decide which crops to plant, and which traits can be selected for the next generation of data driven crop improvement. A next step is to automate the use of sensors mounted on robots, drones, and tractors to assess plants in the field. The TERRA Reference Phenotyping Platform (TERRA-Ref, terraref.github.io) will provide an open access database and computing platform on which researchers can use and develop tools that use sensor data to assess and manage agricultural and other terrestrial ecosystems. TERRA-Ref will adopt existing standards and develop modular software components and common interfaces, in collaboration with researchers from iPlant, NEON, AgMIP, USDA, rOpenSci, ARPA-E, many scientists and industry partners. Our goal is to advance science by enabling efficient use, reuse, exchange, and creation of knowledge. --- Invited talk for the "Informatics for Reproducibility in Earth and Environmental Science Research" session at the American Geophysical Union Fall Meeting, Dec 17 2015.

Web Apollo at Genome Informatics 2014

Monica Munoz-Torres

Precise elucidation of the many different biological features encoded in any genome requires careful examination and review by researchers, who gather and evaluate the available evidence to corroborate and modify gene predictions and other biological elements. This curation process allows them to resolve discrepancies and validate automated gene model hypotheses and alignments. This approach is the well-established practice for well-known genomes such as human, mouse, zebrafish, Drosophila, et cetera. Desktop Apollo was originally developed to meet these needs. The cost of sequencing a genome has been dramatically reduced by several orders of magnitude in the last decade, and the natural consequence is that more and more researchers are sequencing more and more new genomes, both within populations and across species. Because individual researchers can now readily sequence many genomes of interest, the need for a universally accessible genomic curation tool logically follows. Each new exome or genome sequenced requires visualization and curation to obtain biologically accurate genomic features sets, even for limited set of genes, because computational genome analysis remains an imperfect art. Additionally, unlike earlier genome projects, which had the advantage of more highly polished genomes, recent projects usually have lower coverage. Therefore researchers now face additional work correcting for more frequent assembly errors and annotating genes split across multiple contigs. Genome annotation is an inherently collaborative task; researchers only very rarely work in isolation, turning to colleagues for second opinions and insights from those with with expertise in particular domains and gene families. The new JavaScript based Apollo, allows researchers real-time interactivity, breaking down large amounts of data into manageable portions to mobilize groups of researchers with shared interests. We are also focused on training the next generation of researchers by reaching out to educators to make these tools available as part of curricula via workshops and webinars, and through widely applied systems such as iPlant and DNA Subway. Here we offer details of our progress. Presentation at Genome Informatics, Session (3) on Databases, Data Mining, Visualization, Ontologies and Curation. Authors: Monica C Munoz-Torres, Suzanna E. Lewis, Ian Holmes, Colin Diesh, Deepak Unni, Christine Elsik.

2015_CV_J_SHELTON_linkedJennifer Shelton

Similar to Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes (20)

REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND E...

Advanced Bioinformatics for Genomics and BioData Driven Research

Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...

Cross-Kingdom Standards in Genomics, Epigenomics and Metagenomics

Overview of Next Gen Sequencing Data Analysis

Role of bioinformatics in life sciences research

Cv long

CV_10/17

Small molecule identification and the new MassBank

Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...

Building bioinformatics resources for the global community

WikiPathways: how open source and open data can make omics technology more us...

Learning Systems for Science

wings2014 Workshop 1 Design, sequence, align, count, visualize

2015 genome-center

Reproducible research - to infinity

Biological Database (1)pptxpdfpdfpdf.pdf

Reusable Software and Open Data To Optimize Agriculture

Web Apollo at Genome Informatics 2014

2015_CV_J_SHELTON_linked

Recently uploaded

Richard's aventures in two entangled wonderlands

Richard Gill

Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.

Lateral Ventricles.pdf very easy good diagrams comprehensive

silvermistyshot

Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...

NathanBaughman3

GBSN - Biochemistry (Unit 5) Chemistry of Lipids

Areesha Ahmad

Unveiling the Energy Potential of Marshmallow Deposits.pdf

Erdal Coalmaker

NuGOweek 2024 Ghent - programme - final version

pablovgd

(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...

Scintica Instrumentation

Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes. In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.

Lab report on liquid viscosity of glycerin

ossaicprecious19

Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx

muralinath2

The ASGCT Annual Meeting was packed with exciting progress in the field advan...

Health Advances

GBSN- Microbiology (Lab 3) Gram Staining

Areesha Ahmad

Citrus Greening Disease and its Management

subedisuryaofficial

Multi-source connectivity as the driver of solar wind variability in the heli...

Sérgio Sacani

The ambient solar wind that flls the heliosphere originates from multiple sources in the solar corona and is highly structured. It is often described as high-speed, relatively homogeneous, plasma streams from coronal holes and slow-speed, highly variable, streams whose source regions are under debate. A key goal of ESA/NASA’s Solar Orbiter mission is to identify solar wind sources and understand what drives the complexity seen in the heliosphere. By combining magnetic feld modelling and spectroscopic techniques with high-resolution observations and measurements, we show that the solar wind variability detected in situ by Solar Orbiter in March 2022 is driven by spatio-temporal changes in the magnetic connectivity to multiple sources in the solar atmosphere. The magnetic feld footpoints connected to the spacecraft moved from the boundaries of a coronal hole to one active region (12961) and then across to another region (12957). This is refected in the in situ measurements, which show the transition from fast to highly Alfvénic then to slow solar wind that is disrupted by the arrival of a coronal mass ejection. Our results describe solar wind variability at 0.5 au but are applicable to near-Earth observatories.

Seminar of U.V. Spectroscopy by SAMIR PANDA

SAMIR PANDA

Structural Classification Of Protein (SCOP)

aishnasrivastava

A brief information about the SCOP protein database used in bioinformatics. The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.

erythropoiesis-I_mechanism& clinical significance.pptx

muralinath2

extra-chromosomal-inheritance[1].pptx.pdfpdf

DiyaBiswas10

Slide 1: Title Slide Extrachromosomal Inheritance Slide 2: Introduction to Extrachromosomal Inheritance Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus. Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids. Slide 3: Mitochondrial Inheritance Mitochondria: Organelles responsible for energy production. Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria. Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring. Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy. Slide 4: Chloroplast Inheritance Chloroplasts: Organelles responsible for photosynthesis in plants. Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts. Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species. Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA. Slide 5: Plasmid Inheritance Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes. Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation. Significance: Important in biotechnology for gene cloning and genetic engineering. Slide 6: Mechanisms of Extrachromosomal Inheritance Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance. Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells. Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression. Slide 7: Examples of Extrachromosomal Inheritance Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells. Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration. Slide 8: Importance of Extrachromosomal Inheritance Evolution: Provides insight into the evolution of eukaryotic cells. Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases. Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification. Slide 9: Recent Research and Advances Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA. Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases. Slide 10: Conclusion Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology. Future Directions: Continued research and technological advancements hold promise for new treatments and applications. Slide 11: Questions and Discussion Invite Audience: Open the floor for any questions or further discussion on the topic.

SCHIZOPHRENIA Disorder/ Brain Disorder.pdf

SELF-EXPLANATORY

Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...

Sérgio Sacani

We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and 30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1 . Our search finds no candidates at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to infer the properties of the evolving luminosity function without binning in redshift or luminosity that marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results, and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5 from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical models for evolution of the dark matter halo mass function.

Hemoglobin metabolism_pathophysiology.pptx

muralinath2

Recently uploaded (20)

Richard's aventures in two entangled wonderlands

Lateral Ventricles.pdf very easy good diagrams comprehensive

Astronomy Update- Curiosity’s exploration of Mars _ Local Briefs _ leadertele...

GBSN - Biochemistry (Unit 5) Chemistry of Lipids

Unveiling the Energy Potential of Marshmallow Deposits.pdf

NuGOweek 2024 Ghent - programme - final version

(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...

Lab report on liquid viscosity of glycerin

Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx

The ASGCT Annual Meeting was packed with exciting progress in the field advan...

GBSN- Microbiology (Lab 3) Gram Staining

Citrus Greening Disease and its Management

Multi-source connectivity as the driver of solar wind variability in the heli...

Seminar of U.V. Spectroscopy by SAMIR PANDA

Structural Classification Of Protein (SCOP)

erythropoiesis-I_mechanism& clinical significance.pptx

extra-chromosomal-inheritance[1].pptx.pdfpdf

SCHIZOPHRENIA Disorder/ Brain Disorder.pdf

Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...

Hemoglobin metabolism_pathophysiology.pptx

Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes

1. Re-assembly, quality evaluation, and annotation of 678 marine microbial eukaryotic reference transcriptomes Lisa K. Johnson, Harriet Alexander, C. Titus Brown Lab for Data Intensive Biology (DIB) University of California, Davis ICG-13 Session 6: GigaScience Prize Track October 25, 2018 @monsterbashseq ljcohen@ucdavis.edu

2. DNA sequencing technology has revolutionized the field of biology, “New Computational Era” • Now, limiting step is data analysis • New tools and approaches constantly available • What to do if: – New samples to add to the project? – New software tool is developed? Re-analysis of old data with new tools and methods is not a common practice. Should it be?

3. Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP) - Standardized data set, 1 sequencing facility and library preparation - 678 Illumina PE 50 RNA sequence datasets, 1 TB raw data - Wide diversity spanning more than 40 phyla - Original assemblies by the U.S. National Center for Genome Resources (NCGR) Keeling et al. 2014 PMID: 24959919 Caron et al. 2016 PMID: 27867198

4. • Adapted from the Brown lab, “Eel Pond mRNA-seq Protocol”: http://eel-pond.readthedocs.io/en/latest/ Titus Brown, Camille Scott, and Leigh Sheneman • Dr. Tessa Pierce: https://github.com/dib-lab/eelpond (snakemake workflow) Johnson, LK; Alexander, H; Brown, CT. 2018. GigaScience. In press. https://www.biorxiv.org/content/early/2018/09/18/323576 Programmatically automated pipeline (Python) x 678 transcriptomes

5. Our re-assemblies have more contigs:

6. Re-assemblies generally contain most of the information in the NCGR assemblies Proportionofcontigs(CRB-BLAST)

7. Similar Open Reading Frame (ORF) and Benchmarks of Universal Single Copy Orthologs (BUSCO)

8. Camille Scott, www.camillescott.org/dammit ‘dammit’ annotation pipeline: Pfam, Rfam, OrthoDB After annotation, ~30% extra content appears real DIB NCGR Extra content

9. Most DIB assemblies have more unique content. Unique k-mers (k=25), unique word combinations Luiz Irber, HyperLogLog: https://doi.org/10.1101/056846 https://github.com/dib-lab/khmer

10. Dinophyta have more unique k-mers Can we detect phylogenetic differences in the assemblies? Unique k-mers = unique word combinations (k=25) *

11. Ciliophora have lower ORF percentages Can we detect phylogenetic differences in the assemblies? *

12. • Re-assembly with new tools can yield new results (and content!) • Automated and programmable pipelines can be used to process arbitrarily many samples and test new tools • Analyzing many samples using a common pipeline identifies taxon-specic trends Summary

13. Acknowledgements • Data Intensive Biology Lab –Camille Scott, Luiz Irber • MSU iCER hpcc • NSF-XSEDE, Jetstream cloud Photo by James Word

Editor's Notes

Hi, my name is Lisa Johnson, I’m a PhD student at UC Davis in Titus Brown’s Data Intensive Biology lab tackling questions surrounding k-mer based sequence analysis. Thank you for this opportunity to speak today. I would like to first acknowledge my co-authors, Harriet Alexander and my advisor, Titus Brown.
The Marine Microbial Eukaryotic Sequencing Project is a unique set of mRNA sequence data generated by a consortium of PIs who all got together and submitted their favorite marine microbial eukaryotes to one sequencing facility. These species represent 40 pelagic and endosymbiotic phyla, such dinoflagellates, ciliates, diatoms. They are both phylogenetically diverse and geographically diverse, collected from all over the world. This is a really exciting set of data for a few reasons, one is because it is one of the largest publicly available sets of RNA data with a standardized library preparation from different organisms with a total of about 1 TB of raw sequence data. Second, it’s purposefully built, not a metatranscriptome. We technically know who is supposed to be in this data set, so we are generating reference transcriptomes for all of these species, some of which have never had any reference transcriptomes or genomes before. Right after data were sequenced, the NCGR assembled the transcriptomes as references with their own pipeline, using the genome assembler ABySS with some modifications and post-processing for transcriptomes. ==================== Bottom panel, left to right: Elphidium margaritaceum http://zoology.bio.spbu.ru/Eng/Sci/Korsun/Foram2_E-margaritaceum.jpg 2. Acanthamoeba https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/Parasite140120-fig3_Acanthamoeba_keratitis_Figure_3B.png/220px-Parasite140120-fig3_Acanthamoeba_keratitis_Figure_3B.png 3. Gonyaulax spinifera http://www.sms.si.edu/IRLSpec/images/Gonyaulax_Lg.jpg 4. Asterionellopsis glacialis http://www.smhi.se/oceanografi/oce_info_data/plankton_checklist/diatoms/asterionellopsis_glacialis.gif 5. Tetraselmis http://cfb.unh.edu/phycokey/Choices/Chlorophyceae/unicells/flagellated/TETRASELMIS/Tetraselmis_06_500x345.jpg 6. Oxyrrhis marina http://cfb.unh.edu/phycokey/Choices/Dinophyceae/NonPS-dinos/OXYRRHIS/Oxyrrhis_04_300x246_marina.jpg 7. Alexandrium http://www.whoi.edu/cms/images/dfino/2006/6/Alexandrium_en_11187_26907.jpg 8. Pseudonitzschia https://upload.wikimedia.org/wikipedia/commons/5/5e/Pseudonitzschia2.jpg 9. Chlamydomonas https://web.mst.edu/~microbio/BIO221_2009/images_2009/chlamydomonas-3.jpg 10. Emiliania_huxleyi https://upload.wikimedia.org/wikipedia/commons/d/d9/Emiliania_huxleyi_coccolithophore_(PLoS).png 11. Symbiodinium http://www.personal.psu.edu/tcl3/index.html 12. Phaeocystis antarctica http://www.esf.edu/antarctica/images/Phaeo_montage2.jpg 13. Micromonas http://roscoff-culture-collection.org/sites/default/files/field/image/micromonas-colored-350_0.jpg 14. Karenia brevis http://www.sms.si.edu/irlspec/images/Kareni_brevis_2.jpg 15. Thalassiosira pseudonana http://genome.jgi.doe.gov/Thaps3/Tpseudonana.jpg 16. Ditylum_brightwellii https://cimt.pmc.ucsc.edu/images/HAB%20ID/diatom/Ditylum_brightwellii.jpg
Our modularized pipeline, which I wrote in Python, attempts to address these issues. It takes metadata from any data set in NCBI as input and decides which samples to run. Raw sequence reads are downloaded from NCBI, quality trimmed, checked with fastqc, run through digital normalization, then assembled using the Trinity transcriptome assembler. I’m glossing over a lot of details here because there is not enough time, but if you are interested please see me after to talk. There is a tutorial also available, called the “Eel pond protocol”, which is open access and has a small subset of data to run through the steps of a de novo assembly with Trinity. A benefit of this pipeline to highlight is that you can pick up from where you left off if something crashes. As anyone who has used an institutional high performance computing cluster knows, stuff breaks, stops running. With this pipeline, if something stops, you can start it again. This data set pushes the limits of our high performance computing clusters with 1 TB raw data, in terms of storage and compute resources. This took more than 8,000 computing hours, We have found that the resources required for these >600 assemblies are not trivial, and should be a consideration when planning for a project of this size in the future.
In evaluating our assemblies, it appears that our re-assemblies have more contigs. A contig is a linear prediction of a full transcript by the assembly software. In subsequent slides, I’ll be showing similar figures like this, so want to orient you first. On the y-axis is what we’re measuring – here it’s the number of contigs. This is a split violin plot showing the frequency distribution around the mean of each pipeline. In the blue on the right shows our re-assemblies, which I’ve labeled “DIB” because we’re the data intensive biology lab. In the gray on the left are assemblies from NCGR. The number on top in blue shows the numbers of assemblies where DIB has a higher value than NCGR or in gray where NCGR has a higher number. In this case, we see that there were more DIB assemblies with higher numbers of contigs in comparison to the NCGR. The mean of DIB is around 48,000 contigs, with some samples producing up to 190,000 contigs up here towards the tail of the distribution. While the mean of NCGR is around 25,000 contigs and fewer assemblies have high numbers of contigs, the highest is about 100,000. So, these differences were interesting for us – and we came up with some questions (click)
In addition to have higher quality scores, there appears to be more content. The proportion of contigs from a comparison called a reciprocal best blast of NCGR vs. our DIB assemblies indicates that most of the content found in NCGR is also found in the DIB re-assemblies. But also that there is extra information in the DIB assemblies not found in the NCGR assemblies. This information was obtained by aligning the two assemblies against each other both ways. First with NCGR as the reference, then the reverse with DIB as the reference. Engage with audience: As you can see here…our peak is about 0.7, or 70%. This means that we’re capturing 70% of the content in the NCGR assemblies. On the other hand, NCGR assemblies capture about 50% of the content of our assemblies. The difference is about 20%. The ~20% difference between these 2 blast comparisons leads us to still question whether we have just assembled junk or if we actually have higher resolution assemblies.
Orient audience to graphs: left ORF on Y axis Even though we have more contigs, the open reading frame protein coding regions detected is similar if not more tightly distributed towards the upper range. Most of the assemblies have slightly higher ORF content. And on the right are BUSCO percentages, which is a set of benchmarking universal single copy orthologs expected to be found in all eukaryotic transcriptomes, like housekeeping genes. While there are problems with using BUSCO scores as an absolute measurement of assembly quality, they can serve as a comparative metric relative to another pipeline. Our assemblies have a similar BUSCO content relative to NCGR. So, at least these haven’t gone down. The extra content we found is probably not all junk.
In digging deeper into the extra content, this is a plot of ONLY this extra content in the blue part. Samples are across the x axis, sorted by the number of extra contigs on the y axis. (pause, let this sink in, take a drink or something) Highlighted in green is the number of these extra contigs that are actually annotated to a known gene. I annotated the re-assemblies using this really great tool out of our lab by Camille Scott called ‘dammit’. No, it’s not an acronym, it was named out of frustration: “Just annotate it, dammit!” The dammit pipeline uses the highly-curated Pfam and Rfam known protein domain databases as well as ORthoDB with conserved orthology domains. About 1/3 of the extra content has annotations.
Here we are comparing the raw sequence content, regardless of annotation, in terms of the number of kmers or unique word combinations with a k length of 25. We see that our assemblies fall above the 1:1 expectation, meaning that our assemblies have more unique words compared to the NCGR assemblies. This is kind of like taking two versions of the same book and digesting them down into individual 25 letter words found in the book. We found that our assemblies have more unique words than NCGR. Therefore, we are able to answer that our assemblies probably have a bit more biologically-meaningful content
To address our second question about whether we can detect phylogenetic differences in the assemblies, we took a look at some of the assembly metrics grouped by taxa. Explain figures: unique k-mers on the y, input reads on the x, colors indicate different taxa, plotting mean and stdev The Dinoflagellates appear to have more unique kmer content. This seems to make sense, knowing that Dinoflagellates have this steady-state gene expression thing going on, where they just keep expressing genes on and one, then regulate more at the translational level. As far as the software, it might be useful to incorporate strain-specific information like this into assembly software.
Here again, colors are different taxonomic groupings, mean percentage of open reading frame predictions on the y, number of transcripts on the x We see here that Cilliate assemblies appear to have a lower open reading frame percentage. This is interesting since it has recently been found Ciliates have an alternative triplet codon dictionary, with codons normally encoding STOP serving a different purpose. Dinoflagellates here have this high open reading frame content, and lots of contigs. In this case, it is useful to know that our assembly evaluation tools might perform outside the range of what is normal for the organisms in question. The assemblies are not necessarily lower quality, but may be perceived as lower in quality because of cool and unique features like this.
Strain-specific trends may lead to understanding how raw data content affects the overall assembly quality
Thank the Moore Foundation first.

Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes

Similar to Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes (20)

More from GigaScience, BGI Hong Kong

More from GigaScience, BGI Hong Kong (20)

Recently uploaded

Recently uploaded (20)

Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes

Editor's Notes