Lisa Johnson's talk at the #ICG13 GigaScience Prize Track: Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes. Shenzhen, 26th October 2018
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...Alejandra Gonzalez-Beltran
Metagenomic Data Provenance and Management using the ISA infrastructure - overview, implementation patterns & software tools
Slides presented at EBI Metagenomics Bioinformatics course: http://www.ebi.ac.uk/training/course/metagenomics2014
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Spark Summit
Whole genome based metagenomics analyses hold the key to discover novel species from microbial communities, reveal their full metabolic potentials, and understand their interactions with each other. Metagenomics projects based on next generation sequencing typically produce 100GB to 1000GB unstructured data. Unlike many other big data problems, analysis of metagenomics data often generates temporary files with 100 to 1000 times of the original size, posing a significant challenge in both hardware infrastructure and software algorithms. Here we report our experience with evaluating Apache Spark in metagenomics data analysis for its speed, scalability, robustness, and most importantly, ease of programming. We developed a Spark-based scalable metagenomics application to deconvolute individual genomes from a complex microbial community with thousands of species. We then systematically tested its performance on synthetic and real world datasets using the Elastic MapReduce framework provided by Amazon Web Services. Our preliminary results suggest Spark provides a cost-effective solution with rapid development/deployment cycles for metagenomics data analysis. These experience likely extends to other big genomics data analyses, in both research and production settings.
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
Recent advances in genome sequencing technologies and bioinformatics have enabled whole-genomes to be studied at population-level rather then for small number of individuals. This provides new power to whole genome association studies (WGAS
), which now seek to identify the multi-gene causes of common complex diseases like diabetes or cancer.
As WGAS involve studying thousands of genomes, they pose both technological and methodological challenges. The volume of data is significant, for example the dataset from 1000 Genomes project with genomes of 2504 individuals includes nearly 85M genomic variants with raw data size of 0.8 TB. The number of features is enormous and greatly exceeds the number of samples, which makes it challenging to apply traditional statistical approaches.
Random forest is one of the methods that was found to be useful in this context, both because of its potential for parallelization and its robustness. Although there is a number of big data implementations available (including Spark ML) they are tuned for typical dataset with large number of samples and relatively small number of variables, and either fail or are inefficient in the GWAS context especially, that a costly data preprocessing is usually required.
To address these problems, we have developed the RandomForestHD – a Spark based implementation optimized for highly dimensional data sets. We have successfully RandomForestHD applied it to datasets beyond the reach of other tools and for smaller datasets found its performance superior. We are currently applying RandomForestHD, released as part of the VariantSpark toolkit, to a number of WGAS studies.
In the presentation we will introduce the domain of WGAS and related challenges, present RandomForestHD with its design principles and implementation details with regards to Spark, compare its performance with other tools, and finally showcase the results of a few WGAS applications.
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...Alejandra Gonzalez-Beltran
Metagenomic Data Provenance and Management using the ISA infrastructure - overview, implementation patterns & software tools
Slides presented at EBI Metagenomics Bioinformatics course: http://www.ebi.ac.uk/training/course/metagenomics2014
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Spark Summit
Whole genome based metagenomics analyses hold the key to discover novel species from microbial communities, reveal their full metabolic potentials, and understand their interactions with each other. Metagenomics projects based on next generation sequencing typically produce 100GB to 1000GB unstructured data. Unlike many other big data problems, analysis of metagenomics data often generates temporary files with 100 to 1000 times of the original size, posing a significant challenge in both hardware infrastructure and software algorithms. Here we report our experience with evaluating Apache Spark in metagenomics data analysis for its speed, scalability, robustness, and most importantly, ease of programming. We developed a Spark-based scalable metagenomics application to deconvolute individual genomes from a complex microbial community with thousands of species. We then systematically tested its performance on synthetic and real world datasets using the Elastic MapReduce framework provided by Amazon Web Services. Our preliminary results suggest Spark provides a cost-effective solution with rapid development/deployment cycles for metagenomics data analysis. These experience likely extends to other big genomics data analyses, in both research and production settings.
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
Recent advances in genome sequencing technologies and bioinformatics have enabled whole-genomes to be studied at population-level rather then for small number of individuals. This provides new power to whole genome association studies (WGAS
), which now seek to identify the multi-gene causes of common complex diseases like diabetes or cancer.
As WGAS involve studying thousands of genomes, they pose both technological and methodological challenges. The volume of data is significant, for example the dataset from 1000 Genomes project with genomes of 2504 individuals includes nearly 85M genomic variants with raw data size of 0.8 TB. The number of features is enormous and greatly exceeds the number of samples, which makes it challenging to apply traditional statistical approaches.
Random forest is one of the methods that was found to be useful in this context, both because of its potential for parallelization and its robustness. Although there is a number of big data implementations available (including Spark ML) they are tuned for typical dataset with large number of samples and relatively small number of variables, and either fail or are inefficient in the GWAS context especially, that a costly data preprocessing is usually required.
To address these problems, we have developed the RandomForestHD – a Spark based implementation optimized for highly dimensional data sets. We have successfully RandomForestHD applied it to datasets beyond the reach of other tools and for smaller datasets found its performance superior. We are currently applying RandomForestHD, released as part of the VariantSpark toolkit, to a number of WGAS studies.
In the presentation we will introduce the domain of WGAS and related challenges, present RandomForestHD with its design principles and implementation details with regards to Spark, compare its performance with other tools, and finally showcase the results of a few WGAS applications.
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedSpark Summit
In 2001, it cost ~$100M to sequence a single human genome. In 2014, due to dramatic improvements in sequencing technology far outpacing Moore’s law, we entered the era of the $1,000 genome. At the same time, the power of genetics to impact medicine has become evident: for example, drugs with supporting genetic evidence have twice the clinical trial success rate. These factors have led to an explosion in the volume of genetic data, in the face of which existing analysis tools are breaking down.
Therefore, we began the open-source Hail project (https://hail.is) to be a scalable platform built on Apache Spark to enable the worldwide genetics community to build, share, and apply new tools. Hail is focused on variant-level (post-read) data; querying genetic data, annotations and sample data; and performing rare and common variant association analyses. Hail has already been used to analyze datasets with hundreds of thousands of exomes and tens of thousands of whole genomes.
We will give an overview of the goals of the Hail project and its architecture. The challenge of efficiently manipulating genetic data in Spark has led to several innovations that may have wider applicability, including an RDD-like abstraction for representing multidimensional data and an OrderedRDD abstraction for ordered data, (for example, data indexed by position in the genome). Finally, we will discuss Hail performance and future directions.
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Golden Helix Inc
With a focus on scalable architecture and optimized native code that fully utilizes the CPU and RAM available, we can scale genomic analysis into sizes conventionally considered Big Data on a single host. In this webcast, we demonstrate recent innovations and features in Golden Helix solutions that enable the analysis of big data on your own terms.
Science has evolved from the isolated individual tinkering in the lab, through the era of the “gentleman scientist” with his or her assistant(s), to group-based then expansive collaboration and now to an opportunity to collaborate with the world. With the advent of the internet the opportunity for crowd-sourced contribution and large-scale collaboration has exploded and, as a result, scientific discovery has been further enabled. The contributions of enormous open data sets, liberal licensing policies and innovative technologies for mining and linking these data has given rise to platforms that are beginning to deliver on the promise of semantic technologies and nanopublications, facilitated by the unprecedented computational resources available today, especially the increasing capabilities of handheld devices. The speaker will provide an overview of his experiences in developing a crowdsourced platform for chemists allowing for data deposition, annotation and validation. The challenges of mapping chemical and pharmacological data, especially in regards to data quality, will be discussed. The promise of distributed participation in data analysis is already in place.
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim PoterbaDatabricks
In 2001, it cost ~$100M to sequence a single human genome. In 2014, due to dramatic improvements in sequencing technology far outpacing Moore’s law, we entered the era of the $1,000 genome. At the same time, the power of genetics to impact medicine has become evident. For example, drugs with supporting genetic evidence are twice as likely to succeed in clinical trials. These factors have led to an explosion in the volume of genetic data, in the face of which existing analysis tools are breaking down.
As a result, the Broad Institute began the open-source Hail project (https://hail.is), a scalable platform built on Apache Spark, to enable the worldwide genetics community to build, share and apply new tools. Hail is focused on variant-level (post-read) data; querying genetic data, as well as annotations, on variants and samples; and performing rare and common variant association analyses. Hail has already been used to analyze datasets with hundreds of thousands of exomes and tens of thousands of whole genomes, enabling dozens of major research projects.
exFrame: a Semantic Web Platform for Genomics ExperimentsTim Clark
slides from talk given at Bio-ontologies 2013, Berlin DE, 20 July 2013
Emily Merrill*, Stephane Corlosquet*, Paolo Ciccarese†*, Tim Clark*†‡, Sudeshna Das†*
* Massachusetts General Hospital
† Harvard Medical School
‡ School of Computer Science, University of Manchester
PMR metabolomics and transcriptomics database and its RESTful web APIs: A dat...Araport
PMR database is a community resource for deposition and analysis of metabolomics data and related transcriptomics data. PMR currently houses metabolomics data from over 25 species of eukaryotes. In this talk, we introduce PMRs RESTful web APIs for data sharing, and demonstrate its applications in research using Araport to provide Arabidopsis metabolomics data.
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at ScaleAndy Petrella
A talk given at the BioBankCloud conference in Feb 2015 about distributed computing in the contexts of genomics and health.
In this one, we exposed what results we obtained exploring the 1000genomes data using ADAM, followed by an introduction to our scalable GA4GH server implementation built using ADAM, Apache Spark and Play Framework 2.
Annotopia open annotation services platformTim Clark
Annotopia is an open-access, open-source, open annotation services platform developed for scientific annotation of documents and datasets on the web using the W3C Open Annotation model http://www.openannotation.org/spec/core/.
Using Annotopia, virtually any client application including lightweight web clients, can create, selectively share, and access annotation of web documents and data. This can be done regardless of the ownership of the base objects being annotated.
Annotopia supports unstructured, semi-structured and fully-structured (semantic) annotation; manual and automated (textmining) annotation; permissions, groups, and sharing. It also provides access to specialized vocabulary and text analytics services.
Annotopia is an open source platform licensed under Apache 2.0.
REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND E...Lisa K. Johnson, Ph.D.
Talk for ASLO Ocean Science Meeting, Honolulu, HI
March 3, 2017
Lisa J. Cohen, Harriet Alexander, C. Titus Brown
The Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP) facilitated the generation of 678 Illumina RNA sequence datasets from a wide diversity of organisms spanning more than 40 phyla of cultured microbial eukaryotes collected from a variety of marine environments. This is the largest publicly available set of RNA sequencing data from a diversity of eukaryotic taxa with a standardized library preparation. We developed an automated and modularized de novo transcriptome assembly pipeline for the MMETSP data set that is extensible to accommodate both future software updates and additional samples. With this large set of assemblies from a diversity of species, we were able to quantitatively evaluate the qualities of individual transcriptomes. Moreover, a meta-analysis across the dataset revealed lineage-specific transcriptome characteristics, such as predicted open reading frames, contig features, unique k-mers and evaluation scores. Ultimately, a better understanding of these assemblies and annotations will enhance our ability to accurately identify and characterize genes of ecological and biogeochemical significance.
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedSpark Summit
In 2001, it cost ~$100M to sequence a single human genome. In 2014, due to dramatic improvements in sequencing technology far outpacing Moore’s law, we entered the era of the $1,000 genome. At the same time, the power of genetics to impact medicine has become evident: for example, drugs with supporting genetic evidence have twice the clinical trial success rate. These factors have led to an explosion in the volume of genetic data, in the face of which existing analysis tools are breaking down.
Therefore, we began the open-source Hail project (https://hail.is) to be a scalable platform built on Apache Spark to enable the worldwide genetics community to build, share, and apply new tools. Hail is focused on variant-level (post-read) data; querying genetic data, annotations and sample data; and performing rare and common variant association analyses. Hail has already been used to analyze datasets with hundreds of thousands of exomes and tens of thousands of whole genomes.
We will give an overview of the goals of the Hail project and its architecture. The challenge of efficiently manipulating genetic data in Spark has led to several innovations that may have wider applicability, including an RDD-like abstraction for representing multidimensional data and an OrderedRDD abstraction for ordered data, (for example, data indexed by position in the genome). Finally, we will discuss Hail performance and future directions.
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Golden Helix Inc
With a focus on scalable architecture and optimized native code that fully utilizes the CPU and RAM available, we can scale genomic analysis into sizes conventionally considered Big Data on a single host. In this webcast, we demonstrate recent innovations and features in Golden Helix solutions that enable the analysis of big data on your own terms.
Science has evolved from the isolated individual tinkering in the lab, through the era of the “gentleman scientist” with his or her assistant(s), to group-based then expansive collaboration and now to an opportunity to collaborate with the world. With the advent of the internet the opportunity for crowd-sourced contribution and large-scale collaboration has exploded and, as a result, scientific discovery has been further enabled. The contributions of enormous open data sets, liberal licensing policies and innovative technologies for mining and linking these data has given rise to platforms that are beginning to deliver on the promise of semantic technologies and nanopublications, facilitated by the unprecedented computational resources available today, especially the increasing capabilities of handheld devices. The speaker will provide an overview of his experiences in developing a crowdsourced platform for chemists allowing for data deposition, annotation and validation. The challenges of mapping chemical and pharmacological data, especially in regards to data quality, will be discussed. The promise of distributed participation in data analysis is already in place.
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim PoterbaDatabricks
In 2001, it cost ~$100M to sequence a single human genome. In 2014, due to dramatic improvements in sequencing technology far outpacing Moore’s law, we entered the era of the $1,000 genome. At the same time, the power of genetics to impact medicine has become evident. For example, drugs with supporting genetic evidence are twice as likely to succeed in clinical trials. These factors have led to an explosion in the volume of genetic data, in the face of which existing analysis tools are breaking down.
As a result, the Broad Institute began the open-source Hail project (https://hail.is), a scalable platform built on Apache Spark, to enable the worldwide genetics community to build, share and apply new tools. Hail is focused on variant-level (post-read) data; querying genetic data, as well as annotations, on variants and samples; and performing rare and common variant association analyses. Hail has already been used to analyze datasets with hundreds of thousands of exomes and tens of thousands of whole genomes, enabling dozens of major research projects.
exFrame: a Semantic Web Platform for Genomics ExperimentsTim Clark
slides from talk given at Bio-ontologies 2013, Berlin DE, 20 July 2013
Emily Merrill*, Stephane Corlosquet*, Paolo Ciccarese†*, Tim Clark*†‡, Sudeshna Das†*
* Massachusetts General Hospital
† Harvard Medical School
‡ School of Computer Science, University of Manchester
PMR metabolomics and transcriptomics database and its RESTful web APIs: A dat...Araport
PMR database is a community resource for deposition and analysis of metabolomics data and related transcriptomics data. PMR currently houses metabolomics data from over 25 species of eukaryotes. In this talk, we introduce PMRs RESTful web APIs for data sharing, and demonstrate its applications in research using Araport to provide Arabidopsis metabolomics data.
BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at ScaleAndy Petrella
A talk given at the BioBankCloud conference in Feb 2015 about distributed computing in the contexts of genomics and health.
In this one, we exposed what results we obtained exploring the 1000genomes data using ADAM, followed by an introduction to our scalable GA4GH server implementation built using ADAM, Apache Spark and Play Framework 2.
Annotopia open annotation services platformTim Clark
Annotopia is an open-access, open-source, open annotation services platform developed for scientific annotation of documents and datasets on the web using the W3C Open Annotation model http://www.openannotation.org/spec/core/.
Using Annotopia, virtually any client application including lightweight web clients, can create, selectively share, and access annotation of web documents and data. This can be done regardless of the ownership of the base objects being annotated.
Annotopia supports unstructured, semi-structured and fully-structured (semantic) annotation; manual and automated (textmining) annotation; permissions, groups, and sharing. It also provides access to specialized vocabulary and text analytics services.
Annotopia is an open source platform licensed under Apache 2.0.
REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND E...Lisa K. Johnson, Ph.D.
Talk for ASLO Ocean Science Meeting, Honolulu, HI
March 3, 2017
Lisa J. Cohen, Harriet Alexander, C. Titus Brown
The Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP) facilitated the generation of 678 Illumina RNA sequence datasets from a wide diversity of organisms spanning more than 40 phyla of cultured microbial eukaryotes collected from a variety of marine environments. This is the largest publicly available set of RNA sequencing data from a diversity of eukaryotic taxa with a standardized library preparation. We developed an automated and modularized de novo transcriptome assembly pipeline for the MMETSP data set that is extensible to accommodate both future software updates and additional samples. With this large set of assemblies from a diversity of species, we were able to quantitatively evaluate the qualities of individual transcriptomes. Moreover, a meta-analysis across the dataset revealed lineage-specific transcriptome characteristics, such as predicted open reading frames, contig features, unique k-mers and evaluation scores. Ultimately, a better understanding of these assemblies and annotations will enhance our ability to accurately identify and characterize genes of ecological and biogeochemical significance.
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...Surya Saha
Rapidly spreading invasive diseases in systems with little or no prior experimental data or resources pose a unique set of challenges for growers, scientists as well as regulators. As a part of a USDA NIFA CAPS project focused on the psyllid, Diaphorina citri, we have released improved genomics resources including high quality genome assemblies and annotation. We have also created an open access web portal for analyses around the Citrus Greening/Huanglongbing disease complex. Citrusgreening.org includes pathosystem-wide resources and bioinformatics tools for multiple Citrus spp. hosts, the Asian citrus psyllid vector (ACP, Diaphorina citri), and multiple pathogens including Candidatus Liberibacter asiaticus (CLas). To the best of our knowledge, this is the first example of a database to use the pathosystem as a holistic framework to understand an insect transmitted plant disease. Users can submit relevant data sets to enable sharing and allow the community to leverage their data within an integrated system. The system includes the metabolic pathway databases CitrusCyc and DiaphorinaCyc with organism specific pathways that can be used to mine metabolomics, transcriptomics and proteomics results to identify pathways and regulatory mechanisms involved in disease response. The Psyllid Expression Network (PEN) contains expression profiles of ACP genes from multiple life stages, tissues, conditions and hosts. The Citrus Expression Network (CEN) contains public expression data from multiple tissues and conditions for various citrus hosts. All tools connect to a central database. The portal also includes electrical penetration graph (EPG) recordings, information about citrus rootstock trials and metabolomics data in addition to traditional omics data types with a goal of combining and mining all information related to the Huanglongbing pathosystem. User-friendly manual curation tools will allow the continuous improvement of knowledge base as more experimental research is published. The portal can be accessed at https://citrusgreening.org/.
Cross-Kingdom Standards in Genomics, Epigenomics and MetagenomicsChristopher Mason
Challenges and biases in preparing, characterizing, and sequencing DNA and RNA can have significant impacts on research in genomics across all kingdoms of life, including experiments in single cells, RNA profiling, and metagenomics. Technical artifacts and contaminations can arise at each point of sample manipulation, extraction, sequencing, and analysis. Thus, the measurement and benchmarking of these potential sources of error are of paramount importance as next-generation sequencing (NGS) projects become more global and ubiquitous.
Fortunately, a variety of methods, standards, and technologies have recently emerged that improve measurements in genomics and sequencing, from the initial input material to the computational pipelines that process and annotate the data.
This webinar will review work to develop standards and their applications in genomics, including the ABRF-NGS Phase II NGS Study on DNA Sequencing; the FDA’s Sequencing Quality Control Consortium (SEQC2); metagenomics standards efforts (ABRF, ATCC, Zymo, Metaquins), and the Epigenomics QC group of the SEQC2. The webinar will also review he computational methods for detection, validation, and implementation of these genomic measures.
Small molecule identification and the new MassBankSteffen Neumann
Since the beginning more than 10 years ago, the MassBank system
provided a user-friendly web interface. We now have improved
data access, version control and issue tracking by moving
the data to github, allowing for a whole new workflow
and access route for Bio- and Cheminformatics users.
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Larry Smarr
06.09.15
Invited Talk
2006 Synthetic Biology Symposium
Aliso Creek Inn
Title: Building a Community Cyberinfrastructure to Support Marine Microbial Ecology Metagenomics
Laguna Beach, CA
Building bioinformatics resources for the global communityExternalEvents
http://www.fao.org/about/meetings/wgs-on-food-safety-management/en/
Building bioinformatics resources for the global community. Presentation from the Technical Meeting on the impact of Whole Genome Sequencing (WGS) on food safety management and GMI-9, 23-25 May 2016, Rome, Italy.
WikiPathways: how open source and open data can make omics technology more us...Chris Evelo
Presentation about collaborative development of open source pathway analysis code and pathways and about usage in analytical software distributed with analytical machines like mass spectrophotometers.
New learning technologies seem likely to transform much of science, as they are already doing for many areas of industry and society. We can expect these technologies to be used, for example, to obtain new insights from massive scientific data and to automate research processes. However, success in such endeavors will require new learning systems: scientific computing platforms, methods, and software that enable the large-scale application of learning technologies. These systems will need to enable learning from extremely large quantities of data; the management of large and complex data, models, and workflows; and the delivery of learning capabilities to many thousands of scientists. In this talk, I review these challenges and opportunities and describe systems that my colleagues and I are developing to enable the application of learning throughout the research process, from data acquisition to analysis.
Reusable Software and Open Data To Optimize AgricultureDavid LeBauer
Abstract:
Humans need a secure and sustainable food supply, and science can help. We have an opportunity to transform agriculture by combining knowledge of organisms and ecosystems to engineer ecosystems that sustainably produce food, fuel, and other services. The challenge is that the information we have. Measurements, theories, and laws found in publications, notebooks, measurements, software, and human brains are difficult to combine. We homogenize, encode, and automate the synthesis of data and mechanistic understanding in a way that links understanding at different scales and across domains. This allows extrapolation, prediction, and assessment. Reusable components allow automated construction of new knowledge that can be used to assess, predict, and optimize agro-ecosystems.
Developing reusable software and open-access databases is hard, and examples will illustrate how we use the Predictive Ecosystem Analyzer (PEcAn, pecanproject.org), the Biofuel Ecophysiological Traits and Yields database (BETYdb, betydb.org), and ecophysiological crop models to predict crop yield, decide which crops to plant, and which traits can be selected for the next generation of data driven crop improvement. A next step is to automate the use of sensors mounted on robots, drones, and tractors to assess plants in the field. The TERRA Reference Phenotyping Platform (TERRA-Ref, terraref.github.io) will provide an open access database and computing platform on which researchers can use and develop tools that use sensor data to assess and manage agricultural and other terrestrial ecosystems.
TERRA-Ref will adopt existing standards and develop modular software components and common interfaces, in collaboration with researchers from iPlant, NEON, AgMIP, USDA, rOpenSci, ARPA-E, many scientists and industry partners. Our goal is to advance science by enabling efficient use, reuse, exchange, and creation of knowledge.
---
Invited talk for the "Informatics for Reproducibility in Earth and Environmental Science Research" session at the American Geophysical Union Fall Meeting, Dec 17 2015.
Precise elucidation of the many different biological features encoded in any genome requires careful examination and review by researchers, who gather and evaluate the available evidence to corroborate and modify gene predictions and other biological elements. This curation process allows them to resolve discrepancies and validate automated gene model hypotheses and alignments. This approach is the well-established practice for well-known genomes such as human, mouse, zebrafish, Drosophila, et cetera. Desktop Apollo was originally developed to meet these needs.
The cost of sequencing a genome has been dramatically reduced by several orders of magnitude in the last decade, and the natural consequence is that more and more researchers are sequencing more and more new genomes, both within populations and across species. Because individual researchers can now readily sequence many genomes of interest, the need for a universally accessible genomic curation tool logically follows. Each new exome or genome sequenced requires visualization and curation to obtain biologically accurate genomic features sets, even for limited set of genes, because computational genome analysis remains an imperfect art. Additionally, unlike earlier genome projects, which had the advantage of more highly polished genomes, recent projects usually have lower coverage. Therefore researchers now face additional work correcting for more frequent assembly errors and annotating genes split across multiple contigs.
Genome annotation is an inherently collaborative task; researchers only very rarely work in isolation, turning to colleagues for second opinions and insights from those with with expertise in particular domains and gene families. The new JavaScript based Apollo, allows researchers real-time interactivity, breaking down large amounts of data into manageable portions to mobilize groups of researchers with shared interests. We are also focused on training the next generation of researchers by reaching out to educators to make these tools available as part of curricula via workshops and webinars, and through widely applied systems such as iPlant and DNA Subway. Here we offer details of our progress.
Presentation at Genome Informatics, Session (3) on Databases, Data Mining, Visualization, Ontologies and Curation.
Authors: Monica C Munoz-Torres, Suzanna E. Lewis, Ian Holmes, Colin Diesh, Deepak Unni, Christine Elsik.
IDW2022: A decades experiences in transparent and interactive publication of ...GigaScience, BGI Hong Kong
Scott Edmunds at International Data Week 2022: A decades experiences in transparent and interactive publication of FAIR data and software via an end-to-end XML publishing platform. 21st June 2022
GigaByte Chief Editor Scott Edmunds presents on how to prepare a data paper for the TDR and WHO sponsored call for data papers describing datasets on vectors of human diseases launched in Nov 2021. Presented at the GBIF webinar on 25th January 2022 and aimed at authors interested in submitting a manuscript submitted to the series.
STM Week: Demonstrating bringing publications to life via an End-to-end XML p...GigaScience, BGI Hong Kong
Scott Edmunds at the STM Week 2020 Digital Publishing seminar on Demonstrating bringing publications to life via an End-to-end XML publishing platform. 2nd December 2020
Scott Edmunds: A new publishing workflow for rapid dissemination of genomes u...GigaScience, BGI Hong Kong
Scott Edmunds on a new publishing workflow for rapid dissemination of genomes using GigaByte & GigaDB. Presented at Biodiversity 2020 in the Annotation & Databases track, 9th October 2020.
Scott Edmunds: Quantifying how FAIR is Hong Kong: The Hong Kong Shareability ...GigaScience, BGI Hong Kong
Scot Edmunds talk at CODATA2019 on Quantifying how FAIR is Hong Kong: The Hong Kong Shareability of Hong Kong University Research Experiment. 19th September 2019 in Beijing
Scott Edmunds talk at IARC: How can we make science more trustworthy and FAIR...GigaScience, BGI Hong Kong
Scott Edmunds talk at IARC, Lyon. How can we make science more trustworthy and FAIR? Principled publishing for more evidence based research. 8th July 2019
PAGAsia19 - The Digitalization of Ruili Botanical Garden Project: Production...GigaScience, BGI Hong Kong
A 3 part talk presented at PAG Asia 2019 in Shenzhen- The Digitalization of Ruili Botanical Garden Project: Production, Curation and Re-Use. Presented by Huan Liu (CNGB), Scott Edmunds (GigaScience) & Stephen Tsui (CUHK). 8th June 2019
Democratising biodiversity and genomics research: open and citizen science to...GigaScience, BGI Hong Kong
Scott Edmunds at the China National GeneBank Youth Biodiversity MegaData Forum: Democratising biodiversity and genomics research: open and citizen science to build trust and fill the data gaps. 18th December 2018
Ricardo Wurmus at #ICG13: Reproducible genomics analysis pipelines with GNU Guix. Presented at the GigaScience Prize Track at the International Conference on Genomics, Shezhen 26th October 2018
Paul Pavlidis at #ICG13: Monitoring changes in the Gene Ontology and their im...GigaScience, BGI Hong Kong
Paul Pavlidis talk at the #ICG13 GigaScience Prize Track: Monitoring changes in the Gene Ontology and their impact on genomic data analysis (GOtrack). Shenzhen, 26th October 2018
Stefan Prost at #ICG13: Genome analyses show strong selection on coloration, ...GigaScience, BGI Hong Kong
Stefan Prost presentation for the #ICG13 GigaScience Prize Track: Genome analyses show strong selection on coloration, morphological and behavioral phenotypes in birds-of-paradise. Shenzhen, 26th October, 2018
Reproducible method and benchmarking publishing for the data (and evidence) d...GigaScience, BGI Hong Kong
Scott Edmunds presentation on: Reproducible method and benchmarking publishing for the data (and evidence) driven era. The Silk Road Forensics Conference, Yantai, 18th September 2018
Mary Ann Tuli: What MODs can learn from Journals – a GigaDB curator’s perspec...GigaScience, BGI Hong Kong
Mary Ann Tuli's talk at the International Society of Biocuration meeting : What MODs can learn from Journals – a GigaDB curator’s perspective. Shanghai 9th April 2018
Laurie Goodman: Sharing and Reusing Cell Image Data, ASCB/EMBO 2017 Subgroup ...GigaScience, BGI Hong Kong
Laurie Goodman's pre-prepared slides for the Subgroup S Sharing and Reusing Cell Image Data session at the 2017 ASCB│EMBO meeting in Philadelphia. December 2017
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
The ambient solar wind that flls the heliosphere originates from multiple
sources in the solar corona and is highly structured. It is often described
as high-speed, relatively homogeneous, plasma streams from coronal
holes and slow-speed, highly variable, streams whose source regions are
under debate. A key goal of ESA/NASA’s Solar Orbiter mission is to identify
solar wind sources and understand what drives the complexity seen in the
heliosphere. By combining magnetic feld modelling and spectroscopic
techniques with high-resolution observations and measurements, we show
that the solar wind variability detected in situ by Solar Orbiter in March
2022 is driven by spatio-temporal changes in the magnetic connectivity to
multiple sources in the solar atmosphere. The magnetic feld footpoints
connected to the spacecraft moved from the boundaries of a coronal hole
to one active region (12961) and then across to another region (12957). This
is refected in the in situ measurements, which show the transition from fast
to highly Alfvénic then to slow solar wind that is disrupted by the arrival of
a coronal mass ejection. Our results describe solar wind variability at 0.5 au
but are applicable to near-Earth observatories.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
Slide 1: Title Slide
Extrachromosomal Inheritance
Slide 2: Introduction to Extrachromosomal Inheritance
Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus.
Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids.
Slide 3: Mitochondrial Inheritance
Mitochondria: Organelles responsible for energy production.
Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria.
Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring.
Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy.
Slide 4: Chloroplast Inheritance
Chloroplasts: Organelles responsible for photosynthesis in plants.
Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts.
Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species.
Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA.
Slide 5: Plasmid Inheritance
Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes.
Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation.
Significance: Important in biotechnology for gene cloning and genetic engineering.
Slide 6: Mechanisms of Extrachromosomal Inheritance
Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance.
Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells.
Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression.
Slide 7: Examples of Extrachromosomal Inheritance
Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells.
Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration.
Slide 8: Importance of Extrachromosomal Inheritance
Evolution: Provides insight into the evolution of eukaryotic cells.
Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases.
Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification.
Slide 9: Recent Research and Advances
Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA.
Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases.
Slide 10: Conclusion
Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology.
Future Directions: Continued research and technological advancements hold promise for new treatments and applications.
Slide 11: Questions and Discussion
Invite Audience: Open the floor for any questions or further discussion on the topic.
This pdf is about the Schizophrenia.
For more details visit on YouTube; @SELF-EXPLANATORY;
https://www.youtube.com/channel/UCAiarMZDNhe1A3Rnpr_WkzA/videos
Thanks...!
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes
1. Re-assembly, quality evaluation, and annotation of
678 marine microbial eukaryotic
reference transcriptomes
Lisa K. Johnson, Harriet Alexander, C. Titus Brown
Lab for Data Intensive Biology (DIB)
University of California, Davis
ICG-13
Session 6: GigaScience Prize Track
October 25, 2018
@monsterbashseq
ljcohen@ucdavis.edu
2. DNA sequencing technology has revolutionized the field of biology,
“New Computational Era”
• Now, limiting step is data analysis
• New tools and approaches constantly available
• What to do if:
– New samples to add to the project?
– New software tool is developed?
Re-analysis of old data with new tools and methods is not a common
practice. Should it be?
3. Marine Microbial Eukaryotic Transcriptome
Sequencing Project (MMETSP)
- Standardized data set, 1 sequencing facility and library preparation
- 678 Illumina PE 50 RNA sequence datasets, 1 TB raw data
- Wide diversity spanning more than 40 phyla
- Original assemblies by the U.S. National Center for Genome Resources (NCGR)
Keeling et al. 2014
PMID: 24959919
Caron et al. 2016
PMID: 27867198
4. • Adapted from the Brown lab, “Eel Pond mRNA-seq Protocol”: http://eel-pond.readthedocs.io/en/latest/
Titus Brown, Camille Scott, and Leigh Sheneman
• Dr. Tessa Pierce: https://github.com/dib-lab/eelpond (snakemake workflow)
Johnson, LK; Alexander, H; Brown, CT. 2018. GigaScience. In press.
https://www.biorxiv.org/content/early/2018/09/18/323576
Programmatically automated pipeline
(Python) x 678 transcriptomes
9. Most DIB assemblies have more unique content.
Unique k-mers (k=25), unique word combinations
Luiz Irber,
HyperLogLog:
https://doi.org/10.1101/056846
https://github.com/dib-lab/khmer
10. Dinophyta have more unique k-mers
Can we detect phylogenetic differences in the assemblies?
Unique k-mers = unique word combinations (k=25)
*
11. Ciliophora have lower ORF percentages
Can we detect phylogenetic differences in the assemblies?
*
12. • Re-assembly with new tools can yield new results (and content!)
• Automated and programmable pipelines can be used to process
arbitrarily many samples and test new tools
• Analyzing many samples using a common pipeline identifies
taxon-specic trends
Summary
13. Acknowledgements
• Data Intensive Biology Lab
–Camille Scott, Luiz Irber
• MSU iCER hpcc
• NSF-XSEDE, Jetstream
cloud
Photo by James Word
Editor's Notes
Hi, my name is Lisa Johnson, I’m a PhD student at UC Davis in Titus Brown’s Data Intensive Biology lab tackling questions surrounding k-mer based sequence analysis. Thank you for this opportunity to speak today. I would like to first acknowledge my co-authors, Harriet Alexander and my advisor, Titus Brown.
The Marine Microbial Eukaryotic Sequencing Project is a unique set of mRNA sequence data generated by a consortium of PIs who all got together and submitted their favorite marine microbial eukaryotes to one sequencing facility. These species represent 40 pelagic and endosymbiotic phyla, such dinoflagellates, ciliates, diatoms. They are both phylogenetically diverse and geographically diverse, collected from all over the world.
This is a really exciting set of data for a few reasons, one is because it is one of the largest publicly available sets of RNA data with a standardized library preparation from different organisms with a total of about 1 TB of raw sequence data.
Second, it’s purposefully built, not a metatranscriptome. We technically know who is supposed to be in this data set, so we are generating reference transcriptomes for all of these species, some of which have never had any reference transcriptomes or genomes before.
Right after data were sequenced, the NCGR assembled the transcriptomes as references with their own pipeline, using the genome assembler ABySS with some modifications and post-processing for transcriptomes.
====================
Bottom panel, left to right:
Elphidium margaritaceum
http://zoology.bio.spbu.ru/Eng/Sci/Korsun/Foram2_E-margaritaceum.jpg
2. Acanthamoeba
https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/Parasite140120-fig3_Acanthamoeba_keratitis_Figure_3B.png/220px-Parasite140120-fig3_Acanthamoeba_keratitis_Figure_3B.png
3. Gonyaulax spinifera
http://www.sms.si.edu/IRLSpec/images/Gonyaulax_Lg.jpg
4. Asterionellopsis glacialis
http://www.smhi.se/oceanografi/oce_info_data/plankton_checklist/diatoms/asterionellopsis_glacialis.gif
5. Tetraselmis
http://cfb.unh.edu/phycokey/Choices/Chlorophyceae/unicells/flagellated/TETRASELMIS/Tetraselmis_06_500x345.jpg
6. Oxyrrhis marina
http://cfb.unh.edu/phycokey/Choices/Dinophyceae/NonPS-dinos/OXYRRHIS/Oxyrrhis_04_300x246_marina.jpg
7. Alexandrium
http://www.whoi.edu/cms/images/dfino/2006/6/Alexandrium_en_11187_26907.jpg
8. Pseudonitzschia
https://upload.wikimedia.org/wikipedia/commons/5/5e/Pseudonitzschia2.jpg
9. Chlamydomonas
https://web.mst.edu/~microbio/BIO221_2009/images_2009/chlamydomonas-3.jpg
10. Emiliania_huxleyi
https://upload.wikimedia.org/wikipedia/commons/d/d9/Emiliania_huxleyi_coccolithophore_(PLoS).png
11. Symbiodinium
http://www.personal.psu.edu/tcl3/index.html
12. Phaeocystis antarctica
http://www.esf.edu/antarctica/images/Phaeo_montage2.jpg
13. Micromonas
http://roscoff-culture-collection.org/sites/default/files/field/image/micromonas-colored-350_0.jpg
14. Karenia brevis
http://www.sms.si.edu/irlspec/images/Kareni_brevis_2.jpg
15. Thalassiosira pseudonana
http://genome.jgi.doe.gov/Thaps3/Tpseudonana.jpg
16. Ditylum_brightwellii
https://cimt.pmc.ucsc.edu/images/HAB%20ID/diatom/Ditylum_brightwellii.jpg
Our modularized pipeline, which I wrote in Python, attempts to address these issues. It takes metadata from any data set in NCBI as input and decides which samples to run.
Raw sequence reads are downloaded from NCBI, quality trimmed, checked with fastqc, run through digital normalization, then assembled using the Trinity transcriptome assembler.
I’m glossing over a lot of details here because there is not enough time, but if you are interested please see me after to talk. There is a tutorial also available, called the “Eel pond protocol”, which is open access and has a small subset of data to run through the steps of a de novo assembly with Trinity.
A benefit of this pipeline to highlight is that you can pick up from where you left off if something crashes. As anyone who has used an institutional high performance computing cluster knows, stuff breaks, stops running. With this pipeline, if something stops, you can start it again.
This data set pushes the limits of our high performance computing clusters with 1 TB raw data, in terms of storage and compute resources. This took more than 8,000 computing hours, We have found that the resources required for these >600 assemblies are not trivial, and should be a consideration when planning for a project of this size in the future.
In evaluating our assemblies, it appears that our re-assemblies have more contigs. A contig is a linear prediction of a full transcript by the assembly software. In subsequent slides, I’ll be showing similar figures like this, so want to orient you first. On the y-axis is what we’re measuring – here it’s the number of contigs. This is a split violin plot showing the frequency distribution around the mean of each pipeline. In the blue on the right shows our re-assemblies, which I’ve labeled “DIB” because we’re the data intensive biology lab. In the gray on the left are assemblies from NCGR. The number on top in blue shows the numbers of assemblies where DIB has a higher value than NCGR or in gray where NCGR has a higher number.
In this case, we see that there were more DIB assemblies with higher numbers of contigs in comparison to the NCGR.
The mean of DIB is around 48,000 contigs, with some samples producing up to 190,000 contigs up here towards the tail of the distribution. While the mean of NCGR is around 25,000 contigs and fewer assemblies have high numbers of contigs, the highest is about 100,000.
So, these differences were interesting for us – and we came up with some questions (click)
In addition to have higher quality scores, there appears to be more content. The proportion of contigs from a comparison called a reciprocal best blast of NCGR vs. our DIB assemblies indicates that most of the content found in NCGR is also found in the DIB re-assemblies. But also that there is extra information in the DIB assemblies not found in the NCGR assemblies. This information was obtained by aligning the two assemblies against each other both ways. First with NCGR as the reference, then the reverse with DIB as the reference.
Engage with audience: As you can see here…our peak is about 0.7, or 70%. This means that we’re capturing 70% of the content in the NCGR assemblies. On the other hand, NCGR assemblies capture about 50% of the content of our assemblies. The difference is about 20%.
The ~20% difference between these 2 blast comparisons leads us to still question whether we have just assembled junk or if we actually have higher resolution assemblies.
Orient audience to graphs: left ORF on Y axis
Even though we have more contigs, the open reading frame protein coding regions detected is similar if not more tightly distributed towards the upper range. Most of the assemblies have slightly higher ORF content.
And on the right are BUSCO percentages, which is a set of benchmarking universal single copy orthologs expected to be found in all eukaryotic transcriptomes, like housekeeping genes.
While there are problems with using BUSCO scores as an absolute measurement of assembly quality, they can serve as a comparative metric relative to another pipeline. Our assemblies have a similar BUSCO content relative to NCGR. So, at least these haven’t gone down. The extra content we found is probably not all junk.
In digging deeper into the extra content, this is a plot of ONLY this extra content in the blue part. Samples are across the x axis, sorted by the number of extra contigs on the y axis. (pause, let this sink in, take a drink or something)
Highlighted in green is the number of these extra contigs that are actually annotated to a known gene.
I annotated the re-assemblies using this really great tool out of our lab by Camille Scott called ‘dammit’. No, it’s not an acronym, it was named out of frustration: “Just annotate it, dammit!” The dammit pipeline uses the highly-curated Pfam and Rfam known protein domain databases as well as ORthoDB with conserved orthology domains. About 1/3 of the extra content has annotations.
Here we are comparing the raw sequence content, regardless of annotation, in terms of the number of kmers or unique word combinations with a k length of 25. We see that our assemblies fall above the 1:1 expectation, meaning that our assemblies have more unique words compared to the NCGR assemblies. This is kind of like taking two versions of the same book and digesting them down into individual 25 letter words found in the book. We found that our assemblies have more unique words than NCGR.
Therefore, we are able to answer that our assemblies probably have a bit more biologically-meaningful content
To address our second question about whether we can detect phylogenetic differences in the assemblies, we took a look at some of the assembly metrics grouped by taxa.
Explain figures: unique k-mers on the y, input reads on the x, colors indicate different taxa, plotting mean and stdev
The Dinoflagellates appear to have more unique kmer content. This seems to make sense, knowing that Dinoflagellates have this steady-state gene expression thing going on, where they just keep expressing genes on and one, then regulate more at the translational level.
As far as the software, it might be useful to incorporate strain-specific information like this into assembly software.
Here again, colors are different taxonomic groupings, mean percentage of open reading frame predictions on the y, number of transcripts on the x
We see here that Cilliate assemblies appear to have a lower open reading frame percentage. This is interesting since it has recently been found Ciliates have an alternative triplet codon dictionary, with codons normally encoding STOP serving a different purpose.
Dinoflagellates here have this high open reading frame content, and lots of contigs.
In this case, it is useful to know that our assembly evaluation tools might perform outside the range of what is normal for the organisms in question. The assemblies are not necessarily lower quality, but may be perceived as lower in quality because of cool and unique features like this.
Strain-specific trends may lead to understanding how raw data content affects the overall assembly quality