1. The document discusses best practices for scientific software development including writing code for people to read, automating repetitive tasks, using version control, and avoiding redundancy.
2. Specific approaches mentioned are planning for mistakes, automated testing, continuous integration, and using style guides to ensure code is readable and consistently formatted.
3. Knitting allows analyzing and reporting in a single file by embedding R code chunks in markdown documents.
The document discusses the challenges and opportunities that will arise from the exponential growth of biological data in the coming years. It outlines four key areas: 1) Research approaches will need to effectively analyze infinite amounts of data. 2) Software and decentralized infrastructure will be needed to process the data. 3) Open science and reproducible research practices are important for data-driven biology. 4) Training the next generation of biologists in data analysis skills will be a major challenge. The document advocates for open source tools, reproducible research methods, and expanded training programs to help biology take advantage of the coming data deluge.
From peer-reviewed to peer-reproduced: a role for research objects in scholar...Alejandra Gonzalez-Beltran
The document discusses how research objects and computational workflows can help capture experimental processes and reproduce findings in life sciences research. It describes a computational experiment evaluating three genome assembly algorithms on bacterial, insect, and human genomes. Key steps included identifying resources, designing the experimental workflow, running the experiment in Galaxy, and publishing results as nanopublications aggregated in a research object to enable verification and reuse. The goal is to improve reproducibility by making experimental descriptions and reviews more structured and transparent.
- Biology is generating vast amounts of data from techniques like DNA sequencing that is growing exponentially. However, biology is unprepared to effectively analyze and utilize this "big data".
- There are few researchers trained in both data analysis and biology. Data is often not shared between researchers, and the current publishing system discourages open sharing of knowledge. Most computational research also lacks reproducibility.
- The presenter advocates for open science, data sharing, and improved training for the next generation of researchers in both biology and data analysis skills to help address these challenges and better leverage the growing stores of biological data.
This document discusses the challenges and opportunities biology faces with increasing data generation. It outlines four key points:
1) Research approaches for analyzing infinite genomic data streams, such as digital normalization which compresses data while retaining information.
2) The need for usable software and decentralized infrastructure to perform real-time, streaming data analysis.
3) The importance of open science and reproducibility given most researchers cannot replicate their own computational analyses.
4) The lack of data analysis training in biology and efforts at UC Davis to address this through workshops and community building.
This document summarizes a talk titled "AWager for 2016: How SoftwareWill Beat Hardware in Biological Data Analysis". The talk discusses how software approaches can outpace hardware for analyzing large biological datasets. It notes that current variant calling approaches have limitations due to being I/O intensive and requiring multiple passes over data. The talk introduces approaches using lossy compression and streaming algorithms that can perform analysis more efficiently using less memory and in a single pass. This could enable analyzing a human genome on a desktop computer by 2016 as wagered. The talk argues that with better algorithmic tools, biological data analysis need not require large computers and can scale with the information content of data rather than just data size.
This document outlines a 12-step program for biology to adapt to the era of data-intensive science. It summarizes the author's background and research interests. It then discusses the rapid growth of biological data from techniques like DNA sequencing. It introduces the concept of digital normalization as a way to efficiently process large transcriptome datasets. Finally, it outlines some proposed steps for the field, including investing in computational training, a focus on biological questions, and moving to continuous data updating models.
This document discusses complex metagenome assembly and career thoughts in bioinformatics. It begins with the speaker's research background and then discusses two main topics: 1) challenges with metagenome assembly due to low coverage regions and strain variation in sequencing data, and approaches using assembly graphs, and 2) the need for more "bioinformaticians in the middle" who are comfortable with both biology and computational analysis to integrate large-scale data into their research. The speaker provides advice for embracing computation and seeking formal training opportunities to develop skills at this intersection of disciplines.
The document discusses the ISA infrastructure, which provides a standardized format (ISA-TAB) for experimental metadata and data exchange. It can be used across various domains like toxicology, systems biology, and nanotechnology. The Risa R package integrates experimental metadata with analysis and allows updating metadata. Nature Scientific Data is a new publication for describing valuable datasets. The ISA framework has been adopted by over 30 public and private resources and is growing in use for facilitating reuse of investigations in various life science domains. Toxicity examples include EU projects on predictive toxicology and a rat study of drug candidates. Questions can be directed to the ISA tools group.
The document discusses the challenges and opportunities that will arise from the exponential growth of biological data in the coming years. It outlines four key areas: 1) Research approaches will need to effectively analyze infinite amounts of data. 2) Software and decentralized infrastructure will be needed to process the data. 3) Open science and reproducible research practices are important for data-driven biology. 4) Training the next generation of biologists in data analysis skills will be a major challenge. The document advocates for open source tools, reproducible research methods, and expanded training programs to help biology take advantage of the coming data deluge.
From peer-reviewed to peer-reproduced: a role for research objects in scholar...Alejandra Gonzalez-Beltran
The document discusses how research objects and computational workflows can help capture experimental processes and reproduce findings in life sciences research. It describes a computational experiment evaluating three genome assembly algorithms on bacterial, insect, and human genomes. Key steps included identifying resources, designing the experimental workflow, running the experiment in Galaxy, and publishing results as nanopublications aggregated in a research object to enable verification and reuse. The goal is to improve reproducibility by making experimental descriptions and reviews more structured and transparent.
- Biology is generating vast amounts of data from techniques like DNA sequencing that is growing exponentially. However, biology is unprepared to effectively analyze and utilize this "big data".
- There are few researchers trained in both data analysis and biology. Data is often not shared between researchers, and the current publishing system discourages open sharing of knowledge. Most computational research also lacks reproducibility.
- The presenter advocates for open science, data sharing, and improved training for the next generation of researchers in both biology and data analysis skills to help address these challenges and better leverage the growing stores of biological data.
This document discusses the challenges and opportunities biology faces with increasing data generation. It outlines four key points:
1) Research approaches for analyzing infinite genomic data streams, such as digital normalization which compresses data while retaining information.
2) The need for usable software and decentralized infrastructure to perform real-time, streaming data analysis.
3) The importance of open science and reproducibility given most researchers cannot replicate their own computational analyses.
4) The lack of data analysis training in biology and efforts at UC Davis to address this through workshops and community building.
This document summarizes a talk titled "AWager for 2016: How SoftwareWill Beat Hardware in Biological Data Analysis". The talk discusses how software approaches can outpace hardware for analyzing large biological datasets. It notes that current variant calling approaches have limitations due to being I/O intensive and requiring multiple passes over data. The talk introduces approaches using lossy compression and streaming algorithms that can perform analysis more efficiently using less memory and in a single pass. This could enable analyzing a human genome on a desktop computer by 2016 as wagered. The talk argues that with better algorithmic tools, biological data analysis need not require large computers and can scale with the information content of data rather than just data size.
This document outlines a 12-step program for biology to adapt to the era of data-intensive science. It summarizes the author's background and research interests. It then discusses the rapid growth of biological data from techniques like DNA sequencing. It introduces the concept of digital normalization as a way to efficiently process large transcriptome datasets. Finally, it outlines some proposed steps for the field, including investing in computational training, a focus on biological questions, and moving to continuous data updating models.
This document discusses complex metagenome assembly and career thoughts in bioinformatics. It begins with the speaker's research background and then discusses two main topics: 1) challenges with metagenome assembly due to low coverage regions and strain variation in sequencing data, and approaches using assembly graphs, and 2) the need for more "bioinformaticians in the middle" who are comfortable with both biology and computational analysis to integrate large-scale data into their research. The speaker provides advice for embracing computation and seeking formal training opportunities to develop skills at this intersection of disciplines.
The document discusses the ISA infrastructure, which provides a standardized format (ISA-TAB) for experimental metadata and data exchange. It can be used across various domains like toxicology, systems biology, and nanotechnology. The Risa R package integrates experimental metadata with analysis and allows updating metadata. Nature Scientific Data is a new publication for describing valuable datasets. The ISA framework has been adopted by over 30 public and private resources and is growing in use for facilitating reuse of investigations in various life science domains. Toxicity examples include EU projects on predictive toxicology and a rat study of drug candidates. Questions can be directed to the ISA tools group.
Bio-GraphIIn is a graph-based, integrative and semantically enabled repository for life science experimental data. It addresses the need for a system that supports retrospective data submissions, handles heterogeneous experimental data, and overcomes the fragmentation of existing data formats and databases. Bio-GraphIIn uses the Investigation/Study/Assay (ISA) framework and ontologies to semantically represent experimental metadata and enable rich queries across studies, with the goal of facilitating integrative data analysis.
The Taverna Workflow Management Software Suite - Past, Present, FuturemyGrid team
The document summarizes the Taverna workflow management software. It discusses how Taverna allows users to visually design and execute workflows to analyze data through web services, scripts, and other tools. The summary highlights that Taverna uses a dataflow model and supports mixing different step types, nested workflows, and interactions. It also discusses how Taverna aims to advance scientific discovery by making workflows reusable, adaptive to different infrastructures, and able to process data at large scales.
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...Alejandra Gonzalez-Beltran
This document discusses data sharing and interoperability. It introduces BioSharing, a registry of life science data standards, policies and databases. It also describes the ISA infrastructure, a metadata framework for tracking experimental investigations. A case study is presented on reproducing results from the SOAPdenovo2 genome assembly algorithm using ISA-Tab, Galaxy workflows and other methods. The importance of data publication and enabling reproducible research is emphasized.
This document is a resume for Gautam Machiraju. It summarizes his education and research experience. He has a B.A. in Applied Mathematics from UC Berkeley with a concentration in Mathematical Biology and a minor in Bioengineering. He has worked on several research projects involving mathematical modeling and data analysis related to cancer biomarkers, genomics, and proteomics. His skills include programming, mathematics, data science, and laboratory techniques. He is currently a bioinformatics research assistant at Stanford University School of Medicine.
The document discusses the ISA infrastructure, which provides a framework for tracking metadata in bioscience experiments from data collection to sharing in linked data clouds. The infrastructure includes a metadata syntax, open source software tools, and a user community. It allows annotation of experimental metadata, materials, and processes using ontologies to make semantics explicit and enable integration and knowledge discovery. The infrastructure is growing with over 30 public and private resources adopting it to facilitate standards-compliant sharing of investigations across life science domains.
The document discusses knowledge management of experimental data through the ISA ecosystem. It describes the ISA-tab format and software suite that allows annotation and curation of experimental metadata. As a use case, it analyzes a dataset on metabolite profiling from a study of fatty acid amide hydrolase knockout mice. The ISA tools can represent investigations and assays, convert data to standardized formats, and facilitate sharing and analysis of experimental data.
This document discusses community standards for reproducible and reusable bioscience research. It outlines the importance of consistent reporting to maximize the value of collective scientific outputs. However, there are challenges due to the large number of bioscience reporting standards and lack of knowledge about how they relate. The document calls for a coherent catalogue of data sharing resources to evaluate standards, show relationships among them, and promote interoperability. This would help researchers make informed choices about standards and facilitate structured descriptions of experiments across domains.
This document provides a summary of Jennifer Shelton's background and experience in bioinformatics. It outlines her education in biology and post-baccalaureate studies. Her research focuses on de novo genome and transcriptome assembly using next-generation sequencing and BioNano Genomics data. She has extensive experience developing bioinformatics workflows and teaching coding skills through workshops. Currently she is the Bioinformatics Core Outreach Coordinator at Kansas State University where she continues her research and outreach efforts.
The document discusses the ISA (Investigation/Study/Assay) framework for enabling data reuse and reproducibility in bioscience research. The ISA framework provides a generic format for rich experimental descriptions and an infrastructure of open source software tools. It aims to minimize the burden of reporting, curating, sharing data and metadata from bioscience experiments to enable comprehension, reuse of data, and reproducibility. The framework promotes community engagement to develop community standards and document use cases.
This document discusses concepts and tools for exploring large sequencing datasets. It begins by providing background on building tools to analyze large sequencing datasets and enabling scientists to quickly generate hypotheses from data. The document then outlines goals of enabling hypothesis-driven biology through better hypothesis generation and refinement, and making sequence analysis less valuable and putting the author out of a job. It presents a narrative arc that discusses reconstructing community genomes from shotgun metagenomics, underlying enabling approaches and tools, and a plan for positive global influence through technology and training.
Ontomaton: NCBO BioPortal Ontology lookups in Google Spreadsheets produced by ISATeam at University of Oxford e-Research Centre (Eamonn Maguire, Alejandra Gonzalez-Beltran, Philippe Rocca-Serra and Susanna Sansone) and NCBO (Trish Whetzel).
The work was presented during ICBO 2013 in Montreal by Trish Whetzel (Thanks Trish!)
BioSharing.org - mapping the landscape of community standards, databases, dat...Alejandra Gonzalez-Beltran
This document summarizes Alejandra González-Beltrán's presentation on BioSharing.org, which maps the landscape of community standards, databases, and data policies. It discusses how BioSharing aims to help stakeholders make informed decisions for data interoperability through curating crowdsourcing information on existing standards and policies. It also describes how BioSharing integrates information from the MIBBI Project and is working with the MICheckout tool to help users create and use modular standards components.
This presentation is a thorough guide to the use of Web Apollo, with details on User Navigation, Functionality, and the thought process behind manual annotation.
During this workshop, participants:
- Learn to identify homologs of known genes of interest in a newly sequenced genome.
- Become familiar with the environment and functionality of the Web Apollo genome annotation editing tool.
- Learn how to corroborate or modify automatically annotated gene models using available evidence in Web Apollo.
- Understand the process of curation in the context of genome annotation.
Next generation genomics: Petascale data in the life sciencesGuy Coates
Keynote presentation at OGF 28.
The year 2000 saw the release of "The" human genome, the product of a the combined sequencing effort of the whole planet. In 2010, single institutions are sequencing thousands of genomes a year, producing petabytes of data. Furthermore, many of the large scale sequencing projects are based around international collaboration and consortia. The talk will explore how Grid and Cloud technologies are being used to share genomics data around the planet, revolutionizing life science research.
In this presentation from the DDN User Meeting at SC13, Tim Cutts from The Sanger Insitute describes how the company wrangles genomics data with DDN storage.
Watch the video presentation: http://insidehpc.com/2013/11/13/ddn-user-meeting-coming-sc13-nov-18/
Precise elucidation of the many different biological features encoded in any genome requires careful examination and review by researchers, who gather and evaluate the available evidence to corroborate and modify gene predictions and other biological elements. This curation process allows them to resolve discrepancies and validate automated gene model hypotheses and alignments. This approach is the well-established practice for well-known genomes such as human, mouse, zebrafish, Drosophila, et cetera. Desktop Apollo was originally developed to meet these needs.
The cost of sequencing a genome has been dramatically reduced by several orders of magnitude in the last decade, and the natural consequence is that more and more researchers are sequencing more and more new genomes, both within populations and across species. Because individual researchers can now readily sequence many genomes of interest, the need for a universally accessible genomic curation tool logically follows. Each new exome or genome sequenced requires visualization and curation to obtain biologically accurate genomic features sets, even for limited set of genes, because computational genome analysis remains an imperfect art. Additionally, unlike earlier genome projects, which had the advantage of more highly polished genomes, recent projects usually have lower coverage. Therefore researchers now face additional work correcting for more frequent assembly errors and annotating genes split across multiple contigs.
Genome annotation is an inherently collaborative task; researchers only very rarely work in isolation, turning to colleagues for second opinions and insights from those with with expertise in particular domains and gene families. The new JavaScript based Apollo, allows researchers real-time interactivity, breaking down large amounts of data into manageable portions to mobilize groups of researchers with shared interests. We are also focused on training the next generation of researchers by reaching out to educators to make these tools available as part of curricula via workshops and webinars, and through widely applied systems such as iPlant and DNA Subway. Here we offer details of our progress.
Presentation at Genome Informatics, Session (3) on Databases, Data Mining, Visualization, Ontologies and Curation.
Authors: Monica C Munoz-Torres, Suzanna E. Lewis, Ian Holmes, Colin Diesh, Deepak Unni, Christine Elsik.
Science has evolved from the isolated individual tinkering in the lab, through the era of the “gentleman scientist” with his or her assistant(s), to group-based then expansive collaboration and now to an opportunity to collaborate with the world. With the advent of the internet the opportunity for crowd-sourced contribution and large-scale collaboration has exploded and, as a result, scientific discovery has been further enabled. The contributions of enormous open data sets, liberal licensing policies and innovative technologies for mining and linking these data has given rise to platforms that are beginning to deliver on the promise of semantic technologies and nanopublications, facilitated by the unprecedented computational resources available today, especially the increasing capabilities of handheld devices. The speaker will provide an overview of his experiences in developing a crowdsourced platform for chemists allowing for data deposition, annotation and validation. The challenges of mapping chemical and pharmacological data, especially in regards to data quality, will be discussed. The promise of distributed participation in data analysis is already in place.
The document discusses making experimental data and methods more reproducible and accessible by providing structured metadata alongside narrative descriptions. It recommends using community standards and ontologies to semantically tag key information, and machine-readable formats to structure descriptions in a consistent way. Tools are proposed to help authors report structured information and curate it according to these standards to make data fully FAIR (findable, accessible, interoperable, reusable). The goal is to move from experiments that are difficult to reproduce to those that are "born reproducible".
VariantSpark: applying Spark-based machine learning methods to genomic inform...Denis C. Bauer
Genomic information is increasingly used in medical practice giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. Here we introduce VariantSpark, which utilizes Hadoop/Spark along with its machine learning library, MLlib, providing the means of parallelisation for population-scale bioinformatics tasks. VariantSpark is the interface to the standard variant format (VCF), offers seamless genome-wide sampling of variants and provides a pipeline for visualising results.
To demonstrate the capabilities of VariantSpark, we clustered more than 3,000 individuals with 80 Million variants each to determine the population structure in the dataset. VariantSpark is 80% faster than the Spark-based genome clustering approach, ADAM, the comparable implementation using Hadoop/Mahout, as well as Admixture, a commonly used tool for determining individual ancestries. It is over 90% faster than traditional implementations using R and Python. These benefits of speed, resource consumption and scalability enables VariantSpark to open up the usage of advanced, efficient machine learning algorithms to genomic data.
The package is written in Scala and available at https://github.com/BauerLab/VariantSpark.
The document discusses organizing computational biology projects. It recommends using a logical directory structure with a common root directory for related projects. Within each project directory, it suggests top-level directories for data, results, source code, documents, and binaries. For results directories, it advises creating subdirectories for each experiment with names indicating the date and topic. Maintaining a lab notebook to document experiments and their progress and conclusions is also recommended.
This document discusses genome assembly and sequencing. It provides advice on sampling, sequencing approaches, scaffolding, input data quality assurance, trimming, filtering, choosing an assembler, and testing parameters. It notes that there is no single best way and that testing many combinations of software, parameters, and formats is required. Understanding everything is not necessary, and 20% of the effort can yield 80% of the results. It also briefly describes the fire ant Solenopsis invicta and its two social forms.
Bio-GraphIIn is a graph-based, integrative and semantically enabled repository for life science experimental data. It addresses the need for a system that supports retrospective data submissions, handles heterogeneous experimental data, and overcomes the fragmentation of existing data formats and databases. Bio-GraphIIn uses the Investigation/Study/Assay (ISA) framework and ontologies to semantically represent experimental metadata and enable rich queries across studies, with the goal of facilitating integrative data analysis.
The Taverna Workflow Management Software Suite - Past, Present, FuturemyGrid team
The document summarizes the Taverna workflow management software. It discusses how Taverna allows users to visually design and execute workflows to analyze data through web services, scripts, and other tools. The summary highlights that Taverna uses a dataflow model and supports mixing different step types, nested workflows, and interactions. It also discusses how Taverna aims to advance scientific discovery by making workflows reusable, adaptive to different infrastructures, and able to process data at large scales.
Seminario en CIFASIS, Rosario, Argentina - Seminar in CIFASIS, Rosario, Argen...Alejandra Gonzalez-Beltran
This document discusses data sharing and interoperability. It introduces BioSharing, a registry of life science data standards, policies and databases. It also describes the ISA infrastructure, a metadata framework for tracking experimental investigations. A case study is presented on reproducing results from the SOAPdenovo2 genome assembly algorithm using ISA-Tab, Galaxy workflows and other methods. The importance of data publication and enabling reproducible research is emphasized.
This document is a resume for Gautam Machiraju. It summarizes his education and research experience. He has a B.A. in Applied Mathematics from UC Berkeley with a concentration in Mathematical Biology and a minor in Bioengineering. He has worked on several research projects involving mathematical modeling and data analysis related to cancer biomarkers, genomics, and proteomics. His skills include programming, mathematics, data science, and laboratory techniques. He is currently a bioinformatics research assistant at Stanford University School of Medicine.
The document discusses the ISA infrastructure, which provides a framework for tracking metadata in bioscience experiments from data collection to sharing in linked data clouds. The infrastructure includes a metadata syntax, open source software tools, and a user community. It allows annotation of experimental metadata, materials, and processes using ontologies to make semantics explicit and enable integration and knowledge discovery. The infrastructure is growing with over 30 public and private resources adopting it to facilitate standards-compliant sharing of investigations across life science domains.
The document discusses knowledge management of experimental data through the ISA ecosystem. It describes the ISA-tab format and software suite that allows annotation and curation of experimental metadata. As a use case, it analyzes a dataset on metabolite profiling from a study of fatty acid amide hydrolase knockout mice. The ISA tools can represent investigations and assays, convert data to standardized formats, and facilitate sharing and analysis of experimental data.
This document discusses community standards for reproducible and reusable bioscience research. It outlines the importance of consistent reporting to maximize the value of collective scientific outputs. However, there are challenges due to the large number of bioscience reporting standards and lack of knowledge about how they relate. The document calls for a coherent catalogue of data sharing resources to evaluate standards, show relationships among them, and promote interoperability. This would help researchers make informed choices about standards and facilitate structured descriptions of experiments across domains.
This document provides a summary of Jennifer Shelton's background and experience in bioinformatics. It outlines her education in biology and post-baccalaureate studies. Her research focuses on de novo genome and transcriptome assembly using next-generation sequencing and BioNano Genomics data. She has extensive experience developing bioinformatics workflows and teaching coding skills through workshops. Currently she is the Bioinformatics Core Outreach Coordinator at Kansas State University where she continues her research and outreach efforts.
The document discusses the ISA (Investigation/Study/Assay) framework for enabling data reuse and reproducibility in bioscience research. The ISA framework provides a generic format for rich experimental descriptions and an infrastructure of open source software tools. It aims to minimize the burden of reporting, curating, sharing data and metadata from bioscience experiments to enable comprehension, reuse of data, and reproducibility. The framework promotes community engagement to develop community standards and document use cases.
This document discusses concepts and tools for exploring large sequencing datasets. It begins by providing background on building tools to analyze large sequencing datasets and enabling scientists to quickly generate hypotheses from data. The document then outlines goals of enabling hypothesis-driven biology through better hypothesis generation and refinement, and making sequence analysis less valuable and putting the author out of a job. It presents a narrative arc that discusses reconstructing community genomes from shotgun metagenomics, underlying enabling approaches and tools, and a plan for positive global influence through technology and training.
Ontomaton: NCBO BioPortal Ontology lookups in Google Spreadsheets produced by ISATeam at University of Oxford e-Research Centre (Eamonn Maguire, Alejandra Gonzalez-Beltran, Philippe Rocca-Serra and Susanna Sansone) and NCBO (Trish Whetzel).
The work was presented during ICBO 2013 in Montreal by Trish Whetzel (Thanks Trish!)
BioSharing.org - mapping the landscape of community standards, databases, dat...Alejandra Gonzalez-Beltran
This document summarizes Alejandra González-Beltrán's presentation on BioSharing.org, which maps the landscape of community standards, databases, and data policies. It discusses how BioSharing aims to help stakeholders make informed decisions for data interoperability through curating crowdsourcing information on existing standards and policies. It also describes how BioSharing integrates information from the MIBBI Project and is working with the MICheckout tool to help users create and use modular standards components.
This presentation is a thorough guide to the use of Web Apollo, with details on User Navigation, Functionality, and the thought process behind manual annotation.
During this workshop, participants:
- Learn to identify homologs of known genes of interest in a newly sequenced genome.
- Become familiar with the environment and functionality of the Web Apollo genome annotation editing tool.
- Learn how to corroborate or modify automatically annotated gene models using available evidence in Web Apollo.
- Understand the process of curation in the context of genome annotation.
Next generation genomics: Petascale data in the life sciencesGuy Coates
Keynote presentation at OGF 28.
The year 2000 saw the release of "The" human genome, the product of a the combined sequencing effort of the whole planet. In 2010, single institutions are sequencing thousands of genomes a year, producing petabytes of data. Furthermore, many of the large scale sequencing projects are based around international collaboration and consortia. The talk will explore how Grid and Cloud technologies are being used to share genomics data around the planet, revolutionizing life science research.
In this presentation from the DDN User Meeting at SC13, Tim Cutts from The Sanger Insitute describes how the company wrangles genomics data with DDN storage.
Watch the video presentation: http://insidehpc.com/2013/11/13/ddn-user-meeting-coming-sc13-nov-18/
Precise elucidation of the many different biological features encoded in any genome requires careful examination and review by researchers, who gather and evaluate the available evidence to corroborate and modify gene predictions and other biological elements. This curation process allows them to resolve discrepancies and validate automated gene model hypotheses and alignments. This approach is the well-established practice for well-known genomes such as human, mouse, zebrafish, Drosophila, et cetera. Desktop Apollo was originally developed to meet these needs.
The cost of sequencing a genome has been dramatically reduced by several orders of magnitude in the last decade, and the natural consequence is that more and more researchers are sequencing more and more new genomes, both within populations and across species. Because individual researchers can now readily sequence many genomes of interest, the need for a universally accessible genomic curation tool logically follows. Each new exome or genome sequenced requires visualization and curation to obtain biologically accurate genomic features sets, even for limited set of genes, because computational genome analysis remains an imperfect art. Additionally, unlike earlier genome projects, which had the advantage of more highly polished genomes, recent projects usually have lower coverage. Therefore researchers now face additional work correcting for more frequent assembly errors and annotating genes split across multiple contigs.
Genome annotation is an inherently collaborative task; researchers only very rarely work in isolation, turning to colleagues for second opinions and insights from those with with expertise in particular domains and gene families. The new JavaScript based Apollo, allows researchers real-time interactivity, breaking down large amounts of data into manageable portions to mobilize groups of researchers with shared interests. We are also focused on training the next generation of researchers by reaching out to educators to make these tools available as part of curricula via workshops and webinars, and through widely applied systems such as iPlant and DNA Subway. Here we offer details of our progress.
Presentation at Genome Informatics, Session (3) on Databases, Data Mining, Visualization, Ontologies and Curation.
Authors: Monica C Munoz-Torres, Suzanna E. Lewis, Ian Holmes, Colin Diesh, Deepak Unni, Christine Elsik.
Science has evolved from the isolated individual tinkering in the lab, through the era of the “gentleman scientist” with his or her assistant(s), to group-based then expansive collaboration and now to an opportunity to collaborate with the world. With the advent of the internet the opportunity for crowd-sourced contribution and large-scale collaboration has exploded and, as a result, scientific discovery has been further enabled. The contributions of enormous open data sets, liberal licensing policies and innovative technologies for mining and linking these data has given rise to platforms that are beginning to deliver on the promise of semantic technologies and nanopublications, facilitated by the unprecedented computational resources available today, especially the increasing capabilities of handheld devices. The speaker will provide an overview of his experiences in developing a crowdsourced platform for chemists allowing for data deposition, annotation and validation. The challenges of mapping chemical and pharmacological data, especially in regards to data quality, will be discussed. The promise of distributed participation in data analysis is already in place.
The document discusses making experimental data and methods more reproducible and accessible by providing structured metadata alongside narrative descriptions. It recommends using community standards and ontologies to semantically tag key information, and machine-readable formats to structure descriptions in a consistent way. Tools are proposed to help authors report structured information and curate it according to these standards to make data fully FAIR (findable, accessible, interoperable, reusable). The goal is to move from experiments that are difficult to reproduce to those that are "born reproducible".
VariantSpark: applying Spark-based machine learning methods to genomic inform...Denis C. Bauer
Genomic information is increasingly used in medical practice giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. Here we introduce VariantSpark, which utilizes Hadoop/Spark along with its machine learning library, MLlib, providing the means of parallelisation for population-scale bioinformatics tasks. VariantSpark is the interface to the standard variant format (VCF), offers seamless genome-wide sampling of variants and provides a pipeline for visualising results.
To demonstrate the capabilities of VariantSpark, we clustered more than 3,000 individuals with 80 Million variants each to determine the population structure in the dataset. VariantSpark is 80% faster than the Spark-based genome clustering approach, ADAM, the comparable implementation using Hadoop/Mahout, as well as Admixture, a commonly used tool for determining individual ancestries. It is over 90% faster than traditional implementations using R and Python. These benefits of speed, resource consumption and scalability enables VariantSpark to open up the usage of advanced, efficient machine learning algorithms to genomic data.
The package is written in Scala and available at https://github.com/BauerLab/VariantSpark.
The document discusses organizing computational biology projects. It recommends using a logical directory structure with a common root directory for related projects. Within each project directory, it suggests top-level directories for data, results, source code, documents, and binaries. For results directories, it advises creating subdirectories for each experiment with names indicating the date and topic. Maintaining a lab notebook to document experiments and their progress and conclusions is also recommended.
This document discusses genome assembly and sequencing. It provides advice on sampling, sequencing approaches, scaffolding, input data quality assurance, trimming, filtering, choosing an assembler, and testing parameters. It notes that there is no single best way and that testing many combinations of software, parameters, and formats is required. Understanding everything is not necessary, and 20% of the effort can yield 80% of the results. It also briefly describes the fire ant Solenopsis invicta and its two social forms.
This document provides a summary of best practices for scientific software development based on research and experience. It describes practices that can improve scientists' productivity and the reliability of their software, such as writing programs for people rather than computers, using version control, automating testing, and documenting work. Adopting these practices in concert can help reduce errors, make software easier to maintain, and save scientists time and effort.
The document discusses genomic analysis of the fire ant Solenopsis invicta. It notes that the genome sequencing of a Gp-9 B male fire ant revealed an expansion of lipid-processing gene families and over 420 putative olfactory receptors, more than any other insect. It also identified a functional DNA methylation system. Previous research had linked the fire ant's social structure to its Gp-9 locus, but genome sequencing provided more genomic context around this gene and others related to social behavior and chemical signaling.
The document discusses how DNA can provide information about evolutionary relationships. It notes that species relationships were previously determined by characteristics like bone structure, morphology, development, behavior, and ecological niche. However, DNA sequences change over time due to mutations during replication and from mutagens, viruses, and transposons. Analyzing DNA sequences can provide additional insights into evolutionary histories and current evolutionary contexts.
GS Rubber Industries is a leading manufacturer of molded rubber products founded in 1993. It focuses on quality, efficiency, and attention to detail fostered from its origins in the automotive and defense industries. The company offers rubber molding expertise and state-of-the-art technology for parts like caps, seals, mounts, and boots across many industries.
Improvement of Military Planes From War to WarTodd Roberts
Oswald Boelcke's 8 rules for aerial combat emphasized attacking from a position of advantage, committing fully to attacks, firing at close range, maintaining awareness of one's opponent, targeting vulnerable areas, countering opponents' attacks, maintaining an escape route, and coordinating attacks in groups. Over the decades, fighter aircraft increased dramatically in range, speed, altitude capability, and armaments to gain advantages in aerial combat.
Este documento proporciona una introducción a la base de datos de políticas de enrutamiento (RPDB) en Linux y los comandos iproute2. Explica conceptos clave como direcciones IP, CIDR, tablas y ámbitos, e ilustra el uso de los comandos ip link, ip address y ip route para configurar y mostrar la configuración de enlaces de red, direcciones y rutas. El objetivo es explicar la teoría y los ejemplos prácticos de la gestión de redes con herramientas iproute2 en Linux.
This document summarizes key points about tourism as an employment creator and career opportunity. It discusses how tourism generates direct jobs in industries like hotels and restaurants, and indirect jobs in sectors that support tourism. Some countries rely heavily on tourism income. The tourism industry offers over 400 occupations with various qualifications required depending on the role. Potential disadvantages of tourism jobs include casual work, lack of career growth, low pay and seasonality.
The document discusses human evolution and transitions from early hominins to modern humans. It describes several candidate ancestral species from the Miocene epoch, including Proconsul, Sivapithecus, Dryopithecus, and Ouranopithecus. Key transitions in human evolution discussed include bipedalism, increased brain size, use of tools, control of fire, and language/culture. Australopithecus species like Lucy are identified as being pivotal early hominins from 2-4 million years ago.
Joomla is an open source content management system (CMS) that makes it easy to build and manage websites. It allows non-technical users to add and edit content without needing HTML skills. Some key features include a user-friendly interface, built-in professional features, and hundreds of free and commercial extensions. Major organizations like the UN, MTV, and Harvard use Joomla for their websites. The document provides an overview of why to use Joomla, how to install it, and how to organize and manage content once installed.
This document provides information about an evolution and ecology course, including:
- The semester A lectures will focus on evolution, covering topics like the history of evolutionary thought, natural selection, human evolution, and more.
- Semester B lectures will focus on ecology, covering topics like food webs, biomes, global warming, and more.
- Assessment will include a workshop in each semester worth 20% and 80% for a final exam.
- Recommended reading materials are also provided to supplement the course content.
Este documento describe una tierra virgen con las montañas más antiguas del planeta que los ojos van a presenciar, lo que será un espectáculo único en el mundo. Se trata de un lugar sin alterar con rocas muy antiguas.
This document outlines the key concepts and evidence for evolution presented in an SBS 110 - Evolution course. It discusses the early ideas of fixity of species versus change, followed by the three main schools of evolutionary thought: Linnaeus proposed separate creation of species, Lamarck proposed inheritance of acquired characteristics, and Darwin and Wallace proposed evolution by natural selection and descent with modification. The document then covers Darwin's evidence for evolution including the fossil record, comparative anatomy, comparative embryology, vestigial structures, and domestication through artificial selection.
This document appears to be a list of photo locations from around the world including Kauai, Hawaii; Old Town and New Town in an unspecified location; Central Park in New York City; Lake Como; Joshua Tree; Prague; terraced rice fields in Indonesia; a rainbow eucalyptus tree in Maui, Hawaii; the Himalayas; a bridge in Vietnam; Cambodia; a colorful village on the sea; and terraced rice fields in Japan. The document also includes a music reference to Deep Purple and indicates it was created by Consul.
1. The document discusses best practices for scientific software development, including writing code for people rather than computers, automating repetitive tasks, using version control, and conducting code reviews.
2. Specific approaches and tools recommended are planning for mistakes, automated testing, continuous integration, and using a coding style guide. R and Ruby style guides are provided as examples.
3. The benefits of following such practices are improving productivity, reducing errors, making code easier to read and maintain, and allowing scientists to focus on scientific questions rather than software issues. Reproducible and sustainable software is the overall goal.
This document discusses best practices for organizing computational biology projects. It recommends creating a directory structure with folders for source code, data, documentation, results and binaries/executables. Data folders should include README files explaining where the data came from. Version control is important to track changes over time. Comments and documentation will help others understand the project and allow researchers to revisit past work without reconstructing their experiments from scratch. Organizing and documenting projects thoroughly makes computational experiments more reproducible, understandable and useful to both the original researchers and others in the future.
This document provides guidance on organizing computational biology projects. It recommends:
1. Organizing related projects together under a common root directory, with each project in its own subdirectory.
2. Using a logical top-level structure within each project with chronological organization at lower levels. This allows tracking experiments over time.
3. Including directories for data, source code, results, and documentation. Results subdirectories should be named with dates to allow sorting experiments chronologically.
Taking detailed, dated notes in lab notebooks or Markdown files integrated with analysis code allows fully documenting projects and easily repeating or extending prior work.
Open-source tools for generating and analyzing large materials data setsAnubhav Jain
This document discusses open-source software tools for generating and analyzing large materials data sets developed by Anubhav Jain and collaborators. It summarizes several software packages including pymatgen for materials analysis, FireWorks for scientific workflows, custodian for error recovery in calculations, and matminer for data mining. Applications of the tools include generating the Materials Project database containing properties of over 65,000 materials compounds calculated using high-performance computing resources. The document emphasizes the importance of open-source collaborative software development and automation to accelerate materials discovery.
This document summarizes a presentation given at ICME 2015 at the Cheyenne Mountain Resort in Colorado Springs, Colorado. The presentation was given by James Belak from Lawrence Livermore National Laboratory and discussed use-inspired research and development for integrated computational materials engineering. It addressed key computations needed for ICME like databases, expert systems, simulation methods, and continuum models. Integrating computations into materials engineering was a focus.
Software tools for calculating materials properties in high-throughput (pymat...Anubhav Jain
This document discusses software tools for automating materials simulations. It introduces pymatgen, atomate, and FireWorks which can be used together to define a workflow of calculations, execute the workflow on supercomputers, and recover from errors or failures. The tools allow researchers to focus on designing and analyzing simulations rather than manual setup and execution of jobs. Workflows in atomate can compute many materials properties including elastic tensors, band structures, and transport coefficients. Parameters are customizable but sensible defaults are provided. FireWorks then executes the workflows across multiple supercomputing clusters.
This document discusses the challenges and opportunities presented by the increasing volume and complexity of biological data. It outlines four main areas: 1) Developing methods to efficiently store, access, and analyze large datasets; 2) Broadening our understanding of gene function beyond a small number of well-studied genes; 3) Accelerating research through improved sharing of data, results, and methods; and 4) Leveraging exploratory analysis of integrated datasets to generate new insights. The author advocates for lossy data compression, streaming analysis, preprint sharing, improved metadata collection, and incentivizing open data practices.
The Symbiotic Nature of Provenance and WorkflowEric Stephan
This document discusses the symbiotic relationship between provenance and workflows in scientific research. It notes that workflows provide automation and integration capabilities, while provenance provides documentation of what transpired. The document provides examples of workflow and provenance technologies and outlines challenges around interoperability. It concludes that recognizing the interdependent relationship between provenance and workflows can help advance systems science research.
In this deck from the HPC User Forum, Rick Stevens from Argonne presents: AI for Science.
"Artificial Intelligence (AI) is making strides in transforming how we live. From the tech industry embracing AI as the most important technology for the 21st century to governments around the world growing efforts in AI, initiatives are rapidly emerging in the space. In sync with these emerging initiatives including U.S. Department of Energy efforts, Argonne has launched an “AI for Science” initiative aimed at accelerating the development and adoption of AI approaches in scientific and engineering domains with the goal to accelerate research and development breakthroughs in energy, basic science, medicine, and national security, especially where we have significant volumes of data and relatively less developed theory. AI methods allow us to discover patterns in data that can lead to experimental hypotheses and thus link data driven methods to new experiments and new understanding."
Watch the video: https://wp.me/p3RLHQ-kQi
Learn more: https://www.anl.gov/topic/science-technology/artificial-intelligence
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document discusses openness and reproducibility in computational science. It begins with an introduction and background on the challenges of analyzing non-model organisms. It then describes the goals and challenges of shotgun sequencing analysis, including assembly, counting, and variant calling. It emphasizes the need for efficient data structures, algorithms, and cloud-based analysis to handle large datasets. The document advocates for open science practices like publishing code, data, and analyses to ensure reproducibility of computational results.
Keynote on software sustainability given at the 2nd Annual Netherlands eScience Symposium, November 2014.
Based on the article
Carole Goble ,
Better Software, Better Research
Issue No.05 - Sept.-Oct. (2014 vol.18)
pp: 4-8
IEEE Computer Society
http://www.computer.org/csdl/mags/ic/2014/05/mic2014050004.pdf
http://doi.ieeecomputersociety.org/10.1109/MIC.2014.88
http://www.software.ac.uk/resources/publications/better-software-better-research
The document discusses how computation can accelerate the generation of new knowledge by enabling large-scale collaborative research and extracting insights from vast amounts of data. It provides examples from astronomy, physics simulations, and biomedical research where computation has allowed more data and researchers to be incorporated, advancing various fields more quickly over time. Computation allows for data sharing, analysis, and hypothesis generation at scales not previously possible.
This document summarizes a keynote presentation about challenges in bioinformatics software development and proposed solutions. Some of the key points made include: 1) bioinformatics software development involves multiple disciplines including computer science, software engineering, statistics, and biology, each with different priorities; 2) there is a massive proliferation of bioinformatics software packages that leads to many difficult choices for researchers; 3) proposed solutions include developing software in a more modular and automated way, using common benchmarks and protocols to evaluate tools, and focusing on reproducibility and usability.
Software tools for high-throughput materials data generation and data miningAnubhav Jain
Atomate and matminer are open-source Python libraries for high-throughput materials data generation and data mining. Atomate makes it easy to automatically generate large datasets by running standardized computational workflows with different simulation packages. Matminer contains tools for featurizing materials data and integrating it with machine learning algorithms and data visualization methods. Both aim to accelerate materials discovery by automating and standardizing computational workflows and data analysis tasks.
This document is a resume for Gautam Machiraju. It summarizes his education and research experience. He has a B.A. in Applied Mathematics from UC Berkeley with a concentration in Mathematical Biology and a minor in Bioengineering. He has worked on several research projects involving mathematical modeling and data analysis related to biology and healthcare. These include modeling cancer biomarker shedding kinetics, mining literature for biomarker data, and using deep learning on patient time-series data. He has strong skills in programming, mathematics, and bioinformatics.
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker, Inc.
Typically in predictive data analysis challenges, participants are provided a dataset and asked to make predictions. Participants include with their prediction the scripts/code used to produce it. Challenge administrators validate the winning model by reconstructing and running the source code.
Often data cannot be provided to participants directly, e.g. due to data sensitivity (data may be from living human subjects) or data size (tens of terabytes). Further, predictions must be reproducible from the code provided by particpants. Containerization is an excellent solution to these problems: Rather than providing the data to the participants, we ask the participants to provided a Dockerized "trainable" model. We run the both the training and validation phases of machine learning and guarantee reproducibility 'for free'.
We use the Docker tool suite to spin up and run servers in the cloud to process the queue of submitted containers, each essentially a batch job. This fleet can be scaled to match the level of activity in the challenge. We have used Docker successfully in our 2015 ALS Stratification Challenge and our 2015 Somatic Mutation Calling Tumour Heterogeneity (SMC-HET) Challenge, and are starting an implementation for our 2016 Digitial Mammography Challenge.
This document discusses analyzing large sequencing datasets and summarizing metagenomic communities. It describes benchmarking different assembly methods on a mock community dataset. Digital normalization and partitioning treatments were found to save computational time without altering assembly results. Approximately 90% of genomes were recovered, with few misassemblies. Deeper sequencing is needed to fully reconstruct communities, with petabasepair sampling required. Computational resources must scale to analyze the large volumes of data that will be generated from deeper metagenomic surveys.
In this deck from the 2014 HPC User Forum in Seattle, Jack Collins from the National Cancer Institute presents: Genomes to Structures to Function: The Role of HPC.
Watch the video presentation: http://wp.me/p3RLHQ-d28
This document provides an introduction to next-generation sequencing (NGS) technologies. It discusses the history of DNA sequencing technologies leading up to NGS, including Sanger sequencing. It then describes the key principles of several major NGS platforms, including how they achieve massively parallel sequencing using amplification and signal detection. The document notes challenges of NGS like short read lengths, coverage depth needs, and large data volumes. It also outlines common applications of NGS data like variant detection and discusses future prospects.
HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLERcscpconf
HPC-MAQ is a parallel implementation of the MAQ reference assembler that leverages Hadoop and MapReduce to enable MAQ to run in a distributed, parallel manner on large datasets. It addresses the computational challenges of reference assembly by splitting the input reads into chunks that are processed simultaneously. The MAQ algorithm is used to align reads to a reference sequence and call consensus genotypes. HPC-MAQ allows MAQ to scale to larger genomes by distributing the work across multiple nodes in a Hadoop cluster.
Similar to 2014 11-13-sbsm032-reproducible research (20)
This document discusses the experience of a researcher in genomics with applying FAIR and open approaches. It notes that making data and analysis methods FAIR and open can increase visibility, drive citations, and facilitate collaboration. However, it also enables competition to more easily access and utilize resources without contributing. Striking the right balance between openness and protecting competitive advantages is challenging. Overall, the researcher finds FAIR and open principles have greatly increased the impact and robustness of their work, but there are also costs to consider.
2018 08-reduce risks of genomics researchYannick Wurm
Geoffrey Chang, a protein crystallographer at The Scripps Research Institute, had his career trajectory disrupted when several of his high-profile papers describing protein structures had to be retracted. An in-house software program Chang's lab used to process diffraction data from protein crystals introduced a sign error that inverted the structures, invalidating biological interpretations. This included a 2001 Science paper describing the structure of the MsbA protein. A 2006 Nature paper by Swiss researchers casting doubt on Chang's MsbA structure led him to discover the software error. Chang and his co-authors sincerely regretted the confusion and unproductive research caused by the need to retract their influential papers.
Geoffrey Chang was a prominent structural biologist who received prestigious early career awards. However, his work came under scrutiny when other researchers discovered errors in his published protein structures due to a problem with his in-house data analysis software. This led Chang to retract 5 of his papers describing protein structures. The retractions were costly for Chang's career and reputation as well as for other researchers who had performed follow-up work based on the incorrect structures. The incident highlights the importance of using well-tested, reproducible analysis methods in scientific research.
Keynote talk given at Fairdom User meeting http://fair-dom.org/communities/users/barcelona-2016-first-user-meeting/ .
I begin by summarising how we apply molecular approaches to understand social behaviour in ants. Subsequently, I give an overview of the data-handling challenges the genomic bioinformatics community faces. Finally, I give an overview of some of the tools and approaches my lab have developed to help us get things done better, faster, more reliably and more reproducibly.
The document discusses the genetic basis of social organization in fire ant populations. Researchers used RAD sequencing of haploid males to discover SNPs and genotype individuals at over 2,400 loci. Principal component analysis separated individuals into two clusters corresponding to their social form (single or multiple queen), with the first principal component explaining over 12% of the variance. A region on chromosome 13 containing the Gp-9 gene was completely associated with social form. This research identified a major gene influencing an important social trait using next-generation sequencing techniques.
This document provides an agenda for a spring school on bioinformatics and population genomics, including practical sessions on analyzing genomic data from reads to reference genomes and gene predictions in 6 steps: inspecting and cleaning reads, genome assembly, assessing assembly quality, predicting protein-coding genes, assessing gene prediction quality, and assessing the overall process quality using biological measures. It also addresses wifi issues that could reduce bandwidth and lists the VM password.
This document provides information about a spring school on bioinformatics and population genomics that includes practical sessions. The sessions will cover topics like short read cleaning, genome assembly, gene prediction, quality control, mapping reads to call variants, visualizing variants, analyzing variants through PCA and measuring diversity and differentiation, inferring population sizes and gene flow, and analyzing gene expression from raw sequencing data to expression levels. The document lists the team of practitioners leading the sessions and encourages participants to share their favorite software packages.
2015 12-18- Avoid having to retract your genomics analysis - Popgroup Reprodu...Yannick Wurm
Brief (15min) talk I gave at #PopGroup49 in Edinburgh providing a few simple methods to reduce risk in genomics analyses.
Please cite: Avoid having to retract your genomics analysis (2015) Y Wurm. The Winnower 2, e143696.68941 https://thewinnower.com/papers/avoid-having-to-retract-your-genomics-analysis
This document contains information about programming in R, including practical examples. It discusses accessing and subsetting data, using regular expressions for text search, creating functions, and using loops. Examples are provided to demonstrate creating vectors, accessing subsets of vectors, using regular expressions to find patterns in text, creating functions to convert between units or estimate values, and using for loops to repeat operations over multiple elements. The document suggests R is useful for working with big data in biology and other fields due to its ability to automate tasks, integrate with other tools, and handle large datasets through programming.
This document describes oSwitch, a tool that allows easy access to other operating systems via one-line commands. It works by wrapping Docker containers, allowing commands to be run in different OS environments without disrupting the user's current environment. The document provides an example usage where a user is able to run an "abyss-pe" command in a Biolinux container after it is not found in their native OS. It notes how oSwitch aims to preserve the user's current working directory, login shell, home directory and file permissions during usage.
This document provides an outline for a lecture on the genetic basis of evolution. It begins with introducing key terms like gene, locus, allele, genotype, and phenotype. It then discusses genetic drift and how drift is influenced by population size. Selection is also introduced and defined as a process where individuals with different genotypes have different fitnesses. The document emphasizes that both genetic drift and selection influence evolution, and neither process should be overemphasized. It aims to move people away from only considering selection (pan-selectionism) and highlights the importance of genetic drift.
This document discusses human evolution and recent insights from genomics. It summarizes that Neanderthals were the closest evolutionary relatives to modern humans and lived in Europe and Western Asia until disappearing 30,000 years ago. A draft sequence of the Neanderthal genome from three individuals was presented, composed of over 4 billion nucleotides. Comparisons with five modern human genomes identified regions potentially affected by selection in ancestral modern humans, involving genes related to metabolism, cognition, and skeletal development. Analysis suggests Neanderthals shared more genetic variants with non-Africans, indicating gene flow from Neanderthals into their ancestors occurred before Eurasian groups diverged.
The document discusses analyzing ancient plant and insect DNA extracted from ice core samples in Greenland. Key points:
- Plant and insect DNA was recovered from silty ice samples taken between 2-3 km deep in the Dye 3 and JEG ice cores in Greenland, dating back to before the last glacial period.
- The DNA was identified as coming from tree species like pine and alder, indicating a boreal forest environment in southern Greenland at the time, rather than today's Arctic conditions.
- Other plant species identified include those from orders like Asterales, Poales, Rosales and Malpighiales. Insect DNA from Lepidoptera was also recovered.
This document provides an introduction to regular expressions (regex) for text search and pattern matching. It explains that regex allows for powerful text searches beyond simple keywords. Various special symbols and constructs are demonstrated that allow matching complex patterns and variants in text. Examples show matching names, sequences, microsatellite repeats and more with regex. Functions, loops and logical operators in R programming are also briefly covered.
The document discusses major geological drivers of evolution including tectonic plate movement, vulcanism, climate change, and meteorite impacts. Tectonic plate movement has caused continental drift and formation of supercontinents like Pangaea, affecting species distributions. Vulcanism causes both local and global climate changes through emission of gases and particles and formation of new land barriers and islands. Climate changes over geological timescales have also impacted evolution. Meteorite impacts have precipitated mass extinctions. These geological forces alter Earth's conditions and drive evolution through large-scale migrations, speciation events, mass extinctions, and adaptive radiations.
This document provides an overview and schedule for the course "SBC 361 Research Methods & Comms". The course is a mixture of advanced analytical skills taught in computer labs using the programming language R, and theoretical content covered in lectures and workshops. It includes two workshops on careers in science and popular science writing. Students will complete assignments involving the computer practicals and tutorials, and a mock exam. The schedule details the topics to be covered each week by different professors and teaching staff. It emphasizes the importance of attending classes, completing required work, and doing additional outside reading to succeed in the course.
This document discusses computational methods and challenges for genome assembly using next-generation sequencing data. It describes the four main stages of genome assembly as preprocessing filtering, graph construction, graph simplification, and postprocessing filtering. Each stage processes the data from the previous stage to build the assembly graph and reduce complexity, though some assemblers delay filtering steps.
This document outlines the course SBC322 Ecological and Evolutionary Genomics. It discusses how new genomic technologies have changed ecology and evolution research by merging molecular and ecological approaches. It aims to critically evaluate research questions, methods, experimental designs and applications in ecological and evolutionary genomics. The course will improve students' skills in critically reading literature, understanding interdisciplinary science, and oral and written scientific communication through interactive small group work, informal and formal presentations, blog posts, and peer review.
The document provides an overview of topics covered in a bioinformatics course, including using Unix, bioinformatics algorithms, biological databases, sequencing technologies, and genome assembly and variant identification. It lists challenges for students in each topic area and provides examples of concepts that will be covered, such as using HPC systems, dynamic programming for sequence alignment, accessing databases like NCBI, processing sequencing data, and identifying variants from assembly. Images are included of different organisms like ants and sequencing technologies. The document aims to outline the scope and challenges of the bioinformatics course.
Sustainable software institute Collaboration workshopYannick Wurm
The document discusses tools for analyzing biological data. It summarizes four tools:
1. SequenceServer - A simple web interface for BLAST that handles formatting and installing BLAST locally.
2. oSwitch - Allows rapidly switching between operating systems and container environments to access specific bioinformatics software without installation.
3. GeneValidator - Helps curate gene predictions by identifying problematic predictions, choosing best alternative models, and aiding manual curation of individual genes.
4. Afra - A crowdsourcing platform that aims to crowdsource the visual inspection and correction of gene models by recruiting and training students, ensuring quality through tutorials, redundancy and senior review, and creating small, simple initial tasks.
Temple of Asclepius in Thrace. Excavation resultsKrassimira Luka
The temple and the sanctuary around were dedicated to Asklepios Zmidrenus. This name has been known since 1875 when an inscription dedicated to him was discovered in Rome. The inscription is dated in 227 AD and was left by soldiers originating from the city of Philippopolis (modern Plovdiv).
Gender and Mental Health - Counselling and Family Therapy Applications and In...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
Communicating effectively and consistently with students can help them feel at ease during their learning experience and provide the instructor with a communication trail to track the course's progress. This workshop will take you through constructing an engaging course container to facilitate effective communication.
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
Walmart Business+ and Spark Good for Nonprofits.pdfTechSoup
"Learn about all the ways Walmart supports nonprofit organizations.
You will hear from Liz Willett, the Head of Nonprofits, and hear about what Walmart is doing to help nonprofits, including Walmart Business and Spark Good. Walmart Business+ is a new offer for nonprofits that offers discounts and also streamlines nonprofits order and expense tracking, saving time and money.
The webinar may also give some examples on how nonprofits can best leverage Walmart Business+.
The event will cover the following::
Walmart Business + (https://business.walmart.com/plus) is a new shopping experience for nonprofits, schools, and local business customers that connects an exclusive online shopping experience to stores. Benefits include free delivery and shipping, a 'Spend Analytics” feature, special discounts, deals and tax-exempt shopping.
Special TechSoup offer for a free 180 days membership, and up to $150 in discounts on eligible orders.
Spark Good (walmart.com/sparkgood) is a charitable platform that enables nonprofits to receive donations directly from customers and associates.
Answers about how you can do more with Walmart!"
How to Setup Warehouse & Location in Odoo 17 InventoryCeline George
In this slide, we'll explore how to set up warehouses and locations in Odoo 17 Inventory. This will help us manage our stock effectively, track inventory levels, and streamline warehouse operations.
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...PECB
Denis is a dynamic and results-driven Chief Information Officer (CIO) with a distinguished career spanning information systems analysis and technical project management. With a proven track record of spearheading the design and delivery of cutting-edge Information Management solutions, he has consistently elevated business operations, streamlined reporting functions, and maximized process efficiency.
Certified as an ISO/IEC 27001: Information Security Management Systems (ISMS) Lead Implementer, Data Protection Officer, and Cyber Risks Analyst, Denis brings a heightened focus on data security, privacy, and cyber resilience to every endeavor.
His expertise extends across a diverse spectrum of reporting, database, and web development applications, underpinned by an exceptional grasp of data storage and virtualization technologies. His proficiency in application testing, database administration, and data cleansing ensures seamless execution of complex projects.
What sets Denis apart is his comprehensive understanding of Business and Systems Analysis technologies, honed through involvement in all phases of the Software Development Lifecycle (SDLC). From meticulous requirements gathering to precise analysis, innovative design, rigorous development, thorough testing, and successful implementation, he has consistently delivered exceptional results.
Throughout his career, he has taken on multifaceted roles, from leading technical project management teams to owning solutions that drive operational excellence. His conscientious and proactive approach is unwavering, whether he is working independently or collaboratively within a team. His ability to connect with colleagues on a personal level underscores his commitment to fostering a harmonious and productive workplace environment.
Date: May 29, 2024
Tags: Information Security, ISO/IEC 27001, ISO/IEC 42001, Artificial Intelligence, GDPR
-------------------------------------------------------------------------------
Find out more about ISO training and certification services
Training: ISO/IEC 27001 Information Security Management System - EN | PECB
ISO/IEC 42001 Artificial Intelligence Management System - EN | PECB
General Data Protection Regulation (GDPR) - Training Courses - EN | PECB
Webinars: https://pecb.com/webinars
Article: https://pecb.com/article
-------------------------------------------------------------------------------
For more information about PECB:
Website: https://pecb.com/
LinkedIn: https://www.linkedin.com/company/pecb/
Facebook: https://www.facebook.com/PECBInternational/
Slideshare: http://www.slideshare.net/PECBCERTIFICATION
Main Java[All of the Base Concepts}.docxadhitya5119
This is part 1 of my Java Learning Journey. This Contains Custom methods, classes, constructors, packages, multithreading , try- catch block, finally block and more.
13. Aquaculture in
Offshore Zones
LETTERS I BOOKS I POLICY FORUM I EDUCATION FORUM I PERSPECTIVES
operations that reveal little, if any, negative
impact on the environment or local ecosys-tems
(2, 3). Naylor criticizes the National
industry governed by regulations with a rational
basis in the ecology of the oceans and the eco-nomic
realities of the marketplace.
1878
in the classroom
1880 1882
perspectives
LETTERS
edited by Etta Kavanagh
Retraction
WE WISH TO RETRACT OUR RESEARCH ARTICLE “STRUCTURE OF
MsbA from E. coli: A homolog of the multidrug resistance ATP bind-ing
cassette (ABC) transporters” and both of our Reports “Structure of
the ABC transporter MsbA in complex with ADP•vanadate and
lipopolysaccharide” and “X-ray structure of the EmrE multidrug trans-porter
in complex with a substrate” (1–3).
The recently reported structure of Sav1866 (4) indicated that our
MsbA structures (1, 2, 5) were incorrect in both the hand of the struc-ture
and the topology. Thus, our biological interpretations based on
these inverted models for MsbA are invalid.
An in-house data reduction program introduced a change in sign for
anomalous differences. This program, which was not part of a conven-tional
data processing package, converted the anomalous pairs (I+ and
I-) to (F- and F+), thereby introducing a sign change. As the diffrac-tion
data collected for each set of MsbA crystals and for the EmrE
crystals were processed with the same program, the structures reported
in (1–3, 5, 6) had the wrong hand.
The error in the topology of the original MsbA structure was a con-sequence
of the low resolution of the data as well as breaks in the elec-tron
density for the connecting loop regions. Unfortunately, the use of
the multicopy refinement procedure still allowed us to obtain reason-able
refinement values for the wrong structures.
The Protein Data Bank (PDB) files 1JSQ, 1PF4, and 1Z2R for
MsbA and 1S7B and 2F2M for EmrE have been moved to the archive
of obsolete PDB entries. The MsbA and EmrE structures will be
recalculated from the original data using the proper sign for the anom-alous
differences, and the new Ca coordinates and structure factors
will be deposited.
We very sincerely regret the confusion that these papers have
caused and, in particular, subsequent research efforts that were unpro-ductive
as a result of our original findings.
GEOFFREY CHANG, CHRISTOPHER B. ROTH,
CHRISTOPHER L. REYES, OWEN PORNILLOS,
YEN-JU CHEN, ANDY P. CHEN
Department of Molecular Biology, The Scripps Research Institute, La Jolla, CA 92037, USA.
References
1. G. Chang, C. B. Roth, Science 293, 1793 (2001).
2. C. L. Reyes, G. Chang, Science 308, 1028 (2005).
3. O. Pornillos, Y.-J. Chen, A. P. Chen, G. Chang, Science 310, 1950 (2005).
4. R. J. Dawson, K. P. Locher, Nature 443, 180 (2006).
5. G. Chang, J. Mol. Biol. 330, 419 (2003).
6. C. Ma, G. Chang, Proc. Natl. Acad. Sci. U.S.A. 101, 2852 (2004).
Downloaded from www.sciencemag.org on September 24, 2014
14. Reproducible Research &
Sustainable Software
• Avoid costly mistakes
• Be faster: “stand on the shoulders of giants”
• Increase impact / visibility
16. “Big data” biology
is hard.
• Biology/life is complex
• Field is young.
• Biologists lack computational training.
• Generally, analysis tools suck.
• badly written
• badly tested
• hard to install
• output quality… often questionable.
• Understanding/visualizing/massaging data is hard.
• Datasets continue to grow!
20. 1210.0530v3 [cs.MS] 29 Nov 2012
steve@practicalcomputing.org),††University ofWisconsin (khuff@cae.wisc.Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶University University (ethan@weecology.org), and †††University of Wisconsin (wilsonp@Scientists spend an increasing amount of time building and using
software. However, most scientists are never taught how to do this
efficiently. As a result, many are unaware of tools and practices that
would allow them to write more reliable and maintainable code with
less effort. We describe a set of best practices for scientific software
development that have solid foundations in research and experience,
and that improve scientists’ productivity and the reliability of their
software.
1. Write programs for people, not computers.
Scientists writing software need to write correctly and can be easily read and programmers (especially the author’s future cannot be easily read and understood it is to know that it is actually doing what it is be productive, software developers must therefore aspects of human cognition into account: human working memory is limited, human (Best Practices for Scientific Computing
Greg Wilson ∗, D.A. Aruliah †, C. Titus Brown ‡, Neil P. Chue Hong §, Matt Davis ¶, Richard T. Guy ∥,
Steven H.D. Haddock ∗∗, Katy Huff ††, Ian M. Mitchell ‡‡, Mark D. Plumbley §§, Ben Waugh ¶¶,
Ethan P. White ∗∗∗, Paul Wilson †††
∗Software Carpentry (gvwilson@software-carpentry.org),†University of Ontario Institute of Technology (Dhavide.Aruliah@State University (ctb@msu.edu),§Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶Space Telescope (mrdavis@stsci.edu),∥University of Toronto (guy@cs.utoronto.ca),∗∗Monterey Bay Aquarium Research Institute
(steve@practicalcomputing.org),††University ofWisconsin (khuff@cae.wisc.edu),‡‡University of British Columbia (mitchell@Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶University College London (b.waugh@ucl.ac.uk),∗∗∗University (ethan@weecology.org), and †††University of Wisconsin (wilsonp@engr.wisc.edu)
Scientists spend an increasing amount of time building and using
software. However, most scientists are never taught how to do this
efficiently. As a result, many are unaware of tools and practices that
would allow them to write more reliable and maintainable code with
less effort. We describe a set of best practices for scientific software
development that have solid foundations in research and experience,
and that improve scientists’ productivity and the reliability of their
software.
Software is as important to modern scientific research as
telescopes and test tubes. From groups that work exclusively
on computational problems, to traditional laboratory and field
scientists, more and more of the daily operation of science re-volves
around computers. This includes the development of
new algorithms, managing and analyzing the large amounts
of data that are generated in single research projects, and
combining disparate datasets to assess synthetic problems.
Scientists typically develop their own software for these
purposes because doing so requires substantial domain-specific
and open source software development [61, studies of scientific computing [4, 31, development in general (summarized in practices will guarantee efficient, error-free but used in concert they will reduce errors in scientific software, make it easier the authors of the software time and effort focusing on the underlying scientific questions.
Software is as important to modern scientific research as
telescopes and test tubes. From groups that work exclusively
on computational problems, to traditional laboratory and field
scientists, more and more of the daily operation of science re-volves
around computers. This includes the development of
new algorithms, managing and analyzing the large amounts
of data that are generated in single research projects, and
combining disparate datasets to assess synthetic problems.
arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
1. Write programs for people, not computers.
2. Automate repetitive tasks.
3. Use the computer to record history.
4. Make incremental changes.
5. Use version control.
6. Don’t repeat yourself (or others).
7. Plan for mistakes.
8. Optimize software only after it works correctly.
9. Document the design and purpose of code rather than its mechanics.!
10. Conduct code reviews.
21.
22.
23. Specific Approaches/Tools
• Planning for mistakes
• Automated testing
• Continuous integration
•Writing for people: use style guide
24. Code for people: Use a style guide
• For R: http://r-pkgs.had.co.nz/style.html
26. understand and improve your code in 6
Coding for people: Indent your
approximate Damian Conway
code!
characters
http://github.com/
27. R style guide extract
Line length
Strive to limit your code to 80 characters per line. This fits comfortably on a printed page with a
reasonably sized font. If you find yourself running out of room, this is a good indication that you
should encapsulate some of the work in a separate function.
!
ant_measurements <- read.table(file = '~/Downloads/Web/ant_measurements.txt', header=TRUE, sep='!
ant_measurements <- read.table(file = '~/Downloads/Web/ant_measurements.txt', header=TRUE,
sep='t', col.names = c('colony', 'individual', 'headwidth', ‘mass'))
!
ant_measurements <- read.table(file = '~/Downloads/Web/ant_measurements.txt',
header = TRUE,
sep = 't',
col.names = c('colony', 'individual', 'headwidth', 'mass')
)
28. Code for people: Use a style guide
• For R: http://r-pkgs.had.co.nz/style.html
• For Ruby: https://github.com/bbatsov/ruby-style-guide
Automatically check your code:
install.packages(“lint”) # once
library(lint) # everytime
lint(“file_to_check.R”)
30. knitr (sweave)Analyzing & Reporting in a single file.
analysis.Rmd
A minimal R Markdown example
I know the value of pi is 3.1416, and 2 times pi is 6.2832. To compile library(knitr); knit(minimal.Rmd)
A paragraph here. A code chunk below:
1+1
## [1] 2
### in R:
library(knitr)
knit(“analysis.Rmd”)
# -- creates analysis.md
### in shell:
pandoc analysis.md -o analysis.pdf
# -- creates MyFile.pdf
.4-.7+.3 # what? it is not zero!
## [1] 5.551e-17
Graphics work too
library(ggplot2)
qplot(speed, dist, data = cars) + geom_smooth()
● ●
●
●
●
●
● ● ●
●
●
● ● ● ●
●
● ●
●
●
●
●
●
●
● ●
● ●
●
● ●
● ●
●
●
●
●
●
● ● ● ●
●
●
120
80
40
0
5 10 15 20 speed
dist
Figure 1: A scatterplot of cars
32. Education
A Quick Guide to Organizing Computational Biology
Projects
William Stafford Noble1,2*
1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science and
Engineering, University of Washington, Seattle, Washington, United States of America
Introduction
Most bioinformatics coursework focus-es
on algorithms, with perhaps some
components devoted to learning pro-gramming
skills and learning how to
use existing bioinformatics software. Un-fortunately,
for students who are prepar-ing
for a research career, this type of
curriculum fails to address many of the
day-to-day organizational challenges as-sociated
with performing computational
experiments. In practice, the principles
behind organizing and documenting
computational experiments are often
learned on the fly, and this learning is
strongly influenced by personal predilec-tions
In each results folder:
•script getResults.rb
•intermediates
•output
Figure 1. Directory structure for a sample project. Directory names are in large typeface, and filenames are in smaller typeface. Only a subset of
the files are shown here. Note that the dates are formatted ,year.-,month.-,day. so that they can be sorted in chronological order. The
source code src/ms-analysis.c is compiled to create bin/ms-analysis and is documented in doc/ms-analysis.html. The README
files in the data directories specify who downloaded the data files from what URL on what date. The driver script results/2009-01-15/runall
automatically generates the three subdirectories split1, split2, and split3, corresponding to three cross-validation splits. The bin/parse-sqt.
as well as by chance interactions
with collaborators or colleagues.
The purpose of this article is to describe
one good strategy for carrying out com-putational
experiments. I will not describe
profound issues such as how to formulate
hypotheses, design experiments, or draw
conclusions. Rather, I will focus on
understanding your work or who may be
evaluating your research skills. Most com-monly,
however, that ‘‘someone’’ is you. A
few months from now, you may not
remember what you were up to when you
created a particular set of files, or you may
not remember what conclusions you drew.
You will either have to then spend time
reconstructing your previous experiments
or lose whatever insights you gained from
those experiments.
This leads to the second principle,
which is actually more like a version of
Murphy’s Law: Everything you do, you
will probably have to do over again.
Inevitably, you will discover some flaw in
your initial preparation of the data being
analyzed, or you will get access to new
data, or you will decide that your param-eterization
of a particular model was not
broad enough. This means that the
experiment you did last week, or even
the set of experiments you’ve been work-ing
on over the past month, will probably
under a common root directory. The
exception to this rule is source code or
scripts that are used in multiple projects.
Each such program might have a project
directory of its own.
Within a given project, I use a top-level
organization that is logical, with chrono-logical
organization at the next level, and
logical organization below that. A sample
project, called msms, is shown in Figure 1.
At the root of most of my projects, I have a
data directory for storing fixed data sets, a
results directory for tracking computa-tional
experiments peformed on that data,
a doc directory with one subdirectory per
manuscript, and directories such as src
for source code and bin for compiled
binaries or scripts.
Within the data and results directo-ries,
it is often tempting to apply a similar,
logical organization. For example, you
may have two or three data sets against
which you plan to benchmark your
algorithms, so you could create one
directory for each of them under data.
with this approach, the distinction be-tween
data and results may not be useful.
Instead, one could imagine a top-level
directory called something like experi-ments,
with subdirectories with names like
2008-12-19. Optionally, the directory
name might also include a word or two
indicating the topic of the experiment
The Lab Notebook
In parallel with this chronological
directory structure, I find it useful to
maintain a chronologically organized lab
notebook. This is a document that resides
in the root of the results directory and
that records your progress in detail.
Entries in the notebook should be dated,
These types of entries provide a complete
picture of the development of the project
over time.
In practice, I ask members of my
research group to put their lab notebooks
online, behind password protection if
necessary. When I meet with a member
of my lab or a project team, we can refer
py script is called by both of the runall driver scripts.
doi:10.1371/journal.pcbi.1000424.g001
39. Choosing a programming language
Good: Bad:
Excel quick dirty easy to make mistakes
doesn’t scale
R numbers, stats,
genomics
programming
Unix command-line
== shell == bash
Can’t escape it.
Quick Dirty. HPC.
programming,
complicated things
Java 1990s user interfaces overcomplicated.
Perl 1980s. Everything.
Python scripting, text ugly
Ruby scripting, text
Javascript/Node scripting, flexibility(web
client), community only little bio-stuff
40. Ruby. “Friends don’t let friends do Perl” - reddit user
• example: “reverse each line in file”
• read file; with each line
• remove the invisible “end of line” character
• reverse the contents
• print the reversed line
### in PERL:
open INFILE, my_file.txt;
while (defined ($line = INFILE)) {
chomp($line);
@letters = split(//, $line);
@reverse_letters = reverse(@letters);
$reverse_string = join(, @reverse_letters);
print $reverse_string, n;
}
### in Ruby:
File.open(a).each { |line|
puts line.chomp.reverse
}
41. More ruby examples.
5.times {
puts Hello world
}
# Sorting people
people_sorted_by_age = people.sort_by { |person| person.age}
+many tools for bio-data - e.g. check http://biogems.info
42. Getting help.
• In real life: Make friends with people. Talk to them.
• Online:
• Specific discussion mailing lists (e.g.: R, Stacks, bioruby, MAKER...)
• Programming: http://stackoverflow.com
• Bioinformatics: http://www.biostars.org
• Sequencing-related: http://seqanswers.com
• Stats: http://stats.stackexchange.com
!
• Codeschool!
46. Anurag Priyam,
Sure, I can
help you…
Mechanical engineering student, IIT Kharagpur
47. “Can you BLAST this for me?”
Antgenomes.org SequenceServer
BLAST made easy
(well, we’re trying...)
Aim: An open source idiot-proof
web-interface for custom BLAST