This document provides guidance on organizing computational biology projects. It recommends:
1. Organizing related projects together under a common root directory, with each project in its own subdirectory.
2. Using a logical top-level structure within each project with chronological organization at lower levels. This allows tracking experiments over time.
3. Including directories for data, source code, results, and documentation. Results subdirectories should be named with dates to allow sorting experiments chronologically.
Taking detailed, dated notes in lab notebooks or Markdown files integrated with analysis code allows fully documenting projects and easily repeating or extending prior work.
The document discusses the challenges and opportunities that will arise from the exponential growth of biological data in the coming years. It outlines four key areas: 1) Research approaches will need to effectively analyze infinite amounts of data. 2) Software and decentralized infrastructure will be needed to process the data. 3) Open science and reproducible research practices are important for data-driven biology. 4) Training the next generation of biologists in data analysis skills will be a major challenge. The document advocates for open source tools, reproducible research methods, and expanded training programs to help biology take advantage of the coming data deluge.
Women Who Code-HSV Event:
'An Introduction to Machine Learning and Genomics'. Dr. Lasseigne will introduce the R programming language and the foundational concepts of machine learning with real-world examples including applications in the field of genomics with an emphasis on complex human disease research.
Brittany Lasseigne, PhD, is a postdoctoral fellow in the lab of Dr. Richard Myers at the HudsonAlpha Institute for Biotechnology and a 2016-2017 Prevent Cancer Foundation Fellow. Dr. Lasseigne received a BS in biological engineering from the James Worth Bagley College of Engineering at Mississippi State University and a PhD in biotechnology science and engineering from The University of Alabama in Huntsville. As a graduate student, she studied the role of epigenetics and copy number variation in cancer, identifying novel diagnostic biomarkers and prognostic signatures associated with kidney cancer. In her current position, Dr. Lasseigne’s research focus is the application of genetics and genomics to complex human diseases. Her recent work includes the identification of gene variants linked to ALS, characterization of gene expression patterns in schizophrenia and bipolar disorder, and development of non-invasive biomarker assays. Dr. Lasseigne is currently focused on integrating genomic data across cancers with functional annotations and patient information to explore novel mechanisms in cancer etiology and progression, identify therapeutic targets, and understand genomic changes associated with patient survival. Based upon those analyses, she is creating tools to share with the scientific community.
This document discusses best practices for organizing computational biology projects. It recommends creating a directory structure with folders for source code, data, documentation, results and binaries/executables. Data folders should include README files explaining where the data came from. Version control is important to track changes over time. Comments and documentation will help others understand the project and allow researchers to revisit past work without reconstructing their experiments from scratch. Organizing and documenting projects thoroughly makes computational experiments more reproducible, understandable and useful to both the original researchers and others in the future.
Reproducibility and Scientific Research: why, what, where, when, who, how Carole Goble
This document discusses the importance of reproducibility in scientific research. It makes three key points:
1. For results to be considered valid, scientific publications should provide clear descriptions of methods and protocols so that other researchers can successfully repeat and extend the work.
2. Many factors can undermine reproducibility, such as publication pressures, poor training, disorganization, and outright fraud. Ensuring reproducible research requires transparency across experimental designs, data, software, and computational workflows.
3. Achieving reproducible science is challenging and poorly incentivized due to the resources and time required to prepare materials for independent verification. Overcoming these issues will require collective effort across the research community.
This document provides an overview and schedule for the course "SBC 361 Research Methods & Comms". The course is a mixture of advanced analytical skills taught in computer labs using the programming language R, and theoretical content covered in lectures and workshops. It includes two workshops on careers in science and popular science writing. Students will complete assignments involving the computer practicals and tutorials, and a mock exam. The schedule details the topics to be covered each week by different professors and teaching staff. It emphasizes the importance of attending classes, completing required work, and doing additional outside reading to succeed in the course.
A large part of the NECDMC curriculum uses case studies to teach best practices in data management for many different science disciplines. This presentation goes through the methodology of a case study, how to develop a case study, and presents an actual example of a research case study.
This document discusses problems with traditional scholarly publishing and proposes solutions centered around open data and transparency. It notes that traditional publishing hinders reproducibility due to lack of access to data and methods. This has led to an increasing number of non-reproducible findings and retractions. The document advocates for incentivizing the publication of data, software, workflows and other research objects to improve reproducibility and transparency. It highlights several examples where making these elements openly available improved scrutiny and identified errors in published works.
The document discusses the challenges and opportunities that will arise from the exponential growth of biological data in the coming years. It outlines four key areas: 1) Research approaches will need to effectively analyze infinite amounts of data. 2) Software and decentralized infrastructure will be needed to process the data. 3) Open science and reproducible research practices are important for data-driven biology. 4) Training the next generation of biologists in data analysis skills will be a major challenge. The document advocates for open source tools, reproducible research methods, and expanded training programs to help biology take advantage of the coming data deluge.
Women Who Code-HSV Event:
'An Introduction to Machine Learning and Genomics'. Dr. Lasseigne will introduce the R programming language and the foundational concepts of machine learning with real-world examples including applications in the field of genomics with an emphasis on complex human disease research.
Brittany Lasseigne, PhD, is a postdoctoral fellow in the lab of Dr. Richard Myers at the HudsonAlpha Institute for Biotechnology and a 2016-2017 Prevent Cancer Foundation Fellow. Dr. Lasseigne received a BS in biological engineering from the James Worth Bagley College of Engineering at Mississippi State University and a PhD in biotechnology science and engineering from The University of Alabama in Huntsville. As a graduate student, she studied the role of epigenetics and copy number variation in cancer, identifying novel diagnostic biomarkers and prognostic signatures associated with kidney cancer. In her current position, Dr. Lasseigne’s research focus is the application of genetics and genomics to complex human diseases. Her recent work includes the identification of gene variants linked to ALS, characterization of gene expression patterns in schizophrenia and bipolar disorder, and development of non-invasive biomarker assays. Dr. Lasseigne is currently focused on integrating genomic data across cancers with functional annotations and patient information to explore novel mechanisms in cancer etiology and progression, identify therapeutic targets, and understand genomic changes associated with patient survival. Based upon those analyses, she is creating tools to share with the scientific community.
This document discusses best practices for organizing computational biology projects. It recommends creating a directory structure with folders for source code, data, documentation, results and binaries/executables. Data folders should include README files explaining where the data came from. Version control is important to track changes over time. Comments and documentation will help others understand the project and allow researchers to revisit past work without reconstructing their experiments from scratch. Organizing and documenting projects thoroughly makes computational experiments more reproducible, understandable and useful to both the original researchers and others in the future.
Reproducibility and Scientific Research: why, what, where, when, who, how Carole Goble
This document discusses the importance of reproducibility in scientific research. It makes three key points:
1. For results to be considered valid, scientific publications should provide clear descriptions of methods and protocols so that other researchers can successfully repeat and extend the work.
2. Many factors can undermine reproducibility, such as publication pressures, poor training, disorganization, and outright fraud. Ensuring reproducible research requires transparency across experimental designs, data, software, and computational workflows.
3. Achieving reproducible science is challenging and poorly incentivized due to the resources and time required to prepare materials for independent verification. Overcoming these issues will require collective effort across the research community.
This document provides an overview and schedule for the course "SBC 361 Research Methods & Comms". The course is a mixture of advanced analytical skills taught in computer labs using the programming language R, and theoretical content covered in lectures and workshops. It includes two workshops on careers in science and popular science writing. Students will complete assignments involving the computer practicals and tutorials, and a mock exam. The schedule details the topics to be covered each week by different professors and teaching staff. It emphasizes the importance of attending classes, completing required work, and doing additional outside reading to succeed in the course.
A large part of the NECDMC curriculum uses case studies to teach best practices in data management for many different science disciplines. This presentation goes through the methodology of a case study, how to develop a case study, and presents an actual example of a research case study.
This document discusses problems with traditional scholarly publishing and proposes solutions centered around open data and transparency. It notes that traditional publishing hinders reproducibility due to lack of access to data and methods. This has led to an increasing number of non-reproducible findings and retractions. The document advocates for incentivizing the publication of data, software, workflows and other research objects to improve reproducibility and transparency. It highlights several examples where making these elements openly available improved scrutiny and identified errors in published works.
Capturing Context in Scientific Experiments: Towards Computer-Driven Sciencedgarijo
Scientists publish computational experiments in ways that do not facilitate reproducibility or reuse. Significant domain expertise, time and effort are required to understand scientific experiments and their research outputs. In order to improve this situation, mechanisms are needed to capture the exact details and the context of computational experiments. Only then, Intelligent Systems would be able help researchers understand, discover, link and reuse products of existing research.
In this presentation I will introduce my work and vision towards enabling scientists share, link, curate and reuse their computational experiments and results. In the first part of the talk, I will present my work for capturing and sharing the context of scientific experiments by using scientific workflows and machine readable representations. Thanks to this approach, experiment results are described in an unambiguous manner, have a clear trace of their creation process and include a pointer to the sources used for their generation. In the second part of the talk, I will describe examples on how the context of scientific experiments may be exploited to browse, explore and inspect research results. I will end the talk by presenting new ideas for improving and benefiting from the capture of context of scientific experiments and how to involve scientists in the process of curating and creating abstractions on available research metadata.
Genome sharing projects around the world nijmegen oct 29 - 2015Fiona Nielsen
Genome sharing projects across the world
Did you ever wonder what happened to the exponential increase in genome sequencing data? It is out there around the world and a lot of it is consented for research use. This means that if you just know where to find the data, you can potentially analyse gigabytes of data to power your research.
In this talk Fiona will present community genome initiatives, the genome sharing projects across the world, how you can benefit from this wealth of data in your work, and how you can boost your academic career by sharing and collaboration.
by Fiona Nielsen, Founder and CEO of DNAdigest and Repositive
With a background in software development Fiona pursued her career in bioinformatics research at Radboud University Nijmegen. Now a scientist-turned-entrepreneur Fiona founded DNAdigest and its social enterprise spin-out Repositive Ltd. Both the charity and company focus on efficient and ethical sharing of genetics data for research to accelerate diagnostics and cures for genetic diseases.
Tracking Social Practices with Big(ish) dataBen Anderson
Paper presented at 'Methodology' session of PRACTICES, THE BUILT ENVIRONMENT AND SUSTAINABILITY EARLY CAREER RESEARCHER NETWORK Workshop,
26-27 June 2014, Cambridge
Towards Incidental Collaboratories; Research Data ServicesAnita de Waard
This document discusses enabling "incidental collaboratories" by collecting and connecting biological research data through a centralized framework. It argues that biology research is currently quite isolated due to its small scale and competitive nature. The framework would involve storing experimental data with metadata, allowing analyses across similar experiment types and biological subjects, and preserving data long-term with access controls. This could help move labs from being isolated to being "sensors in a network" and address objections around data ownership and quality.
Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in...GigaScience, BGI Hong Kong
Scott Edmunds talk at the 7th Internation Conference on Genomics: "Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era. ICG7, Hong Kong 1st December 2012
"
COM 578 Empirical Methods in Machine Learning and Data Miningbutest
This document provides an overview and summary of the COM 578 Empirical Methods in Machine Learning and Data Mining course. It outlines the course topics, grading structure, office hours, homework assignments, final project, and textbooks. Key topics covered in the course include decision trees, k-nearest neighbor, neural networks, support vectors, association rules, clustering, and boosting/bagging. The final project involves applying machine learning techniques to train models on two different datasets.
This document discusses reproducible research and provides guidance on how to conduct research in a reproducible manner. It covers:
1. The importance of reproducible research due to large datasets, computational analyses, and the potential for human error. Ensuring reproducibility requires new expertise and infrastructure.
2. Key aspects of reproducible research include data management plans, version control, use of file formats and software/tools that allow reproducibility, and publishing data and code to allow others to replicate results.
3. Reproducible research benefits the scientific community by increasing transparency and allows researchers to re-analyze their own data in the future. Journals and funders are increasingly requiring reproducibility.
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
Keynote given by Carole Goble on 23rd July 2013 at ISMB/ECCB 2013
http://www.iscb.org/ismbeccb2013
How could we evaluate research and researchers? Reproducibility underpins the scientific method: at least in principle if not practice. The willing exchange of results and the transparent conduct of research can only be expected up to a point in a competitive environment. Contributions to science are acknowledged, but not if the credit is for data curation or software. From a bioinformatics view point, how far could our results be reproducible before the pain is just too high? Is open science a dangerous, utopian vision or a legitimate, feasible expectation? How do we move bioinformatics from one where results are post-hoc "made reproducible", to pre-hoc "born reproducible"? And why, in our computational information age, do we communicate results through fragmented, fixed documents rather than cohesive, versioned releases? I will explore these questions drawing on 20 years of experience in both the development of technical infrastructure for Life Science and the social infrastructure in which Life Science operates.
This document discusses the challenges and opportunities biology faces with increasing data generation. It outlines four key points:
1) Research approaches for analyzing infinite genomic data streams, such as digital normalization which compresses data while retaining information.
2) The need for usable software and decentralized infrastructure to perform real-time, streaming data analysis.
3) The importance of open science and reproducibility given most researchers cannot replicate their own computational analyses.
4) The lack of data analysis training in biology and efforts at UC Davis to address this through workshops and community building.
1. The document discusses best practices for scientific software development including writing code for people to read, automating repetitive tasks, using version control, and avoiding redundancy.
2. Specific approaches mentioned are planning for mistakes, automated testing, continuous integration, and using style guides to ensure code is readable and consistently formatted.
3. Knitting allows analyzing and reporting in a single file by embedding R code chunks in markdown documents.
This document provides an overview of genomic tools and best practices for scientific computing. It discusses SequenceServer, a tool for BLAST searches, and Bionode, a collection of Node.js modules for bioinformatics. It also discusses challenges with gene prediction and introduces GeneValidator, a tool for visual inspection and manual correction of gene predictions. Key points include automating repetitive tasks, writing code for people through style guides, and using version control and modularization to improve code quality and reproducibility.
This document summarizes a keynote presentation about challenges in bioinformatics software development and proposed solutions. Some of the key points made include: 1) bioinformatics software development involves multiple disciplines including computer science, software engineering, statistics, and biology, each with different priorities; 2) there is a massive proliferation of bioinformatics software packages that leads to many difficult choices for researchers; 3) proposed solutions include developing software in a more modular and automated way, using common benchmarks and protocols to evaluate tools, and focusing on reproducibility and usability.
Keynote on software sustainability given at the 2nd Annual Netherlands eScience Symposium, November 2014.
Based on the article
Carole Goble ,
Better Software, Better Research
Issue No.05 - Sept.-Oct. (2014 vol.18)
pp: 4-8
IEEE Computer Society
http://www.computer.org/csdl/mags/ic/2014/05/mic2014050004.pdf
http://doi.ieeecomputersociety.org/10.1109/MIC.2014.88
http://www.software.ac.uk/resources/publications/better-software-better-research
This document discusses openness and reproducibility in computational science. It begins with an introduction and background on the challenges of analyzing non-model organisms. It then describes the goals and challenges of shotgun sequencing analysis, including assembly, counting, and variant calling. It emphasizes the need for efficient data structures, algorithms, and cloud-based analysis to handle large datasets. The document advocates for open science practices like publishing code, data, and analyses to ensure reproducibility of computational results.
Presentation for Harvard's ABCD Technology in Education group:
The Institute for Quantitative Social Science (IQSS) is a unique entity at Harvard - it combines research, software development, and specialized services to provide innovative solutions to research and scholarship problems at Harvard and beyond. I will talk about the software projects that IQSS is currently working on (Dataverse, Zelig, Consilience, and OpenScholar), including the research and development processes, the benefits provided to the Harvard community, and the impacts on research and scholarship.
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker, Inc.
Typically in predictive data analysis challenges, participants are provided a dataset and asked to make predictions. Participants include with their prediction the scripts/code used to produce it. Challenge administrators validate the winning model by reconstructing and running the source code.
Often data cannot be provided to participants directly, e.g. due to data sensitivity (data may be from living human subjects) or data size (tens of terabytes). Further, predictions must be reproducible from the code provided by particpants. Containerization is an excellent solution to these problems: Rather than providing the data to the participants, we ask the participants to provided a Dockerized "trainable" model. We run the both the training and validation phases of machine learning and guarantee reproducibility 'for free'.
We use the Docker tool suite to spin up and run servers in the cloud to process the queue of submitted containers, each essentially a batch job. This fleet can be scaled to match the level of activity in the challenge. We have used Docker successfully in our 2015 ALS Stratification Challenge and our 2015 Somatic Mutation Calling Tumour Heterogeneity (SMC-HET) Challenge, and are starting an implementation for our 2016 Digitial Mammography Challenge.
Research results in peer-reviewed publications are reproducible, right? If only it was so clear cut. With high profile paper retractions and pushes for better data sharing by funders, publishers and the community, the spotlight is now focussing on the whole way research is conducted around the world.
This talk from the Software Sustainability Institute's Collaborations Workshop 2014 describes how cloud computing, with Microsoft Azure, is helping researchers realize the goals of scientific reproducibility.
Find out more at www.azure4research.com
This document summarizes a talk given by C. Titus Brown on best practices for scientific computing. Brown is an assistant professor at Michigan State University who works in microbiology, computer science, and bioinformatics. The talk discusses using version control, writing tests to prevent mistakes, automating repetitive tasks, and documenting code design and purpose. It emphasizes that adopting a few key practices can improve scientists' efficiency and the correctness and reproducibility of their work.
This document summarizes a talk titled "AWager for 2016: How SoftwareWill Beat Hardware in Biological Data Analysis". The talk discusses how software approaches can outpace hardware for analyzing large biological datasets. It notes that current variant calling approaches have limitations due to being I/O intensive and requiring multiple passes over data. The talk introduces approaches using lossy compression and streaming algorithms that can perform analysis more efficiently using less memory and in a single pass. This could enable analyzing a human genome on a desktop computer by 2016 as wagered. The talk argues that with better algorithmic tools, biological data analysis need not require large computers and can scale with the information content of data rather than just data size.
Capturing Context in Scientific Experiments: Towards Computer-Driven Sciencedgarijo
Scientists publish computational experiments in ways that do not facilitate reproducibility or reuse. Significant domain expertise, time and effort are required to understand scientific experiments and their research outputs. In order to improve this situation, mechanisms are needed to capture the exact details and the context of computational experiments. Only then, Intelligent Systems would be able help researchers understand, discover, link and reuse products of existing research.
In this presentation I will introduce my work and vision towards enabling scientists share, link, curate and reuse their computational experiments and results. In the first part of the talk, I will present my work for capturing and sharing the context of scientific experiments by using scientific workflows and machine readable representations. Thanks to this approach, experiment results are described in an unambiguous manner, have a clear trace of their creation process and include a pointer to the sources used for their generation. In the second part of the talk, I will describe examples on how the context of scientific experiments may be exploited to browse, explore and inspect research results. I will end the talk by presenting new ideas for improving and benefiting from the capture of context of scientific experiments and how to involve scientists in the process of curating and creating abstractions on available research metadata.
Genome sharing projects around the world nijmegen oct 29 - 2015Fiona Nielsen
Genome sharing projects across the world
Did you ever wonder what happened to the exponential increase in genome sequencing data? It is out there around the world and a lot of it is consented for research use. This means that if you just know where to find the data, you can potentially analyse gigabytes of data to power your research.
In this talk Fiona will present community genome initiatives, the genome sharing projects across the world, how you can benefit from this wealth of data in your work, and how you can boost your academic career by sharing and collaboration.
by Fiona Nielsen, Founder and CEO of DNAdigest and Repositive
With a background in software development Fiona pursued her career in bioinformatics research at Radboud University Nijmegen. Now a scientist-turned-entrepreneur Fiona founded DNAdigest and its social enterprise spin-out Repositive Ltd. Both the charity and company focus on efficient and ethical sharing of genetics data for research to accelerate diagnostics and cures for genetic diseases.
Tracking Social Practices with Big(ish) dataBen Anderson
Paper presented at 'Methodology' session of PRACTICES, THE BUILT ENVIRONMENT AND SUSTAINABILITY EARLY CAREER RESEARCHER NETWORK Workshop,
26-27 June 2014, Cambridge
Towards Incidental Collaboratories; Research Data ServicesAnita de Waard
This document discusses enabling "incidental collaboratories" by collecting and connecting biological research data through a centralized framework. It argues that biology research is currently quite isolated due to its small scale and competitive nature. The framework would involve storing experimental data with metadata, allowing analyses across similar experiment types and biological subjects, and preserving data long-term with access controls. This could help move labs from being isolated to being "sensors in a network" and address objections around data ownership and quality.
Scott Edmunds: Channeling the Deluge: Reproducibility & Data Dissemination in...GigaScience, BGI Hong Kong
Scott Edmunds talk at the 7th Internation Conference on Genomics: "Channeling the Deluge: Reproducibility & Data Dissemination in the “Big-Data” Era. ICG7, Hong Kong 1st December 2012
"
COM 578 Empirical Methods in Machine Learning and Data Miningbutest
This document provides an overview and summary of the COM 578 Empirical Methods in Machine Learning and Data Mining course. It outlines the course topics, grading structure, office hours, homework assignments, final project, and textbooks. Key topics covered in the course include decision trees, k-nearest neighbor, neural networks, support vectors, association rules, clustering, and boosting/bagging. The final project involves applying machine learning techniques to train models on two different datasets.
This document discusses reproducible research and provides guidance on how to conduct research in a reproducible manner. It covers:
1. The importance of reproducible research due to large datasets, computational analyses, and the potential for human error. Ensuring reproducibility requires new expertise and infrastructure.
2. Key aspects of reproducible research include data management plans, version control, use of file formats and software/tools that allow reproducibility, and publishing data and code to allow others to replicate results.
3. Reproducible research benefits the scientific community by increasing transparency and allows researchers to re-analyze their own data in the future. Journals and funders are increasingly requiring reproducibility.
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
Keynote given by Carole Goble on 23rd July 2013 at ISMB/ECCB 2013
http://www.iscb.org/ismbeccb2013
How could we evaluate research and researchers? Reproducibility underpins the scientific method: at least in principle if not practice. The willing exchange of results and the transparent conduct of research can only be expected up to a point in a competitive environment. Contributions to science are acknowledged, but not if the credit is for data curation or software. From a bioinformatics view point, how far could our results be reproducible before the pain is just too high? Is open science a dangerous, utopian vision or a legitimate, feasible expectation? How do we move bioinformatics from one where results are post-hoc "made reproducible", to pre-hoc "born reproducible"? And why, in our computational information age, do we communicate results through fragmented, fixed documents rather than cohesive, versioned releases? I will explore these questions drawing on 20 years of experience in both the development of technical infrastructure for Life Science and the social infrastructure in which Life Science operates.
This document discusses the challenges and opportunities biology faces with increasing data generation. It outlines four key points:
1) Research approaches for analyzing infinite genomic data streams, such as digital normalization which compresses data while retaining information.
2) The need for usable software and decentralized infrastructure to perform real-time, streaming data analysis.
3) The importance of open science and reproducibility given most researchers cannot replicate their own computational analyses.
4) The lack of data analysis training in biology and efforts at UC Davis to address this through workshops and community building.
1. The document discusses best practices for scientific software development including writing code for people to read, automating repetitive tasks, using version control, and avoiding redundancy.
2. Specific approaches mentioned are planning for mistakes, automated testing, continuous integration, and using style guides to ensure code is readable and consistently formatted.
3. Knitting allows analyzing and reporting in a single file by embedding R code chunks in markdown documents.
This document provides an overview of genomic tools and best practices for scientific computing. It discusses SequenceServer, a tool for BLAST searches, and Bionode, a collection of Node.js modules for bioinformatics. It also discusses challenges with gene prediction and introduces GeneValidator, a tool for visual inspection and manual correction of gene predictions. Key points include automating repetitive tasks, writing code for people through style guides, and using version control and modularization to improve code quality and reproducibility.
This document summarizes a keynote presentation about challenges in bioinformatics software development and proposed solutions. Some of the key points made include: 1) bioinformatics software development involves multiple disciplines including computer science, software engineering, statistics, and biology, each with different priorities; 2) there is a massive proliferation of bioinformatics software packages that leads to many difficult choices for researchers; 3) proposed solutions include developing software in a more modular and automated way, using common benchmarks and protocols to evaluate tools, and focusing on reproducibility and usability.
Keynote on software sustainability given at the 2nd Annual Netherlands eScience Symposium, November 2014.
Based on the article
Carole Goble ,
Better Software, Better Research
Issue No.05 - Sept.-Oct. (2014 vol.18)
pp: 4-8
IEEE Computer Society
http://www.computer.org/csdl/mags/ic/2014/05/mic2014050004.pdf
http://doi.ieeecomputersociety.org/10.1109/MIC.2014.88
http://www.software.ac.uk/resources/publications/better-software-better-research
This document discusses openness and reproducibility in computational science. It begins with an introduction and background on the challenges of analyzing non-model organisms. It then describes the goals and challenges of shotgun sequencing analysis, including assembly, counting, and variant calling. It emphasizes the need for efficient data structures, algorithms, and cloud-based analysis to handle large datasets. The document advocates for open science practices like publishing code, data, and analyses to ensure reproducibility of computational results.
Presentation for Harvard's ABCD Technology in Education group:
The Institute for Quantitative Social Science (IQSS) is a unique entity at Harvard - it combines research, software development, and specialized services to provide innovative solutions to research and scholarship problems at Harvard and beyond. I will talk about the software projects that IQSS is currently working on (Dataverse, Zelig, Consilience, and OpenScholar), including the research and development processes, the benefits provided to the Harvard community, and the impacts on research and scholarship.
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker, Inc.
Typically in predictive data analysis challenges, participants are provided a dataset and asked to make predictions. Participants include with their prediction the scripts/code used to produce it. Challenge administrators validate the winning model by reconstructing and running the source code.
Often data cannot be provided to participants directly, e.g. due to data sensitivity (data may be from living human subjects) or data size (tens of terabytes). Further, predictions must be reproducible from the code provided by particpants. Containerization is an excellent solution to these problems: Rather than providing the data to the participants, we ask the participants to provided a Dockerized "trainable" model. We run the both the training and validation phases of machine learning and guarantee reproducibility 'for free'.
We use the Docker tool suite to spin up and run servers in the cloud to process the queue of submitted containers, each essentially a batch job. This fleet can be scaled to match the level of activity in the challenge. We have used Docker successfully in our 2015 ALS Stratification Challenge and our 2015 Somatic Mutation Calling Tumour Heterogeneity (SMC-HET) Challenge, and are starting an implementation for our 2016 Digitial Mammography Challenge.
Research results in peer-reviewed publications are reproducible, right? If only it was so clear cut. With high profile paper retractions and pushes for better data sharing by funders, publishers and the community, the spotlight is now focussing on the whole way research is conducted around the world.
This talk from the Software Sustainability Institute's Collaborations Workshop 2014 describes how cloud computing, with Microsoft Azure, is helping researchers realize the goals of scientific reproducibility.
Find out more at www.azure4research.com
This document summarizes a talk given by C. Titus Brown on best practices for scientific computing. Brown is an assistant professor at Michigan State University who works in microbiology, computer science, and bioinformatics. The talk discusses using version control, writing tests to prevent mistakes, automating repetitive tasks, and documenting code design and purpose. It emphasizes that adopting a few key practices can improve scientists' efficiency and the correctness and reproducibility of their work.
This document summarizes a talk titled "AWager for 2016: How SoftwareWill Beat Hardware in Biological Data Analysis". The talk discusses how software approaches can outpace hardware for analyzing large biological datasets. It notes that current variant calling approaches have limitations due to being I/O intensive and requiring multiple passes over data. The talk introduces approaches using lossy compression and streaming algorithms that can perform analysis more efficiently using less memory and in a single pass. This could enable analyzing a human genome on a desktop computer by 2016 as wagered. The talk argues that with better algorithmic tools, biological data analysis need not require large computers and can scale with the information content of data rather than just data size.
Open-source tools for generating and analyzing large materials data setsAnubhav Jain
This document discusses open-source software tools for generating and analyzing large materials data sets developed by Anubhav Jain and collaborators. It summarizes several software packages including pymatgen for materials analysis, FireWorks for scientific workflows, custodian for error recovery in calculations, and matminer for data mining. Applications of the tools include generating the Materials Project database containing properties of over 65,000 materials compounds calculated using high-performance computing resources. The document emphasizes the importance of open-source collaborative software development and automation to accelerate materials discovery.
1. The document discusses best practices for scientific software development, including writing code for people rather than computers, automating repetitive tasks, using version control, and conducting code reviews.
2. Specific approaches and tools recommended are planning for mistakes, automated testing, continuous integration, and using a coding style guide. R and Ruby style guides are provided as examples.
3. The benefits of following such practices are improving productivity, reducing errors, making code easier to read and maintain, and allowing scientists to focus on scientific questions rather than software issues. Reproducible and sustainable software is the overall goal.
This document discusses complex metagenome assembly and career thoughts in bioinformatics. It begins with the speaker's research background and then discusses two main topics: 1) challenges with metagenome assembly due to low coverage regions and strain variation in sequencing data, and approaches using assembly graphs, and 2) the need for more "bioinformaticians in the middle" who are comfortable with both biology and computational analysis to integrate large-scale data into their research. The speaker provides advice for embracing computation and seeking formal training opportunities to develop skills at this intersection of disciplines.
Past, Present, and Future of Analyzing Software DataJeongwhan Choi
The document discusses the past, present, and future of analyzing software data. It traces the evolution from early pioneers in the 1950s and 1960s who began quantifying aspects of software like size and complexity, to modern academic experiments applying machine learning techniques in the 1980s-2000s, to widespread industrial adoption and conferences focused on the topic today. The future is predicted to include more data, algorithms, roles for data scientists, and real-time analysis to address big data challenges.
This document summarizes a presentation about Myria, a relational algorithmics-as-a-service platform developed by researchers at the University of Washington. Myria allows users to write queries and algorithms over large datasets using declarative languages like Datalog and SQL, and executes them efficiently in a parallel manner. It aims to make data analysis scalable and accessible for researchers across many domains by removing the need to handle low-level data management and integration tasks. The presentation provides an overview of the Myria architecture and compiler framework, and gives examples of how it has been used for projects in oceanography, astronomy, biology and medical informatics.
This document discusses computational reproducibility challenges in analyzing non-model organism sequencing data. It describes how shotgun sequencing is used to assemble genomes and transcriptomes and measure gene expression without a reference genome. K-mers are introduced as an implicit alignment method using overlapping fragments. Efficient data structures and algorithms are needed to analyze the large amounts of redundant sequencing data while retaining information. The author's lab approach is to develop novel methods at scale and apply them to real problems, then release everything openly online to enable reproducibility.
Scientific Software: Sustainability, Skills & SociologyNeil Chue Hong
This document discusses the importance of software sustainability. It notes that software is everywhere, long-lived, and hard to define, which makes sustainability challenging. It emphasizes that software sustainability requires cultivating skills in developers and researchers, providing proper incentives, and recognizing that people are a key part of maintaining software over long periods of time.
eScience: A Transformed Scientific MethodDuncan Hull
The document discusses the concept of eScience, which involves synthesizing information technology and science. It explains how science is becoming more data-driven and computational, requiring new tools to manage large amounts of data. It recommends that organizations foster the development of tools to help with data capture, analysis, publication, and access across various scientific disciplines.
UMich CI Days: Scaling a code in the human dimensionmatthewturk
This document provides an overview of the yt astrophysics analysis and visualization toolkit. It discusses yt's goals of addressing physical rather than computational questions and getting out of the way of analysis. It also covers yt's community aspects, including the challenges of developing open source scientific software and strategies used by yt like reducing barriers to entry, open communication, and emphasizing a community of peers. Key points discussed are designing the community desired, challenges of academic rewards, and successes of yt like its development by working astrophysicists and usage on supercomputers.
Similar to 2013 10-30-sbc361-reproducible designsandsustainablesoftware (20)
This document discusses the experience of a researcher in genomics with applying FAIR and open approaches. It notes that making data and analysis methods FAIR and open can increase visibility, drive citations, and facilitate collaboration. However, it also enables competition to more easily access and utilize resources without contributing. Striking the right balance between openness and protecting competitive advantages is challenging. Overall, the researcher finds FAIR and open principles have greatly increased the impact and robustness of their work, but there are also costs to consider.
2018 08-reduce risks of genomics researchYannick Wurm
Geoffrey Chang, a protein crystallographer at The Scripps Research Institute, had his career trajectory disrupted when several of his high-profile papers describing protein structures had to be retracted. An in-house software program Chang's lab used to process diffraction data from protein crystals introduced a sign error that inverted the structures, invalidating biological interpretations. This included a 2001 Science paper describing the structure of the MsbA protein. A 2006 Nature paper by Swiss researchers casting doubt on Chang's MsbA structure led him to discover the software error. Chang and his co-authors sincerely regretted the confusion and unproductive research caused by the need to retract their influential papers.
Geoffrey Chang was a prominent structural biologist who received prestigious early career awards. However, his work came under scrutiny when other researchers discovered errors in his published protein structures due to a problem with his in-house data analysis software. This led Chang to retract 5 of his papers describing protein structures. The retractions were costly for Chang's career and reputation as well as for other researchers who had performed follow-up work based on the incorrect structures. The incident highlights the importance of using well-tested, reproducible analysis methods in scientific research.
Keynote talk given at Fairdom User meeting http://fair-dom.org/communities/users/barcelona-2016-first-user-meeting/ .
I begin by summarising how we apply molecular approaches to understand social behaviour in ants. Subsequently, I give an overview of the data-handling challenges the genomic bioinformatics community faces. Finally, I give an overview of some of the tools and approaches my lab have developed to help us get things done better, faster, more reliably and more reproducibly.
The document discusses the genetic basis of social organization in fire ant populations. Researchers used RAD sequencing of haploid males to discover SNPs and genotype individuals at over 2,400 loci. Principal component analysis separated individuals into two clusters corresponding to their social form (single or multiple queen), with the first principal component explaining over 12% of the variance. A region on chromosome 13 containing the Gp-9 gene was completely associated with social form. This research identified a major gene influencing an important social trait using next-generation sequencing techniques.
This document provides an agenda for a spring school on bioinformatics and population genomics, including practical sessions on analyzing genomic data from reads to reference genomes and gene predictions in 6 steps: inspecting and cleaning reads, genome assembly, assessing assembly quality, predicting protein-coding genes, assessing gene prediction quality, and assessing the overall process quality using biological measures. It also addresses wifi issues that could reduce bandwidth and lists the VM password.
This document provides information about a spring school on bioinformatics and population genomics that includes practical sessions. The sessions will cover topics like short read cleaning, genome assembly, gene prediction, quality control, mapping reads to call variants, visualizing variants, analyzing variants through PCA and measuring diversity and differentiation, inferring population sizes and gene flow, and analyzing gene expression from raw sequencing data to expression levels. The document lists the team of practitioners leading the sessions and encourages participants to share their favorite software packages.
2015 12-18- Avoid having to retract your genomics analysis - Popgroup Reprodu...Yannick Wurm
Brief (15min) talk I gave at #PopGroup49 in Edinburgh providing a few simple methods to reduce risk in genomics analyses.
Please cite: Avoid having to retract your genomics analysis (2015) Y Wurm. The Winnower 2, e143696.68941 https://thewinnower.com/papers/avoid-having-to-retract-your-genomics-analysis
This document contains information about programming in R, including practical examples. It discusses accessing and subsetting data, using regular expressions for text search, creating functions, and using loops. Examples are provided to demonstrate creating vectors, accessing subsets of vectors, using regular expressions to find patterns in text, creating functions to convert between units or estimate values, and using for loops to repeat operations over multiple elements. The document suggests R is useful for working with big data in biology and other fields due to its ability to automate tasks, integrate with other tools, and handle large datasets through programming.
This document describes oSwitch, a tool that allows easy access to other operating systems via one-line commands. It works by wrapping Docker containers, allowing commands to be run in different OS environments without disrupting the user's current environment. The document provides an example usage where a user is able to run an "abyss-pe" command in a Biolinux container after it is not found in their native OS. It notes how oSwitch aims to preserve the user's current working directory, login shell, home directory and file permissions during usage.
This document provides an outline for a lecture on the genetic basis of evolution. It begins with introducing key terms like gene, locus, allele, genotype, and phenotype. It then discusses genetic drift and how drift is influenced by population size. Selection is also introduced and defined as a process where individuals with different genotypes have different fitnesses. The document emphasizes that both genetic drift and selection influence evolution, and neither process should be overemphasized. It aims to move people away from only considering selection (pan-selectionism) and highlights the importance of genetic drift.
This document discusses human evolution and recent insights from genomics. It summarizes that Neanderthals were the closest evolutionary relatives to modern humans and lived in Europe and Western Asia until disappearing 30,000 years ago. A draft sequence of the Neanderthal genome from three individuals was presented, composed of over 4 billion nucleotides. Comparisons with five modern human genomes identified regions potentially affected by selection in ancestral modern humans, involving genes related to metabolism, cognition, and skeletal development. Analysis suggests Neanderthals shared more genetic variants with non-Africans, indicating gene flow from Neanderthals into their ancestors occurred before Eurasian groups diverged.
The document discusses analyzing ancient plant and insect DNA extracted from ice core samples in Greenland. Key points:
- Plant and insect DNA was recovered from silty ice samples taken between 2-3 km deep in the Dye 3 and JEG ice cores in Greenland, dating back to before the last glacial period.
- The DNA was identified as coming from tree species like pine and alder, indicating a boreal forest environment in southern Greenland at the time, rather than today's Arctic conditions.
- Other plant species identified include those from orders like Asterales, Poales, Rosales and Malpighiales. Insect DNA from Lepidoptera was also recovered.
This document provides an introduction to regular expressions (regex) for text search and pattern matching. It explains that regex allows for powerful text searches beyond simple keywords. Various special symbols and constructs are demonstrated that allow matching complex patterns and variants in text. Examples show matching names, sequences, microsatellite repeats and more with regex. Functions, loops and logical operators in R programming are also briefly covered.
The document discusses major geological drivers of evolution including tectonic plate movement, vulcanism, climate change, and meteorite impacts. Tectonic plate movement has caused continental drift and formation of supercontinents like Pangaea, affecting species distributions. Vulcanism causes both local and global climate changes through emission of gases and particles and formation of new land barriers and islands. Climate changes over geological timescales have also impacted evolution. Meteorite impacts have precipitated mass extinctions. These geological forces alter Earth's conditions and drive evolution through large-scale migrations, speciation events, mass extinctions, and adaptive radiations.
This document discusses computational methods and challenges for genome assembly using next-generation sequencing data. It describes the four main stages of genome assembly as preprocessing filtering, graph construction, graph simplification, and postprocessing filtering. Each stage processes the data from the previous stage to build the assembly graph and reduce complexity, though some assemblers delay filtering steps.
This document outlines the course SBC322 Ecological and Evolutionary Genomics. It discusses how new genomic technologies have changed ecology and evolution research by merging molecular and ecological approaches. It aims to critically evaluate research questions, methods, experimental designs and applications in ecological and evolutionary genomics. The course will improve students' skills in critically reading literature, understanding interdisciplinary science, and oral and written scientific communication through interactive small group work, informal and formal presentations, blog posts, and peer review.
The document provides an overview of topics covered in a bioinformatics course, including using Unix, bioinformatics algorithms, biological databases, sequencing technologies, and genome assembly and variant identification. It lists challenges for students in each topic area and provides examples of concepts that will be covered, such as using HPC systems, dynamic programming for sequence alignment, accessing databases like NCBI, processing sequencing data, and identifying variants from assembly. Images are included of different organisms like ants and sequencing technologies. The document aims to outline the scope and challenges of the bioinformatics course.
Sustainable software institute Collaboration workshopYannick Wurm
The document discusses tools for analyzing biological data. It summarizes four tools:
1. SequenceServer - A simple web interface for BLAST that handles formatting and installing BLAST locally.
2. oSwitch - Allows rapidly switching between operating systems and container environments to access specific bioinformatics software without installation.
3. GeneValidator - Helps curate gene predictions by identifying problematic predictions, choosing best alternative models, and aiding manual curation of individual genes.
4. Afra - A crowdsourcing platform that aims to crowdsource the visual inspection and correction of gene models by recruiting and training students, ensuring quality through tutorials, redundancy and senior review, and creating small, simple initial tasks.
The document discusses genomic analysis of the fire ant Solenopsis invicta. It notes that the genome sequencing of a Gp-9 B male fire ant revealed an expansion of lipid-processing gene families and over 420 putative olfactory receptors, more than any other insect. It also identified a functional DNA methylation system. Previous research had linked the fire ant's social structure to its Gp-9 locus, but genome sequencing provided more genomic context around this gene and others related to social behavior and chemical signaling.
How to Make a Field Mandatory in Odoo 17Celine George
In Odoo, making a field required can be done through both Python code and XML views. When you set the required attribute to True in Python code, it makes the field required across all views where it's used. Conversely, when you set the required attribute in XML views, it makes the field required only in the context of that particular view.
Gender and Mental Health - Counselling and Family Therapy Applications and In...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
This document provides an overview of wound healing, its functions, stages, mechanisms, factors affecting it, and complications.
A wound is a break in the integrity of the skin or tissues, which may be associated with disruption of the structure and function.
Healing is the body’s response to injury in an attempt to restore normal structure and functions.
Healing can occur in two ways: Regeneration and Repair
There are 4 phases of wound healing: hemostasis, inflammation, proliferation, and remodeling. This document also describes the mechanism of wound healing. Factors that affect healing include infection, uncontrolled diabetes, poor nutrition, age, anemia, the presence of foreign bodies, etc.
Complications of wound healing like infection, hyperpigmentation of scar, contractures, and keloid formation.
Strategies for Effective Upskilling is a presentation by Chinwendu Peace in a Your Skill Boost Masterclass organisation by the Excellence Foundation for South Sudan on 08th and 09th June 2024 from 1 PM to 3 PM on each day.
Chapter wise All Notes of First year Basic Civil Engineering.pptxDenish Jangid
Chapter wise All Notes of First year Basic Civil Engineering
Syllabus
Chapter-1
Introduction to objective, scope and outcome the subject
Chapter 2
Introduction: Scope and Specialization of Civil Engineering, Role of civil Engineer in Society, Impact of infrastructural development on economy of country.
Chapter 3
Surveying: Object Principles & Types of Surveying; Site Plans, Plans & Maps; Scales & Unit of different Measurements.
Linear Measurements: Instruments used. Linear Measurement by Tape, Ranging out Survey Lines and overcoming Obstructions; Measurements on sloping ground; Tape corrections, conventional symbols. Angular Measurements: Instruments used; Introduction to Compass Surveying, Bearings and Longitude & Latitude of a Line, Introduction to total station.
Levelling: Instrument used Object of levelling, Methods of levelling in brief, and Contour maps.
Chapter 4
Buildings: Selection of site for Buildings, Layout of Building Plan, Types of buildings, Plinth area, carpet area, floor space index, Introduction to building byelaws, concept of sun light & ventilation. Components of Buildings & their functions, Basic concept of R.C.C., Introduction to types of foundation
Chapter 5
Transportation: Introduction to Transportation Engineering; Traffic and Road Safety: Types and Characteristics of Various Modes of Transportation; Various Road Traffic Signs, Causes of Accidents and Road Safety Measures.
Chapter 6
Environmental Engineering: Environmental Pollution, Environmental Acts and Regulations, Functional Concepts of Ecology, Basics of Species, Biodiversity, Ecosystem, Hydrological Cycle; Chemical Cycles: Carbon, Nitrogen & Phosphorus; Energy Flow in Ecosystems.
Water Pollution: Water Quality standards, Introduction to Treatment & Disposal of Waste Water. Reuse and Saving of Water, Rain Water Harvesting. Solid Waste Management: Classification of Solid Waste, Collection, Transportation and Disposal of Solid. Recycling of Solid Waste: Energy Recovery, Sanitary Landfill, On-Site Sanitation. Air & Noise Pollution: Primary and Secondary air pollutants, Harmful effects of Air Pollution, Control of Air Pollution. . Noise Pollution Harmful Effects of noise pollution, control of noise pollution, Global warming & Climate Change, Ozone depletion, Greenhouse effect
Text Books:
1. Palancharmy, Basic Civil Engineering, McGraw Hill publishers.
2. Satheesh Gopi, Basic Civil Engineering, Pearson Publishers.
3. Ketki Rangwala Dalal, Essentials of Civil Engineering, Charotar Publishing House.
4. BCP, Surveying volume 1
हिंदी वर्णमाला पीपीटी, hindi alphabet PPT presentation, hindi varnamala PPT, Hindi Varnamala pdf, हिंदी स्वर, हिंदी व्यंजन, sikhiye hindi varnmala, dr. mulla adam ali, hindi language and literature, hindi alphabet with drawing, hindi alphabet pdf, hindi varnamala for childrens, hindi language, hindi varnamala practice for kids, https://www.drmullaadamali.com
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
3. Regular expressions:
Text search on steroids.
Regular expression
David
Dav(e|id)
Dav(e|id|ide|o)
At{1,2}enborough
Atte[nm]borough
At{1,2}[ei][nm]bo{0,1}ro(ugh){0,1}
Finds
David
David, Dave
David, Dave, Davide, Davo
Attenborough,
Atenborough
Attenborough,
Attemborough
Atimbro, attenbrough, etc.
Easy counting, replacing all with “Sir David Attenborough”
5. Functions
•R
has many. e.g.: plot(), t.test()
• Making
your own:
tree_age_estimate <- function(diameter, species) {
[...do the magic...
# maybe something like:
growth.rate <- growth.rates[ species ]
age.estimate <- diameter / growth.rate
...]
return(age.estimate)
}
>
+
>
+
tree_age_estimate(25, "White Oak")
66
tree_age_estimate(60, "Carya ovata")
190
6. “for”
Loop
> possible_colours <- c('blue', 'cyan', 'sky-blue', 'navy blue',
'steel blue', 'royal blue', 'slate blue', 'light blue', 'dark
blue', 'prussian blue', 'indigo', 'baby blue', 'electric blue')
> possible_colours
[1] "blue"
"cyan"
"sky-blue"
[5] "steel blue"
"royal blue"
"slate blue"
[9] "dark blue"
"prussian blue" "indigo"
[13] "electric blue"
> for (colour in possible_colours) {
+
print(paste("The sky is oh so, so", colour))
+ }
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
[1]
"The
"The
"The
"The
"The
"The
"The
"The
"The
"The
"The
"The
sky
sky
sky
sky
sky
sky
sky
sky
sky
sky
sky
sky
is
is
is
is
is
is
is
is
is
is
is
is
so,
so,
so,
so,
so,
so,
so,
so,
so,
so,
so,
so,
oh
oh
oh
oh
oh
oh
oh
oh
oh
oh
oh
oh
so
so
so
so
so
so
so
so
so
so
so
so
blue"
cyan"
sky-blue"
navy blue"
steel blue"
royal blue"
slate blue"
light blue"
dark blue"
prussian blue"
indigo"
baby blue"
"navy blue"
"light blue"
"baby blue"
9. Why consider experimental design?
• If
you’re performing experiments
• Cost
• Time
• for experiment
• for analysis
• Ethics
• If you’re deciding to fund? to buy? to approve? to compete?
• are the results real?
• can you trust the data?
11. Example: deer parasites
• Do
red deer that feed in woodland have more parasites than
deer that feed on moorland?
• Find
a woodland + a highland; collect faecal samples from 20
deer in each.
• Conclusion?
• But:
• pseudoreplication: (n = 1 not 20!):
• shared environment (influence each other)
• relatedness
• many confounding factors: (e.g. altitude...)
12. Your turn: small
& big Pheidole
workers.
• Is
there a genetic predisposition for becoming a larger
worker?
• Design
an experiment alone.
• Exchange
ideas with your neighbor.
14. Your turn again: protein production
• Large
amounts of potential superdrug takeItEasyProtein™
required for Phase II trials.
• 10 cell lines can produce takeItEasyProtein™.
• You have 5 possible growth media.
• Optimization question: Which combination of temperature, cell
line, and growth medium will perform best?
• Constraints:
• each assay takes 4 days.
• access to 2 incubators (each can contain 1-100 growth tubes).
• large scale production starts in 2 weeks
• Design an experiment alone.
• Exchange ideas with your neighbor.
20. (steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.w
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ Unive
University D.A. Aruliah † , C. Titus Brown ‡ , Neil P. ChueUniversityDavisWisconsin Guy ,
(ethan@weecology.org), and ††† Hong § , Matt of ¶ , Richard T. (wils
∗
Greg Wilson ,
Best Practices for Scientific Computing
Steven H.D. Haddock ∗∗ , Katy Huff †† , Ian M. Mitchell ‡‡ , Mark D. Plumbley §§ , Ben Waugh ¶¶ ,
Ethan P. White ∗∗∗ , Paul Wilson †††
Software Carpentry (gvwilson@software-carpentry.org),† University of Ontario Institute of Technology (Dhavide.Aru
State University (ctb@msu.edu),§ Software Sustainability Institute (N.ChueHong@epcc.ed.ac.uk),¶ Space Telescope
(mrdavis@stsci.edu), University of Toronto (guy@cs.utoronto.ca),∗∗ Monterey Bay Aquarium Research Institute
(steve@practicalcomputing.org),†† University of Wisconsin (khuff@cae.wisc.edu),‡‡ University of British Columbia (mi
Mary University of London (mark.plumbley@eecs.qmul.ac.uk),¶¶ University College London (b.waugh@ucl.ac.uk),∗∗
University (ethan@weecology.org), and ††† University of Wisconsin (wilsonp@engr.wisc.edu)
∗
arXiv:1210.0530v3 [cs.MS] 29 Nov 2012
Scientists spend an increasing amount of time building and using
a
software. However, most scientists are never taught how to do this
i
efficiently. As a result, many are unaware of tools and practices that
d
would allow them to write more reliable and maintainable code with
p
less effort. We describe a set of best practices for scientific software
m
Scientists spend an increasing amount of time building and using research and software development [61
and open source experience,
development that have solid foundations in ical studies of scientific computing [4, 31,
software. However, most scientists are never taught how to do this
e
efficiently. As a improve are unaware of tools and practices thatand the reliability of their
and that result, many scientists’ productivity
development in general (summarized in
would allow them to write more reliable and maintainable code with
software. describe a set of best practices for scientific software practices will guarantee efficient, error-frt
less effort. We
ment, but used in concert they will red
f
development that have solid foundations in research and experience,
and that improve scientists’ productivitypeople, reliability of their
and the not computers. errors in scientific software, make it easie
1. Write programs for
the authors of the software time and effo
software.
Software is as important to modern focusing on the underlying scientific ques
scientific research as
2. Automate repetitive tasks.
3. Use important to tubes. From groups
the test modern scientific research
telescopesasand computer to record history. as that work exclusively
Software is
1
telescopes andMaketubes. From groups that work exclusively
test incremental changes.
4.
on computationalto traditional laboratory and field 1. laboratory andpeople, not c
problems, to traditional Write programs for field
on computational problems, control.
5. Use version
Scientists writing software need to writeS
scientists, more and more of the daily operation of science re- operation of science rescientists, more and more of the daily cutes correctly and can be easily read and
6. computers. This includes the development of
volves aroundDon’t repeat yourself (or others).
c
programmers (especially the author’s fut
volves 7. Plan for mistakes.
around computers. This includes the development of
new algorithms, managing and analyzing the large amounts
cannot be easily read and understood it is
p
of data algorithms, managing andworksand
that are generated in single research projects, correctly.the large amounts
new 8. Optimize software only after it analyzingknow that it is actually doing what it i
to
combining disparate datasets to assess synthetic problems.
c
9. Document the designown software single research projects, and must t
and purpose ofthese rather than itssoftware developers
code be productive, mechanics.
of Scientists that are generated in for
data typically develop their
aspects of human cognition into account
t
10. Conduct requires substantial domain-specific
purposes because doing so code reviews.
human working memory is limited, huma
21.
22. Education
A Quick Guide to Organizing Computational Biology
Projects
William Stafford Noble1,2*
1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science and
Engineering, University of Washington, Seattle, Washington, United States of America
Introduction
under a common root directory. The
understanding your work or who may be
exception to this rule is source code or
evaluating your research skills. Most comMost bioinformatics coursework focusscripts that are used in multiple projects.
monly, however, that ‘‘someone’’ is you. A
es on algorithms, with perhaps some
Each such program might have a project
few months from now, you may not
components devoted to learning prodirectory of its own.
remember what you were up to when you
gramming skills and learning how to
Within a given project, I use a top-level
created a particular set of files, or you may
use existing bioinformatics software. Unorganization that is logical, with chrononot remember what conclusions you drew.
fortunately, for students who are preparlogical organization at the next level, and
You will either have to then spend time
ing for a research career, this type of
logical organization below that. A sample
reconstructing your previous experiments
curriculum fails to address many of the
project, called msms, is shown in Figure 1.
or lose whatever insights you gained from
day-to-day organizational challenges asAt the root of most of my projects, I have a
those experiments.
sociated with performing computational
data directory for storing fixed data sets, a
This leads to the second principle,
experiments. In practice, the principles
results directory for tracking computawhich is actually more like a version of
Figure
names are
typeface, and filenames are
behind organizing and documenting 1. Directory structure for a sample project. Directorydo, youin large tional experiments in smaller typeface. Only a subset of
Murphy’s that the dates are formatted ,year.-,month.-,day. so that they can bepeformed on that data,
the files are shown here. NoteLaw: Everything you
sorted in chronological order. The
computational experiments are often code src/ms-analysis.c have to to do over again. and is documented in doc/ms-analysis.html. The README
source
is compiled
create bin/ms-analysis a doc directory with one subdirectory per
will probably
files in
what date. The driver script results/2009-01-15/runall
learned on the fly, and this learning is the data directories specify who downloaded the data files from what URL on manuscript, and directories such as src
automatically Inevitably, you will discover some flaw split3, corresponding to three cross-validation splits. The bin/parsegenerates the three subdirectories split1, split2, and in
sqt.py
strongly influenced by personal predilec- script is called by bothpreparation driverthe data being
for source code and bin for compiled
your initial of the runall of scripts.
doi:10.1371/journal.pcbi.1000424.g001
tions as well as by chance interactions
binaries or scripts.
analyzed, or you will get access to new
with collaborators or colleagues.
Within the data and results a complete
data, the distinction be- The your paramThese types of entries provide directowith this approach,or you will decide that Lab Notebook
The purpose of this article is to describe data and results may of a particular model was not
picture of the development a similar,
tween
not be useful.
ries, it is often tempting to apply of the project
eterization
In parallel with this chronological
over time.
Instead,
could
one good strategy for carrying out com- onebroad imagine a top-level means structure,the find itlogical toorganization. For example, you
enough. This directory that I
useful
directory called something like experiIn practice, I ask members of my
putational experiments. I will not describe , with subdirectories with names like last week, chronologically organizedhave two or group to data sets notebooks
maintain a or even
may lab research three put their lab against
ments
experiment you did
notebook. This is a document that resides
2008-12-19. Optionally, the directory
profound issues such as how to formulate
which
plan to password protection if
the set of experiments you’veroot of the results directory andyou online, behind benchmark your
in the been workname
also include a
or two
necessary. When I meet with a member
hypotheses, design experiments, or draw might ing on over word past month, will probably
that records your progress algorithms, ofso lab or a could team, we can one
in detail.
indicating the topic of the the
experiment
my you project create refer
Entries in the notebook
conclusions. Rather, I will focus therein. In practice,to single experiment you have organized should be dated, for each of lab notebook, focusing on
on
directory
need a be redone. If and they should be relatively verbose, with to the online them under data.
will often require more than one day of
the current entry but scrolling up to
relatively mundane issues such as organizthis
and documented your work clearly, thenimages In my experience, entries approach is risky,
links or embedded
or tables
work, and so you may end up working a
previous
as necessary. The URL
ing files and directories and documenting or repeating creating a new displaying the results of the experiments the can also be provided toof yourcollabobecause
logical structure remote final
few days
more before the experiment with the new
In each results folder:
•script getResults.rb or WHATIDID.txt or MyAnalysis.Rnw
•intermediates
•output
24. knitr (sweave)Analyzing & Reporting in a single file.
MyFile.Rnw
documentclass{article}
usepackage[sc]{mathpazo}
usepackage[T1]{fontenc}
usepackage{url}
begin{document}
Also works with
Markdown
instead of LaTeX!
### in R:
library(knitr)
knit(“MyFile.Rnw”)
# --> creates MyFile.tex
<<setup, include=FALSE, cache=FALSE, echo=FALSE>>=
# this is equivalent to SweaveOpts{...}
opts_chunk$set(fig.path='figure/minimal-', fig.align='center', fig.show='hold')
options(replace.assign=TRUE,width=90)
@
title{A Minimal Demo of knitr}
### in shell:
pdflatex MyFile.tex
# --> creates MyFile.pdf
author{Yihui Xie}
A Minimal Demo of knitr
maketitle
You can test if textbf{knitr} works with this minimal demo. OK, let's
get started with some boring random numbers:
Yihui Xie
February 26, 2012
<<boring-random,echo=TRUE,cache=TRUE>>=
set.seed(1121)
(x=rnorm(20))
mean(x);var(x)
@
You can test if knitr works with this minimal demo. OK, let’s get started with s
numbers:
The first element of texttt{x} is Sexpr{x[1]}. Boring boxplots
and histograms recorded by the PDF device:
set.seed(1121)
(x <- rnorm(20))
<<boring-plots,cache=TRUE,echo=TRUE>>=
## two plots side by side
par(mar=c(4,4,.1,.1),cex.lab=.95,cex.axis=.9,mgp=c(2,.7,0),tcl=-.3,las=1)
boxplot(x)
hist(x,main='')
@
Do the above chunks work? You should be able to compile the TeX{}
## [1] 0.14496 0.43832
## [10] -0.02531 0.15088
## [19] 0.13272 -0.15594
mean(x)
## [1] 0.3217
var(x)
0.15319
0.11008
1.08494 1.99954 -0.81188
1.35968 -0.32699 -0.71638
0.16027
1.80977
0
0
25. Choosing a programming language
Excel
R
Unix command-line (i.e., shell, i.e., bash)
Perl
Java
Python
Ruby
Javascript
26. Ruby.
“Friends don’t let friends do Perl” - reddit user
example: reverse the contents of each line in a file
### in PERL:
open INFILE, "my_file.txt";
while (defined ($line = <INFILE>)) {
chomp($line);
@letters = split(//, $line);
@reverse_letters = reverse(@letters);
$reverse_string = join("", @reverse_letters);
print $reverse_string, "n";
}
### in Ruby:
File.open("my_file.txt").each do |line|
puts line.chomp.reverse
end
27. More ruby examples.
5.times do
puts "Hello world"
end
# Sorting people
people_sorted_by_age = people.sort_by{ |person| person.age}
28. Getting help.
• In
real life: Make friends with people. Talk to them.
• Online:
• Specific discussion mailing lists (e.g.: R, Stacks, bioruby, MAKER...)
• Programming: http://stackoverflow.com
• Bioinformatics: http://www.biostars.org
• Sequencing-related: http://seqanswers.com
• Stats: http://stats.stackexchange.com