How learning, using, and teaching R has helped my career in the life sciences
#TokyoR 2018-7-15
Presented at Yahoo Japan 2.45pm
Tom Kelly, Postdoctoral Fellow (RIKEN IMS)
Sharing data with lightweight data standards, such as schema.org and bioschemas. The Knetminer case, an application for the agrifood domain and molecular biology.
Presented at Open Data Sicilia (#ODS2021)
Being Reproducible: SSBSS Summer School 2017Carole Goble
Lecture 2:
Being Reproducible: Models, Research Objects and R* Brouhaha
Reproducibility is a R* minefield, depending on whether you are testing for robustness (rerun), defence (repeat), certification (replicate), comparison (reproduce) or transferring between researchers (reuse). Different forms of "R" make different demands on the completeness, depth and portability of research. Sharing is another minefield raising concerns of credit and protection from sharp practices.
In practice the exchange, reuse and reproduction of scientific experiments is dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: the codes fork, data is updated, algorithms are revised, workflows break, service updates are released. ResearchObject.org is an effort to systematically support more portable and reproducible research exchange.
In this talk I will explore these issues in more depth using the FAIRDOM Platform and its support for reproducible modelling. The talk will cover initiatives and technical issues, and raise social and cultural challenges.
EKAW 2016 - TechMiner: Extracting Technologies from Academic PublicationsFrancesco Osborne
TechMiner is a new approach that combines natural language processing, machine learning, and semantic technologies to extract information about technologies (such as applications, systems, languages, and formats) from research publications. It generates an ontology describing technologies and their relationships to other research entities. The approach was evaluated on a gold standard of manually annotated publications and found to improve precision and recall over alternative natural language processing approaches. Future work includes enriching the approach to identify additional scientific objects and applying it to other research fields.
Findable Accessable Interoperable Reusable < data |models | SOPs | samples | articles| * >. FAIR is a mantra; a meme; a myth; a mystery; a moan. For the past 15 years I have been working on FAIR in a bunch of projects and initiatives in Life Science projects. Some are top-down like Life Science European Research Infrastructures ELIXIR and ISBE, and some are bottom-up, supporting research projects in Systems and Synthetic Biology (FAIRDOM), Biodiversity (BioVel), and Pharmacology (open PHACTS), for example. Some have become movements, like Bioschemas, the Common Workflow Language and Research Objects. Others focus on cross-cutting approaches in reproducibility, computational workflows, metadata representation and scholarly sharing & publication. In this talk I will relate a series of FAIRy tales. Some of them are Grimm. Some have happy endings. Who are the villains and who are the heroes? What are the morals we can draw from these stories?
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
Lecture 1:
Being FAIR: FAIR data and model management
In recent years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs, workflows. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship [1] have proved to be an effective rallying-cry. Funding agencies expect data (and increasingly software) management retention and access plans. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems and Synthetic Biology demands the interlinking and exchange of assets and the systematic recording of metadata for their interpretation.
Our FAIRDOM project (http://www.fair-dom.org) supports Systems Biology research projects with their research data, methods and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety. The FAIRDOM Platform has been installed by over 30 labs or projects. Our public, centrally hosted Asset Commons, the FAIRDOMHub.org, supports the outcomes of 50+ projects.
Now established as a grassroots association, FAIRDOM has over 8 years of experience of practical asset sharing and data infrastructure at the researcher coal-face ranging across European programmes (SysMO and ERASysAPP ERANets), national initiatives (Germany's de.NBI and Systems Medicine of the Liver; Norway's Digital Life) and European Research Infrastructures (ISBE) as well as in PI's labs and Centres such as the SynBioChem Centre at Manchester.
In this talk I will show explore how FAIRDOM has been designed to support Systems Biology projects and show examples of its configuration and use. I will also explore the technical and social challenges we face.
I will also refer to European efforts to support public archives for the life sciences. ELIXIR (http:// http://www.elixir-europe.org/) the European Research Infrastructure of 21 national nodes and a hub funded by national agreements to coordinate and sustain key data repositories and archives for the Life Science community, improve access to them and related tools, support training and create a platform for dataset interoperability. As the Head of the ELIXIR-UK Node and co-lead of the ELIXIR Interoperability Platform I will show how this work relates to your projects.
[1] Wilkinson et al, The FAIR Guiding Principles for scientific data management and stewardship Scientific Data 3, doi:10.1038/sdata.2016.18
Research Objects: more than the sum of the partsCarole Goble
Workshop on Managing Digital Research Objects in an Expanding Science Ecosystem, 15 Nov 2017, Bethesda, USA
https://www.rd-alliance.org/managing-digital-research-objects-expanding-science-ecosystem
Research output is more than just the rhetorical narrative. The experimental methods, computational codes, data, algorithms, workflows, Standard Operating Procedures, samples and so on are the objects of research that enable reuse and reproduction of scientific experiments, and they too need to be examined and exchanged as research knowledge.
A first step is to think of Digital Research Objects as a broadening out to embrace these artefacts or assets of research. The next is to recognise that investigations use multiple, interlinked, evolving artefacts. Multiple datasets and multiple models support a study; each model is associated with datasets for construction, validation and prediction; an analytic pipeline has multiple codes and may be made up of nested sub-pipelines, and so on. Research Objects (http://researchobject.org/) is a framework by which the many, nested and contributed components of research can be packaged together in a systematic way, and their context, provenance and relationships richly described.
1) The document describes the SOPHIA project, which aims to build altmetric networks of researchers and institutions to understand how research impacts spread in society.
2) SOPHIA collects data from Scopus and social media sources to build a heterogeneous graph network, and analyzes the network using graph metrics to measure the influence and authority of researchers and institutions.
3) The project has developed visualization and search tools to explore the altmetric networks, annotated documents, and metrics within a software prototype called SOPHIA.
Sharing data with lightweight data standards, such as schema.org and bioschemas. The Knetminer case, an application for the agrifood domain and molecular biology.
Presented at Open Data Sicilia (#ODS2021)
Being Reproducible: SSBSS Summer School 2017Carole Goble
Lecture 2:
Being Reproducible: Models, Research Objects and R* Brouhaha
Reproducibility is a R* minefield, depending on whether you are testing for robustness (rerun), defence (repeat), certification (replicate), comparison (reproduce) or transferring between researchers (reuse). Different forms of "R" make different demands on the completeness, depth and portability of research. Sharing is another minefield raising concerns of credit and protection from sharp practices.
In practice the exchange, reuse and reproduction of scientific experiments is dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: the codes fork, data is updated, algorithms are revised, workflows break, service updates are released. ResearchObject.org is an effort to systematically support more portable and reproducible research exchange.
In this talk I will explore these issues in more depth using the FAIRDOM Platform and its support for reproducible modelling. The talk will cover initiatives and technical issues, and raise social and cultural challenges.
EKAW 2016 - TechMiner: Extracting Technologies from Academic PublicationsFrancesco Osborne
TechMiner is a new approach that combines natural language processing, machine learning, and semantic technologies to extract information about technologies (such as applications, systems, languages, and formats) from research publications. It generates an ontology describing technologies and their relationships to other research entities. The approach was evaluated on a gold standard of manually annotated publications and found to improve precision and recall over alternative natural language processing approaches. Future work includes enriching the approach to identify additional scientific objects and applying it to other research fields.
Findable Accessable Interoperable Reusable < data |models | SOPs | samples | articles| * >. FAIR is a mantra; a meme; a myth; a mystery; a moan. For the past 15 years I have been working on FAIR in a bunch of projects and initiatives in Life Science projects. Some are top-down like Life Science European Research Infrastructures ELIXIR and ISBE, and some are bottom-up, supporting research projects in Systems and Synthetic Biology (FAIRDOM), Biodiversity (BioVel), and Pharmacology (open PHACTS), for example. Some have become movements, like Bioschemas, the Common Workflow Language and Research Objects. Others focus on cross-cutting approaches in reproducibility, computational workflows, metadata representation and scholarly sharing & publication. In this talk I will relate a series of FAIRy tales. Some of them are Grimm. Some have happy endings. Who are the villains and who are the heroes? What are the morals we can draw from these stories?
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
Lecture 1:
Being FAIR: FAIR data and model management
In recent years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs, workflows. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship [1] have proved to be an effective rallying-cry. Funding agencies expect data (and increasingly software) management retention and access plans. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems and Synthetic Biology demands the interlinking and exchange of assets and the systematic recording of metadata for their interpretation.
Our FAIRDOM project (http://www.fair-dom.org) supports Systems Biology research projects with their research data, methods and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety. The FAIRDOM Platform has been installed by over 30 labs or projects. Our public, centrally hosted Asset Commons, the FAIRDOMHub.org, supports the outcomes of 50+ projects.
Now established as a grassroots association, FAIRDOM has over 8 years of experience of practical asset sharing and data infrastructure at the researcher coal-face ranging across European programmes (SysMO and ERASysAPP ERANets), national initiatives (Germany's de.NBI and Systems Medicine of the Liver; Norway's Digital Life) and European Research Infrastructures (ISBE) as well as in PI's labs and Centres such as the SynBioChem Centre at Manchester.
In this talk I will show explore how FAIRDOM has been designed to support Systems Biology projects and show examples of its configuration and use. I will also explore the technical and social challenges we face.
I will also refer to European efforts to support public archives for the life sciences. ELIXIR (http:// http://www.elixir-europe.org/) the European Research Infrastructure of 21 national nodes and a hub funded by national agreements to coordinate and sustain key data repositories and archives for the Life Science community, improve access to them and related tools, support training and create a platform for dataset interoperability. As the Head of the ELIXIR-UK Node and co-lead of the ELIXIR Interoperability Platform I will show how this work relates to your projects.
[1] Wilkinson et al, The FAIR Guiding Principles for scientific data management and stewardship Scientific Data 3, doi:10.1038/sdata.2016.18
Research Objects: more than the sum of the partsCarole Goble
Workshop on Managing Digital Research Objects in an Expanding Science Ecosystem, 15 Nov 2017, Bethesda, USA
https://www.rd-alliance.org/managing-digital-research-objects-expanding-science-ecosystem
Research output is more than just the rhetorical narrative. The experimental methods, computational codes, data, algorithms, workflows, Standard Operating Procedures, samples and so on are the objects of research that enable reuse and reproduction of scientific experiments, and they too need to be examined and exchanged as research knowledge.
A first step is to think of Digital Research Objects as a broadening out to embrace these artefacts or assets of research. The next is to recognise that investigations use multiple, interlinked, evolving artefacts. Multiple datasets and multiple models support a study; each model is associated with datasets for construction, validation and prediction; an analytic pipeline has multiple codes and may be made up of nested sub-pipelines, and so on. Research Objects (http://researchobject.org/) is a framework by which the many, nested and contributed components of research can be packaged together in a systematic way, and their context, provenance and relationships richly described.
1) The document describes the SOPHIA project, which aims to build altmetric networks of researchers and institutions to understand how research impacts spread in society.
2) SOPHIA collects data from Scopus and social media sources to build a heterogeneous graph network, and analyzes the network using graph metrics to measure the influence and authority of researchers and institutions.
3) The project has developed visualization and search tools to explore the altmetric networks, annotated documents, and metrics within a software prototype called SOPHIA.
Aspects of Reproducibility in Earth ScienceRaul Palma
The document discusses aspects of reproducibility in earth science research within the European Virtual Environment for Research - Earth Science Themes (EVEREST) project. The key objectives of EVEREST are to establish an e-infrastructure to facilitate collaborative earth science research through shared data, models, and workflows. Research Objects (ROs) will be used to capture and share workflows, processes, and results to help ensure reproducibility and preservation of earth science research. An example RO is described for mapping volcano deformation using satellite imagery and other data sources. Issues around reproducibility related to data access, software dependencies, and manual intervention in workflows are also discussed.
This document summarizes Professor Carole Goble's presentation on making research more reproducible and FAIR (Findable, Accessible, Interoperable, Reusable) through the use of research objects and related standards and infrastructure. It discusses challenges to reproducibility in computational research and proposes bundling datasets, workflows, software and other research products into standardized research objects that can be cited and shared to help address these challenges.
The document discusses the ISA infrastructure, which provides a generic format for experimental description and data exchange. The ISA infrastructure aims to support bio-scientists from experimental design to data publication. It does this through developing community standards, open source software tools, and engaging communities. The infrastructure provides a common framework to describe experiments in a way that allows data to flow between different systems and communities.
Some tools developed at OEG (Ontology Engineering Group) for facilitating ontology engineering activities as evaluation, documentation, releasing and publication.
A keynote given on the FAIR Data Principles at the FAIRplus Innovation and SME Forum, Hinxton Genome Campus, Cambridge, UK, 29 January 2020. The history of the principles, issues about the principles and speculations about the future
Short talk on Research Object and their use for reproducibility and publishing in the Systems Biology Commons Platform FAIRDOMHub, and the underlying software SEEK.
Gene Ontology WormBase Workshop International Worm Meeting 2015raymond91105
This document summarizes how Gene Ontology (GO) annotations are used at WormBase to annotate genes in C. elegans. It describes the three aspects of GO (biological process, molecular function, cellular component) and how GO annotations associate genes to specific terms. It provides details on how to access and browse GO annotations at WormBase, including through gene pages, the Ontology Browser, and download files. It also describes using the PANTHER database to perform GO term enrichment analysis. The document outlines current and future improvements to GO annotations at WormBase.
Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...Rothamsted Research, UK
Graph-based modelling is becoming more popular, in the sciences and elsewhere, as a flexible and powerful way to exploit data to power world-changing digital applications. Com- pared to the initial vision of the Semantic Web, knowledge graphs and graph databases are be- coming a practical and computationally less formal way to manage graph data. On the other hand, linked data based on Semantic Web standards are a complementary, rather than alternative, ap- proach to deal with these data, since they still provide a common way to represent and exchange information. In this paper we introduce rdf2neo, a tool to populate Neo4j databases starting from RDF data sets, based on a configurable mapping between the two. By employing agrigenomics- related real use cases, we show how such mapping can allow for a hybrid approach to the man- agement of networked knowledge, based on taking advantage of the best of both RDF and prop- erty graphs.
FAIR Workflows and Research Objects get a Workout Carole Goble
So, you want to build a pan-national digital space for bioscience data and methods? That works with a bunch of pre-existing data repositories and processing platforms? So you can share FAIR workflows and move them between services? Package them up with data and other stuff (or just package up data for that matter)? How? WorkflowHub (https://workflowhub.eu) and RO-Crate Research Objects (https://www.researchobject.org/ro-crate) that’s how! A step towards FAIR Digital Objects gets a workout.
Presented at DataVerse Community Meeting 2021
Reproducibility Using Semantics: An Overviewdgarijo
Overview of the different approaches for addressing reproducibilities (using semantics) in laboratory protocols, workflow description and publication and workflow infrastructure. Furthermore, Research Objects are introduced as a means to capture the context and annotations of scientific experiments, together with the privacy and IPR concerns that may arise. This presentation was presented in Dagstuhl Seminar 16041: http://www.dagstuhl.de/16041
ACS 248th Paper 67 Eureka CollaborationStuart Chalk
This document discusses the Eureka Research Workbench (ERW), a digital platform for enabling international scientific collaboration. The ERW allows researchers to store all research notes, data, and files in a digital format using the Experiment Markup Language (ExptML) to capture different data types. It also facilitates collaboration between research groups by allowing all users to view shared data. The document describes a case study of international collaboration between research groups in Thailand and the US using the ERW to study endocrine disrupting chemicals. It also provides feedback from users and outlines future plans to improve translation features and data visualization tools to further support global scientific collaboration.
This talk explores how principles derived from experimental design practice, data and computational models can greatly enhance data quality, data generation, data reporting, data publication and data review.
Lei Zheng has over 15 years of experience in areas such as machine learning, data mining, and software development. He currently works as a Senior Software Engineer at Yahoo, where he develops algorithms for spam filtering and detection of abusive behavior. Previously he held research positions at the University of Pittsburgh and JustSystems Evans Research, where he implemented algorithms and systems for information retrieval, natural language processing, and data mining.
ACS 248th Paper 136 JSmol/JSpecView Eureka IntegrationStuart Chalk
Integration of the combined JSmol/JSpecView molecular viewer/spectral viewer software in the Eureka Research Workbench. Can display molecular structures, spectra and the linked version where clicking on a peak shows molecular movement (IR).
This document discusses Neo4j and its applications in bioinformatics. It describes Bio4j, an open source bioinformatics graph database built using Neo4j that integrates data from sources like Uniprot, NCBI taxonomy, Gene Ontology, and more. Bio4j models biological data as nodes and relationships in a graph structure rather than tables. This allows for more flexible querying and knowledge integration. The document provides examples of how Bio4j can be accessed through its Java API, Cypher query language, Gremlin traversal language, and REST API. It also describes some tools and visualizations for exploring and analyzing Bio4j data.
Repeatable plant pathology bioinformatic analysis: Not everything is NGS dataLeighton Pritchard
Presentation on use of Galaxy for plant pathology bioinformatics, presented by Peter Cock, at the Genomics for Non-Model Organisms workshop, ISMB/ECCB, Vienna, Austria, 19 July 2011
The Seven Deadly Sins of BioinformaticsDuncan Hull
Keynote talk at Bioinformatics Open Source Conference (BOSC) Special Interest Group at the 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB 2007) in Vienna, July 2007 by Carole Goble, University of Manchester.
The document summarizes updates and new features in the latest release (Araport11) of the Arabidopsis Information Portal (Araport). Key points include:
1) Araport assumed responsibility for the Arabidopsis thaliana Col-0 genome sequence and annotation.
2) The Araport11 release incorporates 113 RNA-seq datasets, contributions from NCBI, UniProt, and Arabidopsis researchers. Structural and functional annotation were performed.
3) Araport provides a "one-stop shop" for Arabidopsis data including updated gene models, protein coding genes, transcripts, community curation tools, and over 70 tracks of data in JBrowse.
This document introduces a workbook for analyzing geometric morphometric data using freely available software. It discusses the relationship between the workbook and its accompanying textbook. The workbook is meant to provide practical guidance on running specific analyses in various software packages, updating more frequently than the textbook. It reviews several freely available software options for geometric morphometrics and emphasizes "comprehensive" packages that allow many different analyses within a single program or related suite of programs. However, it notes that no single package can perform all possible analyses, so multiple packages may need to be used. It encourages the use of R as a flexible environment that can handle complex statistical models and analyses not found in specialized morphometrics packages.
Cool Informatics Tools and Services for Biomedical ResearchDavid Ruau
This document provides an overview of bioinformatics tools and services for analyzing big data in biomedical research. It discusses traditional bioinformatics tools, analyzing genomic data from microarrays and next-generation sequencing without and with code, interpreting results using protein interaction networks and pathways, tools for data storage, cleaning and visualization, and making research reproducible. Galaxy, R, and programming are presented as useful for automated, reproducible analysis of large genomic datasets.
Aspects of Reproducibility in Earth ScienceRaul Palma
The document discusses aspects of reproducibility in earth science research within the European Virtual Environment for Research - Earth Science Themes (EVEREST) project. The key objectives of EVEREST are to establish an e-infrastructure to facilitate collaborative earth science research through shared data, models, and workflows. Research Objects (ROs) will be used to capture and share workflows, processes, and results to help ensure reproducibility and preservation of earth science research. An example RO is described for mapping volcano deformation using satellite imagery and other data sources. Issues around reproducibility related to data access, software dependencies, and manual intervention in workflows are also discussed.
This document summarizes Professor Carole Goble's presentation on making research more reproducible and FAIR (Findable, Accessible, Interoperable, Reusable) through the use of research objects and related standards and infrastructure. It discusses challenges to reproducibility in computational research and proposes bundling datasets, workflows, software and other research products into standardized research objects that can be cited and shared to help address these challenges.
The document discusses the ISA infrastructure, which provides a generic format for experimental description and data exchange. The ISA infrastructure aims to support bio-scientists from experimental design to data publication. It does this through developing community standards, open source software tools, and engaging communities. The infrastructure provides a common framework to describe experiments in a way that allows data to flow between different systems and communities.
Some tools developed at OEG (Ontology Engineering Group) for facilitating ontology engineering activities as evaluation, documentation, releasing and publication.
A keynote given on the FAIR Data Principles at the FAIRplus Innovation and SME Forum, Hinxton Genome Campus, Cambridge, UK, 29 January 2020. The history of the principles, issues about the principles and speculations about the future
Short talk on Research Object and their use for reproducibility and publishing in the Systems Biology Commons Platform FAIRDOMHub, and the underlying software SEEK.
Gene Ontology WormBase Workshop International Worm Meeting 2015raymond91105
This document summarizes how Gene Ontology (GO) annotations are used at WormBase to annotate genes in C. elegans. It describes the three aspects of GO (biological process, molecular function, cellular component) and how GO annotations associate genes to specific terms. It provides details on how to access and browse GO annotations at WormBase, including through gene pages, the Ontology Browser, and download files. It also describes using the PANTHER database to perform GO term enrichment analysis. The document outlines current and future improvements to GO annotations at WormBase.
Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...Rothamsted Research, UK
Graph-based modelling is becoming more popular, in the sciences and elsewhere, as a flexible and powerful way to exploit data to power world-changing digital applications. Com- pared to the initial vision of the Semantic Web, knowledge graphs and graph databases are be- coming a practical and computationally less formal way to manage graph data. On the other hand, linked data based on Semantic Web standards are a complementary, rather than alternative, ap- proach to deal with these data, since they still provide a common way to represent and exchange information. In this paper we introduce rdf2neo, a tool to populate Neo4j databases starting from RDF data sets, based on a configurable mapping between the two. By employing agrigenomics- related real use cases, we show how such mapping can allow for a hybrid approach to the man- agement of networked knowledge, based on taking advantage of the best of both RDF and prop- erty graphs.
FAIR Workflows and Research Objects get a Workout Carole Goble
So, you want to build a pan-national digital space for bioscience data and methods? That works with a bunch of pre-existing data repositories and processing platforms? So you can share FAIR workflows and move them between services? Package them up with data and other stuff (or just package up data for that matter)? How? WorkflowHub (https://workflowhub.eu) and RO-Crate Research Objects (https://www.researchobject.org/ro-crate) that’s how! A step towards FAIR Digital Objects gets a workout.
Presented at DataVerse Community Meeting 2021
Reproducibility Using Semantics: An Overviewdgarijo
Overview of the different approaches for addressing reproducibilities (using semantics) in laboratory protocols, workflow description and publication and workflow infrastructure. Furthermore, Research Objects are introduced as a means to capture the context and annotations of scientific experiments, together with the privacy and IPR concerns that may arise. This presentation was presented in Dagstuhl Seminar 16041: http://www.dagstuhl.de/16041
ACS 248th Paper 67 Eureka CollaborationStuart Chalk
This document discusses the Eureka Research Workbench (ERW), a digital platform for enabling international scientific collaboration. The ERW allows researchers to store all research notes, data, and files in a digital format using the Experiment Markup Language (ExptML) to capture different data types. It also facilitates collaboration between research groups by allowing all users to view shared data. The document describes a case study of international collaboration between research groups in Thailand and the US using the ERW to study endocrine disrupting chemicals. It also provides feedback from users and outlines future plans to improve translation features and data visualization tools to further support global scientific collaboration.
This talk explores how principles derived from experimental design practice, data and computational models can greatly enhance data quality, data generation, data reporting, data publication and data review.
Lei Zheng has over 15 years of experience in areas such as machine learning, data mining, and software development. He currently works as a Senior Software Engineer at Yahoo, where he develops algorithms for spam filtering and detection of abusive behavior. Previously he held research positions at the University of Pittsburgh and JustSystems Evans Research, where he implemented algorithms and systems for information retrieval, natural language processing, and data mining.
ACS 248th Paper 136 JSmol/JSpecView Eureka IntegrationStuart Chalk
Integration of the combined JSmol/JSpecView molecular viewer/spectral viewer software in the Eureka Research Workbench. Can display molecular structures, spectra and the linked version where clicking on a peak shows molecular movement (IR).
This document discusses Neo4j and its applications in bioinformatics. It describes Bio4j, an open source bioinformatics graph database built using Neo4j that integrates data from sources like Uniprot, NCBI taxonomy, Gene Ontology, and more. Bio4j models biological data as nodes and relationships in a graph structure rather than tables. This allows for more flexible querying and knowledge integration. The document provides examples of how Bio4j can be accessed through its Java API, Cypher query language, Gremlin traversal language, and REST API. It also describes some tools and visualizations for exploring and analyzing Bio4j data.
Repeatable plant pathology bioinformatic analysis: Not everything is NGS dataLeighton Pritchard
Presentation on use of Galaxy for plant pathology bioinformatics, presented by Peter Cock, at the Genomics for Non-Model Organisms workshop, ISMB/ECCB, Vienna, Austria, 19 July 2011
The Seven Deadly Sins of BioinformaticsDuncan Hull
Keynote talk at Bioinformatics Open Source Conference (BOSC) Special Interest Group at the 15th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB 2007) in Vienna, July 2007 by Carole Goble, University of Manchester.
The document summarizes updates and new features in the latest release (Araport11) of the Arabidopsis Information Portal (Araport). Key points include:
1) Araport assumed responsibility for the Arabidopsis thaliana Col-0 genome sequence and annotation.
2) The Araport11 release incorporates 113 RNA-seq datasets, contributions from NCBI, UniProt, and Arabidopsis researchers. Structural and functional annotation were performed.
3) Araport provides a "one-stop shop" for Arabidopsis data including updated gene models, protein coding genes, transcripts, community curation tools, and over 70 tracks of data in JBrowse.
This document introduces a workbook for analyzing geometric morphometric data using freely available software. It discusses the relationship between the workbook and its accompanying textbook. The workbook is meant to provide practical guidance on running specific analyses in various software packages, updating more frequently than the textbook. It reviews several freely available software options for geometric morphometrics and emphasizes "comprehensive" packages that allow many different analyses within a single program or related suite of programs. However, it notes that no single package can perform all possible analyses, so multiple packages may need to be used. It encourages the use of R as a flexible environment that can handle complex statistical models and analyses not found in specialized morphometrics packages.
Cool Informatics Tools and Services for Biomedical ResearchDavid Ruau
This document provides an overview of bioinformatics tools and services for analyzing big data in biomedical research. It discusses traditional bioinformatics tools, analyzing genomic data from microarrays and next-generation sequencing without and with code, interpreting results using protein interaction networks and pathways, tools for data storage, cleaning and visualization, and making research reproducible. Galaxy, R, and programming are presented as useful for automated, reproducible analysis of large genomic datasets.
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
Share and analyse genomic data
at scale with Spark, Adam, Tachyon & the Spark Notebook
Sharp intro to Genomics data
What are the Challenges
Distributed Machine Learning to the rescue
Projects: Distributed teams
Research: Long process
Towards Maximum Share for efficiency
Introduction to Biological Network Analysis and Visualization with Cytoscape ...Keiichiro Ono
Introduction to biological network analysis and visualization with Cytoscape (using the latest version 3.4).
This is a first half of the lecture for Applied Bioinformatics lecture at TSRI.
Sarah Guido gave a presentation on analyzing data with Python. She discussed several Python tools for preprocessing, analysis, and visualization including Pandas for data wrangling, scikit-learn for machine learning, NLTK for natural language processing, MRjob for processing large datasets in parallel, and ggplot for visualization. For each tool, she provided examples and use cases. She emphasized that the best tools depend on the type of data and analysis needs.
This document discusses challenges and opportunities for integrating large, heterogeneous biological data sets. It outlines the types of analysis and discovery that could be enabled, such as comparing data across studies. Technical challenges include incompatible identifiers and schemas between data sources. Common solutions attempt standardization but have limitations. The document examines Amazon's approach as a model, with principles like exposing all data through programmatic interfaces. It argues for a "platform" approach and combining data-driven and model-driven analysis to gain new insights. Developing services with end users in mind could help maximize data reuse.
This document is a resume for Gautam Machiraju. It summarizes his education and research experience. He has a B.A. in Applied Mathematics from UC Berkeley with a concentration in Mathematical Biology and a minor in Bioengineering. He has worked on several research projects involving mathematical modeling and data analysis related to biology and healthcare. These include modeling cancer biomarker shedding kinetics, mining literature for biomarker data, and using deep learning on patient time-series data. He has strong skills in programming, mathematics, and bioinformatics.
This document is a resume for Gautam Machiraju. It summarizes his education and research experience. He has a B.A. in Applied Mathematics from UC Berkeley with a concentration in Mathematical Biology and a minor in Bioengineering. He has worked on several research projects involving mathematical modeling and data analysis related to cancer biomarkers, genomics, and proteomics. His skills include programming, mathematics, data science, and laboratory techniques. He is currently a bioinformatics research assistant at Stanford University School of Medicine.
Towards reproducibility and maximally-open dataPablo Bernabeu
Presented at the Open Scholarship Prize Competition 2021, organised by Open Scholarship Community Galway.
Video of the presentation: https://nuigalway.mediaspace.kaltura.com/media/OSW2021A+OSCG+Open+Scholarship+Prize+-+The+Final!/1_d7ekd3d3/121659351#t=56:08
This document provides an overview of cloud bioinformatics and the challenges of analyzing large datasets from next-generation sequencing (NGS). It discusses how bioinformatics uses computational methods to study genes, proteins, and genomes. The advent of NGS has led to huge datasets that require high-performance computing. Cloud computing provides access to pooled computing resources in a cost-effective manner and helps address the bioinformatics challenge of assembling and analyzing NGS data. The document also outlines common bioinformatics software and resources available through WestGrid and Galaxy that can be used for sequence assembly, annotation, and other applications.
The document discusses how computation can accelerate the generation of new knowledge by enabling large-scale collaborative research and extracting insights from vast amounts of data. It provides examples from astronomy, physics simulations, and biomedical research where computation has allowed more data and researchers to be incorporated, advancing various fields more quickly over time. Computation allows for data sharing, analysis, and hypothesis generation at scales not previously possible.
Keynote on software sustainability given at the 2nd Annual Netherlands eScience Symposium, November 2014.
Based on the article
Carole Goble ,
Better Software, Better Research
Issue No.05 - Sept.-Oct. (2014 vol.18)
pp: 4-8
IEEE Computer Society
http://www.computer.org/csdl/mags/ic/2014/05/mic2014050004.pdf
http://doi.ieeecomputersociety.org/10.1109/MIC.2014.88
http://www.software.ac.uk/resources/publications/better-software-better-research
Fairification experience clarifying the semantics of data matricesPistoia Alliance
This webinar presents the Statistics Ontology, STATO which is a semantic framework to support the creation of standardized analysis reports to help with review of results in the form of data matrices. STATO includes a hierarchy of classes and a vocabulary for annotating statistical methods used in life, natural and biomedical sciences investigations, text mining and statistical analyses.
The advent of social networks has changed the research in computer science. Now, the massive volume of data has present in the form of twitter, facebook, emails, IOT (Internet of Things). So, the storage and analysis of these data has become great challenge for researchers. Traditional frameworks have failed for the processing of large data. R is open source programming framework developed for the analysis of large data results better accuracy. It also gives the opportunity of the implementation in R programming language. In this paper, a study on the use of R for the classification of large social network data. Naïve Bayes algorithm is used for the classification of large twitter data. The experiment has shown that enormous amount of data can be sufficiently classified using the R framework with promising results.
The document describes tools developed by the ENCODE project to improve access and reproducibility of ENCODE data and analysis pipelines. A scalable metadata-driven system and REST API have been implemented to provide access to ENCODE data files and metadata. The structured metadata describes analysis pipelines, software, and steps to support reproducibility. The REST API and metadata standards can be used by researchers to further analyze ENCODE data and integrate their own data.
grizzly - informal overview - pydata boston 2013 adrianheilbut
The document summarizes the motivation, goals, and core ideas behind the grizzly statistical analysis framework. It discusses how biological and scientific data is increasingly complex with multidimensional, hierarchical, and temporal structures. It outlines desiderata for reproducible, efficient analysis including correctness, verifiability, and interactivity. The document presents strategies like separating concerns and abstracting data management. It draws inspiration from fields like OLAP and scientific workflows. Core ideas include representing data as multidimensional cubes with semantic types and modeling computation as directed acyclic graphs of typed functions.
1. The document discusses how a biologist, Marco Roos, became interested in e-science through his work in molecular and cellular biology, bioinformatics, and data integration projects.
2. Roos describes how e-science allows for collaboration between different experts and disciplines through technologies like workflows, semantic web, and virtual laboratories.
3. Roos emphasizes that e-science should empower scientists by making tools and resources easy to use, share, and build upon so that scientists can focus on scientific problems rather than technical challenges.
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
The document discusses scientific workflows, provenance, and linked data. It covers:
1) Scientific workflows can automate data analysis at scale, abstract complex processes, and capture provenance for transparency.
2) Provenance represents the origin and history of data and can be represented using standards like PROV. It allows reasoning about how results were produced.
3) Capturing and publishing provenance as linked open data can help make scientific results more reusable and queryable, but challenges remain around multi-site studies and producing human-readable reports.
Data Science Provenance: From Drug Discovery to Fake FansJameel Syed
Knowledge work adds value to raw data; how this activity is performed is critical for how reliably results can be reproduced and scrutinized. With a brief diversion into epistemology, the presentation will outline the challenges for practitioners and consumers of Big Data analysis, and demonstrate how these were tackled at Inforsense (life sciences workflow analytics platform) and Musicmetric (social media analytics for music).
The talk covers the following issues with concrete examples:
- Representations of provenance
- Considerations to allow analysis computation to be recreated
- Reliable collection of noisy data from the internet
- Archiving of data and accommodating retrospective changes
- Using linked data to direct Big Data analytics
This document summarizes computational analysis methods for determining expectation values commonly used in bioinformatics databases. It discusses tools like BLAST, FASTA, and databases like NCBI that allow querying and analyzing sequences. The expectation value provides the probability that a match could occur by chance, with lower values indicating higher quality matches. These tools and databases facilitate customizable extraction of data from sequences to enable further analysis and knowledge discovery in bioinformatics.
Build it and they will come: An R interface to the Leiden clustering algorithm with reticulate
Presentation at Bio"Pack"athon 2020 #1
Date: 25/02/2020
Venue: RIKEN Yokohama, Japan
Primary language: English
https://sites.google.com/view/biopackathon/biopackathon20201
Learning from and teaching in communities
コミュニティーで学び、そこで教えた事
Can we bring “Software Carpentry” to Japan? 「ソフトウィア・カーペントリー」を日本でやりませんか?
Presentation in English (with slides in English and Japanese)
#TokyoR 73th Meeting 2018-10-20
Tom Kelly (RIKEN IMS, Yokohama, Japan)
- The document discusses whether administering high-dose antimicrobial chemotherapy prevents the evolution of antibiotic resistance.
- It presents two opposing hypotheses - the "Hit Hard" hypothesis that higher doses eliminate bacteria more quickly, limiting resistance, versus the hypothesis that higher doses indirectly select for resistant strains by removing competition.
- Through mathematical modeling, it finds the risk of highly resistant strains emerging is highest at intermediate doses and lowest at either the maximum safe dose or minimum effective dose. The optimal strategy depends on specific infection parameters.
This document describes a bioinformatic methodology to predict synthetic lethal drug targets for cancers deficient in the tumor suppressor gene E-cadherin (CDH1). The methodology analyzes gene expression data from public databases to identify genes whose expression levels correlate with CDH1. Known synthetic lethal interactions, like between BRCA and PARP1, were correctly predicted. Several candidate synthetic lethal partners of CDH1 were identified and grouped into biological pathways. This bioinformatic approach can efficiently predict synthetic lethal targets to guide experimental validation and help develop targeted therapies for CDH1-deficient cancers.
Tom Kelly is a PhD candidate in genetics who uses various bioinformatics tools for data analysis and visualization, including R, Python, and Bash Shell. His favorite tool is R due to its vast array of packages supporting data analysis and visualization for computational biology. He is interested in emerging presentation tools like Prezi and Microsoft Sway that offer alternatives to PowerPoint. Kelly's research focuses on using the synthetic lethal concept to indirectly target tumor suppressor genes for personalized cancer therapy through computational analysis of genetic interactions and experimental screening.
E research feb2016 sifting the needles in the haystackTom Kelly
This document summarizes a bioinformatics analysis that used resampling techniques to compare predicted synthetic lethal gene interactions to experimental screening data. The analysis predicted synthetic lethal partners for the CDH1 gene in breast cancer using a method called SLIPT. Pathway enrichment analysis found several pathways enriched in both the SLIPT predictions and intersections with experimental screens, including cell cycle, DNA repair, and WNT signaling pathways. Resampling by permutation was used to generate a null distribution of pathway enrichments and test if overlaps between SLIPT predictions and screens were higher than expected by chance. Several pathways, including translation, nonsense mediated decay, and immune pathways, were significantly enriched in both datasets after multiple testing correction. The analysis provides computational validation of predicted synthetic lethal
Bioinformatic Analysis of Synthetic Lethality in Breast CancerTom Kelly
This document summarizes a bioinformatic analysis of synthetic lethal genetic interactions in breast cancer. It describes how the researchers used gene expression data from breast cancer samples to predict potential synthetic lethal gene pairs through statistical testing. Many statistically significant interactions were found, including known synthetic lethal partners. The researchers validated some predictions and discuss applications for targeted cancer therapies and chemoprevention. High performance computing resources were crucial for analyzing large genome-scale datasets.
Hidden in Plain Sight - The Genetics of ZombiesTom Kelly
Tom Kelly argues that eugenics programs aimed at eliminating "zombie genes" would be ineffective and unethical. While some view zombies as a genetic disease, others see it as a contagious condition, with different policy implications. Historically, reactions to outbreaks have varied depending on whether the condition was seen as genetic or contagious, with genetic views sometimes leading to misguided and harmful eugenics policies rather than coexistence and understanding between the affected and unaffected.
AI in the Workplace Reskilling, Upskilling, and Future Work.pptxSunil Jagani
Discover how AI is transforming the workplace and learn strategies for reskilling and upskilling employees to stay ahead. This comprehensive guide covers the impact of AI on jobs, essential skills for the future, and successful case studies from industry leaders. Embrace AI-driven changes, foster continuous learning, and build a future-ready workforce.
Read More - https://bit.ly/3VKly70
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor IvaniukFwdays
At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsScyllaDB
ScyllaDB monitoring provides a lot of useful information. But sometimes it’s not easy to find the root of the problem if something is wrong or even estimate the remaining capacity by the load on the cluster. This talk shares our team's practical tips on: 1) How to find the root of the problem by metrics if ScyllaDB is slow 2) How to interpret the load and plan capacity for the future 3) Compaction strategies and how to choose the right one 4) Important metrics which aren’t available in the default monitoring setup.
How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?
This talk will cover ScyllaDB Architecture from the cluster-level view and zoom in on data distribution and internal node architecture. In the process, we will learn the secret sauce used to get ScyllaDB's high availability and superior performance. We will also touch on the upcoming changes to ScyllaDB architecture, moving to strongly consistent metadata and tablets.
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...Fwdays
Direct losses from downtime in 1 minute = $5-$10 thousand dollars. Reputation is priceless.
As part of the talk, we will consider the architectural strategies necessary for the development of highly loaded fintech solutions. We will focus on using queues and streaming to efficiently work and manage large amounts of data in real-time and to minimize latency.
We will focus special attention on the architectural patterns used in the design of the fintech system, microservices and event-driven architecture, which ensure scalability, fault tolerance, and consistency of the entire system.
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsDianaGray10
Join us to learn how UiPath Apps can directly and easily interact with prebuilt connectors via Integration Service--including Salesforce, ServiceNow, Open GenAI, and more.
The best part is you can achieve this without building a custom workflow! Say goodbye to the hassle of using separate automations to call APIs. By seamlessly integrating within App Studio, you can now easily streamline your workflow, while gaining direct access to our Connector Catalog of popular applications.
We’ll discuss and demo the benefits of UiPath Apps and connectors including:
Creating a compelling user experience for any software, without the limitations of APIs.
Accelerating the app creation process, saving time and effort
Enjoying high-performance CRUD (create, read, update, delete) operations, for
seamless data management.
Speakers:
Russell Alfeche, Technology Leader, RPA at qBotic and UiPath MVP
Charlie Greenberg, host
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...DanBrown980551
This LF Energy webinar took place June 20, 2024. It featured:
-Alex Thornton, LF Energy
-Hallie Cramer, Google
-Daniel Roesler, UtilityAPI
-Henry Richardson, WattTime
In response to the urgency and scale required to effectively address climate change, open source solutions offer significant potential for driving innovation and progress. Currently, there is a growing demand for standardization and interoperability in energy data and modeling. Open source standards and specifications within the energy sector can also alleviate challenges associated with data fragmentation, transparency, and accessibility. At the same time, it is crucial to consider privacy and security concerns throughout the development of open source platforms.
This webinar will delve into the motivations behind establishing LF Energy’s Carbon Data Specification Consortium. It will provide an overview of the draft specifications and the ongoing progress made by the respective working groups.
Three primary specifications will be discussed:
-Discovery and client registration, emphasizing transparent processes and secure and private access
-Customer data, centering around customer tariffs, bills, energy usage, and full consumption disclosure
-Power systems data, focusing on grid data, inclusive of transmission and distribution networks, generation, intergrid power flows, and market settlement data
The Department of Veteran Affairs (VA) invited Taylor Paschal, Knowledge & Information Management Consultant at Enterprise Knowledge, to speak at a Knowledge Management Lunch and Learn hosted on June 12, 2024. All Office of Administration staff were invited to attend and received professional development credit for participating in the voluntary event.
The objectives of the Lunch and Learn presentation were to:
- Review what KM ‘is’ and ‘isn’t’
- Understand the value of KM and the benefits of engaging
- Define and reflect on your “what’s in it for me?”
- Share actionable ways you can participate in Knowledge - - Capture & Transfer
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Keywords: AI, Containeres, Kubernetes, Cloud Native
Event Link: https://meine.doag.org/events/cloudland/2024/agenda/#agendaId.4211
"NATO Hackathon Winner: AI-Powered Drug Search", Taras KlobaFwdays
This is a session that details how PostgreSQL's features and Azure AI Services can be effectively used to significantly enhance the search functionality in any application.
In this session, we'll share insights on how we used PostgreSQL to facilitate precise searches across multiple fields in our mobile application. The techniques include using LIKE and ILIKE operators and integrating a trigram-based search to handle potential misspellings, thereby increasing the search accuracy.
We'll also discuss how the azure_ai extension on PostgreSQL databases in Azure and Azure AI Services were utilized to create vectors from user input, a feature beneficial when users wish to find specific items based on text prompts. While our application's case study involves a drug search, the techniques and principles shared in this session can be adapted to improve search functionality in a wide range of applications. Join us to learn how PostgreSQL and Azure AI can be harnessed to enhance your application's search capability.
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving
What began over 115 years ago as a supplier of precision gauges to the automotive industry has evolved into being an industry leader in the manufacture of product branding, automotive cockpit trim and decorative appliance trim. Value-added services include in-house Design, Engineering, Program Management, Test Lab and Tool Shops.
What is an RPA CoE? Session 2 – CoE RolesDianaGray10
In this session, we will review the players involved in the CoE and how each role impacts opportunities.
Topics covered:
• What roles are essential?
• What place in the automation journey does each role play?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
What is an RPA CoE? Session 1 – CoE VisionDianaGray10
In the first session, we will review the organization's vision and how this has an impact on the COE Structure.
Topics covered:
• The role of a steering committee
• How do the organization’s priorities determine CoE Structure?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
My Research Journey with R
1. My Research Journey with R
How learning, using, and teaching R has helped my career in the life sciences
#TokyoR 2018-7-15
Tom Kelly
Postdoctoral Researcher
Epigenome Technology Exploration Unit
RIKEN Centre for Integrative Medical Sciences
Yokohama, Japan
ケケケリリリーーー・・・トトトムムム
ポスドクで 研究者
エピゲノム技術開発ユニット
国立研究開発法人理化学研究所の生命医科学研究センター
日本の横浜市
2. My Research Journey with R
Why I chose R to do (the vast majority of) my research
What I use R for in my research and what I’ve learned along the way
How my workflow has changed and package recommendations
Future challenges and hot topics
3. My Research Journey with R
Introduction
Studied at the University of Otago, Dunedin, New Zealand
Majored in genetics and mathematics
Focused on “bioinformatics” in postgrad
PhD on gene interactions in breast cancer for “precision
medicine” supervised by A/Prof. Mik Black (a statistician)
Worked at Tohoku University, Sendai, Miyagi Prefecture
Assisted with academic writing and data analysis in
Neuroscience and Bioengineering Laboratories
Taught statistical analysis and programming in R to
international postgraduate students (in English)
Currently a postdoc at RIKEN, Yokohama campus
Part of a Plant Stem Cell Analysis consortium
Focusing on single-cell genomics technologies
Continuing to develop new analysis techniques and
pipelines driven by new technology
Tom Kelly
Twitter:
@tomkXY
GitHub:
TomKellyGenetics
4. Why I Started With R
My supervisor was a statistician and good example of how
R could be used in my field
An opportunity to learn new (transferable) computational
skills and work with “Big Data” (rather than theory or
experiments)
Free and Open-Source
A large (and growing) user community to engage with (and
seek help from) online and at events
A huge ecosystem of packages to do statistical analyses and
plotting (especially in the field of genomics/bioinformatics)
CRAN Bioconductor GitHub
Mik Black
Otago Uni
Dunedin, New
Zealand
5. What I Use R For
Pretty much everything . . .
Analysis of gene expression patterns (differential expression,
molecular subtypes, cluster analysis)
Pathway (functional group) enrichment and network (graph
structure) analysis
Develop and test novel analysis methods for genomics data
Analysis heterogeneity (variation) at the single-cell level
(classification and markers of cell types)
Integrative “omics” analysis across data from different
techniques (genetic variant, mutation, gene expression,
protein, metabolism, epigenetic regulatory states, chromatin
structure)
6. How I Use R
Data manipulation and statistical analysis
Built-in functions (“base R”, stats) and distributions (mvtnorm, extraDist)
data.table (fread) and tibble for enhanced “data frames”
igraph for graph theory, pathway structure, and network analysis
Parallel computing with snow and OpenMPI (simulations and permutations)
Accessing genomics annotation and analysis packages
Genomic data (e.g., org.Hs.eg.db, reactome.db)
Statistical analysis (e.g., limma, edgeR)
Plotting and data visualisation
gplots(heatmap.2 and venn diagram), vioplot, and built-in plots (scatterplot,
lineplot, boxplot, histograms, titles, axes, legends, etc)
Dimension reduction techniques: SVD, PCA, tSNE (Rtsne), UMAP (umap)
Many of these are also provided in the “tidyverse”
readr, tidyr and dplyr for data manipulation
ggplot2 for visualisation
More and more and more utilities and packages from GitHub
7. How I Use R
Shiny Apps
Build and share interactive apps
Even if you can’t write JavaScript
14. How I Use R
Package development and code release with devtools
Develop R packages with devtools and roxygen2 (documentation)
Share functions and release code as a research output
Release” CRAN, Bioconductor, GitHub, ROpenSci
Cite: Zenodo, Journal of Open Source Software, Journal of Statistical Software
15. How I Use R
Packages I’ve developed
Data visualisation
heatmap.2x for annotated heatmap.2x (gplots)
vioplot enhanced version: proposed version 0.3
plot.igraph plotting directional graph structures, including
inhibitory links
Network analysis using igraph
graphsim simulate gene expression from pathway graph structures
pathway.structure.permutation perform permutation analysis
of gene candidates in a pathway structure
info.centrality compute network efficiency and information
centrality
igraph.extensions install all of the above
Gene expression analysis
slipt detect “synthetic lethal” gene interactions in expression data
DoubletDetection R implementation of a tool to detect technical
errors in single-cell RNA-Seq data
Developing packages has become a part of how I analyse data
16. How I Use R
How my workflow has changed
Interactive with RStudio IDE (which I still use)
Using Projects (especially to develop packages)
Running scripts and running in the terminal (background with
nohup) on local PC or remote servers
Developing (and documenting) functions and packages that intend
to reuse and share
17. How I Use R
Biggest challenges
Being an early-adopter is hard
(and sometimes worth it)
Taking a project using different tools to your team is hard
(but there is help online!)
Keeping up with the latest tools in the field
(but there could be worse problems)
18. Engage with the community
Online (beyond the “help’ system’)
StackOverflow/StackExchange (Q&A)
GitHub (Share code)
Twitter (#Rstats #Rlang)
R blogs
Google (everyone does it!)
Workshops and community events
Software Carpentry / Data Carpentry
(swcarpentry @thecarpentries)
Reseach Bazaar (ResBaz)
HackyHour
Mozilla “Study Group”
R user groups (Meetup, #TokyoR)
19. It’s not just statistics: it’s a language
Mike Sumner
Australian Antarctic
Division, Antarctic
Climate and
Ecosystems
Hobart, Australia
Twitter:
@mdsumner
GitHub:
mdsumner
#RLang
21. Learning in a community
Australia
Research Bazaar (2015) Melbourne
ResBaz organisers Software Carpentry Instructors
22. Learning in a community
New Zealand
ResBaz (2016) Dunedin ResBaz (2017) Auckland
ResBaz (Feb 2018) Dunedin
ResBaz (June 2018) Dunedin
23. R is a global community
R user groups (RUGs)
Joseph Rickert (@RStudioJoe)
ResBaz events (2017)
Software Carpentry Instructors
R User Groups (Meetup)
“RLadies” Groups
24. Programming is Learning
Things I want to learn more about or do better
Project management
Tracking package versions (packrat)
Testing functions and packages with Travis CI or Appveyor
Version control (git) and containers (docker)
Calling other languages (use the best tool for the job)
Python (reticulate), Julia (RJulia), C++ (Rcpp)
The “tidyverse” from Hadley Wickham et al
readr, tidyr, glue, dplyr, purrr, ggplot2 (gganimate, gghighlight)
Analysis techniques
Machine Learning, Statistical Learning, AI
Bayesian modelling and inference
Techniques for “single-cell” analysis (Suerat, monocle, etc)
Plotting to communicate variation and uncertainty
Colour-blind “friendly” palettes (RColorBrewer, viridis)
Value-suppressing uncertainty palettes (VSUP)
Interactive plots (plotly, shiny, or D3.js)
27. Advice
You never stop learning R
Everyone uses Google (and that’s ok!)
Seek projects that challenge you to learn more
Code is a means to an end: keep project goals in mind!
Code together; teach together; learn together