Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...Juan Antonio Vizcaino
The document discusses the challenges and opportunities of big data in proteomics. It describes how proteomics data volumes are growing rapidly due to technological advances, creating both computational challenges for data analysis and opportunities to reuse large amounts of public data. The PRIDE Archive at EBI stores over 4,000 proteomics datasets and provides tools like PRIDE Inspector to help analyze and validate large datasets. However, challenges remain around data standardization, metadata completeness, and the need for greater computational infrastructure and expertise to fully leverage the large amounts of shared proteomics data.
Public proteomics data: a (mostly unexploited) gold mine for computational re...Juan Antonio Vizcaino
The document discusses public proteomics data available through the PRIDE Archive at the European Bioinformatics Institute. It provides statistics on data submissions and downloads, which continue to increase significantly each year. The author advocates for reusing public proteomics data through approaches like proteogenomics studies, discovery of new post-translational modifications, and meta-analysis studies. Spectrum clustering is presented as a method to further analyze and draw insights from large proteomics datasets.
The document introduces several proteomics data standards developed by the Proteomics Standards Initiative (PSI), including mzML, mzIdentML, mzQuantML, TraML, and mzTab. It provides an overview of each standard, describing what type of data it encodes (e.g. mass spectrometry data, identification data, quantification data), its timeline of development and versions, and its increasing adoption by proteomics software and databases. The document emphasizes that data standards are necessary for data sharing and integration in proteomics given the large number of experimental workflows and data types.
Mining the hidden proteome using hundreds of public proteomics datasetsJuan Antonio Vizcaino
The document discusses mining hidden proteomics data using public proteomics datasets. It describes how the PRIDE Cluster tool clusters over 250 million spectra from the PRIDE Archive, including over 190 million previously unidentified spectra. This clustering identified inconsistent clusters that could be reanalyzed, inferred identifications for 9.1 million originally unidentified spectra contained within reliable identification clusters, and consistently unidentified clusters that could be targeted for further analysis to identify unknown peptides. The clustering took 5 days on a 340-core system and generated 28 million clusters.
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...Juan Antonio Vizcaino
The document discusses the challenges and opportunities of big data in proteomics. It describes how proteomics data volumes are growing rapidly due to technological advances, creating both computational challenges for data analysis and opportunities to reuse large amounts of public data. The PRIDE Archive at EBI stores over 4,000 proteomics datasets and provides tools like PRIDE Inspector to help analyze and validate large datasets. However, challenges remain around data standardization, metadata completeness, and the need for greater computational infrastructure and expertise to fully leverage the large amounts of shared proteomics data.
Public proteomics data: a (mostly unexploited) gold mine for computational re...Juan Antonio Vizcaino
The document discusses public proteomics data available through the PRIDE Archive at the European Bioinformatics Institute. It provides statistics on data submissions and downloads, which continue to increase significantly each year. The author advocates for reusing public proteomics data through approaches like proteogenomics studies, discovery of new post-translational modifications, and meta-analysis studies. Spectrum clustering is presented as a method to further analyze and draw insights from large proteomics datasets.
The document introduces several proteomics data standards developed by the Proteomics Standards Initiative (PSI), including mzML, mzIdentML, mzQuantML, TraML, and mzTab. It provides an overview of each standard, describing what type of data it encodes (e.g. mass spectrometry data, identification data, quantification data), its timeline of development and versions, and its increasing adoption by proteomics software and databases. The document emphasizes that data standards are necessary for data sharing and integration in proteomics given the large number of experimental workflows and data types.
Mining the hidden proteome using hundreds of public proteomics datasetsJuan Antonio Vizcaino
The document discusses mining hidden proteomics data using public proteomics datasets. It describes how the PRIDE Cluster tool clusters over 250 million spectra from the PRIDE Archive, including over 190 million previously unidentified spectra. This clustering identified inconsistent clusters that could be reanalyzed, inferred identifications for 9.1 million originally unidentified spectra contained within reliable identification clusters, and consistently unidentified clusters that could be targeted for further analysis to identify unknown peptides. The clustering took 5 days on a 340-core system and generated 28 million clusters.
The document discusses proteomics repositories and their role in sharing mass spectrometry (MS) proteomics data. It describes the main types of information stored in MS proteomics repositories, including raw experimental data, identification and quantification results, metadata, and other associated information. The document outlines some of the main existing repositories, including PRIDE Archive, PeptideAtlas, and Global Proteome Machine, and whether they reprocess data through a standardized pipeline or store data as published. Reprocessing repositories provide an updated view of data through consistent analysis, while no-reprocessing repositories preserve the original analysis. Data sharing is important for independent review, meta-analysis, and advancing the field.
This document discusses the ProteomeXchange Consortium and recent updates. It provides statistics on data submissions and downloads. Over 7,475 datasets have been submitted from over 50 countries, with the majority from the US, Germany, and China. PRIDE and MassIVE are the largest repositories. A new prospective member, iProX, is described which will be the main proteomics data sharing platform in China. Guidelines are being developed to handle reprocessed datasets submitted to repositories.
The document discusses data standards for proteomics, including those developed by the Proteomics Standards Initiative (PSI). It describes several existing PSI standards for mass spectrometry data, including mzML, mzIdentML, mzQuantML, and TraML. It provides an example of the successful mzML standard and discusses how mzIdentML has been widely adopted for representing peptide and protein identifications.
An overview of the PRIDE ecosystem of resources and computational tools for m...Juan Antonio Vizcaino
The document provides an overview of the PRIDE ecosystem of resources and computational tools for mass spectrometry proteomics data. It describes PRIDE Archive and ProteomeXchange as repositories for proteomics data, as well as tools like PRIDE Inspector for visualizing and validating data. It also discusses how public proteomics data is increasingly being reused, and added-value resources like PRIDE Cluster and PRIDE Proteomes that provide aggregated views of proteomics data.
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...Juan Antonio Vizcaino
The document discusses PRIDE and ProteomeXchange, which are resources that support the deposition of proteomics data to public repositories. PRIDE stores mass spectrometry-based proteomics data, and is one of the repositories that is part of ProteomeXchange, a framework that allows standard submission of proteomics data between major repositories. The document outlines the cultural change in proteomics towards public data sharing, and provides information on submitting proteomics data to PRIDE and accessing data deposited in PRIDE and ProteomeXchange.
The document discusses updates to the PRIDE Cluster project. PRIDE Cluster analyzes mass spectrometry proteomics data stored in the PRIDE database by clustering peptide spectra. The latest implementation clustered over 256 million spectra using Apache Hadoop. This resulted in 28 million clusters, including clusters with inconsistent identifications, clusters linking identified and unidentified spectra, and large clusters of consistently unidentified spectra that could help identify new peptides and post-translational modifications. The PRIDE Cluster provides a public resource for data mining the large collection of proteomics datasets in PRIDE.
Text and Non-textual Objects: Seamless access for scientists
Uwe Rosemann (German National Library of Science and Technology (TIB), Germany)
The European High Level Expert Group on Scientific data has formulated the challenges for a scientific infrastructure to be reached by 2030: “Our vision is a scientific e-infrastructure that supports seamless access, use, re-use, and trust of data. In a sense, the physical and technical infrastructure becomes invisible and the data themselves become the infrastructure – a valuable asset, on which science, technology, the economy and society can advance”.
Here, “data” is not restricted to primary data but also includes all non-textual material (graphs, spectra, videos, 3D-objects etc.).
The German National Library of Science and Technology (TIB) has developed a concept for a national competence center for non-textual materials which is now founded by the German State and by the German Federal Countries. The center has to perform the task: developing solutions and services together with the scientific community to make such data available, citable, sharable and usable, including visual search tools and enhanced content-based retrieval.
With solutions such as DataCite and modular development for extraction, indexing and visual searching of new scientific metadata, TIB will accept the challenge. And will make all data accessible to its users fast, convenient and easy to use.
The paper shows what special tools are developed by TIB in the context of scientific AV-media, 3D-objects and research data.
What is Reproducibility? The R* brouhaha (and how Research Objects can help)Carole Goble
presented at 1st First International Workshop on Reproducible Open Science @ TPDL, 9 Sept 2016, Hannover, Germany
http://repscience2016.research-infrastructures.eu/
ChemAxon is an industry leader in enterprise and web-based structure content management and delivery. They have the widest file format support and deployments throughout STM publishing and major pharma/biotech. Hot topics discussed include Markush search capabilities with Thomson Reuters content, biomolecule registration tools, and ongoing improvements to structure extraction from documents and databases. ChemAxon works with many commercial and public/free organizations and is interested in custom work and new product ideas from users.
NSF Workshop Data and Software Citation, 6-7 June 2016, Boston USA, Software Panel
FIndable, Accessible, Interoperable, Reusable Software and Data Citation: Europe, Research Objects, and BioSchemas.org
FAIR Data, Operations and Model management for Systems Biology and Systems Me...Carole Goble
This document discusses the FAIRDOM consortium's efforts to promote FAIR (Findable, Accessible, Interoperable, Reusable) principles for managing data, operations, and models from systems biology and systems medicine projects. It outlines challenges in asset management for multi-partner, multi-disciplinary projects using multiple formats and repositories. FAIRDOM provides pillars of support including community actions, platforms/tools, and a public project commons to help address these challenges and better enable sharing, reuse, and reproducibility of research assets according to FAIR principles.
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...Carole Goble
Over the past 5 years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs and so forth. Don’t stop reading. Data management isn’t likely to win anyone a Nobel prize. But publications should be supported and accompanied by data, methods, procedures, etc. to assure reproducibility of results. Funding agencies expect data (and increasingly software) management retention and access plans as part of the proposal process for projects to be funded. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems Biology demands the interlinking and exchange of assets and the systematic recording
of metadata for their interpretation.
The FAIR Guiding Principles for scientific data management and stewardship (http://www.nature.com/articles/sdata201618) has been an effective rallying-cry for EU and USA Research Infrastructures. FAIRDOM (Findable, Accessible, Interoperable, Reusable Data, Operations and Models) Initiative has 8 years of experience of asset sharing and data infrastructure ranging across European programmes (SysMO and EraSysAPP ERANets), national initiatives (de.NBI, German Virtual Liver Network, UK SynBio centres) and PI's labs. It aims to support Systems and Synthetic Biology researchers with data and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety.
This talk will use the FAIRDOM Initiative to discuss the FAIR management of data, SOPs, and models for Sys Bio, highlighting the challenges of and approaches to sharing, credit, citation and asset infrastructures in practice. I'll also highlight recent experiments in affecting sharing using behavioural interventions.
http://www.fair-dom.org
http://www.fairdomhub.org
http://www.seek4science.org
Presented at COMBINE 2016, Newcastle, 19 September.
http://co.mbine.org/events/COMBINE_2016
Reflections on a (slightly unusual) multi-disciplinary academic careerCarole Goble
Talk given at the School of Computer Science, The University of Manchester, UK Postgraduate Research Symposium 2019
the Carole Goble Doctoral Paper award was given for the first time
Some tools developed at OEG (Ontology Engineering Group) for facilitating ontology engineering activities as evaluation, documentation, releasing and publication.
Reproducibility, Research Objects and Reality, Leiden 2016Carole Goble
Presented at the Leiden Bioscience Lecture, 24 November 2016, Reproducibility, Research Objects and Reality
Over the past 5 years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs, workflows. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship have proved to be an effective rallying-cry. Funding agencies expect data (and increasingly software) management retention and access plans. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. It all sounds very laudable and straightforward. BUT…..
Reproducibility is a R* minefield, depending on whether you are testing for robustness (rerun), defence (repeat), certification (replicate), comparison (reproduce) or transferring between researchers (reuse). Different forms of "R" make different demands on the completeness, depth and portability of research. Sharing is another minefield raising concerns of credit and protection from sharp practices.
In practice the exchange, reuse and reproduction of scientific experiments is dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: the codes fork, data is updated, algorithms are revised, workflows break, service updates are released. ResearchObject.org is an effort to systematically support more portable and reproducible research exchange
In this talk I will explore these issues in data-driven computational life sciences through the examples and stories from initiatives I am involved, and Leiden is involved in too including:
· FAIRDOM which has built a Commons for Systems and Synthetic Biology projects, with an emphasis on standards smuggled in by stealth and efforts to affecting sharing practices using behavioural interventions
· ELIXIR, the EU Research Data Infrastructure, and its efforts to exchange workflows
· Bioschemas.org, an ELIXIR-NIH-Google effort to support the finding of assets.
Reproducible Research: how could Research Objects helpCarole Goble
Reproducible Research: how could Research Objects help, given at 21st Genomic Standards Consortium Meeting
Dates: May 20-23, 2019
https://press3.mcs.anl.gov/gensc/meetings/gsc21/
FAIR data and model management for systems biology.FAIRDOM
Written and presented by Carole Goble (University of Manchester) as part of Intelligent Systems for Molecular Biology (ISMB), Dublin. July 10th - 14th 2015.
ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance...Juan Antonio Vizcaino
This document summarizes a presentation about the ProteomeXchange (PX) consortium, which provides a framework for standard data submission and dissemination between major proteomics repositories, including PRIDE, PeptideAtlas, and MassIVE. It describes how researchers can submit complete or partial datasets to PX via PRIDE using the PX submission tool. Complete submissions use mzIdentML for processed results, while partial submissions store search engine output files. Over 1,300 datasets have been submitted to PX from researchers worldwide.
This document provides information about recent developments in the DOI and DataCite community. It discusses new clients of the TIB DOI service including universities and research institutions in Germany and abroad. It also mentions new responsibilities for the TIB DOI service related to a research data management project. The document then summarizes new members that have joined DataCite recently from different countries and organizations. It concludes by discussing potential Chinese collaborations for DataCite.
Use of spark for proteomic scoring seattle presentationlordjoe
This document discusses using Apache Spark to parallelize proteomic scoring, which involves matching tandem mass spectra against a large database of peptides. The author developed a version of the Comet scoring algorithm and implemented it on a Spark cluster. This outperformed single machines by over 10x, allowing searches that took 8 hours to be done in under 30 minutes. Key considerations for running large jobs in parallel on Spark are discussed, such as input formatting, accumulator functions for debugging, and smart partitioning of data. The performance improvements allow searching larger databases and considering more modifications.
The document discusses proteomics repositories and their role in sharing mass spectrometry (MS) proteomics data. It describes the main types of information stored in MS proteomics repositories, including raw experimental data, identification and quantification results, metadata, and other associated information. The document outlines some of the main existing repositories, including PRIDE Archive, PeptideAtlas, and Global Proteome Machine, and whether they reprocess data through a standardized pipeline or store data as published. Reprocessing repositories provide an updated view of data through consistent analysis, while no-reprocessing repositories preserve the original analysis. Data sharing is important for independent review, meta-analysis, and advancing the field.
This document discusses the ProteomeXchange Consortium and recent updates. It provides statistics on data submissions and downloads. Over 7,475 datasets have been submitted from over 50 countries, with the majority from the US, Germany, and China. PRIDE and MassIVE are the largest repositories. A new prospective member, iProX, is described which will be the main proteomics data sharing platform in China. Guidelines are being developed to handle reprocessed datasets submitted to repositories.
The document discusses data standards for proteomics, including those developed by the Proteomics Standards Initiative (PSI). It describes several existing PSI standards for mass spectrometry data, including mzML, mzIdentML, mzQuantML, and TraML. It provides an example of the successful mzML standard and discusses how mzIdentML has been widely adopted for representing peptide and protein identifications.
An overview of the PRIDE ecosystem of resources and computational tools for m...Juan Antonio Vizcaino
The document provides an overview of the PRIDE ecosystem of resources and computational tools for mass spectrometry proteomics data. It describes PRIDE Archive and ProteomeXchange as repositories for proteomics data, as well as tools like PRIDE Inspector for visualizing and validating data. It also discusses how public proteomics data is increasingly being reused, and added-value resources like PRIDE Cluster and PRIDE Proteomes that provide aggregated views of proteomics data.
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...Juan Antonio Vizcaino
The document discusses PRIDE and ProteomeXchange, which are resources that support the deposition of proteomics data to public repositories. PRIDE stores mass spectrometry-based proteomics data, and is one of the repositories that is part of ProteomeXchange, a framework that allows standard submission of proteomics data between major repositories. The document outlines the cultural change in proteomics towards public data sharing, and provides information on submitting proteomics data to PRIDE and accessing data deposited in PRIDE and ProteomeXchange.
The document discusses updates to the PRIDE Cluster project. PRIDE Cluster analyzes mass spectrometry proteomics data stored in the PRIDE database by clustering peptide spectra. The latest implementation clustered over 256 million spectra using Apache Hadoop. This resulted in 28 million clusters, including clusters with inconsistent identifications, clusters linking identified and unidentified spectra, and large clusters of consistently unidentified spectra that could help identify new peptides and post-translational modifications. The PRIDE Cluster provides a public resource for data mining the large collection of proteomics datasets in PRIDE.
Text and Non-textual Objects: Seamless access for scientists
Uwe Rosemann (German National Library of Science and Technology (TIB), Germany)
The European High Level Expert Group on Scientific data has formulated the challenges for a scientific infrastructure to be reached by 2030: “Our vision is a scientific e-infrastructure that supports seamless access, use, re-use, and trust of data. In a sense, the physical and technical infrastructure becomes invisible and the data themselves become the infrastructure – a valuable asset, on which science, technology, the economy and society can advance”.
Here, “data” is not restricted to primary data but also includes all non-textual material (graphs, spectra, videos, 3D-objects etc.).
The German National Library of Science and Technology (TIB) has developed a concept for a national competence center for non-textual materials which is now founded by the German State and by the German Federal Countries. The center has to perform the task: developing solutions and services together with the scientific community to make such data available, citable, sharable and usable, including visual search tools and enhanced content-based retrieval.
With solutions such as DataCite and modular development for extraction, indexing and visual searching of new scientific metadata, TIB will accept the challenge. And will make all data accessible to its users fast, convenient and easy to use.
The paper shows what special tools are developed by TIB in the context of scientific AV-media, 3D-objects and research data.
What is Reproducibility? The R* brouhaha (and how Research Objects can help)Carole Goble
presented at 1st First International Workshop on Reproducible Open Science @ TPDL, 9 Sept 2016, Hannover, Germany
http://repscience2016.research-infrastructures.eu/
ChemAxon is an industry leader in enterprise and web-based structure content management and delivery. They have the widest file format support and deployments throughout STM publishing and major pharma/biotech. Hot topics discussed include Markush search capabilities with Thomson Reuters content, biomolecule registration tools, and ongoing improvements to structure extraction from documents and databases. ChemAxon works with many commercial and public/free organizations and is interested in custom work and new product ideas from users.
NSF Workshop Data and Software Citation, 6-7 June 2016, Boston USA, Software Panel
FIndable, Accessible, Interoperable, Reusable Software and Data Citation: Europe, Research Objects, and BioSchemas.org
FAIR Data, Operations and Model management for Systems Biology and Systems Me...Carole Goble
This document discusses the FAIRDOM consortium's efforts to promote FAIR (Findable, Accessible, Interoperable, Reusable) principles for managing data, operations, and models from systems biology and systems medicine projects. It outlines challenges in asset management for multi-partner, multi-disciplinary projects using multiple formats and repositories. FAIRDOM provides pillars of support including community actions, platforms/tools, and a public project commons to help address these challenges and better enable sharing, reuse, and reproducibility of research assets according to FAIR principles.
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...Carole Goble
Over the past 5 years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs and so forth. Don’t stop reading. Data management isn’t likely to win anyone a Nobel prize. But publications should be supported and accompanied by data, methods, procedures, etc. to assure reproducibility of results. Funding agencies expect data (and increasingly software) management retention and access plans as part of the proposal process for projects to be funded. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. The multi-component, multi-disciplinary nature of Systems Biology demands the interlinking and exchange of assets and the systematic recording
of metadata for their interpretation.
The FAIR Guiding Principles for scientific data management and stewardship (http://www.nature.com/articles/sdata201618) has been an effective rallying-cry for EU and USA Research Infrastructures. FAIRDOM (Findable, Accessible, Interoperable, Reusable Data, Operations and Models) Initiative has 8 years of experience of asset sharing and data infrastructure ranging across European programmes (SysMO and EraSysAPP ERANets), national initiatives (de.NBI, German Virtual Liver Network, UK SynBio centres) and PI's labs. It aims to support Systems and Synthetic Biology researchers with data and model management, with an emphasis on standards smuggled in by stealth and sensitivity to asset sharing and credit anxiety.
This talk will use the FAIRDOM Initiative to discuss the FAIR management of data, SOPs, and models for Sys Bio, highlighting the challenges of and approaches to sharing, credit, citation and asset infrastructures in practice. I'll also highlight recent experiments in affecting sharing using behavioural interventions.
http://www.fair-dom.org
http://www.fairdomhub.org
http://www.seek4science.org
Presented at COMBINE 2016, Newcastle, 19 September.
http://co.mbine.org/events/COMBINE_2016
Reflections on a (slightly unusual) multi-disciplinary academic careerCarole Goble
Talk given at the School of Computer Science, The University of Manchester, UK Postgraduate Research Symposium 2019
the Carole Goble Doctoral Paper award was given for the first time
Some tools developed at OEG (Ontology Engineering Group) for facilitating ontology engineering activities as evaluation, documentation, releasing and publication.
Reproducibility, Research Objects and Reality, Leiden 2016Carole Goble
Presented at the Leiden Bioscience Lecture, 24 November 2016, Reproducibility, Research Objects and Reality
Over the past 5 years we have seen a change in expectations for the management of all the outcomes of research – that is the “assets” of data, models, codes, SOPs, workflows. The “FAIR” (Findable, Accessible, Interoperable, Reusable) Guiding Principles for scientific data management and stewardship have proved to be an effective rallying-cry. Funding agencies expect data (and increasingly software) management retention and access plans. Journals are raising their expectations of the availability of data and codes for pre- and post- publication. It all sounds very laudable and straightforward. BUT…..
Reproducibility is a R* minefield, depending on whether you are testing for robustness (rerun), defence (repeat), certification (replicate), comparison (reproduce) or transferring between researchers (reuse). Different forms of "R" make different demands on the completeness, depth and portability of research. Sharing is another minefield raising concerns of credit and protection from sharp practices.
In practice the exchange, reuse and reproduction of scientific experiments is dependent on bundling and exchanging the experimental methods, computational codes, data, algorithms, workflows and so on along with the narrative. These "Research Objects" are not fixed, just as research is not “finished”: the codes fork, data is updated, algorithms are revised, workflows break, service updates are released. ResearchObject.org is an effort to systematically support more portable and reproducible research exchange
In this talk I will explore these issues in data-driven computational life sciences through the examples and stories from initiatives I am involved, and Leiden is involved in too including:
· FAIRDOM which has built a Commons for Systems and Synthetic Biology projects, with an emphasis on standards smuggled in by stealth and efforts to affecting sharing practices using behavioural interventions
· ELIXIR, the EU Research Data Infrastructure, and its efforts to exchange workflows
· Bioschemas.org, an ELIXIR-NIH-Google effort to support the finding of assets.
Reproducible Research: how could Research Objects helpCarole Goble
Reproducible Research: how could Research Objects help, given at 21st Genomic Standards Consortium Meeting
Dates: May 20-23, 2019
https://press3.mcs.anl.gov/gensc/meetings/gsc21/
FAIR data and model management for systems biology.FAIRDOM
Written and presented by Carole Goble (University of Manchester) as part of Intelligent Systems for Molecular Biology (ISMB), Dublin. July 10th - 14th 2015.
ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance...Juan Antonio Vizcaino
This document summarizes a presentation about the ProteomeXchange (PX) consortium, which provides a framework for standard data submission and dissemination between major proteomics repositories, including PRIDE, PeptideAtlas, and MassIVE. It describes how researchers can submit complete or partial datasets to PX via PRIDE using the PX submission tool. Complete submissions use mzIdentML for processed results, while partial submissions store search engine output files. Over 1,300 datasets have been submitted to PX from researchers worldwide.
This document provides information about recent developments in the DOI and DataCite community. It discusses new clients of the TIB DOI service including universities and research institutions in Germany and abroad. It also mentions new responsibilities for the TIB DOI service related to a research data management project. The document then summarizes new members that have joined DataCite recently from different countries and organizations. It concludes by discussing potential Chinese collaborations for DataCite.
Use of spark for proteomic scoring seattle presentationlordjoe
This document discusses using Apache Spark to parallelize proteomic scoring, which involves matching tandem mass spectra against a large database of peptides. The author developed a version of the Comet scoring algorithm and implemented it on a Spark cluster. This outperformed single machines by over 10x, allowing searches that took 8 hours to be done in under 30 minutes. Key considerations for running large jobs in parallel on Spark are discussed, such as input formatting, accumulator functions for debugging, and smart partitioning of data. The performance improvements allow searching larger databases and considering more modifications.
The document discusses proteomics, which is the study of the proteome or total protein complement of a biological system. Proteomics aims to understand protein expression, functions, interactions, and modifications through various analytical techniques and faces many challenges due to the complexity of proteins. Key approaches in proteomics include expression profiling to compare protein levels between healthy and disease states, structural analysis to determine protein structures, and network mapping to study protein interactions. Mass spectrometry and bioinformatics tools play important roles in proteomic studies, which have applications in characterizing protein complexes and identifying disease biomarkers.
Soil biology is the study of microorganisms that live in soil, including their interactions with the environment and each other. There are many important groups of soil organisms including bacteria, fungi, protozoa, nematodes, micro- and mesofauna, and macrofauna like earthworms. Soil organisms carry out essential functions such as decomposition, nutrient cycling, and improving soil structure.
Soil microbiology and cycles of the elementsCara Molina
Soil is formed over long periods of time from weathered rock and decayed organic matter. It consists of minerals like sand, silt, and clay as well as organic matter and hosts a diverse array of microorganisms. Soil microorganisms play important roles in nutrient cycling, decomposition, and supporting plant growth. The most abundant microbes are bacteria and fungi, which break down organic residues. Other microbes like actinomycetes and mycorrhizal fungi also contribute to soil fertility. Protists and nematodes regulate microbe populations as predators. Overall, the complex web of soil microorganisms drives key ecosystem functions.
This document provides a summary of soil microorganisms and their functions in 3 sentences or less:
Soil is teeming with life including bacteria, fungi, protists, and animals that carry out essential functions like decomposing organic matter, fixing nitrogen, and forming symbiotic relationships with plant roots. There can be thousands of species of microbes like bacteria and fungi, and dozens of species of larger organisms like earthworms, mites and nematodes in a single handful of healthy soil. These diverse soil microorganisms interact and carry out critical processes in the soil ecosystem that support plant growth and agricultural production.
Soil microbiology is the study of microorganisms in soil such as bacteria, actinomycetes, fungi, algae and protozoa. These microorganisms are important because they affect soil structure and fertility through organic matter decomposition, nutrient transformations, and symbiotic relationships with plants. The four major groups of microbes found in soil are bacteria, actinomycetes, fungi, and algae, each playing an important role in soil health and plant growth.
This study aims to search for genetic and proteomic risk factors and protective factors associated with coronary heart disease (CHD) in order to develop new diagnostic techniques and therapies. The study will analyze gene expression patterns in peripheral blood monocytes and perform proteomics analysis of blood serum from five patient groups: 1) those with heart attack and risk factors, 2) those with heart attack without risk factors, 3) young individuals with risk factors but no heart attack, 4) elderly individuals with risk factors but no heart attack, and 5) healthy elderly individuals without risk factors. Gene expression profiles will be obtained using microarray analysis and validated with real-time PCR. Differentially expressed genes and proteins may help identify new targets for preventing and
The document discusses the field of proteomics, which is the large-scale study of proteins, including their functions and structures. It defines proteomics and describes several areas within it, such as functional proteomics, expressional proteomics, and structural proteomics. It outlines typical proteomics experiments and some key methods used, including two-dimensional electrophoresis, mass spectrometry, and protein-protein interaction prediction methods like phylogenetic profiling.
Soil microorganisms play important roles in maintaining soil health and fertility. They are involved in nutrient cycling by decomposing organic matter, fixing nitrogen, and carrying out other biochemical processes. The main types of microbes found in soil are bacteria, actinomycetes, fungi, algae, and protozoa. Soil microbes affect soil structure, plant growth, and carry out important processes like nitrogen fixation, nutrient availability, and degradation of pollutants. However, human activities like agricultural practices, urbanization, and climate change threaten soil microbes by reducing organic matter, increasing salinity, and introducing pollutants. Proper management is needed to protect these vital soil microorganisms.
Proteomics is the study of the structure and function of proteins. It involves identifying and quantifying the proteins expressed by a genome or cell type. Key aspects of proteomics include protein separation techniques like gel electrophoresis, mass spectrometry to identify proteins, and analyzing protein interactions and post-translational modifications. While genomes provide the blueprint, proteomics helps understand the diversity of proteins expressed and how they function together to direct cellular activities. It is a promising tool for disease diagnosis by identifying protein biomarkers.
Proteomics is the large-scale study of proteins, including their structures, functions, and interactions. It has become an important technology for understanding biological systems on a global scale. Mass spectrometry plays a key role in proteomic analysis by allowing researchers to identify and characterize proteins and their post-translational modifications like phosphorylation. There are challenges in analyzing post-translational modifications since proteins exist in multiple modified forms, but methods like affinity enrichment and tandem mass spectrometry are used to map modifications and locate them on protein sequences.
The document discusses PRIDE, a proteomics data repository at EMBL-EBI. It describes how PRIDE stores mass spectrometry proteomics data, its role within the ProteomeXchange consortium, and how researchers can submit data to PRIDE including the use of mzIdentML and PRIDE tools.
PRIDE is a proteomics database at EMBL-EBI that stores mass spectrometry-based proteomics data, including peptide and protein identifications and quantifications. It is part of the ProteomeXchange consortium, which aims to facilitate standardized data submission and dissemination between proteomics repositories. The document outlines the types of data stored in PRIDE, how to access and submit data, and tools for data conversion and visualization like PRIDE Converter 2 and PRIDE Inspector.
PRIDE resources and ProteomeXchange
- PRIDE is a proteomics data repository at EMBL-EBI that stores mass spectrometry-based proteomics data.
- It is part of the ProteomeXchange consortium, which provides a framework for standardized data submission and dissemination between proteomics repositories.
- This presentation discusses how to submit data to PRIDE/ProteomeXchange using PRIDE tools, including converting files to mzIdentML format and using the PX submission tool for metadata and file transfer.
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...Juan Antonio Vizcaino
Dr. Juan Antonio Vizcaíno presented on developing open data analysis pipelines in the cloud to enable large-scale analysis of proteomics data. He introduced PRIDE and ProteomeXchange as repositories for proteomics data that are seeing substantial growth. Moving analysis pipelines to the cloud will facilitate public reuse of large datasets, improve scalability, and ensure reproducibility. Initial pipelines have been created for identification, quantification, and quality control of mass spectrometry data and deployed on the EMBL-EBI cloud platform. Future work includes optimizing access to PRIDE data and developing pipelines for analysis of DIA and proteogenomics data.
This document summarizes Juan A. Vizcaíno's presentation on the ELIXIR Proteomics Community. It discusses the establishment of the community through an implementation study and strategy meeting. The community aims to develop standardized proteomics data analysis pipelines and deploy them in a cloud environment. It will also work to improve proteomics data standards and integrate proteomics with other omics data through activities like the Proteomics Standards Initiative. The ProteomeXchange database is a major resource overseen by the community for storing and sharing proteomics data internationally.
This document provides an overview and status update of ProteomeXchange in 2017. It discusses submission and download statistics showing growth in datasets submitted. There are now over 5,000 datasets in PRIDE from over 1,000 species. Download volumes have increased to over 200 TB in 2016. Citations of proteomics datasets are also increasing. A new prospective member, Firmiana, may join ProteomeXchange. The OmicsDI interface provides integrated access to datasets across multiple omics domains like proteomics, transcriptomics and metabolomics.
1) ProteomeXchange is a global database containing proteomics data from several repositories including PRIDE, MassIVE, and jPOST.
2) A new member, iProX, joined in 2017 and contains over 60 terabytes of data from China.
3) Usage of ProteomeXchange data is increasing, with PRIDE downloads growing from 50 terabytes in 2013 to over 295 terabytes in 2017.
PRIDE and ProteomeXchange – Making proteomics data accessible and reusableYasset Perez-Riverol
The document discusses ProteomeXchange (PX), a consortium that aims to make proteomics data accessible and reusable. PX includes repositories like PRIDE, PeptideAtlas, and MassIVE. It allows standard data submission between repositories through a common identifier space. The document outlines the PX submission workflow, describes components like the PX submission tool and PRIDE Inspector. It also provides statistics on data available through PX, with over 1,300 datasets contributed primarily from human, mouse and yeast studies. Future plans include better integration of proteomics resources to facilitate data reuse.
The document discusses a training webinar about PRIDE and ProteomeXchange. It begins with instructions for participating in the webinar and an overview of data resources at EMBL-EBI. It then covers PRIDE's mission to archive proteomics data, the ProteomeXchange consortium for standardized data submission, and tools for submitting data to PRIDE including PRIDE Converter, PRIDE Inspector, and the ProteomeXchange submission tool.
The ProteomeXchange consortium allows researchers to easily deposit and retrieve proteomics data. It includes repositories like PRIDE, PeptideAtlas, and recently MassIVE. The goal is to standardize submission and access across repositories through common identifiers and supported workflows. Over 1,300 datasets have been submitted, with many tools now supporting standard formats like mzIdentML for complete submissions. The most accessed datasets include large reference maps of the human proteome. Open source tools are improving submission and analysis of ProteomeXchange data.
A proteomics data “gold mine” at your disposal: Now that the data is there, w...Juan Antonio Vizcaino
The document discusses the reuse of public proteomics data. It describes how data from the PRoteomics IDEntifications (PRIDE) Archive can be reanalyzed to conduct proteogenomics studies, discover new post-translational modifications and variants, and enable meta-analysis studies of protein-protein interactions and associations. It also examines challenges around analyzing the "dark proteome" of consistently unidentified spectra in public datasets and developing open analysis pipelines for proteomics data in cloud environments.
An update of the activities of the ProteomeXchange Consortium of proteomics resources given at HUPO 2016 (Taipei). Some slides at the end of the presentation are from Nuno Bandeira.
The document provides an overview and status update of ProteomeXchange, including submission and citation statistics, new prospective members jPOST and iPROX, and the OmicsDI interface. It notes that ProteomeXchange currently includes over 3,800 datasets submitted primarily from the US, Germany, UK, and China, and that submissions and data reuse have grown substantially in recent years.
The document discusses the activities of the EMBL-EBI ELIXIR Node related to proteomics data and analysis. It describes how EMBL-EBI contributes to the ELIXIR platforms of data, tools, interoperability, compute, and training through its work on the PRIDE Archive and ProteomeXchange repository, development of proteomics data standards and software tools, implementation of reproducible proteomics pipelines, and proteomics training courses. The PRIDE Archive contains over 280 terabytes of mass spectrometry proteomics data from over 51 countries and has seen rapid growth in recent years.
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...BigData_Europe
Slides for keynote talk at the Big Data Europe workshop nr 3 on 11.9.2017 in Amsterdam co-located with SEMANTiCS2017 conference by Ron Dekker, Director CESSDA: European Open Science Agenda: where we are and where we are going?
TIB's action for research data managament as a national library's strategy in...Peter Löwe
The document discusses the TIB's strategy for research data management as a national library in the era of big data. It provides background on the TIB, including its size, budget, collections and networks. It then discusses key initiatives and projects related to research data management, including DataCite for assigning DOIs to datasets, the GOPORTIS library network, and the RADAR project which aims to create a research data repository. The goal is to improve access, discovery and preservation of research data by integrating datasets into the scholarly record through persistent identifiers and linking from publications.
A description of BRISSKit, an open source tool that may be used to combine datasets held in different locations and analyse them for the purpose of research. Talk give by Jonathan Tedds of Leicester Uni. for the Data Management in Practice workshop, which took place on Nov 14th 2013 at the London School of Hygiene and Tropical Medicine
Reusing and integrating public proteomics data to improve our knowledge of th...Juan Antonio Vizcaino
Dr. Juan Antonio Vizcaíno discusses reuse and integration of public proteomics data to improve knowledge of the human proteome. He describes how the PRIDE database stores mass spectrometry-based proteomics data and how ProteomeXchange provides a framework for data submission and dissemination between repositories. Reanalysis of public proteomics data is increasing and can be used for proteogenomics studies and meta-analyses to integrate proteomics and genomics data and better understand the human proteome.
This document provides an overview of proteomics data standards developed by the Proteomics Standards Initiative (PSI). It discusses the need for data standards, describes existing PSI standards like mzML for mass spectrometry data, mzIdentML for identification data, and mzTab for final results. The document also provides background on the development and adoption of these standards over time to support the evolving needs of the proteomics community.
Dr. Juan Antonio Vizcaíno presented on the reuse of public proteomics data. The submission of proteomics datasets to repositories like PRIDE has increased dramatically in recent years. Downloads and reuse of data from PRIDE has also grown significantly, reaching 295 terabytes in 2017. Common ways researchers reuse public proteomics data include verifying published results, building spectral libraries, finding interesting datasets to reanalyze for new discoveries, and benchmarking new algorithms. Data sharing allows information to be extracted and reused in new experiments, advancing protein knowledge in areas like UniProt and neXtProt databases.
PRIDE is a proteomics database that stores mass spectrometry-based proteomics data as part of the ProteomeXchange consortium. It contains identification and quantification data from peptide and protein expression analyses as well as post-translational modifications and mass spectra. Data is organized into datasets and assays and can be submitted to PRIDE via tools that export results into mzIdentML or mzTab format. Complete submissions contain identified spectra mapped to results, while partial submissions provide limited experimental details. PRIDE Inspector and the PX submission tool facilitate validation, visualization and submission of proteomics data to PRIDE.
1) There are several major proteomics repositories that serve different purposes, including repositories that store raw data without reprocessing it (PRIDE Archive, MassIVE, jPOST, iProx, PASSEL) and repositories that reprocess all raw data using standardized methods (PeptideAtlas, GPMDB, proteomicsDB, Human Proteome Map).
2) The document outlines the types of information commonly stored in proteomics repositories, including raw data, identification results, quantification, and metadata. It also discusses standards for file formats.
3) Data sharing in proteomics is becoming more important, driven by journals and funders, to enable reproducible science and maximize the value of research findings. Repositories support
Proteomics is the large-scale study of proteins. The document provides an overview of the history and concepts of proteomics, including definitions of key terms, descriptions of pioneering scientists and techniques, and the importance of bioinformatics in proteomics research. It discusses how proteomics has evolved from protein sequencing and gel electrophoresis to modern mass spectrometry-based techniques and quantitative analysis. The increasing role of proteomics in fields like structural biology and clinical applications is also noted.
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...Juan Antonio Vizcaino
This document summarizes a webinar about developing open proteomics data analysis pipelines in the cloud. It discusses creating reusable workflows for common proteomics analysis tasks like identification, quantification, and quality control. These workflows would be deployed in cloud environments like the EMBL-EBI "Embassy Cloud" and connected to public proteomics databases like PRIDE. The goals are to make large-scale proteomics analysis more reproducible, scalable, and accessible to the community. An implementation study is underway to develop initial workflows for common analysis types, with plans to expand the available tools and optimize the pipelines for growing proteomics data volumes in the future.
This document provides an overview and status update of various proteomics data standards and related efforts from the PSI Proteome Informatics working group. It discusses the structure and timeline of developments for mzIdentML, mzQuantML, mzTab, and related proteogenomics formats. It also outlines plans for the meeting, including further developing mzTab for different applications and the new proVCF format for representing genetic variation at the protein level.
The document discusses the ELIXIR Proteomics Community and its plans. It describes how 11 ELIXIR nodes support the community to develop sustainable proteomics tools and resources and make them FAIR. It highlights existing resources like the PRIDE database and ProteomeXchange repository. Future plans include developing proteoform-centric approaches, integrating omics data, and improving analysis workflows and data management.
This document discusses the reuse of public proteomics data. It provides statistics on proteomics datasets submitted to PRIDE, including the top submitting countries, types of submissions, data volume, and most studied species. It then discusses several ways that public proteomics data is being reused, including to verify published results, build spectral libraries, find new splice isoforms or post-translational modifications, benchmark new tools, and contribute to protein evidence in databases like UniProt. Specific examples of data reuse are also provided, such as for spectral searching, meta-analysis, and repurposing data for proteogenomics studies or discovering novel PTMs.
This document discusses proteomics repositories and data sharing in proteomics. It describes the types of information stored in MS proteomics repositories, including raw data, identification results, quantification, and metadata. It outlines several main repositories, distinguishing between those that do not reprocess data, like PRIDE and MassIVE, and those that do reprocess data through a standardized pipeline, like PeptideAtlas and GPMDB. It also discusses resources focused on drafts of the human proteome, such as proteomicsDB and the Human Proteome Map. Overall, the document provides an overview of existing proteomics repositories and issues around data sharing in the field.
Proteomics is the large-scale study of proteins. It has become an important field due to developments in mass spectrometry and genomics. However, proteomics generates large amounts of complex data that requires bioinformatics analysis. The history of proteomics includes early pioneers in protein sequencing and mass spectrometry techniques. Current areas of focus include biomarker discovery, structural biology, and integrating proteomics with other omics data through systems biology approaches.
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...Juan Antonio Vizcaino
The document discusses the spectra-cluster Toolsuite, which enhances proteomics analysis through spectrum clustering. It describes how the toolsuite was used to cluster the PRIDE database of mass spectrometry data, identifying consensus spectra and inferring identifications for originally unidentified spectra. It also discusses how the toolsuite can be used to cluster individual datasets to improve label-free quantification and characterize unknown samples. The toolsuite includes algorithms, APIs, and tools to enable clustering, development, and analysis capabilities.
Enabling automated processing and analysis of large-scale proteomics dataJuan Antonio Vizcaino
This document summarizes several presentations and events related to proteomics data analysis and ELIXIR activities. It describes a kickoff meeting in Tuebingen where 25 people from 11 ELIXIR nodes discussed future proteomics activities. It also outlines a new 1-year ELIXIR implementation project led by EMBL-EBI and ELIXIR-Germany to develop reusable proteomics analysis pipelines using the OpenMS framework and deploy them on the EMBL-EBI cloud for processing large proteomics datasets from the PRIDE repository, which saw over 243 terabytes of data downloaded in 2016.
The document discusses the Proteomics Standards Initiative (PSI), which develops data format standards for proteomics to facilitate data sharing and reproducibility. It notes that PSI has developed several standard file formats for mass spectrometry-based proteomics data, including mzML for MS data, mzIdentML for identification data, and mzTab for final results. It also maintains related controlled vocabularies and specifies minimum reporting guidelines. The document outlines PSI's process for developing and reviewing standards and lists its current objectives to improve adoption, extend standards to other omics fields, and facilitate reproducible analysis pipelines.
The document discusses the potential for reuse and repurposing of public proteomics data. It notes that datasets are being reused more through activities like contributing to protein knowledge bases, meta-analysis approaches, and spectral libraries. Specific resources that enable reuse are mentioned, such as SRMAtlas, PeptidePicker, and PRIDE Cluster. The document also discusses reprocessing repositories like PeptideAtlas and GPMDB that reanalyze raw data. Repurposing of data for areas like proteogenomics and discovering novel PTMs is highlighted. Overall, the document outlines the many ways that public proteomics data is being leveraged beyond its original purpose through reuse, reanalysis and integration with other omics data.
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfSelcen Ozturkcan
Ozturkcan, S., Berndt, A., & Angelakis, A. (2024). Mending clothing to support sustainable fashion. Presented at the 31st Annual Conference by the Consortium for International Marketing Research (CIMaR), 10-13 Jun 2024, University of Gävle, Sweden.
Embracing Deep Variability For Reproducibility and Replicability
Abstract: Reproducibility (aka determinism in some cases) constitutes a fundamental aspect in various fields of computer science, such as floating-point computations in numerical analysis and simulation, concurrency models in parallelism, reproducible builds for third parties integration and packaging, and containerization for execution environments. These concepts, while pervasive across diverse concerns, often exhibit intricate inter-dependencies, making it challenging to achieve a comprehensive understanding. In this short and vision paper we delve into the application of software engineering techniques, specifically variability management, to systematically identify and explicit points of variability that may give rise to reproducibility issues (eg language, libraries, compiler, virtual machine, OS, environment variables, etc). The primary objectives are: i) gaining insights into the variability layers and their possible interactions, ii) capturing and documenting configurations for the sake of reproducibility, and iii) exploring diverse configurations to replicate, and hence validate and ensure the robustness of results. By adopting these methodologies, we aim to address the complexities associated with reproducibility and replicability in modern software systems and environments, facilitating a more comprehensive and nuanced perspective on these critical aspects.
https://hal.science/hal-04582287
PPT on Alternate Wetting and Drying presented at the three-day 'Training and Validation Workshop on Modules of Climate Smart Agriculture (CSA) Technologies in South Asia' workshop on April 22, 2024.
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆Sérgio Sacani
Context. The early-type galaxy SDSS J133519.91+072807.4 (hereafter SDSS1335+0728), which had exhibited no prior optical variations during the preceding two decades, began showing significant nuclear variability in the Zwicky Transient Facility (ZTF) alert stream from December 2019 (as ZTF19acnskyy). This variability behaviour, coupled with the host-galaxy properties, suggests that SDSS1335+0728 hosts a ∼ 106M⊙ black hole (BH) that is currently in the process of ‘turning on’. Aims. We present a multi-wavelength photometric analysis and spectroscopic follow-up performed with the aim of better understanding the origin of the nuclear variations detected in SDSS1335+0728. Methods. We used archival photometry (from WISE, 2MASS, SDSS, GALEX, eROSITA) and spectroscopic data (from SDSS and LAMOST) to study the state of SDSS1335+0728 prior to December 2019, and new observations from Swift, SOAR/Goodman, VLT/X-shooter, and Keck/LRIS taken after its turn-on to characterise its current state. We analysed the variability of SDSS1335+0728 in the X-ray/UV/optical/mid-infrared range, modelled its spectral energy distribution prior to and after December 2019, and studied the evolution of its UV/optical spectra. Results. From our multi-wavelength photometric analysis, we find that: (a) since 2021, the UV flux (from Swift/UVOT observations) is four times brighter than the flux reported by GALEX in 2004; (b) since June 2022, the mid-infrared flux has risen more than two times, and the W1−W2 WISE colour has become redder; and (c) since February 2024, the source has begun showing X-ray emission. From our spectroscopic follow-up, we see that (i) the narrow emission line ratios are now consistent with a more energetic ionising continuum; (ii) broad emission lines are not detected; and (iii) the [OIII] line increased its flux ∼ 3.6 years after the first ZTF alert, which implies a relatively compact narrow-line-emitting region. Conclusions. We conclude that the variations observed in SDSS1335+0728 could be either explained by a ∼ 106M⊙ AGN that is just turning on or by an exotic tidal disruption event (TDE). If the former is true, SDSS1335+0728 is one of the strongest cases of an AGNobserved in the process of activating. If the latter were found to be the case, it would correspond to the longest and faintest TDE ever observed (or another class of still unknown nuclear transient). Future observations of SDSS1335+0728 are crucial to further understand its behaviour. Key words. galaxies: active– accretion, accretion discs– galaxies: individual: SDSS J133519.91+072807.4
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...Scintica Instrumentation
Targeting Hsp90 and its pathogen Orthologs with Tethered Inhibitors as a Diagnostic and Therapeutic Strategy for cancer and infectious diseases with Dr. Timothy Haystead.
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...Sérgio Sacani
Magmatic iron-meteorite parent bodies are the earliest planetesimals in the Solar System,and they preserve information about conditions and planet-forming processes in thesolar nebula. In this study, we include comprehensive elemental compositions andfractional-crystallization modeling for iron meteorites from the cores of five differenti-ated asteroids from the inner Solar System. Together with previous results of metalliccores from the outer Solar System, we conclude that asteroidal cores from the outerSolar System have smaller sizes, elevated siderophile-element abundances, and simplercrystallization processes than those from the inner Solar System. These differences arerelated to the formation locations of the parent asteroids because the solar protoplane-tary disk varied in redox conditions, elemental distributions, and dynamics at differentheliocentric distances. Using highly siderophile-element data from iron meteorites, wereconstruct the distribution of calcium-aluminum-rich inclusions (CAIs) across theprotoplanetary disk within the first million years of Solar-System history. CAIs, the firstsolids to condense in the Solar System, formed close to the Sun. They were, however,concentrated within the outer disk and depleted within the inner disk. Future modelsof the structure and evolution of the protoplanetary disk should account for this dis-tribution pattern of CAIs.
Signatures of wave erosion in Titan’s coastsSérgio Sacani
The shorelines of Titan’s hydrocarbon seas trace flooded erosional landforms such as river valleys; however, it isunclear whether coastal erosion has subsequently altered these shorelines. Spacecraft observations and theo-retical models suggest that wind may cause waves to form on Titan’s seas, potentially driving coastal erosion,but the observational evidence of waves is indirect, and the processes affecting shoreline evolution on Titanremain unknown. No widely accepted framework exists for using shoreline morphology to quantitatively dis-cern coastal erosion mechanisms, even on Earth, where the dominant mechanisms are known. We combinelandscape evolution models with measurements of shoreline shape on Earth to characterize how differentcoastal erosion mechanisms affect shoreline morphology. Applying this framework to Titan, we find that theshorelines of Titan’s seas are most consistent with flooded landscapes that subsequently have been eroded bywaves, rather than a uniform erosional process or no coastal erosion, particularly if wave growth saturates atfetch lengths of tens of kilometers.
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...Creative-Biolabs
Neutralizing antibodies, pivotal in immune defense, specifically bind and inhibit viral pathogens, thereby playing a crucial role in protecting against and mitigating infectious diseases. In this slide, we will introduce what antibodies and neutralizing antibodies are, the production and regulation of neutralizing antibodies, their mechanisms of action, classification and applications, as well as the challenges they face.
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Proteomics public data resources: enabling "big data" analysis in proteomics
1. Proteomics public data resources:
enabling “big data” analysis in proteomics
Dr. Juan Antonio Vizcaíno
EMBL-European Bioinformatics Institute
Hinxton, Cambridge, UK
2. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Overview
• Intro: Concept of “Big data” in biology and proteomics
• PRIDE Archive and ProteomeXchange
• PRIDE tools
• Reuse of public proteomics data
• Working with “Big data”: PRIDE Cluster
6. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Overview
• Intro: Concept of “Big data” in biology and proteomics
• PRIDE Archive and ProteomeXchange
• PRIDE tools
• Reuse of public proteomics data
• Working with “Big data”: PRIDE Cluster
7. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Data resources at EMBL-EBI
Genes, genomes & variation
ArrayExpress
Expression Atlas PRIDE
InterPro Pfam UniProt
ChEMBL ChEBI
Molecular structures
Protein Data Bank in Europe
Electron Microscopy Data Bank
European Nucleotide Archive
European Variation Archive
European Genome-phenome Archive
Gene & protein expression
Protein sequences, families & motifs
Chemical biology
Reactions, interactions &
pathways
IntAct Reactome MetaboLights
Systems
BioModels Enzyme Portal BioSamples
Ensembl
Ensembl Genomes
GWAS Catalog
Metagenomics portal
Europe PubMed Central
Gene Ontology
Experimental Factor
Ontology
Literature & ontologies
8. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
What is a proteomics publication in 2016?
• Proteomics studies generate potentially large amounts of
data and results.
• Ideally, a proteomics publication needs to:
• Summarize the results of the study
• Provide supporting information for reliability of any
results reported
• Information in a publication:
• Manuscript
• Supplementary material
• Associated data submitted to a public repository
9. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
• PRIDE stores mass spectrometry (MS)-based
proteomics data:
• Peptide and protein expression data
(identification and quantification)
• Post-translational modifications
• Mass spectra (raw data and peak lists)
• Technical and biological metadata
• Any other related information
• Full support for tandem MS approaches
PRIDE (PRoteomics IDEntifications) Archive
http://www.ebi.ac.uk/pride/archive
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2016
10. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
ProteomeXchange: A Global, distributed proteomics
database
PASSEL
(SRM data)
PRIDE
(MS/MS data)
MassIVE
(MS/MS data)
Raw
ID/Q
Meta
jPOST
(MS/MS data)
Mandatory raw data deposition
since July 2015
• Goal: Development of a framework to allow standard data submission and
dissemination pipelines between the main existing proteomics repositories.
http://www.proteomexchange.org
New in 2016
Vizcaíno et al., Nat Biotechnol, 2014
Deustch et al., NAR, 2017, in press
11. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
ProteomeCentral
Metadata /
Manuscript
Raw Data
Results
Journals
Peptide Atlas
Receiving repositories
PRIDE
Researcher’s results
Raw data
Metadata
PASSEL
Research
groups
Reanalysis of datasets
MassIVE
jPOST
MS/MS
data
(as complete
submissions)
Any other
workflow
(mainly partial
submissions)
DATASETS
SRM
data
Reprocessed results
MassIVE
ProteomeXchange data workflow
Vizcaíno et al., Nat Biotechnol, 2014
Deustch et al., NAR, 2017, in press
12. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
ProteomeCentral: Centralised portal for all PX
datasets
http://proteomecentral.proteomexchange.org/cgi/GetDataset
13. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
ProteomeCentral
Metadata /
Manuscript
Raw Data
Results
Journals
Peptide Atlas
Receiving repositories
PRIDE
Researcher’s results
Raw data
Metadata
PASSEL
Research
groups
Reanalysis of datasets
MassIVE
jPOST
MS/MS
data
(as complete
submissions)
Any other
workflow
(mainly partial
submissions)
DATASETS
SRM
data
Reprocessed results
MassIVE
ProteomeXchange data workflow
Vizcaíno et al., Nat Biotechnol, 2014
Deustch et al., NAR, 2017, in press
14. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
ProteomeCentral
Metadata /
Manuscript
Raw Data
Results
Journals
UniProt/
neXtProtPeptide Atlas
Other DBs
Receiving repositories
PRIDE
GPMDBResearcher’s results
Raw data
Metadata
PASSEL
proteomicsDB
Research
groups
Reanalysis of datasets
MassIVE
jPOST
MS/MS
data
(as complete
submissions)
Any other
workflow
(mainly partial
submissions)
DATASETS
OmicsDI
Integration with other
omics datasets
SRM
data
Reprocessed results
MassIVE
ProteomeXchange data workflow
Vizcaíno et al., Nat Biotechnol, 2014
Deustch et al., NAR, 2017, in press
15. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
PRIDE: Source of MS proteomics data
• PRIDE Archive already provides or
will soon provide MS proteomics
data to other EMBL-EBI resources
such as UniProt, Ensembl and the
EBI Expression Atlas.
http://www.ebi.ac.uk/pride/archive
16. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
PRIDE Archive – over 4,500 datasets from
over 51 countries and 1,700 groups
• USA – 814 datasets
• Germany – 528
• UK – 338
• China – 328
• France – 222
• Netherlands – 175
• Canada - 137
Data volume:
• Total: ~275 TB
• Number of all files: ~560,000
• PXD000320-324: ~ 4 TB
• PXD002319-26 ~2.4 TB
• PXD001471 ~1.6 TB
• 1,973 datasets i.e. 52% of
all are publicly accessible
• ~90% of all
ProteomeXchange datasets
YearSubmissions
All submissions
Complete
PRIDE Archive growth
In the last 12 months: ~165 submitted datasets per month
Top Species studied by at least 100
datasets:
2,010 Homo sapiens
604 Mus musculus
191 Saccharomyces cerevisiae
140 Arabidopsis thaliana
127 Rattus norvegicus
>900 reported taxa in total
17. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Overview
• Intro: Concept of “Big data” in biology and proteomics
• PRIDE Archive and ProteomeXchange
• PRIDE tools
• Reuse of public proteomics data
• Working with “Big data”: PRIDE Cluster
18. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
PRIDE Components: Data Submission Process
PRIDE Converter 2
PRIDE Inspector PX Submission Tool
mzIdentML
PRIDE XML
In addition to PRIDE Archive, the PRIDE team develops
and maintains different tools and software libraries to
facilitate the handling and visualisation of MS proteomics
data and the submission process
19. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Current PSI Standard File Formats for MS
• mzMLMS data
• mzIdentMLIdentification
• mzQuantMLQuantitation
• mzTabFinal Results
• TraMLSRM
20. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
PRIDE Inspector Toolsuite
Wang et al., Nat. Biotechnology, 2012
Perez-Riverol et al., Bioinformatics,
2015
Perez-Riverol et al., MCP, 2016
• PRIDE Inspector - standalone tool to enable visualisation and validation of MS
data.
• Build on top of ms-data-core-api - open source algorithms and libraries for
computational proteomics.
• Supported file formats: mzIdentML, mzML, mzTab (PSI standards), and PRIDE
XML.
• Broad functionality.
https://github.com/PRIDE-Utilities/ms-data-core-api
https://github.com/PRIDE-Toolsuite/pride-inspector
Summary and QC charts Peptide spectra annotation and
visualization
21. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
PX Submission Tool
Desktop application for data
submissions to ProteomeXchange via
PRIDE
• Implemented in Java 7
• Streamlines the submission process
• Capture mappings between files
• Retain metadata
• Fast file transfer with Aspera (FASP®
transfer technology) – FTP also
available
• Command line option
Submission tool screenshot
22. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Overview
• Intro: Concept of “Big data” in biology and proteomics
• PRIDE Archive and ProteomeXchange
• PRIDE tools
• Reuse of public proteomics data
• Working with “Big data”: PRIDE Cluster
23. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Datasets are being reused more and more….
Vaudel et al., Proteomics, 2016
Data download volume for
PRIDE Archive in 2015: 198 TB
0
50
100
150
200
250
2013 2014 2015 2016
Downloads in TBs
25. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Draft Human proteome papers published in 2014
Wilhelm et al., Nature, 2014 Kim et al., Nature, 2014
•Two independent groups claimed to have produced the
first complete draft of the human proteome by MS.
• Some of their findings are controversial and need further
validation… but generated a lot of discussion and put
proteomics in the spotlight.
•They used many different tissues.
Nature cover 29 May 2014
26. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Draft Human proteome papers published in 2014
Wilhelm et al., Nature, 2014
•Around 60% of the data used for the
analysis comes from previous
experiments, most of them stored in
proteomics repositories such as
PRIDE/ProteomeXchange, PASSEL or
MassIVE.
•They complement that data with “exotic”
tissues.
30. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Challenges for data reuse in proteomics
• Insufficient technical and biological metadata.
• Large computational infrastructure maybe needed (e.g. when
analysing many datasets together).
• Shortage of expertise (people).
• Lack of standardisation in the field.
31. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Summary of the talk so far
• PRIDE Archive and other ProteomeXchange resources make
possible data sharing in the MS proteomics field.
• Data sharing is becoming the norm in the field.
• Standalone tools: PRIDE Inspector and PX Submission tool.
• Datasets are increasingly reused (many opportunities):
• Example of one of the drafts of the human proteome.
• Proteogenomics approaches.
• But there are important challenges as well.
32. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Overview
• Intro: Concept of “Big data” in biology and proteomics
• PRIDE Archive and ProteomeXchange
• PRIDE tools
• Reuse of public proteomics data
• Working with Big data: PRIDE Cluster
34. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
PRIDE Cluster: Initial Motivation
• Provide a QC-filtered peptide-centric view of PRIDE.
• Data is stored in PRIDE Archive as originally analysed by the
submitters (no data reprocessing is done).
• Heterogeneous quality, difficult to make the data comparable.
• Enable assessment of (published) proteomics data.
35. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
PRIDE Cluster
• Provide an aggregated peptide centric view of PRIDE Archive.
• Hypothesis: same peptide will generate similar MS/MS spectra across
experiments.
• Enables QC of peptide-spectrum matches (PSMs). Infer reliable
identifications by comparing submitted identifications of spectra within a
cluster.
After clustering, a representative spectrum is built for all peptides
consistently identified across different datasets.
Griss et al., Nat. Methods, 2013
Griss et al., Nat. Methods,
2016
36. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
PRIDE Cluster - Concept
NMMAACDPR
NMMAACDPR
PPECPDFDPPR
NMMAACDPR
NMMAACDPR NMMAACDPR
Consensus spectrum
PPECPDFDPPR
Threshold: At least 3 spectra in a
cluster and ratio >70%.
Originally submitted identified spectra
Spectrum
clustering
38. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
PRIDE Cluster: Implementation
• Griss et al., Nat. Methods, 2013
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
39. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
PRIDE Cluster Iteration 2: Why?
• PRIDE Archive has experienced a huge increase in data
since 2013.
• We wanted to develop an algorithm that could also work
with unidentified spectra.
Year
Submissions
All submissions
Complete
PRIDE Archive growth
40. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Parallelizing Spectrum Clustering: Hadoop
• Optimizes work distribution among machines.
• Hadoop is a (open source) Framework for parallelism
using the Map-Reduce algorithm by Google.
• Solves many general issues of large parallel jobs:
• Scheduling
• inter-job communication
• failure
https://hadoop.apache.org/
41. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
PRIDE Cluster: Second Implementation
• Griss et al., Nat. Methods, 2013
• Clustered all public, identified
spectra in PRIDE
• EBI compute farm, LSF
• 20.7 M identified spectra
• 610 CPU days, two
calendar weeks
• Validation, calibration
• Feedback into PRIDE datasets
• EBI farm, LSF
• Griss et al., Nat. Methods, 2016
• Clustered all public spectra in
PRIDE by April 2015
• Apache Hadoop.
• Starting with 256 M spectra.
• 190 M unidentified spectra (they
were filtered to 111 M for spectra
that are likely to represent a
peptide).
• 66 M identified spectra
• Result: 28 M clusters
• 5 calendar days on 30 node
Hadoop cluster, 340 CPU cores
42. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Examples: one perfect cluster
- 880 PSMs give the same peptide ID
- 4 species
- 28 datasets
- Same instruments
44. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
PRIDE Cluster
Sequence-based
search engines
Spectrum clustering
Incorrectly or
unidentified spectra
45. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Output of the analysis
• 1. Inconsistent spectrum clusters
• 2. Clusters including identified and unidentified spectra.
• 3. Clusters just containing unidentified spectra.
46. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
1. Re-analysis of inconsistent clusters
NMMAACDPR
NMMAACDPR
IGGIGTVPVGR
NMMAACDPR
PPECPDFDPPR
VFDEFKPLVEEPQNLIK
NMMAACDPR
IGGIGTVPVGR
No sequence has a
proportion in the
cluster >50%
Consensus spectrum
PPECPDFDPPR
VFDEFKPLVEEP
QNLIK
Originally submitted identified spectra
Spectrum
clustering
47. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
1. Re-analysis of inconsistent clusters
• Re-analysed 3,997 large (>100 spectra), inconsistent clusters with
PepNovo, SpectraST, X!Tandem.
• 453 clusters (11%) were identified as peptides originated from
keratins, trypsin, albumin, and hemoglobin.
• In this case, it is likely that a contaminants DB was not used in the
search.
52. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
2. Inferring identifications for originally unidentified spectra
52
• 9.1 M unidentified spectra were contained in clusters with a reliable
identification.
• These are candidate new identifications (that need to be confirmed),
often missed due to search engine settings
• Example: 49,263 reliable clusters (containing 560,000 identified and
130,000 unidentified spectra) contained phosphorylated peptides,
many of them from non-enriched studies.
53. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
3. Consistently unidentified clusters
• 19 M clusters contain only unidentified spectra.
• 41,155 of these spectra have more than 100 spectra (= 12 M
spectra).
• Most of them are likely to be derived from peptides.
• They could correspond to PTMs or variant peptides.
• With various methods, we found likely identifications for about 20%.
• Vast amount of data mining remains to be done.
55. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
PRIDE Cluster as a Public Data Mining Resource
55
• http://www.ebi.ac.uk/pride/cluster
• Spectral libraries for 16 species.
• All clustering results, as well as specific subsets of interest available.
• Source code (open source) and Java API
56. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Public datasets from different omics: OmicsDI
http://www.ebi.ac.uk/Tools/omicsdi/
• Aims to integrate of ‘omics’ datasets (proteomics,
transcriptomics, metabolomics and genomics at present).
PRIDE
MassIVE
jPOST
PASSEL
GPMDB
ArrayExpress
Expression Atlas
MetaboLights
Metabolomics Workbench
GNPS
EGA
Perez-Riverol et al., 2016, BioRXxiv
60. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Summary part 2
• Using a “big data” approach we were able to get extra
knowledge from all the public data in PRIDE Archive.
• Spectrum clustering enables QC in proteomics resources
such as PRIDE Archive.
• It is possible to detect spectra that are consistently
unidentified across hundreds of datasets (maybe peptide
variants, or peptides with PTMs not initially considered).
• OmicsDI: new platform to identify public datasets coming
from different omics technologies (more possibilities for data
reuse!)
61. Juan A. Vizcaíno
juan@ebi.ac.uk
International de.NBI Symposium
Heidelberg, 9 November 2016
Aknowledgements: People
Attila Csordas
Tobias Ternent
Gerhard Mayer (de.NBI)
Johannes Griss
Yasset Perez-Riverol
Manuel Bernal-Llinares
Andrew Jarnuczak
Enrique Perez
Former team members, especially
Rui Wang, Florian Reisinger, Noemi
del Toro, Jose A. Dianes & Henning
Hermjakob
Acknowledgements: The PRIDE Team
All data submitters !!!
@pride_ebi
@proteomexchange