An update of the activities of the ProteomeXchange Consortium of proteomics resources given at HUPO 2016 (Taipei). Some slides at the end of the presentation are from Nuno Bandeira.
PRIDE resources and ProteomeXchange
- PRIDE is a proteomics data repository at EMBL-EBI that stores mass spectrometry-based proteomics data.
- It is part of the ProteomeXchange consortium, which provides a framework for standardized data submission and dissemination between proteomics repositories.
- This presentation discusses how to submit data to PRIDE/ProteomeXchange using PRIDE tools, including converting files to mzIdentML format and using the PX submission tool for metadata and file transfer.
This document discusses the reuse of public proteomics data. It provides statistics on proteomics datasets submitted to PRIDE, including the top submitting countries, types of submissions, data volume, and most studied species. It then discusses several ways that public proteomics data is being reused, including to verify published results, build spectral libraries, find new splice isoforms or post-translational modifications, benchmark new tools, and contribute to protein evidence in databases like UniProt. Specific examples of data reuse are also provided, such as for spectral searching, meta-analysis, and repurposing data for proteogenomics studies or discovering novel PTMs.
Mining the hidden proteome using hundreds of public proteomics datasetsJuan Antonio Vizcaino
The document discusses mining hidden proteomics data using public proteomics datasets. It describes how the PRIDE Cluster tool clusters over 250 million spectra from the PRIDE Archive, including over 190 million previously unidentified spectra. This clustering identified inconsistent clusters that could be reanalyzed, inferred identifications for 9.1 million originally unidentified spectra contained within reliable identification clusters, and consistently unidentified clusters that could be targeted for further analysis to identify unknown peptides. The clustering took 5 days on a 340-core system and generated 28 million clusters.
Public proteomics data: a (mostly unexploited) gold mine for computational re...Juan Antonio Vizcaino
The document discusses public proteomics data available through the PRIDE Archive at the European Bioinformatics Institute. It provides statistics on data submissions and downloads, which continue to increase significantly each year. The author advocates for reusing public proteomics data through approaches like proteogenomics studies, discovery of new post-translational modifications, and meta-analysis studies. Spectrum clustering is presented as a method to further analyze and draw insights from large proteomics datasets.
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...Juan Antonio Vizcaino
The document discusses PRIDE and ProteomeXchange, which are resources that support the deposition of proteomics data to public repositories. PRIDE stores mass spectrometry-based proteomics data, and is one of the repositories that is part of ProteomeXchange, a framework that allows standard submission of proteomics data between major repositories. The document outlines the cultural change in proteomics towards public data sharing, and provides information on submitting proteomics data to PRIDE and accessing data deposited in PRIDE and ProteomeXchange.
The document introduces several proteomics data standards developed by the Proteomics Standards Initiative (PSI), including mzML, mzIdentML, mzQuantML, TraML, and mzTab. It provides an overview of each standard, describing what type of data it encodes (e.g. mass spectrometry data, identification data, quantification data), its timeline of development and versions, and its increasing adoption by proteomics software and databases. The document emphasizes that data standards are necessary for data sharing and integration in proteomics given the large number of experimental workflows and data types.
An overview of the PRIDE ecosystem of resources and computational tools for m...Juan Antonio Vizcaino
The document provides an overview of the PRIDE ecosystem of resources and computational tools for mass spectrometry proteomics data. It describes PRIDE Archive and ProteomeXchange as repositories for proteomics data, as well as tools like PRIDE Inspector for visualizing and validating data. It also discusses how public proteomics data is increasingly being reused, and added-value resources like PRIDE Cluster and PRIDE Proteomes that provide aggregated views of proteomics data.
PRIDE resources and ProteomeXchange
- PRIDE is a proteomics data repository at EMBL-EBI that stores mass spectrometry-based proteomics data.
- It is part of the ProteomeXchange consortium, which provides a framework for standardized data submission and dissemination between proteomics repositories.
- This presentation discusses how to submit data to PRIDE/ProteomeXchange using PRIDE tools, including converting files to mzIdentML format and using the PX submission tool for metadata and file transfer.
This document discusses the reuse of public proteomics data. It provides statistics on proteomics datasets submitted to PRIDE, including the top submitting countries, types of submissions, data volume, and most studied species. It then discusses several ways that public proteomics data is being reused, including to verify published results, build spectral libraries, find new splice isoforms or post-translational modifications, benchmark new tools, and contribute to protein evidence in databases like UniProt. Specific examples of data reuse are also provided, such as for spectral searching, meta-analysis, and repurposing data for proteogenomics studies or discovering novel PTMs.
Mining the hidden proteome using hundreds of public proteomics datasetsJuan Antonio Vizcaino
The document discusses mining hidden proteomics data using public proteomics datasets. It describes how the PRIDE Cluster tool clusters over 250 million spectra from the PRIDE Archive, including over 190 million previously unidentified spectra. This clustering identified inconsistent clusters that could be reanalyzed, inferred identifications for 9.1 million originally unidentified spectra contained within reliable identification clusters, and consistently unidentified clusters that could be targeted for further analysis to identify unknown peptides. The clustering took 5 days on a 340-core system and generated 28 million clusters.
Public proteomics data: a (mostly unexploited) gold mine for computational re...Juan Antonio Vizcaino
The document discusses public proteomics data available through the PRIDE Archive at the European Bioinformatics Institute. It provides statistics on data submissions and downloads, which continue to increase significantly each year. The author advocates for reusing public proteomics data through approaches like proteogenomics studies, discovery of new post-translational modifications, and meta-analysis studies. Spectrum clustering is presented as a method to further analyze and draw insights from large proteomics datasets.
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...Juan Antonio Vizcaino
The document discusses PRIDE and ProteomeXchange, which are resources that support the deposition of proteomics data to public repositories. PRIDE stores mass spectrometry-based proteomics data, and is one of the repositories that is part of ProteomeXchange, a framework that allows standard submission of proteomics data between major repositories. The document outlines the cultural change in proteomics towards public data sharing, and provides information on submitting proteomics data to PRIDE and accessing data deposited in PRIDE and ProteomeXchange.
The document introduces several proteomics data standards developed by the Proteomics Standards Initiative (PSI), including mzML, mzIdentML, mzQuantML, TraML, and mzTab. It provides an overview of each standard, describing what type of data it encodes (e.g. mass spectrometry data, identification data, quantification data), its timeline of development and versions, and its increasing adoption by proteomics software and databases. The document emphasizes that data standards are necessary for data sharing and integration in proteomics given the large number of experimental workflows and data types.
An overview of the PRIDE ecosystem of resources and computational tools for m...Juan Antonio Vizcaino
The document provides an overview of the PRIDE ecosystem of resources and computational tools for mass spectrometry proteomics data. It describes PRIDE Archive and ProteomeXchange as repositories for proteomics data, as well as tools like PRIDE Inspector for visualizing and validating data. It also discusses how public proteomics data is increasingly being reused, and added-value resources like PRIDE Cluster and PRIDE Proteomes that provide aggregated views of proteomics data.
The document discusses updates to the PRIDE Cluster project. PRIDE Cluster analyzes mass spectrometry proteomics data stored in the PRIDE database by clustering peptide spectra. The latest implementation clustered over 256 million spectra using Apache Hadoop. This resulted in 28 million clusters, including clusters with inconsistent identifications, clusters linking identified and unidentified spectra, and large clusters of consistently unidentified spectra that could help identify new peptides and post-translational modifications. The PRIDE Cluster provides a public resource for data mining the large collection of proteomics datasets in PRIDE.
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...Juan Antonio Vizcaino
The document discusses the challenges and opportunities of big data in proteomics. It describes how proteomics data volumes are growing rapidly due to technological advances, creating both computational challenges for data analysis and opportunities to reuse large amounts of public data. The PRIDE Archive at EBI stores over 4,000 proteomics datasets and provides tools like PRIDE Inspector to help analyze and validate large datasets. However, challenges remain around data standardization, metadata completeness, and the need for greater computational infrastructure and expertise to fully leverage the large amounts of shared proteomics data.
This document discusses the ProteomeXchange Consortium and recent updates. It provides statistics on data submissions and downloads. Over 7,475 datasets have been submitted from over 50 countries, with the majority from the US, Germany, and China. PRIDE and MassIVE are the largest repositories. A new prospective member, iProX, is described which will be the main proteomics data sharing platform in China. Guidelines are being developed to handle reprocessed datasets submitted to repositories.
A proteomics data “gold mine” at your disposal: Now that the data is there, w...Juan Antonio Vizcaino
The document discusses the reuse of public proteomics data. It describes how data from the PRoteomics IDEntifications (PRIDE) Archive can be reanalyzed to conduct proteogenomics studies, discover new post-translational modifications and variants, and enable meta-analysis studies of protein-protein interactions and associations. It also examines challenges around analyzing the "dark proteome" of consistently unidentified spectra in public datasets and developing open analysis pipelines for proteomics data in cloud environments.
Small molecule identification and the new MassBankSteffen Neumann
Since the beginning more than 10 years ago, the MassBank system
provided a user-friendly web interface. We now have improved
data access, version control and issue tracking by moving
the data to github, allowing for a whole new workflow
and access route for Bio- and Cheminformatics users.
This document discusses mass spectrometry informatics formats developed by the Proteomics Standards Initiative. It describes standard formats such as mzIdentML, mzQuantML, and mzTab that have been created for proteomics data as well as ongoing work to extend mzTab to support metabolomics and glycomics data. It also provides information on the current status and adoption of these standards by the proteomics community.
This document describes the Bio2RDF project, which aims to integrate biological data from multiple sources using Semantic Web technologies. It proposes applying linked data principles and semantic graph ranking methods to provide an integrated search interface for querying post-genomic knowledge about human and mouse. The results section describes the initial Bio2RDF knowledge map integrating data from 30 sources, with statistics on its coverage. A demo query about Paget disease is also presented to illustrate searching the data using SPARQL.
The document discusses proteomics repositories and their role in sharing mass spectrometry (MS) proteomics data. It describes the main types of information stored in MS proteomics repositories, including raw experimental data, identification and quantification results, metadata, and other associated information. The document outlines some of the main existing repositories, including PRIDE Archive, PeptideAtlas, and Global Proteome Machine, and whether they reprocess data through a standardized pipeline or store data as published. Reprocessing repositories provide an updated view of data through consistent analysis, while no-reprocessing repositories preserve the original analysis. Data sharing is important for independent review, meta-analysis, and advancing the field.
The document discusses data standards for proteomics, including those developed by the Proteomics Standards Initiative (PSI). It describes several existing PSI standards for mass spectrometry data, including mzML, mzIdentML, mzQuantML, and TraML. It provides an example of the successful mzML standard and discusses how mzIdentML has been widely adopted for representing peptide and protein identifications.
BioSHaRE: Opal and Mica: a software suite for data harmonization and federati...Lisette Giepmans
BioSHaRE conference July 28th, 2015, Milan - Latest tools and services for data sharing
Stream 1: Tools for data sharing analysis and enhancement
Opal is a software application to manage study data, and includes a feature enabling data harmonisation and data integration across studies. As such, Opal supports the development and implementation of processing algorithms required to transform study-specific data into a common harmonised format. Moreover, when connected to a Mica web interface, Opal allows users to seamlessly and securely search distributed datasets across several Opal instances.
Opal is freely available for download at www.obiba.org and is provided under the GPL3 open source licence. All studies or networks of studies using the Opal software for data storage, data management or data harmonisation must mention Opal in manuscripts, presentations, or other works made public and include a web link to the Maelstrom Research website (www.maelstrom-research.org).
Mica is a software application developed to create web portals for individual epidemiological studies or for study consortia. Features supported by Mica include a standardised study catalogue, study-specific and harmonised variable data dictionary browsers, online data access request forms, and communication tools (e.g. forums, events, news).
When used in conjunction with the Opal software, Mica also allows authenticated users (i.e. with username and password) to perform distributed queries on the content of study databases hosted on remote servers, and retrieve summary statistics of that content.
Mica is a Java-based, cross-platform, client-server application and comes along with the following two clients: the administrators' user interface and a content management system (Drupal) used to render the catalogue content on the study or consortium.
Mica is freely available for download at www.obiba.org and is provided under the GPL3 open source license.
The eNanoMapper database for nanomaterial safety information: storage and queryNina Jeliazkova
A number of challenges exist in engineered nanomaterials (ENM) data representation and integration mainly due to data complexity and provenance. We have recently described the eNanoMapper database [doi:10.1109/BIBM.2014.699936] as part of the computational infrastructure for toxicological data management of ENM, developed within the EU FP7 eNanoMapper project. The ontology-supported data model is based on an exhaustive review of existing nano-related data models, databases, and nanomaterial related entries in chemical and toxicogenomic databases. We demonstrate how this approach provides a common ground for integration of data represented in diverse formats (ISA-TAB, OECD HT, custom RDF and set of spreadsheet templates used by the EU NanoSafety Cluster projects) and enables uniform approach towards import, storage and searching of ENM physicochemical measurements and biological assay results. A configurable parser enables import of the data stored in spreadsheet templates, accommodating different organization of the data. The configuration metadata is defined in a separate file, mapping the spreadsheet into the internal data model. The demonstration data provided by eNanoMapper partners ((i) NanoWiki, (ii) a literature dataset on protein coronas and (iii) the ModNanoTox project dataset consisting of 86 assays and 100 different endpoints) illustrates the capability of the associated REST API to support a variety of tests and endpoints, recommended by the OECD Working Party of Manufactured Nanomaterials. The API is tightly integrated with a chemical structure search, allowing highlighting the function as a core, coating or functionalisation. The REST API enables graphical summaries of the data and integration in applications such as NanoQSAR modelling via programmatic interaction.
ASMS Fall 2018 Metabolomics Informatics Workshop Peak PickingEmma Schymanski
Principles of Peak Picking and Alignment in Pictures and further "doing". ASMS Fall Metabolomics Informatics Workshop 2018.
https://www.asms.org/conferences/fall-workshop/program
The ProteomeXchange Consortium aims to allow standard data submission and dissemination between major proteomics repositories, including PeptideAtlas, PRIDE, and MassIVE. It establishes a common identifier space (PXD IDs) and supports workflows for MS/MS and SRM data submitted from any experimental approach. Since 2012, over 3,800 datasets have been submitted from over 700 species, with over 1,900 publicly accessible. Submissions have grown significantly each year, and data downloads for reuse are also increasing. The goal is to make data sharing easier for researchers.
The document discusses PRIDE, a proteomics data repository at EMBL-EBI. It describes how PRIDE stores mass spectrometry proteomics data, its role within the ProteomeXchange consortium, and how researchers can submit data to PRIDE including the use of mzIdentML and PRIDE tools.
This document introduces several proteomics data standards developed by the Proteomics Standards Initiative (PSI), including mzML for mass spectrometry data, mzIdentML for peptide and protein identifications, mzQuantML for quantification data, and mzTab for final identification and quantification results. It describes how these standards address the need for data standardization in proteomics as the field has evolved. It also discusses how these standards have been implemented in proteomics databases, software tools, and data repositories like ProteomeXchange to facilitate data sharing and analysis.
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...Juan Antonio Vizcaino
Dr. Juan Antonio Vizcaíno presented on developing open data analysis pipelines in the cloud to enable large-scale analysis of proteomics data. He introduced PRIDE and ProteomeXchange as repositories for proteomics data that are seeing substantial growth. Moving analysis pipelines to the cloud will facilitate public reuse of large datasets, improve scalability, and ensure reproducibility. Initial pipelines have been created for identification, quantification, and quality control of mass spectrometry data and deployed on the EMBL-EBI cloud platform. Future work includes optimizing access to PRIDE data and developing pipelines for analysis of DIA and proteogenomics data.
This document summarizes a presentation about proteomics repositories. It discusses why sharing proteomics data is important, the types of information stored in repositories, and some of the main existing repositories and their characteristics. Some repositories, like PRIDE and MassIVE, store data as originally analyzed without reprocessing. Others, like PeptideAtlas and GPMDB, reprocess raw data using a standardized pipeline to provide an updated view. The document also discusses resources developed from draft human proteome papers, including proteomicsDB and the Human Proteome Map.
Reusing and integrating public proteomics data to improve our knowledge of th...Juan Antonio Vizcaino
Dr. Juan Antonio Vizcaíno discusses reuse and integration of public proteomics data to improve knowledge of the human proteome. He describes how the PRIDE database stores mass spectrometry-based proteomics data and how ProteomeXchange provides a framework for data submission and dissemination between repositories. Reanalysis of public proteomics data is increasing and can be used for proteogenomics studies and meta-analyses to integrate proteomics and genomics data and better understand the human proteome.
The document discusses updates to the PRIDE Cluster project. PRIDE Cluster analyzes mass spectrometry proteomics data stored in the PRIDE database by clustering peptide spectra. The latest implementation clustered over 256 million spectra using Apache Hadoop. This resulted in 28 million clusters, including clusters with inconsistent identifications, clusters linking identified and unidentified spectra, and large clusters of consistently unidentified spectra that could help identify new peptides and post-translational modifications. The PRIDE Cluster provides a public resource for data mining the large collection of proteomics datasets in PRIDE.
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...Juan Antonio Vizcaino
The document discusses the challenges and opportunities of big data in proteomics. It describes how proteomics data volumes are growing rapidly due to technological advances, creating both computational challenges for data analysis and opportunities to reuse large amounts of public data. The PRIDE Archive at EBI stores over 4,000 proteomics datasets and provides tools like PRIDE Inspector to help analyze and validate large datasets. However, challenges remain around data standardization, metadata completeness, and the need for greater computational infrastructure and expertise to fully leverage the large amounts of shared proteomics data.
This document discusses the ProteomeXchange Consortium and recent updates. It provides statistics on data submissions and downloads. Over 7,475 datasets have been submitted from over 50 countries, with the majority from the US, Germany, and China. PRIDE and MassIVE are the largest repositories. A new prospective member, iProX, is described which will be the main proteomics data sharing platform in China. Guidelines are being developed to handle reprocessed datasets submitted to repositories.
A proteomics data “gold mine” at your disposal: Now that the data is there, w...Juan Antonio Vizcaino
The document discusses the reuse of public proteomics data. It describes how data from the PRoteomics IDEntifications (PRIDE) Archive can be reanalyzed to conduct proteogenomics studies, discover new post-translational modifications and variants, and enable meta-analysis studies of protein-protein interactions and associations. It also examines challenges around analyzing the "dark proteome" of consistently unidentified spectra in public datasets and developing open analysis pipelines for proteomics data in cloud environments.
Small molecule identification and the new MassBankSteffen Neumann
Since the beginning more than 10 years ago, the MassBank system
provided a user-friendly web interface. We now have improved
data access, version control and issue tracking by moving
the data to github, allowing for a whole new workflow
and access route for Bio- and Cheminformatics users.
This document discusses mass spectrometry informatics formats developed by the Proteomics Standards Initiative. It describes standard formats such as mzIdentML, mzQuantML, and mzTab that have been created for proteomics data as well as ongoing work to extend mzTab to support metabolomics and glycomics data. It also provides information on the current status and adoption of these standards by the proteomics community.
This document describes the Bio2RDF project, which aims to integrate biological data from multiple sources using Semantic Web technologies. It proposes applying linked data principles and semantic graph ranking methods to provide an integrated search interface for querying post-genomic knowledge about human and mouse. The results section describes the initial Bio2RDF knowledge map integrating data from 30 sources, with statistics on its coverage. A demo query about Paget disease is also presented to illustrate searching the data using SPARQL.
The document discusses proteomics repositories and their role in sharing mass spectrometry (MS) proteomics data. It describes the main types of information stored in MS proteomics repositories, including raw experimental data, identification and quantification results, metadata, and other associated information. The document outlines some of the main existing repositories, including PRIDE Archive, PeptideAtlas, and Global Proteome Machine, and whether they reprocess data through a standardized pipeline or store data as published. Reprocessing repositories provide an updated view of data through consistent analysis, while no-reprocessing repositories preserve the original analysis. Data sharing is important for independent review, meta-analysis, and advancing the field.
The document discusses data standards for proteomics, including those developed by the Proteomics Standards Initiative (PSI). It describes several existing PSI standards for mass spectrometry data, including mzML, mzIdentML, mzQuantML, and TraML. It provides an example of the successful mzML standard and discusses how mzIdentML has been widely adopted for representing peptide and protein identifications.
BioSHaRE: Opal and Mica: a software suite for data harmonization and federati...Lisette Giepmans
BioSHaRE conference July 28th, 2015, Milan - Latest tools and services for data sharing
Stream 1: Tools for data sharing analysis and enhancement
Opal is a software application to manage study data, and includes a feature enabling data harmonisation and data integration across studies. As such, Opal supports the development and implementation of processing algorithms required to transform study-specific data into a common harmonised format. Moreover, when connected to a Mica web interface, Opal allows users to seamlessly and securely search distributed datasets across several Opal instances.
Opal is freely available for download at www.obiba.org and is provided under the GPL3 open source licence. All studies or networks of studies using the Opal software for data storage, data management or data harmonisation must mention Opal in manuscripts, presentations, or other works made public and include a web link to the Maelstrom Research website (www.maelstrom-research.org).
Mica is a software application developed to create web portals for individual epidemiological studies or for study consortia. Features supported by Mica include a standardised study catalogue, study-specific and harmonised variable data dictionary browsers, online data access request forms, and communication tools (e.g. forums, events, news).
When used in conjunction with the Opal software, Mica also allows authenticated users (i.e. with username and password) to perform distributed queries on the content of study databases hosted on remote servers, and retrieve summary statistics of that content.
Mica is a Java-based, cross-platform, client-server application and comes along with the following two clients: the administrators' user interface and a content management system (Drupal) used to render the catalogue content on the study or consortium.
Mica is freely available for download at www.obiba.org and is provided under the GPL3 open source license.
The eNanoMapper database for nanomaterial safety information: storage and queryNina Jeliazkova
A number of challenges exist in engineered nanomaterials (ENM) data representation and integration mainly due to data complexity and provenance. We have recently described the eNanoMapper database [doi:10.1109/BIBM.2014.699936] as part of the computational infrastructure for toxicological data management of ENM, developed within the EU FP7 eNanoMapper project. The ontology-supported data model is based on an exhaustive review of existing nano-related data models, databases, and nanomaterial related entries in chemical and toxicogenomic databases. We demonstrate how this approach provides a common ground for integration of data represented in diverse formats (ISA-TAB, OECD HT, custom RDF and set of spreadsheet templates used by the EU NanoSafety Cluster projects) and enables uniform approach towards import, storage and searching of ENM physicochemical measurements and biological assay results. A configurable parser enables import of the data stored in spreadsheet templates, accommodating different organization of the data. The configuration metadata is defined in a separate file, mapping the spreadsheet into the internal data model. The demonstration data provided by eNanoMapper partners ((i) NanoWiki, (ii) a literature dataset on protein coronas and (iii) the ModNanoTox project dataset consisting of 86 assays and 100 different endpoints) illustrates the capability of the associated REST API to support a variety of tests and endpoints, recommended by the OECD Working Party of Manufactured Nanomaterials. The API is tightly integrated with a chemical structure search, allowing highlighting the function as a core, coating or functionalisation. The REST API enables graphical summaries of the data and integration in applications such as NanoQSAR modelling via programmatic interaction.
ASMS Fall 2018 Metabolomics Informatics Workshop Peak PickingEmma Schymanski
Principles of Peak Picking and Alignment in Pictures and further "doing". ASMS Fall Metabolomics Informatics Workshop 2018.
https://www.asms.org/conferences/fall-workshop/program
The ProteomeXchange Consortium aims to allow standard data submission and dissemination between major proteomics repositories, including PeptideAtlas, PRIDE, and MassIVE. It establishes a common identifier space (PXD IDs) and supports workflows for MS/MS and SRM data submitted from any experimental approach. Since 2012, over 3,800 datasets have been submitted from over 700 species, with over 1,900 publicly accessible. Submissions have grown significantly each year, and data downloads for reuse are also increasing. The goal is to make data sharing easier for researchers.
The document discusses PRIDE, a proteomics data repository at EMBL-EBI. It describes how PRIDE stores mass spectrometry proteomics data, its role within the ProteomeXchange consortium, and how researchers can submit data to PRIDE including the use of mzIdentML and PRIDE tools.
This document introduces several proteomics data standards developed by the Proteomics Standards Initiative (PSI), including mzML for mass spectrometry data, mzIdentML for peptide and protein identifications, mzQuantML for quantification data, and mzTab for final identification and quantification results. It describes how these standards address the need for data standardization in proteomics as the field has evolved. It also discusses how these standards have been implemented in proteomics databases, software tools, and data repositories like ProteomeXchange to facilitate data sharing and analysis.
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...Juan Antonio Vizcaino
Dr. Juan Antonio Vizcaíno presented on developing open data analysis pipelines in the cloud to enable large-scale analysis of proteomics data. He introduced PRIDE and ProteomeXchange as repositories for proteomics data that are seeing substantial growth. Moving analysis pipelines to the cloud will facilitate public reuse of large datasets, improve scalability, and ensure reproducibility. Initial pipelines have been created for identification, quantification, and quality control of mass spectrometry data and deployed on the EMBL-EBI cloud platform. Future work includes optimizing access to PRIDE data and developing pipelines for analysis of DIA and proteogenomics data.
This document summarizes a presentation about proteomics repositories. It discusses why sharing proteomics data is important, the types of information stored in repositories, and some of the main existing repositories and their characteristics. Some repositories, like PRIDE and MassIVE, store data as originally analyzed without reprocessing. Others, like PeptideAtlas and GPMDB, reprocess raw data using a standardized pipeline to provide an updated view. The document also discusses resources developed from draft human proteome papers, including proteomicsDB and the Human Proteome Map.
Reusing and integrating public proteomics data to improve our knowledge of th...Juan Antonio Vizcaino
Dr. Juan Antonio Vizcaíno discusses reuse and integration of public proteomics data to improve knowledge of the human proteome. He describes how the PRIDE database stores mass spectrometry-based proteomics data and how ProteomeXchange provides a framework for data submission and dissemination between repositories. Reanalysis of public proteomics data is increasing and can be used for proteogenomics studies and meta-analyses to integrate proteomics and genomics data and better understand the human proteome.
The document provides an overview and status update of ProteomeXchange, including submission and citation statistics, new prospective members jPOST and iPROX, and the OmicsDI interface. It notes that ProteomeXchange currently includes over 3,800 datasets submitted primarily from the US, Germany, UK, and China, and that submissions and data reuse have grown substantially in recent years.
This document provides an overview and status update of ProteomeXchange in 2017. It discusses submission and download statistics showing growth in datasets submitted. There are now over 5,000 datasets in PRIDE from over 1,000 species. Download volumes have increased to over 200 TB in 2016. Citations of proteomics datasets are also increasing. A new prospective member, Firmiana, may join ProteomeXchange. The OmicsDI interface provides integrated access to datasets across multiple omics domains like proteomics, transcriptomics and metabolomics.
This document discusses proteomics repositories and data sharing in proteomics. It describes the types of information stored in MS proteomics repositories, including raw data, identification results, quantification, and metadata. It outlines several main repositories, distinguishing between those that do not reprocess data, like PRIDE and MassIVE, and those that do reprocess data through a standardized pipeline, like PeptideAtlas and GPMDB. It also discusses resources focused on drafts of the human proteome, such as proteomicsDB and the Human Proteome Map. Overall, the document provides an overview of existing proteomics repositories and issues around data sharing in the field.
1) ProteomeXchange is a global database containing proteomics data from several repositories including PRIDE, MassIVE, and jPOST.
2) A new member, iProX, joined in 2017 and contains over 60 terabytes of data from China.
3) Usage of ProteomeXchange data is increasing, with PRIDE downloads growing from 50 terabytes in 2013 to over 295 terabytes in 2017.
Enabling automated processing and analysis of large-scale proteomics dataJuan Antonio Vizcaino
This document summarizes several presentations and events related to proteomics data analysis and ELIXIR activities. It describes a kickoff meeting in Tuebingen where 25 people from 11 ELIXIR nodes discussed future proteomics activities. It also outlines a new 1-year ELIXIR implementation project led by EMBL-EBI and ELIXIR-Germany to develop reusable proteomics analysis pipelines using the OpenMS framework and deploy them on the EMBL-EBI cloud for processing large proteomics datasets from the PRIDE repository, which saw over 243 terabytes of data downloaded in 2016.
The document discusses a training webinar about PRIDE and ProteomeXchange. It begins with instructions for participating in the webinar and an overview of data resources at EMBL-EBI. It then covers PRIDE's mission to archive proteomics data, the ProteomeXchange consortium for standardized data submission, and tools for submitting data to PRIDE including PRIDE Converter, PRIDE Inspector, and the ProteomeXchange submission tool.
The document discusses the potential for reuse and repurposing of public proteomics data. It notes that datasets are being reused more through activities like contributing to protein knowledge bases, meta-analysis approaches, and spectral libraries. Specific resources that enable reuse are mentioned, such as SRMAtlas, PeptidePicker, and PRIDE Cluster. The document also discusses reprocessing repositories like PeptideAtlas and GPMDB that reanalyze raw data. Repurposing of data for areas like proteogenomics and discovering novel PTMs is highlighted. Overall, the document outlines the many ways that public proteomics data is being leveraged beyond its original purpose through reuse, reanalysis and integration with other omics data.
This document provides an overview of proteomics data standards developed by the Proteomics Standards Initiative (PSI). It discusses the need for data standards, describes existing PSI standards like mzML for mass spectrometry data, mzIdentML for identification data, and mzTab for final results. The document also provides background on the development and adoption of these standards over time to support the evolving needs of the proteomics community.
Dr. Juan Antonio Vizcaíno presented on the reuse of public proteomics data. The submission of proteomics datasets to repositories like PRIDE has increased dramatically in recent years. Downloads and reuse of data from PRIDE has also grown significantly, reaching 295 terabytes in 2017. Common ways researchers reuse public proteomics data include verifying published results, building spectral libraries, finding interesting datasets to reanalyze for new discoveries, and benchmarking new algorithms. Data sharing allows information to be extracted and reused in new experiments, advancing protein knowledge in areas like UniProt and neXtProt databases.
PRIDE is a proteomics database that stores mass spectrometry-based proteomics data as part of the ProteomeXchange consortium. It contains identification and quantification data from peptide and protein expression analyses as well as post-translational modifications and mass spectra. Data is organized into datasets and assays and can be submitted to PRIDE via tools that export results into mzIdentML or mzTab format. Complete submissions contain identified spectra mapped to results, while partial submissions provide limited experimental details. PRIDE Inspector and the PX submission tool facilitate validation, visualization and submission of proteomics data to PRIDE.
1) There are several major proteomics repositories that serve different purposes, including repositories that store raw data without reprocessing it (PRIDE Archive, MassIVE, jPOST, iProx, PASSEL) and repositories that reprocess all raw data using standardized methods (PeptideAtlas, GPMDB, proteomicsDB, Human Proteome Map).
2) The document outlines the types of information commonly stored in proteomics repositories, including raw data, identification results, quantification, and metadata. It also discusses standards for file formats.
3) Data sharing in proteomics is becoming more important, driven by journals and funders, to enable reproducible science and maximize the value of research findings. Repositories support
Proteomics is the large-scale study of proteins. The document provides an overview of the history and concepts of proteomics, including definitions of key terms, descriptions of pioneering scientists and techniques, and the importance of bioinformatics in proteomics research. It discusses how proteomics has evolved from protein sequencing and gel electrophoresis to modern mass spectrometry-based techniques and quantitative analysis. The increasing role of proteomics in fields like structural biology and clinical applications is also noted.
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...Juan Antonio Vizcaino
This document summarizes a webinar about developing open proteomics data analysis pipelines in the cloud. It discusses creating reusable workflows for common proteomics analysis tasks like identification, quantification, and quality control. These workflows would be deployed in cloud environments like the EMBL-EBI "Embassy Cloud" and connected to public proteomics databases like PRIDE. The goals are to make large-scale proteomics analysis more reproducible, scalable, and accessible to the community. An implementation study is underway to develop initial workflows for common analysis types, with plans to expand the available tools and optimize the pipelines for growing proteomics data volumes in the future.
This document provides an overview and status update of various proteomics data standards and related efforts from the PSI Proteome Informatics working group. It discusses the structure and timeline of developments for mzIdentML, mzQuantML, mzTab, and related proteogenomics formats. It also outlines plans for the meeting, including further developing mzTab for different applications and the new proVCF format for representing genetic variation at the protein level.
The document discusses the ELIXIR Proteomics Community and its plans. It describes how 11 ELIXIR nodes support the community to develop sustainable proteomics tools and resources and make them FAIR. It highlights existing resources like the PRIDE database and ProteomeXchange repository. Future plans include developing proteoform-centric approaches, integrating omics data, and improving analysis workflows and data management.
This document summarizes Juan A. Vizcaíno's presentation on the ELIXIR Proteomics Community. It discusses the establishment of the community through an implementation study and strategy meeting. The community aims to develop standardized proteomics data analysis pipelines and deploy them in a cloud environment. It will also work to improve proteomics data standards and integrate proteomics with other omics data through activities like the Proteomics Standards Initiative. The ProteomeXchange database is a major resource overseen by the community for storing and sharing proteomics data internationally.
Proteomics is the large-scale study of proteins. It has become an important field due to developments in mass spectrometry and genomics. However, proteomics generates large amounts of complex data that requires bioinformatics analysis. The history of proteomics includes early pioneers in protein sequencing and mass spectrometry techniques. Current areas of focus include biomarker discovery, structural biology, and integrating proteomics with other omics data through systems biology approaches.
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...Juan Antonio Vizcaino
The document discusses the spectra-cluster Toolsuite, which enhances proteomics analysis through spectrum clustering. It describes how the toolsuite was used to cluster the PRIDE database of mass spectrometry data, identifying consensus spectra and inferring identifications for originally unidentified spectra. It also discusses how the toolsuite can be used to cluster individual datasets to improve label-free quantification and characterize unknown samples. The toolsuite includes algorithms, APIs, and tools to enable clustering, development, and analysis capabilities.
The document discusses the activities of the EMBL-EBI ELIXIR Node related to proteomics data and analysis. It describes how EMBL-EBI contributes to the ELIXIR platforms of data, tools, interoperability, compute, and training through its work on the PRIDE Archive and ProteomeXchange repository, development of proteomics data standards and software tools, implementation of reproducible proteomics pipelines, and proteomics training courses. The PRIDE Archive contains over 280 terabytes of mass spectrometry proteomics data from over 51 countries and has seen rapid growth in recent years.
The document discusses the Proteomics Standards Initiative (PSI), which develops data format standards for proteomics to facilitate data sharing and reproducibility. It notes that PSI has developed several standard file formats for mass spectrometry-based proteomics data, including mzML for MS data, mzIdentML for identification data, and mzTab for final results. It also maintains related controlled vocabularies and specifies minimum reporting guidelines. The document outlines PSI's process for developing and reviewing standards and lists its current objectives to improve adoption, extend standards to other omics fields, and facilitate reproducible analysis pipelines.
Authoring a personal GPT for your research and practice: How we created the Q...Leonel Morgado
Thematic analysis in qualitative research is a time-consuming and systematic task, typically done using teams. Team members must ground their activities on common understandings of the major concepts underlying the thematic analysis, and define criteria for its development. However, conceptual misunderstandings, equivocations, and lack of adherence to criteria are challenges to the quality and speed of this process. Given the distributed and uncertain nature of this process, we wondered if the tasks in thematic analysis could be supported by readily available artificial intelligence chatbots. Our early efforts point to potential benefits: not just saving time in the coding process but better adherence to criteria and grounding, by increasing triangulation between humans and artificial intelligence. This tutorial will provide a description and demonstration of the process we followed, as two academic researchers, to develop a custom ChatGPT to assist with qualitative coding in the thematic data analysis process of immersive learning accounts in a survey of the academic literature: QUAL-E Immersive Learning Thematic Analysis Helper. In the hands-on time, participants will try out QUAL-E and develop their ideas for their own qualitative coding ChatGPT. Participants that have the paid ChatGPT Plus subscription can create a draft of their assistants. The organizers will provide course materials and slide deck that participants will be able to utilize to continue development of their custom GPT. The paid subscription to ChatGPT Plus is not required to participate in this workshop, just for trying out personal GPTs during it.
The cost of acquiring information by natural selectionCarl Bergstrom
This is a short talk that I gave at the Banff International Research Station workshop on Modeling and Theory in Population Biology. The idea is to try to understand how the burden of natural selection relates to the amount of information that selection puts into the genome.
It's based on the first part of this research paper:
The cost of information acquisition by natural selection
Ryan Seamus McGee, Olivia Kosterlitz, Artem Kaznatcheev, Benjamin Kerr, Carl T. Bergstrom
bioRxiv 2022.07.02.498577; doi: https://doi.org/10.1101/2022.07.02.498577
Current Ms word generated power point presentation covers major details about the micronuclei test. It's significance and assays to conduct it. It is used to detect the micronuclei formation inside the cells of nearly every multicellular organism. It's formation takes place during chromosomal sepration at metaphase.
Immersive Learning That Works: Research Grounding and Paths ForwardLeonel Morgado
We will metaverse into the essence of immersive learning, into its three dimensions and conceptual models. This approach encompasses elements from teaching methodologies to social involvement, through organizational concerns and technologies. Challenging the perception of learning as knowledge transfer, we introduce a 'Uses, Practices & Strategies' model operationalized by the 'Immersive Learning Brain' and ‘Immersion Cube’ frameworks. This approach offers a comprehensive guide through the intricacies of immersive educational experiences and spotlighting research frontiers, along the immersion dimensions of system, narrative, and agency. Our discourse extends to stakeholders beyond the academic sphere, addressing the interests of technologists, instructional designers, and policymakers. We span various contexts, from formal education to organizational transformation to the new horizon of an AI-pervasive society. This keynote aims to unite the iLRN community in a collaborative journey towards a future where immersive learning research and practice coalesce, paving the way for innovative educational research and practice landscapes.
hematic appreciation test is a psychological assessment tool used to measure an individual's appreciation and understanding of specific themes or topics. This test helps to evaluate an individual's ability to connect different ideas and concepts within a given theme, as well as their overall comprehension and interpretation skills. The results of the test can provide valuable insights into an individual's cognitive abilities, creativity, and critical thinking skills
The technology uses reclaimed CO₂ as the dyeing medium in a closed loop process. When pressurized, CO₂ becomes supercritical (SC-CO₂). In this state CO₂ has a very high solvent power, allowing the dye to dissolve easily.
PPT on Direct Seeded Rice presented at the three-day 'Training and Validation Workshop on Modules of Climate Smart Agriculture (CSA) Technologies in South Asia' workshop on April 22, 2024.
1. The ProteomeXchange Consortium: 2016
update
Dr. Juan Antonio Vizcaíno
Proteomics Team Leader
EMBL-European Bioinformatics Institute
Hinxton, Cambridge, UK
2. Juan A. Vizcaíno
juan@ebi.ac.uk
HUPO 2016 World Conference
Taipei, 20 September 2016
PSI Spring Meeting 2017
Beijing Proteome Research Center, China
April 24-26, 2017
April 23: 2nd PHOENIX Mini-Symposium
on Frontiers of Proteomics
April 27: Hiking the Great Wall
Focus topics:
• Quality control: qcML
• Proteogenomics formats
• proXI: proteomics eXpression Interface
• Privacy and Proteomics Data
3. Juan A. Vizcaíno
juan@ebi.ac.uk
HUPO 2016 World Conference
Taipei, 20 September 2016
Overview
• General introduction to ProteomeXchange
• Overall submission statistics
• Updated HPP guidelines
• Specifics about MassIVE (Nuno)
4. Juan A. Vizcaíno
juan@ebi.ac.uk
HUPO 2016 World Conference
Taipei, 20 September 2016
ProteomeXchange: A Global, distributed proteomics
database
PASSEL
(SRM data)
PRIDE
(MS/MS data)
MassIVE
(MS/MS data)
Raw
ID/Q
Meta
Mandatory raw data deposition
since July 2015
• Goal: Development of a framework to allow standard data submission and
dissemination pipelines between the main existing proteomics repositories.
http://www.proteomexchange.org
5. Juan A. Vizcaíno
juan@ebi.ac.uk
HUPO 2016 World Conference
Taipei, 20 September 2016
ProteomeXchange: A Global, distributed proteomics
database
PASSEL
(SRM data)
PRIDE
(MS/MS data)
MassIVE
(MS/MS data)
Raw
ID/Q
Meta
jPOST
(MS/MS data)
Mandatory raw data deposition
since July 2015
• Goal: Development of a framework to allow standard data submission and
dissemination pipelines between the main existing proteomics repositories.
http://www.proteomexchange.org
New in 2016
6. Juan A. Vizcaíno
juan@ebi.ac.uk
HUPO 2016 World Conference
Taipei, 20 September 2016
ProteomeCentral
Metadata /
Manuscript
Raw Data
Results
Journals
Peptide Atlas
Receiving repositories
PRIDE
Researcher’s results
Raw data
Metadata
PASSEL
Research
groups
Reanalysis of datasets
MassIVE
jPOST
MS/MS
data
(as complete
submissions)
Any other
workflow
(mainly partial
submissions)
DATASETS
SRM
data
Reprocessed results
MassIVE
ProteomeXchange data workflow
7. Juan A. Vizcaíno
juan@ebi.ac.uk
HUPO 2016 World Conference
Taipei, 20 September 2016
ProteomeCentral: Centralised portal for all PX
datasets
http://proteomecentral.proteomexchange.org/cgi/GetDataset
8. Juan A. Vizcaíno
juan@ebi.ac.uk
HUPO 2016 World Conference
Taipei, 20 September 2016
ProteomeCentral
Metadata /
Manuscript
Raw Data
Results
Journals
Peptide Atlas
Receiving repositories
PRIDE
Researcher’s results
Raw data
Metadata
PASSEL
Research
groups
Reanalysis of datasets
MassIVE
jPOST
MS/MS
data
(as complete
submissions)
Any other
workflow
(mainly partial
submissions)
DATASETS
SRM
data
Reprocessed results
MassIVE
ProteomeXchange data workflow
9. Juan A. Vizcaíno
juan@ebi.ac.uk
HUPO 2016 World Conference
Taipei, 20 September 2016
ProteomeCentral
Metadata /
Manuscript
Raw Data
Results
Journals
UniProt/
neXtProtPeptide Atlas
Other DBs
Receiving repositories
PRIDE
GPMDBResearcher’s results
Raw data
Metadata
PASSEL
proteomicsDB
Research
groups
Reanalysis of datasets
MassIVE
jPOST
MS/MS
data
(as complete
submissions)
Any other
workflow
(mainly partial
submissions)
DATASETS
OmicsDI
Integration with other
omics datasets
SRM
data
Reprocessed results
MassIVE
ProteomeXchange data workflow
10. Juan A. Vizcaíno
juan@ebi.ac.uk
HUPO 2016 World Conference
Taipei, 20 September 2016
OmicsDI: Portal for omics datasets
http://www.ebi.ac.uk/Tools/omicsdi/
• Aims to integrate of ‘omics’ datasets (proteomics,
transcriptomics, metabolomics and genomics at present).
PRIDE
MassIVE
jPOST
PASSEL
GPMDB
ArrayExpress
Expression Atlas
MetaboLights
Metabolomics Workbench
GNPS
EGA
Perez-Riverol et al., 2016, BioRXxiv
11. Juan A. Vizcaíno
juan@ebi.ac.uk
HUPO 2016 World Conference
Taipei, 20 September 2016
OmicsDI: Portal for omics datasets
Perez-Riverol et al., 2016, BioRXxiv
12. Juan A. Vizcaíno
juan@ebi.ac.uk
HUPO 2016 World Conference
Taipei, 20 September 2016
Overview
• General introduction to ProteomeXchange
• Overall submission statistics
• Updated HPP guidelines
• Specifics about MassIVE (Nuno)
13. Juan A. Vizcaíno
juan@ebi.ac.uk
HUPO 2016 World Conference
Taipei, 20 September 2016
Countries with at least 100
datasets:
1105 USA
546 Germany
411 United Kingdom
356 China
229 France
188 Netherlands
178 Canada
150 Switzerland
125 Australia
123 Spain
123 Denmark
117 Japan
101 Sweden
ProteomeXchange: 4,534 datasets up until 31st July, 2016
Type:
4067 PRIDE
339 MassIVE
115 PeptideAtlas/PASSEL
13 jPOST
Publicly Accessible:
2597 datasets, 57% of all
2334 PRIDE
135 MassIVE
115 PASSEL
13 jPOST
Datasets/year:
2012: 102
2013: 527
2014: 963
2015: 1758
2016 (till end of July): 1184
Top Species studied by at least 100
datasets:
2010 Homo sapiens
604 Mus musculus
191 Saccharomyces cerevisiae
140 Arabidopsis thaliana
127 Rattus norvegicus
936 reported taxa in total
14. Juan A. Vizcaíno
juan@ebi.ac.uk
HUPO 2016 World Conference
Taipei, 20 September 2016
Datasets are being reused more and more….
Data download volume for PRIDE in 2015: ~ 200 TB
Vaudel et al., Proteomics, 2016
15. Juan A. Vizcaíno
juan@ebi.ac.uk
HUPO 2016 World Conference
Taipei, 20 September 2016
Overview
• General introduction to ProteomeXchange
• Overall submission statistics
• Updated HPP guidelines
• Specifics about MassIVE (Nuno)
17. Juan A. Vizcaíno
juan@ebi.ac.uk
HUPO 2016 World Conference
Taipei, 20 September 2016
Complete
Partial
Complete vs Partial submissions: processed results
For complete submissions, it is possible to connect the spectra with the identification
processed results and they can be visualized.
18. Juan A. Vizcaíno
juan@ebi.ac.uk
HUPO 2016 World Conference
Taipei, 20 September 2016
Complete vs Partial submissions: experimental metadata
Complete Partial
General experimental metadata about the projects is similar.
However, at the assay level information in partial submissions is not so detailed
19. Juan A. Vizcaíno
juan@ebi.ac.uk
HUPO 2016 World Conference
Taipei, 20 September 2016
An observer of ProteomeXchange consortium - iProX
• Proteome data sharing platform in China
• Focusing
• Collection and sharing of proteome experiment raw data
• Standardized metadata of proteome experiment
• Visualization of proteome dataset
• Providing
• A User friendly data submission pipeline
• Structured management of datasets
• An effective user authority system
• Standardized metadata collection
• Powerful computing, storage, and network resources to support the pipeline
• Remote data backup and synchronous update
www.iprox.org
20. Juan A. Vizcaíno
juan@ebi.ac.uk
HUPO 2016 World Conference
Taipei, 20 September 2016
Overview
• General introduction to ProteomeXchange
• Overall submission statistics
• Updated HPP guidelines
• Specifics about MassIVE (Nuno)
21. MassIVE update
Mingxun Wang1,2,4, Jeremy Carver1,4, Nuno Bandeira1-4
1Center for Computational Mass Spectrometry
2Computer Science and Engineering
3Skaggs School of Pharmacy and Pharmaceutical Sciences
4University of California, San Diego
Center for
Computational
Mass
Spectrometry
http://massive.ucsd.edu
22. Juan A. Vizcaíno
juan@ebi.ac.uk
HUPO 2016 World Conference
Taipei, 20 September 2016
http://massive.ucsd.eduhttp://proteomics.ucsd.edu
MassIVE Interactivity
• MassIVE = Mass spectrometry Interactive Virtual Environment
23. Juan A. Vizcaíno
juan@ebi.ac.uk
HUPO 2016 World Conference
Taipei, 20 September 2016
Massive reanalysis
• Community knowledge requires reproducible, well-characterized results
• MS-GF+ standard database search
• Reanalyzed 15 TB of Human data with ~185M MS/MS spectra
• 79 million new FDR-controlled PSMs
• 3.6 million modified versions of 2.8 million unique peptide sequences
• CPTAC colon cancer available with 5 different results sets
• [Original] Imported CPTAC results: 6.9M PSMs
• [Reanalysis] MS-GF+ database search: 8.9M PSMs, 70k mod variants (169k total)
• [Reanalysis] Spectral library search (MSPLIT): 10M PSMs, including 387K mixture spectra
• [Reanalysis] Proteogenomics searches of TCGA transcriptomics sequences (Enosi): 6.8M total
PSMs, 19,728 proteogenomic events
• [Reanalysis] Blind modification search (MODa): 7.8M PSMs, 2.8M PSMs for 221k mod variants
(306k total), 203K new mod variants (unique modified peptides)
http://massive.ucsd.eduhttp://proteomics.ucsd.edu
24. Juan A. Vizcaíno
juan@ebi.ac.uk
HUPO 2016 World Conference
Taipei, 20 September 2016
Massive: Do it yourself
1. MSGF+ - Database search engine
2. MSPLIT – Spectral Library Search Engine
3. ENOSI – ProteoGenomic Search Engine
4. MODa - Multi-blind modification database search engine
5. Spectral Networks – spectral alignment-based
analysis and propagation of identifications
6. Multi-pass - MSPLIT, MSGFDB, MODa cascade Search
Workflow
7. MSGFDB - Database search engine
8. MSPLIT-DIA – Spectral Library Search for SWATH
9. Upload your own! (mzIdentML, mzTab, TSV)
http://massive.ucsd.eduhttp://proteomics.ucsd.edu
25. Juan A. Vizcaíno
juan@ebi.ac.uk
HUPO 2016 World Conference
Taipei, 20 September 2016
Check what others think the spectrum is –
Massive Search
Find peptide, proteins, PTMs
Agreement in spectrum
identification?
One-stop search
across tens of
millions of PSMs
Original
Reanalysis
http://massive.ucsd.eduhttp://proteomics.ucsd.edu
26. Juan A. Vizcaíno
juan@ebi.ac.uk
HUPO 2016 World Conference
Taipei, 20 September 2016
What can you do?
• How can the community work together to reveal the whole human proteome?
• Mass spectrometrists share Data
• At least: partial submissions with raw mass spectrometry data and enough metadata to
allow for reanalysis
• Especially useful: rare tissues/conditions or very deep acquisition
• Biologists share Knowledge
• At least: complete submissions with FDR-filtered results in open format (mzIdentML or
mzTab)
• Especially useful: human-curated knowledge of proteins, PTMs, endogenous peptides,
etc
• Bioinformaticians share Reanalyses
• At least: FDR-filtered results in open format (mzIdentML or mzTab)
• Especially useful: algorithms that identify new types of PSMs (e.g., PTM-specific,
mixtures)
http://massive.ucsd.eduhttp://proteomics.ucsd.edu
27. Juan A. Vizcaíno
juan@ebi.ac.uk
HUPO 2016 World Conference
Taipei, 20 September 2016
Aknowledgements: People
Attila Csordas
Tobias Ternent
Gerhard Mayer (de.NBI)
Yasset Perez-Riverol
Manuel Bernal-Llinares
Andrew Jarnuczak
Former team members, especially:
Rui Wang
Florian Reisinger
Noemi del Toro
Jose A. Dianes
Henning Hermjakob
Acknowledgements: The PRIDE Team and all PX partners
All data submitters !!!
Eric Deutsch
Zhi Sun
David Campbell
Nuno Bandeira
Mingxun Wang
Jeremy Carver
Yasushi Ishihama
Shujiro Okuda
Shin Kawano
Follow new datasets @proteomexchange