This document summarizes a presentation about the ProteomeXchange (PX) consortium, which provides a framework for standard data submission and dissemination between major proteomics repositories, including PRIDE, PeptideAtlas, and MassIVE. It describes how researchers can submit complete or partial datasets to PX via PRIDE using the PX submission tool. Complete submissions use mzIdentML for processed results, while partial submissions store search engine output files. Over 1,300 datasets have been submitted to PX from researchers worldwide.
The ProteomeXchange consortium allows researchers to easily deposit and retrieve proteomics data. It includes repositories like PRIDE, PeptideAtlas, and recently MassIVE. The goal is to standardize submission and access across repositories through common identifiers and supported workflows. Over 1,300 datasets have been submitted, with many tools now supporting standard formats like mzIdentML for complete submissions. The most accessed datasets include large reference maps of the human proteome. Open source tools are improving submission and analysis of ProteomeXchange data.
PRIDE and ProteomeXchange – Making proteomics data accessible and reusableYasset Perez-Riverol
The document discusses ProteomeXchange (PX), a consortium that aims to make proteomics data accessible and reusable. PX includes repositories like PRIDE, PeptideAtlas, and MassIVE. It allows standard data submission between repositories through a common identifier space. The document outlines the PX submission workflow, describes components like the PX submission tool and PRIDE Inspector. It also provides statistics on data available through PX, with over 1,300 datasets contributed primarily from human, mouse and yeast studies. Future plans include better integration of proteomics resources to facilitate data reuse.
Mining the hidden proteome using hundreds of public proteomics datasetsJuan Antonio Vizcaino
The document discusses mining hidden proteomics data using public proteomics datasets. It describes how the PRIDE Cluster tool clusters over 250 million spectra from the PRIDE Archive, including over 190 million previously unidentified spectra. This clustering identified inconsistent clusters that could be reanalyzed, inferred identifications for 9.1 million originally unidentified spectra contained within reliable identification clusters, and consistently unidentified clusters that could be targeted for further analysis to identify unknown peptides. The clustering took 5 days on a 340-core system and generated 28 million clusters.
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...Juan Antonio Vizcaino
The document discusses the challenges and opportunities of big data in proteomics. It describes how proteomics data volumes are growing rapidly due to technological advances, creating both computational challenges for data analysis and opportunities to reuse large amounts of public data. The PRIDE Archive at EBI stores over 4,000 proteomics datasets and provides tools like PRIDE Inspector to help analyze and validate large datasets. However, challenges remain around data standardization, metadata completeness, and the need for greater computational infrastructure and expertise to fully leverage the large amounts of shared proteomics data.
The document discusses PRIDE, a proteomics data repository at EMBL-EBI. It describes how PRIDE stores mass spectrometry proteomics data, its role within the ProteomeXchange consortium, and how researchers can submit data to PRIDE including the use of mzIdentML and PRIDE tools.
Distributed Query Processing for Federated RDF Data ManagementOlafGoerlitz
PhD defense talk about SPLENDID, a state-of-the-art implementation for efficient distributed SPARQL query processing on Linked Data using SPARQL endpoints and voiD descriptions.
"4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QALD-9 Question Answering over Linked Data Challenge" as presented in the 17th International Semantic Web Conference ISWC, 8th - 12th of October 2018, held in Monterey, California, USA
This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
The ProteomeXchange consortium allows researchers to easily deposit and retrieve proteomics data. It includes repositories like PRIDE, PeptideAtlas, and recently MassIVE. The goal is to standardize submission and access across repositories through common identifiers and supported workflows. Over 1,300 datasets have been submitted, with many tools now supporting standard formats like mzIdentML for complete submissions. The most accessed datasets include large reference maps of the human proteome. Open source tools are improving submission and analysis of ProteomeXchange data.
PRIDE and ProteomeXchange – Making proteomics data accessible and reusableYasset Perez-Riverol
The document discusses ProteomeXchange (PX), a consortium that aims to make proteomics data accessible and reusable. PX includes repositories like PRIDE, PeptideAtlas, and MassIVE. It allows standard data submission between repositories through a common identifier space. The document outlines the PX submission workflow, describes components like the PX submission tool and PRIDE Inspector. It also provides statistics on data available through PX, with over 1,300 datasets contributed primarily from human, mouse and yeast studies. Future plans include better integration of proteomics resources to facilitate data reuse.
Mining the hidden proteome using hundreds of public proteomics datasetsJuan Antonio Vizcaino
The document discusses mining hidden proteomics data using public proteomics datasets. It describes how the PRIDE Cluster tool clusters over 250 million spectra from the PRIDE Archive, including over 190 million previously unidentified spectra. This clustering identified inconsistent clusters that could be reanalyzed, inferred identifications for 9.1 million originally unidentified spectra contained within reliable identification clusters, and consistently unidentified clusters that could be targeted for further analysis to identify unknown peptides. The clustering took 5 days on a 340-core system and generated 28 million clusters.
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...Juan Antonio Vizcaino
The document discusses the challenges and opportunities of big data in proteomics. It describes how proteomics data volumes are growing rapidly due to technological advances, creating both computational challenges for data analysis and opportunities to reuse large amounts of public data. The PRIDE Archive at EBI stores over 4,000 proteomics datasets and provides tools like PRIDE Inspector to help analyze and validate large datasets. However, challenges remain around data standardization, metadata completeness, and the need for greater computational infrastructure and expertise to fully leverage the large amounts of shared proteomics data.
The document discusses PRIDE, a proteomics data repository at EMBL-EBI. It describes how PRIDE stores mass spectrometry proteomics data, its role within the ProteomeXchange consortium, and how researchers can submit data to PRIDE including the use of mzIdentML and PRIDE tools.
Distributed Query Processing for Federated RDF Data ManagementOlafGoerlitz
PhD defense talk about SPLENDID, a state-of-the-art implementation for efficient distributed SPARQL query processing on Linked Data using SPARQL endpoints and voiD descriptions.
"4th Natural Language Interface over the Web of Data (NLIWoD) workshop and QALD-9 Question Answering over Linked Data Challenge" as presented in the 17th International Semantic Web Conference ISWC, 8th - 12th of October 2018, held in Monterey, California, USA
This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
PRIDE is a proteomics database at EMBL-EBI that stores mass spectrometry-based proteomics data, including peptide and protein identifications and quantifications. It is part of the ProteomeXchange consortium, which aims to facilitate standardized data submission and dissemination between proteomics repositories. The document outlines the types of data stored in PRIDE, how to access and submit data, and tools for data conversion and visualization like PRIDE Converter 2 and PRIDE Inspector.
Persistent Identifiers, Herbarium workshop at Kongsvold, September 1 to 4, 2014Dag Endresen
Implementation of persistent and globally unique identifiers for specimens held in natural history collections worldwide will open up new opportunities for referring to these physical resources in an interlinked digital context such as the Internet. Here, we will describe the approach for persistent identification of collection specimens developed and implemented at the Natural History Museum in Oslo (NHM-UiO) by the the Norwegian participant node to the Global Biodiversity Information Facility (GBIF-Norway). The Norwegian university museums are invited to use our resolver service at "http://purl.org/gbifnorway/id/<uuid>" when publishing biodiversity data to GBIF. All occurrence records published through GBIF-Norway, with appropriate PURL-UUID identifiers mapped to the Darwin Core occurrenceID, will automatically be added to our resolver service and kept updated.
The document provides an overview and status update of ProteomeXchange, including submission and citation statistics, new prospective members jPOST and iPROX, and the OmicsDI interface. It notes that ProteomeXchange currently includes over 3,800 datasets submitted primarily from the US, Germany, UK, and China, and that submissions and data reuse have grown substantially in recent years.
An overview of the PRIDE ecosystem of resources and computational tools for m...Juan Antonio Vizcaino
The document provides an overview of the PRIDE ecosystem of resources and computational tools for mass spectrometry proteomics data. It describes PRIDE Archive and ProteomeXchange as repositories for proteomics data, as well as tools like PRIDE Inspector for visualizing and validating data. It also discusses how public proteomics data is increasingly being reused, and added-value resources like PRIDE Cluster and PRIDE Proteomes that provide aggregated views of proteomics data.
Presentation at the NEH-Funded Linked Ancient World Data Institute, ISAW/NYU, New York, May 2012. Discusses the use of RDF and linked data in representing geographic information relationships between resources.
This document discusses the ProteomeXchange Consortium and recent updates. It provides statistics on data submissions and downloads. Over 7,475 datasets have been submitted from over 50 countries, with the majority from the US, Germany, and China. PRIDE and MassIVE are the largest repositories. A new prospective member, iProX, is described which will be the main proteomics data sharing platform in China. Guidelines are being developed to handle reprocessed datasets submitted to repositories.
GBIF-Norway status for the 6th European GBIF nodes meeting April 2014Dag Endresen
Slides prepared for the 6th European GBIF nodes meeting in Brussels. At the meeting these slides was replaced by a live online demo of these tools. Topics include citizen science transcription of specimen labels, persistent identifiers and custom collection portals. All slides are CC-by.
Public proteomics data: a (mostly unexploited) gold mine for computational re...Juan Antonio Vizcaino
The document discusses public proteomics data available through the PRIDE Archive at the European Bioinformatics Institute. It provides statistics on data submissions and downloads, which continue to increase significantly each year. The author advocates for reusing public proteomics data through approaches like proteogenomics studies, discovery of new post-translational modifications, and meta-analysis studies. Spectrum clustering is presented as a method to further analyze and draw insights from large proteomics datasets.
A proteomics data “gold mine” at your disposal: Now that the data is there, w...Juan Antonio Vizcaino
The document discusses the reuse of public proteomics data. It describes how data from the PRoteomics IDEntifications (PRIDE) Archive can be reanalyzed to conduct proteogenomics studies, discover new post-translational modifications and variants, and enable meta-analysis studies of protein-protein interactions and associations. It also examines challenges around analyzing the "dark proteome" of consistently unidentified spectra in public datasets and developing open analysis pipelines for proteomics data in cloud environments.
Data exchange alternatives, GIGA TAG (2009)Dag Endresen
GIGA TAG meeting at Bioversity International, Rome, Italy 18th May 2009. Data exchange alternatives for the Global Information on Germplasm Accessions (GIGA) project. Dag Endresen (Bioversity/NordGen).
I Mapreduced a Neo store: Creating large Neo4j Databases with HadoopGoDataDriven
When exploring very large raw datasets containing massive interconnected networks, it is sometimes helpful to extract your data, or a subset thereof, into a graph database like Neo4j. This allows you to easily explore and visualize networked data to discover meaningful patterns.
When your graph has 100M+ nodes and 1000M+ edges, using the regular Neo4j import tools will make the import very time-intensive (as in many hours to days).
In this talk, I’ll show you how we used Hadoop to scale the creation of very large Neo4j databases by distributing the load across a cluster and how we solved problems like creating sequential row ids and position-dependent records using a distributed framework like Hadoop.
What is the fuzz on triple stores? Will triple stores eventually replace relational databases? This talk looks at the big picture, explains the technology and tries to look at the road ahead.
Opening and Integration of CASDD and Germplasm Data to AGRIS by Prof. Xuefu Z...CIARD Movement
Presentation delivered at the Agricultural Data Interoperability Interest Group -- Research Data Alliance (RDA) 4th Plenary Meeting -- Amsterdam, September 2014
The document discusses Wetnet, a system for smart water network management. It consists of sensors that monitor flows, a central processing unit, and web application. The system [1] converts normal networks into smart networks to improve efficiency, [2] analyzes sensor data to detect leaks and other issues, and [3] provides reports to aid decision making. Overall, Wetnet aims to efficiently manage water distribution through continuous monitoring and performance analysis.
The best gifts have the recipient in mind. Give Mom what she never gives herself this Mother’s Day. Let her relax, work of her own schedule and unwind.
Ways to connect with your child with yoga techniques. Link to a home practice manual that will lead you and your child ways to practice breath connection, physical bonding and trust.
PRIDE is a proteomics database at EMBL-EBI that stores mass spectrometry-based proteomics data, including peptide and protein identifications and quantifications. It is part of the ProteomeXchange consortium, which aims to facilitate standardized data submission and dissemination between proteomics repositories. The document outlines the types of data stored in PRIDE, how to access and submit data, and tools for data conversion and visualization like PRIDE Converter 2 and PRIDE Inspector.
Persistent Identifiers, Herbarium workshop at Kongsvold, September 1 to 4, 2014Dag Endresen
Implementation of persistent and globally unique identifiers for specimens held in natural history collections worldwide will open up new opportunities for referring to these physical resources in an interlinked digital context such as the Internet. Here, we will describe the approach for persistent identification of collection specimens developed and implemented at the Natural History Museum in Oslo (NHM-UiO) by the the Norwegian participant node to the Global Biodiversity Information Facility (GBIF-Norway). The Norwegian university museums are invited to use our resolver service at "http://purl.org/gbifnorway/id/<uuid>" when publishing biodiversity data to GBIF. All occurrence records published through GBIF-Norway, with appropriate PURL-UUID identifiers mapped to the Darwin Core occurrenceID, will automatically be added to our resolver service and kept updated.
The document provides an overview and status update of ProteomeXchange, including submission and citation statistics, new prospective members jPOST and iPROX, and the OmicsDI interface. It notes that ProteomeXchange currently includes over 3,800 datasets submitted primarily from the US, Germany, UK, and China, and that submissions and data reuse have grown substantially in recent years.
An overview of the PRIDE ecosystem of resources and computational tools for m...Juan Antonio Vizcaino
The document provides an overview of the PRIDE ecosystem of resources and computational tools for mass spectrometry proteomics data. It describes PRIDE Archive and ProteomeXchange as repositories for proteomics data, as well as tools like PRIDE Inspector for visualizing and validating data. It also discusses how public proteomics data is increasingly being reused, and added-value resources like PRIDE Cluster and PRIDE Proteomes that provide aggregated views of proteomics data.
Presentation at the NEH-Funded Linked Ancient World Data Institute, ISAW/NYU, New York, May 2012. Discusses the use of RDF and linked data in representing geographic information relationships between resources.
This document discusses the ProteomeXchange Consortium and recent updates. It provides statistics on data submissions and downloads. Over 7,475 datasets have been submitted from over 50 countries, with the majority from the US, Germany, and China. PRIDE and MassIVE are the largest repositories. A new prospective member, iProX, is described which will be the main proteomics data sharing platform in China. Guidelines are being developed to handle reprocessed datasets submitted to repositories.
GBIF-Norway status for the 6th European GBIF nodes meeting April 2014Dag Endresen
Slides prepared for the 6th European GBIF nodes meeting in Brussels. At the meeting these slides was replaced by a live online demo of these tools. Topics include citizen science transcription of specimen labels, persistent identifiers and custom collection portals. All slides are CC-by.
Public proteomics data: a (mostly unexploited) gold mine for computational re...Juan Antonio Vizcaino
The document discusses public proteomics data available through the PRIDE Archive at the European Bioinformatics Institute. It provides statistics on data submissions and downloads, which continue to increase significantly each year. The author advocates for reusing public proteomics data through approaches like proteogenomics studies, discovery of new post-translational modifications, and meta-analysis studies. Spectrum clustering is presented as a method to further analyze and draw insights from large proteomics datasets.
A proteomics data “gold mine” at your disposal: Now that the data is there, w...Juan Antonio Vizcaino
The document discusses the reuse of public proteomics data. It describes how data from the PRoteomics IDEntifications (PRIDE) Archive can be reanalyzed to conduct proteogenomics studies, discover new post-translational modifications and variants, and enable meta-analysis studies of protein-protein interactions and associations. It also examines challenges around analyzing the "dark proteome" of consistently unidentified spectra in public datasets and developing open analysis pipelines for proteomics data in cloud environments.
Data exchange alternatives, GIGA TAG (2009)Dag Endresen
GIGA TAG meeting at Bioversity International, Rome, Italy 18th May 2009. Data exchange alternatives for the Global Information on Germplasm Accessions (GIGA) project. Dag Endresen (Bioversity/NordGen).
I Mapreduced a Neo store: Creating large Neo4j Databases with HadoopGoDataDriven
When exploring very large raw datasets containing massive interconnected networks, it is sometimes helpful to extract your data, or a subset thereof, into a graph database like Neo4j. This allows you to easily explore and visualize networked data to discover meaningful patterns.
When your graph has 100M+ nodes and 1000M+ edges, using the regular Neo4j import tools will make the import very time-intensive (as in many hours to days).
In this talk, I’ll show you how we used Hadoop to scale the creation of very large Neo4j databases by distributing the load across a cluster and how we solved problems like creating sequential row ids and position-dependent records using a distributed framework like Hadoop.
What is the fuzz on triple stores? Will triple stores eventually replace relational databases? This talk looks at the big picture, explains the technology and tries to look at the road ahead.
Opening and Integration of CASDD and Germplasm Data to AGRIS by Prof. Xuefu Z...CIARD Movement
Presentation delivered at the Agricultural Data Interoperability Interest Group -- Research Data Alliance (RDA) 4th Plenary Meeting -- Amsterdam, September 2014
The document discusses Wetnet, a system for smart water network management. It consists of sensors that monitor flows, a central processing unit, and web application. The system [1] converts normal networks into smart networks to improve efficiency, [2] analyzes sensor data to detect leaks and other issues, and [3] provides reports to aid decision making. Overall, Wetnet aims to efficiently manage water distribution through continuous monitoring and performance analysis.
The best gifts have the recipient in mind. Give Mom what she never gives herself this Mother’s Day. Let her relax, work of her own schedule and unwind.
Ways to connect with your child with yoga techniques. Link to a home practice manual that will lead you and your child ways to practice breath connection, physical bonding and trust.
3 d pie chart circular puzzle with hole in center pieces 6 stages style 4 pow...SlideTeam.net
The document describes a set of 6 circular puzzle piece diagrams that can be downloaded and edited in PowerPoint. The diagrams allow users to customize text, change colors, sizes and orientations of icons. Individual puzzle pieces can be ungrouped for further editing. All images are fully editable to bring presentations to life and engage audiences. The diagrams can be downloaded from a provided website URL.
Breast cancer tumors are classified by their hormone-receptor status and the presence, or absence, of certain proteins. The subtype of breast cancer determines how it will be treated.
Learn about the three main subtypes and how each is treated.
The ProteomeXchange Consortium aims to allow standard data submission and dissemination between major proteomics repositories, including PeptideAtlas, PRIDE, and MassIVE. It establishes a common identifier space (PXD IDs) and supports workflows for MS/MS and SRM data submitted from any experimental approach. Since 2012, over 3,800 datasets have been submitted from over 700 species, with over 1,900 publicly accessible. Submissions have grown significantly each year, and data downloads for reuse are also increasing. The goal is to make data sharing easier for researchers.
Este documento describe los diferentes tipos de blogs, incluyendo blogs personales, microblogging, blogs educativos e inversos. Explica los pasos para crear un blog exitoso, como elegir un tema, nombre, plataforma, plantilla y frecuencia de publicación, además de la importancia de participar activamente en la comunidad de blogs.
Dokumen ini berisi ringkasan kegiatan selama praktek kerja lapangan di Pelatihan Multilogika (Orbit). Terdapat berbagai kegiatan seperti pencatatan materi, back up data, pengenalan hardware komputer, instalasi sistem operasi, perakitan PC, pemasangan kabel, kunjungan kerja, dan banyak lagi.
Why does your business need a mobile appJatin Dabas
The need and the purpose of creating an app for your business is to stay in touch with target audience in an effective way and also to boost marketing opportunities.
http://www.oodlestechnologies.com/blogs/Why-Does-Your-Business-Need-a-Mobile-App
This infographic presents several reasons for why small business owners need to add video marketing to their promotional strategies.Building your business requires a good mix of approaches and mediums. The use of video in promoting businesses has exploded over the last few years as its effectiveness in engaging and educating is being understood.Technology has allowed even a small business with a meager budget the ability to produce smart, eyeball capturing videos that can lead to sales.
This document contains 10 quotes from founders and leaders of successful companies like Google, Twitter, Groupon, and Alibaba. The quotes provide advice for startups such as delivering more than expected, focusing on making ideas happen rather than just having ideas, learning from unhappy customers, and hiring great people and giving them freedom. The full quotes cover additional topics like not waiting for perfection, planning being less useful than taking action, and chasing vision over money.
The document provides a checklist for companies to follow before, during, and after implementing a reduction in force (RIF). It recommends that companies 1) adequately prepare by clarifying goals, criteria, and processes for selection; 2) efficiently conduct the RIF by balancing respect and security while clearly defining the start and end; and 3) deal with the aftermath humanely by implementing communication plans, being sensitive to emotions, and paying attention to key players. It also lists six common mistakes to avoid when conducting a RIF.
Este documento describe el impacto de las tecnologías de la información y la comunicación (TIC) en la música. Las TIC han permitido editar partituras, realizar grabaciones de alta calidad, ralentizar interpretaciones y crear composiciones multimedia. Aplicaciones como Incredibox, Tony-b, Music Maker y Music Grid permiten a los usuarios crear y compartir música en línea de manera sencilla. Las TIC también han facilitado que los creadores compartan y enseñen su trabajo musical con otros a través de Internet.
Grafico diario del dax perfomance index para el 10 08-2012Experiencia Trading
Este documento presenta un análisis técnico del índice Dax Performance del 10 de agosto de 2012. Incluye gráficos que representan varias medias simples como líneas de soporte y resistencia potenciales. Analiza posibles escenarios si el índice mantiene o pierde ciertos niveles clave, y ofrece objetivos alcistas y resistencias. También incluye una explicación de cómo se construyeron los gráficos y cómo interpretarlos.
The document discusses representing imaging mass spectrometry (MS) data. It describes imzML, a common data standard for MS imaging data. It also outlines how MS imaging data can be submitted to the ProteomeXchange repository via the PRIDE database. MS imaging generates data from tissue sections, and imzML encodes both the raw data files and metadata about the images. Submitting to ProteomeXchange involves uploading raw data files, result files, and metadata descriptions to allow sharing and reuse of MS imaging experiments.
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...Juan Antonio Vizcaino
The document discusses PRIDE and ProteomeXchange, which are resources that support the deposition of proteomics data to public repositories. PRIDE stores mass spectrometry-based proteomics data, and is one of the repositories that is part of ProteomeXchange, a framework that allows standard submission of proteomics data between major repositories. The document outlines the cultural change in proteomics towards public data sharing, and provides information on submitting proteomics data to PRIDE and accessing data deposited in PRIDE and ProteomeXchange.
Dr. Juan Antonio Vizcaíno presents two algorithms for spectral clustering in the PRIDE proteomics database: PRIDE Cluster and PRIDE Cluster H. PRIDE Cluster originally clustered over 20 million spectra but did not scale well with increasing data. PRIDE Cluster H was developed to address this using the Hadoop parallel processing framework. It clustered over 54 million spectra from PRIDE in around 2.5 days. The algorithms aim to assess identification quality and find consensus spectra across experiments. Cluster results will be made available in the PRIDE Archive and used for annotation transfer and identification of previously unidentified spectra.
PRIDE resources and ProteomeXchange
- PRIDE is a proteomics data repository at EMBL-EBI that stores mass spectrometry-based proteomics data.
- It is part of the ProteomeXchange consortium, which provides a framework for standardized data submission and dissemination between proteomics repositories.
- This presentation discusses how to submit data to PRIDE/ProteomeXchange using PRIDE tools, including converting files to mzIdentML format and using the PX submission tool for metadata and file transfer.
ProteomeXchange: Update for the C-HPP Consortium.
10th C-HPP Workshop: “Proteome data management and identification of missing proteins".
Bangkok, Thailand. 09/08/2015. Remote presentation.
This document provides an overview and status update of ProteomeXchange in 2017. It discusses submission and download statistics showing growth in datasets submitted. There are now over 5,000 datasets in PRIDE from over 1,000 species. Download volumes have increased to over 200 TB in 2016. Citations of proteomics datasets are also increasing. A new prospective member, Firmiana, may join ProteomeXchange. The OmicsDI interface provides integrated access to datasets across multiple omics domains like proteomics, transcriptomics and metabolomics.
The document discusses PRIDE and ProteomeXchange, resources for sharing public proteomics datasets. It describes how PRIDE stores mass spectrometry-based proteomics data and supports data sharing in the field. It also outlines the ProteomeXchange consortium which aims to standardize data submission and dissemination between proteomics repositories, and how data can be submitted to PRIDE using tools that support standard file formats.
Data submissions and archiving raw data in life sciences. A pilot with Proteo...Rafael C. Jimenez
European Life Sciences Infrastructure for Biological Information aims to provide data infrastructure for biological information sharing. It is running a pilot project with proteomics data to enable standardized submission and dissemination of data between major proteomics resources like PRIDE and PeptideAtlas. The pilot allows direct archiving of raw proteomics data in PRIDE for the first time. It uses the EUDAT program for data storage and access and ProteomeXchange as a framework to link proteomics databases together. The goal is to prepare for the rapid growth of life sciences data and keep up with processing and storing the large volumes of raw data being generated.
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
The document discusses scientific workflows, provenance, and linked data. It covers:
1) Scientific workflows can automate data analysis at scale, abstract complex processes, and capture provenance for transparency.
2) Provenance represents the origin and history of data and can be represented using standards like PROV. It allows reasoning about how results were produced.
3) Capturing and publishing provenance as linked open data can help make scientific results more reusable and queryable, but challenges remain around multi-site studies and producing human-readable reports.
Presentation used for my oral Master's Thesis defense for the Universtat Autònoma de Barcelona. It shows the development of a Perl script for the automated generation of a report of the somatic mutations found in a Normal/Tumor cancer experiment.
This document discusses mass spectrometry informatics formats developed by the Proteomics Standards Initiative. It describes standard formats such as mzIdentML, mzQuantML, and mzTab that have been created for proteomics data as well as ongoing work to extend mzTab to support metabolomics and glycomics data. It also provides information on the current status and adoption of these standards by the proteomics community.
The document discusses a training webinar about PRIDE and ProteomeXchange. It begins with instructions for participating in the webinar and an overview of data resources at EMBL-EBI. It then covers PRIDE's mission to archive proteomics data, the ProteomeXchange consortium for standardized data submission, and tools for submitting data to PRIDE including PRIDE Converter, PRIDE Inspector, and the ProteomeXchange submission tool.
Results may vary: Collaborations Workshop, Oxford 2014Carole Goble
Thoughts on computational science reproducibility with a focus on software. Given at the Software Sustainability Institute's 2014 Collaborations Workshop
Data-intensive applications on cloud computing resources: Applications in lif...Ola Spjuth
Presentation at the de.NBI 2017 symposium “The Future Development of Bioinformatics in Germany and Europe” held at the Center for Interdisciplinary Research (ZiF) of Bielefeld University, October 23-25, 2017.
https://www.denbi.de/symposium2017
What's mine is yours (and vice versa) Data sharing in vibrational spectroscopyAlex Henderson
Presentation given at SPEC 2014, Krakow, Poland. 17-22 August 2014
[some slides do not display correctly, download the pdf for better quality]
In our day-to-day practice we collect data, convert this to information, hopefully extract knowledge, and then pass this on to our peers, thereby advancing the global understanding of our field. This is a very linear process. What if we were to share our data? Have others take our information and combine it with their own? Such a branched process would likely result in more rapid discoveries and, potentially, a greater understanding. In order to facilitate data sharing we must define at least two interfaces with our peers;
1. A mechanism of them understanding the language of our data
2. A mechanism of passing on the context of our experiment
Of course, both of these must work in reverse; we must understand their data and also their experimental context. These are separate yet related ideas. Our data are meaningless without context, but because we are ‘close to the action’ we do not explicitly document them.
Recording the nature of our experiments can have benefits closer to home. Too often we find ourselves searching for results that we know we recorded, but have difficulty locating. Then there is the issue of recalling the exact experimental procedure involved in the sample preparation or data reduction. Documentation of these will lead to better laboratory practice all round.
Earlier this year, a network of academic, clinical and industrial groups was constituted in the UK, with some international partners, to consider how best to push forward the use of infrared and Raman spectroscopies in the clinical arena: CLIRSPEC [1]. One of the work packages of the CLIRSPEC network is the development of standard protocols for data sharing. The work package falls, initially, into two parts;
1. How to easily and uniformly transfer our data between research teams and, by association, into an accessible archive.
2. How to record the provenance of our samples, the treatments they undergo, the experiments performed on them and the manner the resulting data was manipulated: the metadata.
In this presentation we will outline the current position of the CLIRSPEC work package, both in terms of the performance of various candidate data formats (JCAMP-DX, SPC, netCDF, …), and the options for the recording of the metadata associated with the experimental procedure (controlled vocabularies, XML, RDF, ISA-TAB, …). Included here is the concept of a minimum reporting requirement for IR and Raman, particularly in the clinical arena, that we can all try to meet.
None of this can happen without the buy-in of the community. We seek to engage everyone in a dialogue that will result in more consistent, and hopefully better, practice across all laboratories to further our understanding of clinical vibrational spectroscopy.
[1] http://clirspec.org
Data Integration vs Transparency: Tackling the tensionPaul Groth
Paul Groth discussed the tension between data integration and transparency. He explained that while integrating data from multiple sources is important for analysis, it can reduce transparency about where the data came from. Provenance, or recording the origin and process of data, was presented as a solution. Groth outlined challenges in provenance collection and proposed techniques like taint tracking and record and replay from software security to help automate provenance capture while data is integrated and analyzed.
Similar to ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets (20)
Reusing and integrating public proteomics data to improve our knowledge of th...Juan Antonio Vizcaino
Dr. Juan Antonio Vizcaíno discusses reuse and integration of public proteomics data to improve knowledge of the human proteome. He describes how the PRIDE database stores mass spectrometry-based proteomics data and how ProteomeXchange provides a framework for data submission and dissemination between repositories. Reanalysis of public proteomics data is increasing and can be used for proteogenomics studies and meta-analyses to integrate proteomics and genomics data and better understand the human proteome.
This document provides an overview of proteomics data standards developed by the Proteomics Standards Initiative (PSI). It discusses the need for data standards, describes existing PSI standards like mzML for mass spectrometry data, mzIdentML for identification data, and mzTab for final results. The document also provides background on the development and adoption of these standards over time to support the evolving needs of the proteomics community.
Dr. Juan Antonio Vizcaíno presented on the reuse of public proteomics data. The submission of proteomics datasets to repositories like PRIDE has increased dramatically in recent years. Downloads and reuse of data from PRIDE has also grown significantly, reaching 295 terabytes in 2017. Common ways researchers reuse public proteomics data include verifying published results, building spectral libraries, finding interesting datasets to reanalyze for new discoveries, and benchmarking new algorithms. Data sharing allows information to be extracted and reused in new experiments, advancing protein knowledge in areas like UniProt and neXtProt databases.
PRIDE is a proteomics database that stores mass spectrometry-based proteomics data as part of the ProteomeXchange consortium. It contains identification and quantification data from peptide and protein expression analyses as well as post-translational modifications and mass spectra. Data is organized into datasets and assays and can be submitted to PRIDE via tools that export results into mzIdentML or mzTab format. Complete submissions contain identified spectra mapped to results, while partial submissions provide limited experimental details. PRIDE Inspector and the PX submission tool facilitate validation, visualization and submission of proteomics data to PRIDE.
1) There are several major proteomics repositories that serve different purposes, including repositories that store raw data without reprocessing it (PRIDE Archive, MassIVE, jPOST, iProx, PASSEL) and repositories that reprocess all raw data using standardized methods (PeptideAtlas, GPMDB, proteomicsDB, Human Proteome Map).
2) The document outlines the types of information commonly stored in proteomics repositories, including raw data, identification results, quantification, and metadata. It also discusses standards for file formats.
3) Data sharing in proteomics is becoming more important, driven by journals and funders, to enable reproducible science and maximize the value of research findings. Repositories support
Proteomics is the large-scale study of proteins. The document provides an overview of the history and concepts of proteomics, including definitions of key terms, descriptions of pioneering scientists and techniques, and the importance of bioinformatics in proteomics research. It discusses how proteomics has evolved from protein sequencing and gel electrophoresis to modern mass spectrometry-based techniques and quantitative analysis. The increasing role of proteomics in fields like structural biology and clinical applications is also noted.
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Process...Juan Antonio Vizcaino
This document summarizes a webinar about developing open proteomics data analysis pipelines in the cloud. It discusses creating reusable workflows for common proteomics analysis tasks like identification, quantification, and quality control. These workflows would be deployed in cloud environments like the EMBL-EBI "Embassy Cloud" and connected to public proteomics databases like PRIDE. The goals are to make large-scale proteomics analysis more reproducible, scalable, and accessible to the community. An implementation study is underway to develop initial workflows for common analysis types, with plans to expand the available tools and optimize the pipelines for growing proteomics data volumes in the future.
This document provides an overview and status update of various proteomics data standards and related efforts from the PSI Proteome Informatics working group. It discusses the structure and timeline of developments for mzIdentML, mzQuantML, mzTab, and related proteogenomics formats. It also outlines plans for the meeting, including further developing mzTab for different applications and the new proVCF format for representing genetic variation at the protein level.
1) ProteomeXchange is a global database containing proteomics data from several repositories including PRIDE, MassIVE, and jPOST.
2) A new member, iProX, joined in 2017 and contains over 60 terabytes of data from China.
3) Usage of ProteomeXchange data is increasing, with PRIDE downloads growing from 50 terabytes in 2013 to over 295 terabytes in 2017.
Developing open data analysis pipelines in the cloud: Enabling the ‘big data’...Juan Antonio Vizcaino
Dr. Juan Antonio Vizcaíno presented on developing open data analysis pipelines in the cloud to enable large-scale analysis of proteomics data. He introduced PRIDE and ProteomeXchange as repositories for proteomics data that are seeing substantial growth. Moving analysis pipelines to the cloud will facilitate public reuse of large datasets, improve scalability, and ensure reproducibility. Initial pipelines have been created for identification, quantification, and quality control of mass spectrometry data and deployed on the EMBL-EBI cloud platform. Future work includes optimizing access to PRIDE data and developing pipelines for analysis of DIA and proteogenomics data.
The document discusses the ELIXIR Proteomics Community and its plans. It describes how 11 ELIXIR nodes support the community to develop sustainable proteomics tools and resources and make them FAIR. It highlights existing resources like the PRIDE database and ProteomeXchange repository. Future plans include developing proteoform-centric approaches, integrating omics data, and improving analysis workflows and data management.
This document summarizes Juan A. Vizcaíno's presentation on the ELIXIR Proteomics Community. It discusses the establishment of the community through an implementation study and strategy meeting. The community aims to develop standardized proteomics data analysis pipelines and deploy them in a cloud environment. It will also work to improve proteomics data standards and integrate proteomics with other omics data through activities like the Proteomics Standards Initiative. The ProteomeXchange database is a major resource overseen by the community for storing and sharing proteomics data internationally.
This document discusses the reuse of public proteomics data. It provides statistics on proteomics datasets submitted to PRIDE, including the top submitting countries, types of submissions, data volume, and most studied species. It then discusses several ways that public proteomics data is being reused, including to verify published results, build spectral libraries, find new splice isoforms or post-translational modifications, benchmark new tools, and contribute to protein evidence in databases like UniProt. Specific examples of data reuse are also provided, such as for spectral searching, meta-analysis, and repurposing data for proteogenomics studies or discovering novel PTMs.
This document discusses proteomics repositories and data sharing in proteomics. It describes the types of information stored in MS proteomics repositories, including raw data, identification results, quantification, and metadata. It outlines several main repositories, distinguishing between those that do not reprocess data, like PRIDE and MassIVE, and those that do reprocess data through a standardized pipeline, like PeptideAtlas and GPMDB. It also discusses resources focused on drafts of the human proteome, such as proteomicsDB and the Human Proteome Map. Overall, the document provides an overview of existing proteomics repositories and issues around data sharing in the field.
The document introduces several proteomics data standards developed by the Proteomics Standards Initiative (PSI), including mzML, mzIdentML, mzQuantML, TraML, and mzTab. It provides an overview of each standard, describing what type of data it encodes (e.g. mass spectrometry data, identification data, quantification data), its timeline of development and versions, and its increasing adoption by proteomics software and databases. The document emphasizes that data standards are necessary for data sharing and integration in proteomics given the large number of experimental workflows and data types.
Proteomics is the large-scale study of proteins. It has become an important field due to developments in mass spectrometry and genomics. However, proteomics generates large amounts of complex data that requires bioinformatics analysis. The history of proteomics includes early pioneers in protein sequencing and mass spectrometry techniques. Current areas of focus include biomarker discovery, structural biology, and integrating proteomics with other omics data through systems biology approaches.
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...Juan Antonio Vizcaino
The document discusses the spectra-cluster Toolsuite, which enhances proteomics analysis through spectrum clustering. It describes how the toolsuite was used to cluster the PRIDE database of mass spectrometry data, identifying consensus spectra and inferring identifications for originally unidentified spectra. It also discusses how the toolsuite can be used to cluster individual datasets to improve label-free quantification and characterize unknown samples. The toolsuite includes algorithms, APIs, and tools to enable clustering, development, and analysis capabilities.
Enabling automated processing and analysis of large-scale proteomics dataJuan Antonio Vizcaino
This document summarizes several presentations and events related to proteomics data analysis and ELIXIR activities. It describes a kickoff meeting in Tuebingen where 25 people from 11 ELIXIR nodes discussed future proteomics activities. It also outlines a new 1-year ELIXIR implementation project led by EMBL-EBI and ELIXIR-Germany to develop reusable proteomics analysis pipelines using the OpenMS framework and deploy them on the EMBL-EBI cloud for processing large proteomics datasets from the PRIDE repository, which saw over 243 terabytes of data downloaded in 2016.
The binding of cosmological structures by massless topological defectsSérgio Sacani
Assuming spherical symmetry and weak field, it is shown that if one solves the Poisson equation or the Einstein field
equations sourced by a topological defect, i.e. a singularity of a very specific form, the result is a localized gravitational
field capable of driving flat rotation (i.e. Keplerian circular orbits at a constant speed for all radii) of test masses on a thin
spherical shell without any underlying mass. Moreover, a large-scale structure which exploits this solution by assembling
concentrically a number of such topological defects can establish a flat stellar or galactic rotation curve, and can also deflect
light in the same manner as an equipotential (isothermal) sphere. Thus, the need for dark matter or modified gravity theory is
mitigated, at least in part.
The debris of the ‘last major merger’ is dynamically youngSérgio Sacani
The Milky Way’s (MW) inner stellar halo contains an [Fe/H]-rich component with highly eccentric orbits, often referred to as the
‘last major merger.’ Hypotheses for the origin of this component include Gaia-Sausage/Enceladus (GSE), where the progenitor
collided with the MW proto-disc 8–11 Gyr ago, and the Virgo Radial Merger (VRM), where the progenitor collided with the
MW disc within the last 3 Gyr. These two scenarios make different predictions about observable structure in local phase space,
because the morphology of debris depends on how long it has had to phase mix. The recently identified phase-space folds in Gaia
DR3 have positive caustic velocities, making them fundamentally different than the phase-mixed chevrons found in simulations
at late times. Roughly 20 per cent of the stars in the prograde local stellar halo are associated with the observed caustics. Based
on a simple phase-mixing model, the observed number of caustics are consistent with a merger that occurred 1–2 Gyr ago.
We also compare the observed phase-space distribution to FIRE-2 Latte simulations of GSE-like mergers, using a quantitative
measurement of phase mixing (2D causticality). The observed local phase-space distribution best matches the simulated data
1–2 Gyr after collision, and certainly not later than 3 Gyr. This is further evidence that the progenitor of the ‘last major merger’
did not collide with the MW proto-disc at early times, as is thought for the GSE, but instead collided with the MW disc within
the last few Gyr, consistent with the body of work surrounding the VRM.
Or: Beyond linear.
Abstract: Equivariant neural networks are neural networks that incorporate symmetries. The nonlinear activation functions in these networks result in interesting nonlinear equivariant maps between simple representations, and motivate the key player of this talk: piecewise linear representation theory.
Disclaimer: No one is perfect, so please mind that there might be mistakes and typos.
dtubbenhauer@gmail.com
Corrected slides: dtubbenhauer.com/talks.html
Authoring a personal GPT for your research and practice: How we created the Q...Leonel Morgado
Thematic analysis in qualitative research is a time-consuming and systematic task, typically done using teams. Team members must ground their activities on common understandings of the major concepts underlying the thematic analysis, and define criteria for its development. However, conceptual misunderstandings, equivocations, and lack of adherence to criteria are challenges to the quality and speed of this process. Given the distributed and uncertain nature of this process, we wondered if the tasks in thematic analysis could be supported by readily available artificial intelligence chatbots. Our early efforts point to potential benefits: not just saving time in the coding process but better adherence to criteria and grounding, by increasing triangulation between humans and artificial intelligence. This tutorial will provide a description and demonstration of the process we followed, as two academic researchers, to develop a custom ChatGPT to assist with qualitative coding in the thematic data analysis process of immersive learning accounts in a survey of the academic literature: QUAL-E Immersive Learning Thematic Analysis Helper. In the hands-on time, participants will try out QUAL-E and develop their ideas for their own qualitative coding ChatGPT. Participants that have the paid ChatGPT Plus subscription can create a draft of their assistants. The organizers will provide course materials and slide deck that participants will be able to utilize to continue development of their custom GPT. The paid subscription to ChatGPT Plus is not required to participate in this workshop, just for trying out personal GPTs during it.
The cost of acquiring information by natural selectionCarl Bergstrom
This is a short talk that I gave at the Banff International Research Station workshop on Modeling and Theory in Population Biology. The idea is to try to understand how the burden of natural selection relates to the amount of information that selection puts into the genome.
It's based on the first part of this research paper:
The cost of information acquisition by natural selection
Ryan Seamus McGee, Olivia Kosterlitz, Artem Kaznatcheev, Benjamin Kerr, Carl T. Bergstrom
bioRxiv 2022.07.02.498577; doi: https://doi.org/10.1101/2022.07.02.498577
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfSelcen Ozturkcan
Ozturkcan, S., Berndt, A., & Angelakis, A. (2024). Mending clothing to support sustainable fashion. Presented at the 31st Annual Conference by the Consortium for International Marketing Research (CIMaR), 10-13 Jun 2024, University of Gävle, Sweden.
ProteomeXchange Experience: PXD Identifiers and Release of Data on Acceptance, Uploading Large Data Sets
1. ProteomeXchange Experience: PXD
Identifiers and Release of Data on
Acceptance, Uploading Large Data Sets
Dr. Juan Antonio Vizcaíno
PRIDE Group Coordinator
Proteomics Services Team
EMBL-EBI
Hinxton, Cambridge, UK
2. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Overview
• The ProteomeXchange (PX) consortium
• How to submit and access data in PX via PRIDE
• How to access PX data
• Some HPP related things
3. ProteomeXchange Consortium
• Goal: Development of a framework to allow
standard data submission and dissemination
pipelines between the main existing proteomics
repositories.
• Includes PeptideAtlas (ISB, Seattle), PRIDE
(Cambridge, UK) and (very recently) MassIVE
(UCSD, San Diego).
• Common identifier space (PXD identifiers)
• Two supported data workflows: MS/MS and SRM.
• Main objective: Make life easier for researchers
http://www.proteomexchange.org
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
4. ProteomeXchange data workflow
Results
Raw Data*
Juan A. Vizcaíno
juan@ebi.ac.uk
ProteomeCentral
PRIDE
(MS/MS data)
13th HUPO World Congress
Madrid, 5 October 2014
Metadata /
Manuscript
Journals
UniProt/
neXtProt
Peptide Atlas
Other DBs
Receiving repositories
PASSEL
(SRM data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE
(MS/MS data)
Vizcaíno et al., Nat Biotechnol, 2014
5. MassIVE (UCSD)
• Just joined ProteomeXchange on June 2014
• Only partial submissions. A few datasets so far.
Juan A. Vizcaíno
juan@ebi.ac.uk
http://proteomics.ucsd.edu/service/massive/
13th HUPO World Congress
Madrid, 5 October 2014
6. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
• Suitable for SRM assays
• Part of PeptideAtlas set of
resources.
http://www.peptideatlas.org/passel/
Farrah et al., Proteomics, 2012
PASSEL: repository for SRM data
7. ProteomeXchange data workflow
Results
Raw Data*
Juan A. Vizcaíno
juan@ebi.ac.uk
ProteomeCentral
PRIDE
(MS/MS data)
13th HUPO World Congress
Madrid, 5 October 2014
Metadata /
Manuscript
Journals
UniProt/
neXtProt
Peptide Atlas
Other DBs
Receiving repositories
PASSEL
(SRM data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE
(MS/MS data)
Vizcaíno et al., Nat Biotechnol, 2014
8. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Overview
• The ProteomeXchange (PX) consortium
• How to submit and access data in PX via PRIDE
• How to access PX data
• Some HPP related things
9. Manuscript just out detailing the process
http://www.proteomexchange.org/submission Ternent et al., Proteomics, 2014
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Example dataset:
PXD000764
- Title: “Discovery of new CSF biomarkers for meningitis in children”
- 12 runs: 4 controls and 8 infected samples
- Identification and quantification data
10. PX Data workflow for MS/MS data
Juan A. Vizcaíno
juan@ebi.ac.uk
1. Mass spectrometer output files: raw data (binary files) or
peak list spectra in a standardized format (mzML, mzXML).
2. Result files:
a. Complete submissions: Result files can be converted to
PRIDE XML or the mzIdentML data standard.
b. Partial submissions: For workflows not yet supported by
PRIDE, search engine output files will be stored and
provided in their original form.
3. Metadata: Sufficiently detailed description of sample origin,
workflow, instrumentation, submitter.
4. Other files: Optional files:
a. QUANT: Quantification related results e. FASTA
b. PEAK: Peak list files f. SP_LIBRARY
c. GEL: Gel images
d. OTHER: Any other file type
13th HUPO World Congress
Madrid, 5 October 2014
Published
Raw
Files
Other
files
11. Complete vs Partial submissions: experimental metadata
Complete Partial
General experimental metadata about the projects is similar.
However, at the assay level information in partial submissions is not so detailed
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
12. Complete vs Partial submissions: processed results
For complete submissions, it is possible to connect the spectra with the identification
processed results and they can be visualized.
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Complete
Partial
13. Complete submissions using mzIdentML
Juan A. Vizcaíno
juan@ebi.ac.uk
An increasing number of tools support export to mzIdentML
1.1
13th HUPO World Congress
Madrid, 5 October 2014
Search
Engine
Results +
MS files
Search
engines
mzIdentML
- Mascot
- MSGF+
- Myrimatch and related tools from D. Tabb’s lab
- OpenMS
- PEAKS
- ProCon (ProteomeDiscoverer, Sequest)
- Scaffold
- TPP via the idConvert tool (ProteoWizard)
- ProteinPilot (planned by the end of 2014)
- Others: library for X!Tandem conversion, lab
internal pipelines, …
- Referenced spectral files need to be submitted as well
(all open formats are supported).
Updated list: http://www.psidev.info/tools-implementing-mzIdentML#.
14. Tools ‘RESULT’ file generation Final ‘RESULT’ file
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
mzIdentML
‘RESULT’
Now: native file export
Spectra
files
Mascot
ProteinPilo
t
Scaffold
PEAKS
MSGF+
Others
Native File export
15. Available for complete submissions
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Wang et al., Nat. Biotechnology, 2012
PRIDE Inspector 2.0
PRIDE Inspector 2.0 supports:
- PRIDE XML
- mzIdentML + all types of spectra files
- mzML
- mzTab Ident (work in progress)
http://code.google.com/p/pride-toolsuite/
wiki/PRIDEInspector
16. PX submission tool: data submission
• Capture the mappings between the different types of files.
• Add the mandatory metadata annotation.
• Make the file upload process straightforward to the submitter (It transfers all the
files using Aspera or FTP).
• Command line alternative: some scripting is needed.
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Published
Raw
Other
files
http://www.proteomexchange.org/submission
PX
submission
tool
17. Uploading large datasets: Aspera
- Aspera is the default file transfer protocol to PRIDE:
- PX Submission tool
- Command line
- Up to 50X faster than FTP
Juan A. Vizcaíno
juan@ebi.ac.uk
File transfer speed should
not be a problem!!
13th HUPO World Congress
Madrid, 5 October 2014
18. ProteomeXchange: 1329 datasets up until October 2014
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Origin:
271 USA
166 Germany
115 United Kingdom
73 Switzerland
70 China
68 Netherlands
67 France
55 Canada
44 Spain
42 Belgium
33 Sweden
31 Australia
31 Denmark
31 Japan
20 India
20 Norway
19 Taiwan
17 Ireland
16 Austria
14 Finland
14 Italy
12 Republic of Korea
11 Brazil
9 Russia
8 Israel
7 Singapore …
Type:
437 PRIDE complete
792 PRIDE partial
63 PeptideAtlas/PASSEL complete
14 MassIVE
23 reprocessed
Publicly Accessible:
691 datasets, 52% of all
86% PRIDE
12% PASSEL
2% MassIVE
Top Species studied by at least 10
datasets:
577 Homo sapiens
165 Mus musculus
56 Saccharomyces cerevisiae
53 Arabidopsis thaliana
29 Rattus norvegicus
22 Escherichia coli
17 Bos taurus
16 Mycobacterium tuberculosis
13 Oryza sativa
13 Drosophila melanogaster
13 Glycine max
~ 290 species in total
Data volume:
Total: ~55 TB
Number of all files: ~131,000
PXD000320-324: ~ 5 TB
PXD000065: ~ 1.4TB
Datasets/year:
2012: 102
2013: 527
2014: 700
19. Public data release: when does it happen?
• When the author tells us to do it (the authors can do it by
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
themselves)
• When we find out that a dataset has been published
• We look for PXD identifiers in PubMed abstracts.
• If your PXD identifier is not in the abstract, a paper may have
been published and the data is still private. Let us know!
• New web form in the PRIDE web to facilitate the process
20. ProteomeXchange: 1329 datasets up until October 2014
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Origin:
271 USA
166 Germany
115 United Kingdom
73 Switzerland
70 China
68 Netherlands
67 France
55 Canada
44 Spain
42 Belgium
33 Sweden
31 Australia
31 Denmark
31 Japan
20 India
20 Norway
19 Taiwan
17 Ireland
16 Austria
14 Finland
14 Italy
12 Republic of Korea
11 Brazil
9 Russia
8 Israel
7 Singapore …
Type:
437 PRIDE complete
792 PRIDE partial
63 PeptideAtlas/PASSEL complete
14 MassIVE
23 reprocessed
Publicly Accessible:
691 datasets, 52% of all
86% PRIDE
12% PASSEL
2% MassIVE
Top Species studied by at least 10
datasets:
577 Homo sapiens
165 Mus musculus
56 Saccharomyces cerevisiae
53 Arabidopsis thaliana
29 Rattus norvegicus
22 Escherichia coli
17 Bos taurus
16 Mycobacterium tuberculosis
13 Oryza sativa
13 Drosophila melanogaster
13 Glycine max
~ 290 species in total
Data volume:
Total: ~55 TB
Number of all files: ~131,000
PXD000320-324: ~ 5 TB
PXD000065: ~ 1.4TB
Datasets/year:
2012: 102
2013: 527
2014: 700
21. Partial submissions can be used to store
other data types
• Everything can be stored, not only MS/MS data: very flexible
mechanism to be able to capture all types of datasets
• PRIDE does not store SRM data (it goes to PASSEL)
• Top down proteomics datasets.
• Mass Spectrometry Imaging datasets.
• Data independent acquisition techniques: e.g. SWATH-MS datasets.
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
22. Imaging MS datasets: partial submissions
3. Upload
4. Download
From original publication [13] Reconstructed ProteomeXchange data
Juan A. Vizcaíno
juan@ebi.ac.uk
C
D
13th HUPO World Congress
Madrid, 5 October 2014
1. Thermo RAW data / UDP
2. Mirion Software (JLU)
1. Thermo RAW data / UDP
2. Convert to imzML
3. Upload to PRIDE repository
(EBI, Cambridge, UK)
4. Download from PRIDE
5. Display in MSiReader
PRIDE
Database
European
Bioinformatics
Institute,
Cambridge, UK
- Vendor-independent data format
- Freely available software (open source)
- ‚open data‘ – free to reuse
- Anybody can do this!
23. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Overview
• The ProteomeXchange (PX) consortium
• How to submit and access data in PX via PRIDE
• How to access PX data
• Some HPP related things
24. ProteomeXchange data workflow
Results
Raw Data*
Juan A. Vizcaíno
juan@ebi.ac.uk
ProteomeCentral
PRIDE
(MS/MS data)
13th HUPO World Congress
Madrid, 5 October 2014
Metadata /
Manuscript
Journals
UniProt/
neXtProt
Peptide Atlas
Other DBs
Receiving repositories
PASSEL
(SRM data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE
(MS/MS data)
Vizcaíno et al., Nat Biotechnol, 2014
25. ProteomeCentral: Portal for all PX datasets
http://proteomecentral.proteomexchange.org/cgi/GetDataset
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
26. Get notified about new PX datasets
- Subscribe to the RSS Feed to receive information about
the new datasets:
http://groups.google.com/group/proteomexchange/feed/r
ss_v2_0_msgs.xml
Juan A. Vizcaíno
juan@ebi.ac.uk
Proteome Central Researchers
13th HUPO World Congress
Madrid, 5 October 2014
27. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Overview
• The ProteomeXchange (PX) consortium
• How to submit and access data in PX via PRIDE
• How to access PX data
• Some HPP related things
28. PX submission tool: HPP tags
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
29. HPP datasets are now tagged
The Projects are now tagged and can be browsed as a group of data sets.
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Tags for: HPP, C-HPP and
B/D-HPP
30. HPP PX datasets: some numbers
Since January 2014, we started capturing the PI information
- 25 HPP datasets: 22 C-HPP and 3 B/D-HPP
- Countries represented in C-HPP:
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
- 5 Spain
- 4 South Korea
- 3 Brazil, China
Only a small proportion of the datasets have been made
publicly available, at least through ProteomeXchange
31. Which are the most accessed datasets?
PXD Identifier Hits Dataset title Publication
PXD000561 153512 A draft map of the human proteome
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Kim et al.,
Nature,2014.
PMID: 24870542
PXD000851 111587
Membrane proteomic analysis of
colorectal cancer tissue
Kume et al., MCP,
2014.
PMID:24687888
PXD000865 51639
Mass spectrometry based draft of
the human proteome
Wilhelm et al., 2014,
Nature,
PMID:24870543
32. Which are the most accessed datasets?
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Total Numbers
33. Conclusions
• ProteomeXchange is widely used.
• PRIDE contains most of the MS/MS datasets.
• It has now a new consortium member: MassIVE (UCSD).
• Around half of the datasets are already public.
• Different open source tools available to facilitate the process:
• File transfer speed should not be a problem (Aspera support)
• Data depostion enables and promotes data reuse.
• ProteomeXchange is open to new members.
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
34. Acknowledgements
Juan A. Vizcaíno
juan@ebi.ac.uk
PeptideAtlas Team (ISB, Seattle)
Eric Deutsch
Terry Farrah
Zhi Sun
Andrew R. Jones
Lennart Martens
Juan Pablo Albar
Martin Eisenacher
Gil Omenn
Nuno Bandeira
And many other PX partners and
stakeholders
13th HUPO World Congress
Madrid, 5 October 2014
PRIDE Team
Attila Csordas
Rui Wang
Florian Reisinger
Jose A. Dianes
Tobias Ternent
Yasset Perez-Riverol
Noemi del Toro
Henning Hermjakob
EU FP7 grant number 260558
35. Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
Questions?
36. Connecting different data types
How to connect different data types (genomics, metabolomics, etc)?
Juan A. Vizcaíno
juan@ebi.ac.uk
13th HUPO World Congress
Madrid, 5 October 2014
It can be used for:
- ArrayExpress/ GEO
Identifiers
- MetaboLights identifiers
- etc, etc
37. Pilot project started in the context of ELIXIR
Juan A. Vizcaíno
juan@ebi.ac.uk
B2SAFE
B2SAFE
13th HUPO World Congress
Madrid, 5 October 2014
4
3
CSC
BILS
Site B
Site C
ELIXIR EUDAT CDI
B2SAFE
B2SAFE
PRIDE
EMBL-EBI