Standarization in Proteomics: From raw data to metadata files

From Raw Data to MetaData
Files
Yasset
Perez
Riverol
Proteomics
&
Bioiforma4cs
CIGB

Common Proteomic Workflow
Mixture/
Sample
Separa4on
Techniques
(1D,
2D)
LC
MS/MS
Iden4fica4on
OMSSA
– Different providers: (annotations,
software converters & viewers)
– For Raw data formats, there is also
the very real problem of “aging”.
Different:
– Protocols.
– Outputs.
– Providers.
Different:
– Strategies.
– Search Engines.
– Post-Processing
Analysis.
– File Outputs.

LC-MS/MS (different
instruments)
Raw File Raw File Raw File Raw File
Raw
data
is
binary!!!…
It
means
you
can’t
read
it
with
Notepad
but
also
without
their
programs
and
libraries.
Peaks without processing!!!

LC-MS/MS (“aging” problem.)
Thermo XCalibur MassLynx Trapper Compass
FrameWork
Next
the
problem
with
proprietary
raw
data
formats,
there
is
also
the
very
real
problem
of
“aging”
that
comes
with
any
binary
formaSed
data.
As
4me
goes
by,
support
for
certain
formats
tends
to
evaporate
and
within
the
space
of
several
years,
readers
can
no
longer
be
found
for
the
format.
Martens
and
co.
Proteomics
2005,
5,
3501–3505

Information inside Raw files
• Raw files contain all the individual peaks as registered
by the instrument detector.
• Peaks without processing!!!
• For LC-MS machines, can store elution profiles and
times for the LC part.
• Depending on the vendor and make of the machine,
other useful instrument-related information can be
stored in these files as well.

File Formats Evolution
Pure Peaks
Formats
(pkl, ms2, mgf)
mzXML
(2004)
mzData
(2006)
mzML (2008)
Nature
Biotechnology.
2004,
22
(11),1459-‐1466.
mzData,
hIp://psidev.info/index.php?q=node/80#mzdata.
Mol
Cell
Proteomics.
2011,10(1),

Pure Peak File
mgf (mascot generic file)
BEGIN IONS
PEPMASS=406.283
CHARGE=2+,3+
TITLE=Experiment_1
145.119100 8
217.142900 75
409.221455 11
438.314735 46
567.400183 24
714.447552 31
116.113400 72
91.2165000 32
405.288933 94
39.3021000 12
549.379462 21
715.466300 81
15.1098000 62
45.1358430 28
pkl (peak list)
814.27 22673800 1
221.06 2529.3
223.84 220.9
226.91 1026.9
227.97 1037.9
231.06 110.6
239.05 7193.1
239.74 2513.3
240.27 363.4
240.79 1314.7
241.45 629.9
254.85 332.5
259.71 200.5
260.93 2437.7
dta
539.3453 2
86.1006 4.0000
112.1109 3.0000
115.0906 2.0000
120.0817 5.0000
175.0219 2.0000
225.1467 2.0000
225.7205 2.0000
228.1194 2.0000
230.1106 2.0000
234.1836 2.0000
238.6206 2.0000
240.1569 3.0000
251.1396 2.0000
254.1557 2.0000
261.1669 9.0000
261.6609 2.0000
268.1504 8.0000

mzXML
mzXML
Parent FileList
MsInstrument
SeparationTechnique
dataProcessing
spooting
scanList
scan
scanDescription
msLevel
PrecursorList
binaryDataArray
binaryDataArray
• • •
scan
scan
scanOrigin
deisotoped
centroided
deconvoluted
mzXML was the first xml
based file format developed
for proteomics experiments.
It was developed by the
System Biology Group, USA.
The annotations in the file
are string based. It means,
they are in this way: (Name
Attribute, Value).
D o n o t s u p p o r t
chromatograms information.
Is very difficult to extend. The structure of the file
don’t allow to define new parameter or features for
each elements. For example, msInstrument are defined
only by the name of the instrument. Also, if the
spectrum is preprocessing with any program, is difficult
to incorporate the information.
Actually exist more than 4 versions
of the schema. The schema is
supported by the System Biology
Group, USA-Zurich.

Controlled Vocabularies &
Ontology Lookup Service
TOF T.O.F.
100173
time of flight
time-of-flight

OLS
• Is a web service oriented system
developed in Java.
• It was developed and is maintained by
the PRIDE Team!!!
• We have the service installed in a local
machine!!!!
• I know the library and the source
code. We have an strong collaboration
with the developers of the Service!!!

mzML
mzML
cvList
referenceableParamGroupList
sampleList
instrumentConfigurationList
softwareList
dataProcessingList
acquisitionSettingsList
run
spectrum
spectrumDescription
precursorList
scan
binaryDataArray
binaryDataArray
• • •
spectrumList
spectrum
spectrum
• • •
chromatogramList
chromatogram
chromatogram
• • •
chromatogram
binaryDataArray
binaryDataArray
Meta data about the spectra
plus all the spectra themselves.
The header at the top of the
file encodes information about:
the source of the data as well as
information about the sample,
instrument and software that
processed the data.
Cvterms are used to define the
metadata and the properties of
each element (software,
instrument, sample, scansetting,
etc.
Chromatograms may be encoded in mzML in a special element that contains one or
more cvParams to describe the type of chromatogram, followed by two base64-
encoded binary data arrays.

Comparison table
Metadata/fileformat mzml mzData mzXml mgf pkl ms2 dta
Species X X - - - - -
Tissue X X - - - - -
Instrument X X X - - - -
Experiment Description X - - - - - -
References X - - - - - -
Contacts X X X - - - -
X (FileContent /
Additional
creationDate) X X - - - -
Samples X X - - - - -
Instrument Configuration X X X - - - -
Data Processing X X X - - - -
mzML is supported by:
- Institute for Systems Biology , Seattle.
- Swiss Institute for Bioinformatics and Geneva Bioinformatics,
Switzerland.
- European Bioinformatics Institute, Hinxton, UK.
- Thermo Fisher, San Jose, CA.
- Indigo Biosystems, Carmel, IN.
mzML and mzXML is comatible with:
- Mascot!!!!, X! Tandem, OMSSA.
- PeptideProphet
Is
not
binary!!!…
It
means
you
can
read
it
with
Notepad
but
also
with
your
libraries
and
own
code…

ProteoWizard
msConvert
API
Thermo
API
Bruker
API
Agilent
API
Waters
API
File Input Supported:
– Thermo
– Bruker
– Agilent
– Waters
– Pkl
– mgf,
– dta
– ms2
File Output Supported:
– mzML
– mzXML
– mzData
– Pkl
– mgf
Cross-platform !!!!

Identification
X!Tandem
Mascot
Database
Search
Mascot
Percolator
PeptideProphet
Scaffold
X! Tandem OMSSA Fenyx
PeptideProphet
De Novo
Sequence
Peaks PepNovo
Spectral
Library
SpectraST NIST
Thousand
approaches!!!…
It
means
you
can
combine
different
programs,
with
different
parameters,
and
different
workflows..

File Formats?
AnalysisXML: v1.0 – candidate (Dic 08)
.dat
.dat
.dat
pepXML
protXML
AnalysisXML
Seattle Proteome Center at
the Institute for Systems
Biology
Programs with excel output
OMSSA
Programs with their output format

mzidentml
Collection of use cases agreed
to cover:
- e.g. PMF, MS/MS,
sequence tag, de novo,
spectral library
Pep
Evidence1
Ambiguity
Group1
Protein
Result Set
Protein
Hypothesis1
Pep
Evidence2
Pep
Evidence1
Protein
Hypothesis2
Pep
Evidence2
Pep
Evidence1
Ambiguity
Group2
Protein
Hypothesis1
…
…
…
…
… …
Pep
Evidence2
Mul9ple
Search
Engines!!!…
Protocol
Descrip4ons,
Database
Proper4es,
Search
Engines,
Parameters,
Modifica4ons..
Fully
compa4ble
with
Otology's!!!
Supported
by
Mascot!!!

mzidentml
• Results in mzIdentML format can be exported directly from Mascot (export of version 1.1
available in version 2.3)
• Converters are currently available for Sequest and Proteome Discoverer output (.msf
and .protXML) (e.g. within ProCon: http://www.medizinisches-proteom-center.de/ProCon),.
• OMSSA and X!Tandem (http://code.google.com/p/mzidentml-parsers/)
• The pipeline applications Scaffold (import into Scaffold PTM and export of mzIdentML
available in Scaffold version 3) and TPP (results can be exported to mzIdentML via the
ProteoWizard converter).
• A beta exporter is also available for Phenyx.
• OpenMS implements C++ code for reading (and as of release 1.9) writing mzIdentML.
• An open-source Java API for reading and writing mzIdentML has also been developed,
available from http://code.google.com/p/jmzidentml/!!!!!

Gels
(nobody
care)
― Only limited support for the storage of detailed descriptions of all stages of a
gel-based proteomics workflow.
― Information is mostly restricted to unstructured text paragraphs.
Different Scenarios:
OffeGel-electrophoresis
1D 2D
One of the reasons is the lack of widely accepted standards for
representing gel data and the difficulties encountered modelling the range
of workflows employed in different settings.

gelml
Gelml is basically a metadata file
that contains the URI of the image
file.
The structure of the schema is
complex !!!!. One of the reason is the
amount of different protocols
Not well documented, an small
community behind, and not really
extended in the community!!!

Before Technical Things!!!
• The number of tools based on XML
standard files is growing exponentially..
Why:
– Easy to read and write!!!
– They are standards!!!!
– Repositories Support (PRIDE,
PEPTIDEATLAS).
– Have enough information for most of the
programs.

APIs
• jmzml: Library to read/
write information from
mzml files.
• jmzidentml: Library to
read/write information
from mzidentml files.
• jgelml: Library to read/
write information from
gelml files. (current
development)
• Developed by
the PRIDE team.
• Java Libraries.
• Still growing.
• Open-Source
and Free.

ms-core-api
Applications
proteolims
N-terminal
Identification
Web services
ms-core-api
APIs
jmzml jmzxml jmzData jmzReader jmzidml jgelml
m s - c o r e - a p i i s a j a v a
framewrok, a common object
model to represent different
file formats.
Support now:
― mzidentml
― mzml, mzData, mzXML
― pride xml, pride database
― pkl, mgf, ms2, dta
― gelml (current work)
Cross-platform and well
documented!!!
The aim of ms-core-api library is to guarantee for our current
development tools a common language of objects and classes!!!!

The relevance of APIs concept
• Different programs can used to
implement the main functionalities.
• If you have APIs .. Then you just need
to think on integration, scalability and
presentation…
• Easy to maintain and to scale and to
share…
• They are the “MAIN CORE!”!!

ms-‐core-‐api:
good
for…
?
Spectrum Viewer
Identification Report

Think about review our experiments
MetaData Report
Reviewer Panel

conclusion
• mzml is the current standard for MS/MS storage.
• mzidentml will be the future standard on proteomics
community for peptide/protein identification storage.
• gelml is not very extended in the community but so
far the best option for gel information storage.
• ms-core-api support mzml,mzidentml, and in the near
future gelml.

Standarization in Proteomics: From raw data to metadata files

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Standarization in Proteomics: From raw data to metadata files

Similar to Standarization in Proteomics: From raw data to metadata files (20)

More from Yasset Perez-Riverol

More from Yasset Perez-Riverol (9)

Recently uploaded

Recently uploaded (20)

Standarization in Proteomics: From raw data to metadata files