Small, fast and useful – MMTF a new paradigm in macromolecular data transmission

Small,
fast
and
useful
–
MMTF
a
new
paradigm
in

macromolecular
data
transmission
–
mm9.rcsb.org

Anthony
R.
Bradley,
Alexander
S.
Rose,
Yana
Valasatava,
Jose
M.
Duarte,
Andreas
Prlić,
Peter
W.
Rose

Yet another file format???
Applications
BD2K Targeted Software Development, Grant
Number: U01 CA198942
Funding and acknowledgements
Get the data
Three ways to get involved
hJp://mm9.rcsb.org/

Already several early adopters
APIs provided
Cole Christie and Chris Randle
•  Steep
increase
in
atoms
per
structure

(37%
between
2012
and
2016)

•  10,000
new
structures
added
per
year

•  68
of
the
100
largest
structures
were

deposited
in
the
past
three
years

•  Largest
structure
contains
2.5
M
atoms

•  EM
seen
a
sharp
rise
in
recent
years

Outcomes
•  Small

~75
%
compression
over
mmCIF
GZIP

•  Fast

Parsing
2
orders
of
magnitude
faster

•  Self-‐contained

No
need
for
calls
to
external
resources

•  Useful

Bonding
(bond
order)
and
secondary

structure
info
included
in
all
files

What is it?
•  Binary

MessagePack
(binary
JSON
format)
used

as
a
data
container
hJp://msgpack.org/

•  Custom
lossless
compression

Delta,
run-‐length
and
dicdonary
encoding

used
to
compress
data

•  Open-‐source

Specificadon
and
soeware
libraries

developed
under
Apache/MIT
licenses

Fast

•  Whole
PDB
archive
converted
to
MMTF
weekly

•  Individual
files
available
from
a
REST
API:

wget

h'p://mm,.rcsb.org/v0.2/full/4hhb.mm,.gz

•  Whole
archive
as
a
Hadoop
sequence
file:

wget
h'p://mm,.rcsb.org/v0.2/hadoopfiles/full.tar

•  More
details:

hJp://mm9.rcsb.org/download.html

•  MMTF
allows
interacdve
data

mining
of
the
endre
PDB
archive

•  No
need
for
SQL
or
seing
up
a

database,
or
schema

•  Queries
on
the
endre
archive
in

only
a
couple
of
minutes

1.  Use
–
use
our
API
to
do
your
own
processing

2.  Adopt
–
incorporate
MMTF
into
your
toolkit

3.  Contribute
–
fork
us
on
github

Data mining
Efficient contact finding
Fragment generation
•  Generate
all
fragments
from
the

protein
chains
in
the
PDB

•  Commonly
done
in,
e.g.,
ab
ini&o

structure
predicdon

•  I/O
is
a
key
boJleneck
in
this
process

•  MMTF
allows
for
such
analysis
to
be

done
in
fracdon
of
dme

•  More
experiments
can
be
done
/
day

•  No
need
to
compromise
on
dataset

size
or
parameters

Using
a
Mac
mini
with
2.6
GHz
Intel
Core
i5
(4
cores)
and
16GB
RAM.

Using
a
Mac
mini
with
2.6
GHz
Intel
Core
i5
(4
cores)
and
16GB
RAM.

Using
a
Mac
mini
with
a
2.6
GHz
Intel
Core
i5
and
16GB
RAM.

Small

High performance analysis
Hadoop
sequence
files

are
opdmized
for
fast

parallel
and
sequendal

access

Spark
is
a
fast
in-‐memory

big
data
engine
with

clean
and
expressive
APIs

hJp://spark.apache.org/

•  APIs
and
tools
designed
using
the
Apache
Spark

framework
for
fast
parallel
in-‐memory
processing

•  Spark
deals
with
running
code
in
muld-‐threaded

manner
–
no
need
to
manage
thread
pools

•  Python,
Java
and
Scala
APIs
available

•  Spark
used
widely
in
other
areas
of
Bioinformadcs

(e.g.,
ADAM
in
Genomics
hJp://bdgenomics.org/)

Efficient
hashing
algorithm

Inefficient
looping
algorithm

•  Inter-‐atomic
contacts
are
oeen

analyzed,
e.g.,
empirical
force
fields

•  MMTF
facilitates
the
efficient

contact
finding
algorithm
to
have
a

strong
impact

•  Using
mmCIF
efficient
algorithm

provides
only
~10
%
speedup

•  Using
MMTF
the
same
algorithm

gives
a
~90
%
speedup

•  MMTF
promotes
efficient

downstream
algorithm
design

Element
Occurrences
%
of
PDB

Carbon
431,487,468
43
%

Oxygen
174,153,905
17
%

Nitrogen
121,509,487
12
%

•  Efficient
transmission
and
parsing
of
data

integral
to
Big
Data
inidadves,
e.g.,
ADAM

•  No
compressed
format
for
macromolecules

•  Processing
and
analyzing
macromolecules
is

a
boJleneck

•  Visualizing
large
structures
is
challenging

•  Clean
APIs
to
the
data
provided
in

commonly
used
languages

•  No
need
to
write
your
own
parser

•  No
more
parsers
breaking

hJps://github.com/rcsb/mm9-‐python

hJps://github.com/rcsb/mm9-‐java

hJps://github.com/rcsb/mm9-‐javascript

Atoms
per
structure
in
the
PDB

Time
taken
to
find
all
C-‐alpha-‐C-‐alpha
contacts

using
mmCIF
and
MMTF

Using
a
Mac
mini
with
2.6
GHz
Intel
Core
i5
(4
cores)
and
16GB
RAM.

30
GB

7
GB

<2
minutes

400
minutes

MMTF
mmCIF
MMTF
mmCIF

MMTF
mmCIF

MMTF
mmCIF

Time
to
count
all
the
elements
in
the
PDB

MMTF
mmCIF

Experiments
run
per
24
hours

50

6

448

404

4

640

402

4

EM
atoms
added
to
the
PDB

Atoms
per
structure
in
the
PDB

Whole
PDB
archive
GZIP
compressed

BioJava

•  Protein
Data
Bank
(PDB)
is
a
world-‐wide
archive
of
macromolecular
structures

•  Established
in
1972
it
has
seen
large
growth
over
the
past
30
years

•  Data
currently

stored
and
transmiJed
in
PDB
and
mmCIF
archival
file
formats

•  Such
format
not
appropriate
for
web-‐based
and
Big
Data
applicadons

Small, fast and useful – MMTF a new paradigm in macromolecular data transmission

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Small, fast and useful – MMTF a new paradigm in macromolecular data transmission

Similar to Small, fast and useful – MMTF a new paradigm in macromolecular data transmission (20)

Recently uploaded

Recently uploaded (20)

Small, fast and useful – MMTF a new paradigm in macromolecular data transmission