SlideShare a Scribd company logo
1 of 26
Download to read offline
Compact representation
of 3D macromolecular
structures from the PDB
Presented by Yana Valasatava
Postdoctoral Researcher
Structural Bioinformatics Group
San Diego Supercomputer Center
The PDB evolving complexity
PDB archive
> 30 GB
~250 MB in
mmCIF format
Structural biology efforts meet a big-data era:
● Growing size: ~ 120K structures with an
annual growth by ~10K structures
● Evolving complexity: growing
compositional heterogeneity and size
● Increasing usage: > 300,000 users per
month from over 160 countries
3J3Q
3J3Q has more than 1 million atoms
The PDB has more than 1 billion atoms
★ Interactive visualization
○ slow network transfer
○ slow parsing
○ slow rendering
★ Mobile visualization
○ limited bandwidth
○ limited memory
★ Large-scale structural analysis
○ slow repeated I/O
○ slow repeated parsing
Scalability issues
PDBx/mmCIF
Flexible, extensible, and verbose
format with rich metadata, well suited
for archival purposes.
repetitive information
redundant annotations
inefficient representation
PDB/MMTF
The MacroMolecular Transmission Format
MMTF has the following advantages:
❏ it occupies less space (less disk I/O)
❏ it is faster to read (no time-consuming string parsing)
❏ it contains precalculated information useful for structural analysis
and visualisation (covalent bonds and bond orders)
Fields:
○ Format data (e.g. the version number of the specification)
○ Metadata (e.g. rFree and resolution)
○ Structure data (e.g. number of models, chains, groups, atoms)
○ Chain data (e.g. list of chain IDs, chain names)
○ Group data (e.g. list of group names, formal charges, bonds)
○ Atom data (e.g. B-factors, coordinates, occupancies)
https://github.com/rcsb/mmtf/blob/master/spec.md
MMTF compression pipeline
integer encoding
dictionary encoding
run-length encoding
delta encoding
GZIP
recursive
indexing
extract structural data
calculate bonds, SSE
The binary container format of MMTF
Compression pipeline: dictionary encoding
Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor
ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19
ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35
ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05
{ "groupName": "ARG",
"singleLetterCode": "R",
"chemCompType": "L-PEPTIDE LINKING",
"atomNameList": [ "N", "CA", "C" ],
"elementList": [ "N", "C", "C"] }
index: 1
SER-GLY-ARG-SER-SER
groupTypeList: [ 2, 0, 1, 2, 2 ]
Compression pipeline: encodings
Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor
ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19
ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35
ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05
14.699 -> 14699
14.500 -> 14500
169
1,2,3->1,1,1->1,3
(delta + run-length) -> (integer + delta)
integer encoding: map floating point numbers to integer
run-length encoding: stretches of equal values are represented by the value itself and the
occurrence count
delta encoding: differences (deltas) between the numbers are stored
Compression pipeline: Recursive Indexing
Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor
ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19
ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35
ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05
Recursive Indexing: [-50, -128, 7, 127, 268] -> [-50, -128, 0, 7, 127, 0, 127, 127, 14]
Array of 8-bit integer values, so the open interval is (127, -128):
Overview of data
Full format
• all atoms (useful for structural bioinformatics analysis)
• coordinates with 3 decimal place precision (no loss after decoding)
Reduced format
• C-alpha/phosphate backbone atoms and ligands (useful for
visualisation and some structural bioinformatics)
• coordinates with 1 decimal place precision (almost further 40 %
reduction in size)
• exactly same data structure as full (parsers work for both)
MMTF size and parsing speed
* Parsing using Java libraries
Using MMTF
To efficiently store, transmit, and visualize the 3D structures of biological
macromolecules
To perform large-scale structural calculations such as geometric queries or
structural comparisons over the entire PDB archive held in memory
Presented by Anthony Bradley
Postdoctoral Researcher
Structural Bioinformatics Group
San Diego Supercomputer Center
Using MMTF
To efficiently store, transmit, and visualize the 3D structures of biological
macromolecules
To perform large-scale structural calculations such as geometric queries or
structural comparisons over the entire PDB archive held in memory
Goals
• Analysis should be easy and simple
• Whole archive analysis of the PDB should be trivial
AND fast
• Big Data tools (e.g. Spark and Hadoop) are available
mmtf-python
mmtf-java
Nobody should (have to) write their own parser. Ever.
MMTF-Spark - Simple API
Continued…..
Data mining - speed advantage
Contact finding
Contact finding
Pros and cons
Pros:
● Looping through the whole library performing simple
analyses
● Simple to parallelize code
● Much more complete data
Cons:
● Tied to Java
● Not a magic unicorn
Pros and cons
Pros:
● Looping through the whole library performing simple
analyses
● Simple to parallelize code
● Much more complete data
Cons:
● Tied to Java
● Not a magic unicorn
Thanks!
• http://mmtf.rcsb.org/
• https://github.com/rcsb/mmtf-javascript
• https://github.com/rcsb/mmtf-java
• https://github.com/rcsb/mmtf-python
• http://spark.apache.org/
Acknowledgements
NCI/NIH (U01 CA198942)

More Related Content

Similar to Compact Representation of 3D Macromolecular Structures from the PDB

An Evaluation of Science Data Formats and Their Use at the Community Coordin...
 An Evaluation of Science Data Formats and Their Use at the Community Coordin... An Evaluation of Science Data Formats and Their Use at the Community Coordin...
An Evaluation of Science Data Formats and Their Use at the Community Coordin...The HDF-EOS Tools and Information Center
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Spark Summit
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Databricks
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Databricks
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
Short.course.introduction.to.vhdl
Short.course.introduction.to.vhdlShort.course.introduction.to.vhdl
Short.course.introduction.to.vhdlRavi Sony
 
An Efficient Biological Sequence Compression Technique Using LUT and Repeat ...
An Efficient Biological Sequence Compression Technique Using  LUT and Repeat ...An Efficient Biological Sequence Compression Technique Using  LUT and Repeat ...
An Efficient Biological Sequence Compression Technique Using LUT and Repeat ...IOSR Journals
 
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...Anthony Bradley
 
Digital standard cell library Design flow
Digital standard cell library Design flowDigital standard cell library Design flow
Digital standard cell library Design flowijsrd.com
 
Color Digital Sign Board using Altium Designer
Color Digital Sign Board using Altium DesignerColor Digital Sign Board using Altium Designer
Color Digital Sign Board using Altium Designerijtsrd
 
The Search for Gravitational Waves
The Search for Gravitational WavesThe Search for Gravitational Waves
The Search for Gravitational Wavesinside-BigData.com
 
Experiences in ELK with D3.js for Large Log Analysis and Visualization
Experiences in ELK with D3.js  for Large Log Analysis  and VisualizationExperiences in ELK with D3.js  for Large Log Analysis  and Visualization
Experiences in ELK with D3.js for Large Log Analysis and VisualizationSurasak Sanguanpong
 
June 25-26, Workshop
 June 25-26,  Workshop June 25-26,  Workshop
June 25-26, WorkshopFahadahammed2
 
Iaetsd design and implementation of pseudo random number generator
Iaetsd design and implementation of pseudo random number generatorIaetsd design and implementation of pseudo random number generator
Iaetsd design and implementation of pseudo random number generatorIaetsd Iaetsd
 

Similar to Compact Representation of 3D Macromolecular Structures from the PDB (20)

An Evaluation of Science Data Formats and Their Use at the Community Coordin...
 An Evaluation of Science Data Formats and Their Use at the Community Coordin... An Evaluation of Science Data Formats and Their Use at the Community Coordin...
An Evaluation of Science Data Formats and Their Use at the Community Coordin...
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Short.course.introduction.to.vhdl
Short.course.introduction.to.vhdlShort.course.introduction.to.vhdl
Short.course.introduction.to.vhdl
 
An Efficient Biological Sequence Compression Technique Using LUT and Repeat ...
An Efficient Biological Sequence Compression Technique Using  LUT and Repeat ...An Efficient Biological Sequence Compression Technique Using  LUT and Repeat ...
An Efficient Biological Sequence Compression Technique Using LUT and Repeat ...
 
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
 
Bio Linux
Bio LinuxBio Linux
Bio Linux
 
Digital standard cell library Design flow
Digital standard cell library Design flowDigital standard cell library Design flow
Digital standard cell library Design flow
 
Color Digital Sign Board using Altium Designer
Color Digital Sign Board using Altium DesignerColor Digital Sign Board using Altium Designer
Color Digital Sign Board using Altium Designer
 
The Search for Gravitational Waves
The Search for Gravitational WavesThe Search for Gravitational Waves
The Search for Gravitational Waves
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
Experiences in ELK with D3.js for Large Log Analysis and Visualization
Experiences in ELK with D3.js  for Large Log Analysis  and VisualizationExperiences in ELK with D3.js  for Large Log Analysis  and Visualization
Experiences in ELK with D3.js for Large Log Analysis and Visualization
 
June 25-26, Workshop
 June 25-26,  Workshop June 25-26,  Workshop
June 25-26, Workshop
 
Packet sniffing
Packet sniffingPacket sniffing
Packet sniffing
 
Tridiagonal solver in gpu
Tridiagonal solver in gpuTridiagonal solver in gpu
Tridiagonal solver in gpu
 
TiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architectureTiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architecture
 
Iaetsd design and implementation of pseudo random number generator
Iaetsd design and implementation of pseudo random number generatorIaetsd design and implementation of pseudo random number generator
Iaetsd design and implementation of pseudo random number generator
 

Recently uploaded

(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde
 
Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicAditi Jain
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXDole Philippines School
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 

Recently uploaded (20)

(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 
Microteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical EngineeringMicroteaching on terms used in filtration .Pharmaceutical Engineering
Microteaching on terms used in filtration .Pharmaceutical Engineering
 
Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by Petrovic
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 

Compact Representation of 3D Macromolecular Structures from the PDB

  • 1. Compact representation of 3D macromolecular structures from the PDB
  • 2. Presented by Yana Valasatava Postdoctoral Researcher Structural Bioinformatics Group San Diego Supercomputer Center
  • 3. The PDB evolving complexity PDB archive > 30 GB ~250 MB in mmCIF format Structural biology efforts meet a big-data era: ● Growing size: ~ 120K structures with an annual growth by ~10K structures ● Evolving complexity: growing compositional heterogeneity and size ● Increasing usage: > 300,000 users per month from over 160 countries 3J3Q 3J3Q has more than 1 million atoms The PDB has more than 1 billion atoms
  • 4. ★ Interactive visualization ○ slow network transfer ○ slow parsing ○ slow rendering ★ Mobile visualization ○ limited bandwidth ○ limited memory ★ Large-scale structural analysis ○ slow repeated I/O ○ slow repeated parsing Scalability issues
  • 5. PDBx/mmCIF Flexible, extensible, and verbose format with rich metadata, well suited for archival purposes. repetitive information redundant annotations inefficient representation
  • 6. PDB/MMTF The MacroMolecular Transmission Format MMTF has the following advantages: ❏ it occupies less space (less disk I/O) ❏ it is faster to read (no time-consuming string parsing) ❏ it contains precalculated information useful for structural analysis and visualisation (covalent bonds and bond orders) Fields: ○ Format data (e.g. the version number of the specification) ○ Metadata (e.g. rFree and resolution) ○ Structure data (e.g. number of models, chains, groups, atoms) ○ Chain data (e.g. list of chain IDs, chain names) ○ Group data (e.g. list of group names, formal charges, bonds) ○ Atom data (e.g. B-factors, coordinates, occupancies) https://github.com/rcsb/mmtf/blob/master/spec.md
  • 7. MMTF compression pipeline integer encoding dictionary encoding run-length encoding delta encoding GZIP recursive indexing extract structural data calculate bonds, SSE The binary container format of MMTF
  • 8. Compression pipeline: dictionary encoding Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19 ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35 ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05 { "groupName": "ARG", "singleLetterCode": "R", "chemCompType": "L-PEPTIDE LINKING", "atomNameList": [ "N", "CA", "C" ], "elementList": [ "N", "C", "C"] } index: 1 SER-GLY-ARG-SER-SER groupTypeList: [ 2, 0, 1, 2, 2 ]
  • 9. Compression pipeline: encodings Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19 ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35 ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05 14.699 -> 14699 14.500 -> 14500 169 1,2,3->1,1,1->1,3 (delta + run-length) -> (integer + delta) integer encoding: map floating point numbers to integer run-length encoding: stretches of equal values are represented by the value itself and the occurrence count delta encoding: differences (deltas) between the numbers are stored
  • 10. Compression pipeline: Recursive Indexing Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19 ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35 ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05 Recursive Indexing: [-50, -128, 7, 127, 268] -> [-50, -128, 0, 7, 127, 0, 127, 127, 14] Array of 8-bit integer values, so the open interval is (127, -128):
  • 11. Overview of data Full format • all atoms (useful for structural bioinformatics analysis) • coordinates with 3 decimal place precision (no loss after decoding) Reduced format • C-alpha/phosphate backbone atoms and ligands (useful for visualisation and some structural bioinformatics) • coordinates with 1 decimal place precision (almost further 40 % reduction in size) • exactly same data structure as full (parsers work for both)
  • 12. MMTF size and parsing speed * Parsing using Java libraries
  • 13. Using MMTF To efficiently store, transmit, and visualize the 3D structures of biological macromolecules To perform large-scale structural calculations such as geometric queries or structural comparisons over the entire PDB archive held in memory
  • 14. Presented by Anthony Bradley Postdoctoral Researcher Structural Bioinformatics Group San Diego Supercomputer Center
  • 15. Using MMTF To efficiently store, transmit, and visualize the 3D structures of biological macromolecules To perform large-scale structural calculations such as geometric queries or structural comparisons over the entire PDB archive held in memory
  • 16. Goals • Analysis should be easy and simple • Whole archive analysis of the PDB should be trivial AND fast • Big Data tools (e.g. Spark and Hadoop) are available
  • 17. mmtf-python mmtf-java Nobody should (have to) write their own parser. Ever.
  • 20. Data mining - speed advantage
  • 23. Pros and cons Pros: ● Looping through the whole library performing simple analyses ● Simple to parallelize code ● Much more complete data Cons: ● Tied to Java ● Not a magic unicorn
  • 24. Pros and cons Pros: ● Looping through the whole library performing simple analyses ● Simple to parallelize code ● Much more complete data Cons: ● Tied to Java ● Not a magic unicorn
  • 25. Thanks! • http://mmtf.rcsb.org/ • https://github.com/rcsb/mmtf-javascript • https://github.com/rcsb/mmtf-java • https://github.com/rcsb/mmtf-python • http://spark.apache.org/