SlideShare a Scribd company logo
1 of 26
Download to read offline
Compact representation
of 3D macromolecular
structures from the PDB
Presented by Yana Valasatava
Postdoctoral Researcher
Structural Bioinformatics Group
San Diego Supercomputer Center
The PDB evolving complexity
PDB archive
> 30 GB
~250 MB in
mmCIF format
Structural biology efforts meet a big-data era:
● Growing size: ~ 120K structures with an
annual growth by ~10K structures
● Evolving complexity: growing
compositional heterogeneity and size
● Increasing usage: > 300,000 users per
month from over 160 countries
3J3Q
3J3Q has more than 1 million atoms
The PDB has more than 1 billion atoms
★ Interactive visualization
○ slow network transfer
○ slow parsing
○ slow rendering
★ Mobile visualization
○ limited bandwidth
○ limited memory
★ Large-scale structural analysis
○ slow repeated I/O
○ slow repeated parsing
Scalability issues
PDBx/mmCIF
Flexible, extensible, and verbose
format with rich metadata, well suited
for archival purposes.
repetitive information
redundant annotations
inefficient representation
PDB/MMTF
The MacroMolecular Transmission Format
MMTF has the following advantages:
❏ it occupies less space (less disk I/O)
❏ it is faster to read (no time-consuming string parsing)
❏ it contains precalculated information useful for structural analysis
and visualisation (covalent bonds and bond orders)
Fields:
○ Format data (e.g. the version number of the specification)
○ Metadata (e.g. rFree and resolution)
○ Structure data (e.g. number of models, chains, groups, atoms)
○ Chain data (e.g. list of chain IDs, chain names)
○ Group data (e.g. list of group names, formal charges, bonds)
○ Atom data (e.g. B-factors, coordinates, occupancies)
https://github.com/rcsb/mmtf/blob/master/spec.md
MMTF compression pipeline
integer encoding
dictionary encoding
run-length encoding
delta encoding
GZIP
recursive
indexing
extract structural data
calculate bonds, SSE
The binary container format of MMTF
Compression pipeline: dictionary encoding
Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor
ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19
ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35
ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05
{ "groupName": "ARG",
"singleLetterCode": "R",
"chemCompType": "L-PEPTIDE LINKING",
"atomNameList": [ "N", "CA", "C" ],
"elementList": [ "N", "C", "C"] }
index: 1
SER-GLY-ARG-SER-SER
groupTypeList: [ 2, 0, 1, 2, 2 ]
Compression pipeline: encodings
Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor
ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19
ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35
ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05
14.699 -> 14699
14.500 -> 14500
169
1,2,3->1,1,1->1,3
(delta + run-length) -> (integer + delta)
integer encoding: map floating point numbers to integer
run-length encoding: stretches of equal values are represented by the value itself and the
occurrence count
delta encoding: differences (deltas) between the numbers are stored
Compression pipeline: Recursive Indexing
Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor
ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19
ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35
ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05
Recursive Indexing: [-50, -128, 7, 127, 268] -> [-50, -128, 0, 7, 127, 0, 127, 127, 14]
Array of 8-bit integer values, so the open interval is (127, -128):
Overview of data
Full format
• all atoms (useful for structural bioinformatics analysis)
• coordinates with 3 decimal place precision (no loss after decoding)
Reduced format
• C-alpha/phosphate backbone atoms and ligands (useful for
visualisation and some structural bioinformatics)
• coordinates with 1 decimal place precision (almost further 40 %
reduction in size)
• exactly same data structure as full (parsers work for both)
MMTF size and parsing speed
* Parsing using Java libraries
Using MMTF
To efficiently store, transmit, and visualize the 3D structures of biological
macromolecules
To perform large-scale structural calculations such as geometric queries or
structural comparisons over the entire PDB archive held in memory
Presented by Anthony Bradley
Postdoctoral Researcher
Structural Bioinformatics Group
San Diego Supercomputer Center
Using MMTF
To efficiently store, transmit, and visualize the 3D structures of biological
macromolecules
To perform large-scale structural calculations such as geometric queries or
structural comparisons over the entire PDB archive held in memory
Goals
• Analysis should be easy and simple
• Whole archive analysis of the PDB should be trivial
AND fast
• Big Data tools (e.g. Spark and Hadoop) are available
mmtf-python
mmtf-java
Nobody should (have to) write their own parser. Ever.
MMTF-Spark - Simple API
Continued…..
Data mining - speed advantage
Contact finding
Contact finding
Pros and cons
Pros:
● Looping through the whole library performing simple
analyses
● Simple to parallelize code
● Much more complete data
Cons:
● Tied to Java
● Not a magic unicorn
Pros and cons
Pros:
● Looping through the whole library performing simple
analyses
● Simple to parallelize code
● Much more complete data
Cons:
● Tied to Java
● Not a magic unicorn
Thanks!
• http://mmtf.rcsb.org/
• https://github.com/rcsb/mmtf-javascript
• https://github.com/rcsb/mmtf-java
• https://github.com/rcsb/mmtf-python
• http://spark.apache.org/
Acknowledgements
NCI/NIH (U01 CA198942)

More Related Content

Similar to Compact Representation of 3D Macromolecular Structures from the PDB

An Evaluation of Science Data Formats and Their Use at the Community Coordin...
 An Evaluation of Science Data Formats and Their Use at the Community Coordin... An Evaluation of Science Data Formats and Their Use at the Community Coordin...
An Evaluation of Science Data Formats and Their Use at the Community Coordin...The HDF-EOS Tools and Information Center
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Spark Summit
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Databricks
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Databricks
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
Short.course.introduction.to.vhdl
Short.course.introduction.to.vhdlShort.course.introduction.to.vhdl
Short.course.introduction.to.vhdlRavi Sony
 
An Efficient Biological Sequence Compression Technique Using LUT and Repeat ...
An Efficient Biological Sequence Compression Technique Using  LUT and Repeat ...An Efficient Biological Sequence Compression Technique Using  LUT and Repeat ...
An Efficient Biological Sequence Compression Technique Using LUT and Repeat ...IOSR Journals
 
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...Anthony Bradley
 
Digital standard cell library Design flow
Digital standard cell library Design flowDigital standard cell library Design flow
Digital standard cell library Design flowijsrd.com
 
Color Digital Sign Board using Altium Designer
Color Digital Sign Board using Altium DesignerColor Digital Sign Board using Altium Designer
Color Digital Sign Board using Altium Designerijtsrd
 
The Search for Gravitational Waves
The Search for Gravitational WavesThe Search for Gravitational Waves
The Search for Gravitational Wavesinside-BigData.com
 
Experiences in ELK with D3.js for Large Log Analysis and Visualization
Experiences in ELK with D3.js  for Large Log Analysis  and VisualizationExperiences in ELK with D3.js  for Large Log Analysis  and Visualization
Experiences in ELK with D3.js for Large Log Analysis and VisualizationSurasak Sanguanpong
 
June 25-26, Workshop
 June 25-26,  Workshop June 25-26,  Workshop
June 25-26, WorkshopFahadahammed2
 
Iaetsd design and implementation of pseudo random number generator
Iaetsd design and implementation of pseudo random number generatorIaetsd design and implementation of pseudo random number generator
Iaetsd design and implementation of pseudo random number generatorIaetsd Iaetsd
 

Similar to Compact Representation of 3D Macromolecular Structures from the PDB (20)

An Evaluation of Science Data Formats and Their Use at the Community Coordin...
 An Evaluation of Science Data Formats and Their Use at the Community Coordin... An Evaluation of Science Data Formats and Their Use at the Community Coordin...
An Evaluation of Science Data Formats and Their Use at the Community Coordin...
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Short.course.introduction.to.vhdl
Short.course.introduction.to.vhdlShort.course.introduction.to.vhdl
Short.course.introduction.to.vhdl
 
An Efficient Biological Sequence Compression Technique Using LUT and Repeat ...
An Efficient Biological Sequence Compression Technique Using  LUT and Repeat ...An Efficient Biological Sequence Compression Technique Using  LUT and Repeat ...
An Efficient Biological Sequence Compression Technique Using LUT and Repeat ...
 
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...
 
Bio Linux
Bio LinuxBio Linux
Bio Linux
 
Digital standard cell library Design flow
Digital standard cell library Design flowDigital standard cell library Design flow
Digital standard cell library Design flow
 
Color Digital Sign Board using Altium Designer
Color Digital Sign Board using Altium DesignerColor Digital Sign Board using Altium Designer
Color Digital Sign Board using Altium Designer
 
The Search for Gravitational Waves
The Search for Gravitational WavesThe Search for Gravitational Waves
The Search for Gravitational Waves
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
Experiences in ELK with D3.js for Large Log Analysis and Visualization
Experiences in ELK with D3.js  for Large Log Analysis  and VisualizationExperiences in ELK with D3.js  for Large Log Analysis  and Visualization
Experiences in ELK with D3.js for Large Log Analysis and Visualization
 
June 25-26, Workshop
 June 25-26,  Workshop June 25-26,  Workshop
June 25-26, Workshop
 
Packet sniffing
Packet sniffingPacket sniffing
Packet sniffing
 
Tridiagonal solver in gpu
Tridiagonal solver in gpuTridiagonal solver in gpu
Tridiagonal solver in gpu
 
TiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architectureTiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architecture
 
Iaetsd design and implementation of pseudo random number generator
Iaetsd design and implementation of pseudo random number generatorIaetsd design and implementation of pseudo random number generator
Iaetsd design and implementation of pseudo random number generator
 

Recently uploaded

3.-Acknowledgment-Dedication-Abstract.docx
3.-Acknowledgment-Dedication-Abstract.docx3.-Acknowledgment-Dedication-Abstract.docx
3.-Acknowledgment-Dedication-Abstract.docxUlahVanessaBasa
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxpriyankatabhane
 
AICTE activity on Water Conservation spreading awareness
AICTE activity on Water Conservation spreading awarenessAICTE activity on Water Conservation spreading awareness
AICTE activity on Water Conservation spreading awareness1hk20is002
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPirithiRaju
 
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxpriyankatabhane
 
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...jana861314
 
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyLAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyChayanika Das
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learningvschiavoni
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGiovaniTrinidad
 
BACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika DasBACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika DasChayanika Das
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxGiDMOh
 
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsTotal Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsMarkus Roggen
 
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaEGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaDr.Mahmoud Abbas
 
Understanding Nutrition, 16th Edition pdf
Understanding Nutrition, 16th Edition pdfUnderstanding Nutrition, 16th Edition pdf
Understanding Nutrition, 16th Edition pdfHabibouKarbo
 
Food_safety_Management_pptx.pptx in microbiology
Food_safety_Management_pptx.pptx in microbiologyFood_safety_Management_pptx.pptx in microbiology
Food_safety_Management_pptx.pptx in microbiologyHemantThakare8
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests GlycosidesNandakishor Bhaurao Deshmukh
 
Harry Coumnas Thinks That Human Teleportation May Ensure Humanity's Survival
Harry Coumnas Thinks That Human Teleportation May Ensure Humanity's SurvivalHarry Coumnas Thinks That Human Teleportation May Ensure Humanity's Survival
Harry Coumnas Thinks That Human Teleportation May Ensure Humanity's Survivalkevin8smith
 
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...Chayanika Das
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 

Recently uploaded (20)

3.-Acknowledgment-Dedication-Abstract.docx
3.-Acknowledgment-Dedication-Abstract.docx3.-Acknowledgment-Dedication-Abstract.docx
3.-Acknowledgment-Dedication-Abstract.docx
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptx
 
AICTE activity on Water Conservation spreading awareness
AICTE activity on Water Conservation spreading awarenessAICTE activity on Water Conservation spreading awareness
AICTE activity on Water Conservation spreading awareness
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPR
 
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
 
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
 
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyLAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptx
 
BACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika DasBACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptx
 
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsTotal Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
 
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaEGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
 
Understanding Nutrition, 16th Edition pdf
Understanding Nutrition, 16th Edition pdfUnderstanding Nutrition, 16th Edition pdf
Understanding Nutrition, 16th Edition pdf
 
Food_safety_Management_pptx.pptx in microbiology
Food_safety_Management_pptx.pptx in microbiologyFood_safety_Management_pptx.pptx in microbiology
Food_safety_Management_pptx.pptx in microbiology
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
 
Harry Coumnas Thinks That Human Teleportation May Ensure Humanity's Survival
Harry Coumnas Thinks That Human Teleportation May Ensure Humanity's SurvivalHarry Coumnas Thinks That Human Teleportation May Ensure Humanity's Survival
Harry Coumnas Thinks That Human Teleportation May Ensure Humanity's Survival
 
Interferons.pptx.
Interferons.pptx.Interferons.pptx.
Interferons.pptx.
 
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 

Compact Representation of 3D Macromolecular Structures from the PDB

  • 1. Compact representation of 3D macromolecular structures from the PDB
  • 2. Presented by Yana Valasatava Postdoctoral Researcher Structural Bioinformatics Group San Diego Supercomputer Center
  • 3. The PDB evolving complexity PDB archive > 30 GB ~250 MB in mmCIF format Structural biology efforts meet a big-data era: ● Growing size: ~ 120K structures with an annual growth by ~10K structures ● Evolving complexity: growing compositional heterogeneity and size ● Increasing usage: > 300,000 users per month from over 160 countries 3J3Q 3J3Q has more than 1 million atoms The PDB has more than 1 billion atoms
  • 4. ★ Interactive visualization ○ slow network transfer ○ slow parsing ○ slow rendering ★ Mobile visualization ○ limited bandwidth ○ limited memory ★ Large-scale structural analysis ○ slow repeated I/O ○ slow repeated parsing Scalability issues
  • 5. PDBx/mmCIF Flexible, extensible, and verbose format with rich metadata, well suited for archival purposes. repetitive information redundant annotations inefficient representation
  • 6. PDB/MMTF The MacroMolecular Transmission Format MMTF has the following advantages: ❏ it occupies less space (less disk I/O) ❏ it is faster to read (no time-consuming string parsing) ❏ it contains precalculated information useful for structural analysis and visualisation (covalent bonds and bond orders) Fields: ○ Format data (e.g. the version number of the specification) ○ Metadata (e.g. rFree and resolution) ○ Structure data (e.g. number of models, chains, groups, atoms) ○ Chain data (e.g. list of chain IDs, chain names) ○ Group data (e.g. list of group names, formal charges, bonds) ○ Atom data (e.g. B-factors, coordinates, occupancies) https://github.com/rcsb/mmtf/blob/master/spec.md
  • 7. MMTF compression pipeline integer encoding dictionary encoding run-length encoding delta encoding GZIP recursive indexing extract structural data calculate bonds, SSE The binary container format of MMTF
  • 8. Compression pipeline: dictionary encoding Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19 ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35 ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05 { "groupName": "ARG", "singleLetterCode": "R", "chemCompType": "L-PEPTIDE LINKING", "atomNameList": [ "N", "CA", "C" ], "elementList": [ "N", "C", "C"] } index: 1 SER-GLY-ARG-SER-SER groupTypeList: [ 2, 0, 1, 2, 2 ]
  • 9. Compression pipeline: encodings Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19 ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35 ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05 14.699 -> 14699 14.500 -> 14500 169 1,2,3->1,1,1->1,3 (delta + run-length) -> (integer + delta) integer encoding: map floating point numbers to integer run-length encoding: stretches of equal values are represented by the value itself and the occurrence count delta encoding: differences (deltas) between the numbers are stored
  • 10. Compression pipeline: Recursive Indexing Group Id Symb. AtmId ResId ChainIds x, y, z coordinates (A) Occ. B-factor ATOM 1 N N ARG A 18 14.699 61.369 62.050 1.00 39.19 ATOM 2 C CA ARG A 18 14.500 62.241 60.856 1.00 38.35 ATOM 3 C C ARG A 18 13.762 61.516 59.729 1.00 36.05 Recursive Indexing: [-50, -128, 7, 127, 268] -> [-50, -128, 0, 7, 127, 0, 127, 127, 14] Array of 8-bit integer values, so the open interval is (127, -128):
  • 11. Overview of data Full format • all atoms (useful for structural bioinformatics analysis) • coordinates with 3 decimal place precision (no loss after decoding) Reduced format • C-alpha/phosphate backbone atoms and ligands (useful for visualisation and some structural bioinformatics) • coordinates with 1 decimal place precision (almost further 40 % reduction in size) • exactly same data structure as full (parsers work for both)
  • 12. MMTF size and parsing speed * Parsing using Java libraries
  • 13. Using MMTF To efficiently store, transmit, and visualize the 3D structures of biological macromolecules To perform large-scale structural calculations such as geometric queries or structural comparisons over the entire PDB archive held in memory
  • 14. Presented by Anthony Bradley Postdoctoral Researcher Structural Bioinformatics Group San Diego Supercomputer Center
  • 15. Using MMTF To efficiently store, transmit, and visualize the 3D structures of biological macromolecules To perform large-scale structural calculations such as geometric queries or structural comparisons over the entire PDB archive held in memory
  • 16. Goals • Analysis should be easy and simple • Whole archive analysis of the PDB should be trivial AND fast • Big Data tools (e.g. Spark and Hadoop) are available
  • 17. mmtf-python mmtf-java Nobody should (have to) write their own parser. Ever.
  • 20. Data mining - speed advantage
  • 23. Pros and cons Pros: ● Looping through the whole library performing simple analyses ● Simple to parallelize code ● Much more complete data Cons: ● Tied to Java ● Not a magic unicorn
  • 24. Pros and cons Pros: ● Looping through the whole library performing simple analyses ● Simple to parallelize code ● Much more complete data Cons: ● Tied to Java ● Not a magic unicorn
  • 25. Thanks! • http://mmtf.rcsb.org/ • https://github.com/rcsb/mmtf-javascript • https://github.com/rcsb/mmtf-java • https://github.com/rcsb/mmtf-python • http://spark.apache.org/