SlideShare a Scribd company logo
1 of 27
Download to read offline
Introducing
MMseqs
MARTIN STEINEGGER
GENE CENTER MUNICH
Motivation
Map to protein /
organism
Blast: ~40 000 days (16 cores)
MMseqs: ~40 days (16 cores)
7 lanes × 200M reads
~ 7 × 200M seqs
of 50 amino acids
UniProt
5×107
Protein
seqs
1.4×109
Search reads
against UniProt
Gene predictionSequence
genome
Growth of the UniProtKB/TrEMBL
Protein Sequence Database
MARTIN STEINEGGER
Result Protein Search
Build & read index Search Time Speed-up factor
MMseqs s=4 1h 17m 6m 950x
MMseqs s=7 1h 17m 11m 518x
swipe 36m 2d 5h 34m 1.8x
BLAST 36m 3d 23h 01m 1x
ublast 1h 52m 46m 127x
RAPsearch 2h 11m 10h 56m 9.5x
UniProt
54 790 250
7 616 Proteins
 search
ROC5
query 1:
db 50
db 48
query 3:
db 65
db 63
db 62
db 59
db 56
query 2:
db 55
db 43
Query 4:
db 100
db 99
 ROC5
Roc value:
query 4: 0.2
query 1: 0.4
query 3: 0.6
query 2: 1.0
1.00
.2 .4 .6 1.0
.75
.50
.25
TP
FP
AUC 0.6
ROC5
Fractionofqueries 5
Result Protein Search
Fractionofqueries
ROC5
SCOP25
UniProtKB
283 406
SCOP25
7 616
 true positive: same SCOP superfamily
 false positive: different SCOP fold
 ignore same fold different superfamily
 search
950x
518x
9.5x
127x
1.8x
1x
Workflow Protein Search
Prefiltering Alignment
Search space : 108 × 108
~ 7 days
for UniProt 5.4*107
Search space: 108 × 102
~ 2 days
Query 1 Query n
Database
Hit1
Query 1 Query nquery 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
hit 1: 123
hit 2: 68
hit 3: 32
query n
...
Filtering Sequences with k-mers
Homologous proteins Unrelated proteins
Sequence2
Sequence2
Sequence 1 Sequence 1
Filtering Sequences with k-mers
2014/5/8
MARTIN STEINEGGER
Exact matches of length 3 Similar matches of length 6
Filtering Sequences with k-mers
Exact 3-mer matches Similar 6-mer matches
Informationispower k-mersaslongaspossible
Butweneedinexactmatchestokeepsensitivityhigh
3 mer, exact
5 mer, exact
5 mer, 25 similar
6 mer, 100 similar
7 mer, 400 similar
Prob. of chance k mer match
3
1.2 10 3
5
3 10 7
25
5
7.5 10 6
100
6
1.5 10 6
400
7
3 10 7
Prob. of homologous match
at 25% sequence identity
3
1/64
5
1/1024
25
5
1/40
100
6
1/40
400
7
1/40
Keep low for high speed! Keep high for high sensitivity!
Prefiltering
Algorithm
Most critical part of MMseqs
regarding speed and memory
consumption
Calculates similarity scores on
multiple CPUs.
Computationally intense parts
are vectorized.
11
Database SetQuery Set
AAAAAA	
  
AAAAAR	
  
.	
  .	
  .	
  
MHWVRE	
  
.	
  .	
  .	
  
XXXXXX	
  
Seq.Ids	
  
5351	
  
43314	
  
2314	
  
.	
  .	
  .	
  
Query
matchList of k-mers
Index table
of database
Sum of scores
Result of query 1
. . L G T M H W V R Q 	
   	
  A . . 	
   	
  
MHWVRQ	
  42	
  
MHWVKQ	
  34	
  	
  
MHWVRE	
  34	
  
.	
  .	
  .	
  
query 1:
db 5351: (123)
db 2314: (68)
db 2: (62)
23 ... 11+34 ... 42+34 ... 12+34
1 ... 2314 ... 5351 ... 43314Db. Seq. Idx.
Z-scorescorrectforbackgroundk-mermatches
: summed k mer match score of query with target protein
, with from calibration run
: expected score from background matches
# expected chance k mer matches
Poisson distributed matches
Fast Smith-Waterman alignment using SSE2
Fast Smith
Waterman
Using Michael Farrar’s version
of the Smith-Waterman
algorithm to align prefiltering
outputs.
13
. . .
Prefiltering
Result
Alignment
Result
Hit1
Query 1 Query n
Multi core
parallelization
over query
sequence
Thread level parallelization with
OpenMP.
Splits query database in
packages and matches them
against the database set.
14
Node
Query seqs.
0 - 25.000
Query seqs.
25.001-50.000
Query seqs.
50.001-75.000
Query seqs.
75.001-100.000
Result
Database Set
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query k1:
db 12: 103
db 71: 58
db 92: 52
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query k2:
db 15: 152
db 23: 88
db 24: 32
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: 123
db 23: 68
db 2: 32
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query 1:
db 5: (123)
db 23: (68)
db 2: (32)
. . .
query k3:
db 5: 123
db 23: 68
db 2: 32
. . .
Core 1 Core 2 Core 3 Core 4
Multi node
parallelization
over database
sequence
From top to bottom:
1.  Message Passing Interface
2.  Thread Level Parallelism
3.  Data Level Parallelism
15
Aggregated
results
DB Seq
0 - 100.000
Node 1
Query Query Query
DB Seq.
100.001 - 200.000
Node 2
Query Query Query
DB Seq
200.001 - 300.000
Node 3
Query Query Query
Sequences Clusters
GLTRETVSR
Why Sequence Clustering
Workflow of MMseqs
ClusteringPrefiltering
Query 1 Query n
Database
Alignment
Hit1 Query 1 Query n
Clustering
Clustering
with greedy
set cover
Linear time and space
greedy set cover algorithm
to cluster results.
18
Database Set
Alignment
Result
Query Set
Clustering
Result
Cascaded
Clustering
19
90% sequence
identity
50% sequence
identity
20% sequence
identity
Speed
Sensitivity
Data to cluster
ClusteringPrefiltering Alignment
Updating
We created an updating
mechanism that is able to
detect changes and update
the current database.
We also guarantee stable
cluster identifiers.
20
New sequences
Old sequences
Deleted sequences
Old Result
Update
New against New
NewagainstOld
+
Updating: N × ΔN
Reclustering: N × N
Clustering Results
Clusters Corrupted Clusters Seq. per Cluster Time
MMseqs s=4 naive clust 85 780 3.4 3.4 4m 03s
MMseqs s=4 set cover 60 915 1 4.7 4m 02s
MMseqs cascaded s=4 41 173 3 7.0 3m 35s
MMseqs s=7 29 801 2 9.7 9m 26s
MMseqs cascaded s=7 22 541 1 12.9 5m 07s
blastclust 21 890 1 13.3 7h 25m 01s
CD-HIT 114 386 260 2.5 1h 25m 01s
kClust 91 681 1 3.2 9m 57s
Usearch 157 981 11 1.8 45s
SCOP25
UniProtKB
283 406
SCOP25
7 616
 cluster
Summary
l  BLAST-like searches at up to 1000x speed
l  Application on metagenomics datasets
l  Copes with huge sequence data amounts
l  Clustering large protein seq data sets with best sensitivity/speed
l  More sensitive core algorithm
l  Profile searches => boosts sensitivity at same speed
l  Applications in metagenomics
l  E. g. gut microbiomes for medical research, soil for agriculture etc.
l  Nucleotide sequence version to be tested
Outlook
Thanks
Maria Hauser
Development
Gene Center Munich
Ludwig-Maximilians-Universität
Johannes Söding
PI
Max Planck Institute Göttingen
Justas Dapkunas
Betatest
Institute of Biotechnology Vilnius
University
Klaus Faidt
Betatest
Max Planck Institute Tübingen
Borisas Bursteinas
Betatest
EBI: UniProt development
Andreas Hauser
FFindex
Thank you
for your time.
Discussion
Backup
2014/5/8
MILOT MIRDITA
Result Protein Search
MARTIN STEINEGGER
TP
FP
ROC5
query 1:
db 50
db 48
query 3:
db 65
db 63
db 62
db 59
db 56
query 2:
db 55
db 43
 ROC
All querys:
db 100
db 99
db 65
db 63
db 62
db 59
db 56
db 55
db 50
db 48
db 43
 ROC5
Roc value:
query 4: 0.2
query 1: 0.4
query 3: 0.6
query 2: 1.0
Query 4:
db 100
db 99
ROC 0.4 ROC 1.0 ROC 0.6 ROC 0.2
TP
FP
1.00
.2 .4 .6 1.0
.75
.50
.25
TP
FP
AUC 0.6
ROC5
Fractionofqueries
query 3 contributes
½ of the scores
query 4 contributes
all highest scores

More Related Content

Similar to MMseqs NGS 2014

Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
OpenCL applications in genomics
OpenCL applications in genomicsOpenCL applications in genomics
OpenCL applications in genomicsUSC
 
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature HashingWush Wu
 
Open Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraOpen Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraAlexander Korotkov
 
Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA
Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMABioinformática y supercomputación. Razones para hacerse bioinformático en la UMA
Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMAM. Gonzalo Claros
 
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用NVIDIA Taiwan
 
Analysis update for GENEVA meeting 2011
Analysis update for GENEVA meeting 2011Analysis update for GENEVA meeting 2011
Analysis update for GENEVA meeting 2011USC
 
Open Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second eraOpen Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second eraSveta Smirnova
 
Q pcr symposium2007-pcrarray
Q pcr symposium2007-pcrarrayQ pcr symposium2007-pcrarray
Q pcr symposium2007-pcrarrayElsa von Licy
 
Pham,Nhat_ResearchPoster
Pham,Nhat_ResearchPosterPham,Nhat_ResearchPoster
Pham,Nhat_ResearchPosterNhat Pham
 

Similar to MMseqs NGS 2014 (20)

Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Bioinformatics life sciences_v2015
Bioinformatics life sciences_v2015Bioinformatics life sciences_v2015
Bioinformatics life sciences_v2015
 
OpenCL applications in genomics
OpenCL applications in genomicsOpenCL applications in genomics
OpenCL applications in genomics
 
community detection
community detectioncommunity detection
community detection
 
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature Hashing
 
MaPU-HPCA2016
MaPU-HPCA2016MaPU-HPCA2016
MaPU-HPCA2016
 
_BLAST.ppt
_BLAST.ppt_BLAST.ppt
_BLAST.ppt
 
Prediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source toolsPrediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source tools
 
M Sc Project
M Sc ProjectM Sc Project
M Sc Project
 
RSlovakia #1 meetup
RSlovakia #1 meetupRSlovakia #1 meetup
RSlovakia #1 meetup
 
Open Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraOpen Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second era
 
Blinkdb
BlinkdbBlinkdb
Blinkdb
 
Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA
Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMABioinformática y supercomputación. Razones para hacerse bioinformático en la UMA
Bioinformática y supercomputación. Razones para hacerse bioinformático en la UMA
 
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
GTC Taiwan 2017 GPU 平台上導入深度學習於半導體產業之 EDA 應用
 
Analysis update for GENEVA meeting 2011
Analysis update for GENEVA meeting 2011Analysis update for GENEVA meeting 2011
Analysis update for GENEVA meeting 2011
 
Open Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second eraOpen Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second era
 
Macs course
Macs courseMacs course
Macs course
 
thesis_choward
thesis_chowardthesis_choward
thesis_choward
 
Q pcr symposium2007-pcrarray
Q pcr symposium2007-pcrarrayQ pcr symposium2007-pcrarray
Q pcr symposium2007-pcrarray
 
Pham,Nhat_ResearchPoster
Pham,Nhat_ResearchPosterPham,Nhat_ResearchPoster
Pham,Nhat_ResearchPoster
 

Recently uploaded

Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
Gas_Laws_powerpoint_notes.ppt for grade 10
Gas_Laws_powerpoint_notes.ppt for grade 10Gas_Laws_powerpoint_notes.ppt for grade 10
Gas_Laws_powerpoint_notes.ppt for grade 10ROLANARIBATO3
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsssuserddc89b
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxFarihaAbdulRasheed
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
TOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptxTOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptxdharshini369nike
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555kikilily0909
 
Cytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptxCytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptxVarshiniMK
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2John Carlo Rollon
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Welcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work DayWelcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work DayZachary Labe
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Heredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of TraitsHeredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of TraitsCharlene Llagas
 

Recently uploaded (20)

Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
Gas_Laws_powerpoint_notes.ppt for grade 10
Gas_Laws_powerpoint_notes.ppt for grade 10Gas_Laws_powerpoint_notes.ppt for grade 10
Gas_Laws_powerpoint_notes.ppt for grade 10
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physics
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
TOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptxTOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptx
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555
 
Cytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptxCytokinin, mechanism and its application.pptx
Cytokinin, mechanism and its application.pptx
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Welcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work DayWelcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work Day
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Heredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of TraitsHeredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of Traits
 

MMseqs NGS 2014

  • 2. Motivation Map to protein / organism Blast: ~40 000 days (16 cores) MMseqs: ~40 days (16 cores) 7 lanes × 200M reads ~ 7 × 200M seqs of 50 amino acids UniProt 5×107 Protein seqs 1.4×109 Search reads against UniProt Gene predictionSequence genome
  • 3. Growth of the UniProtKB/TrEMBL Protein Sequence Database MARTIN STEINEGGER
  • 4. Result Protein Search Build & read index Search Time Speed-up factor MMseqs s=4 1h 17m 6m 950x MMseqs s=7 1h 17m 11m 518x swipe 36m 2d 5h 34m 1.8x BLAST 36m 3d 23h 01m 1x ublast 1h 52m 46m 127x RAPsearch 2h 11m 10h 56m 9.5x UniProt 54 790 250 7 616 Proteins  search
  • 5. ROC5 query 1: db 50 db 48 query 3: db 65 db 63 db 62 db 59 db 56 query 2: db 55 db 43 Query 4: db 100 db 99  ROC5 Roc value: query 4: 0.2 query 1: 0.4 query 3: 0.6 query 2: 1.0 1.00 .2 .4 .6 1.0 .75 .50 .25 TP FP AUC 0.6 ROC5 Fractionofqueries 5
  • 6. Result Protein Search Fractionofqueries ROC5 SCOP25 UniProtKB 283 406 SCOP25 7 616  true positive: same SCOP superfamily  false positive: different SCOP fold  ignore same fold different superfamily  search 950x 518x 9.5x 127x 1.8x 1x
  • 7. Workflow Protein Search Prefiltering Alignment Search space : 108 × 108 ~ 7 days for UniProt 5.4*107 Search space: 108 × 102 ~ 2 days Query 1 Query n Database Hit1 Query 1 Query nquery 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: hit 1: 123 hit 2: 68 hit 3: 32 query n ...
  • 8. Filtering Sequences with k-mers Homologous proteins Unrelated proteins Sequence2 Sequence2 Sequence 1 Sequence 1
  • 9. Filtering Sequences with k-mers 2014/5/8 MARTIN STEINEGGER Exact matches of length 3 Similar matches of length 6
  • 10. Filtering Sequences with k-mers Exact 3-mer matches Similar 6-mer matches Informationispower k-mersaslongaspossible Butweneedinexactmatchestokeepsensitivityhigh 3 mer, exact 5 mer, exact 5 mer, 25 similar 6 mer, 100 similar 7 mer, 400 similar Prob. of chance k mer match 3 1.2 10 3 5 3 10 7 25 5 7.5 10 6 100 6 1.5 10 6 400 7 3 10 7 Prob. of homologous match at 25% sequence identity 3 1/64 5 1/1024 25 5 1/40 100 6 1/40 400 7 1/40 Keep low for high speed! Keep high for high sensitivity!
  • 11. Prefiltering Algorithm Most critical part of MMseqs regarding speed and memory consumption Calculates similarity scores on multiple CPUs. Computationally intense parts are vectorized. 11 Database SetQuery Set AAAAAA   AAAAAR   .  .  .   MHWVRE   .  .  .   XXXXXX   Seq.Ids   5351   43314   2314   .  .  .   Query matchList of k-mers Index table of database Sum of scores Result of query 1 . . L G T M H W V R Q    A . .     MHWVRQ  42   MHWVKQ  34     MHWVRE  34   .  .  .   query 1: db 5351: (123) db 2314: (68) db 2: (62) 23 ... 11+34 ... 42+34 ... 12+34 1 ... 2314 ... 5351 ... 43314Db. Seq. Idx.
  • 12. Z-scorescorrectforbackgroundk-mermatches : summed k mer match score of query with target protein , with from calibration run : expected score from background matches # expected chance k mer matches Poisson distributed matches
  • 13. Fast Smith-Waterman alignment using SSE2 Fast Smith Waterman Using Michael Farrar’s version of the Smith-Waterman algorithm to align prefiltering outputs. 13 . . . Prefiltering Result Alignment Result Hit1 Query 1 Query n
  • 14. Multi core parallelization over query sequence Thread level parallelization with OpenMP. Splits query database in packages and matches them against the database set. 14 Node Query seqs. 0 - 25.000 Query seqs. 25.001-50.000 Query seqs. 50.001-75.000 Query seqs. 75.001-100.000 Result Database Set query 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query k1: db 12: 103 db 71: 58 db 92: 52 . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query k2: db 15: 152 db 23: 88 db 24: 32 . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: db 5: 123 db 23: 68 db 2: 32 . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query 1: db 5: (123) db 23: (68) db 2: (32) . . . query k3: db 5: 123 db 23: 68 db 2: 32 . . . Core 1 Core 2 Core 3 Core 4
  • 15. Multi node parallelization over database sequence From top to bottom: 1.  Message Passing Interface 2.  Thread Level Parallelism 3.  Data Level Parallelism 15 Aggregated results DB Seq 0 - 100.000 Node 1 Query Query Query DB Seq. 100.001 - 200.000 Node 2 Query Query Query DB Seq 200.001 - 300.000 Node 3 Query Query Query
  • 17. Workflow of MMseqs ClusteringPrefiltering Query 1 Query n Database Alignment Hit1 Query 1 Query n
  • 18. Clustering Clustering with greedy set cover Linear time and space greedy set cover algorithm to cluster results. 18 Database Set Alignment Result Query Set Clustering Result
  • 19. Cascaded Clustering 19 90% sequence identity 50% sequence identity 20% sequence identity Speed Sensitivity Data to cluster ClusteringPrefiltering Alignment
  • 20. Updating We created an updating mechanism that is able to detect changes and update the current database. We also guarantee stable cluster identifiers. 20 New sequences Old sequences Deleted sequences Old Result Update New against New NewagainstOld + Updating: N × ΔN Reclustering: N × N
  • 21. Clustering Results Clusters Corrupted Clusters Seq. per Cluster Time MMseqs s=4 naive clust 85 780 3.4 3.4 4m 03s MMseqs s=4 set cover 60 915 1 4.7 4m 02s MMseqs cascaded s=4 41 173 3 7.0 3m 35s MMseqs s=7 29 801 2 9.7 9m 26s MMseqs cascaded s=7 22 541 1 12.9 5m 07s blastclust 21 890 1 13.3 7h 25m 01s CD-HIT 114 386 260 2.5 1h 25m 01s kClust 91 681 1 3.2 9m 57s Usearch 157 981 11 1.8 45s SCOP25 UniProtKB 283 406 SCOP25 7 616  cluster
  • 22. Summary l  BLAST-like searches at up to 1000x speed l  Application on metagenomics datasets l  Copes with huge sequence data amounts l  Clustering large protein seq data sets with best sensitivity/speed l  More sensitive core algorithm l  Profile searches => boosts sensitivity at same speed l  Applications in metagenomics l  E. g. gut microbiomes for medical research, soil for agriculture etc. l  Nucleotide sequence version to be tested Outlook
  • 23. Thanks Maria Hauser Development Gene Center Munich Ludwig-Maximilians-Universität Johannes Söding PI Max Planck Institute Göttingen Justas Dapkunas Betatest Institute of Biotechnology Vilnius University Klaus Faidt Betatest Max Planck Institute Tübingen Borisas Bursteinas Betatest EBI: UniProt development Andreas Hauser FFindex
  • 24. Thank you for your time. Discussion
  • 26. Result Protein Search MARTIN STEINEGGER TP FP
  • 27. ROC5 query 1: db 50 db 48 query 3: db 65 db 63 db 62 db 59 db 56 query 2: db 55 db 43  ROC All querys: db 100 db 99 db 65 db 63 db 62 db 59 db 56 db 55 db 50 db 48 db 43  ROC5 Roc value: query 4: 0.2 query 1: 0.4 query 3: 0.6 query 2: 1.0 Query 4: db 100 db 99 ROC 0.4 ROC 1.0 ROC 0.6 ROC 0.2 TP FP 1.00 .2 .4 .6 1.0 .75 .50 .25 TP FP AUC 0.6 ROC5 Fractionofqueries query 3 contributes ½ of the scores query 4 contributes all highest scores