Genome_annotation@BioDec:
Python all over the place.
Ivan Rossi
ivan@biodec.com
@rouge2507
Hello
● BioDec does bioinformatics since 2002
● Bioinformatics software development
● Bioinformation management system, BioDecoders
● Bioinformatics Consulting
● Development, engineering and integration of custom solutions
● Annotated databases of biosequences (e.g. genomes)
● Our Forte
● Protein-sequence analysis
● Trans-membrane proteins
● Machine-learning
● Python is everywhere
The Challenge:
from Sequence to Function
>BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus.
MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG
DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE
SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH
WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE
YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI
KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR
GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS
LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY
YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT
KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH
Protein Function
Gene Sequence
Protein Sequence (~10^7)
Protein Structure (10^5)
Problems in Sequence Analysis
Information Overflow:
very large sets of data available
High Throughput:
New data must be processed at high speed
(volume of data, time constraints)
Open Problems:
difficult to provide a simple first-principle or a
model-based solution
Alignments
OmpA APKDNTWYTGAKLGWS QYHDTGLINNNGPTHEN KLGAGAFGGYQV NPYVGFEMGYDWLGR
OEP21 IDTNTFFQVRGGLD TKT---------------GQPS SGSALIRHF YPNFSATLGVGVRYD
OmpA MPYKGSVENGA YKAQGVQLTAKLGYP ITDDLDIYTRLGGMVWRADT YSNVYGKN HDTGVS
OEP21 KQDSVGVRYAKND KLRYTVLAKKT FPVTNDGLVNFKIK GGCDVDQD-------FKE WKSR
OmpA PVFAGGVEYA I-TPEIATRLEYQW TNNIGDAHTIGTRPDNG MLSLGVSYRF G-----
OEP21 GGAEFSWNVF NFQKDQDVRLRIGYE AFEQV-PYLQIRE NNWTFNADYKGRWNVRYD L
Alignments of some kind are the main tool for
sequence comparison and database search
OmpA: PDB 1BXW, SwissProt OMPA_ECOLI
OEP21: Transmembrane Domain (24-177)
Tools from machine learning
Prediction
Known sequences (DB subsets)
TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
ANN,
HMM,
SVM
ANN,
HMM,
SVM
Known mapping
General
Rules
Known
structures
Artificial Neural Networks (ANNs)
Hidden Markov Models (HMMs)
Support Vector Machines (SVMs)
New sequence
A 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0
C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0
E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0
F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0
G 10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0
H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0
K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100
I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0
L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0
M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0
N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0
R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0
S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0
T 20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0
V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0
W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Y 70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0
Evolutionary
Information
1 Y K D Y H S - D K K K G E L - -
2 Y R D Y Q T - D Q K K G D L - -
3 Y R D Y Q S - D H K K G E L - -
4 Y R D Y V S - D H K K G E L - -
5 Y R D Y Q F - D Q K K G S L - -
6 Y K D Y N T - H Q K K N E S - -
7 Y R D Y Q T - D H K K A D L - -
8 G Y G F G - - L I K N T E T T K
9 T K G Y G F G L I K N T E T T K
10 T K G Y G F G L I K N T E T T K
Sequence position
MSA
Seq. Profile
Sequence profile
Given a Multiple
Sequence Alignment
(MSA) of similar
sequences,
associate to each
position a 20-valued
vector containing
the relative
aminoacidic
composition of the
aligned sequences.
Why Python? (2.1.x, in 2002)
● Common ground, easy to pick up
● Expressive: productive, fast prototyping
● Mantainable: readable after months
● Useful tools and libs (e.g. BioPython)
● Retrospective:
We were f...ing RIGHT!
Hidden Markov Models
Very powerful tools when:
● The system can be modeled in probabilistic terms.
● There is a ‘grammar of the problem’
● There is a “limited sequential dependency” that can
model the problem (at least to a rough approx)
N T
0.01
0.01
0.99
0.99
99HMMers
End
Start
Signal Peptide
TM1
TM2
TM3
TM4
TM5
TM6
TM7
Insertion loop
Inside loop
Outside loop
Profile-HMM, based on:
http://www.biocomp.unibo.it/piero/PHMM
BioPython
BioPython (http://biopython.org) is a community-
developed (O|B|F) set of Python libraries and tools for
bioinformatics.
● The Parsers for formats and application (vital)
● The Sequence objects
● Bio.SeqIO, Bio.AlignIO, Bio.PDB
● Specialized External-application wrappers
● BioSQL interface
BioSQL
BioSQL (http://www.biosql.org) is a generic
relational model (a schema) covering
sequences, features, sequence and feature
annotation, a reference taxonomy, and
ontologies.
● Works with all O|B|F Bio* projects
● We extended it to suit our special need
Ruffus
Ruffus (http://www.ruffus.org.uk/) is a
Computation Pipeline library for Python, designed
to allow easy analysis automation.
● Acts like a pythonic Make on steroids
● Write your Python functions and decorate them
– @originate, @transform, @merge an more
● Pipeline handling
– Run pipelines make-style (run_pipeline)
– Schedule pipelines on SGE compute clusters (run_job)
Angler pipeline
Proteome
Generate
profiles
Predictions:
Signal peptides
Betabarrels
Alpha-helical TMP
Fold recognition
Coiled coils
Disordered regions
Sub-cellular localization
Classify
Proteome
Atlas (a DB)
Angler annotates and classifies
Protein sequences
ZenDock
Analyzes protein solvent-
exposed surface for
putative “interactor”
residues, returning a
“fuzzy” (probabilistic)
answer.
Interactors are correlated
and grouped into patches
Results are mapped on
the protein 3D structure
and made available
through a web interface
Contact-shell profile
Int non-Int
If you can't outrun them...
The Problem
● Full Profile building is the slow step
– It takes 30” to 5' for a 3-passes PsiBlast run
(uniref90)
– Repeat for ~10^5 … CPU weeks for genome.
● Major genomes updated every 3 months
● Micro-SME: limited resources
… try to outsmart them.
● Sequence space is redundant
– Both intra-genome and inter-genome
● Profiles are built incrementally
– PsiBlast is an iterative algorithm
● PsiBlast is deterministic
– Given the same sequence, database, and number
of iterations you get the same profile
Our accelerator: the PyBlastCache
1) Hash the sequence
2) version the reference protein database
3) store computed profiles in a key-value store
1) Key as a combination of seq. hash and DB version
4) Compute
● If full_key_match: skip_and_copy()
●
If seq_key_match: update_profile( seq, itn=1)
●
If no_key: create_profile(seq, itn=3)
The (Python) front-ends
● Plone: a CMS
– https://plone.org
● Web2py: a MVC framework
– http://www.web2py.com
● Galaxy: web interface + workflow engine
– Focus on reproducible research
– https://wiki.galaxyproject.org/
– Saas: https://usegalaxy.org
● A BiOSQL browser, based on Plone, to search and
display data and metadata (annotations) from
biosequence databases. Could integrate predictors;
● We publicly released the base version open-source
software at http://plone4bio.org;
● Used to be the la base for some commercial software
we sold to clients.
Plone4Bio
Plone4Bio screenshots
Bologna, 21/1/2010
LIMS features
Galaxy
Galaxy is an open, web-based platform for accessible,
reproducible, and transparent computational biomedical
research.
– Users without programming experience can easily specify
parameters and run tools and workflows.
– Galaxy captures information in order to allow complete repeats
of a computational analysis.
– Users share and publish analyses via the web and create
Pages, interactive, web-based documents that describe a
complete analysis.
● Accepted as material by peer reviewed journals
Galaxy highlights
Galaxy is useful to both end user and bioinformatic devs.
● Get data directly from online DBs (USCS, Biomart,...)
● Handling of data from lab instrumentetion (e.g NGS seqs)
● Map calculated data on online viewers (e.g. genome viewer)
● Easily extensible: wrapping a foreign tools is as simple as
by writing an XML file.
● Data sharing (workflows, libraries, tools...)
● The community!
Snapshots
From https://usegalaxy.org
Visual programming
Thou Shalt Care For The DATA
● So much junk in the literature!!
– Both for features and data sets
● Use training, testing and validation sets
● The sets should always be disjoint
– Below 25% seq ID
● Redundancy is THE ENEMY
● Avoid feature bloat, use feature selection
● Always compare results with a nearest-neighbor method
– Good ones are really hard to beat
No Free Lunch
● There is no killer method
– Choose method that better models your domain
(e.g. sequences → HMMs)
– Data curation is always more important
● Be Humble, be Honest!
Meditation hint: http://www.no-free-lunch.org/
The community is your friend.
Give back to the community.

Genome_annotation@BioDec: Python all over the place

  • 1.
    Genome_annotation@BioDec: Python all overthe place. Ivan Rossi ivan@biodec.com @rouge2507
  • 2.
    Hello ● BioDec doesbioinformatics since 2002 ● Bioinformatics software development ● Bioinformation management system, BioDecoders ● Bioinformatics Consulting ● Development, engineering and integration of custom solutions ● Annotated databases of biosequences (e.g. genomes) ● Our Forte ● Protein-sequence analysis ● Trans-membrane proteins ● Machine-learning ● Python is everywhere
  • 3.
    The Challenge: from Sequenceto Function >BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus. MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH Protein Function Gene Sequence Protein Sequence (~10^7) Protein Structure (10^5)
  • 4.
    Problems in SequenceAnalysis Information Overflow: very large sets of data available High Throughput: New data must be processed at high speed (volume of data, time constraints) Open Problems: difficult to provide a simple first-principle or a model-based solution
  • 5.
    Alignments OmpA APKDNTWYTGAKLGWS QYHDTGLINNNGPTHENKLGAGAFGGYQV NPYVGFEMGYDWLGR OEP21 IDTNTFFQVRGGLD TKT---------------GQPS SGSALIRHF YPNFSATLGVGVRYD OmpA MPYKGSVENGA YKAQGVQLTAKLGYP ITDDLDIYTRLGGMVWRADT YSNVYGKN HDTGVS OEP21 KQDSVGVRYAKND KLRYTVLAKKT FPVTNDGLVNFKIK GGCDVDQD-------FKE WKSR OmpA PVFAGGVEYA I-TPEIATRLEYQW TNNIGDAHTIGTRPDNG MLSLGVSYRF G----- OEP21 GGAEFSWNVF NFQKDQDVRLRIGYE AFEQV-PYLQIRE NNWTFNADYKGRWNVRYD L Alignments of some kind are the main tool for sequence comparison and database search OmpA: PDB 1BXW, SwissProt OMPA_ECOLI OEP21: Transmembrane Domain (24-177)
  • 6.
    Tools from machinelearning Prediction Known sequences (DB subsets) TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN ANN, HMM, SVM ANN, HMM, SVM Known mapping General Rules Known structures Artificial Neural Networks (ANNs) Hidden Markov Models (HMMs) Support Vector Machines (SVMs) New sequence
  • 7.
    A 0 00 0 0 0 0 0 0 0 0 10 0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0 E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0 F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0 G 10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0 H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0 K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100 I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0 N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0 R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0 T 20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0 V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Y 70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0 Evolutionary Information 1 Y K D Y H S - D K K K G E L - - 2 Y R D Y Q T - D Q K K G D L - - 3 Y R D Y Q S - D H K K G E L - - 4 Y R D Y V S - D H K K G E L - - 5 Y R D Y Q F - D Q K K G S L - - 6 Y K D Y N T - H Q K K N E S - - 7 Y R D Y Q T - D H K K A D L - - 8 G Y G F G - - L I K N T E T T K 9 T K G Y G F G L I K N T E T T K 10 T K G Y G F G L I K N T E T T K Sequence position MSA Seq. Profile Sequence profile Given a Multiple Sequence Alignment (MSA) of similar sequences, associate to each position a 20-valued vector containing the relative aminoacidic composition of the aligned sequences.
  • 8.
    Why Python? (2.1.x,in 2002) ● Common ground, easy to pick up ● Expressive: productive, fast prototyping ● Mantainable: readable after months ● Useful tools and libs (e.g. BioPython) ● Retrospective: We were f...ing RIGHT!
  • 9.
    Hidden Markov Models Verypowerful tools when: ● The system can be modeled in probabilistic terms. ● There is a ‘grammar of the problem’ ● There is a “limited sequential dependency” that can model the problem (at least to a rough approx) N T 0.01 0.01 0.99 0.99
  • 10.
    99HMMers End Start Signal Peptide TM1 TM2 TM3 TM4 TM5 TM6 TM7 Insertion loop Insideloop Outside loop Profile-HMM, based on: http://www.biocomp.unibo.it/piero/PHMM
  • 11.
    BioPython BioPython (http://biopython.org) isa community- developed (O|B|F) set of Python libraries and tools for bioinformatics. ● The Parsers for formats and application (vital) ● The Sequence objects ● Bio.SeqIO, Bio.AlignIO, Bio.PDB ● Specialized External-application wrappers ● BioSQL interface
  • 12.
    BioSQL BioSQL (http://www.biosql.org) isa generic relational model (a schema) covering sequences, features, sequence and feature annotation, a reference taxonomy, and ontologies. ● Works with all O|B|F Bio* projects ● We extended it to suit our special need
  • 13.
    Ruffus Ruffus (http://www.ruffus.org.uk/) isa Computation Pipeline library for Python, designed to allow easy analysis automation. ● Acts like a pythonic Make on steroids ● Write your Python functions and decorate them – @originate, @transform, @merge an more ● Pipeline handling – Run pipelines make-style (run_pipeline) – Schedule pipelines on SGE compute clusters (run_job)
  • 14.
    Angler pipeline Proteome Generate profiles Predictions: Signal peptides Betabarrels Alpha-helicalTMP Fold recognition Coiled coils Disordered regions Sub-cellular localization Classify Proteome Atlas (a DB) Angler annotates and classifies Protein sequences
  • 15.
    ZenDock Analyzes protein solvent- exposedsurface for putative “interactor” residues, returning a “fuzzy” (probabilistic) answer. Interactors are correlated and grouped into patches Results are mapped on the protein 3D structure and made available through a web interface Contact-shell profile Int non-Int
  • 16.
    If you can'toutrun them... The Problem ● Full Profile building is the slow step – It takes 30” to 5' for a 3-passes PsiBlast run (uniref90) – Repeat for ~10^5 … CPU weeks for genome. ● Major genomes updated every 3 months ● Micro-SME: limited resources
  • 17.
    … try tooutsmart them. ● Sequence space is redundant – Both intra-genome and inter-genome ● Profiles are built incrementally – PsiBlast is an iterative algorithm ● PsiBlast is deterministic – Given the same sequence, database, and number of iterations you get the same profile
  • 18.
    Our accelerator: thePyBlastCache 1) Hash the sequence 2) version the reference protein database 3) store computed profiles in a key-value store 1) Key as a combination of seq. hash and DB version 4) Compute ● If full_key_match: skip_and_copy() ● If seq_key_match: update_profile( seq, itn=1) ● If no_key: create_profile(seq, itn=3)
  • 19.
    The (Python) front-ends ●Plone: a CMS – https://plone.org ● Web2py: a MVC framework – http://www.web2py.com ● Galaxy: web interface + workflow engine – Focus on reproducible research – https://wiki.galaxyproject.org/ – Saas: https://usegalaxy.org
  • 20.
    ● A BiOSQLbrowser, based on Plone, to search and display data and metadata (annotations) from biosequence databases. Could integrate predictors; ● We publicly released the base version open-source software at http://plone4bio.org; ● Used to be the la base for some commercial software we sold to clients. Plone4Bio
  • 21.
  • 22.
  • 23.
    Galaxy Galaxy is anopen, web-based platform for accessible, reproducible, and transparent computational biomedical research. – Users without programming experience can easily specify parameters and run tools and workflows. – Galaxy captures information in order to allow complete repeats of a computational analysis. – Users share and publish analyses via the web and create Pages, interactive, web-based documents that describe a complete analysis. ● Accepted as material by peer reviewed journals
  • 24.
    Galaxy highlights Galaxy isuseful to both end user and bioinformatic devs. ● Get data directly from online DBs (USCS, Biomart,...) ● Handling of data from lab instrumentetion (e.g NGS seqs) ● Map calculated data on online viewers (e.g. genome viewer) ● Easily extensible: wrapping a foreign tools is as simple as by writing an XML file. ● Data sharing (workflows, libraries, tools...) ● The community!
  • 25.
  • 26.
  • 27.
    Thou Shalt CareFor The DATA ● So much junk in the literature!! – Both for features and data sets ● Use training, testing and validation sets ● The sets should always be disjoint – Below 25% seq ID ● Redundancy is THE ENEMY ● Avoid feature bloat, use feature selection ● Always compare results with a nearest-neighbor method – Good ones are really hard to beat
  • 28.
    No Free Lunch ●There is no killer method – Choose method that better models your domain (e.g. sequences → HMMs) – Data curation is always more important ● Be Humble, be Honest! Meditation hint: http://www.no-free-lunch.org/
  • 29.
    The community isyour friend. Give back to the community.