Genome_annotation@BioDec: Python all over the place

Genome_annotation@BioDec:
Python all over the place.
Ivan Rossi
ivan@biodec.com
@rouge2507

Hello
● BioDec does bioinformatics since 2002
● Bioinformatics software development
● Bioinformation management system, BioDecoders
● Bioinformatics Consulting
● Development, engineering and integration of custom solutions
● Annotated databases of biosequences (e.g. genomes)
● Our Forte
● Protein-sequence analysis
● Trans-membrane proteins
● Machine-learning
● Python is everywhere

The Challenge:
from Sequence to Function
>BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus.
MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG
DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE
SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH
WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE
YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI
KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR
GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS
LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY
YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT
KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH
Protein Function
Gene Sequence
Protein Sequence (~10^7)
Protein Structure (10^5)

Problems in Sequence Analysis
Information Overflow:
very large sets of data available
High Throughput:
New data must be processed at high speed
(volume of data, time constraints)
Open Problems:
difficult to provide a simple first-principle or a
model-based solution

Alignments
OmpA APKDNTWYTGAKLGWS QYHDTGLINNNGPTHEN KLGAGAFGGYQV NPYVGFEMGYDWLGR
OEP21 IDTNTFFQVRGGLD TKT---------------GQPS SGSALIRHF YPNFSATLGVGVRYD
OmpA MPYKGSVENGA YKAQGVQLTAKLGYP ITDDLDIYTRLGGMVWRADT YSNVYGKN HDTGVS
OEP21 KQDSVGVRYAKND KLRYTVLAKKT FPVTNDGLVNFKIK GGCDVDQD-------FKE WKSR
OmpA PVFAGGVEYA I-TPEIATRLEYQW TNNIGDAHTIGTRPDNG MLSLGVSYRF G-----
OEP21 GGAEFSWNVF NFQKDQDVRLRIGYE AFEQV-PYLQIRE NNWTFNADYKGRWNVRYD L
Alignments of some kind are the main tool for
sequence comparison and database search
OmpA: PDB 1BXW, SwissProt OMPA_ECOLI
OEP21: Transmembrane Domain (24-177)

Tools from machine learning
Prediction
Known sequences (DB subsets)
TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
ANN,
HMM,
SVM
ANN,
HMM,
SVM
Known mapping
General
Rules
Known
structures
Artificial Neural Networks (ANNs)
Hidden Markov Models (HMMs)
Support Vector Machines (SVMs)
New sequence

A 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0
C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0
E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0
F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0
G 10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0
H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0
K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100
I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0
L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0
M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0
N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0
R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0
S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0
T 20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0
V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0
W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Y 70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0
Evolutionary
Information
1 Y K D Y H S - D K K K G E L - -
2 Y R D Y Q T - D Q K K G D L - -
3 Y R D Y Q S - D H K K G E L - -
4 Y R D Y V S - D H K K G E L - -
5 Y R D Y Q F - D Q K K G S L - -
6 Y K D Y N T - H Q K K N E S - -
7 Y R D Y Q T - D H K K A D L - -
8 G Y G F G - - L I K N T E T T K
9 T K G Y G F G L I K N T E T T K
10 T K G Y G F G L I K N T E T T K
Sequence position
MSA
Seq. Profile
Sequence profile
Given a Multiple
Sequence Alignment
(MSA) of similar
sequences,
associate to each
position a 20-valued
vector containing
the relative
aminoacidic
composition of the
aligned sequences.

Why Python? (2.1.x, in 2002)
● Common ground, easy to pick up
● Expressive: productive, fast prototyping
● Mantainable: readable after months
● Useful tools and libs (e.g. BioPython)
● Retrospective:
We were f...ing RIGHT!

Hidden Markov Models
Very powerful tools when:
● The system can be modeled in probabilistic terms.
● There is a ‘grammar of the problem’
● There is a “limited sequential dependency” that can
model the problem (at least to a rough approx)
N T
0.01
0.01
0.99
0.99

99HMMers
End
Start
Signal Peptide
TM1
TM2
TM3
TM4
TM5
TM6
TM7
Insertion loop
Inside loop
Outside loop
Profile-HMM, based on:
http://www.biocomp.unibo.it/piero/PHMM

BioPython
BioPython (http://biopython.org) is a community-
developed (O|B|F) set of Python libraries and tools for
bioinformatics.
● The Parsers for formats and application (vital)
● The Sequence objects
● Bio.SeqIO, Bio.AlignIO, Bio.PDB
● Specialized External-application wrappers
● BioSQL interface

BioSQL
BioSQL (http://www.biosql.org) is a generic
relational model (a schema) covering
sequences, features, sequence and feature
annotation, a reference taxonomy, and
ontologies.
● Works with all O|B|F Bio* projects
● We extended it to suit our special need

Ruffus
Ruffus (http://www.ruffus.org.uk/) is a
Computation Pipeline library for Python, designed
to allow easy analysis automation.
● Acts like a pythonic Make on steroids
● Write your Python functions and decorate them
– @originate, @transform, @merge an more
● Pipeline handling
– Run pipelines make-style (run_pipeline)
– Schedule pipelines on SGE compute clusters (run_job)

Angler pipeline
Proteome
Generate
profiles
Predictions:
Signal peptides
Betabarrels
Alpha-helical TMP
Fold recognition
Coiled coils
Disordered regions
Sub-cellular localization
Classify
Proteome
Atlas (a DB)
Angler annotates and classifies
Protein sequences

ZenDock
Analyzes protein solvent-
exposed surface for
putative “interactor”
residues, returning a
“fuzzy” (probabilistic)
answer.
Interactors are correlated
and grouped into patches
Results are mapped on
the protein 3D structure
and made available
through a web interface
Contact-shell profile
Int non-Int

If you can't outrun them...
The Problem
● Full Profile building is the slow step
– It takes 30” to 5' for a 3-passes PsiBlast run
(uniref90)
– Repeat for ~10^5 … CPU weeks for genome.
● Major genomes updated every 3 months
● Micro-SME: limited resources

… try to outsmart them.
● Sequence space is redundant
– Both intra-genome and inter-genome
● Profiles are built incrementally
– PsiBlast is an iterative algorithm
● PsiBlast is deterministic
– Given the same sequence, database, and number
of iterations you get the same profile

Our accelerator: the PyBlastCache
1) Hash the sequence
2) version the reference protein database
3) store computed profiles in a key-value store
1) Key as a combination of seq. hash and DB version
4) Compute
● If full_key_match: skip_and_copy()
●
If seq_key_match: update_profile( seq, itn=1)
●
If no_key: create_profile(seq, itn=3)

The (Python) front-ends
● Plone: a CMS
– https://plone.org
● Web2py: a MVC framework
– http://www.web2py.com
● Galaxy: web interface + workflow engine
– Focus on reproducible research
– https://wiki.galaxyproject.org/
– Saas: https://usegalaxy.org

● A BiOSQL browser, based on Plone, to search and
display data and metadata (annotations) from
biosequence databases. Could integrate predictors;
● We publicly released the base version open-source
software at http://plone4bio.org;
● Used to be the la base for some commercial software
we sold to clients.
Plone4Bio

Bologna, 21/1/2010
LIMS features

Galaxy
Galaxy is an open, web-based platform for accessible,
reproducible, and transparent computational biomedical
research.
– Users without programming experience can easily specify
parameters and run tools and workflows.
– Galaxy captures information in order to allow complete repeats
of a computational analysis.
– Users share and publish analyses via the web and create
Pages, interactive, web-based documents that describe a
complete analysis.
● Accepted as material by peer reviewed journals

Galaxy highlights
Galaxy is useful to both end user and bioinformatic devs.
● Get data directly from online DBs (USCS, Biomart,...)
● Handling of data from lab instrumentetion (e.g NGS seqs)
● Map calculated data on online viewers (e.g. genome viewer)
● Easily extensible: wrapping a foreign tools is as simple as
by writing an XML file.
● Data sharing (workflows, libraries, tools...)
● The community!

Snapshots
From https://usegalaxy.org

Thou Shalt Care For The DATA
● So much junk in the literature!!
– Both for features and data sets
● Use training, testing and validation sets
● The sets should always be disjoint
– Below 25% seq ID
● Redundancy is THE ENEMY
● Avoid feature bloat, use feature selection
● Always compare results with a nearest-neighbor method
– Good ones are really hard to beat

No Free Lunch
● There is no killer method
– Choose method that better models your domain
(e.g. sequences → HMMs)
– Data curation is always more important
● Be Humble, be Honest!
Meditation hint: http://www.no-free-lunch.org/

The community is your friend.
Give back to the community.

Genome_annotation@BioDec: Python all over the place

More Related Content

Similar to Genome_annotation@BioDec: Python all over the place

Recently uploaded

Genome_annotation@BioDec: Python all over the place