Introduction to bioinformatics

Introduction to Bioinformatics
Andrzej Stefan Czech
Leeuwarden
2015-03-23
2015-03-23

What is bioinformatics?
Bioinformatics is an interdisciplinary field that
develops methods and software tools for
understanding biological data. As an
interdisciplinary field of science, bioinformatics
combines computer science, statistics,
mathematics, and engineering to study and
process biological data.
http://en.wikipedia.org/wiki/Bioinformatics
2015-03-23 3

A little bit of history
• 1951 – Sequencing peptide (Frederick Sanger)
• 1965 – Sequencing RNA (Robert Holley)
• 1970 – Term BIOINFORMATICS coined by
Paulien Hogeweg & Ben Hesper
• 1977 – Sequencing DNA (Frederick Sanger)
• 1990 – Human Genome Project started
(expected duration 15 years)
• 2003 – Human Genome Project completed
2015-03-23 4

• It’s all about money!!!!
2015-03-23
Why is bioinformatics so important?
5

Cost of sequencing
Sboner et al. Genome Biology 2011 12:125 doi:10.1186/gb-2011-12-8-125
2015-03-23 6

Cost of sequencing & data analysis
Sboner et al. Genome Biology 2011 12:125 doi:10.1186/gb-2011-12-8-125
2015-03-23 7

Future of biological research
• With rapidly advancing automation there
will be less human efforts needed for
sample preparation
• With increasing amount of information data
analysis will be more important
• The information output of experiments is
growing beyond human capability: need of
high level summaries and statistics
2015-03-23 9

Different flavors of bioinformatics
2015-03-23 10

Different flavors of bioinformatics
• Sequence analysis
• Annotation
• Data analysis
• Computational
biology
• Structural
bioinformatics
• Systems biology
• Scientific
programming
• Expression analysis
• Network biology
• Biostatistics
• Computational
genomics
• Databases
2015-03-23 11

BIOINFORMATICS IN SEQUENCE
ANALYSIS
2015-03-23 12

High throughput sequencing
• 454 Roche ~10 hours | 400bp | 500 Mbp
• Illumina ~250 hours | 2x100bp | >150 Gbp
• PacBio 14 hours | 1..10kbp | ~50 Mbp
• Ion Torrent 4 hours | 400bp | <10 Gbp
• Nanopore 1..? hours | 1..?bp | 1..?bp
2015-03-23 13

High throughput sequencing
Billions of short sequence fragments (= ‘reads’)
2015-03-23 14
~170 DVDs
800 Gb
600.000.000.000
nucleotides
=
~800.000x

Quality filtering and trimming
TAGCGCAATACTTTCTGTTAGCGCAAATCCTAGTAGTGCAT
CCATGTGTGGGTTGTGTTNNNNNNNNNNNNNNNNNNNNNNN
AGTGGTATCAACGCAGAGTACGGGGGACCTTNNNNNNNNNN
CCATGTGTGGGTTGTGTTNNNNNNNNNNNNNNNNNNNNNNN
AGTGGTATCAACGCAGAGTACGGGGGACCTTNNNNNNNNNN
2015-03-23 15

Quality filtering and trimming
AGTGGTATCAACGCAGAGTACGGG
2015-03-23 16

Sequence search (BLAST)
• BLAST is one of the most commonly used
bioinformatics software
• It finds small sub-sequences of your query
in the subject sequence
• Uses word to match with the database of
subject and then uses heuristics to verify
and extend match
2015-03-23 17

GAGATGGTATCAACGCAGAGATCTGGTGTT
Sequence search
2015-03-23 18
AGAG
GAGA
AGAT
GATA
ATAT
TATG
ATGG
TGGT
AGAGATATGGT

GAGA--TGGTATCAACGCAGAGATCTGGTGTT
Sequence search
2015-03-23 19
AGAGATATGGT AGAGATATGGT
0+1+1+1+1-3-2+1+1+1+1 =3 1 1 1 1 1 1-1 1 1 1 1 =9
|||| |||| |||||| ||||
Better HSP
Still possible HSP
HSP=high-scoring segment pairs

Sequence/genome alignment
• Global alignment
– global optimization that "forces" the alignment
to span the entire sequences
(Needleman–Wunsch algorithm or Clustal style)
• Local alignment
– identify short regions of similarity within long
divergent sequences
(Smith–Waterman algorithm or BLAST style)
2015-03-23 20

Sequence alignment
• Global alignment
FTFTALILLAVAV
| ||| ||| ||
F--TAL-LLA-AV
• Local alignment
FTFTALILL-AVAV
|||| || ||
--FTAL-LLAAV--
2015-03-23 21

Genome alignment
• Glocal alignment
• Uses a word matching method
• Creates suffix tree for faster search
• Searches suffix tree for exact matches of
words clusters them and then uses local
alignment methods to extend match
2015-03-23 22

Suffix tree
2015-03-23 23
accg
Source: Delcher et al. Nucleic Acids Research, 1999, Vol. 27, No. 11

Genome alignment
2015-03-23 24

Assembly
• Short read assembly is extremely difficult
and computationally intensive!
• For longer reads an Overlap Consensus
(OLC) assemblers are used
• For shorter reads (and in
high numbers) De Bruijn
Graph assemblers are
better
2015-03-23 25Source: Commins, Toft & Fares (CC BY-SA 2.5)

2015-03-23 26
Source: Schatz, Delcher , Salzberg,http://www.genome.org/cgi/doi/10.1101/gr.101360.109.

Genome annotation
• Prediction of:
– Genes
– Repeats
– Non coding RNAs (rRNAs, tRNAs, miRNAs,
snRNAs, siRNAs, ta-siRNAs)
– Promoters
– Enhancers
– Protein binding sites
– …
2015-03-23 27

Genome annotation
2015-03-23 28
5’ UTR /
Promoter/
RBS
Start
codon
Exon 1 Exon 3Exon 2Intron Intron
Stop
codon
3’ UTR

Genome annotation
2015-03-23 29
ACCTCTCACTCTTTCTTCTCATCTTCTTCAATTATAACAACCTAACCATGTCTTTAGAACAAGAAT
TTCTCTTCCTAATTCCTTCAACAATGGCCAATAATCTCAATCTCTTCCTTTGTTTCCTTCTCTTTATT
TGTTTCTTCACTCTCTGCCTTAGCCCTGGTGGTCTAGCTTGGGCTCTAATTTCAAAGCCAAAAAA
CCAATCCATCATTCCTAGAATCAGCTTATGAGCTTTTGTTTCATAGAGCAATGGGGTTTGCTCCAT
TGGAGAGTGTGTTTGTGTTGGTGATGAAGAAAATAAGAACAGCCCTAATGATAATGAAGATGAA
GATTTTGTTGATGTGTTGCTTGATTTGGAAAAGGAAAACAAACTCACCGACTCTGATATGATTGC
TGTGTTAGGTACGTATGTGTATTATAATTTCTTGTTTCATTACTATTTTGATATTTTTCTACTGCACT
TCAATTTTAATCGGTTTGAAATGATTTTTTAATATGCTCTTACAAGATTATGACTTGGGAAAGATTC
TTACATCTTTAAATATTTCAATTTTTTGTGTGATACATGAAATGCATGACTGTTTTTTACTTGCGATT
TACATGTTGAAATTTTCTTTACTTTGATATTCTATGTTTTTTTAACAAATTTTCTCTTAAATAAATGA
CATGTAGGAAATGACCTCCCAAATCTTCCTTATCTCCATGCAGTCGTCAAAGAGACTCTTAGAAT
GCACCCTCCCGGCCCACTTCTCTCTTGGGCACGCCTTGCCATCCATGACACCCATGTCGCAGGC
CACTACATTCCTGCTGGCACCACTGCGATGGTCAACATGTGGGCCATAACCCACGACGACCAAA
CTGTGGCTCGCTCAGTTAGTTCATAAGTTCGAATGGGTTCAAGCTGATGAATCGAAAATCAAAG
TGGATTTGTCTGAGTGTTTGAAGCTATCTCTGGAAATGAAACACCCTTTGATTTGTAGGGCTATC
CCGAGGAATGTAGGGTTCGAGTCTCACCCTGATCATGCATGACAGATTAAAAAAAAAAAGAA
AAGGCACATCTAGGGGAGCTTATTATGATATTATCATATGTTGAAAATTAAATGTGTTTGTTGCTTT
CTTTTCTTTTTTTCTTTTTCCTTTCTTCTTTCTCTTAATCAATTGATATTATATCTTGTGTGGAACAA
ATAGTATCGGATTCGAGATTTAATGTTGGGATAATCCTTAAATGTAATTCCGTTATTAAGTGTGAA

Genome annotation
2015-03-23 30
ACCTCTCACTCTTTCTTCTCATCTTCTTCAATTATAACAACCTAACC[ATG]TCTTTAGAACAAGAAT
TTCTCTTCCTAATTCCTTCAACAATGGCCAATAATCTCAATCTCTTCCTTTGTTTCCTTCTCTTTATT
TGTTTCTTCACTCTCTGCCTTAGCCCTGGTGGTCTAGCTTGGGCTCTAATTTCAAAGCCAAAAAA
CCAATCCATCATTCCTAGAATCAGCTTATGAGCTTTTGTTTCATAGAGCAATGGGGTTTGCTCCAT
TGGAGAGTGTGTTTGTGTTGGTGATGAAGAAAATAAGAACAGCCCTAATGATAATGAAGATGAA
GATTTTGTTGATGTGTTGCTTGATTTGGAAAAGGAAAACAAACTCACCGACTCTGATATGATTGC
TGTGTTAG|GTACGTATGTGTATTATAATTTCTTGTTTCATTACTATTTTGATATTTTTCTACTGCACT
TCAATTTTAATCGGTTTGAAATGATTTTTTAATATGCTCTTACAAGATTATGACTTGGGAAAGATTC
TTACATCTTTAAATATTTCAATTTTTTGTGTGATACATGAAATGCATGACTGTTTTTTACTTGCGATT
TACATGTTGAAATTTTCTTTACTTTGATATTCTATGTTTTTTTAACAAATTTTCTCTTAAATAAATGA
CATGTAG|GAAATGACCTCCCAAATCTTCCTTATCTCCATGCAGTCGTCAAAGAGACTCTTAGAAT
GCACCCTCCCGGCCCACTTCTCTCTTGGGCACGCCTTGCCATCCATGACACCCATGTCGCAGGC
CACTACATTCCTGCTGGCACCACTGCGATGGTCAACATGTGGGCCATAACCCACGACGACCAAA
CTGTGGCTCGCTCAGTTAGTTCATAAGTTCGAATGGGTTCAAGCTGATGAATCGAAAATCAAAG
TGGATTTGTCTGAGTGTTTGAAGCTATCTCTGGAAATGAAACACCCTTTGATTTGTAGGGCTATC
CCGAGGAATGTAGGGTTCGAGTCTCACCC[TGA]TCATGCATGACAGATTAAAAAAAAAAAGAA
AAGGCACATCTAGGGGAGCTTATTATGATATTATCATATGTTGAAAATTAAATGTGTTTGTTGCTTT
CTTTTCTTTTTTTCTTTTTCCTTTCTTCTTTCTCTTAATCAATTGATATTATATCTTGTGTGGAACAA
ATAGTATCGGATTCGAGATTTAATGTTGGGATAATCCTTAAATGTAATTCCGTTATTAAGTGTGAA

Genome annotation
2015-03-23 31
Source: Joint Genome Institute

Expression analysis
2015-03-23 32

STRUCTURAL BIOINFORMATICS
2015-03-23 33

PDB and structural information
• Protein Data Bank holds information about
structure of proteins, nucleic acids and
complexes – over 100 000 entries!
• The 3D structure can be resolved by:
– X-ray diffraction
– NMR
– Electron microscopy
– Simulations
2015-03-23 34

HEADER TRANSCRIPTION 18-MAR-04 1VD4
TITLE SOLUTION STRUCTURE OF THE ZINC FINGER DOMAIN OF TFIIE ALPHA
COMPND 2 MOLECULE: TRANSCRIPTION INITIATION FACTOR IIE, ALPHA
COMPND 8 ENGINEERED: YES
SOURCE MOL_ID: 1;
SOURCE 2 ORGANISM_SCIENTIFIC: HOMO SAPIENS;
SOURCE 10 EXPRESSION_SYSTEM_PLASMID: PET11D
KEYWDS ZINC FINGER, TRANSCRIPTION
EXPDTA SOLUTION NMR
NUMMDL 20
AUTHOR M.OKUDA,A.TANAKA,Y.ARAI,M.SATOH,H.OKAMURA,A.NAGADOI,
REMARK 500 CHOLOGY: RAMACHANDRAN REVISITED. STRUCTURE 4, 1395 - 1400
REMARK 500
REMARK 500 M RES CSSEQI PSI PHI
REMARK 500 1 GLU A 118 -36.12 -163.20
REMARK 500 1 ARG A 119 -92.03 -138.92
REMARK 500 1 THR A 122 -70.74 -110.33
SITE 1 AC1 5 CYS A 129 CYS A 132 CYS A 154 CYS A 157
SITE 2 AC1 5 THR A 159
CRYST1 1.000 1.000 1.000 90.00 90.00 90.00 P 1 1
ORIGX1 1.000000 0.000000 0.000000 0.00000
ORIGX3 0.000000 0.000000 1.000000 0.00000
SCALE1 1.000000 0.000000 0.000000 0.00000
SCALE3 0.000000 0.000000 1.000000 0.00000
MODEL 1
ATOM 1 N ARG A 113 1.980 -19.277 -19.127 1.00 0.00 N
ATOM 2 CA ARG A 113 1.202 -19.280 -17.853 1.00 0.00 C
ATOM 3 C ARG A 113 0.666 -17.875 -17.557 1.00 0.00 C
ATOM 4 O ARG A 113 0.625 -17.023 -18.421 1.00 0.00 O
ATOM 5 CB ARG A 113 2.199 -19.713 -16.778 1.00 0.00 C
ATOM 6 CG ARG A 113 2.435 -21.222 -16.875 1.00 0.00 C
ATOM 7 CD ARG A 113 3.604 -21.619 -15.971 1.00 0.00 C
ATOM 8 NE ARG A 113 2.986 -21.899 -14.645 1.00 0.00 N
ATOM 9 CZ ARG A 113 3.125 -23.073 -14.094 1.00 0.00 C
2015-03-23 35

2015-03-23 36
Source: PDB

COMPUTATIONAL BIOLOGY/
SYSTEMS BIOLOGY
2015-03-23 37

Molecular networks
• Bioinformatics is needed to describe
interactions between proteins, DNA, drugs…
• When thousands of interactions are
analyzed network science come to use
• The set of all protein-protein interactions in
single cell is called interactome
• A single interaction can be researched in
vivo/in vitro but more complex network can
be only investigated in silico
2015-03-23 38

Molecular networks
2015-03-23 39
Source: Hennah & Porteous. PLoS ONE 4 (3): e4906. doi:10.1371/journal.pone.0004906

Metabolic pathways
• To describe a series of biochemical reactions
that often happen in different cellular
compartments, bioinformatics is also useful
• For description of pathways special
databases (graph) had to be designed
• Modeling of metabolites flow in pathway is
virtually impossible without the use of
computers
2015-03-23 40

Metabolic pathways
2015-03-23 41Source: KEGG, www.genome.jp/kegg/pathway/map/map00010.html

Metabolic pathways
2015-03-23 42Source: KEGG, www.genome.jp/kegg/pathway/map/map01100.html

Simulation of biological systems
• Simulation of cell-cell interactions
• Description of interactions inside population
• Between species interactions
• Food chains => food web
• Social relations
• Evolution of populations
• Modeling in pharmacology
2015-03-23 43

BIOINFORMATICS TOOLS
2015-03-23 44

Databases
• Relational databases
(mySQL)
• Non-relational
databases (noSQL)
• Graph databases
• RDF
• MySQL, Microsoft SQL
Server, SAP
• Cassandra, MongoDB,
CouchDB
• Neo4j, Bio4j
• N-Triples, RDF/XML,
Bio2RDF
2015-03-23 45

Databases
• Different types public resources available:
2015-03-23 46
Nucleic sequence
Protein sequence
EST
Genome
Sequence
data
Metadata/Ontologies
Functional
annotation
Gene models
Gene ontologies
Protein structure
Structural data Complexes
structure
RNA structure
Variation dataSNP
SSRindels
Interactions
Metabolic data
Pathways

Databases
• Where to look for the data?
2015-03-23 47

Databases
• How to use them?
– Browsing websites
directly
– Downloading
– Using API
2015-03-23 48

Text/data mining
• Obtaining information from several
scientific resources becoming is more
difficult as the volume of information grows
• Number of different resources/databases is
growing and simple search has to be
repeated for each of them
• Filtering relevant information is a big
intellectual/computational burden
2015-03-23 49

Text mining
• Retrieval, analysis and formatting (parsing)
of information into searchable databases
• Recognition of patterns
• Recognition of natural language
• Extraction of semantic or grammatical
relationships
• Coreference: terms that refer to the same
object
2015-03-23 50

Text mining example
• Query: Find promoters known to work in
E.coli with s70 holenzyme (Es70) aka sD
• PREFIX sbol:http://sbols.org/sbol.owl#
PREFIX pr:http://partsregistry.org/#
SELECT DISTINCT ?name
WHERE {
?part a sbol:Part;
sbol:status ?st;
sbol:name ?name;
sbol:dnaSequence ?seq;
a pr:promoter;
a ?cl.
FILTER (?cl =pr:sigma70_ecoli_prokaryote_rnap
&& ?st !='Deleted')}
2015-03-23 51

Open source software
• Software that anyone can use, modify, share
and distribute.
• Source code is known and can (should!) be
modified to fit the user requirements
• Society driven development
• Dynamic development and early releases
• Security and transparency
2015-03-23 52

Open source software repositories
2015-03-23 53
CRAN
The Comprehensive R Archive Network
CodePlex

Specificity of working with Big Data
2015-03-23 54

CAN I BE A BIOINFORMATICIAN, TOO?
2015-03-23 55

How to become a bioinformatician?
• Get a computer with Linux
• Learn how to use bash shell
and how to run programs
command line
• Learn to code in python or Perl
• Try solving basic problems on
2015-03-23 56

How to become a bioinformatician?
• Read blogs:
• Read fora for geeks:
• Get an account on:
2015-03-23 57

Want to know more?
• Join my network on
http://nl.linkedin.com/in/andrzejstefanczech
• Come to Wageningen for an internship at
Genetwister Technologies B.V.
http://www.genetwister.nl/
• Slides from this lecture are also available on
SlideShare
2015-03-23 58

2015-03-23 Image credits: Biocomicals 59

Introduction to bioinformatics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to bioinformatics

Similar to Introduction to bioinformatics (20)

Recently uploaded

Recently uploaded (20)

Introduction to bioinformatics

Editor's Notes