Bioinformatics        Prof. Wim Van Criekinge        18th december 2012, VUmc, Amsterdam
Outline• Scripting  Perl (Bioperl/Python)       examples spiders/bots• Databases  Genome Browser       examples biomart,...
Bioinformatics, a life science discipline …                                              Math                             ...
Bioinformatics, a life science discipline …                                              Math      Computer Science       ...
Bioinformatics, a life science discipline …                                              Math      Computer Science       ...
Bioinformatics, a life science discipline … management of expectations                                     Math Computer S...
Bioinformatics, a life science discipline … management of expectations                                     Math Computer S...
Bioinformatics                 8
9
What is Perl ?            • Perl is a High-level Scripting language            • Larry Wall created Perl in 1987          ...
What is Perl ?             • Perl is available for most computing               platforms: all flavors of UNIX (Linux), MS...
Why use Perl for bioinformatics ?                • Ease of use by novice programmers                • Flexible language: F...
Why NOT use Perl for bioinformatics ?                • Some tasks are still better done with other                  langua...
What bioinformatics tasks are suited to Perl ?               • Sequence manipulation and analysis               • Parsing ...
Perl installation               • Perl                     Perl is available for various operating systems. To           ...
Check installation          • Command-line flags for perl             Perl – v                 Gives the current version ...
TextPad          • Syntax highlighting          • Run program (prompt for parameters)          • Show line numbers        ...
Customize textpad part 1: Create Document Class                                                  18
• Document classes                     19
Customize textpad part 2: Add Perl to “Tools Menu”                                                     20
Unzip to textpad samples directory                                     21
General Remarks          • Perl is mostly a free format language: add            spaces, tabs or new lines wherever you wa...
Three Basic Data Types           •Scalars - $           •Arrays of scalars - @           •Associative arrays of           ...
2+2 = ?                       $   - indicates a variable       $a = 2;       $b = 2;       $c = $a + $b;                  ...
Ok, $c is 4. How do we know it?              $c = 4;              print “$c”;  print command:                            “...
Loops and cycles (for statement):      # Output all the numbers from 1 to 100      for ($n=1; $n<=100; $n+=1) {           ...
FOR & IF -- all the even numbers from 1 to 100:       for ($n=1; $n<=100; $n+=1) {                   if (($n % 2) == 0) { ...
Two brief diversions (warnings & strict)           • Use warnings           • strict – forces you to „declare‟ a variable ...
Text Processing Functions          The substr function          • Definition          • The substr function extracts a sub...
Random         $x = rand(1);         • srand            The default seed for srand, which used to be time, has           ...
Demo/Example       • Oefening hoe goed zijn de random         nummers ?       • Als ze goed zijn kan je er Pi mee         ...
Bereken Pi aan de hand van twee randomgetallen                                             y                              ...
IntroductionBuffons Needle is one of the oldest problems in the field of geometrical probability. It was first stated in 1...
–http://www.csse.monash.edu.au/~damian/papers/HTML/Perligata.html                                                         ...
Programming       • Variables       • Flow control (if, regex …)       • Loops       • input/output       • Subroutines/ob...
What is a regular expression? • A regular expression (regex) is simply a   way of describing text. • Regular expressions a...
37
Regular Expression Review• A regular expression (regex) is a way of  describing text.• Regular expressions are built up of...
Why would you use a regex?• Often you wish to test a string for the  presence of a specific character, word, or  phrase  E...
Regular ExpressionsMatch to a sequence of charactersThe EcoRI restriction enzyme cuts at the consensus sequence GAATTC.To ...
Regex-style              [m]/PATTERN/[g][i][o]         s/PATTERN/PATTERN/[g][i][e][o]     tr/PATTERNLIST/PATTERNLIST/[c][d...
Regular ExpressionsMatch to a character class• Example• The BstYI restriction enzyme cuts at the consensus sequence  rGATC...
Constructing a Regex • Pattern starts and ends with a /          /pattern/    if you want to match a /, you need to escap...
Looking for a pattern• By default, a regular expression is applied to $_  (the default variable)   if (/a+/) {die}     lo...
Regular Expression Atoms• An „atom‟ is the smallest unit of a  regular expression.• Character atoms     0-9, a-Z match the...
Quantifiers• You can specify the number of times you want  to see an atom. Examples  •   d* : Zero or more times  •   d+ :...
Anchors• Anchors force a pattern match to a certain  location  • ^ : start matching at beginning of string  • $ : start ma...
Remembering Stuff• Being able to match patterns is good, but  limited.• We want to be able to keep portions of the  regula...
Memory Parentheses (pattern memory)• Since we almost always want to keep portions  of the string we have matched, there is...
Getting at pattern memory• Perl stores the matches in a series of default  variables. The first parentheses set goes into ...
Finding all instances of a match • Use the „g‟ modifier to the regular expression     @sites = $sequence =~ /(TATTA)/g;  ...
Perl is Greedy• In addition to taking all your time, perl regular  expressions also try to match the largest  possible str...
Substitute function• s/pattern1/pattern2/;• Looks kind of like a regular expression   Patterns constructed the same way• ...
54
tr function• translate or transliterate• tr/characterlist1/characterlist2/;• Even less like a regular expression than s• s...
Translations               56
Using tr• Creating complimentary DNA sequence   $sequence =~ tr/atgc/TACG/;• Sneaky Perl trick for the day   tr does two...
Regex-Related Special Variables• Perl has a host of special variables that get filled after every m// or s///  regex match...
VoorbeeldWhich of following 4 sequences (seq1/2/3/4)a) contains a “Galactokinase signature”                http://us.expas...
>SEQ1MGNLFENCTHRYSFEYIYENCTNTTNQCGLIRNVASSIDVFHWLDVYISTTIFVISGILNFYCLFIALYT  YYFLDNETRKHYVFVLSRFLSSILVIISLLVLESTLFSESLSPTF...
ArraysDefinitions• A scalar variable contains a scalar value: one number or one string.  A string might contain many words...
The foreach construct     The foreach construct iterates over a list of scalar values     (e.g. that are contained in an a...
Examples for using the foreach construct - cont.• Calculate sum of all array elements:  #!/usr/local/bin/perl  @msr = (3, ...
Accessing individual array elementsIndividual array elements may be accessed by  indicating their position in the list (th...
The sort functionThe sort function receives a list of variables (or an array) and returns the sorted list.@array2 = sort (...
The push and shift functionsThe push function adds a variable or a list of variables to the end of a given  array.Example:...
Perl Array review• An array is designated with the „@‟ sign• An array is a list of individual elements• Arrays are ordered...
Generate random sequence stringfor($n=1;$n<=50;$n++){    @a = ("A","C","G","T");    $b=$a[rand(@a)];    $r.=$b;}print $r; ...
Text Processing FunctionsThe split function• The split function splits a string to a list of substrings  according to the ...
Text Processing FunctionsThe join function• The join function does the opposite of split. It  receives a delimiter and a l...
When is an array not good enough?• Sometimes you want to associate a given value  with another value. (name/value pairs)  ...
Problem solved: The associative array• As the name suggests, an associative array allows  you to link a name with a value•...
The „structure‟ of a Hash• An array looks something like this:                     0      1      2      Index     @array =...
The „structure‟ of a Hash • An array looks something like this:                       0      1     2      Index      @arra...
Creating a hash• There are several methods for creating a hash.  The most simple way – assign a list to a hash.   %hash =...
Getting at values• You should expect by now that there is some  way to get at a value, given a key.• You access a hash key...
Programming in general and Perl in particular• There is more than one right way to do it. Unfortunately, there are also  m...
Programming in general and Perl in particular Develop your program in stages. Once part of it works, save  the working ver...
CPAN       • CPAN: The Comprehensive Perl Archive         Network is available at www.cpan.org         and is a very large...
What is BioPerl?• An „open source‟ project   http://bio.perl.org or http://www.cpan.org• A loose international collaborat...
Multi-line parsing            use strict;            use Bio::SeqIO;            my $filename="sw.txt";            my $sequ...
Live.pl          #!e:Perlbinperl.exe -w          # script for looping over genbank entries, printing out name          use...
Bioperl 101: 2 ESSENTIAL TOOLS            Data::Dumper to find out what            class your in            Perl bptutoria...
Outline• Scripting  Perl (Bioperl/Python)       examples spiders/bots• Databases  Genome Browser       examples biomart,...
Overview• Bots and Spiders   The web   Bots   Spiders   Real world examples     Bioinformatics applications   Perl –...
The web• The WWW-part of the  Internet is based on  hyperlinks• So if one started to  follow all hyperlinks, it  would be ...
Bots• Webbots (web robots, WWW robots, bots): software applications  that run automated tasks over the Internet• Bots perf...
Spiders• Webspiders / Crawlers are programs  or automated scripts which browses  the World Wide Web in a  methodical, auto...
SpidersUse of webcrawlers:  Mainly used to create a copy of all the visited pages for later   processing by a search eng...
Spiders          90
Perl - LWPLWP (also known as libwww-perl)    The World-Wide Web library for Perl    Set of Perl modules which provides ...
Perl - LWP   Some more advanced features LWP::UserAgent (demo2 – show server access logs) Fill in forms and parse resul...
Google hacks   Why not make use of crawls, indexing and    serving technologies of others (e.g. Google)    Google allows...
Advanced APIs   An application programming interface (API) is a source    code interface that an operating system, librar...
Advanced APIs   Google example used Google API / SOAP   NCBI API     The NCBI Web service is a web program that enables...
Fetch data from NCBI   A NCBI database, frequently used is PubMed    PubMed can be queried using E-Utils    Uses syntax...
Fetch data from NCBI   Example: PubMeth    Get data from NCBI PubMed    Get all genes and all aliases for human genes a...
Outline• Scripting  Perl (Bioperl/Python)       examples spiders/bots• Databases  Genome Browser       examples biomart,...
The three genome browsers• There are three main browsers:   Ensembl   NCBI MapViewer   UCSC• At first glance their main...
MapViewer                  Homehttp://www.ncbi.nlm.nih.gov/mapview/   100
MapViewer Master Map                       101
Selecting tracks on MapViewer                                102
MapViewer strengths• Good coverage of plant and fungal genomes.• Close integration with other NCBI tools and  databases, s...
MapViewer limitations• Little cross-species conservation or alignment  data.• Inability to upload custom annotations and d...
UCSC Genome Browser                      105                       105
http://genome.ucsc.edu/                          106                           106
UCSC Genome Browser                      107                       107
Strengths of the UCSC Browser (I)  For this course I will be focusing primarily on the  UCSC Browser for several reasons:•...
UCSC Browser Strengths (II)• Well suited for batch and automated querying of both  gene and intergenic regions.• Comprehen...
UCSC browser limitations• Lack of “overview” mode can make it harder to see  genomic context.• Syntenic regions cannot be ...
Human, mouse,rat synteny in MapViewer                                        111
Browser/Database BatchQuerying                         112                          112
Batch querying overview• Introduction / motivation• UCSC table browser• Custom tracks and frames• Galaxy and direct SQL da...
Why batch querying• Interactive querying is difficult if you want to study  numerous “interesting” genomic regions.• Query...
Batch querying examples• As an example, say you have found one hundred candidate  polymorphisms and you want to know:    ...
Other examples• Other examples include characterizing multiple:   Non-coding RNA candidates   ultra-conserved regions  ...
Browsers and databases• Each of the genome browsers is built on top of  multiple relational databases.• Typically data for...
The UCSC Table Browser• For batch queries, you need to query the  browser databases.• The conventional way of querying a r...
Browser Database Formats Nevertheless, even with the Table Browser, you needsome understanding of the underlying track, ta...
Main UCSC Data Formats • GFF/GTF • BED (Browser Extensible Data)    lists of genomic blocks • PSL    RNA/DNA alignments ...
Custom Tracks• Custom tracks are essentially BED, PSL or GTF files  with formatting lines so they can be displayed on the ...
Selecting custom track output                                122
Sending custom track to browser                                  123                                   123
Adding a custom track                        124                         124
Adding a custom track (II)                             125
Custom track example browser position chr22:10000000-10020000 browser hide all track name=clones description="Clones” visi...
Limitations of the table browser• Can be difficult to create more complex queries.• With hundreds of tables, finding the o...
Ensembl          128
Ensembl Home   http://www.ensembl.org/                                         129
Ensembl ContigView                     130
Ensembl ContigView                     131
Detail and Basepair view                           132
Changing tracks in Ensembl                             133
Ensembl strengths (I)• Multiple view levels shows genomic context.• Some annotations are more complete and/or are  more cl...
Ensembl snpView                  135
Ensembl strengths (II)• Batch and automated querying well supported and  documented (especially for perl and java).• API (...
Ensembl is “community oriented” • Close alliances with Wormbase, Flybase, SGD • “support for easy integration with third p...
Ensembl limitations• Limited data quantifying cross-species  sequence conservation.• Batch queries for intergenic regions ...
BioMart• BioMart - the Ensembl “Table browser”• Similar to the Table Browser and Galaxy tools.• Previous version was calle...
The Galaxy Website• Galaxy website: http://g2.bx.psu.edu• Galaxy objective: Provide sequence and data  manipulation tools ...
141 141
Demo: Galaxy Genomics Toolkit• Galaxy is a web interface to bioinformatics tools that  deal with genome-scale data• There ...
Genome-Scale Data• Bioinformatics work is challenging on  very large “genomics” data sets   sequencing, gene expression, ...
The Galaxy Interface has 3 parts                                       History =List of Tools    Central work panel   data...
Load Data from UCSC             Or upload from your computer   145
Demo: Galaxy Genomics Toolkit• http://athos.ugent.be:8080: staat er een Galaxy instance.• inloggen (als admin: new@new.be,...
Workflows• Galaxy saves your data, and results in the  History• The exact commands and parameters used with  each operatio...
• Galaxy has many public  data sets and public  workflows, which can be  easily used in your projects  (or a tutorial)    ...
NGS tools• Galaxy has recently been expanded with tools to  analyze Next-Gen Sequence data• File format conversions• Analy...
• NGS tools include fileformat conversion, mappingto reference genome,ChIPseq peak calling, RNA-seq gene expression, etc. ...
A number of Groups have set up custom Galaxyservers with special tools                                               151
The SPARQLing future                       152
Outline• Scripting  Perl (Bioperl/Python)       examples spiders/bots• Databases  Genome Browser       examples biomart,...
Wat is „intelligent‟ ?• Intelligentie = de mogelijkheid tot  leren en begrijpen, tot het oplossen  van problemen, tot het ...
Turing test voor intelligentieTHE IMITATION GAMEVrouwMan/MachineOndervrager: Wie vanbeide is de vrouw?                    ...
Wat is „artificieel‟ ?• Artificieel = kunstmatig = door de mens  vervaardigd, niet van natuurlijke  oorsprong• in de conte...
Data mining• WAT? extraheren van kennis uit data• Data indelen in drie groepen:   trainingsset   validatieset   testset...
Clustering• WAT? „unsupervised learning‟ –  antwoord voor de trainingsdata niet  gekend• Resultaat meestal als boomstructu...
Cluster Analysis• Unsupervised methods• Descriptive modeling   Grouping of genes with “similar” expression    profiles  ...
Linkage in Hierarchical Clustering• Single linkage:  S(A,B) = mina minb d(a,b)                                            ...
Hierarchical Clustering                          3 clusters?                          2 clusters?                         ...
Classificatie• WAT? „supervised learning‟ – antwoord  voor de trainingsdata is gekend• Verschillende classificatiemethoden...
Decision tree          Voorbeeld: tennis                              163
Neurale netwerkenBOUW: Neuronen en verbindingenTAAK:verwerken van invoergegevensmachine learning                          ...
Support Vector MachinesDoorvoeren van een lineaire separatie in de datadoor de dimensies aan te passen                    ...
Bio-informatica toepassingen• Decision tree: zoeken naar DNA-sequenties  homoloog aan een gegeven DNA-sequentie• Neurale n...
Bio-informatica toepassingen• Hiërarchisch clusteren: opstellen van fylo-  genetische bomen op basis van DNA-sequenties• G...
Outline          168
Classification                        C N                      N NCC                        NC                          OM...
Classification                      R N                    N NRR                      NR                        OMS       ...
Outline          171
OMS Classifier using “Methylation”               Patient              Sample         Measuring Methylation       Gene     ...
Why use methylation as a biomarker ?• What is feature/biomarker ?   A characteristic that is objectively    measured and ...
Outline          174
Data preparation and modelling• Data preparation   Construct binary features « Methylated » from    PCR data (Ct and Temp...
Data Preparation: Feature Construction                           Sample                             Methylation Specific  ...
Construction of features « Methylated »• Per gene: find boolean function   Methylated IFF:    Ct below upperbound AND    ...
Construction of features « Methylated »Plot of all Ct and Temp measurements for a given gene    Temp                      ...
Noise   Noise: random error or variance in a    measured variable   Incorrect attribute values may due to       Quantit...
Construction of features «Methylated»        Taking into account noise                             QC: StdDev of Ct and Tm...
Construction of features « Methylated »  Taking into account noise                     Good Reproducibility           Bad ...
Construction of features « Methylated » Taking into account noiseFind most robust cut-off for each gene                   ...
Construction of features « Methylated »  Methylated: inside red box                                          183
Construction of features « Methylated »Methylated   Unmethylated     Ranked GenesCancerNormal                             ...
Data preparation and modelling• Data preparation   Construct binary features « Methylated » from    PCR data (Ct and Temp...
Selection of modelling technique• In theory, many techniques applicable   Data type: boolean methylation table, discrete ...
Decision trees  The Weka tool@relation weather.symbolic@attribute   outlook {sunny, overcast, rainy}@attribute   temperatu...
Decision trees  Attribute selection      outlook    temperature   humidity   windy       play      sunny      hot         ...
Decision trees                                                                                         play               ...
Decision trees                               outlook                                      playAttribute selection         ...
playDecision trees                               outlookAttribute selection                                               ...
Decision treesfinal tree                                              play                                              do...
Decision trees Basic algorithm• Initialize top node to all examples• While impure leaves available   select next impure l...
Decision tree built from methylation table                                             Leave-one-out experiment           ...
Outline          195
Evaluation and deployment• Decide whether to use Classification results   Can we use 12 gene decision tree for classifyin...
Attempt to rebuild decision tree  with at most ~5 genes                                                 Minimal leaf size ...
Evaluation and deploymentThe impact of « cost »• Market conditions, cost of goods &  royalty structure can limit the amoun...
Evaluation and deploymentThe importance of « understandability »                                          199
Evaluation and deploymentThe importance of « understandability »Pre and postmarket requirements imposed for IVDMIA (510k e...
Outline• Scripting  Perl (Bioperl/Python)       examples spiders/bots• Databases  Genome Browser       examples biomart,...
WEKA:: Introduction• A collection of open source ML  algorithms   pre-processing   classifiers   clustering   associat...
WEKA:: Installation• Download software from  http://www.cs.waikato.ac.nz/ml/weka/   If you are interested in    modifying...
204
Main GUI• Three graphical user interfaces   “The Explorer” (exploratory data    analysis)   “The Experimenter” (experime...
Explorer: pre-processing the data• Data can be imported from a file in  various formats: ARFF, CSV, C4.5,  binary• Data ca...
WEKA only deals with “flat” files@relation heart-disease-simplified@attribute age numeric@attribute sex { female, male}@at...
WEKA only deals with “flat” files@relation heart-disease-simplified@attribute age numeric@attribute sex { female, male}@at...
2   University of Waikato   12/18/2012   2090
2   University of Waikato   12/18/2012   2101
2   University of Waikato   12/18/2012   2111
2   University of Waikato   12/18/2012   2121
2   University of Waikato   12/18/2012   2131
2   University of Waikato   12/18/2012   2141
2   University of Waikato   12/18/2012   2151
2   University of Waikato   12/18/2012   2161
2   University of Waikato   12/18/2012   2171
2   University of Waikato   12/18/2012   2181
2   University of Waikato   12/18/2012   2191
2   University of Waikato   12/18/2012   2202
2   University of Waikato   12/18/2012   2212
2   University of Waikato   12/18/2012   2222
2   University of Waikato   12/18/2012   2232
2   University of Waikato   12/18/2012   2242
2   University of Waikato   12/18/2012   2252
2   University of Waikato   12/18/2012   2262
2   University of Waikato   12/18/2012   2272
2   University of Waikato   12/18/2012   2282
2   University of Waikato   12/18/2012   2292
Explorer: building “classifiers”• Classifiers in WEKA are models for  predicting nominal or numeric  quantities• Implement...
Decision Tree Induction: Training Dataset               age    income student credit_rating   buys_computer             <=...
Output: A Decision Tree for “buys_computer”                                 age?                  <=30          overcast  ...
2   University of Waikato   12/18/2012   2343
2   University of Waikato   12/18/2012   2353
2   University of Waikato   12/18/2012   2363
2   University of Waikato   12/18/2012   2373
2   University of Waikato   12/18/2012   2383
2   University of Waikato   12/18/2012   2393
2   University of Waikato   12/18/2012   2404
2   University of Waikato   12/18/2012   2414
2   University of Waikato   12/18/2012   2424
2   University of Waikato   12/18/2012   2434
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
2012 12 12_adam_v_final
Upcoming SlideShare
Loading in …5
×

2012 12 12_adam_v_final

3,747 views
3,687 views

Published on

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
3,747
On SlideShare
0
From Embeds
0
Number of Embeds
1,337
Actions
Shares
0
Downloads
31
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

2012 12 12_adam_v_final

  1. 1. Bioinformatics Prof. Wim Van Criekinge 18th december 2012, VUmc, Amsterdam
  2. 2. Outline• Scripting Perl (Bioperl/Python) examples spiders/bots• Databases Genome Browser examples biomart, galaxy• AI Classification and clustering examples WEKA (R, Rapidminer) 2
  3. 3. Bioinformatics, a life science discipline … Math (Molecular) Informatics Biology
  4. 4. Bioinformatics, a life science discipline … Math Computer Science Theoretical Biology (Molecular) Informatics Biology Computational Biology
  5. 5. Bioinformatics, a life science discipline … Math Computer Science Theoretical Biology Bioinformatics (Molecular) Informatics Biology Computational Biology
  6. 6. Bioinformatics, a life science discipline … management of expectations Math Computer Science Theoretical Biology NP AI, Image Analysis Datamining structure prediction (HTX) Bioinformatics Interface Design Expert Annotation Sequence Analysis (Molecular)Informatics Biology Computational Biology
  7. 7. Bioinformatics, a life science discipline … management of expectations Math Computer Science Theoretical Biology NP AI, Image Analysis Datamining structure prediction (HTX) Bioinformatics Discovery Informatics – Computational Genomics Interface Design Expert Annotation Sequence Analysis (Molecular)Informatics Biology Computational Biology
  8. 8. Bioinformatics 8
  9. 9. 9
  10. 10. What is Perl ? • Perl is a High-level Scripting language • Larry Wall created Perl in 1987  Practical Extraction (a)nd Reporting Language  (or Pathologically Eclectic Rubbish Lister) • Born from a system administration tool • Faster than sh or csh • Sslower than C • No need for sed, awk, tr, wc, cut, … • Perl is open and free • http://conferences.oreillynet.com/euroosc on/ 10
  11. 11. What is Perl ? • Perl is available for most computing platforms: all flavors of UNIX (Linux), MS- DOS/Win32, Macintosh, VMS, OS/2, Amiga, AS/400, Atari • Perl is a computer language that is:  Interpreted, compiles at run-time (need for perl.exe !)  Loosely “typed”  String/text oriented  Capable of using multiple syntax formats • In Perl, “there‟s more than one way to do it” 11
  12. 12. Why use Perl for bioinformatics ? • Ease of use by novice programmers • Flexible language: Fast software prototyping (quick and dirty creation of small analysis programs) • Expressiveness. Compact code, Perl Poetry: @{$_[$#_]||[]} • Glutility: Read disparate files and parse the relevant data into a new format • Powerful pattern matching via “regular expressions” (Best Regular Expressions on Earth) • With the advent of the WWW, Perl has become the language of choice to create Common Gateway Interface (CGI) scripts to handle form submissions and create compute severs on the WWW. • Open Source – Free. Availability of Perl modules for Bioinformatics and Internet. 12
  13. 13. Why NOT use Perl for bioinformatics ? • Some tasks are still better done with other languages (heavy computations / graphics)  C(++),C#, Fortran, Java (Pascal,Visual Basic) • With perl you can write simple programs fast, but on the other hand it is also suitable for large and complex programs. (yet, it is not adequate for very large projects)  Python • Larry Wall: “For programmers, laziness is a virtue” 13
  14. 14. What bioinformatics tasks are suited to Perl ? • Sequence manipulation and analysis • Parsing results of sequence analysis programs (Blast, Genscan, Hmmer etc) • Parsing database (eg Genbank) files • Obtaining multiple database entries over the internet •… 14
  15. 15. Perl installation • Perl  Perl is available for various operating systems. To download Perl and install it on your computer, have a look at the following resources:  www.perl.com (OReilly). Downloading Perl Software  ActiveState. ActivePerl for Windows, as well as for Linux and Solaris. ActivePerl binary packages.  CPAN • PHPTriad:  bevat Apache/PHP en MySQL: http://sourceforge.net/projects/phptriad 15
  16. 16. Check installation • Command-line flags for perl  Perl – v Gives the current version of Perl  Perl –e Executes Perl statements from the comment line. Perl –e “print 42;” Perl –e “print ”Twonlinesn”;”  Perl –we Executes and print warnings Perl –we “print „hello‟;x++;” 16
  17. 17. TextPad • Syntax highlighting • Run program (prompt for parameters) • Show line numbers • Clip-ons for web with perl syntax • …. 17
  18. 18. Customize textpad part 1: Create Document Class 18
  19. 19. • Document classes 19
  20. 20. Customize textpad part 2: Add Perl to “Tools Menu” 20
  21. 21. Unzip to textpad samples directory 21
  22. 22. General Remarks • Perl is mostly a free format language: add spaces, tabs or new lines wherever you want. • For clarity, it is recommended to write each statement in a separate line, and use indentation in nested structures. • Comments: Anything from the # sign to the end of the line is a comment. (There are no multi-line comments). • A perl program consists of all of the Perl statements of the file taken collectively as one big routine to execute. 22
  23. 23. Three Basic Data Types •Scalars - $ •Arrays of scalars - @ •Associative arrays of scalers or Hashes - % 23
  24. 24. 2+2 = ? $ - indicates a variable $a = 2; $b = 2; $c = $a + $b; - ends every command ; = - assigns a value to a variable or $c = 2 + 2; or $c = 2 * 2; or $c = 2 / 2; or $c = 2 ^ 4; 2^4 <-> 24 =16 or $c = 1.35 * 2 - 3 / (0.12 + 1);
  25. 25. Ok, $c is 4. How do we know it? $c = 4; print “$c”; print command: “ ” - bracket output expression print “Hello n”; n - print a end-of-the-line character (equivalent to pressing „Enter‟)Strings concatenation: print “Hello everyonen”; print “Hello” . ” everyone” . “n”;Expressions and strings together: print “2 + 2 = “ . (2+2) . ”n”; 2 + 2 = 4 expression
  26. 26. Loops and cycles (for statement): # Output all the numbers from 1 to 100 for ($n=1; $n<=100; $n+=1) { print “$n n”; }1. Initialization: for ( $n=1 ; ; ) { … }2. Increment: for ( ; ; $n+=1 ) { … }3. Termination (do until the criteria is satisfied): for ( ; $n<=100 ; ) { … }4. Body of the loop - command inside curly brackets: for ( ; ; ) { … }
  27. 27. FOR & IF -- all the even numbers from 1 to 100: for ($n=1; $n<=100; $n+=1) { if (($n % 2) == 0) { print “$n”; } } Note: $a % $b -- Modulus -- Remainder when $a is divided by $b
  28. 28. Two brief diversions (warnings & strict) • Use warnings • strict – forces you to „declare‟ a variable the first time you use it.  usage: use strict; (somewhere near the top of your script) • declare variables with „my‟  usage: my $variable;  or: my $variable = „value‟; • my sets the „scope‟ of the variable. Variable exists only within the current block of code • use strict and my both help you to debug errors, and help prevent mistakes. 28
  29. 29. Text Processing Functions The substr function • Definition • The substr function extracts a substring out of a string and returns it. The function receives 3 arguments: a string value, a position on the string (starting to count from 0) and a length. Example: • $a = "university"; • $k = substr ($a, 3, 5); • $k is now "versi" $a remains unchanged. • If length is omitted, everything to the end of the string is returned. 29
  30. 30. Random $x = rand(1); • srand  The default seed for srand, which used to be time, has been changed. Now its a heady mix of difficult-to- predict system-dependent values, which should be sufficient for most everyday purposes. Previous to version 5.004, calling rand without first calling srand would yield the same sequence of random numbers on most or all machines. Now, when perl sees that youre calling rand and havent yet called srand, it calls srand with the default seed. You should still call srand manually if your code might ever be run on a pre- 5.004 system, of course, or if you want a seed other than the default 30
  31. 31. Demo/Example • Oefening hoe goed zijn de random nummers ? • Als ze goed zijn kan je er Pi mee berekenen … • Een goede random generator is belangrijk voor goede randomsequenties die we nadien kunnen gebruiken in simulaties 31
  32. 32. Bereken Pi aan de hand van twee randomgetallen y x 1 32
  33. 33. IntroductionBuffons Needle is one of the oldest problems in the field of geometrical probability. It was first stated in 1777. It involves dropping a needle on a lined sheet of paper and determining the probability of the needle crossing one of the lines on the page. The remarkable result is that the probability is directly related to the value of pi.http://www.angelfire.com/wa/hurben/buff.htmlIn Postscript you send it too the printer … PS has no variables but “stacks”, you can mimick this in Perl by recursively loading and rewriting a subroutine 33
  34. 34. –http://www.csse.monash.edu.au/~damian/papers/HTML/Perligata.html 34
  35. 35. Programming • Variables • Flow control (if, regex …) • Loops • input/output • Subroutines/object 35
  36. 36. What is a regular expression? • A regular expression (regex) is simply a way of describing text. • Regular expressions are built up of small units (atoms) which can represent the type and number of characters in the text • Regular expressions can be very broad (describing everything), or very narrow (describing only one pattern). 36
  37. 37. 37
  38. 38. Regular Expression Review• A regular expression (regex) is a way of describing text.• Regular expressions are built up of small units (atoms) which can represent the type and number of characters in the text• You can group or quantify atoms to describe your pattern• Always use the bind operator (=~) to apply your regular expression to a variable 38
  39. 39. Why would you use a regex?• Often you wish to test a string for the presence of a specific character, word, or phrase Examples “Are there any letter characters in my string?” “Is this a valid accession number?” “Does my sequence contain a start codon (ATG)?” 39
  40. 40. Regular ExpressionsMatch to a sequence of charactersThe EcoRI restriction enzyme cuts at the consensus sequence GAATTC.To find out whether a sequence contains a restriction site for EcoR1, write;if ($sequence =~ /GAATTC/) { ...}; 40
  41. 41. Regex-style [m]/PATTERN/[g][i][o] s/PATTERN/PATTERN/[g][i][e][o] tr/PATTERNLIST/PATTERNLIST/[c][d][s] 41
  42. 42. Regular ExpressionsMatch to a character class• Example• The BstYI restriction enzyme cuts at the consensus sequence rGATCy, namely A or G in the first position, then GATC, and then T or C. To find out whether a sequence contains a restriction site for BstYI, write;• if ($sequence =~ /[AG]GATC[TC]/) {...}; # This will match all of AGATCT, GGATCT, AGATCC, GGATCC.Definition• When a list of characters is enclosed in square brackets [], one and only one of these characters must be present at the corresponding position of the string in order for the pattern to match. You may specify a range of characters using a hyphen -.• A caret ^ at the front of the list negates the character class.Examples• if ($string =~ /[AGTC]/) {...}; # matches any nucleotide• if ($string =~ /[a-z]/) {...}; # matches any lowercase letter• if ($string =~ /chromosome[1-6]/) {...}; # matches chromosome1, chromosome2 ... chromosome6• if ($string =~ /[^xyzXYZ]/) {...}; # matches any character except x, X, y, Y, z, Z 42
  43. 43. Constructing a Regex • Pattern starts and ends with a / /pattern/  if you want to match a /, you need to escape it / (backslash, forward slash)  you can change the delimiter to some other character, but you probably won‟t need to m|pattern| • any „modifiers‟ to the pattern go after the last / i : case insensitive /[a-z]/i o : compile once g : match in list context (global) m or s : match over multiple lines 43
  44. 44. Looking for a pattern• By default, a regular expression is applied to $_ (the default variable)  if (/a+/) {die} looks for one or more „a‟ in $_• If you want to look for the pattern in any other variable, you must use the bind operator  if ($value =~ /a+/) {die} looks for one or more „a‟ in $value• The bind operator is in no way similar to the „=„ sign!! = is assignment, =~ is bind.  if ($value = /[a-z]/) {die} Looks for one or more „a‟ in $_, not $value!!! 44
  45. 45. Regular Expression Atoms• An „atom‟ is the smallest unit of a regular expression.• Character atoms 0-9, a-Z match themselves . (dot) matches everything [atgcATGC] : A character class (group) [a-z] : another character class, a through z 45
  46. 46. Quantifiers• You can specify the number of times you want to see an atom. Examples • d* : Zero or more times • d+ : One or more times • d{3} : Exactly three times • d{4,7} : At least four, and not more than seven • d{3,} : Three or more times We could rewrite /ddd-dddd/ as: /d{3}-d{4}/ 46
  47. 47. Anchors• Anchors force a pattern match to a certain location • ^ : start matching at beginning of string • $ : start matching at end of string • b : match at word boundary (between w and W)• Example: • /^ddd-dddd$/ : matches only valid phone numbers 47
  48. 48. Remembering Stuff• Being able to match patterns is good, but limited.• We want to be able to keep portions of the regular expression for later.  Example: $string = „phone: 353-7236‟ We want to keep the phone number only Just figuring out that the string contains a phone number is insufficient, we need to keep the number as well. 48
  49. 49. Memory Parentheses (pattern memory)• Since we almost always want to keep portions of the string we have matched, there is a mechanism built into perl.• Anything in parentheses within the regular expression is kept in memory.  „phone:353-7236‟ =~ /^phone:(.+)$/; Perl knows we want to keep everything that matches „.+‟ in the above pattern 49
  50. 50. Getting at pattern memory• Perl stores the matches in a series of default variables. The first parentheses set goes into $1, second into $2, etc.  This is why we can‟t name variables ${digit}  Memory variables are created only in the amounts needed. If you have three sets of parentheses, you have ($1,$2,$3).  Memory variables are created for each matched set of parentheses. If you have one set contained within another set, you get two variables (inner set gets lowest number)  Memory variables are only valid in the current scope 50
  51. 51. Finding all instances of a match • Use the „g‟ modifier to the regular expression  @sites = $sequence =~ /(TATTA)/g;  think g for global  Returns a list of all the matches (in order), and stores them in the array  If you have more than one pair of parentheses, your array gets values in sets ($1,$2,$3,$1,$2,$3...) 51
  52. 52. Perl is Greedy• In addition to taking all your time, perl regular expressions also try to match the largest possible string which fits your pattern  /ga+t/ matches gat, gaat, gaaat  „Doh! No doughnuts left!‟ =~ /(d.+t)/ $1 contains „doughnuts left‟• If this is not what you wanted to do, use the „?‟ modifier  /(d.+?t)/ # match as few „.‟s as you can and still make the pattern work 52
  53. 53. Substitute function• s/pattern1/pattern2/;• Looks kind of like a regular expression  Patterns constructed the same way• Inherited from previous languages, so it can be a bit different.  Changes the variable it is bound to! 53
  54. 54. 54
  55. 55. tr function• translate or transliterate• tr/characterlist1/characterlist2/;• Even less like a regular expression than s• substitutes characters in the first list with characters in the second list $string =~ tr/a/A/; # changes every „a‟ to an „A‟  No need for the g modifier when using tr. 55
  56. 56. Translations 56
  57. 57. Using tr• Creating complimentary DNA sequence  $sequence =~ tr/atgc/TACG/;• Sneaky Perl trick for the day  tr does two things. 1. changes characters in the bound variable 2. Counts the number of times it does this  Super-fast character counter™ $a_count = $sequence =~ tr/a/a/; replaces an „a‟ with an „a‟ (no net change), and assigns the result (number of substitutions) to $a_count 57
  58. 58. Regex-Related Special Variables• Perl has a host of special variables that get filled after every m// or s/// regex match. $1, $2, $3, etc. hold the backreferences. $+ holds the last (highest-numbered) backreference. $& (dollar ampersand) holds the entire regex match.• @- is an array of match-start indices into the string. $-[0] holds the start of the entire regex match, $-[1] the start of the first backreference, etc. Likewise, @+ holds match-end indices (ends, not lengths).• $ (dollar followed by an apostrophe or single quote) holds the part of the string after (to the right of) the regex match. $` (dollar backtick) holds the part of the string before (to the left of) the regex match. Using these variables is not recommended in scripts when performance matters, as it causes Perl to slow down all regex matches in your entire script.• All these variables are read-only, and persist until the next regex match is attempted. They are dynamically scoped, as if they had an implicit local at the start of the enclosing scope. Thus if you do a regex match, and call a sub that does a regex match, when that sub returns, your variables are still set as they were for the first match. 58
  59. 59. VoorbeeldWhich of following 4 sequences (seq1/2/3/4)a) contains a “Galactokinase signature” http://us.expasy.org/prosite/b) How many of them?c) Where (hints:pos and $&) ? 59
  60. 60. >SEQ1MGNLFENCTHRYSFEYIYENCTNTTNQCGLIRNVASSIDVFHWLDVYISTTIFVISGILNFYCLFIALYT YYFLDNETRKHYVFVLSRFLSSILVIISLLVLESTLFSESLSPTFAYYAVAFSIYDFSMDTLFFSYIMIS LITYFGVVHYNFYRRHVSLRSLYIILISMWTFSLAIAIPLGLYEAASNSQGPIKCDLSYCGKVVEWITCS LQGCDSFYNANELLVQSIISSVETLVGSLVFLTDPLINIFFDKNISKMVKLQLTLGKWFIALYRFLFQMT NIFENCSTHYSFEKNLQKCVNASNPCQLLQKMNTAHSLMIWMGFYIPSAMCFLAVLVDTYCLLVTISILK SLKKQSRKQYIFGRANIIGEHNDYVVVRLSAAILIALCIIIIQSTYFIDIPFRDTFAFFAVLFIIYDFSILSLLGSFTGVAM MTYFGVMRPLVYRDKFTLKTIYIIAFAIVLFSVCVAIPFGLFQAADEIDGPIKCDSESCELIVKWLLFCI ACLILMGCTGTLLFVTVSLHWHSYKSKKMGNVSSSAFNHGKSRLTWTTTILVILCCVELIPTGLLAAFGK SESISDDCYDFYNANSLIFPAIVSSLETFLGSITFLLDPIINFSFDKRISKVFSSQVSMFSIFFCGKR>SEQ2MLDDRARMEA AKKEKVEQIL AEFQLQEEDL KKVMRRMQKE MDRGLRLETH EEASVKMLPT YVRSTPEGSE VGDFLSLDLG GTNFRVMLVK VGEGEEGQWS VKTKHQMYSI PEDAMTGTAE MLFDYISECI SDFLDKHQMK HKKLPLGFTF SFPVRHEDID KGILLNWTKG FKASGAEGNN VVGLLRDAIK RRGDFEMDVV AMVNDTVATM ISCYYEDHQC EVGMIVGTGC NACYMEEMQN VELVEGDEGR MCVNTEWGAF GDSGELDEFL LEYDRLVDES SANPGQQLYE KLIGGKYMGE LVRLVLLRLV DENLLFHGEA SEQLRTRGAF ETRFVSQVES DTGDRKQIYN ILSTLGLRPS TTDCDIVRRA CESVSTRAAH MCSAGLAGVI NRMRESRSED VMRITVGVDG SVYKLHPSFK ERFHASVRRL TPSCEITFIE SEEGSGRGAA LVSAVACKKA CMLGQ>SEQ3MESDSFEDFLKGEDFSNYSYSSDLPPFLLDAAPCEPESLEINKYFVVIIYVLVFLLSLLGNSLVMLVILY SRVGRSGRDNVIGDHVDYVTDVYLLNLALADLLFALTLPIWAASKVTGWIFGTFLCKVVSLLKEVNFYSGILLLACISVDRY LAIVHATRTLTQKRYLVKFICLSIWGLSLLLALPVLIFRKTIYPPYVSPVCYEDMGNNTANWRMLLRILP QSFGFIVPLLIMLFCYGFTLRTLFKAHMGQKHRAMRVIFAVVLIFLLCWLPYNLVLLADTLMRTWVIQET CERRNDIDRALEATEILGILGRVNLIGEHWDYHSCLNPLIYAFIGQKFRHGLLKILAIHGLISKDSLPKDSRPSFVGSSSGH TSTTL>SEQ4MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA 60
  61. 61. ArraysDefinitions• A scalar variable contains a scalar value: one number or one string. A string might contain many words, but Perl regards it as one unit.• An array variable contains a list of scalar data: a list of numbers or a list of strings or a mixed list of numbers and strings. The order of elements in the list matters.Syntax• Array variable names start with an @ sign.• You may use in the same program a variable named $var and another variable named @var, and they will mean two different, unrelated things.Example• Assume we have a list of numbers which were obtained as a result of some measurement. We can store this list in an array variable as the following:• @msr = (3, 2, 5, 9, 7, 13, 16); 61
  62. 62. The foreach construct The foreach construct iterates over a list of scalar values (e.g. that are contained in an array) and executes a block of code for each of the values. • Example:  foreach $i (@some_array) {  statement_1;  statement_2;  statement_3; }  Each element in @some_array is aliased to the variable $i in turn, and the block of code inside the curly brackets {} is executed once for each element. • The variable $i (or give it any other name you wish) is local to the foreach loop and regains its former value upon exiting of the loop. • Remark $_ 62
  63. 63. Examples for using the foreach construct - cont.• Calculate sum of all array elements: #!/usr/local/bin/perl @msr = (3, 2, 5, 9, 7, 13, 16); $sum = 0; foreach $i (@msr) { $sum += $i; } print "sum is: $sumn"; 63
  64. 64. Accessing individual array elementsIndividual array elements may be accessed by indicating their position in the list (their index).Example:@msr = (3, 2, 5, 9, 7, 13, 16);index value 0 3 1 2 2 5 3 9 4 7 5 13 6 16First element: $msr[0] (here has the value of 3),Third element: $msr[2] (here has the value of 5),and so on. 64
  65. 65. The sort functionThe sort function receives a list of variables (or an array) and returns the sorted list.@array2 = sort (@array1);#!/usr/local/bin/perl@countries = ("Israel", "Norway", "France", "Argentina");@sorted_countries = sort ( @countries);print "ORIG: @countriesn", "SORTED: @sorted_countriesn";Output:ORIG: Israel Norway France ArgentinaSORTED: Argentina France Israel Norway#!/usr/local/bin/perl@numbers = (1 ,2, 4, 16, 18, 32, 64);@sorted_num = sort (@numbers);print "ORIG: @numbers n", "SORTED: @sorted_num n";Output:ORIG: 1 2 4 16 18 32 64SORTED: 1 16 18 2 32 4 64Note that sorting numbers does not happen numerically, but by the string values of each65 number.
  66. 66. The push and shift functionsThe push function adds a variable or a list of variables to the end of a given array.Example:$a = 5;$b = 7;@array = ("David", "John", "Gadi");push (@array, $a, $b);# @array is now ("David", "John", "Gadi", 5, 7)The shift function removes the first element of a given array and returns this element.Example:@array = ("David", "John", "Gadi");$k = shift (@array);# @array is now ("John", "Gadi"); # $k is now "David"Note that after both the push and shift operations the given array @array is changed! 66
  67. 67. Perl Array review• An array is designated with the „@‟ sign• An array is a list of individual elements• Arrays are ordered  Your list stays in the same order that you created it, although you can add or subtract elements to the front or back of the list• You access array elements by number, using the special syntax:  $array[1] returns the „1th‟ element of the array (remember perl starts counting at zero)• You can do anything with an array element that you can do with a scalar variable (addition, subtraction, printing … whatever) 67
  68. 68. Generate random sequence stringfor($n=1;$n<=50;$n++){ @a = ("A","C","G","T"); $b=$a[rand(@a)]; $r.=$b;}print $r; 68
  69. 69. Text Processing FunctionsThe split function• The split function splits a string to a list of substrings according to the positions of a given delimiter. The delimiter is written as a pattern enclosed by slashes: /PATTERN/. Examples:• $string = "programming::course::for::bioinformatics";• @list = split (/::/, $string);• # @list is now ("programming", "course", "for", "bioinformatics") # $string remains unchanged.• $string = "protein kinase Ct450 Kilodaltonst120 Kilobases";• @list = split (/t/, $string); #t indicates tab #• @list is now ("protein kinase C", "450 Kilodaltons", "120 Kilobases") 69
  70. 70. Text Processing FunctionsThe join function• The join function does the opposite of split. It receives a delimiter and a list of strings, and joins the strings into a single string, such that they are separated by the delimiter.• Note that the delimiter is written inside quotes.• Examples:• @list = ("programming", "course", "for", "bioinformatics");• $string = join ("::", @list);• # $string is now "programming::course::for::bioinformatics"• $name = "protein kinase C"; $mol_weight = "450 Kilodaltons"; $seq_length = "120 Kilobases";• $string = join ("t", $name, $mol_weight, $seq_length);• # $string is now: # "protein kinase Ct450 Kilodaltonst120 Kilobases" 70
  71. 71. When is an array not good enough?• Sometimes you want to associate a given value with another value. (name/value pairs) (Rob => 353-7236, Matt => 353-7122, Joe_anonymous => 555-1212) (Acc#1 => sequence1, Acc#2 => sequence2, Acc#n => sequence-n)• You could put this information into an array, but it would be difficult to keep your names and values together (what happens when you sort? Yuck) 71
  72. 72. Problem solved: The associative array• As the name suggests, an associative array allows you to link a name with a value• In perl-speak: associative array = hash  „hash‟ is the preferred term, for various arcane reasons, including that it is easier to say.• Consider an array: The elements (values) are each associated with a name – the index position. These index positions are numerical, sequential, and start at zero.• A hash is similar to an array, but we get to name the index positions anything we want 72
  73. 73. The „structure‟ of a Hash• An array looks something like this: 0 1 2 Index @array = val1 val2 val3 Value 73
  74. 74. The „structure‟ of a Hash • An array looks something like this: 0 1 2 Index @array = val1 val2 val3 Value • A hash looks something like this: Rob Matt Joe_A Key (name)%phone = 353-7236 353-7122 555-1212 Value 74
  75. 75. Creating a hash• There are several methods for creating a hash. The most simple way – assign a list to a hash.  %hash = („rob‟, 56, „joe‟, 17, „jeff‟, „green‟);• Perl is smart enough to know that since you are assigning a list to a hash, you meant to alternate keys and values.  %hash = („rob‟ => 56 , „joe‟ => 17, „jeff‟ => „green‟);• The arrow („=>‟) notation helps some people, and clarifies which keys go with which values. The perl interpreter sees „=>‟ as a comma. 75
  76. 76. Getting at values• You should expect by now that there is some way to get at a value, given a key.• You access a hash key like this:  $hash{„key‟}• This should look somewhat familiar  $array[21] : refer to a value associated with a specific index position in an array  $hash{key} : refer to a value associated with a specific key in a hash 76
  77. 77. Programming in general and Perl in particular• There is more than one right way to do it. Unfortunately, there are also many wrong ways.  1. Always check and make sure the output is correct and logical Consider what errors might occur, and take steps to ensure that you are accounting for them.  2. Check to make sure you are using every variable you declare. Use Strict !  3. Always go back to a script once it is working and see if you can eliminate unnecessary steps. Concise code is good code. You will learn more if you optimize your code. Concise does not mean comment free. Please use as many comments as you think are necessary. Sometimes you want to leave easy to understand code in, rather than short but difficult to understand tricks. Use your judgment. Remember that in the future, you may wish to use or alter the code you wrote today. If you don‟t understand it today, you won‟t tomorrow. 77
  78. 78. Programming in general and Perl in particular Develop your program in stages. Once part of it works, save the working version to another file (or use a source code control system like RCS) before continuing to improve it. When running interactively, show the user signs of activity. There is no need to dump everything to the screen (unless requested to), but a few words or a number change every few minutes will show that your program is doing something. Comment your script. Any information on what it is doing or why might be useful to you a few months later. Decide on a coding convention and stick to it. For example,  for variable names, begin globals with a capital letter and privates (my) with a lower case letter  indent new control structures with (say) 2 spaces  line up closing braces, as in: if (....) { ... ... } 78
  79. 79. CPAN • CPAN: The Comprehensive Perl Archive Network is available at www.cpan.org and is a very large respository of Perl modules for all kind of taks (including bioperl) 79
  80. 80. What is BioPerl?• An „open source‟ project  http://bio.perl.org or http://www.cpan.org• A loose international collaboration of biologist/programmers  Nobody (that I know of) gets paid for this• A collection of PERL modules and methods for doing a number of bioinformatics tasks  Think of it as subroutines to do biology• Consider it a „tool-box‟  There are a lot of nice tools in there, and (usually) somebody else takes care of fixing parsers when they break• BioPerl code is portable - if you give somebody a script, it will probably work on their system 80
  81. 81. Multi-line parsing use strict; use Bio::SeqIO; my $filename="sw.txt"; my $sequence_object; my $seqio = Bio::SeqIO -> new ( -format => swiss, -file => $filename ); while ($sequence_object = $seqio -> next_seq) { my $sequentie = $sequence_object-> seq(); print $sequentie."n"; } 81
  82. 82. Live.pl #!e:Perlbinperl.exe -w # script for looping over genbank entries, printing out name use Bio::DB::Genbank; use Data::Dumper; $gb = new Bio::DB::GenBank(); $sequence_object = $gb->get_Seq_by_id(MUSIGHBA1); print Dumper ($sequence_object); $seq1_id = $sequence_object->display_id(); $seq1_s = $sequence_object->seq(); print "seq1 display id is $seq1_id n"; print "seq1 sequence is $seq1_s n"; 82
  83. 83. Bioperl 101: 2 ESSENTIAL TOOLS Data::Dumper to find out what class your in Perl bptutorial (100 Bio::Seq) to find the available methods for that class 83
  84. 84. Outline• Scripting Perl (Bioperl/Python) examples spiders/bots• Databases Genome Browser examples biomart, galaxy• AI Classification and clustering examples WEKA (R, Rapidminer) 84
  85. 85. Overview• Bots and Spiders  The web  Bots  Spiders  Real world examples  Bioinformatics applications  Perl – LWP libraries  Google hacks  Advanced APIs  Fetch data from NCBI / Ensembl / 85
  86. 86. The web• The WWW-part of the Internet is based on hyperlinks• So if one started to follow all hyperlinks, it would be possible to map almost the entire WWW• Everything you can do as a human (clicking, filling in forms,…) can be done by machines 86
  87. 87. Bots• Webbots (web robots, WWW robots, bots): software applications that run automated tasks over the Internet• Bots perform tasks that:  Are simple  Structurally repetitive  At a much higher rate than would be possible for a human• Automated script fetches, analyses and files information from web servers at many times the speed of a human• Other uses:  Chatbots  IM / Skype / Wiki bots  Malicious bots and bot networks (Zombies) 87
  88. 88. Spiders• Webspiders / Crawlers are programs or automated scripts which browses the World Wide Web in a methodical, automated manner. It is one type of bot• The spider starts with a list of URLs to visit, called the seeds  As the crawler visits these URLs, it identifies all the hyperlinks in the page  It adds them to the list of URLs to visit, called the crawl frontier  URLs from the frontier are recursively visited according to a set of policies• This process is called web crawling or spidering: in most cases a mean 88
  89. 89. SpidersUse of webcrawlers:  Mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches  Automating maintenance tasks on a website, such as checking links or validating HTML code  Can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses Most common used crawler is probably the GoogleBot crawler  Crawls  Indexes (content + key content tags and attributes, such as Title tags and ALT attributes)  Serves results: PageRank Technology 89
  90. 90. Spiders 90
  91. 91. Perl - LWPLWP (also known as libwww-perl) The World-Wide Web library for Perl Set of Perl modules which provides a simple and consistent application programming interface (API) to the World-Wide Web Free book: http://lwp.interglacial.com/ LWP for newbies LWP::Simple (demo1) Go to a URL, fetch data, ready to parse Attention: HTML tags and regular expression 91
  92. 92. Perl - LWP Some more advanced features LWP::UserAgent (demo2 – show server access logs) Fill in forms and parse results Depending on content: follow hyperlinks to other pages and parse these again,… Bioinformatics examples Use genome browser data (demo3) and sequences Get gene aliases and symbols from GeneCards (demo4) 92
  93. 93. Google hacks Why not make use of crawls, indexing and serving technologies of others (e.g. Google) Google allows automated queries: per account 1000 queries a day Google uses Snippets: the short pieces of text you get in the main search results This is the result of its indexing and parsing algoritms Demo5: LWP and Google combined and parsing the results 93
  94. 94. Advanced APIs An application programming interface (API) is a source code interface that an operating system, library or service provides to support requests made by computer programs Language-dependent APIs Language-independent APIs are written in a way they can be called from several programming languages. This is a desired feature for service style API which is not bound to a particular process or system and is available as a remote procedure call 94
  95. 95. Advanced APIs Google example used Google API / SOAP NCBI API  The NCBI Web service is a web program that enables developers to access Entrez Utilities via the Simple Object Access Protocol (SOAP)  Programmers may write software applications that access the E-Utilities using any SOAP development tool  Main tools (demo6): E-Search Searches and retrieves primary IDs and term translations and optionally retains results for future use in the users environment E-Fetch: Retrieves records in the requested format from a list of one or more primary IDs 95 Ensembl API (demo7)
  96. 96. Fetch data from NCBI A NCBI database, frequently used is PubMed PubMed can be queried using E-Utils Uses syntax as regular PubMed website Get the data back in data formats as on the website (XML, Plain Text) Parse XML results and more advanced Text-mining techniques Demo8 Parse results and present them in an interface (http://matrix.ugent.be/mate/methylome/result1.html) 96
  97. 97. Fetch data from NCBI Example: PubMeth Get data from NCBI PubMed Get all genes and all aliases for human genes and their annotations from Ensembl & GeneCards Get all cancer types from cancer thesaurius Parse PubMed results: find genes and aliases; keywords Keep variants in mind (Regexes are very useful) Sort the PubMed abstracts and store found genes and keywords in database; apply scoring scheme 97
  98. 98. Outline• Scripting Perl (Bioperl/Python) examples spiders/bots• Databases Genome Browser examples biomart, galaxy• AI Classification and clustering examples WEKA (R, Rapidminer) 98
  99. 99. The three genome browsers• There are three main browsers:  Ensembl  NCBI MapViewer  UCSC• At first glance their main distinguishing features are:  MapViewer is arranged vertically.  Ensembl has multiple (22) different “Views”.  UCSC has a single “View” for (almost) everything. 99
  100. 100. MapViewer Homehttp://www.ncbi.nlm.nih.gov/mapview/ 100
  101. 101. MapViewer Master Map 101
  102. 102. Selecting tracks on MapViewer 102
  103. 103. MapViewer strengths• Good coverage of plant and fungal genomes.• Close integration with other NCBI tools and databases, such as Model Maker, trace archives or Celera assemblies.• Vertical view enables convenient overview of regional gene descriptions.• Discontiguous MEGABLAST is probably the most sensitive tool available for cross-species sequence queries.• Ability to view multiple assemblies (e.g. Celera and reference) simultaneously. 103
  104. 104. MapViewer limitations• Little cross-species conservation or alignment data.• Inability to upload custom annotations and data.• Limited capability for batch data access.• Limited support for automated database querying.• Vertical view makes base-pair level annotation cumbersome. 104
  105. 105. UCSC Genome Browser 105 105
  106. 106. http://genome.ucsc.edu/ 106 106
  107. 107. UCSC Genome Browser 107 107
  108. 108. Strengths of the UCSC Browser (I) For this course I will be focusing primarily on the UCSC Browser for several reasons:• Strong comparative genomics capabilities.• Fast response  sequence searches performed with BLAT.  code is written in speed-optimized C.  Multiple indexing and non-normalized tables for fast database retrieval.• (Essentially) single “view” from single base-pair to entire chromosome.• Easiest interface for loading custom annotations. 108
  109. 109. UCSC Browser Strengths (II)• Well suited for batch and automated querying of both gene and intergenic regions.• Comprehensive: tends to have the most species, genes and annotations.• Annotations frequently updated (Genbank/Refseq daily / ESTs weekly).• Able to find “similar” genes easily with GeneSorter.• Rapid access to in situ images with VisiGene. 109
  110. 110. UCSC browser limitations• Lack of “overview” mode can make it harder to see genomic context.• Syntenic regions cannot be viewed simultaneously.• Cross species sequence queries with BLAT are often insensitive.• Comprehensiveness of database can make user interface intimidating.• Code access for commercial users requires licensing. 110
  111. 111. Human, mouse,rat synteny in MapViewer 111
  112. 112. Browser/Database BatchQuerying 112 112
  113. 113. Batch querying overview• Introduction / motivation• UCSC table browser• Custom tracks and frames• Galaxy and direct SQL database querying• A batch query example• UCSC Database “gotchas”• Batch querying on Ensembl 113
  114. 114. Why batch querying• Interactive querying is difficult if you want to study numerous “interesting” genomic regions.• Querying each region interactively is:  Tedious  Time-consuming  Error prone 114
  115. 115. Batch querying examples• As an example, say you have found one hundred candidate polymorphisms and you want to know:  Are they in dbSNP?  Do they occur in any known ESTs?  Are the sites conserved in other vertebrates?  Are they near any ”LINE” repeat sequences? Of course you could repeat the procedures described in Part II one hundred times but that would get “old” very fast… 115
  116. 116. Other examples• Other examples include characterizing multiple:  Non-coding RNA candidates  ultra-conserved regions  introns hosting snoRNA genes 116
  117. 117. Browsers and databases• Each of the genome browsers is built on top of multiple relational databases.• Typically data for each genome assembly are stored in a separate database and auxiliary data, e.g. gene ontology (GO) data, are stored in yet other databases.• These databases may have hundreds of tables, many with millions of entries. 117
  118. 118. The UCSC Table Browser• For batch queries, you need to query the browser databases.• The conventional way of querying a relational database is via “Structured Query Language” (SQL).• However with the Table Browser, you can query the database without using SQL. 118
  119. 119. Browser Database Formats Nevertheless, even with the Table Browser, you needsome understanding of the underlying track, table andfile formats. Table formats describe how data is stored in the (relational) databases. Track formats describe how the data is presented on the browser. File formats describe how the data is stored in “flat files” in conventional computer files. Finally, for understanding the underlying the computer code (as we will do in the last part of this tutorial) you will need to learn about the “C” structures which hold the data in the source code. 119
  120. 120. Main UCSC Data Formats • GFF/GTF • BED (Browser Extensible Data)  lists of genomic blocks • PSL  RNA/DNA alignments • .chain  pair-wise cross species alignments • .maf  multiple genome alignments • .wig  numerical data 120
  121. 121. Custom Tracks• Custom tracks are essentially BED, PSL or GTF files with formatting lines so they can be displayed on the browser.• A custom track file can contain multiple tracks, which may be in different formats.• Custom tracks are useful for:  Display of regions of interest on the browser.  Sharing custom data with others.  Input of multiple, arbitrary regions for annotation by the Table Browser.• Custom tracks can be made by the Table Browser, or you can make them easily yourself. 121
  122. 122. Selecting custom track output 122
  123. 123. Sending custom track to browser 123 123
  124. 124. Adding a custom track 124 124
  125. 125. Adding a custom track (II) 125
  126. 126. Custom track example browser position chr22:10000000-10020000 browser hide all track name=clones description="Clones” visibility=3 color=0,128,0 useScore=1 chr22 10000000 10004000 cloneA 960 chr22 10002000 10006000 cloneB 200 chr22 10005000 10009000 cloneC 700 chr22 10006000 10010000 cloneD 600 chr22 10011000 10015000 cloneE 300 chr22 10012000 10017000 cloneF 100 126
  127. 127. Limitations of the table browser• Can be difficult to create more complex queries.• With hundreds of tables, finding the one(s) you want can be confusing.• Getting intersections or unions of genomic regions is often a multi-step process and can be tedious or error prone.• May be slower than direct SQL query.• Not designed for fully automated operation. 127
  128. 128. Ensembl 128
  129. 129. Ensembl Home http://www.ensembl.org/ 129
  130. 130. Ensembl ContigView 130
  131. 131. Ensembl ContigView 131
  132. 132. Detail and Basepair view 132
  133. 133. Changing tracks in Ensembl 133
  134. 134. Ensembl strengths (I)• Multiple view levels shows genomic context.• Some annotations are more complete and/or are more clearly presented (e.g. snpView of multiple mouse strain data.)• Possible to create query over more than one genome database at a time (with BioMart). 134 134
  135. 135. Ensembl snpView 135
  136. 136. Ensembl strengths (II)• Batch and automated querying well supported and documented (especially for perl and java).• API (programmer interface) is designed to be identical for all databases in a release.• Ensembl tends to be more “community oriented” - using standard, widely used tools and data formats.• All data and code are completely free to all. 136
  137. 137. Ensembl is “community oriented” • Close alliances with Wormbase, Flybase, SGD • “support for easy integration with third party data and/or programs” – BioMart • Close integration with R/ Bioconductor software • More use of community standard formats and programs, e.g. DAS, GFF/GTF, Bioperl ( Note: UCSC also supports GFF/GTF and is compatible with R/Bioconductor and DAS, but UCSC tends to use more “homegrown” formats, e.g. BED, PSL, and tools.) 137
  138. 138. Ensembl limitations• Limited data quantifying cross-species sequence conservation.• Batch queries for intergenic regions with BioMart are difficult.• BioMart offers less complete access to database than UCSC Table Browser. (However, the user interface to BioMart is easier.) 138
  139. 139. BioMart• BioMart - the Ensembl “Table browser”• Similar to the Table Browser and Galaxy tools.• Previous version was called EnsMart.• Fewer tables can be accessed with BioMart than with UCSC Table Browser. In particular, non-gene oriented queries may be difficult.• However, the user interface is simpler.• Tight interface with Bioconductor project for annotation of microarray genes. 139
  140. 140. The Galaxy Website• Galaxy website: http://g2.bx.psu.edu• Galaxy objective: Provide sequence and data manipulation tools (a la SRS or the UCSD Biology Workbench) that are capable of being applied to genomic data.• The intent is to provide an easy interface to numerous analysis tools with varied output formats that can work on data from multiple browsers / databases. 140
  141. 141. 141 141
  142. 142. Demo: Galaxy Genomics Toolkit• Galaxy is a web interface to bioinformatics tools that deal with genome-scale data• There is a public server with many pre-installed tools• Many tools work with genomic intervals• Other tools work with various types of tab delimited data formats, and some directly on DNA sequences• It has excellent tools to access public data• It can be installed on a local computer or set up as an institutional server• Can access a standard or custom build on Amazon “Cloud”• Any command line tool or web service can easily be wrapped into the Galaxy interface. 142
  143. 143. Genome-Scale Data• Bioinformatics work is challenging on very large “genomics” data sets  sequencing, gene expression, variants, ChIPseq• Complex command line programs• Genome Browsers• New tools 143
  144. 144. The Galaxy Interface has 3 parts History =List of Tools Central work panel data & results 144
  145. 145. Load Data from UCSC Or upload from your computer 145
  146. 146. Demo: Galaxy Genomics Toolkit• http://athos.ugent.be:8080: staat er een Galaxy instance.• inloggen (als admin: new@new.be, password: newnew)• de cleanfq history heeft 2 paar fastq files en een ref fa en een ref gtf 146
  147. 147. Workflows• Galaxy saves your data, and results in the History• The exact commands and parameters used with each operation with each tool are also saved.• These operations can be saved as a “Workflow”, which can be reused, and shared with other users. 147
  148. 148. • Galaxy has many public data sets and public workflows, which can be easily used in your projects (or a tutorial) 148
  149. 149. NGS tools• Galaxy has recently been expanded with tools to analyze Next-Gen Sequence data• File format conversions• Analysis methods specific to different sequencing platforms (454, Illumina, SOLID)• Analysis methods specific to different applications (RNA-seq, ChIP-seq, mutation finding, metagenomics, etc). 149
  150. 150. • NGS tools include fileformat conversion, mappingto reference genome,ChIPseq peak calling, RNA-seq gene expression, etc. • NGS data analysis useslarge files – slow to upload and slow to process on a public server
  151. 151. A number of Groups have set up custom Galaxyservers with special tools 151
  152. 152. The SPARQLing future 152
  153. 153. Outline• Scripting Perl (Bioperl/Python) examples spiders/bots• Databases Genome Browser examples biomart, galaxy• AI Classification and clustering examples WEKA (R, Rapidminer) 153
  154. 154. Wat is „intelligent‟ ?• Intelligentie = de mogelijkheid tot leren en begrijpen, tot het oplossen van problemen, tot het nemen van beslissingen Machine learning … 154
  155. 155. Turing test voor intelligentieTHE IMITATION GAMEVrouwMan/MachineOndervrager: Wie vanbeide is de vrouw? 155
  156. 156. Wat is „artificieel‟ ?• Artificieel = kunstmatig = door de mens vervaardigd, niet van natuurlijke oorsprong• in de context van A.I.: machines, meestal een digitale computer• H. Simon: analogie mens-digitale computer  geheugen  uitvoeringseenheid  controle-eenheid 156
  157. 157. Data mining• WAT? extraheren van kennis uit data• Data indelen in drie groepen:  trainingsset  validatieset  testset• Clustering/Classificatie 157
  158. 158. Clustering• WAT? „unsupervised learning‟ – antwoord voor de trainingsdata niet gekend• Resultaat meestal als boomstructuur• Belangrijke methode: hiërarchisch clusteren opstellen van distance matrix 158
  159. 159. Cluster Analysis• Unsupervised methods• Descriptive modeling  Grouping of genes with “similar” expression profiles  Grouping of disease tissues, cell lines, or toxicants with “similar” effects on gene expression• Clustering algorithms  Self-organizing maps  Hierarchical clustering  K-means clustering  SVD 159
  160. 160. Linkage in Hierarchical Clustering• Single linkage: S(A,B) = mina minb d(a,b) A• Average linkage: A(A,B) = (∑a ∑b d(a,b)) / |A| |B|• Complete linkage: C(A,B) = maxa maxb d(a,b)• Centroid linkage: M(A,B) = d(mean(A),mean(B))• Hausdorff linkage: B h(A,B) = maxa minb d(a,b) H(A,B) = max(h(A,B),h(B,A))• Ward linkage: W(A,B) = (|A| |B| (M(A,B))2) / (|A|+|B|) 160
  161. 161. Hierarchical Clustering 3 clusters? 2 clusters? 161
  162. 162. Classificatie• WAT? „supervised learning‟ – antwoord voor de trainingsdata is gekend• Verschillende classificatiemethoden:  decision tree  neurale netwerken  support vector machines 162
  163. 163. Decision tree Voorbeeld: tennis 163
  164. 164. Neurale netwerkenBOUW: Neuronen en verbindingenTAAK:verwerken van invoergegevensmachine learning 164
  165. 165. Support Vector MachinesDoorvoeren van een lineaire separatie in de datadoor de dimensies aan te passen 165
  166. 166. Bio-informatica toepassingen• Decision tree: zoeken naar DNA-sequenties homoloog aan een gegeven DNA-sequentie• Neurale netwerken: modelleren en analyseren van genexpressiegegevens, voorspellen van de inwerkingsplaatsen van proteasen• Support Vector Machines: identificeren van genen betrokken bij anti-kankermechanismen, detecteren van homologie tussen eiwitten, analyse van genexpressie 166
  167. 167. Bio-informatica toepassingen• Hiërarchisch clusteren: opstellen van fylo- genetische bomen op basis van DNA-sequenties• Genetische algoritmes: moleculaire herkenning, relatie tussen structuur en functie ophelderen, Multiple Sequence Alignment• Expertsystemen: ontdekken van blessures, vroege detectie van afwijkingen aan de hartklep• Fuzzy logic: primerdesign, voorspellen van de functie van een onbekend gen, expressie- analyse 167
  168. 168. Outline 168
  169. 169. Classification C N N NCC NC OMS classifier C N CC N N C N 169 C: cancer, N: normal
  170. 170. Classification R N N NRR NR OMS classifier R N RR N N R N R: responder 170 N: non-responder
  171. 171. Outline 171
  172. 172. OMS Classifier using “Methylation” Patient Sample Measuring Methylation Gene Gen 1 Gen 2 Gen 3 … Gen n Methylated + - - … + OMS classifier Cancer Normal 172
  173. 173. Why use methylation as a biomarker ?• What is feature/biomarker ?  A characteristic that is objectively measured and evaluated as an indicator of normal biological processes, pathogenic processes, or pharmacologic responses to a therapeutic intervention• Business/biological feature selection/reduction  Of all possible (molecular and clinical) features oncomethylome measures methylation (in cancer/onco) 173
  174. 174. Outline 174
  175. 175. Data preparation and modelling• Data preparation  Construct binary features « Methylated » from PCR data (Ct and Temp)• Modelling  Construct classifier (cancer vs normal) from features « Methylated » 175
  176. 176. Data Preparation: Feature Construction Sample Methylation Specific Quantitative PCR Gene Gen 1 Gen 2 Gen 3 … Gen n Temp 78 81 69 … 72 Ct 25 38 24 … 27 Feature construction: “gene Methylated in sample” Gene Gen 1 Gen 2 Gen 3 … Gen n Methylated + - - … + Compute « methylated » as function of Temp and Ct 176
  177. 177. Construction of features « Methylated »• Per gene: find boolean function  Methylated IFF: Ct below upperbound AND Temp above lowerbound• Taking into account  All Ct and Temp measurements Methylation Specific Quantitative PCR (QMSP) for normals and cancers  Noise in QMPS measurements As observed per gene during Quality Control 177
  178. 178. Construction of features « Methylated »Plot of all Ct and Temp measurements for a given gene Temp Ct What about noise? 178
  179. 179. Noise Noise: random error or variance in a measured variable Incorrect attribute values may due to  Quantity not correctly compared to calibration (e.g., ruler slips)  Inaccurate calibration device (e.g., ruler > 1m)  Precision (e.g., truncated to nearest mile or Ångstrom unit)  Data entry problems  Data transmission problems  Inconsistency in naming convention 179
  180. 180. Construction of features «Methylated» Taking into account noise QC: StdDev of Ct and Tm in IVM StDev 1.6 StDev 0.3 StDev 0.02StDev 3.5 Cancer Inrobust assay Cut-off Robust assay Normal 180
  181. 181. Construction of features « Methylated » Taking into account noise Good Reproducibility Bad Reproducibility Methylated MethylatedBlunt cut-off Methylated MethylatedSharp cut-off 181
  182. 182. Construction of features « Methylated » Taking into account noiseFind most robust cut-off for each gene Compute quality with increasing noise levels (0-2 times StdDev) 1 Quality 1 Quality Inrobust Robust 0 2 Stdev 0 Stdev 2 Quality score based on binomial test 46 or more successes with 58 trials unlikely 16 or more successes with 44 trials likely When probability success = 80/179 when probability success = 77/175 Expected nr successes = 21 Expected nr successes = 19 182
  183. 183. Construction of features « Methylated » Methylated: inside red box 183
  184. 184. Construction of features « Methylated »Methylated Unmethylated Ranked GenesCancerNormal 184
  185. 185. Data preparation and modelling• Data preparation  Construct binary features « Methylated » from PCR data (Ct and Temp)• Modelling  Construct classifier (cancer vs normal) from « Methylated » features 185
  186. 186. Selection of modelling technique• In theory, many techniques applicable  Data type: boolean methylation table, discrete classes  See other talks today• But, additional requirements follow from business understanding (more details below)  Feature selection Final test should be based on at most ~5 genes  Understandability  Both provide a direct competitive advantage• Example of acceptable technique: decision trees 186
  187. 187. Decision trees The Weka tool@relation weather.symbolic@attribute outlook {sunny, overcast, rainy}@attribute temperature {hot, mild, cool}@attribute humidity {high, normal}@attribute windy {TRUE, FALSE}@attribute play {yes, no}@datasunny,hot,high,FALSE,nosunny,hot,high,TRUE,noovercast,hot,high,FALSE,yesrainy,mild,high,FALSE,yesrainy,cool,normal,FALSE,yesrainy,cool,normal,TRUE,noovercast,cool,normal,TRUE,yessunny,mild,high,FALSE,nosunny,cool,normal,FALSE,yesrainy,mild,normal,FALSE,yessunny,mild,normal,TRUE,yesovercast,mild,high,TRUE,yesovercast,hot,normal,FALSE,yesrainy,mild,high,TRUE,no http://www.cs.waikato.ac.nz/ml/weka/ 187
  188. 188. Decision trees Attribute selection outlook temperature humidity windy play sunny hot high FALSE no play sunny hot high TRUE no overcast hot high FALSE yes don‟t play rainy mild high FALSE yes rainy cool normal FALSE yes rainy cool normal TRUE no overcast cool normal TRUE yes pno = 5/14 sunny mild high FALSE no sunny cool normal FALSE yes rainy mild normal FALSE yes sunny mild normal TRUE yes overcast mild high TRUE yes overcast hot normal FALSE yes rainy mild high TRUE no  maximal gain of information  maximal reduction of Entropy = - pyes log2 pyes - pno log2 pno pyes = 9/14 = - 9/14 log2 9/14 - 5/14 log2 5/14 = 0.94 bitshttp://www-lmmb.ncifcrf.gov/~toms/paper/primer/latex/index.htmlhttp://directory.google.com/Top/Science/Math/Applications/Information_Theory/Papers/ 188
  189. 189. Decision trees play 0.94 bits Attribute selection don‟t play play dont play play dont play play dont play play dont play sunny 2 3 hot 2 2 high 3 4 FALSE 6 2 overcast 4 0 mild 4 2 normal 6 1 TRUE 3 3 rainy 3 2 cool 3 1 outlook humidity temperature windy sunny overcast rainy high normal hot mild cool false true amount of information required to specify class of an example given that it reaches node0.97 bits 0.0 bits 0.97 bits 0.98 bits 0.59 bits 1.0 bits 0.92 bits 0.81 bits 0.81 bits 1.0 bits * 5/14 * 4/14 * 5/14 * 7/14 * 7/14 * 4/14 * 6/14 * 4/14 * 8/14 * 6/14 + + + + = 0.69 bits = 0.79 bits = 0.91 bits = 0.89 bits gain: 0.25 bits gain: 0.15 bits gain: 0.03 bits gain: 0.05 bits
  190. 190. Decision trees outlook playAttribute selection don‟t play sunny overcast rainy 0.97 bits outlook temperature humidity windy play sunny hot high FALSE no sunny hot high TRUE no sunny mild high FALSE no sunny cool normal FALSE yes sunny mild normal TRUE yes humidity temperature windy high normal hot mild cool false true0.0 bits 0.0 bits 0.0 bits 1.0 bits 0.0 bits 0.92 bits 1.0 bits * 3/5 * 2/5 * 2/5 * 2/5 * 1/5 * 3/5 * 2/5 + + + = 0.0 bits = 0.40 bits = 0.95 bits gain: 0.97 bits gain: 0.57 bits gain: 0.02 bits
  191. 191. playDecision trees outlookAttribute selection don‟t play outlook temperature humidity windy play sunny overcast rainy rainy mild high FALSE yes rainy cool normal FALSE yes 0.97 bits rainy cool normal TRUE no rainy mild normal FALSE yes rainy mild high TRUE no humidity humidity temperature windy high normal high normal hot mild cool false true  1.0 bits 0.92 bits 0.92 bits 1.0 bits 0.0 bits 0.0 bits *2/5 * 3/5 * 3/5 * 2/5 * 3/5 * 2/5 + + + = 0.95 bits = 0.95 bits = 0.0 bits gain: 0.02 bits gain: 0.02 bits gain: 0.97 bits
  192. 192. Decision treesfinal tree play don‟t play outlook sunny overcast rainy humidity windyhigh normal false true 192
  193. 193. Decision trees Basic algorithm• Initialize top node to all examples• While impure leaves available  select next impure leave L  find splitting attribute A with maximal information gain  for each value of A add child to L 193
  194. 194. Decision tree built from methylation table Leave-one-out experiment To avoid overfitting Decision tree: Test based on 12 genes Sensitivity: 80% Specificity: 88% 194
  195. 195. Outline 195
  196. 196. Evaluation and deployment• Decide whether to use Classification results  Can we use 12 gene decision tree for classifying new patients?• Verification of all steps  Excercise. The above modelling procedure contains a classical mistake: the test-sets used for cross- validation (see leave-one-out) have actually been used for training the model. How? (Weka is not to blame) And how can we fix this?• Check whether business goals have been met  No: test based on 12 genes not useful (max ~5)  Iteration required 196
  197. 197. Attempt to rebuild decision tree with at most ~5 genes Minimal leaf size Increased to 12 New Decision tree: Test based on 4 genesSensitivity decreased from 80% to 64%Specificity increased from 88% to 90% 197
  198. 198. Evaluation and deploymentThe impact of « cost »• Market conditions, cost of goods & royalty structure can limit the amount of genes that can tested 198
  199. 199. Evaluation and deploymentThe importance of « understandability » 199
  200. 200. Evaluation and deploymentThe importance of « understandability »Pre and postmarket requirements imposed for IVDMIA (510k etc)Understandability (NO black boxes) is becoming an important asset 200
  201. 201. Outline• Scripting Perl (Bioperl/Python) examples spiders/bots• Databases Genome Browser examples biomart, galaxy• AI Classification and clustering examples WEKA (R, Rapidminer) 201
  202. 202. WEKA:: Introduction• A collection of open source ML algorithms  pre-processing  classifiers  clustering  association rule• Created by researchers at the University of Waikato in New Zealand• Java based 202
  203. 203. WEKA:: Installation• Download software from http://www.cs.waikato.ac.nz/ml/weka/  If you are interested in modifying/extending weka there is a developer version that includes the source code• Set the weka environment variable for java  setenv WEKAHOME /usr/local/weka/weka-3-0- 2  setenv CLASSPATH $WEKAHOME/weka.jar:$CLASSPATH• Download some ML data from http://mlearn.ics.uci.edu/MLRepositor 203 y.html
  204. 204. 204
  205. 205. Main GUI• Three graphical user interfaces  “The Explorer” (exploratory data analysis)  “The Experimenter” (experimental environment)  “The KnowledgeFlow” (new process model inspired interface) 205
  206. 206. Explorer: pre-processing the data• Data can be imported from a file in various formats: ARFF, CSV, C4.5, binary• Data can also be read from a URL or from an SQL database (using JDBC)• Pre-processing tools in WEKA are called “filters”• WEKA contains filters for:  Discretization, normalization, resampling, attribute selection, transforming and combining attributes, … 12/18/2012 206
  207. 207. WEKA only deals with “flat” files@relation heart-disease-simplified@attribute age numeric@attribute sex { female, male}@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}@attribute cholesterol numeric@attribute exercise_induced_angina { no, yes}@attribute class { present, not_present}@data63,male,typ_angina,233,no,not_present67,male,asympt,286,yes,present67,male,asympt,229,yes,present38,female,non_anginal,?,no,not_present... 207
  208. 208. WEKA only deals with “flat” files@relation heart-disease-simplified@attribute age numeric@attribute sex { female, male}@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}@attribute cholesterol numeric@attribute exercise_induced_angina { no, yes}@attribute class { present, not_present}@data63,male,typ_angina,233,no,not_present67,male,asympt,286,yes,present67,male,asympt,229,yes,present38,female,non_anginal,?,no,not_present...2 12/18/2012 2080
  209. 209. 2 University of Waikato 12/18/2012 2090
  210. 210. 2 University of Waikato 12/18/2012 2101
  211. 211. 2 University of Waikato 12/18/2012 2111
  212. 212. 2 University of Waikato 12/18/2012 2121
  213. 213. 2 University of Waikato 12/18/2012 2131
  214. 214. 2 University of Waikato 12/18/2012 2141
  215. 215. 2 University of Waikato 12/18/2012 2151
  216. 216. 2 University of Waikato 12/18/2012 2161
  217. 217. 2 University of Waikato 12/18/2012 2171
  218. 218. 2 University of Waikato 12/18/2012 2181
  219. 219. 2 University of Waikato 12/18/2012 2191
  220. 220. 2 University of Waikato 12/18/2012 2202
  221. 221. 2 University of Waikato 12/18/2012 2212
  222. 222. 2 University of Waikato 12/18/2012 2222
  223. 223. 2 University of Waikato 12/18/2012 2232
  224. 224. 2 University of Waikato 12/18/2012 2242
  225. 225. 2 University of Waikato 12/18/2012 2252
  226. 226. 2 University of Waikato 12/18/2012 2262
  227. 227. 2 University of Waikato 12/18/2012 2272
  228. 228. 2 University of Waikato 12/18/2012 2282
  229. 229. 2 University of Waikato 12/18/2012 2292
  230. 230. Explorer: building “classifiers”• Classifiers in WEKA are models for predicting nominal or numeric quantities• Implemented learning schemes include:  Decision trees and lists, instance-based classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes‟ nets, … 230
  231. 231. Decision Tree Induction: Training Dataset age income student credit_rating buys_computer <=30 high no fair no This <=30 high no excellent no 31…40 high no fair yesfollows an >40 medium no fair yes example >40 low yes fair yes of >40 low yes excellent no 31…40 low yes excellent yesQuinlan‟s <=30 medium no fair no ID3 <=30 low yes fair yes (Playing >40 medium yes fair yes <=30 medium yes excellent yes Tennis) 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no2 December 18, 2012 2313
  232. 232. Output: A Decision Tree for “buys_computer” age? <=30 overcast 31..40 >40 student? yes credit rating? no yes excellent fair no yes yes 232
  233. 233. 2 University of Waikato 12/18/2012 2343
  234. 234. 2 University of Waikato 12/18/2012 2353
  235. 235. 2 University of Waikato 12/18/2012 2363
  236. 236. 2 University of Waikato 12/18/2012 2373
  237. 237. 2 University of Waikato 12/18/2012 2383
  238. 238. 2 University of Waikato 12/18/2012 2393
  239. 239. 2 University of Waikato 12/18/2012 2404
  240. 240. 2 University of Waikato 12/18/2012 2414
  241. 241. 2 University of Waikato 12/18/2012 2424
  242. 242. 2 University of Waikato 12/18/2012 2434

×