Regexp master 2011
Upcoming SlideShare
Loading in...5
×
 

Regexp master 2011

on

  • 501 views

 

Statistics

Views

Total Views
501
Slideshare-icon Views on SlideShare
384
Embed Views
117

Actions

Likes
0
Downloads
1
Comments
0

2 Embeds 117

http://www.biocomputing.it 75
http://biocomputing.it 42

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Regexp master 2011 Regexp master 2011 Presentation Transcript

    • Parsing a File with Perl Regexp, substr and onelinersBioinformatics master course, ‘11/’12 Paolo Marcatili
    • Agenda Today we will see how to • Extract information from a file • Substr and regexp We already know how to use: • Scalar variables $ and arrays @ • If, for, while, open, print, close…Bioinformatics master course, ‘11/’12 2 Paolo Marcatili
    • Task TodayBioinformatics master course, ‘11/’12 Paolo Marcatili
    • Protein Structures 1st task: • Open a PDB file • Operate a symmetry transformation • Extract data from file headerBioinformatics master course, ‘11/’12 4 Paolo Marcatili
    • Zinc Finger 2nd task: • Open a fasta file • Find all occurencies of Zinc Fingers (homework?)Bioinformatics master course, ‘11/’12 5 Paolo Marcatili
    • ParsingBioinformatics master course, ‘11/’12 Paolo Marcatili
    • Rationale Biological data -> human readable files If you can read it, Perl can read it as well *BUT* It can be trickyBioinformatics master course, ‘11/’12 7 Paolo Marcatili
    • Parsing flow-chart Open the file For each line{ look for “grammar” and store data } Close file Use dataBioinformatics master course, ‘11/’12 8 Paolo Marcatili
    • SubstrBioinformatics master course, ‘11/’12 Paolo Marcatili
    • Substr substr($data, start, length) returns a substring from the expression supplied as first argument.Bioinformatics master course, ‘11/’12 10 Paolo Marcatili
    • Substr substr($data, start, length) ^ ^ ^ your string | | start from 0 | you can omit this (you will extract up to the end of string)Bioinformatics master course, ‘11/’12 11 Paolo Marcatili
    • Substr substr($data, start, length) Examples: my $data=“il mattino ha l’oro in bocca”; print substr($data,0) . “n”; #prints all string print substr($data,3,5) . “n”; #prints matti print substr($data,25) . “n”; #prints bocca print substr($data,-5) . “n”; #prints boccaBioinformatics master course, ‘11/’12 12 Paolo Marcatili
    • Pdb rotationBioinformatics master course, ‘11/’12 Paolo Marcatili
    • PDB ATOM 4 O ASP L 1 43.716 -12.235 68.502 1.00 70.05 O ATOM 5 N ILE L 2 44.679 -10.569 69.673 1.00 48.19 N … COLUMNS DATA TYPE FIELD DEFINITION ------------------------------------------------------------------------------------ - 1 - 6 Record name "ATOM " 7 - 11 Integer serial Atom serial number. 13 - 16 Atom name Atom name. 17 Character altLoc Alternate location indicator. 18 - 20 Residue name resName Residue name. 22 Character chainID Chain identifier. 23 - 26 Integer resSeq Residue sequence number. 27 AChar iCode Code for insertion of residues. 31 - 38 Real(8.3) x Orthogonal coordinates for X in Angstroms 39 - 46 Real(8.3) y Orthogonal coordinates for Y in Angstroms 47 - 54 Real(8.3) z Orthogonal coordinates for Z in Angstroms 55 - 80 Bla Bla Bla (not useful for our purposes)Bioinformatics master course, ‘11/’12 14 Paolo Marcatili
    • simmetry X->Z Y->X Z->Y Y XBioinformatics master course, ‘11/’12 15 Paolo Marcatili
    • Rotation #! /usr/bin/perl -w use strict; open(IG, "<IG.pdb") || die "cannot open IG.pdb:$!"; open(IGR, ">IG_rotated.pdb") || die "cannot open IG_rotated.pdb:$!"; while (my $line=<IG>){ if (substr($line,0,4) eq "ATOM"){ my $X= substr($line,30,8); my $Y= substr($line,38,8); my $Z= substr($line,46,8); print IGR substr($line,0,30).$Z.$X.$Y.substr($line,54); } else{ print IGR $line; } } close IG; close IGR;Bioinformatics master course, ‘11/’12 16 Paolo Marcatili
    • RegExpBioinformatics master course, ‘11/’12 Paolo Marcatili
    • Regular Expressions PDB have a “fixed” structures. What if we want to do something like “check for a valid email address”…Bioinformatics master course, ‘11/’12 18 Paolo Marcatili
    • Regular Expressions PDB have a “fixed” structures. What if we want to do something like “check for a valid email address”… 1. There must be some letters or numbers 2. There must be a @ 3. Other letters 4. .something paolo.marcatili@gmail.com is good paolo.marcatili@.com is not goodBioinformatics master course, ‘11/’12 19 Paolo Marcatili
    • Regular Expressions $line =~ m/^[a-z |1-9| .| _]+@[^.]+.[a-z]{2,}$/ WHAAAT??? This means: Check if $line has some chars at the beginning, then @, then some non-points, then a point, then at least two letters …. Ok, let’s start from something simpler :)Bioinformatics master course, ‘11/’12 20 Paolo Marcatili
    • Regular Expressions $line =~ m/^[a-z |1-9| .| _]+@[^.]+.[a-z]{2,}$/ WHAAAT??? This means: Check if $line has some chars at the beginning, then @, then some non-points, then a point, then at least two letters …. Ok, let’s start from something simpler :)Bioinformatics master course, ‘11/’12 21 Paolo Marcatili
    • Regular Expressions $line =~ m/^ATOM/ Line starts with ATOM $line =~ m/^ATOMs+/ Line starts with ATOM, then there are some spaces $line =~ m/^ATOMs+[-|0-9]+/ Line starts with ATOM, then there are some spaces, then there are some digits or - $line =~ m/^ATOMs+-?[0-9]+/ Line starts with ATOM, then there are some spaces, then there can be a minus, then some digitsBioinformatics master course, ‘11/’12 22 Paolo Marcatili
    • Regular ExpressionsBioinformatics master course, ‘11/’12 23 Paolo Marcatili
    • PDB Header We want to find %id for L and H chainBioinformatics master course, ‘11/’12 24 Paolo Marcatili
    • PDB Header We want to find %id for L and H chain $pidL= $1 if ($line=~m/REMARK SUMMARY-ID_GLOB_L:([.|0-9])/); $pidH= $1 if ($line=~m/REMARK SUMMARY-ID_GLOB_H:([.|0-9])/); ONELINER!! cat IG.pdb | perl -ne ‘print “$1n” if ($_=~m/^REMARK SUMMARY-ID_GLOB_([LH]:[.|0-9]+)/);’Bioinformatics master course, ‘11/’12 25 Paolo Marcatili
    • Zinc FingerBioinformatics master course, ‘11/’12 Paolo Marcatili
    • Zinc Finger A zinc finger is a large superfamily of protein domains that can bind to DNA. A zinc finger consists of two antiparallel β strands, and an α helix. The zinc ion is crucial for the stability of this domain type - in the absence of the metal ion the domain unfolds as it is too small to have a hydrophobic core. The consensus sequence of a single finger is: C-X{2-4}-C-X{3}-[LIVMFYWC]-X{8}-H-X{3}-HBioinformatics master course, ‘11/’12 27 Paolo Marcatili
    • Homework Find all occurencies of ZF motif in zincfinger.fasta Put them in file ZF_motif.fasta e.g. weofjpihouwefghoicalcvgnfglapglifhtylhyuiuiBioinformatics master course, ‘11/’12 28 Paolo Marcatili
    • Homework Find all occurencies of ZF motif in zincfinger.fasta Put them in file ZF_motif.fasta e.g. weofjpihouwefghoicalcvgnfglapglifhtylhyuiui calcvgnfglapglifhtylhBioinformatics master course, ‘11/’12 29 Paolo Marcatili