Successfully reported this slideshow.
Parsing a File with Perl                Regexp, substr and onelinersBioinformatics master course, ‘11/’12   Paolo Marcatili
Agenda Today we will see how to • Extract information from a file • Substr and regexp We already know how to use: • Scalar...
Task TodayBioinformatics master course, ‘11/’12   Paolo Marcatili
Protein Structures 1st task: • Open a PDB file • Operate a symmetry transformation • Extract data from file headerBioinfor...
Zinc Finger 2nd task: • Open a fasta file • Find all occurencies of Zinc Fingers     (homework?)Bioinformatics master cour...
ParsingBioinformatics master course, ‘11/’12    Paolo Marcatili
Rationale Biological data -> human readable files If you can read it, Perl can read it as well *BUT* It can be trickyBioin...
Parsing flow-chart Open the file For each line{     look for “grammar”     and store data } Close file Use dataBioinformat...
SubstrBioinformatics master course, ‘11/’12            Paolo Marcatili
Substr substr($data, start, length) returns a substring from the expression supplied as first    argument.Bioinformatics m...
Substr substr($data, start, length)         ^         ^        ^       your string      |       |                 start fr...
Substr substr($data, start, length) Examples: my $data=“il mattino ha l’oro in bocca”; print substr($data,0) . “n”; #print...
Pdb rotationBioinformatics master course, ‘11/’12   Paolo Marcatili
PDB   ATOM     4   O   ASP L   1   43.716 -12.235   68.502   1.00 70.05        O   ATOM     5   N   ILE L   2   44.679 -10...
simmetry   X->Z   Y->X   Z->Y                  Y                                        XBioinformatics master course, ‘11...
Rotation   #! /usr/bin/perl -w   use strict;   open(IG, "<IG.pdb") || die "cannot open IG.pdb:$!";      open(IGR, ">IG_rot...
RegExpBioinformatics master course, ‘11/’12   Paolo Marcatili
Regular Expressions     PDB have a “fixed” structures. What if we want to do something like “check for a valid email addre...
Regular Expressions        PDB have a “fixed” structures. What if we want to do something like “check for a valid email ad...
Regular Expressions $line =~ m/^[a-z |1-9| .| _]+@[^.]+.[a-z]{2,}$/ WHAAAT??? This means: Check if $line has some chars at...
Regular Expressions $line =~ m/^[a-z |1-9| .| _]+@[^.]+.[a-z]{2,}$/ WHAAAT??? This means: Check if $line has some chars at...
Regular Expressions $line =~ m/^ATOM/ Line starts with ATOM $line =~ m/^ATOMs+/ Line starts with ATOM, then there are some...
Regular ExpressionsBioinformatics master course, ‘11/’12 23                                     Paolo Marcatili
PDB Header    We want to find %id for L and H chainBioinformatics master course, ‘11/’12 24                               ...
PDB Header    We want to find %id for L and H chain    $pidL= $1 if ($line=~m/REMARK SUMMARY-ID_GLOB_L:([.|0-9])/);    $pi...
Zinc FingerBioinformatics master course, ‘11/’12   Paolo Marcatili
Zinc Finger   A zinc finger is a large superfamily of protein      domains that can bind to DNA.   A zinc finger consists ...
Homework       Find all occurencies of ZF motif in zincfinger.fasta   Put them in file ZF_motif.fasta   e.g.   weofjpihouw...
Homework       Find all occurencies of ZF motif in zincfinger.fasta   Put them in file ZF_motif.fasta   e.g.   weofjpihouw...
Upcoming SlideShare
Loading in …5
×

Regexp master 2011

583 views

Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Regexp master 2011

  1. 1. Parsing a File with Perl Regexp, substr and onelinersBioinformatics master course, ‘11/’12 Paolo Marcatili
  2. 2. Agenda Today we will see how to • Extract information from a file • Substr and regexp We already know how to use: • Scalar variables $ and arrays @ • If, for, while, open, print, close…Bioinformatics master course, ‘11/’12 2 Paolo Marcatili
  3. 3. Task TodayBioinformatics master course, ‘11/’12 Paolo Marcatili
  4. 4. Protein Structures 1st task: • Open a PDB file • Operate a symmetry transformation • Extract data from file headerBioinformatics master course, ‘11/’12 4 Paolo Marcatili
  5. 5. Zinc Finger 2nd task: • Open a fasta file • Find all occurencies of Zinc Fingers (homework?)Bioinformatics master course, ‘11/’12 5 Paolo Marcatili
  6. 6. ParsingBioinformatics master course, ‘11/’12 Paolo Marcatili
  7. 7. Rationale Biological data -> human readable files If you can read it, Perl can read it as well *BUT* It can be trickyBioinformatics master course, ‘11/’12 7 Paolo Marcatili
  8. 8. Parsing flow-chart Open the file For each line{ look for “grammar” and store data } Close file Use dataBioinformatics master course, ‘11/’12 8 Paolo Marcatili
  9. 9. SubstrBioinformatics master course, ‘11/’12 Paolo Marcatili
  10. 10. Substr substr($data, start, length) returns a substring from the expression supplied as first argument.Bioinformatics master course, ‘11/’12 10 Paolo Marcatili
  11. 11. Substr substr($data, start, length) ^ ^ ^ your string | | start from 0 | you can omit this (you will extract up to the end of string)Bioinformatics master course, ‘11/’12 11 Paolo Marcatili
  12. 12. Substr substr($data, start, length) Examples: my $data=“il mattino ha l’oro in bocca”; print substr($data,0) . “n”; #prints all string print substr($data,3,5) . “n”; #prints matti print substr($data,25) . “n”; #prints bocca print substr($data,-5) . “n”; #prints boccaBioinformatics master course, ‘11/’12 12 Paolo Marcatili
  13. 13. Pdb rotationBioinformatics master course, ‘11/’12 Paolo Marcatili
  14. 14. PDB ATOM 4 O ASP L 1 43.716 -12.235 68.502 1.00 70.05 O ATOM 5 N ILE L 2 44.679 -10.569 69.673 1.00 48.19 N … COLUMNS DATA TYPE FIELD DEFINITION ------------------------------------------------------------------------------------ - 1 - 6 Record name "ATOM " 7 - 11 Integer serial Atom serial number. 13 - 16 Atom name Atom name. 17 Character altLoc Alternate location indicator. 18 - 20 Residue name resName Residue name. 22 Character chainID Chain identifier. 23 - 26 Integer resSeq Residue sequence number. 27 AChar iCode Code for insertion of residues. 31 - 38 Real(8.3) x Orthogonal coordinates for X in Angstroms 39 - 46 Real(8.3) y Orthogonal coordinates for Y in Angstroms 47 - 54 Real(8.3) z Orthogonal coordinates for Z in Angstroms 55 - 80 Bla Bla Bla (not useful for our purposes)Bioinformatics master course, ‘11/’12 14 Paolo Marcatili
  15. 15. simmetry X->Z Y->X Z->Y Y XBioinformatics master course, ‘11/’12 15 Paolo Marcatili
  16. 16. Rotation #! /usr/bin/perl -w use strict; open(IG, "<IG.pdb") || die "cannot open IG.pdb:$!"; open(IGR, ">IG_rotated.pdb") || die "cannot open IG_rotated.pdb:$!"; while (my $line=<IG>){ if (substr($line,0,4) eq "ATOM"){ my $X= substr($line,30,8); my $Y= substr($line,38,8); my $Z= substr($line,46,8); print IGR substr($line,0,30).$Z.$X.$Y.substr($line,54); } else{ print IGR $line; } } close IG; close IGR;Bioinformatics master course, ‘11/’12 16 Paolo Marcatili
  17. 17. RegExpBioinformatics master course, ‘11/’12 Paolo Marcatili
  18. 18. Regular Expressions PDB have a “fixed” structures. What if we want to do something like “check for a valid email address”…Bioinformatics master course, ‘11/’12 18 Paolo Marcatili
  19. 19. Regular Expressions PDB have a “fixed” structures. What if we want to do something like “check for a valid email address”… 1. There must be some letters or numbers 2. There must be a @ 3. Other letters 4. .something paolo.marcatili@gmail.com is good paolo.marcatili@.com is not goodBioinformatics master course, ‘11/’12 19 Paolo Marcatili
  20. 20. Regular Expressions $line =~ m/^[a-z |1-9| .| _]+@[^.]+.[a-z]{2,}$/ WHAAAT??? This means: Check if $line has some chars at the beginning, then @, then some non-points, then a point, then at least two letters …. Ok, let’s start from something simpler :)Bioinformatics master course, ‘11/’12 20 Paolo Marcatili
  21. 21. Regular Expressions $line =~ m/^[a-z |1-9| .| _]+@[^.]+.[a-z]{2,}$/ WHAAAT??? This means: Check if $line has some chars at the beginning, then @, then some non-points, then a point, then at least two letters …. Ok, let’s start from something simpler :)Bioinformatics master course, ‘11/’12 21 Paolo Marcatili
  22. 22. Regular Expressions $line =~ m/^ATOM/ Line starts with ATOM $line =~ m/^ATOMs+/ Line starts with ATOM, then there are some spaces $line =~ m/^ATOMs+[-|0-9]+/ Line starts with ATOM, then there are some spaces, then there are some digits or - $line =~ m/^ATOMs+-?[0-9]+/ Line starts with ATOM, then there are some spaces, then there can be a minus, then some digitsBioinformatics master course, ‘11/’12 22 Paolo Marcatili
  23. 23. Regular ExpressionsBioinformatics master course, ‘11/’12 23 Paolo Marcatili
  24. 24. PDB Header We want to find %id for L and H chainBioinformatics master course, ‘11/’12 24 Paolo Marcatili
  25. 25. PDB Header We want to find %id for L and H chain $pidL= $1 if ($line=~m/REMARK SUMMARY-ID_GLOB_L:([.|0-9])/); $pidH= $1 if ($line=~m/REMARK SUMMARY-ID_GLOB_H:([.|0-9])/); ONELINER!! cat IG.pdb | perl -ne ‘print “$1n” if ($_=~m/^REMARK SUMMARY-ID_GLOB_([LH]:[.|0-9]+)/);’Bioinformatics master course, ‘11/’12 25 Paolo Marcatili
  26. 26. Zinc FingerBioinformatics master course, ‘11/’12 Paolo Marcatili
  27. 27. Zinc Finger A zinc finger is a large superfamily of protein domains that can bind to DNA. A zinc finger consists of two antiparallel β strands, and an α helix. The zinc ion is crucial for the stability of this domain type - in the absence of the metal ion the domain unfolds as it is too small to have a hydrophobic core. The consensus sequence of a single finger is: C-X{2-4}-C-X{3}-[LIVMFYWC]-X{8}-H-X{3}-HBioinformatics master course, ‘11/’12 27 Paolo Marcatili
  28. 28. Homework Find all occurencies of ZF motif in zincfinger.fasta Put them in file ZF_motif.fasta e.g. weofjpihouwefghoicalcvgnfglapglifhtylhyuiuiBioinformatics master course, ‘11/’12 28 Paolo Marcatili
  29. 29. Homework Find all occurencies of ZF motif in zincfinger.fasta Put them in file ZF_motif.fasta e.g. weofjpihouwefghoicalcvgnfglapglifhtylhyuiui calcvgnfglapglifhtylhBioinformatics master course, ‘11/’12 29 Paolo Marcatili

×