Introduction to Perl and BioPerl

BCBB Bioinformatics Development Series
April 30, 2014
Andrew Oler, PhD
High-throughput Sequencing Bioinformatics Specialist
BCBB/OCICB/NIAID/NIH

Bioinformatics & Computational Biology
Branch (BCBB)

Biocomputing Research Consulting and
Scientific Software Development
High
Throughput
Illustration
Animation
http://www.niaid.nih.gov/about/organization/odoffices/omo/ocicb/Pages/bcbb.aspx
ScienceApps@niaid.nih.gov

Outline
§  Introduction
§  Perl programming principles
o Variables
o Flow controls/Loops
o File manipulation
o Regular expressions
§  BioPerl
o What is BioPerl?
o How do you use BioPerl?
o How do you learn more about BioPerl?

Introduction
•  An interpreted programming language created in
1987 by Larry Wall
•  Good at processing and transforming plain text, like
GenBank or PDB files
•  Official motto: “TMTOWTDI” (There’s More Than
One Way To Do It!)
•  Extensible – currently has a large and active user
base who are constantly adding new functional
libraries
•  Portable – can use in Windows, Mac, & Linux/Unix

Introduction
"Perl is remarkably good for slicing, dicing, twisting, wringing, smoothing,
summarizing and otherwise mangling text. Although the biological sciences
do involve a good deal of numeric analysis now, most of the primary data is
still text: clone names, annotations, comments, bibliographic references.
Even DNA sequences are textlike. Interconverting incompatible data
formats is a matter of text mangling combined with some creative
guesswork. Perl's powerful regular expression matching and string
manipulation operators simplify this job in a way that isn't equalled by any
other modern language."

Examples of Bioinformatics Software with
Perl components
§  GBrowse, GMOD
§  samtools
§  Illumina CASAVA
§  MEME
§  Velvet
§  miRDeep
§  Rosetta
§  ViennaRNA
§  RUM
§  Trinity
§  NCBI BLAST
§  I-TASSER
§  MAKER
§  ...Many more
§  http://stackoverflow.com/questions/2527170/why-is-perl-used-so-
extensively-in-biology-research
§  http://programmers.stackexchange.com/questions/92916/why-is-perl-so-
heavily-used-in-bioinformatics
7

Getting Perl
•  Latest version – 5.18.2
•  http://www.perl.org/
5.12.3
(Lion)

Getting Help
•  perl –v
•  Perl manual pages
•  Books and Documentation:
–  http://www.perl.org/docs.html
–  The O’Reilly Books:
§  Learning Perl
§  Programming Perl
§  Perl Cookbook, etc.
•  http://www.cpan.org
•  http://perldoc.perl.org/perlintro.html
•  BCBB – for help writing your custom scripts
perldoc perl
perldoc perlintro

File Manager/Browser by Operating System
10
OS: Windows Mac OSX Unix
FM: Explorer Finder Shell
Input
Method:
Running Perl scripts

Anatomy of the Terminal, “Command Line”,
or “Shell”
Prompt (computer_name:current_directory username)
Cursor
Command Argument
Window
Output
Mac: Applications -> Utilities -> Terminal
Windows: Download open source software
PuTTY http://www.chiark.greenend.org.uk/~sgtatham/putty/
Other SSH Clients (http://en.wikipedia.org/wiki/Comparison_of_SSH_clients)
Cygwin (http://www.cygwin.com/)
11

How to execute a command
command argument
output
output
12

cd (“change directory”) and
mkdir (“make directory”)
cd ~ change to home directory
cd test_data change to “test_data” directory
cd .. change to higher directory (“go up”)
cd ~/unix_hpc change to home directory > “unix_hpc” directory
mkdir dir_name make directory “dir_name”
pwd “print working directory”
***See Handout “HPC Cluster Computing and Unix Basics Handout” for
more helpful Unix Terminal commands.***
13

"Hello world" script
•  hello_world.pl file
•  Run hello_world.pl
#!/usr/bin/perl
# This is a comment
print "Hello worldn";
>perl hello_world.pl
Hello world
>perl -e 'print "Hello worldn”;'
Hello world
The shebang line must be the first line.
It tells the computer where to find perl.
•  print is a Perl function name
•  Double quotes are used for Strings
•  The semi-colon must be present at the end of
every command in Perl

A Few Helpful Things for a Template
§  #!/usr/bin/env perl!
§  $| = 1; !# Accurate line numbers (for debugging) !
§  use warnings; !# Helpful warnings (for debugging)!
§  use diagnostics; !# Helpful warnings (for debugging)!
§  use strict;! !# Requires you to declare variables
15

Basic Programming Concepts
•  Variables
–  Scalars
–  Arrays
–  Hashes
•  Flow Control
–  if/else
–  unless
•  Loops
–  for
–  foreach
–  while
–  until
•  Files
•  Regexes

Variables
§ In computer programming, a variable is a symbolic
name used to refer to a value – WikiPedia
o Examples
•  Variable names can contain letters, numbers, and _,
but cannot begin with a number
•  $five_prime OK
•  $5_prime NO
  $x = 4;
  $y = 1.0;
  $name = 'Bob';
  $seq = "ACTGTTGTAAGC”;
Perl will treat integers and floating
point numbers as numbers, so x and
y can be used together in an
equation.
Strings are indicated by either
single or double quotes.

Perl Variables
•  Scalar
•  Array
•  Hash

Variables - Scalar
•  Can store a single string, or number
•  Begins with a $
•  Single or double quotes for strings
  my $x = 4; # use “my” to declare a variable
  my $name = 'Bob';
  my $seq= "ACTGTTGTAAGC";
  print "My name is $name.";
#prints My name is Bob.

http://perldoc.perl.org/perlintro.html
&& and
|| or
! not
= assignment
. string concatenation
.. range operator
Arithmetic
Numeric Comparison
Boolean Logic
Miscellaneous
eq equality
ne inequality
lt less than
gt greater
le less than or equal
ge greater than or equal
String Comparison
Scalar Operators
== equality
!= inequality
< less than
> greater
<= less than or equal
>= greater than or equal
+ addition
- subtraction
* multiplication
/ division
++ increment (by one)
-- decrement (by one)
+= increment (by value)
-= decrement (by value)
Examples:
$m = 3;
$f = “hello”;
if ($x == 3)
if ($x eq ‘hi’)

Common Scalar Functions
Function Name Description
length Length of the scalar value
lc Lower case value
uc Upper case value
reverse Returns the value in the opposite order
substr Returns a substring
chomp Removes the last newline (n) character
chop Removes the last character
defined Checks if scalar value exists
split Splits scalar into array
http://perldoc.perl.org/index-functions.html How to use any Perl function

Common Scalar Functions Examples
my $string = "This string has a newline.n";
chomp $string;
print $string;
#prints "This string has a newline.”
$string = lc($string);
print $string;
#prints ”this string has a newline.”
@array = split(" ", $string);
#array looks like [“this", "string", "has",
"a", "newline."]

Scalar Variables Exercise
§  Write a program that computes the circumference of a
circle with a radius of 12.5
§  C = 2 * π * r
§  (Answer should be about 78.5)
23

Array
Andrew Burke Darrell Vijay Mike
0 1 432
•  Stores a list of scalar values (strings or numbers)
•  Zero based index

Variables - Array
•  Begins with @
•  Use the () brackets for creating
•  Use the $ and [] brackets for retrieving a single
element in the array
my @grades = (75, 80, 35);
my @mixnmatch = (5, "A", 4.5);
my @names = ("Bob", "Vivek", "Jane");
# zero-based index
my $first_name = $names[0];
# retrieve the last item in an array
my $last_name = $names[-1];

Common Array Functions
scalar Size of the array
push Add value to end of an array
pop Removes the last element from an array
shift Removes the first element from an array
unshift Add value to the beginning of an array
join Convert array to scalar
splice Removes or replaces specified range of elements from array
grep Search array elements
sort Orders array elements

push/pop modifies the end of an array
Tim Molly Betty Chris
push(@names, "Charles");
@names =
@names = Tim Molly Betty Chris Charles
pop(@names);
@names = Tim Molly Betty Chris

shift/unshift modifies the start of an array
Tim Molly Betty Chris
unshift(@names, "Charles");
@names =
@names = Charles Tim Molly Betty Chris
shift(@names);
@names = Tim Molly Betty Chris

Variables - Hashes
KEYS VALUES
Title Programming Perl, 3rd Edition
Publisher O’Reilly Media
ISBN 978-0-596-00027-1
•  Stores data using key, value pairs

Variables - Hash
§  Indicated with %
§  Use the () brackets and => pointer for creating
§  Use the $ and {} brackets for setting or retrieving a
single element from the hash
my %book_info = (
title =>"Perl for bioinformatics",
author => "James Tisdall",
pages => 270,
price => 40
);
print $book_info{"author"};
#returns "James Tisdall"

Common Hash Functions
keys Returns array of keys
values Returns array of values
reverse Converts keys to values in hash

Retrieving keys or values of a hash
•  Retrieving single value
•  Retrieving all the keys/values as an
array
•  NOTE: Keys and values are unordered
my $book_title = $book_info{"title"};
#$book_title has stored "Perl for bioinformatics"
my @book_attributes = keys %book_info;
my @book_attribute_values = values %book_info;

Variables summary
# A. Scalar variable
my $first_name = "andrew";
my $last_name = "oler”;
# B. Array variable
# use 'circular' bracket and @ symbol for assignment
my @personal_info = ("andrew", $last_name);
# use 'square' bracket and the integer index to access an entry
my $fname = $personal_info[0];
# C. Hash variable
# use 'circular' brackets (similar to array) and % symbol for assignment
my %personal_info = (
first_name => "andrew",
last_name => "oler"
);
# use 'curly' brackets to access a single entry
my $fname1 = $personal_info{first_name};

Tutorial 1
§ Create a variable with the following sequence:
ILE GLY GLY ASN ALA GLN ALA THR ALA ALA ASN SER ILE ALA LEU
GLY SER GLY ALA THR THR
§ print in lowercase
§ split into an array
§ print the array
§ print the first value in the array
§ shift the first value off the array and store it in a
variable
§ print the variable and the array
§ push the variable onto the end of the array
§ print the array

Flow Controls
•  If/elsif/else
•  unless
  $x = 4;
  if ($x > 4) {
  print "I am greater than 4";
  }elsif ($x == 4) {
  print "I am equal to 4";
  }else {
  print "I am less than 4";
  }
  unless($x > 4) {
  print "I am not greater than 4";
  }

Post-condition
# the traditional way
if ($x == 4) {
print "I am 4.";
}
# this line below is equivalent to the
if statement above, but you can only
use it if you have a one line action
print "I am 4." if ( $x == 4 );
print "I am not 4." unless ( $x == 4);

Loops
•  for (EXPR; EXPR; EXPR)
•  foreach
  for ( my $x = 0; $x < 4 ; $x++ ) {
  print "$xn";
  }
  #prints 0, 1, 2, 3 on separate lines
  my @names = ("Bob", "Vivek", "Jane");
 
  foreach my $name (@names) {
  print "My name is $name.n";
  }
  #prints:
  #My name is Bob.
  #My name is Vivek.
  #My name is Jane.

Hashes with foreach
my %book_info = (
title =>"Perl for Bioinformatics",
author => "James Tisdall");
  foreach my $key (keys %book_info) {
  print "$key : $book_info{$key}n";
  }
  #prints:
  #title : Perl for Bioinformatics
  #author : James Tisdall

Loops - continued
•  while
•  until
  my $x =0;
  until($x >= 4) {
  print "$xn";
  $x++;
  }
  my $x = 0;
  while($x < 4) {
  print "$xn";
  $x++;
  }

Tutorial 2
§  Iterate through the array (using foreach) and print
everything unless ILE
§  Use a hash to count how many times each amino acid
occurs
§  Iterate through the hash and print the counts in a table

Files
•  Existence
o  if(-e $file)
•  Open
o  Read open(FILE, "< $file");
o  New open(FILE, "> $file");
o  Append open(FILE, ">> $file");
•  Read (for input/read file handle)
o  while(<FILE>){ }
o  Each line is assigned to special variable $_
•  Write (for output--new/append--file handle)
o  print FILE $string;
•  Close
o  close(FILE);

Directory
•  Existence
o  if(-d $directory)
•  Open
o  opendir(DIR, "$directory")
•  Read
o  readdir(DIR)
•  Close
o  closedir(DIR)
•  Create
o  mkdir($directory) unless (-d
$directory)

# A. Reading file
# create a variable that can tell the program where to find your data
my $file = "/Users/oleraj/Documents/perlTutorials/myFile.txt";
# Check if file exists and read through it
if(-e $file){
open(FILE, "<$file") or die "cannot open file";
while(<FILE>){
chomp;
my $line = $_;
#do something useful here
}
close(FILE);
}
# B. Reading directory
my $directory = "/Users/oleraj";
if(-d $directory){
opendir(DIR, $directory);
my @files = readdir(DIR);
closedir(DIR);
print @files;
}
Notice the special character.
When it is used here, it holds the
line that was just read from the file.
The array @files will hold the name
of every file in the the directory.

Regular Expressions (REGEX)
•  "A regular expression ... is a set of pattern matching rules
encoded in a string according to certain syntax rules." -wikipedia
•  Fast and efficient for "Fuzzy" matches
•  Applications:
•  Checking if a string fits a pattern
•  Extracting a pattern match from a string
•  Altering the pattern within the string
•  Example - Find all sequences from human
•  $seq_name =~ /(human|Homo sapiens)/i;
•  Uses
1.  Find/match only (yes/no) with m// or //
§  e.g., m/regex/; m/human/
2.  Find and replace a string with s///
§  e.g., s/regex/replacement/; s/human/Homo sapiens/
3.  Translate character by character with t///
§  e.g., t/list/newlist/; t/abcd/1234/;

Beginning Perl for Bioinformatics - James Tisdall

Simple Examples
my $protein = "MET SER ASN ASN THR SER";
$protein =~ s/SER/THR/g;
print $protein;
#prints "MET THR ASN ASN THR THR";
$protein =~ m/asn/i;
#will match ASN

Symbol Meaning
. Match any one character (except
newline).
^ Match at beginning of string
$ Match at end of string
n Match the newline
t Match a tab
s Match any whitespace character
w Match any word
character (alphanumeric plus "_")
W Match any non-word character
d Match any digit character
[A-Za-z] Match any letter
[0-9] same as d
my $string = "See also xyz";
$string =~ /See also ./;
#matches "See also x”
$string =~ /^./;
#matches "S”
$string =~ /.$/;
#matches "z”
$string =~ /wsw/;
#matches "e a"
http://www.troubleshooters.com/codecorn/littperl/perlreg.htm

Quantifier Meaning
* Match 0 or more times
+ Match at least once
? Match 0 or 1 times
{COUNT} Match exactly COUNT times.
{MIN,} Match at least MIN times (maximal).
{MIN, MAX} Match at least MIN but not more
than MAX times (maximal).
my $string = "See also xyz";
$string =~ /See also .*/;
#matches "See also xyz”
$string =~ /^.*/;
#matches "See also xyz”
$string =~ /.?$/;
#matches "z”
$string =~ /w+s+w+/;
#matches "See also"

REGEX Examples
my $string = ">ref|XP_001882498.1| retrovirus-related pol polyprotein
[Laccaria bicolor S238N-H82]";
$string =~/s.*virus/;
#will match " retrovirus"
$string =~ /XP_d+/;
#will match "XP_001882498”
$string =~ /XP_d/;
#match “XP_0”
$string =~ /[.*]$/;
#will match "[Laccaria bicolor S238N-H82]"
$string =~ /^.*|/;
#will match ">ref|XP_001882498.1|"
$string =~ /^.*?|/;
#will match ">ref|"
$string =~ s/|/:/g;
#string becomes ">ref:XP_001882498.1: retrovirus-related pol polyprotein
[Laccaria bicolor S238N-H82]"

Tutorial 3
§  open the file "example.fa"
§  read through the file
§  print the id lines for the human sequences (NOTE: the
ids will start with HS)

Summary of Basics
•  Variables
–  Scalar
–  Array
–  Hash
•  Flow Control
–  if/else
–  unless
•  Loops
–  for
–  foreach
–  while
–  until
•  Files
•  Regexes

Longer Script Examples
§  Take a bed file of junctions from RNA-seq analysis
(e.g., TopHat output) and print out some basic
statistics
•  Open up the file bed_file_stats.pl
§  Other examples you would like to discuss?
56

Time for a little break...
57
Regular Expressions
https://xkcd.com/208/

Outline
§  What is a module (in Perl)?
§  Where do you get BioPerl?
§  What is BioPerl?
§  How do you use BioPerl?
§  How do you learn more about BioPerl?
§  Additional Resources
59

What is module (in Perl)?
§  A module is set of Perl variables and methods that are
written to accomplish a particular task
•  Enables the reuse of methods and variables
between Perl scripts / programs
•  Tested
•  End in “.pm” extension
§  Comprehensive Perl Archive Network (CPAN)
–  http://www.cpan.org
–  Type “cpan” in terminal to open
60

Creating a Module
#!/usr/bin/perl!
!
package Foo;!
sub bar {!
print "Hello $_[0]n"!
}!
!
sub blat {!
print "World $_[0]n"!
}!
1;
61

Using a Module
#!/usr/bin/perl!
!
use Foo;!
!
bar( "a" );!
blat( "b" );
62

-  Jason Stajich, Ph.D.
Assistant Professor at the University of California, Riverside
BioPerl developer since 2000
63

Where do you get BioPerl?
§  In-class tutorial
•  Already installed! Yeah!
§  URL
•  www.BioPerl.org
§  Modules
•  Bioperl-core
•  Bioperl-run
•  Bioperl-network
•  Bioperl-DB
64

What is BioPerl?
§  BioPerl is:
•  A collection of Perl modules for biological data and
analysis
•  An open source toolkit with many contributors
•  A flexible and extensible system for doing bioinformatics
data manipulation
•  Consists of >1500 modules; ~1000 considered core
§  Modules are interfaces to data types:
•  Sequences
•  Alignments
•  (Sequence) Features
•  Locations
•  Databases
65Slide adapted from: Jason Stajich

With BioPerl you can…
§  Retrieve sequence data from NCBI
§  Transform sequence files from one format to another
§  Parse (or search) BLAST result files
§  Manipulate sequences, reverse complement, translate
coding DNA sequence to protein
§  And so on…

Major Domains Covered

Additional Domains

Hypothetical Research Project
§  Interested in looking for universal vaccine candidates for
an Influenza virus
•  Would ultimately involve other programs and data (i.e.
epitope data)
§  Protocol
•  Obtain influenza HA sequence
–  2009 pandemic influenza virus hemagglutinin sequence for A/
California/04/2009(H1N1) “FJ966082”
–  Convert into other formats
•  BLAST sequence to find similar sequences
•  Parse BLAST metadata and load into Excel
•  Align similar sequences and save alignment
•  Find motifs in sequences
•  Compute basic sequence metadata
70

Module:
Bio::SeqIO
§  Biological Sequence Input & Output
§  Bioinformatics file reading and writing
§  Enables easy file conversion
§  Example supported formats:
•  ABI, BSML, Fasta, Fastq, GCG, Genbank, Interpro,
KEGG, Lasergene, Phred Phd, Phred Qual, Pir,
Swissprot
71

How do we get Genbank sequence / file if
we have accession?
Sequence Retrieval from NCBI using Bio::DB::GenBank and Bio::SeqIO
!
#!/usr/bin/perl –w!
use strict;!
use Bio::DB::GenBank;!
use Bio::SeqIO;!
!
my $accession = 'FJ966082';!
my $genBank = new Bio::DB::GenBank; !
my $seq = $genBank->get_Seq_by_acc($accession); !
my $seqOut = new Bio::SeqIO(-format => 'genbank', !
! ! ! -file => ”>$accession.gb"); !
$seqOut->write_seq($seq);!
!
!
!
!
!
!
(The downloaded file ”FJ996082.gb” can also be found in the class folder)

Convert from GenBank to FASTA Format
#!/usr/bin/perl!
!
use warnings;!
use strict;!
use Bio::SeqIO;!
!
# create one SeqIO object to read in,and another to write out!
my $seq_in = Bio::SeqIO->new(!
-file => "FJ966082.gb",!
-format => "genbank"!
);!
my $seq_out = Bio::SeqIO->new(!
-file => ">FJ966082.fa",!
-format => "fasta"!
);!
!
# write each entry in the input file to the output file!
while (my $inseq = $seq_in->next_seq) {!
$seq_out->write_seq($inseq);!
}!
73Slide adapted from: BioPerl HowTo

Bio::SeqIO Sequence Object Methods
74Source: http://www.bioperl.org/wiki/HOWTO:Beginners

How to BLAST a Sequence
§  Options to BLAST a single sequence:
•  Go to NCBI GenBank website and BLAST
§  Options to BLAST multiple sequences
•  Use NCBI GenBank website / server to BLAST
through an API (application programmers interface)
•  Setup BLAST software and databases on local
computer
75

A Few BLAST Details 
Query: ...GSVEDTTGSQSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFVEDAELRQTLQEDL...

PQG 18
PEG 15
PKG 14
PRG 14
PDG 13
PHG 13
PMG 13
PNG 13
PSG 13
PQA 12
PQN 12
etc…





Query: 325 SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA 365
a +LA++L+ TP G R++ +W+ P+ D + ER + A
Subject: 290 TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA 330

76Source: S. Altschul http://www.cs.umd.edu/class/fall2011/cmsc858s/BLAST.pdf

Module:
Bio::SearchIO
§  Biological Search Input & Output
§  Plugging in different parsers for pairwise alignment
objects
§  Searches parsed with Bio::SearchIO are organized as
follows (see SearchIO HOWTO and Parsing BLAST
HSPs for much more detail):
§  the Bio::SearchIO object contains
•  Results, which contain
–  Hits, which contain
§  HSPs.
78

Parse BLAST output
#!/usr/bin/perl -w!
use strict;!
use Bio::SearchIO;!
!
my $in = Bio::SearchIO->new(-format => 'blast',!
! ! ! ! ! -file => 'blast-results.txt');!
!
open (OUTFILE, '>blast-data.txt');!
!
while(my $result = $in->next_result()){!
!
while(my $hit = $result->next_hit()){!
!
! !while(my $hsp = $hit->next_hsp()){!
!
! !if($hsp->length('total') > 50 && $hsp->percent_identity() >= 50){!
! ! !print OUTFILE "Query = ". $result->query_name(). "t" .!
! ! ! !"Hit = ". $hit->name(). "t" .!
! ! ! !"Length = ". $hsp->length('total'). "t" .!
! ! ! !"Percent_id = ". $hsp->percent_identity()."n";!
! ! !}!
! !}!
!}!
}!
close (OUTFILE);!
79

Module:
Bio::SearchIO Methods
80http://www.bioperl.org/wiki/HOWTO:SearchIO
Method
Example
Description

algorithm
BLASTX
algorithm string

algorithm_version
2.2.4 [Aug-26-2002]
algorithm version

query_name
20521485|dbj|AP004641.2
query name

query_accession
AP004641.2
query accession

query_length
3059
query length

query_description
Oryza sativa ... 977CE9AF checksum.
query description

database_name
test.fa
database name

database_letters
1291
number of residues in database

database_entries
5
number of database entries

available_statistics
effectivespaceused ... dbletters
statistics used

available_parameters
gapext matrix allowgaps gapopen
parameters used

num_hits
1
number of hits

Parsed Output in Excel
§  Drag blast-data.txt file onto Microsoft Excel icon to
open
§  Enables user to quickly harness Excel knowledge and
abilities to do meta analysis of BLAST results
81

Module:
Bio::AlignIO
§  Bioinformatics multiple sequence alignment input &
output
§  Pluggable parsers and renderers for multiple
sequence alignments
§  A summary of multiple alignment formats is also a
good introduction to the file formats
82

Extract the HSPs to a FASTA file using
Bio::AlignIO
#!/usr/bin/perl -w!
use strict;!
use Bio::AlignIO;!
use Bio::SearchIO;!
!
my $in = new Bio::SearchIO(-format => 'blast', -file => 'blast-results.txt');!
!
my $alnIO = Bio::AlignIO->new(-format =>"fasta", -file => ">hsp.fas");!
!
while(my $result = $in->next_result()){!
!
!while(my $hit = $result->next_hit()){!
!
! !while(my $hsp = $hit->next_hsp()){!
!
! ! !if($hsp->length('total') > 50 && $hsp->percent_identity() >= 50){!
!
! ! ! !my $aln = $hsp->get_aln;!
!
! ! !$alnIO->write_aln($aln);!
!
! ! !}!
! !}!
!}!
}
83

Finding Motifs in Sequences
#!/usr/bin/perl -w!
use strict;!
use Bio::SeqIO;!
!
my $file = 'hsp.fas';!
my $motif = "[ATG]A";!
#my $motif = '(A[^T]{2,}){2,}’;!
!
my $in = Bio::SeqIO->new(-format => 'fasta', -file => $file);!
my $motif_count = 0;!
!
while ( my $seq = $in->next_seq) {!
!my $str = $seq->seq; ! !# get the sequence as a string!
!if ( $str =~ /$motif/i ) {!
! !$motif_count++; # of sequences that have this motif!
!}!
}!
!
printf "%d sequences have the motif $motifn", $motif_count;
84

Using Bio::SeqIO to
Calculate Sequence Metadata
#!/usr/bin/perl -w!
use strict;!
use Bio::SeqIO;!
!
my $file = "hsp.fas";!
my $seq_in = Bio::SeqIO->new(-file => $file, -format => "fasta");!
my ($seqcount, $basecount, $basecount_nostops);!
!
while (my $inseq = $seq_in->next_seq) {!
$seqcount++;! ! !# count the number of sequences!
$basecount += $inseq->length; !# count bases in whole db!
my $str = $inseq->seq; !# get the sequence as a string!
$str =~ s/*//g; ! !# remove all '*' from sequence!
$basecount_nostops += length($inseq); !# add bases from string!
}!
!
print "In $file there are $seqcount sequences, and $basecount bases
($basecount_nostops ignoring *)n";!

Additional Bioperl Examples
§  Review “examples” directory within bioperl directory
86

Resources
§  BioPerl API (the details)
•  http://doc.bioperl.org/releases/bioperl-1.6.1/
§  BioPerl Tutorials
•  http://www.BioPerl.org/wiki/HOWTOs
§  BCBB Handout(s)
•  http://collab.niaid.nih.gov/sites/research/SIG/
Bioinformatics/seminars.aspx
§  Jason Stajich
•  https://github.com/hyphaltip/htbda_perl_class/tree/
master/examples/BioPerl
•  http://courses.stajich.org/gen220/lectures/
87

EMBOSS
§  European Molecular Biology Open Source Suite
§  Command line programs to accomplish many
bioinformatics tasks
§  Bioperl-run has numerous wrappers for EMBOSS
programs
§  Download
•  http://emboss.sourceforge.net
§  Try out
•  http://helixweb.nih.gov/emboss/
88

Thank you!
andrew.oler@nih.gov

ScienceApps@niaid.nih.gov

h5p://bioinforma;cs.niaid.nih.gov

If you have Questions or Comments, please contact us:

Introduction to Perl and BioPerl

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Perl and BioPerl

Similar to Introduction to Perl and BioPerl (20)

More from Bioinformatics and Computational Biosciences Branch

More from Bioinformatics and Computational Biosciences Branch (20)

Recently uploaded

Recently uploaded (20)

Introduction to Perl and BioPerl