SlideShare a Scribd company logo
BCBB Bioinformatics Development Series
April 30, 2014
Andrew Oler, PhD
High-throughput Sequencing Bioinformatics Specialist
BCBB/OCICB/NIAID/NIH
Bioinformatics & Computational Biology
Branch (BCBB)
Biocomputing Research Consulting and
Scientific Software Development
High
Throughput
Illustration
Animation
http://www.niaid.nih.gov/about/organization/odoffices/omo/ocicb/Pages/bcbb.aspx
ScienceApps@niaid.nih.gov
Outline
§  Introduction
§  Perl programming principles
o Variables
o Flow controls/Loops
o File manipulation
o Regular expressions
§  BioPerl
o What is BioPerl?
o How do you use BioPerl?
o How do you learn more about BioPerl?
Introduction
•  An interpreted programming language created in
1987 by Larry Wall
•  Good at processing and transforming plain text, like
GenBank or PDB files
•  Official motto: “TMTOWTDI” (There’s More Than
One Way To Do It!)
•  Extensible – currently has a large and active user
base who are constantly adding new functional
libraries
•  Portable – can use in Windows, Mac, & Linux/Unix
Introduction
"Perl is remarkably good for slicing, dicing, twisting, wringing, smoothing,
summarizing and otherwise mangling text. Although the biological sciences
do involve a good deal of numeric analysis now, most of the primary data is
still text: clone names, annotations, comments, bibliographic references.
Even DNA sequences are textlike. Interconverting incompatible data
formats is a matter of text mangling combined with some creative
guesswork. Perl's powerful regular expression matching and string
manipulation operators simplify this job in a way that isn't equalled by any
other modern language."
Examples of Bioinformatics Software with
Perl components
§  GBrowse, GMOD
§  samtools
§  Illumina CASAVA
§  MEME
§  Velvet
§  miRDeep
§  Rosetta
§  ViennaRNA
§  RUM
§  Trinity
§  NCBI BLAST
§  I-TASSER
§  MAKER
§  ...Many more
§  http://stackoverflow.com/questions/2527170/why-is-perl-used-so-
extensively-in-biology-research
§  http://programmers.stackexchange.com/questions/92916/why-is-perl-so-
heavily-used-in-bioinformatics
7
Getting Perl
•  Latest version – 5.18.2
•  http://www.perl.org/
5.12.3
(Lion)
Getting Help
•  perl –v
•  Perl manual pages
•  Books and Documentation:
–  http://www.perl.org/docs.html
–  The O’Reilly Books:
§  Learning Perl
§  Programming Perl
§  Perl Cookbook, etc.
•  http://www.cpan.org
•  http://perldoc.perl.org/perlintro.html
•  BCBB – for help writing your custom scripts
perldoc perl
perldoc perlintro
File Manager/Browser by Operating System
10
OS: Windows Mac OSX Unix
FM: Explorer Finder Shell
Input
Method:
Running Perl scripts
Anatomy of the Terminal, “Command Line”,
or “Shell”
Prompt (computer_name:current_directory username)
Cursor
Command Argument
Window
Output
Mac: Applications -> Utilities -> Terminal
Windows: Download open source software
PuTTY http://www.chiark.greenend.org.uk/~sgtatham/putty/
Other SSH Clients (http://en.wikipedia.org/wiki/Comparison_of_SSH_clients)
Cygwin (http://www.cygwin.com/)
11
How to execute a command
command argument
output
output
12
cd (“change directory”) and
mkdir (“make directory”)
cd ~ change to home directory
cd test_data change to “test_data” directory
cd .. change to higher directory (“go up”)
cd ~/unix_hpc change to home directory > “unix_hpc” directory
mkdir dir_name make directory “dir_name”
pwd “print working directory”
***See Handout “HPC Cluster Computing and Unix Basics Handout” for
more helpful Unix Terminal commands.***
13
"Hello world" script
•  hello_world.pl file
•  Run hello_world.pl
#!/usr/bin/perl
# This is a comment
print "Hello worldn";
>perl hello_world.pl
Hello world
>perl -e 'print "Hello worldn”;'
Hello world
The shebang line must be the first line.
It tells the computer where to find perl.
•  print is a Perl function name
•  Double quotes are used for Strings
•  The semi-colon must be present at the end of
every command in Perl
A Few Helpful Things for a Template
§  #!/usr/bin/env perl!
§  $| = 1; !# Accurate line numbers (for debugging) !
§  use warnings; !# Helpful warnings (for debugging)!
§  use diagnostics; !# Helpful warnings (for debugging)!
§  use strict;! !# Requires you to declare variables
15
Basic Programming Concepts
•  Variables
–  Scalars
–  Arrays
–  Hashes
•  Flow Control
–  if/else
–  unless
•  Loops
–  for
–  foreach
–  while
–  until
•  Files
•  Regexes
Variables
§ In computer programming, a variable is a symbolic
name used to refer to a value – WikiPedia
o Examples
•  Variable names can contain letters, numbers, and _,
but cannot begin with a number
•  $five_prime OK
•  $5_prime NO
  $x = 4;
  $y = 1.0;
  $name = 'Bob';
  $seq = "ACTGTTGTAAGC”;
Perl will treat integers and floating
point numbers as numbers, so x and
y can be used together in an
equation.
Strings are indicated by either
single or double quotes.
Perl Variables
•  Scalar
•  Array
•  Hash
Variables - Scalar
•  Can store a single string, or number
•  Begins with a $
•  Single or double quotes for strings
  my $x = 4; # use “my” to declare a variable
  my $name = 'Bob';
  my $seq= "ACTGTTGTAAGC";
  print "My name is $name.";
#prints My name is Bob.
http://perldoc.perl.org/perlintro.html
&& and
|| or
! not
= assignment
. string concatenation
.. range operator
Arithmetic
Numeric Comparison
Boolean Logic
Miscellaneous
eq equality
ne inequality
lt less than
gt greater
le less than or equal
ge greater than or equal
String Comparison
Scalar Operators
== equality
!= inequality
< less than
> greater
<= less than or equal
>= greater than or equal
+ addition
- subtraction
* multiplication
/ division
++ increment (by one)
-- decrement (by one)
+= increment (by value)
-= decrement (by value)
Examples:
$m = 3;
$f = “hello”;
if ($x == 3)
if ($x eq ‘hi’)
Common Scalar Functions
Function Name Description
length Length of the scalar value
lc Lower case value
uc Upper case value
reverse Returns the value in the opposite order
substr Returns a substring
chomp Removes the last newline (n) character
chop Removes the last character
defined Checks if scalar value exists
split Splits scalar into array
http://perldoc.perl.org/index-functions.html How to use any Perl function
Common Scalar Functions Examples
my $string = "This string has a newline.n";
chomp $string;
print $string;
#prints "This string has a newline.”
$string = lc($string);
print $string;
#prints ”this string has a newline.”
@array = split(" ", $string);
#array looks like [“this", "string", "has",
"a", "newline."]
Scalar Variables Exercise
§  Write a program that computes the circumference of a
circle with a radius of 12.5
§  C = 2 * π * r
§  (Answer should be about 78.5)
23
Array
Andrew Burke Darrell Vijay Mike
0 1 432
•  Stores a list of scalar values (strings or numbers)
•  Zero based index
Variables - Array
•  Begins with @
•  Use the () brackets for creating
•  Use the $ and [] brackets for retrieving a single
element in the array
my @grades = (75, 80, 35);
my @mixnmatch = (5, "A", 4.5);
my @names = ("Bob", "Vivek", "Jane");
# zero-based index
my $first_name = $names[0];
# retrieve the last item in an array
my $last_name = $names[-1];
Common Array Functions
Function Name Description
scalar Size of the array
push Add value to end of an array
pop Removes the last element from an array
shift Removes the first element from an array
unshift Add value to the beginning of an array
join Convert array to scalar
splice Removes or replaces specified range of elements from array
grep Search array elements
sort Orders array elements
push/pop modifies the end of an array
Tim Molly Betty Chris
push(@names, "Charles");
@names =
@names = Tim Molly Betty Chris Charles
pop(@names);
@names = Tim Molly Betty Chris
shift/unshift modifies the start of an array
Tim Molly Betty Chris
unshift(@names, "Charles");
@names =
@names = Charles Tim Molly Betty Chris
shift(@names);
@names = Tim Molly Betty Chris
Variables - Hashes
KEYS VALUES
Title Programming Perl, 3rd Edition
Publisher O’Reilly Media
ISBN 978-0-596-00027-1
•  Stores data using key, value pairs
Variables - Hash
§  Indicated with %
§  Use the () brackets and => pointer for creating
§  Use the $ and {} brackets for setting or retrieving a
single element from the hash
my %book_info = (
title =>"Perl for bioinformatics",
author => "James Tisdall",
pages => 270,
price => 40
);
print $book_info{"author"};
#returns "James Tisdall"
Common Hash Functions
Function Name Description
keys Returns array of keys
values Returns array of values
reverse Converts keys to values in hash
Retrieving keys or values of a hash
•  Retrieving single value
•  Retrieving all the keys/values as an
array
•  NOTE: Keys and values are unordered
my $book_title = $book_info{"title"};
#$book_title has stored "Perl for bioinformatics"
my @book_attributes = keys %book_info;
my @book_attribute_values = values %book_info;
Variables summary
# A. Scalar variable
my $first_name = "andrew";
my $last_name = "oler”;
# B. Array variable
# use 'circular' bracket and @ symbol for assignment
my @personal_info = ("andrew", $last_name);
# use 'square' bracket and the integer index to access an entry
my $fname = $personal_info[0];
# C. Hash variable
# use 'circular' brackets (similar to array) and % symbol for assignment
my %personal_info = (
first_name => "andrew",
last_name => "oler"
);
# use 'curly' brackets to access a single entry
my $fname1 = $personal_info{first_name};
Tutorial 1
§ Create a variable with the following sequence:
ILE GLY GLY ASN ALA GLN ALA THR ALA ALA ASN SER ILE ALA LEU
GLY SER GLY ALA THR THR
§ print in lowercase
§ split into an array
§ print the array
§ print the first value in the array
§ shift the first value off the array and store it in a
variable
§ print the variable and the array
§ push the variable onto the end of the array
§ print the array
Basic Programming Concepts
•  Variables
–  Scalars
–  Arrays
–  Hashes
•  Flow Control
–  if/else
–  unless
•  Loops
–  for
–  foreach
–  while
–  until
•  Files
•  Regexes
Flow Controls
•  If/elsif/else
•  unless
  $x = 4;
  if ($x > 4) {
  print "I am greater than 4";
  }elsif ($x == 4) {
  print "I am equal to 4";
  }else {
  print "I am less than 4";
  }
  unless($x > 4) {
  print "I am not greater than 4";
  }
Post-condition
# the traditional way
if ($x == 4) {
print "I am 4.";
}
# this line below is equivalent to the
if statement above, but you can only
use it if you have a one line action
print "I am 4." if ( $x == 4 );
print "I am not 4." unless ( $x == 4);
Basic Programming Concepts
•  Variables
–  Scalars
–  Arrays
–  Hashes
•  Flow Control
–  if/else
–  unless
•  Loops
–  for
–  foreach
–  while
–  until
•  Files
•  Regexes
Loops
•  for (EXPR; EXPR; EXPR)
•  foreach
  for ( my $x = 0; $x < 4 ; $x++ ) {
  print "$xn";
  }
  #prints 0, 1, 2, 3 on separate lines
  my @names = ("Bob", "Vivek", "Jane");
 
  foreach my $name (@names) {
  print "My name is $name.n";
  }
  #prints:
  #My name is Bob.
  #My name is Vivek.
  #My name is Jane.
Hashes with foreach
my %book_info = (
title =>"Perl for Bioinformatics",
author => "James Tisdall");
  foreach my $key (keys %book_info) {
  print "$key : $book_info{$key}n";
  }
  #prints:
  #title : Perl for Bioinformatics
  #author : James Tisdall
Loops - continued
•  while
•  until
  my $x =0;
  until($x >= 4) {
  print "$xn";
  $x++;
  }
  my $x = 0;
  while($x < 4) {
  print "$xn";
  $x++;
  }
Tutorial 2
§  Iterate through the array (using foreach) and print
everything unless ILE
§  Use a hash to count how many times each amino acid
occurs
§  Iterate through the hash and print the counts in a table
Basic Programming Concepts
•  Variables
–  Scalars
–  Arrays
–  Hashes
•  Flow Control
–  if/else
–  unless
•  Loops
–  for
–  foreach
–  while
–  until
•  Files
•  Regexes
Files
•  Existence
o  if(-e $file)
•  Open
o  Read open(FILE, "< $file");
o  New open(FILE, "> $file");
o  Append open(FILE, ">> $file");
•  Read (for input/read file handle)
o  while(<FILE>){ }
o  Each line is assigned to special variable $_
•  Write (for output--new/append--file handle)
o  print FILE $string;
•  Close
o  close(FILE);
Directory
•  Existence
o  if(-d $directory)
•  Open
o  opendir(DIR, "$directory")
•  Read
o  readdir(DIR)
•  Close
o  closedir(DIR)
•  Create
o  mkdir($directory) unless (-d
$directory)
# A. Reading file
# create a variable that can tell the program where to find your data
my $file = "/Users/oleraj/Documents/perlTutorials/myFile.txt";
# Check if file exists and read through it
if(-e $file){
open(FILE, "<$file") or die "cannot open file";
while(<FILE>){
chomp;
my $line = $_;
#do something useful here
}
close(FILE);
}
# B. Reading directory
my $directory = "/Users/oleraj";
if(-d $directory){
opendir(DIR, $directory);
my @files = readdir(DIR);
closedir(DIR);
print @files;
}
Notice the special character.
When it is used here, it holds the
line that was just read from the file.
The array @files will hold the name
of every file in the the directory.
Basic Programming Concepts
•  Variables
–  Scalars
–  Arrays
–  Hashes
•  Flow Control
–  if/else
–  unless
•  Loops
–  for
–  foreach
–  while
–  until
•  Files
•  Regexes
Regular Expressions (REGEX)
•  "A regular expression ... is a set of pattern matching rules
encoded in a string according to certain syntax rules." -wikipedia
•  Fast and efficient for "Fuzzy" matches
•  Applications:
•  Checking if a string fits a pattern
•  Extracting a pattern match from a string
•  Altering the pattern within the string
•  Example - Find all sequences from human
•  $seq_name =~ /(human|Homo sapiens)/i;
•  Uses
1.  Find/match only (yes/no) with m// or //
§  e.g., m/regex/; m/human/
2.  Find and replace a string with s///
§  e.g., s/regex/replacement/; s/human/Homo sapiens/
3.  Translate character by character with t///
§  e.g., t/list/newlist/; t/abcd/1234/;
Beginning Perl for Bioinformatics - James Tisdall
Simple Examples
my $protein = "MET SER ASN ASN THR SER";
$protein =~ s/SER/THR/g;
print $protein;
#prints "MET THR ASN ASN THR THR";
$protein =~ m/asn/i;
#will match ASN
Regular Expressions (REGEX)
Symbol Meaning
. Match any one character (except
newline).
^ Match at beginning of string
$ Match at end of string
n Match the newline
t Match a tab
s Match any whitespace character
w Match any word
character (alphanumeric plus "_")
W Match any non-word character
d Match any digit character
[A-Za-z] Match any letter
[0-9] same as d
my $string = "See also xyz";
$string =~ /See also ./;
#matches "See also x”
$string =~ /^./;
#matches "S”
$string =~ /.$/;
#matches "z”
$string =~ /wsw/;
#matches "e a"
http://www.troubleshooters.com/codecorn/littperl/perlreg.htm
Regular Expressions (REGEX)
Quantifier Meaning
* Match 0 or more times
+ Match at least once
? Match 0 or 1 times
{COUNT} Match exactly COUNT times.
{MIN,} Match at least MIN times (maximal).
{MIN, MAX} Match at least MIN but not more
than MAX times (maximal).
my $string = "See also xyz";
$string =~ /See also .*/;
#matches "See also xyz”
$string =~ /^.*/;
#matches "See also xyz”
$string =~ /.?$/;
#matches "z”
$string =~ /w+s+w+/;
#matches "See also"
REGEX Examples
my $string = ">ref|XP_001882498.1| retrovirus-related pol polyprotein
[Laccaria bicolor S238N-H82]";
$string =~/s.*virus/;
#will match " retrovirus"
$string =~ /XP_d+/;
#will match "XP_001882498”
$string =~ /XP_d/;
#match “XP_0”
$string =~ /[.*]$/;
#will match "[Laccaria bicolor S238N-H82]"
$string =~ /^.*|/;
#will match ">ref|XP_001882498.1|"
$string =~ /^.*?|/;
#will match ">ref|"
$string =~ s/|/:/g;
#string becomes ">ref:XP_001882498.1: retrovirus-related pol polyprotein
[Laccaria bicolor S238N-H82]"
Tutorial 3
§  open the file "example.fa"
§  read through the file
§  print the id lines for the human sequences (NOTE: the
ids will start with HS)
Summary of Basics
•  Variables
–  Scalar
–  Array
–  Hash
•  Flow Control
–  if/else
–  unless
•  Loops
–  for
–  foreach
–  while
–  until
•  Files
•  Regexes
Longer Script Examples
§  Take a bed file of junctions from RNA-seq analysis
(e.g., TopHat output) and print out some basic
statistics
•  Open up the file bed_file_stats.pl
§  Other examples you would like to discuss?
56
Time for a little break...
57
Regular Expressions
https://xkcd.com/208/
Outline
§  What is a module (in Perl)?
§  Where do you get BioPerl?
§  What is BioPerl?
§  How do you use BioPerl?
§  How do you learn more about BioPerl?
§  Additional Resources
59
What is module (in Perl)?
§  A module is set of Perl variables and methods that are
written to accomplish a particular task
•  Enables the reuse of methods and variables
between Perl scripts / programs
•  Tested
•  End in “.pm” extension
§  Comprehensive Perl Archive Network (CPAN)
–  http://www.cpan.org
–  Type “cpan” in terminal to open
60
Creating a Module
#!/usr/bin/perl!
!
package Foo;!
sub bar {!
print "Hello $_[0]n"!
}!
!
sub blat {!
print "World $_[0]n"!
}!
1;
61
Using a Module
#!/usr/bin/perl!
!
use Foo;!
!
bar( "a" );!
blat( "b" );
62
-  Jason Stajich, Ph.D.
Assistant Professor at the University of California, Riverside
BioPerl developer since 2000
63
Where do you get BioPerl?
§  In-class tutorial
•  Already installed! Yeah!
§  URL
•  www.BioPerl.org
§  Modules
•  Bioperl-core
•  Bioperl-run
•  Bioperl-network
•  Bioperl-DB
64
What is BioPerl?
§  BioPerl is:
•  A collection of Perl modules for biological data and
analysis
•  An open source toolkit with many contributors
•  A flexible and extensible system for doing bioinformatics
data manipulation
•  Consists of >1500 modules; ~1000 considered core
§  Modules are interfaces to data types:
•  Sequences
•  Alignments
•  (Sequence) Features
•  Locations
•  Databases
65Slide adapted from: Jason Stajich
With BioPerl you can…
§  Retrieve sequence data from NCBI
§  Transform sequence files from one format to another
§  Parse (or search) BLAST result files
§  Manipulate sequences, reverse complement, translate
coding DNA sequence to protein
§  And so on…
66Slide adapted from: Jason Stajich
Major Domains Covered
67Slide adapted from: Jason Stajich
Additional Domains
68Slide adapted from: Jason Stajich
69
Hypothetical Research Project
§  Interested in looking for universal vaccine candidates for
an Influenza virus
•  Would ultimately involve other programs and data (i.e.
epitope data)
§  Protocol
•  Obtain influenza HA sequence
–  2009 pandemic influenza virus hemagglutinin sequence for A/
California/04/2009(H1N1) “FJ966082”
–  Convert into other formats
•  BLAST sequence to find similar sequences
•  Parse BLAST metadata and load into Excel
•  Align similar sequences and save alignment
•  Find motifs in sequences
•  Compute basic sequence metadata
70
Module:
Bio::SeqIO
§  Biological Sequence Input & Output
§  Bioinformatics file reading and writing
§  Enables easy file conversion
§  Example supported formats:
•  ABI, BSML, Fasta, Fastq, GCG, Genbank, Interpro,
KEGG, Lasergene, Phred Phd, Phred Qual, Pir,
Swissprot
71
How do we get Genbank sequence / file if
we have accession?
Sequence Retrieval from NCBI using Bio::DB::GenBank and Bio::SeqIO
!
#!/usr/bin/perl –w!
use strict;!
use Bio::DB::GenBank;!
use Bio::SeqIO;!
!
my $accession = 'FJ966082';!
my $genBank = new Bio::DB::GenBank; !
my $seq = $genBank->get_Seq_by_acc($accession); !
my $seqOut = new Bio::SeqIO(-format => 'genbank', !
! ! ! -file => ”>$accession.gb"); !
$seqOut->write_seq($seq);!
!
!
!
!
!
!
(The downloaded file ”FJ996082.gb” can also be found in the class folder)
72Slide adapted from: Jason Stajich
Convert from GenBank to FASTA Format
#!/usr/bin/perl!
!
use warnings;!
use strict;!
use Bio::SeqIO;!
!
# create one SeqIO object to read in,and another to write out!
my $seq_in = Bio::SeqIO->new(!
-file => "FJ966082.gb",!
-format => "genbank"!
);!
my $seq_out = Bio::SeqIO->new(!
-file => ">FJ966082.fa",!
-format => "fasta"!
);!
!
# write each entry in the input file to the output file!
while (my $inseq = $seq_in->next_seq) {!
$seq_out->write_seq($inseq);!
}!
73Slide adapted from: BioPerl HowTo
Bio::SeqIO Sequence Object Methods
74Source: http://www.bioperl.org/wiki/HOWTO:Beginners
How to BLAST a Sequence
§  Options to BLAST a single sequence:
•  Go to NCBI GenBank website and BLAST
§  Options to BLAST multiple sequences
•  Use NCBI GenBank website / server to BLAST
through an API (application programmers interface)
•  Setup BLAST software and databases on local
computer
75
A Few BLAST Details 
Query: ...GSVEDTTGSQSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFVEDAELRQTLQEDL...

PQG 18
PEG 15
PKG 14
PRG 14
PDG 13
PHG 13
PMG 13
PNG 13
PSG 13
PQA 12
PQN 12
etc…





Query: 325 SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA 365
a +LA++L+ TP G R++ +W+ P+ D + ER + A
Subject: 290 TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA 330

76Source: S. Altschul http://www.cs.umd.edu/class/fall2011/cmsc858s/BLAST.pdf
Detail:
BLOSUM Matrix
77
Module:
Bio::SearchIO
§  Biological Search Input & Output
§  Plugging in different parsers for pairwise alignment
objects
§  Searches parsed with Bio::SearchIO are organized as
follows (see SearchIO HOWTO and Parsing BLAST
HSPs for much more detail):
§  the Bio::SearchIO object contains
•  Results, which contain
–  Hits, which contain
§  HSPs.
78
Parse BLAST output
#!/usr/bin/perl -w!
use strict;!
use Bio::SearchIO;!
!
my $in = Bio::SearchIO->new(-format => 'blast',!
! ! ! ! ! -file => 'blast-results.txt');!
!
open (OUTFILE, '>blast-data.txt');!
!
while(my $result = $in->next_result()){!
!
while(my $hit = $result->next_hit()){!
!
! !while(my $hsp = $hit->next_hsp()){!
!
! !if($hsp->length('total') > 50 && $hsp->percent_identity() >= 50){!
! ! !print OUTFILE "Query = ". $result->query_name(). "t" .!
! ! ! !"Hit = ". $hit->name(). "t" .!
! ! ! !"Length = ". $hsp->length('total'). "t" .!
! ! ! !"Percent_id = ". $hsp->percent_identity()."n";!
! ! !}!
! !}!
!}!
}!
close (OUTFILE);!
79
Module:
Bio::SearchIO Methods
80http://www.bioperl.org/wiki/HOWTO:SearchIO
Method	
   Example	
   Description	
  
algorithm	
   BLASTX	
   algorithm string	
  
algorithm_version	
   2.2.4 [Aug-26-2002]	
   algorithm version	
  
query_name	
   20521485|dbj|AP004641.2	
   query name	
  
query_accession	
   AP004641.2	
   query accession	
  
query_length	
   3059	
   query length	
  
query_description	
   Oryza sativa ... 977CE9AF checksum.	
   query description	
  
database_name	
   test.fa	
   database name	
  
database_letters	
   1291	
   number of residues in database	
  
database_entries	
   5	
   number of database entries	
  
available_statistics	
   effectivespaceused ... dbletters	
   statistics used	
  
available_parameters	
   gapext matrix allowgaps gapopen	
   parameters used	
  
num_hits	
   1	
   number of hits	
  
Parsed Output in Excel
§  Drag blast-data.txt file onto Microsoft Excel icon to
open
§  Enables user to quickly harness Excel knowledge and
abilities to do meta analysis of BLAST results
81
Module:
Bio::AlignIO
§  Bioinformatics multiple sequence alignment input &
output
§  Pluggable parsers and renderers for multiple
sequence alignments
§  A summary of multiple alignment formats is also a
good introduction to the file formats
82
Extract the HSPs to a FASTA file using
Bio::AlignIO
#!/usr/bin/perl -w!
use strict;!
use Bio::AlignIO;!
use Bio::SearchIO;!
!
my $in = new Bio::SearchIO(-format => 'blast', -file => 'blast-results.txt');!
!
my $alnIO = Bio::AlignIO->new(-format =>"fasta", -file => ">hsp.fas");!
!
while(my $result = $in->next_result()){!
!
!while(my $hit = $result->next_hit()){!
!
! !while(my $hsp = $hit->next_hsp()){!
!
! ! !if($hsp->length('total') > 50 && $hsp->percent_identity() >= 50){!
!
! ! ! !my $aln = $hsp->get_aln;!
!
! ! !$alnIO->write_aln($aln);!
!
! ! !}!
! !}!
!}!
}
83
Finding Motifs in Sequences
#!/usr/bin/perl -w!
use strict;!
use Bio::SeqIO;!
!
my $file = 'hsp.fas';!
my $motif = "[ATG]A";!
#my $motif = '(A[^T]{2,}){2,}’;!
!
my $in = Bio::SeqIO->new(-format => 'fasta', -file => $file);!
my $motif_count = 0;!
!
while ( my $seq = $in->next_seq) {!
!my $str = $seq->seq; ! !# get the sequence as a string!
!if ( $str =~ /$motif/i ) {!
! !$motif_count++; # of sequences that have this motif!
!}!
}!
!
printf "%d sequences have the motif $motifn", $motif_count;
84
Using Bio::SeqIO to
Calculate Sequence Metadata
#!/usr/bin/perl -w!
use strict;!
use Bio::SeqIO;!
!
my $file = "hsp.fas";!
my $seq_in = Bio::SeqIO->new(-file => $file, -format => "fasta");!
my ($seqcount, $basecount, $basecount_nostops);!
!
while (my $inseq = $seq_in->next_seq) {!
$seqcount++;! ! !# count the number of sequences!
$basecount += $inseq->length; !# count bases in whole db!
my $str = $inseq->seq; !# get the sequence as a string!
$str =~ s/*//g; ! !# remove all '*' from sequence!
$basecount_nostops += length($inseq); !# add bases from string!
}!
!
print "In $file there are $seqcount sequences, and $basecount bases
($basecount_nostops ignoring *)n";!
85Slide adapted from: Jason Stajich
Additional Bioperl Examples
§  Review “examples” directory within bioperl directory
86
Resources
§  BioPerl API (the details)
•  http://doc.bioperl.org/releases/bioperl-1.6.1/
§  BioPerl Tutorials
•  http://www.BioPerl.org/wiki/HOWTOs
§  BCBB Handout(s)
•  http://collab.niaid.nih.gov/sites/research/SIG/
Bioinformatics/seminars.aspx
§  Jason Stajich
•  https://github.com/hyphaltip/htbda_perl_class/tree/
master/examples/BioPerl
•  http://courses.stajich.org/gen220/lectures/
87
EMBOSS
§  European Molecular Biology Open Source Suite
§  Command line programs to accomplish many
bioinformatics tasks
§  Bioperl-run has numerous wrappers for EMBOSS
programs
§  Download
•  http://emboss.sourceforge.net
§  Try out
•  http://helixweb.nih.gov/emboss/
88
Thank you!
andrew.oler@nih.gov	
  
	
  
ScienceApps@niaid.nih.gov	
  
	
  
h5p://bioinforma;cs.niaid.nih.gov	
  
If you have Questions or Comments, please contact us:

More Related Content

What's hot

Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
Zeeshan Hanjra
 
PAM : Point Accepted Mutation
PAM : Point Accepted MutationPAM : Point Accepted Mutation
PAM : Point Accepted Mutation
Amit Kyada
 
Dynamic programming and pairwise sequence alignment
Dynamic programming and pairwise sequence alignmentDynamic programming and pairwise sequence alignment
Dynamic programming and pairwise sequence alignment
GeethanjaliAnilkumar2
 
Threading modeling methods
Threading modeling methodsThreading modeling methods
Threading modeling methods
ratanvishwas
 
Homology modeling
Homology modelingHomology modeling
Protein Data Bank (PDB)
Protein Data Bank (PDB)Protein Data Bank (PDB)
Motif andpatterndatabase
Motif andpatterndatabaseMotif andpatterndatabase
Motif andpatterndatabase
Sucheta Tripathy
 
Genome annotation 2013
Genome annotation 2013Genome annotation 2013
Genome annotation 2013
Karan Veer Singh
 
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICSSTRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
SHEETHUMOLKS
 
Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins
Vijay Hemmadi
 
YEAST TWO HYBRID SYSTEM
 YEAST TWO HYBRID SYSTEM YEAST TWO HYBRID SYSTEM
YEAST TWO HYBRID SYSTEM
Md Nahidul Islam
 
Express sequence tags
Express sequence tagsExpress sequence tags
Express sequence tags
Dhananjay Desai
 
Functional proteomics, and tools
Functional proteomics, and toolsFunctional proteomics, and tools
Functional proteomics, and tools
KAUSHAL SAHU
 
Cath
CathCath
Cath
Ramya S
 
clustal omega.pptx
clustal omega.pptxclustal omega.pptx
clustal omega.pptx
Aindrila
 
Structural databases
Structural databases Structural databases
Structural databases
Priyadharshana
 
Functional proteomics, methods and tools
Functional proteomics, methods and toolsFunctional proteomics, methods and tools
Functional proteomics, methods and tools
KAUSHAL SAHU
 
Genomic databases
Genomic databasesGenomic databases
Protein data bank
Protein data bankProtein data bank
Protein data bank
Yogesh Joshi
 

What's hot (20)

Protein Structure Prediction
Protein Structure PredictionProtein Structure Prediction
Protein Structure Prediction
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
PAM : Point Accepted Mutation
PAM : Point Accepted MutationPAM : Point Accepted Mutation
PAM : Point Accepted Mutation
 
Dynamic programming and pairwise sequence alignment
Dynamic programming and pairwise sequence alignmentDynamic programming and pairwise sequence alignment
Dynamic programming and pairwise sequence alignment
 
Threading modeling methods
Threading modeling methodsThreading modeling methods
Threading modeling methods
 
Homology modeling
Homology modelingHomology modeling
Homology modeling
 
Protein Data Bank (PDB)
Protein Data Bank (PDB)Protein Data Bank (PDB)
Protein Data Bank (PDB)
 
Motif andpatterndatabase
Motif andpatterndatabaseMotif andpatterndatabase
Motif andpatterndatabase
 
Genome annotation 2013
Genome annotation 2013Genome annotation 2013
Genome annotation 2013
 
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICSSTRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
 
Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins Secondary Structure Prediction of proteins
Secondary Structure Prediction of proteins
 
YEAST TWO HYBRID SYSTEM
 YEAST TWO HYBRID SYSTEM YEAST TWO HYBRID SYSTEM
YEAST TWO HYBRID SYSTEM
 
Express sequence tags
Express sequence tagsExpress sequence tags
Express sequence tags
 
Functional proteomics, and tools
Functional proteomics, and toolsFunctional proteomics, and tools
Functional proteomics, and tools
 
Cath
CathCath
Cath
 
clustal omega.pptx
clustal omega.pptxclustal omega.pptx
clustal omega.pptx
 
Structural databases
Structural databases Structural databases
Structural databases
 
Functional proteomics, methods and tools
Functional proteomics, methods and toolsFunctional proteomics, methods and tools
Functional proteomics, methods and tools
 
Genomic databases
Genomic databasesGenomic databases
Genomic databases
 
Protein data bank
Protein data bankProtein data bank
Protein data bank
 

Similar to Introduction to Perl and BioPerl

Bioinformatics p5-bioperlv2014
Bioinformatics p5-bioperlv2014Bioinformatics p5-bioperlv2014
Bioinformatics p5-bioperlv2014
Prof. Wim Van Criekinge
 
Bioinformatica p6-bioperl
Bioinformatica p6-bioperlBioinformatica p6-bioperl
Bioinformatica p6-bioperl
Prof. Wim Van Criekinge
 
Bioinformatics p5-bioperl v2013-wim_vancriekinge
Bioinformatics p5-bioperl v2013-wim_vancriekingeBioinformatics p5-bioperl v2013-wim_vancriekinge
Bioinformatics p5-bioperl v2013-wim_vancriekinge
Prof. Wim Van Criekinge
 
2012 03 08_dbi
2012 03 08_dbi2012 03 08_dbi
2012 03 08_dbi
Prof. Wim Van Criekinge
 
Bioinformatica 10-11-2011-p6-bioperl
Bioinformatica 10-11-2011-p6-bioperlBioinformatica 10-11-2011-p6-bioperl
Bioinformatica 10-11-2011-p6-bioperl
Prof. Wim Van Criekinge
 
Marc’s (bio)perl course
Marc’s (bio)perl courseMarc’s (bio)perl course
Marc’s (bio)perl course
Marc Logghe
 
Php course-in-navimumbai
Php course-in-navimumbaiPhp course-in-navimumbai
Php course-in-navimumbai
vibrantuser
 
Introduction to Perl
Introduction to PerlIntroduction to Perl
Introduction to Perl
worr1244
 
Advanced Perl Techniques
Advanced Perl TechniquesAdvanced Perl Techniques
Advanced Perl Techniques
Dave Cross
 
php AND MYSQL _ppt.pdf
php AND MYSQL _ppt.pdfphp AND MYSQL _ppt.pdf
php AND MYSQL _ppt.pdf
SVN Polytechnic Kalan Sultanpur UP
 
Php Tutorials for Beginners
Php Tutorials for BeginnersPhp Tutorials for Beginners
Php Tutorials for Beginners
Vineet Kumar Saini
 
PHP and MySQL
PHP and MySQLPHP and MySQL
PHP and MySQL
Sanketkumar Biswas
 
PHP and MySQL with snapshots
 PHP and MySQL with snapshots PHP and MySQL with snapshots
PHP and MySQL with snapshotsrichambra
 
Introduction to Modern Perl
Introduction to Modern PerlIntroduction to Modern Perl
Introduction to Modern Perl
Dave Cross
 
Zend Certification Preparation Tutorial
Zend Certification Preparation TutorialZend Certification Preparation Tutorial
Zend Certification Preparation Tutorial
Lorna Mitchell
 
MIND sweeping introduction to PHP
MIND sweeping introduction to PHPMIND sweeping introduction to PHP
MIND sweeping introduction to PHPBUDNET
 
PowerShell_LangRef_v3 (1).pdf
PowerShell_LangRef_v3 (1).pdfPowerShell_LangRef_v3 (1).pdf
PowerShell_LangRef_v3 (1).pdf
outcast96
 

Similar to Introduction to Perl and BioPerl (20)

Bioinformatics p5-bioperlv2014
Bioinformatics p5-bioperlv2014Bioinformatics p5-bioperlv2014
Bioinformatics p5-bioperlv2014
 
Bioinformatica p6-bioperl
Bioinformatica p6-bioperlBioinformatica p6-bioperl
Bioinformatica p6-bioperl
 
Bioinformatics p5-bioperl v2013-wim_vancriekinge
Bioinformatics p5-bioperl v2013-wim_vancriekingeBioinformatics p5-bioperl v2013-wim_vancriekinge
Bioinformatics p5-bioperl v2013-wim_vancriekinge
 
2012 03 08_dbi
2012 03 08_dbi2012 03 08_dbi
2012 03 08_dbi
 
Bioinformatica 10-11-2011-p6-bioperl
Bioinformatica 10-11-2011-p6-bioperlBioinformatica 10-11-2011-p6-bioperl
Bioinformatica 10-11-2011-p6-bioperl
 
Marc’s (bio)perl course
Marc’s (bio)perl courseMarc’s (bio)perl course
Marc’s (bio)perl course
 
Php course-in-navimumbai
Php course-in-navimumbaiPhp course-in-navimumbai
Php course-in-navimumbai
 
Introduction to Perl
Introduction to PerlIntroduction to Perl
Introduction to Perl
 
Advanced Perl Techniques
Advanced Perl TechniquesAdvanced Perl Techniques
Advanced Perl Techniques
 
Php summary
Php summaryPhp summary
Php summary
 
php AND MYSQL _ppt.pdf
php AND MYSQL _ppt.pdfphp AND MYSQL _ppt.pdf
php AND MYSQL _ppt.pdf
 
Php Tutorials for Beginners
Php Tutorials for BeginnersPhp Tutorials for Beginners
Php Tutorials for Beginners
 
05php
05php05php
05php
 
PHP and MySQL
PHP and MySQLPHP and MySQL
PHP and MySQL
 
PHP and MySQL with snapshots
 PHP and MySQL with snapshots PHP and MySQL with snapshots
PHP and MySQL with snapshots
 
Introduction to Modern Perl
Introduction to Modern PerlIntroduction to Modern Perl
Introduction to Modern Perl
 
Zend Certification Preparation Tutorial
Zend Certification Preparation TutorialZend Certification Preparation Tutorial
Zend Certification Preparation Tutorial
 
MIND sweeping introduction to PHP
MIND sweeping introduction to PHPMIND sweeping introduction to PHP
MIND sweeping introduction to PHP
 
PowerShell_LangRef_v3 (1).pdf
PowerShell_LangRef_v3 (1).pdfPowerShell_LangRef_v3 (1).pdf
PowerShell_LangRef_v3 (1).pdf
 
Php
PhpPhp
Php
 

More from Bioinformatics and Computational Biosciences Branch

Virus Sequence Alignment and Phylogenetic Analysis 2019
Virus Sequence Alignment and Phylogenetic Analysis 2019Virus Sequence Alignment and Phylogenetic Analysis 2019
Virus Sequence Alignment and Phylogenetic Analysis 2019
Bioinformatics and Computational Biosciences Branch
 
Nephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele resultsNephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele results
Bioinformatics and Computational Biosciences Branch
 
Introduction to METAGENOTE
Introduction to METAGENOTE Introduction to METAGENOTE
Intro to homology modeling
Intro to homology modelingIntro to homology modeling
Protein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modelingProtein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modeling
Bioinformatics and Computational Biosciences Branch
 
Homology modeling: Modeller
Homology modeling: ModellerHomology modeling: Modeller
Protein function prediction
Protein function predictionProtein function prediction
Protein structure prediction with a focus on Rosetta
Protein structure prediction with a focus on RosettaProtein structure prediction with a focus on Rosetta
Protein structure prediction with a focus on Rosetta
Bioinformatics and Computational Biosciences Branch
 
Biological networks
Biological networksBiological networks
UNIX Basics and Cluster Computing
UNIX Basics and Cluster ComputingUNIX Basics and Cluster Computing
UNIX Basics and Cluster Computing
Bioinformatics and Computational Biosciences Branch
 
Statistical applications in GraphPad Prism
Statistical applications in GraphPad PrismStatistical applications in GraphPad Prism
Statistical applications in GraphPad Prism
Bioinformatics and Computational Biosciences Branch
 
Intro to JMP for statistics
Intro to JMP for statisticsIntro to JMP for statistics
Categorical models
Categorical modelsCategorical models
Automating biostatistics workflows using R-based webtools
Automating biostatistics workflows using R-based webtoolsAutomating biostatistics workflows using R-based webtools
Automating biostatistics workflows using R-based webtools
Bioinformatics and Computational Biosciences Branch
 
Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)
Bioinformatics and Computational Biosciences Branch
 
Overview of statistics: Statistical testing (Part I)
Overview of statistics: Statistical testing (Part I)Overview of statistics: Statistical testing (Part I)
Overview of statistics: Statistical testing (Part I)
Bioinformatics and Computational Biosciences Branch
 
GraphPad Prism: Curve fitting
GraphPad Prism: Curve fittingGraphPad Prism: Curve fitting

More from Bioinformatics and Computational Biosciences Branch (20)

Hong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptxHong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptx
 
Virus Sequence Alignment and Phylogenetic Analysis 2019
Virus Sequence Alignment and Phylogenetic Analysis 2019Virus Sequence Alignment and Phylogenetic Analysis 2019
Virus Sequence Alignment and Phylogenetic Analysis 2019
 
Nephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele resultsNephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele results
 
Introduction to METAGENOTE
Introduction to METAGENOTE Introduction to METAGENOTE
Introduction to METAGENOTE
 
Intro to homology modeling
Intro to homology modelingIntro to homology modeling
Intro to homology modeling
 
Protein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modelingProtein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modeling
 
Homology modeling: Modeller
Homology modeling: ModellerHomology modeling: Modeller
Homology modeling: Modeller
 
Protein docking
Protein dockingProtein docking
Protein docking
 
Protein function prediction
Protein function predictionProtein function prediction
Protein function prediction
 
Protein structure prediction with a focus on Rosetta
Protein structure prediction with a focus on RosettaProtein structure prediction with a focus on Rosetta
Protein structure prediction with a focus on Rosetta
 
Biological networks
Biological networksBiological networks
Biological networks
 
UNIX Basics and Cluster Computing
UNIX Basics and Cluster ComputingUNIX Basics and Cluster Computing
UNIX Basics and Cluster Computing
 
Statistical applications in GraphPad Prism
Statistical applications in GraphPad PrismStatistical applications in GraphPad Prism
Statistical applications in GraphPad Prism
 
Intro to JMP for statistics
Intro to JMP for statisticsIntro to JMP for statistics
Intro to JMP for statistics
 
Categorical models
Categorical modelsCategorical models
Categorical models
 
Better graphics in R
Better graphics in RBetter graphics in R
Better graphics in R
 
Automating biostatistics workflows using R-based webtools
Automating biostatistics workflows using R-based webtoolsAutomating biostatistics workflows using R-based webtools
Automating biostatistics workflows using R-based webtools
 
Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)
 
Overview of statistics: Statistical testing (Part I)
Overview of statistics: Statistical testing (Part I)Overview of statistics: Statistical testing (Part I)
Overview of statistics: Statistical testing (Part I)
 
GraphPad Prism: Curve fitting
GraphPad Prism: Curve fittingGraphPad Prism: Curve fitting
GraphPad Prism: Curve fitting
 

Recently uploaded

Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
Srikant77
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 

Recently uploaded (20)

Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 

Introduction to Perl and BioPerl

  • 1. BCBB Bioinformatics Development Series April 30, 2014 Andrew Oler, PhD High-throughput Sequencing Bioinformatics Specialist BCBB/OCICB/NIAID/NIH
  • 2. Bioinformatics & Computational Biology Branch (BCBB)
  • 3. Biocomputing Research Consulting and Scientific Software Development High Throughput Illustration Animation http://www.niaid.nih.gov/about/organization/odoffices/omo/ocicb/Pages/bcbb.aspx ScienceApps@niaid.nih.gov
  • 4. Outline §  Introduction §  Perl programming principles o Variables o Flow controls/Loops o File manipulation o Regular expressions §  BioPerl o What is BioPerl? o How do you use BioPerl? o How do you learn more about BioPerl?
  • 5. Introduction •  An interpreted programming language created in 1987 by Larry Wall •  Good at processing and transforming plain text, like GenBank or PDB files •  Official motto: “TMTOWTDI” (There’s More Than One Way To Do It!) •  Extensible – currently has a large and active user base who are constantly adding new functional libraries •  Portable – can use in Windows, Mac, & Linux/Unix
  • 6. Introduction "Perl is remarkably good for slicing, dicing, twisting, wringing, smoothing, summarizing and otherwise mangling text. Although the biological sciences do involve a good deal of numeric analysis now, most of the primary data is still text: clone names, annotations, comments, bibliographic references. Even DNA sequences are textlike. Interconverting incompatible data formats is a matter of text mangling combined with some creative guesswork. Perl's powerful regular expression matching and string manipulation operators simplify this job in a way that isn't equalled by any other modern language."
  • 7. Examples of Bioinformatics Software with Perl components §  GBrowse, GMOD §  samtools §  Illumina CASAVA §  MEME §  Velvet §  miRDeep §  Rosetta §  ViennaRNA §  RUM §  Trinity §  NCBI BLAST §  I-TASSER §  MAKER §  ...Many more §  http://stackoverflow.com/questions/2527170/why-is-perl-used-so- extensively-in-biology-research §  http://programmers.stackexchange.com/questions/92916/why-is-perl-so- heavily-used-in-bioinformatics 7
  • 8. Getting Perl •  Latest version – 5.18.2 •  http://www.perl.org/ 5.12.3 (Lion)
  • 9. Getting Help •  perl –v •  Perl manual pages •  Books and Documentation: –  http://www.perl.org/docs.html –  The O’Reilly Books: §  Learning Perl §  Programming Perl §  Perl Cookbook, etc. •  http://www.cpan.org •  http://perldoc.perl.org/perlintro.html •  BCBB – for help writing your custom scripts perldoc perl perldoc perlintro
  • 10. File Manager/Browser by Operating System 10 OS: Windows Mac OSX Unix FM: Explorer Finder Shell Input Method: Running Perl scripts
  • 11. Anatomy of the Terminal, “Command Line”, or “Shell” Prompt (computer_name:current_directory username) Cursor Command Argument Window Output Mac: Applications -> Utilities -> Terminal Windows: Download open source software PuTTY http://www.chiark.greenend.org.uk/~sgtatham/putty/ Other SSH Clients (http://en.wikipedia.org/wiki/Comparison_of_SSH_clients) Cygwin (http://www.cygwin.com/) 11
  • 12. How to execute a command command argument output output 12
  • 13. cd (“change directory”) and mkdir (“make directory”) cd ~ change to home directory cd test_data change to “test_data” directory cd .. change to higher directory (“go up”) cd ~/unix_hpc change to home directory > “unix_hpc” directory mkdir dir_name make directory “dir_name” pwd “print working directory” ***See Handout “HPC Cluster Computing and Unix Basics Handout” for more helpful Unix Terminal commands.*** 13
  • 14. "Hello world" script •  hello_world.pl file •  Run hello_world.pl #!/usr/bin/perl # This is a comment print "Hello worldn"; >perl hello_world.pl Hello world >perl -e 'print "Hello worldn”;' Hello world The shebang line must be the first line. It tells the computer where to find perl. •  print is a Perl function name •  Double quotes are used for Strings •  The semi-colon must be present at the end of every command in Perl
  • 15. A Few Helpful Things for a Template §  #!/usr/bin/env perl! §  $| = 1; !# Accurate line numbers (for debugging) ! §  use warnings; !# Helpful warnings (for debugging)! §  use diagnostics; !# Helpful warnings (for debugging)! §  use strict;! !# Requires you to declare variables 15
  • 16. Basic Programming Concepts •  Variables –  Scalars –  Arrays –  Hashes •  Flow Control –  if/else –  unless •  Loops –  for –  foreach –  while –  until •  Files •  Regexes
  • 17. Variables § In computer programming, a variable is a symbolic name used to refer to a value – WikiPedia o Examples •  Variable names can contain letters, numbers, and _, but cannot begin with a number •  $five_prime OK •  $5_prime NO   $x = 4;   $y = 1.0;   $name = 'Bob';   $seq = "ACTGTTGTAAGC”; Perl will treat integers and floating point numbers as numbers, so x and y can be used together in an equation. Strings are indicated by either single or double quotes.
  • 19. Variables - Scalar •  Can store a single string, or number •  Begins with a $ •  Single or double quotes for strings   my $x = 4; # use “my” to declare a variable   my $name = 'Bob';   my $seq= "ACTGTTGTAAGC";   print "My name is $name."; #prints My name is Bob.
  • 20. http://perldoc.perl.org/perlintro.html && and || or ! not = assignment . string concatenation .. range operator Arithmetic Numeric Comparison Boolean Logic Miscellaneous eq equality ne inequality lt less than gt greater le less than or equal ge greater than or equal String Comparison Scalar Operators == equality != inequality < less than > greater <= less than or equal >= greater than or equal + addition - subtraction * multiplication / division ++ increment (by one) -- decrement (by one) += increment (by value) -= decrement (by value) Examples: $m = 3; $f = “hello”; if ($x == 3) if ($x eq ‘hi’)
  • 21. Common Scalar Functions Function Name Description length Length of the scalar value lc Lower case value uc Upper case value reverse Returns the value in the opposite order substr Returns a substring chomp Removes the last newline (n) character chop Removes the last character defined Checks if scalar value exists split Splits scalar into array http://perldoc.perl.org/index-functions.html How to use any Perl function
  • 22. Common Scalar Functions Examples my $string = "This string has a newline.n"; chomp $string; print $string; #prints "This string has a newline.” $string = lc($string); print $string; #prints ”this string has a newline.” @array = split(" ", $string); #array looks like [“this", "string", "has", "a", "newline."]
  • 23. Scalar Variables Exercise §  Write a program that computes the circumference of a circle with a radius of 12.5 §  C = 2 * π * r §  (Answer should be about 78.5) 23
  • 24. Array Andrew Burke Darrell Vijay Mike 0 1 432 •  Stores a list of scalar values (strings or numbers) •  Zero based index
  • 25. Variables - Array •  Begins with @ •  Use the () brackets for creating •  Use the $ and [] brackets for retrieving a single element in the array my @grades = (75, 80, 35); my @mixnmatch = (5, "A", 4.5); my @names = ("Bob", "Vivek", "Jane"); # zero-based index my $first_name = $names[0]; # retrieve the last item in an array my $last_name = $names[-1];
  • 26. Common Array Functions Function Name Description scalar Size of the array push Add value to end of an array pop Removes the last element from an array shift Removes the first element from an array unshift Add value to the beginning of an array join Convert array to scalar splice Removes or replaces specified range of elements from array grep Search array elements sort Orders array elements
  • 27. push/pop modifies the end of an array Tim Molly Betty Chris push(@names, "Charles"); @names = @names = Tim Molly Betty Chris Charles pop(@names); @names = Tim Molly Betty Chris
  • 28. shift/unshift modifies the start of an array Tim Molly Betty Chris unshift(@names, "Charles"); @names = @names = Charles Tim Molly Betty Chris shift(@names); @names = Tim Molly Betty Chris
  • 29. Variables - Hashes KEYS VALUES Title Programming Perl, 3rd Edition Publisher O’Reilly Media ISBN 978-0-596-00027-1 •  Stores data using key, value pairs
  • 30. Variables - Hash §  Indicated with % §  Use the () brackets and => pointer for creating §  Use the $ and {} brackets for setting or retrieving a single element from the hash my %book_info = ( title =>"Perl for bioinformatics", author => "James Tisdall", pages => 270, price => 40 ); print $book_info{"author"}; #returns "James Tisdall"
  • 31. Common Hash Functions Function Name Description keys Returns array of keys values Returns array of values reverse Converts keys to values in hash
  • 32. Retrieving keys or values of a hash •  Retrieving single value •  Retrieving all the keys/values as an array •  NOTE: Keys and values are unordered my $book_title = $book_info{"title"}; #$book_title has stored "Perl for bioinformatics" my @book_attributes = keys %book_info; my @book_attribute_values = values %book_info;
  • 33. Variables summary # A. Scalar variable my $first_name = "andrew"; my $last_name = "oler”; # B. Array variable # use 'circular' bracket and @ symbol for assignment my @personal_info = ("andrew", $last_name); # use 'square' bracket and the integer index to access an entry my $fname = $personal_info[0]; # C. Hash variable # use 'circular' brackets (similar to array) and % symbol for assignment my %personal_info = ( first_name => "andrew", last_name => "oler" ); # use 'curly' brackets to access a single entry my $fname1 = $personal_info{first_name};
  • 34. Tutorial 1 § Create a variable with the following sequence: ILE GLY GLY ASN ALA GLN ALA THR ALA ALA ASN SER ILE ALA LEU GLY SER GLY ALA THR THR § print in lowercase § split into an array § print the array § print the first value in the array § shift the first value off the array and store it in a variable § print the variable and the array § push the variable onto the end of the array § print the array
  • 35. Basic Programming Concepts •  Variables –  Scalars –  Arrays –  Hashes •  Flow Control –  if/else –  unless •  Loops –  for –  foreach –  while –  until •  Files •  Regexes
  • 36. Flow Controls •  If/elsif/else •  unless   $x = 4;   if ($x > 4) {   print "I am greater than 4";   }elsif ($x == 4) {   print "I am equal to 4";   }else {   print "I am less than 4";   }   unless($x > 4) {   print "I am not greater than 4";   }
  • 37. Post-condition # the traditional way if ($x == 4) { print "I am 4."; } # this line below is equivalent to the if statement above, but you can only use it if you have a one line action print "I am 4." if ( $x == 4 ); print "I am not 4." unless ( $x == 4);
  • 38. Basic Programming Concepts •  Variables –  Scalars –  Arrays –  Hashes •  Flow Control –  if/else –  unless •  Loops –  for –  foreach –  while –  until •  Files •  Regexes
  • 39. Loops •  for (EXPR; EXPR; EXPR) •  foreach   for ( my $x = 0; $x < 4 ; $x++ ) {   print "$xn";   }   #prints 0, 1, 2, 3 on separate lines   my @names = ("Bob", "Vivek", "Jane");     foreach my $name (@names) {   print "My name is $name.n";   }   #prints:   #My name is Bob.   #My name is Vivek.   #My name is Jane.
  • 40. Hashes with foreach my %book_info = ( title =>"Perl for Bioinformatics", author => "James Tisdall");   foreach my $key (keys %book_info) {   print "$key : $book_info{$key}n";   }   #prints:   #title : Perl for Bioinformatics   #author : James Tisdall
  • 41. Loops - continued •  while •  until   my $x =0;   until($x >= 4) {   print "$xn";   $x++;   }   my $x = 0;   while($x < 4) {   print "$xn";   $x++;   }
  • 42. Tutorial 2 §  Iterate through the array (using foreach) and print everything unless ILE §  Use a hash to count how many times each amino acid occurs §  Iterate through the hash and print the counts in a table
  • 43. Basic Programming Concepts •  Variables –  Scalars –  Arrays –  Hashes •  Flow Control –  if/else –  unless •  Loops –  for –  foreach –  while –  until •  Files •  Regexes
  • 44. Files •  Existence o  if(-e $file) •  Open o  Read open(FILE, "< $file"); o  New open(FILE, "> $file"); o  Append open(FILE, ">> $file"); •  Read (for input/read file handle) o  while(<FILE>){ } o  Each line is assigned to special variable $_ •  Write (for output--new/append--file handle) o  print FILE $string; •  Close o  close(FILE);
  • 45. Directory •  Existence o  if(-d $directory) •  Open o  opendir(DIR, "$directory") •  Read o  readdir(DIR) •  Close o  closedir(DIR) •  Create o  mkdir($directory) unless (-d $directory)
  • 46. # A. Reading file # create a variable that can tell the program where to find your data my $file = "/Users/oleraj/Documents/perlTutorials/myFile.txt"; # Check if file exists and read through it if(-e $file){ open(FILE, "<$file") or die "cannot open file"; while(<FILE>){ chomp; my $line = $_; #do something useful here } close(FILE); } # B. Reading directory my $directory = "/Users/oleraj"; if(-d $directory){ opendir(DIR, $directory); my @files = readdir(DIR); closedir(DIR); print @files; } Notice the special character. When it is used here, it holds the line that was just read from the file. The array @files will hold the name of every file in the the directory.
  • 47. Basic Programming Concepts •  Variables –  Scalars –  Arrays –  Hashes •  Flow Control –  if/else –  unless •  Loops –  for –  foreach –  while –  until •  Files •  Regexes
  • 48. Regular Expressions (REGEX) •  "A regular expression ... is a set of pattern matching rules encoded in a string according to certain syntax rules." -wikipedia •  Fast and efficient for "Fuzzy" matches •  Applications: •  Checking if a string fits a pattern •  Extracting a pattern match from a string •  Altering the pattern within the string •  Example - Find all sequences from human •  $seq_name =~ /(human|Homo sapiens)/i; •  Uses 1.  Find/match only (yes/no) with m// or // §  e.g., m/regex/; m/human/ 2.  Find and replace a string with s/// §  e.g., s/regex/replacement/; s/human/Homo sapiens/ 3.  Translate character by character with t/// §  e.g., t/list/newlist/; t/abcd/1234/;
  • 49. Beginning Perl for Bioinformatics - James Tisdall
  • 50. Simple Examples my $protein = "MET SER ASN ASN THR SER"; $protein =~ s/SER/THR/g; print $protein; #prints "MET THR ASN ASN THR THR"; $protein =~ m/asn/i; #will match ASN
  • 51. Regular Expressions (REGEX) Symbol Meaning . Match any one character (except newline). ^ Match at beginning of string $ Match at end of string n Match the newline t Match a tab s Match any whitespace character w Match any word character (alphanumeric plus "_") W Match any non-word character d Match any digit character [A-Za-z] Match any letter [0-9] same as d my $string = "See also xyz"; $string =~ /See also ./; #matches "See also x” $string =~ /^./; #matches "S” $string =~ /.$/; #matches "z” $string =~ /wsw/; #matches "e a" http://www.troubleshooters.com/codecorn/littperl/perlreg.htm
  • 52. Regular Expressions (REGEX) Quantifier Meaning * Match 0 or more times + Match at least once ? Match 0 or 1 times {COUNT} Match exactly COUNT times. {MIN,} Match at least MIN times (maximal). {MIN, MAX} Match at least MIN but not more than MAX times (maximal). my $string = "See also xyz"; $string =~ /See also .*/; #matches "See also xyz” $string =~ /^.*/; #matches "See also xyz” $string =~ /.?$/; #matches "z” $string =~ /w+s+w+/; #matches "See also"
  • 53. REGEX Examples my $string = ">ref|XP_001882498.1| retrovirus-related pol polyprotein [Laccaria bicolor S238N-H82]"; $string =~/s.*virus/; #will match " retrovirus" $string =~ /XP_d+/; #will match "XP_001882498” $string =~ /XP_d/; #match “XP_0” $string =~ /[.*]$/; #will match "[Laccaria bicolor S238N-H82]" $string =~ /^.*|/; #will match ">ref|XP_001882498.1|" $string =~ /^.*?|/; #will match ">ref|" $string =~ s/|/:/g; #string becomes ">ref:XP_001882498.1: retrovirus-related pol polyprotein [Laccaria bicolor S238N-H82]"
  • 54. Tutorial 3 §  open the file "example.fa" §  read through the file §  print the id lines for the human sequences (NOTE: the ids will start with HS)
  • 55. Summary of Basics •  Variables –  Scalar –  Array –  Hash •  Flow Control –  if/else –  unless •  Loops –  for –  foreach –  while –  until •  Files •  Regexes
  • 56. Longer Script Examples §  Take a bed file of junctions from RNA-seq analysis (e.g., TopHat output) and print out some basic statistics •  Open up the file bed_file_stats.pl §  Other examples you would like to discuss? 56
  • 57. Time for a little break... 57 Regular Expressions https://xkcd.com/208/
  • 58.
  • 59. Outline §  What is a module (in Perl)? §  Where do you get BioPerl? §  What is BioPerl? §  How do you use BioPerl? §  How do you learn more about BioPerl? §  Additional Resources 59
  • 60. What is module (in Perl)? §  A module is set of Perl variables and methods that are written to accomplish a particular task •  Enables the reuse of methods and variables between Perl scripts / programs •  Tested •  End in “.pm” extension §  Comprehensive Perl Archive Network (CPAN) –  http://www.cpan.org –  Type “cpan” in terminal to open 60
  • 61. Creating a Module #!/usr/bin/perl! ! package Foo;! sub bar {! print "Hello $_[0]n"! }! ! sub blat {! print "World $_[0]n"! }! 1; 61
  • 62. Using a Module #!/usr/bin/perl! ! use Foo;! ! bar( "a" );! blat( "b" ); 62
  • 63. -  Jason Stajich, Ph.D. Assistant Professor at the University of California, Riverside BioPerl developer since 2000 63
  • 64. Where do you get BioPerl? §  In-class tutorial •  Already installed! Yeah! §  URL •  www.BioPerl.org §  Modules •  Bioperl-core •  Bioperl-run •  Bioperl-network •  Bioperl-DB 64
  • 65. What is BioPerl? §  BioPerl is: •  A collection of Perl modules for biological data and analysis •  An open source toolkit with many contributors •  A flexible and extensible system for doing bioinformatics data manipulation •  Consists of >1500 modules; ~1000 considered core §  Modules are interfaces to data types: •  Sequences •  Alignments •  (Sequence) Features •  Locations •  Databases 65Slide adapted from: Jason Stajich
  • 66. With BioPerl you can… §  Retrieve sequence data from NCBI §  Transform sequence files from one format to another §  Parse (or search) BLAST result files §  Manipulate sequences, reverse complement, translate coding DNA sequence to protein §  And so on… 66Slide adapted from: Jason Stajich
  • 67. Major Domains Covered 67Slide adapted from: Jason Stajich
  • 68. Additional Domains 68Slide adapted from: Jason Stajich
  • 69. 69
  • 70. Hypothetical Research Project §  Interested in looking for universal vaccine candidates for an Influenza virus •  Would ultimately involve other programs and data (i.e. epitope data) §  Protocol •  Obtain influenza HA sequence –  2009 pandemic influenza virus hemagglutinin sequence for A/ California/04/2009(H1N1) “FJ966082” –  Convert into other formats •  BLAST sequence to find similar sequences •  Parse BLAST metadata and load into Excel •  Align similar sequences and save alignment •  Find motifs in sequences •  Compute basic sequence metadata 70
  • 71. Module: Bio::SeqIO §  Biological Sequence Input & Output §  Bioinformatics file reading and writing §  Enables easy file conversion §  Example supported formats: •  ABI, BSML, Fasta, Fastq, GCG, Genbank, Interpro, KEGG, Lasergene, Phred Phd, Phred Qual, Pir, Swissprot 71
  • 72. How do we get Genbank sequence / file if we have accession? Sequence Retrieval from NCBI using Bio::DB::GenBank and Bio::SeqIO ! #!/usr/bin/perl –w! use strict;! use Bio::DB::GenBank;! use Bio::SeqIO;! ! my $accession = 'FJ966082';! my $genBank = new Bio::DB::GenBank; ! my $seq = $genBank->get_Seq_by_acc($accession); ! my $seqOut = new Bio::SeqIO(-format => 'genbank', ! ! ! ! -file => ”>$accession.gb"); ! $seqOut->write_seq($seq);! ! ! ! ! ! ! (The downloaded file ”FJ996082.gb” can also be found in the class folder) 72Slide adapted from: Jason Stajich
  • 73. Convert from GenBank to FASTA Format #!/usr/bin/perl! ! use warnings;! use strict;! use Bio::SeqIO;! ! # create one SeqIO object to read in,and another to write out! my $seq_in = Bio::SeqIO->new(! -file => "FJ966082.gb",! -format => "genbank"! );! my $seq_out = Bio::SeqIO->new(! -file => ">FJ966082.fa",! -format => "fasta"! );! ! # write each entry in the input file to the output file! while (my $inseq = $seq_in->next_seq) {! $seq_out->write_seq($inseq);! }! 73Slide adapted from: BioPerl HowTo
  • 74. Bio::SeqIO Sequence Object Methods 74Source: http://www.bioperl.org/wiki/HOWTO:Beginners
  • 75. How to BLAST a Sequence §  Options to BLAST a single sequence: •  Go to NCBI GenBank website and BLAST §  Options to BLAST multiple sequences •  Use NCBI GenBank website / server to BLAST through an API (application programmers interface) •  Setup BLAST software and databases on local computer 75
  • 76. A Few BLAST Details  Query: ...GSVEDTTGSQSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFVEDAELRQTLQEDL...  PQG 18 PEG 15 PKG 14 PRG 14 PDG 13 PHG 13 PMG 13 PNG 13 PSG 13 PQA 12 PQN 12 etc…      Query: 325 SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA 365 a +LA++L+ TP G R++ +W+ P+ D + ER + A Subject: 290 TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA 330  76Source: S. Altschul http://www.cs.umd.edu/class/fall2011/cmsc858s/BLAST.pdf
  • 78. Module: Bio::SearchIO §  Biological Search Input & Output §  Plugging in different parsers for pairwise alignment objects §  Searches parsed with Bio::SearchIO are organized as follows (see SearchIO HOWTO and Parsing BLAST HSPs for much more detail): §  the Bio::SearchIO object contains •  Results, which contain –  Hits, which contain §  HSPs. 78
  • 79. Parse BLAST output #!/usr/bin/perl -w! use strict;! use Bio::SearchIO;! ! my $in = Bio::SearchIO->new(-format => 'blast',! ! ! ! ! ! -file => 'blast-results.txt');! ! open (OUTFILE, '>blast-data.txt');! ! while(my $result = $in->next_result()){! ! while(my $hit = $result->next_hit()){! ! ! !while(my $hsp = $hit->next_hsp()){! ! ! !if($hsp->length('total') > 50 && $hsp->percent_identity() >= 50){! ! ! !print OUTFILE "Query = ". $result->query_name(). "t" .! ! ! ! !"Hit = ". $hit->name(). "t" .! ! ! ! !"Length = ". $hsp->length('total'). "t" .! ! ! ! !"Percent_id = ". $hsp->percent_identity()."n";! ! ! !}! ! !}! !}! }! close (OUTFILE);! 79
  • 80. Module: Bio::SearchIO Methods 80http://www.bioperl.org/wiki/HOWTO:SearchIO Method   Example   Description   algorithm   BLASTX   algorithm string   algorithm_version   2.2.4 [Aug-26-2002]   algorithm version   query_name   20521485|dbj|AP004641.2   query name   query_accession   AP004641.2   query accession   query_length   3059   query length   query_description   Oryza sativa ... 977CE9AF checksum.   query description   database_name   test.fa   database name   database_letters   1291   number of residues in database   database_entries   5   number of database entries   available_statistics   effectivespaceused ... dbletters   statistics used   available_parameters   gapext matrix allowgaps gapopen   parameters used   num_hits   1   number of hits  
  • 81. Parsed Output in Excel §  Drag blast-data.txt file onto Microsoft Excel icon to open §  Enables user to quickly harness Excel knowledge and abilities to do meta analysis of BLAST results 81
  • 82. Module: Bio::AlignIO §  Bioinformatics multiple sequence alignment input & output §  Pluggable parsers and renderers for multiple sequence alignments §  A summary of multiple alignment formats is also a good introduction to the file formats 82
  • 83. Extract the HSPs to a FASTA file using Bio::AlignIO #!/usr/bin/perl -w! use strict;! use Bio::AlignIO;! use Bio::SearchIO;! ! my $in = new Bio::SearchIO(-format => 'blast', -file => 'blast-results.txt');! ! my $alnIO = Bio::AlignIO->new(-format =>"fasta", -file => ">hsp.fas");! ! while(my $result = $in->next_result()){! ! !while(my $hit = $result->next_hit()){! ! ! !while(my $hsp = $hit->next_hsp()){! ! ! ! !if($hsp->length('total') > 50 && $hsp->percent_identity() >= 50){! ! ! ! ! !my $aln = $hsp->get_aln;! ! ! ! !$alnIO->write_aln($aln);! ! ! ! !}! ! !}! !}! } 83
  • 84. Finding Motifs in Sequences #!/usr/bin/perl -w! use strict;! use Bio::SeqIO;! ! my $file = 'hsp.fas';! my $motif = "[ATG]A";! #my $motif = '(A[^T]{2,}){2,}’;! ! my $in = Bio::SeqIO->new(-format => 'fasta', -file => $file);! my $motif_count = 0;! ! while ( my $seq = $in->next_seq) {! !my $str = $seq->seq; ! !# get the sequence as a string! !if ( $str =~ /$motif/i ) {! ! !$motif_count++; # of sequences that have this motif! !}! }! ! printf "%d sequences have the motif $motifn", $motif_count; 84
  • 85. Using Bio::SeqIO to Calculate Sequence Metadata #!/usr/bin/perl -w! use strict;! use Bio::SeqIO;! ! my $file = "hsp.fas";! my $seq_in = Bio::SeqIO->new(-file => $file, -format => "fasta");! my ($seqcount, $basecount, $basecount_nostops);! ! while (my $inseq = $seq_in->next_seq) {! $seqcount++;! ! !# count the number of sequences! $basecount += $inseq->length; !# count bases in whole db! my $str = $inseq->seq; !# get the sequence as a string! $str =~ s/*//g; ! !# remove all '*' from sequence! $basecount_nostops += length($inseq); !# add bases from string! }! ! print "In $file there are $seqcount sequences, and $basecount bases ($basecount_nostops ignoring *)n";! 85Slide adapted from: Jason Stajich
  • 86. Additional Bioperl Examples §  Review “examples” directory within bioperl directory 86
  • 87. Resources §  BioPerl API (the details) •  http://doc.bioperl.org/releases/bioperl-1.6.1/ §  BioPerl Tutorials •  http://www.BioPerl.org/wiki/HOWTOs §  BCBB Handout(s) •  http://collab.niaid.nih.gov/sites/research/SIG/ Bioinformatics/seminars.aspx §  Jason Stajich •  https://github.com/hyphaltip/htbda_perl_class/tree/ master/examples/BioPerl •  http://courses.stajich.org/gen220/lectures/ 87
  • 88. EMBOSS §  European Molecular Biology Open Source Suite §  Command line programs to accomplish many bioinformatics tasks §  Bioperl-run has numerous wrappers for EMBOSS programs §  Download •  http://emboss.sourceforge.net §  Try out •  http://helixweb.nih.gov/emboss/ 88
  • 89. Thank you! andrew.oler@nih.gov     ScienceApps@niaid.nih.gov     h5p://bioinforma;cs.niaid.nih.gov   If you have Questions or Comments, please contact us: