This document provides an overview of using the Perl programming language for bioinformatics applications. It introduces Perl variables like scalars, arrays, and hashes. It also covers flow control and loops. The document demonstrates how to open and read/write files in Perl. It provides examples of commonly used bioinformatics tools that incorporate Perl components and recommends resources for learning more about Perl and BioPerl.
BioPerl is an active open source software project supported by the Open Bioinformatics Foundation.
BioPerl is a product of community effort to produce Perl code which is useful in biology.
BioPerl is a collection of Perl modules
It has played an integral role in the Human Genome Project
protein structure prediction methods. homology modelling, fold recognition, threading, ab initio methods. in short and easy form slides. after one time read you can easily understand methods for protein structure prediction.
BioPerl is an active open source software project supported by the Open Bioinformatics Foundation.
BioPerl is a product of community effort to produce Perl code which is useful in biology.
BioPerl is a collection of Perl modules
It has played an integral role in the Human Genome Project
protein structure prediction methods. homology modelling, fold recognition, threading, ab initio methods. in short and easy form slides. after one time read you can easily understand methods for protein structure prediction.
Archive of experimentally determined 3D structures of biological macromolecules.
Established in 1971, by Research Collaboratory for Structural Bioinformatics (RCSB), Brookhaven National Laboratories, USA.
Archive contain atomic coordinates, bibliographic citations, primary and secondary structure information, crystallographic structure factors, NMR experimental data.
Genome annotation, NGS sequence data, decoding sequence information, The genome contains all the biological information required to build and maintain any given living organism.
Secondary Structure Prediction of proteins Vijay Hemmadi
Secondary structure prediction has been around for almost a quarter of a century. The early methods suffered from a lack of data. Predictions were performed on single sequences rather than families of homologous sequences, and there were relatively few known 3D structures from which to derive parameters. Probably the most famous early methods are those of Chou & Fasman, Garnier, Osguthorbe & Robson (GOR) and Lim. Although the authors originally claimed quite high accuracies (70-80 %), under careful examination, the methods were shown to be only between 56 and 60% accurate (see Kabsch & Sander, 1984 given below). An early problem in secondary structure prediction had been the inclusion of structures used to derive parameters in the set of structures used to assess the accuracy of the method.
Some good references on the subject:
Clustal omega is a widely used bioinformatics tool for performing multiple sequence alignment. This ppt contains the concept and types of sequence alignment, algorithms followed by clustal omega, its result interpretation and applications.
Functional proteomics, methods and toolsKAUSHAL SAHU
INTRODUCTION
HISTORY
DEFINITION
PROTEOMICS
FUNCTIONAL PROTEOMICS
PROTEOMICS SOFTWARE
PROTEOMICS ANALYSIS
TOOLS FOR PROTEOM ANALYSIS
DIFFERENTS METHODS FOR STUDY OF FUNCTIONAL PROTEOMICS
APLLICATIONS
LIMITATIONS
CONCLUSION
Genomic databases are referred to as online repositories of genomic variants, described for a single (locus-specific) or more (general) genes or specifically for a population or ethnic group (national/ethnic).
Archive of experimentally determined 3D structures of biological macromolecules.
Established in 1971, by Research Collaboratory for Structural Bioinformatics (RCSB), Brookhaven National Laboratories, USA.
Archive contain atomic coordinates, bibliographic citations, primary and secondary structure information, crystallographic structure factors, NMR experimental data.
Genome annotation, NGS sequence data, decoding sequence information, The genome contains all the biological information required to build and maintain any given living organism.
Secondary Structure Prediction of proteins Vijay Hemmadi
Secondary structure prediction has been around for almost a quarter of a century. The early methods suffered from a lack of data. Predictions were performed on single sequences rather than families of homologous sequences, and there were relatively few known 3D structures from which to derive parameters. Probably the most famous early methods are those of Chou & Fasman, Garnier, Osguthorbe & Robson (GOR) and Lim. Although the authors originally claimed quite high accuracies (70-80 %), under careful examination, the methods were shown to be only between 56 and 60% accurate (see Kabsch & Sander, 1984 given below). An early problem in secondary structure prediction had been the inclusion of structures used to derive parameters in the set of structures used to assess the accuracy of the method.
Some good references on the subject:
Clustal omega is a widely used bioinformatics tool for performing multiple sequence alignment. This ppt contains the concept and types of sequence alignment, algorithms followed by clustal omega, its result interpretation and applications.
Functional proteomics, methods and toolsKAUSHAL SAHU
INTRODUCTION
HISTORY
DEFINITION
PROTEOMICS
FUNCTIONAL PROTEOMICS
PROTEOMICS SOFTWARE
PROTEOMICS ANALYSIS
TOOLS FOR PROTEOM ANALYSIS
DIFFERENTS METHODS FOR STUDY OF FUNCTIONAL PROTEOMICS
APLLICATIONS
LIMITATIONS
CONCLUSION
Genomic databases are referred to as online repositories of genomic variants, described for a single (locus-specific) or more (general) genes or specifically for a population or ethnic group (national/ethnic).
PHP / MySQL applications are compatible to all operating systems, support all the popular databases, 100% remotely configurable, perfect for web programming & provide higher performance and speed.
PHP is an HTML-embedded scripting language. Much of its syntax is borrowed from C, Java and Perl with a couple of unique PHP-specific features thrown in. The goal of the language is to allow web developers to write dynamically generated pages quickly.
MySQL is a Relational Database Management System (RDBMS) that uses Structured Query Language (SQL).
PHP is the most popular scripting language for web development. It is free, open source and server-side (the code is executed on the server).
PHP third party tool and plug-in integration such as chat, forum, blog and search engine
Full-day tutorial for the dutch php conference 2011 giving a very quick tour around all the various areas of the ZCE syllabus and some tips on the exam styles
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month.
The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies.
However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News.
Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have.
For more Tendenci AMS events, check out www.tendenci.com/events
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
3. Biocomputing Research Consulting and
Scientific Software Development
High
Throughput
Illustration
Animation
http://www.niaid.nih.gov/about/organization/odoffices/omo/ocicb/Pages/bcbb.aspx
ScienceApps@niaid.nih.gov
4. Outline
§ Introduction
§ Perl programming principles
o Variables
o Flow controls/Loops
o File manipulation
o Regular expressions
§ BioPerl
o What is BioPerl?
o How do you use BioPerl?
o How do you learn more about BioPerl?
5. Introduction
• An interpreted programming language created in
1987 by Larry Wall
• Good at processing and transforming plain text, like
GenBank or PDB files
• Official motto: “TMTOWTDI” (There’s More Than
One Way To Do It!)
• Extensible – currently has a large and active user
base who are constantly adding new functional
libraries
• Portable – can use in Windows, Mac, & Linux/Unix
6. Introduction
"Perl is remarkably good for slicing, dicing, twisting, wringing, smoothing,
summarizing and otherwise mangling text. Although the biological sciences
do involve a good deal of numeric analysis now, most of the primary data is
still text: clone names, annotations, comments, bibliographic references.
Even DNA sequences are textlike. Interconverting incompatible data
formats is a matter of text mangling combined with some creative
guesswork. Perl's powerful regular expression matching and string
manipulation operators simplify this job in a way that isn't equalled by any
other modern language."
9. Getting Help
• perl –v
• Perl manual pages
• Books and Documentation:
– http://www.perl.org/docs.html
– The O’Reilly Books:
§ Learning Perl
§ Programming Perl
§ Perl Cookbook, etc.
• http://www.cpan.org
• http://perldoc.perl.org/perlintro.html
• BCBB – for help writing your custom scripts
perldoc perl
perldoc perlintro
10. File Manager/Browser by Operating System
10
OS: Windows Mac OSX Unix
FM: Explorer Finder Shell
Input
Method:
Running Perl scripts
11. Anatomy of the Terminal, “Command Line”,
or “Shell”
Prompt (computer_name:current_directory username)
Cursor
Command Argument
Window
Output
Mac: Applications -> Utilities -> Terminal
Windows: Download open source software
PuTTY http://www.chiark.greenend.org.uk/~sgtatham/putty/
Other SSH Clients (http://en.wikipedia.org/wiki/Comparison_of_SSH_clients)
Cygwin (http://www.cygwin.com/)
11
12. How to execute a command
command argument
output
output
12
13. cd (“change directory”) and
mkdir (“make directory”)
cd ~ change to home directory
cd test_data change to “test_data” directory
cd .. change to higher directory (“go up”)
cd ~/unix_hpc change to home directory > “unix_hpc” directory
mkdir dir_name make directory “dir_name”
pwd “print working directory”
***See Handout “HPC Cluster Computing and Unix Basics Handout” for
more helpful Unix Terminal commands.***
13
14. "Hello world" script
• hello_world.pl file
• Run hello_world.pl
#!/usr/bin/perl
# This is a comment
print "Hello worldn";
>perl hello_world.pl
Hello world
>perl -e 'print "Hello worldn”;'
Hello world
The shebang line must be the first line.
It tells the computer where to find perl.
• print is a Perl function name
• Double quotes are used for Strings
• The semi-colon must be present at the end of
every command in Perl
15. A Few Helpful Things for a Template
§ #!/usr/bin/env perl!
§ $| = 1; !# Accurate line numbers (for debugging) !
§ use warnings; !# Helpful warnings (for debugging)!
§ use diagnostics; !# Helpful warnings (for debugging)!
§ use strict;! !# Requires you to declare variables
15
16. Basic Programming Concepts
• Variables
– Scalars
– Arrays
– Hashes
• Flow Control
– if/else
– unless
• Loops
– for
– foreach
– while
– until
• Files
• Regexes
17. Variables
§ In computer programming, a variable is a symbolic
name used to refer to a value – WikiPedia
o Examples
• Variable names can contain letters, numbers, and _,
but cannot begin with a number
• $five_prime OK
• $5_prime NO
$x = 4;
$y = 1.0;
$name = 'Bob';
$seq = "ACTGTTGTAAGC”;
Perl will treat integers and floating
point numbers as numbers, so x and
y can be used together in an
equation.
Strings are indicated by either
single or double quotes.
19. Variables - Scalar
• Can store a single string, or number
• Begins with a $
• Single or double quotes for strings
my $x = 4; # use “my” to declare a variable
my $name = 'Bob';
my $seq= "ACTGTTGTAAGC";
print "My name is $name.";
#prints My name is Bob.
20. http://perldoc.perl.org/perlintro.html
&& and
|| or
! not
= assignment
. string concatenation
.. range operator
Arithmetic
Numeric Comparison
Boolean Logic
Miscellaneous
eq equality
ne inequality
lt less than
gt greater
le less than or equal
ge greater than or equal
String Comparison
Scalar Operators
== equality
!= inequality
< less than
> greater
<= less than or equal
>= greater than or equal
+ addition
- subtraction
* multiplication
/ division
++ increment (by one)
-- decrement (by one)
+= increment (by value)
-= decrement (by value)
Examples:
$m = 3;
$f = “hello”;
if ($x == 3)
if ($x eq ‘hi’)
21. Common Scalar Functions
Function Name Description
length Length of the scalar value
lc Lower case value
uc Upper case value
reverse Returns the value in the opposite order
substr Returns a substring
chomp Removes the last newline (n) character
chop Removes the last character
defined Checks if scalar value exists
split Splits scalar into array
http://perldoc.perl.org/index-functions.html How to use any Perl function
22. Common Scalar Functions Examples
my $string = "This string has a newline.n";
chomp $string;
print $string;
#prints "This string has a newline.”
$string = lc($string);
print $string;
#prints ”this string has a newline.”
@array = split(" ", $string);
#array looks like [“this", "string", "has",
"a", "newline."]
23. Scalar Variables Exercise
§ Write a program that computes the circumference of a
circle with a radius of 12.5
§ C = 2 * π * r
§ (Answer should be about 78.5)
23
24. Array
Andrew Burke Darrell Vijay Mike
0 1 432
• Stores a list of scalar values (strings or numbers)
• Zero based index
25. Variables - Array
• Begins with @
• Use the () brackets for creating
• Use the $ and [] brackets for retrieving a single
element in the array
my @grades = (75, 80, 35);
my @mixnmatch = (5, "A", 4.5);
my @names = ("Bob", "Vivek", "Jane");
# zero-based index
my $first_name = $names[0];
# retrieve the last item in an array
my $last_name = $names[-1];
26. Common Array Functions
Function Name Description
scalar Size of the array
push Add value to end of an array
pop Removes the last element from an array
shift Removes the first element from an array
unshift Add value to the beginning of an array
join Convert array to scalar
splice Removes or replaces specified range of elements from array
grep Search array elements
sort Orders array elements
27. push/pop modifies the end of an array
Tim Molly Betty Chris
push(@names, "Charles");
@names =
@names = Tim Molly Betty Chris Charles
pop(@names);
@names = Tim Molly Betty Chris
28. shift/unshift modifies the start of an array
Tim Molly Betty Chris
unshift(@names, "Charles");
@names =
@names = Charles Tim Molly Betty Chris
shift(@names);
@names = Tim Molly Betty Chris
29. Variables - Hashes
KEYS VALUES
Title Programming Perl, 3rd Edition
Publisher O’Reilly Media
ISBN 978-0-596-00027-1
• Stores data using key, value pairs
30. Variables - Hash
§ Indicated with %
§ Use the () brackets and => pointer for creating
§ Use the $ and {} brackets for setting or retrieving a
single element from the hash
my %book_info = (
title =>"Perl for bioinformatics",
author => "James Tisdall",
pages => 270,
price => 40
);
print $book_info{"author"};
#returns "James Tisdall"
31. Common Hash Functions
Function Name Description
keys Returns array of keys
values Returns array of values
reverse Converts keys to values in hash
32. Retrieving keys or values of a hash
• Retrieving single value
• Retrieving all the keys/values as an
array
• NOTE: Keys and values are unordered
my $book_title = $book_info{"title"};
#$book_title has stored "Perl for bioinformatics"
my @book_attributes = keys %book_info;
my @book_attribute_values = values %book_info;
33. Variables summary
# A. Scalar variable
my $first_name = "andrew";
my $last_name = "oler”;
# B. Array variable
# use 'circular' bracket and @ symbol for assignment
my @personal_info = ("andrew", $last_name);
# use 'square' bracket and the integer index to access an entry
my $fname = $personal_info[0];
# C. Hash variable
# use 'circular' brackets (similar to array) and % symbol for assignment
my %personal_info = (
first_name => "andrew",
last_name => "oler"
);
# use 'curly' brackets to access a single entry
my $fname1 = $personal_info{first_name};
34. Tutorial 1
§ Create a variable with the following sequence:
ILE GLY GLY ASN ALA GLN ALA THR ALA ALA ASN SER ILE ALA LEU
GLY SER GLY ALA THR THR
§ print in lowercase
§ split into an array
§ print the array
§ print the first value in the array
§ shift the first value off the array and store it in a
variable
§ print the variable and the array
§ push the variable onto the end of the array
§ print the array
35. Basic Programming Concepts
• Variables
– Scalars
– Arrays
– Hashes
• Flow Control
– if/else
– unless
• Loops
– for
– foreach
– while
– until
• Files
• Regexes
36. Flow Controls
• If/elsif/else
• unless
$x = 4;
if ($x > 4) {
print "I am greater than 4";
}elsif ($x == 4) {
print "I am equal to 4";
}else {
print "I am less than 4";
}
unless($x > 4) {
print "I am not greater than 4";
}
37. Post-condition
# the traditional way
if ($x == 4) {
print "I am 4.";
}
# this line below is equivalent to the
if statement above, but you can only
use it if you have a one line action
print "I am 4." if ( $x == 4 );
print "I am not 4." unless ( $x == 4);
38. Basic Programming Concepts
• Variables
– Scalars
– Arrays
– Hashes
• Flow Control
– if/else
– unless
• Loops
– for
– foreach
– while
– until
• Files
• Regexes
39. Loops
• for (EXPR; EXPR; EXPR)
• foreach
for ( my $x = 0; $x < 4 ; $x++ ) {
print "$xn";
}
#prints 0, 1, 2, 3 on separate lines
my @names = ("Bob", "Vivek", "Jane");
foreach my $name (@names) {
print "My name is $name.n";
}
#prints:
#My name is Bob.
#My name is Vivek.
#My name is Jane.
40. Hashes with foreach
my %book_info = (
title =>"Perl for Bioinformatics",
author => "James Tisdall");
foreach my $key (keys %book_info) {
print "$key : $book_info{$key}n";
}
#prints:
#title : Perl for Bioinformatics
#author : James Tisdall
41. Loops - continued
• while
• until
my $x =0;
until($x >= 4) {
print "$xn";
$x++;
}
my $x = 0;
while($x < 4) {
print "$xn";
$x++;
}
42. Tutorial 2
§ Iterate through the array (using foreach) and print
everything unless ILE
§ Use a hash to count how many times each amino acid
occurs
§ Iterate through the hash and print the counts in a table
43. Basic Programming Concepts
• Variables
– Scalars
– Arrays
– Hashes
• Flow Control
– if/else
– unless
• Loops
– for
– foreach
– while
– until
• Files
• Regexes
44. Files
• Existence
o if(-e $file)
• Open
o Read open(FILE, "< $file");
o New open(FILE, "> $file");
o Append open(FILE, ">> $file");
• Read (for input/read file handle)
o while(<FILE>){ }
o Each line is assigned to special variable $_
• Write (for output--new/append--file handle)
o print FILE $string;
• Close
o close(FILE);
45. Directory
• Existence
o if(-d $directory)
• Open
o opendir(DIR, "$directory")
• Read
o readdir(DIR)
• Close
o closedir(DIR)
• Create
o mkdir($directory) unless (-d
$directory)
46. # A. Reading file
# create a variable that can tell the program where to find your data
my $file = "/Users/oleraj/Documents/perlTutorials/myFile.txt";
# Check if file exists and read through it
if(-e $file){
open(FILE, "<$file") or die "cannot open file";
while(<FILE>){
chomp;
my $line = $_;
#do something useful here
}
close(FILE);
}
# B. Reading directory
my $directory = "/Users/oleraj";
if(-d $directory){
opendir(DIR, $directory);
my @files = readdir(DIR);
closedir(DIR);
print @files;
}
Notice the special character.
When it is used here, it holds the
line that was just read from the file.
The array @files will hold the name
of every file in the the directory.
47. Basic Programming Concepts
• Variables
– Scalars
– Arrays
– Hashes
• Flow Control
– if/else
– unless
• Loops
– for
– foreach
– while
– until
• Files
• Regexes
48. Regular Expressions (REGEX)
• "A regular expression ... is a set of pattern matching rules
encoded in a string according to certain syntax rules." -wikipedia
• Fast and efficient for "Fuzzy" matches
• Applications:
• Checking if a string fits a pattern
• Extracting a pattern match from a string
• Altering the pattern within the string
• Example - Find all sequences from human
• $seq_name =~ /(human|Homo sapiens)/i;
• Uses
1. Find/match only (yes/no) with m// or //
§ e.g., m/regex/; m/human/
2. Find and replace a string with s///
§ e.g., s/regex/replacement/; s/human/Homo sapiens/
3. Translate character by character with t///
§ e.g., t/list/newlist/; t/abcd/1234/;
50. Simple Examples
my $protein = "MET SER ASN ASN THR SER";
$protein =~ s/SER/THR/g;
print $protein;
#prints "MET THR ASN ASN THR THR";
$protein =~ m/asn/i;
#will match ASN
51. Regular Expressions (REGEX)
Symbol Meaning
. Match any one character (except
newline).
^ Match at beginning of string
$ Match at end of string
n Match the newline
t Match a tab
s Match any whitespace character
w Match any word
character (alphanumeric plus "_")
W Match any non-word character
d Match any digit character
[A-Za-z] Match any letter
[0-9] same as d
my $string = "See also xyz";
$string =~ /See also ./;
#matches "See also x”
$string =~ /^./;
#matches "S”
$string =~ /.$/;
#matches "z”
$string =~ /wsw/;
#matches "e a"
http://www.troubleshooters.com/codecorn/littperl/perlreg.htm
52. Regular Expressions (REGEX)
Quantifier Meaning
* Match 0 or more times
+ Match at least once
? Match 0 or 1 times
{COUNT} Match exactly COUNT times.
{MIN,} Match at least MIN times (maximal).
{MIN, MAX} Match at least MIN but not more
than MAX times (maximal).
my $string = "See also xyz";
$string =~ /See also .*/;
#matches "See also xyz”
$string =~ /^.*/;
#matches "See also xyz”
$string =~ /.?$/;
#matches "z”
$string =~ /w+s+w+/;
#matches "See also"
53. REGEX Examples
my $string = ">ref|XP_001882498.1| retrovirus-related pol polyprotein
[Laccaria bicolor S238N-H82]";
$string =~/s.*virus/;
#will match " retrovirus"
$string =~ /XP_d+/;
#will match "XP_001882498”
$string =~ /XP_d/;
#match “XP_0”
$string =~ /[.*]$/;
#will match "[Laccaria bicolor S238N-H82]"
$string =~ /^.*|/;
#will match ">ref|XP_001882498.1|"
$string =~ /^.*?|/;
#will match ">ref|"
$string =~ s/|/:/g;
#string becomes ">ref:XP_001882498.1: retrovirus-related pol polyprotein
[Laccaria bicolor S238N-H82]"
54. Tutorial 3
§ open the file "example.fa"
§ read through the file
§ print the id lines for the human sequences (NOTE: the
ids will start with HS)
55. Summary of Basics
• Variables
– Scalar
– Array
– Hash
• Flow Control
– if/else
– unless
• Loops
– for
– foreach
– while
– until
• Files
• Regexes
56. Longer Script Examples
§ Take a bed file of junctions from RNA-seq analysis
(e.g., TopHat output) and print out some basic
statistics
• Open up the file bed_file_stats.pl
§ Other examples you would like to discuss?
56
57. Time for a little break...
57
Regular Expressions
https://xkcd.com/208/
58.
59. Outline
§ What is a module (in Perl)?
§ Where do you get BioPerl?
§ What is BioPerl?
§ How do you use BioPerl?
§ How do you learn more about BioPerl?
§ Additional Resources
59
60. What is module (in Perl)?
§ A module is set of Perl variables and methods that are
written to accomplish a particular task
• Enables the reuse of methods and variables
between Perl scripts / programs
• Tested
• End in “.pm” extension
§ Comprehensive Perl Archive Network (CPAN)
– http://www.cpan.org
– Type “cpan” in terminal to open
60
63. - Jason Stajich, Ph.D.
Assistant Professor at the University of California, Riverside
BioPerl developer since 2000
63
64. Where do you get BioPerl?
§ In-class tutorial
• Already installed! Yeah!
§ URL
• www.BioPerl.org
§ Modules
• Bioperl-core
• Bioperl-run
• Bioperl-network
• Bioperl-DB
64
65. What is BioPerl?
§ BioPerl is:
• A collection of Perl modules for biological data and
analysis
• An open source toolkit with many contributors
• A flexible and extensible system for doing bioinformatics
data manipulation
• Consists of >1500 modules; ~1000 considered core
§ Modules are interfaces to data types:
• Sequences
• Alignments
• (Sequence) Features
• Locations
• Databases
65Slide adapted from: Jason Stajich
66. With BioPerl you can…
§ Retrieve sequence data from NCBI
§ Transform sequence files from one format to another
§ Parse (or search) BLAST result files
§ Manipulate sequences, reverse complement, translate
coding DNA sequence to protein
§ And so on…
66Slide adapted from: Jason Stajich
70. Hypothetical Research Project
§ Interested in looking for universal vaccine candidates for
an Influenza virus
• Would ultimately involve other programs and data (i.e.
epitope data)
§ Protocol
• Obtain influenza HA sequence
– 2009 pandemic influenza virus hemagglutinin sequence for A/
California/04/2009(H1N1) “FJ966082”
– Convert into other formats
• BLAST sequence to find similar sequences
• Parse BLAST metadata and load into Excel
• Align similar sequences and save alignment
• Find motifs in sequences
• Compute basic sequence metadata
70
72. How do we get Genbank sequence / file if
we have accession?
Sequence Retrieval from NCBI using Bio::DB::GenBank and Bio::SeqIO
!
#!/usr/bin/perl –w!
use strict;!
use Bio::DB::GenBank;!
use Bio::SeqIO;!
!
my $accession = 'FJ966082';!
my $genBank = new Bio::DB::GenBank; !
my $seq = $genBank->get_Seq_by_acc($accession); !
my $seqOut = new Bio::SeqIO(-format => 'genbank', !
! ! ! -file => ”>$accession.gb"); !
$seqOut->write_seq($seq);!
!
!
!
!
!
!
(The downloaded file ”FJ996082.gb” can also be found in the class folder)
72Slide adapted from: Jason Stajich
73. Convert from GenBank to FASTA Format
#!/usr/bin/perl!
!
use warnings;!
use strict;!
use Bio::SeqIO;!
!
# create one SeqIO object to read in,and another to write out!
my $seq_in = Bio::SeqIO->new(!
-file => "FJ966082.gb",!
-format => "genbank"!
);!
my $seq_out = Bio::SeqIO->new(!
-file => ">FJ966082.fa",!
-format => "fasta"!
);!
!
# write each entry in the input file to the output file!
while (my $inseq = $seq_in->next_seq) {!
$seq_out->write_seq($inseq);!
}!
73Slide adapted from: BioPerl HowTo
75. How to BLAST a Sequence
§ Options to BLAST a single sequence:
• Go to NCBI GenBank website and BLAST
§ Options to BLAST multiple sequences
• Use NCBI GenBank website / server to BLAST
through an API (application programmers interface)
• Setup BLAST software and databases on local
computer
75
76. A Few BLAST Details
Query: ...GSVEDTTGSQSLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEAFVEDAELRQTLQEDL...
PQG 18
PEG 15
PKG 14
PRG 14
PDG 13
PHG 13
PMG 13
PNG 13
PSG 13
PQA 12
PQN 12
etc…
Query: 325 SLAALLNKCKTPQGQRLVNQWIKQPLMDKNRIEERLNLVEA 365
a +LA++L+ TP G R++ +W+ P+ D + ER + A
Subject: 290 TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQQTIGA 330
76Source: S. Altschul http://www.cs.umd.edu/class/fall2011/cmsc858s/BLAST.pdf
78. Module:
Bio::SearchIO
§ Biological Search Input & Output
§ Plugging in different parsers for pairwise alignment
objects
§ Searches parsed with Bio::SearchIO are organized as
follows (see SearchIO HOWTO and Parsing BLAST
HSPs for much more detail):
§ the Bio::SearchIO object contains
• Results, which contain
– Hits, which contain
§ HSPs.
78
80. Module:
Bio::SearchIO Methods
80http://www.bioperl.org/wiki/HOWTO:SearchIO
Method
Example
Description
algorithm
BLASTX
algorithm string
algorithm_version
2.2.4 [Aug-26-2002]
algorithm version
query_name
20521485|dbj|AP004641.2
query name
query_accession
AP004641.2
query accession
query_length
3059
query length
query_description
Oryza sativa ... 977CE9AF checksum.
query description
database_name
test.fa
database name
database_letters
1291
number of residues in database
database_entries
5
number of database entries
available_statistics
effectivespaceused ... dbletters
statistics used
available_parameters
gapext matrix allowgaps gapopen
parameters used
num_hits
1
number of hits
81. Parsed Output in Excel
§ Drag blast-data.txt file onto Microsoft Excel icon to
open
§ Enables user to quickly harness Excel knowledge and
abilities to do meta analysis of BLAST results
81
82. Module:
Bio::AlignIO
§ Bioinformatics multiple sequence alignment input &
output
§ Pluggable parsers and renderers for multiple
sequence alignments
§ A summary of multiple alignment formats is also a
good introduction to the file formats
82
83. Extract the HSPs to a FASTA file using
Bio::AlignIO
#!/usr/bin/perl -w!
use strict;!
use Bio::AlignIO;!
use Bio::SearchIO;!
!
my $in = new Bio::SearchIO(-format => 'blast', -file => 'blast-results.txt');!
!
my $alnIO = Bio::AlignIO->new(-format =>"fasta", -file => ">hsp.fas");!
!
while(my $result = $in->next_result()){!
!
!while(my $hit = $result->next_hit()){!
!
! !while(my $hsp = $hit->next_hsp()){!
!
! ! !if($hsp->length('total') > 50 && $hsp->percent_identity() >= 50){!
!
! ! ! !my $aln = $hsp->get_aln;!
!
! ! !$alnIO->write_aln($aln);!
!
! ! !}!
! !}!
!}!
}
83
84. Finding Motifs in Sequences
#!/usr/bin/perl -w!
use strict;!
use Bio::SeqIO;!
!
my $file = 'hsp.fas';!
my $motif = "[ATG]A";!
#my $motif = '(A[^T]{2,}){2,}’;!
!
my $in = Bio::SeqIO->new(-format => 'fasta', -file => $file);!
my $motif_count = 0;!
!
while ( my $seq = $in->next_seq) {!
!my $str = $seq->seq; ! !# get the sequence as a string!
!if ( $str =~ /$motif/i ) {!
! !$motif_count++; # of sequences that have this motif!
!}!
}!
!
printf "%d sequences have the motif $motifn", $motif_count;
84
85. Using Bio::SeqIO to
Calculate Sequence Metadata
#!/usr/bin/perl -w!
use strict;!
use Bio::SeqIO;!
!
my $file = "hsp.fas";!
my $seq_in = Bio::SeqIO->new(-file => $file, -format => "fasta");!
my ($seqcount, $basecount, $basecount_nostops);!
!
while (my $inseq = $seq_in->next_seq) {!
$seqcount++;! ! !# count the number of sequences!
$basecount += $inseq->length; !# count bases in whole db!
my $str = $inseq->seq; !# get the sequence as a string!
$str =~ s/*//g; ! !# remove all '*' from sequence!
$basecount_nostops += length($inseq); !# add bases from string!
}!
!
print "In $file there are $seqcount sequences, and $basecount bases
($basecount_nostops ignoring *)n";!
85Slide adapted from: Jason Stajich
87. Resources
§ BioPerl API (the details)
• http://doc.bioperl.org/releases/bioperl-1.6.1/
§ BioPerl Tutorials
• http://www.BioPerl.org/wiki/HOWTOs
§ BCBB Handout(s)
• http://collab.niaid.nih.gov/sites/research/SIG/
Bioinformatics/seminars.aspx
§ Jason Stajich
• https://github.com/hyphaltip/htbda_perl_class/tree/
master/examples/BioPerl
• http://courses.stajich.org/gen220/lectures/
87
88. EMBOSS
§ European Molecular Biology Open Source Suite
§ Command line programs to accomplish many
bioinformatics tasks
§ Bioperl-run has numerous wrappers for EMBOSS
programs
§ Download
• http://emboss.sourceforge.net
§ Try out
• http://helixweb.nih.gov/emboss/
88
89. Thank you!
andrew.oler@nih.gov
ScienceApps@niaid.nih.gov
h5p://bioinforma;cs.niaid.nih.gov
If you have Questions or Comments, please contact us: