Intro to Perl and Bioperl

bcbbslides 2,967 views 62 slides May 02, 2017
Slide 1
Slide 1 of 62
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62

About This Presentation

Intro to Perl and Bioperl


Slide Content

Introduction to Perl/BioPerl
Presented by: Jennifer Dommer, Vivek Gopalan
Bioinformatics Developer
Bioinformatics and Computational Biosciences Branch
National Institute of Allergy and Infectious Diseases
Office of Cyber Infrastructure and Computational Biology

Who Are We?
• Bioinformatics and Computational
Biology Branch (BCBB)
• NIH/NIAID/OD/OSMO/OCICB/BCBB
• group of 28 people
• http://bioinformatics.niaid.nih.gov
• [email protected]

Outline
• Introduction
• Perl programming principles
o Variables
o Flow controls/Loops
o File manipulation
o Regular expressions
• BioPerl
o Introduction
o SeqIO
o SearchIO

Introduction
• PERL – Practical Extraction and Report Language
• An interpreted programming language created in
1987 by Larry Wall
• Good at processing and transforming plain text,
like GenBank or PDB files
• Not strongly typed – Variables don’t require a type
and are not required to be declared in advance.
You can’t do this in C- or Java-like languages.
• Extensible – currently has an large and active user
base who are constantly adding new functional
libraries
• Portable– can use in Windows, Mac, & Linux/Unix

Introduction
"Perl is remarkably good for slicing, dicing, twisting, wringing, smoothing,
summarizing and otherwise mangling text. Although the biological sciences
do involve a good deal of numeric analysis now, most of the primary data is
still text: clone names, annotations, comments, bibliographic references.
Even DNA sequences are textlike. Interconverting incompatible data
formats is a matter of text mangling combined with some creative
guesswork. Perl's powerful regular expression matching and string
manipulation operators simplify this job in a way that isn't equalled by any
other modern language."

Getting Perl
• Latest version – 5.12.3
• http://www.perl.org/

Getting Help
• perl –v
• Perl manual pages
• Books and Documentation:
• http://www.perl.org/docs.html
• The O’Reilly Books: Programming Perl, etc.
• http://www.cpan.org
• http://perldoc.perl.org/perlintro.html
• BCBB – for help writing your custom scripts






perldoc perl
perldoc perlintro

"Hello world" script
• hello_world.pl file



#!/usr/bin/perl

# This is a comment
print "Hello world\n";
The shebang line must be the first line.
It tells the computer where to find Perl.
• print is a Perl function name
• Double quotes are used for Strings
• The semi-colon must be present at the end of
every command in Perl

"Hello world" script
• hello_world.pl file



• Run hello_world.pl
#!/usr/bin/perl

# This is a comment
print "Hello world\n";
>perl hello_world.pl
Hello world
The shebang line must be the first line.
It tells the computer where to find Perl.
• print is a Perl function name
• Double quotes are used for Strings
• The semi-colon must be present at the end of
every command in Perl

"Hello world" script
• hello_world.pl file



• Run hello_world.pl
#!/usr/bin/perl

# This is a comment
print "Hello world\n";
>perl hello_world.pl
Hello world
>perl -e 'print "Hello world\n”;'
Hello world
The shebang line must be the first line.
It tells the computer where to find Perl.
• print is a Perl function name
• Double quotes are used for Strings
• The semi-colon must be present at the end of
every command in Perl

Basic Programming Concepts
• Variables
• Scalars
• Arrays
• Hashes
• Flow Control
• if/else
• unless
• Loops
• for
• foreach
• while
• until
• Files
• Regexes

Variables
• In computer programming, a variable is a symbolic
name used to refer to a value – WikiPedia
o Examples
• Variable names can contain letters, digits, and _,
but cannot begin with a digit
 $x = 4;
 $y = 1.0;
 $name = 'Bob';
 $seq = "ACTGTTGTAAGC";

Perl will treat integers and floating
point numbers as numbers, so x and y
can be used together in an equation.
Strings are indicated by either
single or double quotes.

Perl Variables
• Scalar
• Array
• Hash

Variables - Scalar
• Can store a single string, or number
• Begins with a $
• Single or double quotes for strings






 my $x = 4;
 my $name = 'Bob';
 my $seq= "ACTGTTGTAAGC";
 print "My name is $name.";
#prints My name is Bob.

http://perldoc.perl.org/perlintro.html
&& and
|| or
! not
= assignment
. string concatenation
.. range operator
Arithmetic
Numeric Comparison
Boolean Logic
Miscellaneous
eq equality
ne inequality
lt less than
gt greater
le less than or equal
ge greater than or equal
String Comparison
Scalar Operators
== equality
!= inequality
< less than
> greater
<= less than or equal
>= greater than or equal
+ addition
- subtraction
* multiplication
/ division
++ increment (by one)
-- decrement (by one)
+= increment (by value)
-= decrement (by value)

Common Scalar Functions
Function Name Description
length Length of the scalar value
lc Lower case value
uc Upper case value
reverse Returns the value in the opposite order
substr Returns a substring
chomp Removes the last newline (\n) character
chop Removes the last character
defined Checks if scalar value exists
split Splits scalar into array

Common Scalar Functions Examples
my $string = "This string has a newline.\ n";
chomp $string;
print $string;
#prints "This string has a newline.”

@array = split(" ", $string);
#array looks like ["This", "string", "has",
"a", "newline."]

Array
Vivek Jennifer Jason Darrell Qina
0 1 4 3 2
• Stores a list of scalar values (strings or numbers)
• Zero based index

Variables - Array
• Begins with @
• Use the () brackets for creating
• Use the $ and [] brackets for retrieving a single
element in the array
my @grades = (75, 80, 35);
my @mixnmatch = (5, "A", 4.5);
my @names = ("Bob", "Vivek", "Jane");

# zero-based index
my $first_name = $names[0];

#special variable to retrieve the last item in an array
my $last_name = $names[$#names];

Common Array Functions
Function Name Description
scalar Size of the array
push Add value to end of an array
pop Removes the last element from an array
shift Removes the first element from an array
unshift Add value to the beginning of an array
join Convert array to scalar
splice Removes or replaces specified range of elements from array
grep Search array elements
sort Orders array elements

push/pop
Tim Molly Betty Chris
push(@names, "Charles");
@names =
@names = Tim Molly Betty Chris Charles
pop(@names);
@names = Tim Molly Betty Chris

shift/unshift
Tim Molly Betty Chris
unshift(@names, "Charles");
@names =
@names = Charles Tim Molly Betty Chris
shift(@names);
@names = Tim Molly Betty Chris

Variables - Hashes
KEYS VALUES
Title Programming Perl, 3
rd
Edition
Publisher O’Reilly Media
ISBN 978-0-596-00027-1

Variables - Hash
• Stores data using key, value pairs
• Indicated with %
• Use the () brackets for creating
• Use the $ and {} brackets for setting or retrieving
a single element from the hash
my %book_info = (
title =>"Perl for bioinformatics",
author => "James Tisdall",
pages => 270,
price => 40
);
$book_info{"author"};
#returns "James Tisdall"

Variables - Hash
• Retrieving single value or all the keys/
values
• NOTE: Keys and values are
unordered
my $book_title = $book_info{"title"};
#refers to "Perl for bioinformatics"

my @book_attributes = keys %book_info;
my @book_attribute_values = values %book_info;

Common Hash Functions
Function Name Description
keys Returns array of keys
values Returns array of values
reverse Converts keys to values in hash

Variables summary
# A. Scalar variable
my $first_name = "vivek";
my $last_name = "gopalan”;

# B. Array variable
# use 'circular' bracket and @ symbol for assignment
my @personal_info = ("vivek", $last_name);
# use 'square' bracket and the integer index to access an entry
my $fname = $personal_info[0];

# C. Hash variable
# use 'circular' brackets (similar to array) and % symbol for
assignment
my %personal_info = (
first_name => "vivek",
last_name => "gopalan"
);
# use 'curly' brackets to access a single entry
my $fname1 = $personal_info{first_name };

Tutorial 1
• Create a variable with the following sequence:
ILE GLY GLY ASN ALA GLN ALA THR ALA ALA ASN SER ILE
ALA LEU GLY SER GLY ALA THR THR
• print in lowercase
• split into an array
• print the array
• print the first value in the array
• shift the first value off the array and store it in a variable
• print the variable and the array
• push the variable onto the end of the array
• print the array

Basic Programming Concepts
• Variables
• Scalars
• Arrays
• Hashes
• Flow Control
• if/else
• unless
• Loops
• for
• foreach
• while
• until
• Files
• Regexes

Flow Controls
• If/elsif/else

• unless

 $x = 4;
 if ($x > 4) {
 print "I am greater than 4";
 }elsif ($x == 4) {
 print "I am equal to 4";
 }else {
 print "I am less than 4";
 }
 unless($x > 4) {
 print "I am not greater than 4";
 }

Post-condition
# the traditional way
if ($x == 4) {
print "I am 4.";
}

# this line is equivalent to the if
statement above, but you can only use
it if you have a one line action
print "I am 4." if ( $x == 4 );
print "I am not 4." unless ( $x == 4);

Basic Programming Concepts
• Variables
• Scalars
• Arrays
• Hashes
• Flow Control
• if/else
• unless
• Loops
• for
• foreach
• while
• until
• Files
• Regexes

Loops
• for

• foreach

 for ( my $x = 0; $x < 4 ; $x++ ) {
 print "$x";
 }
 my @names = ("Bob", "Vivek", "Jane");

 foreach my $name (@names) {
 print "My name is $name.\n";
 }
 #prints:
 #My name is Bob.
 #My name is Vivek.
 #My name is Jane.

Hashes with foreach
my %book_info = (
title =>"Perl for Bioinformatics",
author => "James Tisdall");

 foreach my $key (keys %book_info) {
 print "$key : $book_info{$key} ";
 }
 #prints:
 #title : Perl for Bioinformatics
 #author : James Tisdall

Loops - continued
• while
•  until

 my $x =0;
 until($x => 4) {
 print "$x";
 $x++;
 }
 my $x =0;
 while($x < 4) {
 print "$x";
 $x++;
 }

Tutorial 2
• iterate through the array
• print everything unless ILE
• use a hash to count how many times each AA
occurs
• iterate through the hash
• print the counts

Basic Programming Concepts
• Variables
• Scalars
• Arrays
• Hashes
• Flow Control
• if/else
• unless
• Loops
• for
• foreach
• while
• until
• Files
• Regexes

Files
• Existence
o if(-e $file)
• Open
o Read - open(FILE, "< $file");
o New - open(FILE, "> $file");
o Append - open(FILE, ">> $file");
• Read
o while(<FILE>)
• Write
o print FILE $string;
• Close
o close(FILE)

Directory
• Existence
o if(-d $directory)
• Open
o opendir(DIR, "$directory")
• Read
o readdir(DIR)
• Close
o closedir(DIR)
• Create
o mkdir($directory) unless (-d
$directory)

# A. Reading file
# create a variable that can tell the program where to find your data
my $file = "/User/Vivek/Documents/perlTutorials/myFile.txt";

# Check if file exists and read through it
if(-e $file){
open(FILE, "<$file") or die "cannot open file";
while(<FILE>){
chomp;
my $line = $_;
#do something useful here
}
close(FILE);
}
# B. Reading directory
my $directory = "/User/Vivek";
if(-d $directory){
opendir(DIR, $directory);
my @files = readdir(DIR);
closedir(DIR);
print @files;
}
Notice the special character. When it
is used here, it holds the line that was
just read from the file.
The array @files will hold the name
of every file in the the directory.

Basic Programming Concepts
• Variables
• Scalars
• Arrays
• Hashes
• Flow Control
• if/else
• unless
• Loops
• for
• foreach
• while
• until
• Files
• Regexes

Regular Expressions (REGEX)
• "A regular expression ... is a set of
pattern matching rules encoded in a
string according to certain syntax
rules." -wikipedia
• Fast and efficient for "Fuzzy" matches
• Example - Find all sequences from
human
o $seq_name =~ /(human|Homo sapiens)/i;

Beginning Perl for Bioinformatics - James Tidall

Simple Examples
my $protein = "MET SER ASN ASN THR SER";
$protein =~ s/SER/THR/g;
print $protein;
#prints "MET THR ASN ASN THR THR";

$protein =~ m/asn/i;
#will match ASN

Regular Expressions (REGEX)
Symbol Meaning
. Match any one character (except
newline).
^ Match at beginning of string
$ Match at end of string
Match the newline
Match a tab
\s Match any whitespace character
\w Match any word
character (alphanumeric plus "_")
\W Match any non-word character
\d Match any digit character
[A-Za-z] Match any letter
[0-9] same as \d
my $string = "See also xyz";

$string =~ /See also ./;
#matches "See also x”

$string =~ /^./;
#matches "S”

$string =~ /.$/;
#matches "z”

$string =~ /\w\s\w/;
#matches "e a"

Regular Expressions (REGEX)
Quantifier Meaning
* Match 0 or more times
+ Match at least once
? Match 0 or 1 times
*? Match 0 or more times (minimal).
+? Match 1 or more times (minimal).
?? Match 0 or 1 time (minimal).
{COUNT} Match exactly COUNT times.
{MIN,} Match at least MIN times
(maximal).
{MIN, MAX} Match at least MIN but not more
than MAX times (maximal).
my $string = "See also xyz";

$string =~ /See also .*/;
#matches "See also xyz”

$string =~ /^.*/;
#matches "See also xyz”

$string =~ /.?$/;
#matches "z”

$string =~ /\w+\s+\w+/;
#matches "See also"

REGEX Examples
my $string = ">ref|XP_001882498.1| retrovirus-related pol polyprotein
[Laccaria bicolor S238N-H82]";

$string =~/\s.*virus/;
#will match "retrovirus"

$string =~ /XP_\d+/;
#will match "XP_001882498”
$string =~ /XP_\d/;
#match “XP_0”

$string =~ /\[.*\]$/;
#will match "[Laccaria bicolor S238N-H82]"

$string =~ /^.*\|/;
#will match ">ref|XP_001882498.1|"

$string =~ /^.*?\|/;
#will match ">ref|"

$string =~ s/\|/:/g;
#string becomes ">ref:XP_001882498.1: retrovirus-related pol polyprotein
[Laccaria bicolor S238N-H82]"

Tutorial 3
• open the file example.fa
• read through the file
• print the id lines for the human sequences
(NOTE: the ids will start with HS)

Summary of Basics
• Variables
• Scalar
• Array
• Hash
• Flow Control
• if/else
• unless
• Loops
• for
• foreach
• while
• until
• Files
• Regexes

Basic BioPerl
• GenBank file manipulation using Seq::IO
o Fetch from NCBI
o Select a subsequence
o Print to a FASTA file
• Analyzing BLAST results using Search::IO
o Retrieve hits with greater than 75% identity and
length greater than 50

BioPerl
• BioPerl is a collection of Perl libraries for analyzing
biological data.
• Sequence Analysis, Phylogenetic Analysis,
Protein Structure Analysis, etc.
• Installation instructions can be found at
www.bioperl.org
• It is NOT a separate programming language.

Getting BioPerl
• Installation instructions can be found
at www.bioperl.org
• Current version 1.6.1
• Documentation:
o http://search.cpan.org/~cjfields/BioPerl/
o http://doc.bioperl.org/
o use perldoc

BioPerl Notes
• All of the BioPerl libraries begin with "Bio::”
• The libraries are grouped by function
• Align, Phylogeny, DB, Seq, Search, Structure,
etc.
• All of the parsing libraries end in "IO"

Hello GenBank
#!/usr/bin/perl
use strict;
use warnings;

# Import the Bioperl Library
use Bio::DB::GenBank;

#create GenBank download handle
my $gb = new Bio::DB::GenBank;

# this returns a Seq object via internet connection to GenBank:
my $seq = $gb->get_Seq_by_acc('AF303112');
print "ID: ". $seq->display_id(). "\nSEQ: ". $seq->seq()."";

File Handling with Perl
• Existence
o if(-e $file)
• Open
o Read - open(FILE, "< $file");
o New - open(FILE, "> $file");
o Append - open(FILE, ">> $file");
• Read
o while(<FILE>)
• Write
o print FILE $string;
• Close
o close(FILE)

Files With BioPerl
• Open
• Read - my $seq_in = Bio::SeqIO->new(
-file => '<$infile',
-format => 'Genbank');
• New - my $seq_out = Bio::SeqIO->new(
-file => '>$outfile',
-format => 'Genbank');
• Append - my $seq_out = Bio::SeqIO->new(
-file => '>>$outfile',
-format => 'Genbank');
• Read
• while (my $inseq = $seq_in->next_seq())
• Write
• $seq_out->write_seq($inseq);

#!/usr/bin/perl
use strict;
use warnings;
##--------- Divide GB File Based on Species ---------##
use lib “/Users/afniuser/Downloads/BioPerl-1.6.1”;
use Bio::SeqIO;

my $infile = "myGenbankFile.gb";

my $inseq = Bio::SeqIO->new(-file => “<$infile”,-format => 'Genbank');

my $humanFile = Bio::SeqIO->new(-file => '>human.gb',-format => 'Genbank');
my $otherFile = Bio::SeqIO->new(-file => '>other.gb',-format => 'Genbank');

while(my $seqin = $inseq->next_seq()){
#here we make use of the Bio::Seq object’s species attribute, which
#returns a Bio::Species object, which has a binomial attribute that
#holds the species name of the source of the sequence
if($seqin->species()->binomial() =~ m/Homo sapiens/){
$humanFile->write_seq($seqin);
}else{
$otherFile->write_seq($seqin);
}
}


Create the two output files.
Use a REGEX to decide which
file to write the sequence to.

Bio::SearchIO
• These objects represent the three components of a
BLAST or FASTA pairwise database search result
– Result - a container object for a given query sequence, there
will be a Result for every query sequence in a database search
• Hit - a container object for each identified sequence found
to be similar to the query sequence, it contains HSPs
– HSP - represents the alignment of the query and hit
sequence. For BLAST there can be multiple HSPs
while only a single one for FASTA results. The HSP
object will contain reference to the query and subject
alignment start and end.
Result
Hit
HSP

#!/usr/bin/perl
use strict;
use warnings;

use Bio::SearchIO;

my $in = new Bio::SearchIO(-format => 'blast',
-file => 'report.bls');

while(my $result = $in->next_result()){
## $result is a Bio::Search::Result::ResultI compliant object
while(my $hit = $result->next_hit()){
## $hit is a Bio::Search::Hit::HitI compliant object
while(my $hsp = $hit->next_hsp()){
##$hsp is a Bio::Search::HSP::HSPI compliant object

if($hsp->length('total') > 50 && $hsp->percent_identity() >= 75){
print "Query = ". $result-> query_name().
"Hit = ". $hit->name().
"Length = ". $ hsp->length('total').
"Percent_id = ". $hsp->percent_identity()." ";
}
}
}
}
We need to look at all of the
results, hits, and hsps, so we’ll
use nested while loops.

Tutorial 4
• open the fasta file
• create two output files in genbank format, one for
human, one for other
• if the sequence ids start with HS, print to the
human file
• if the id doesn't start with HS, print to the other
file

Summary
• Perl
o Variables
o Flow Control
o Loops
o Files
o Regular Expressions
• BioPerl
o SeqIO
o SearchIO

Contact Us
[email protected]
http://bioinformatics.niaid.nih.gov