What is Biopython?
•tools for computational molecular biology
•to program in python and want to make it as
easy as possible to use python for bioinformatics
by creating high-quality, reusable modules and
scripts
2
What can Biopython do?
•Manipulate DNA and protein sequences
•Run BLAST
•Access public databases
•Manipulate protein structures
•Population genetics
•Supervised learning methods
•Networks of various kinds
Obtaining Biopython
•http://www.biopython.org
4
Making sure it worked
>>> new_seq.complement()
>>> new_seq.reverse_complement()
5
Working with sequences
•A biopython Seq object has two important
attributes:
–data : as the name implies, this is the actual
sequence data string of the sequence
–alphabet : an object describing what the individual
characters making up the string "mean" and how they
should be interpreted
•Two advantages
1.this gives an idea of the type of information the data object
contains
2.this provides a means of contraining the information you have
in the data object, as a means of type checking
6
Parsing biological file
formats(easier)
from Bio import SeqIO
myFile = open("ls_orchid.fasta")
for seq_record in SeqIO.parse(myFile, "fasta"):
print seq_record.id
print repr(seq_record.seq)
print len(seq_record)
myFile.close()
12
FASTA files as Dictionaries
import string
def get_accession_num(fasta_record):
title_atoms = string.split(fasta_record.title)
# all of the accession number information is stuck
in the first element
# and separated by '|'s
accession_atoms = string.split(title_atoms[0], '|')
# the accession number is the 4th element
gb_name = accession_atoms[3]
# strip the version info before returning
return gb_name[:-2]
13
FASTA files as Dictionaries(easier)
>>> from Bio import Fasta
>>> Fasta.index_file("ls_orchid.fasta", "my_orchid_dict.idx",
get_accession_num)
>>> from Bio.Alphabet import IUPAC
>>> dna_parser = Fasta.SequenceParser(IUPAC.ambiguous_dna)
>>> orchid_dict = Fasta.Dictionary("my_orchid_dict.idx", dna_parser)
14
Blast
for seq in SeqIO.parse('marker.fa', 'fasta'):
b_results = NCBIWWW.qblast('blastn', 'nr',
seq.seq, format_type='Text')
print b_results.read()
15
More information
http://www.biopython.org
Problem
•Write a program to read a FASTA file and print
the number of sequences, number of residues,
and minimum, maximum and average lengths of
the sequences.
> python read-fasta-file.py sample.fa
Number of sequences = 7
Number of residues = 285
Minimum length = 21
Maximum length = 94
Average length = 40.7