10518261_biopython_python_slides_notes.ppt

SangeethaM386158 19 views 17 slides Aug 11, 2024
Slide 1
Slide 1 of 17
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17

About This Presentation

Python


Slide Content

Biopython
1

What is Biopython?
•tools for computational molecular biology
•to program in python and want to make it as
easy as possible to use python for bioinformatics
by creating high-quality, reusable modules and
scripts
2

What can Biopython do?
•Manipulate DNA and protein sequences
•Run BLAST
•Access public databases
•Manipulate protein structures
•Population genetics
•Supervised learning methods
•Networks of various kinds

Obtaining Biopython
•http://www.biopython.org
4

Making sure it worked
>>> new_seq.complement()
>>> new_seq.reverse_complement()
5

Working with sequences
•A biopython Seq object has two important
attributes:
–data : as the name implies, this is the actual
sequence data string of the sequence
–alphabet : an object describing what the individual
characters making up the string "mean" and how they
should be interpreted
•Two advantages
1.this gives an idea of the type of information the data object
contains
2.this provides a means of contraining the information you have
in the data object, as a means of type checking
6

Working with sequences
7

Working with sequences
>>> protein_seq = Seq('EVRNAK', IUPAC.protein)
>>> dna_seq = Seq('ACGT', IUPAC.unambiguous_dna)
>>> protein_seq + dna_seq
>>> my_seq.tostring()
>>> my_seq[5] = 'G
>>> mutable_seq = my_seq.tomutable()
>>> print mutable_seq
>>> mutable_seq[5] = 'T'
>>> print mutable_seq
>>> mutable_seq.remove('T')
>>> print mutable_seq
>>> mutable_seq.reverse()
>>> print mutable_seq
8

Parsing biological file formats
>gi|6273290|gb|AF191664.1|AF191664 Opuntia clavata rpl16 gene; chloroplast
gene for...
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAA
TCTAAATGATATAGGATTCCACTATGTAAGGTCTTTGAATCATATCATAAAAGACAATGTAAT
AAA...
import string
from Bio.ParserSupport import AbstractConsumer
class SpeciesExtractor(AbstractConsumer):
def __init__(self):
self.species_list = []
def title(self, title_info):
title_atoms = string.split(title_info)
new_species = title_atoms[1]
if new_species not in self.species_list:
self.species_list.append(new_species)
9

Parsing biological file formats
from Bio import Fasta
def extract_organisms(file, num_records):
scanner = Fasta._Scanner()
consumer = SpeciesExtractor()
file_to_parse = open(file, 'r')
for fasta_record in range(num_records):
scanner.feed(file_to_parse, consumer)
file_to_parse.close()
return handler.species_list
10

Parsing biological file formats(easier)
>>> from Bio import Fasta
>>> parser = Fasta.RecordParser()
>>> file = open("ls_orchid.fasta")
>>> iterator = Fasta.Iterator(file, parser)
>>> cur_record = iterator.next()
>>> dir(cur_record)
>>> print cur_record.title
>>> print cur_record
11

Parsing biological file
formats(easier)
from Bio import SeqIO
myFile = open("ls_orchid.fasta")
for seq_record in SeqIO.parse(myFile, "fasta"):
print seq_record.id
print repr(seq_record.seq)
print len(seq_record)
myFile.close()
12

FASTA files as Dictionaries
import string
def get_accession_num(fasta_record):
title_atoms = string.split(fasta_record.title)
# all of the accession number information is stuck
in the first element
# and separated by '|'s
accession_atoms = string.split(title_atoms[0], '|')
# the accession number is the 4th element
gb_name = accession_atoms[3]
# strip the version info before returning
return gb_name[:-2]
13

FASTA files as Dictionaries(easier)
>>> from Bio import Fasta
>>> Fasta.index_file("ls_orchid.fasta", "my_orchid_dict.idx",
get_accession_num)
>>> from Bio.Alphabet import IUPAC
>>> dna_parser = Fasta.SequenceParser(IUPAC.ambiguous_dna)
>>> orchid_dict = Fasta.Dictionary("my_orchid_dict.idx", dna_parser)
14

Blast
for seq in SeqIO.parse('marker.fa', 'fasta'):
b_results = NCBIWWW.qblast('blastn', 'nr',
seq.seq, format_type='Text')
print b_results.read()
15

More information
http://www.biopython.org

Problem
•Write a program to read a FASTA file and print
the number of sequences, number of residues,
and minimum, maximum and average lengths of
the sequences.
> python read-fasta-file.py sample.fa
Number of sequences = 7
Number of residues = 285
Minimum length = 21
Maximum length = 94
Average length = 40.7
Tags