sequence of file formats in bioinformatics

30,741 views 31 slides Jun 08, 2014
Slide 1
Slide 1 of 31
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31

About This Presentation

methods and tools


Slide Content

SEQUENCE FILE FORMATS 1

introduction Data is stored in a biological database in the form of sequences or molecular form Unique file format Representation of data in biological database Categories of file formats Sequence database Molecular database 2

Sequence file formats Gene bank flat-file Format FASTA Format Multi-FASTA Format GCG Format GCG-MSF Format EMBL Format Clustal Format SWIS PROT format 3

Gene bank flat-file Format Used by NCBI It is divided into three parts Header just a direct and very precise or brief introductory part Features all genes in seq., location of genes in genome, protein product and coding genes etc. Sequence : ORIGIN atcgatcgatgcgctat // 4

Description of gene bank flat file identifiers HEADRES Locus Definition Accession Version Dbsource : dates for creation and modifications Keywords Source Organism References Authors Title Journal Medline ID: all published sources Comment FEATURES SEQUENCE 5

Retrieved from ncbi 6

7

8

Fasta format One line header Stats with > followed by name of gene Sequence of gene or protein Blank spaces Paragraph marks Numerals Are all ignored Steric sign * at the end 9

FASTA Format >p53 ctcgaggggc ctagacattg ccctccagag agagcaccca acaccctcca ggcttgaccg 61 gccagggtgt ccccttccta ccttggagag agcagcccca gggcatcctg cagggggtgc 121 tgggacacca gctggccttc aaggtctctg cctccctcca gccaccccac tacacgctgc 181 tgggatcctg gatctcagct ccctggccga caacactggc aaactcctac tcatccacga 241 aggccctcct gggcatggtg gtccttccca gcctggcagt ctgttcctca cacaccttgt 301 tagtgcccag cccctgaggt tgcagctggg ggtgtctctg aagggctgtg agcccccagg 361 aagccctggg gaagtgcctg ccttgcctcc ccccggccct gccagcgcct ggctctgccc * 10

11

Multi-FASTA Format Just like an aggregation of FASTA file as listed above Multiple sequences follow one after the other Single file Accepted by several databases Clustal W Multalin 12

MULTI FASTA format > jhuma gccagggtgt ccccttccta ccttggagag agcagcccca gggcatcctg cagggggtgc > bhuma gccagggtgt ccccttccta ccttggagag agcagcccca gggcatcctg cagggggtgc >puma gccagggtgt ccccttccta ccttggagag agcagcccca gggcatcctg cagggggtgc > zuma gccagggtgt ccccttccta ccttggagag agcagcccca gggcatcctg cagggggtgc 13

14

GCG Format GCG: genetics computer group First line says it all …. !!N.A_SEQUENCE 1.0 !!AA_SEQUENCE 1.0 Just a simple format in which we just get to now the sequence for the genes or proteins 15

GCG format 16

GCG-MSF Format Multiple sequences Sequence name Sequences Alignment Word pileup indicates that It is a multiple sequence containing file Mandatory MSF word indicated in the file that tells that it is an MSF GCG file and is not just GCG Comments terminated with // 2 consecutive blank lines Multiple sequences 17

GCG MSF Format 18

EMBL Format Sequence format of European molecular biology laboratory database Starts with ID identification number Ends with // as terminator Different lines with own format Used to record various forms of data i.e DNA, RNA, GENE, PROTEIN etc etc 19

EMBL format 20

Clustal Format Most widely used sequence alignment tool CLUSTAL W CLUSTAL X Aligned protein or gene sequences 21

Clustal x 22

SWIS PROT format Protein sequence database ID : identification number AC: accession number DE: description GN: gene name OS: organism specie OG: organelle OC: organism classification OX: organism taxonomy cross reference RN: reference number RP: reference position 23

Continued… RC: reference comment RX: reference cross reference RA: reference author RT: reference title RL: reference location CC: blank DR: database cross reference KW: key word FT: feature table SQ: sequence // 24

25

Sequence conversion tools Several software's have been designed by … ? The aim of these software's is to make a detailed conversion of one sequence format into another Some of the software used widely for sequence inter-conversion are : ReadSeq GCG SeqVerter Seqret 26

Read Seq Developed by Dr. D.G Gilbert Automated conversion 18 supported file formats are there which can be interconverted into one another 27

28

29

Assignment FASTA Multi FASTA Flat file GCG format EMBL Clustal SWISS PROT Make each file by this Friday and send as attachments in an email 30

Molecular file formats continued… 31
Tags