GenBank Databases are the best portal of bioinformatics related research work as well as comprehensive information also.
Size: 526.17 KB
Language: en
Added: Jan 31, 2019
Slides: 17 pages
Slide Content
GenBank Databases Hafiz.M.Zeeshan.Raza Research Associate_HEC_NRPU [email protected] COMSATS UNIVERSITY SAHIWAL
Overview Introduction Sections of Database Importance of GenBank
Historical background T he first major bioinformatics project was undertaken by Margaret Dayhoff in 1965, who developed a first protein sequence database called Atlas of Protein Sequence and Structure . Subsequently , in the early 1970s , the Brookhaven National Laboratory established the Protein Data Bank for archiving three-dimensional protein structures. The first sequence alignment algorithm was developed by Needleman and Wunsch in 1970. This was a fundamental step in the development of the field of bioinformatics, which paved the way for the routine sequence comparisons and database searching practiced by modern biologists. The 1980s saw the establishment of GenBank and the development of fast database searching algorithms such as FASTA by William Pearson and BLAST by Stephen Altschul and coworkers .
Introduction GenBank is the most complete collection of annotated nucleic acid sequence data for almost every organism. The content includes genomic DNA, mRNA, cDNA , ESTs, high throughput raw sequence data, and sequence polymorphisms. There is also a GenPept database for protein sequences, the majority of which are conceptual translations from DNA sequences, although a small number of the amino acid sequences are derived using peptide sequencing techniques .
How to search GenBank There are two ways to search for sequences in GenBank. One is using text-based keywords similar to a PubMed search. The other is using molecular sequences to search by sequence similarity using BLAST.
GenBank Sequence Format To search GenBank effectively using the text-based method requires an understanding of the GenBank sequence format. GenBank is a relational database. However, the search output for sequence files is produced as flat files for easy reading. The resulting flat files contain three sections; Header, Features, and Sequence entry. There are many fields in the Header and Features sections. Each field has an unique identifier for easy indexing by computer software. Understanding the structure of the GenBank files helps in designing effective search strategies.
1 st section…Header Part The line, “DEFINITION,” provides the summary information for the sequence record including the name of the sequence, the name and taxonomy of the source organism if known, and whether the sequence is complete or partial. This is followed by an accession number for the sequence, which is a unique number assigned to a piece of DNA when it was first submitted to GenBank and is permanently associated with that sequence. This is the number that should be cited in publications . It has two different formats: two letters with five digits or one letter with six digits.
Continue… For a nucleotide sequence that has been translated into a protein sequence, a new “accession number” is given in the form of a string of alphanumeric characters . In addition to the accession number, there is also a version number and a gene index ( gi ) number . The purpose of these numbers is to identify the current version of the sequence . If the sequence annotation is revised at a later date, the accession number remains the same, but the version number is incremented as is the gi number. A translated protein sequence also has a different gi number from the DNA sequence it is derived from .
Continue… The next line in the Header section is the “ORGANISM” field, which includes the source of the organism with the scientific name of the species and sometimes the tissue type. Along with the scientific name is the information of taxonomic classification of the organism. Different levels of the classification are hyperlinked to the NCBI taxonomy database with more detailed descriptions .
Continue… This is followed by the “REFERENCE ” field, which provides the publication citation related to the sequence entry . The REFERENCE part includes author and title information of the published work (or tentative title for unpublished work). The “JOURNAL” field includes the citation information as well as the date of sequence submission. The citation is often hyperlinked to the PubMed record for access to the original literature information. The last part of the Header is the contact information of the sequence submitter.
2 nd section…Features The “Features” section includes annotation information about the gene and gene product , as well as regions of biological significance reported in the sequence, with identifiers and qualifiers. The “Source” field provides the length of the sequence, the scientific name of the organism, and the taxonomy identification number. Some optional information includes the clone source, the tissue type and the cell line. The “gene ” field is the information about the nucleotide coding sequence and its name. For DNA entries, there is a “CDS” field, which is information about the boundaries of the sequence that can be translated into amino acids. For eukaryotic DNA, this field also contains information of the locations of exons and translated protein sequences is entered.
3 rd section…Sequence The third section of the flat file is the sequence itself starting with the label “ORIGIN”. The format of the sequence display can be changed by choosing options at a Display pull-down menu at the upper left corner. For DNA entries, there is a BASE COUNT report that includes the numbers of A, G, C, and T in the sequence. This section , for both DNA or protein sequences, ends with two forward slashes (the “//” symbol ).
Importance In retrieving DNA or protein sequences from GenBank, the search can be limited to different fields of annotation such as “organism,” “accession number,” “authors,” and “publication date.” One can use a combination of the “Limits” and “ Preview/Index” options as described. Alternatively, a number of search qualifiers can be used, each defining one of the fields in a GenBank file. The qualifiers are similar to but not the same as the field tags in PubMed. For example, in GenBank, [GENE] represents field for gene name, [AUTH] for author name, and [ORGN] for organism name. Frequently used GenBank qualifiers, which have to be in uppercase and in brackets
Alternative sequence Formats In bioinformatics, FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes . FASTA is one of the simplest and the most popular sequence formats because it contains plain sequence information that is readable by many bioinformatics analysis programs. It has a single definition line that begins with a right angle bracket ( > ) followed by a sequence name. Sometimes, extra information such as gi number or comments can be given, which are separated from the sequence name by a “|” symbol.
FASTA Format Sequence >E01306.1 DNA encoding human insulin-like growth factor I(IGFI) GAATTCTAACGGTCCCGAAACTCTGTGCGGTG TGAATGGTTGACGCTCTGCAG TTGTTTGCGGTGACCGTGGTTTTTATTTTAACAAACCCACTGGTT ATG GTTCTT TTCTCGTCGTGCTCCCCAGACTGGTATTGT TGA GAATGCTGCTTTCGTTCTTG GACCTGCGTCGTCTGGAAATGTATTGCGCTCCCCTGAAACCCGC The extra information is considered optional and is ignored by sequence analysis programs. The plain sequence in standard one-letter symbols starts in the second line. Each line of sequence data is limited to sixty to eighty characters in width. The drawback of this format is that much annotation information is lost.