This file contains- introduction, classification, primary database, nucleic acid database, protein sequence database, protein structure database
Size: 1.63 MB
Language: en
Added: May 20, 2023
Slides: 34 pages
Slide Content
Primary Bioinformatics Database
Contents Introduction Classification of databases Primary databases Nucleic acid databases Gen Bank EMBL DDBJ Protein sequence databases SWISS-PROT UNIPROT PIR Protein structure database PDB Conclusion References
Introduction Bioinformatics databases or biological databases are storehouses of biological information . They can be defined as libraries containing data collected from scientific experiments, published literature and computational analysis. It provides users an interface to facilitate easy and efficient recording, storing, analyzing and retrieval of biological data through application of computer software. Biological data comes in several different formats like text, sequence data, structure, links, etc. and these needs to be taken into account while creating the databases
CLASSIFICATION OF DATABASES The databases can be classified into 3 categories on the basis of the information stored. Primary Database Secondary Database Composite Database
Primary Database Primary databases (also known as data repositories) are highly organised , user-friendly gateways to the huge amount of biological data produced by researchers around the world. The primary databases were first developed for the storage of experimentally determined DNA and protein sequences in the 1980s and 90s. Nowadays, sequence submissions are made by individual laboratories, as well as “in bulk” by sequencing centres around the world. Most protein sequences found in databases are the product of conceptual translation of the genes and genomes determined using DNA sequencing.
Primary databases Primary databases are also called as archieval database. They are populated with experimentally derived data such as nucleotide sequence, protein sequence or macromolecular structure. Experimental results are submitted directly into the database by researchers, and the data are essentially archival in nature. Once given a database accession number, the data in primary databases are never changed: they form part of the scientific record.
Once data are deposited in primary databases, they can be accessed freely by anyone around the world. For example, researchers are working on a Staphylococcus aureus strain that was isolated from a patient. After some investigations, the researchers suspect that this strain might be genetically different from previously identified strains. They decide to sequence it and, after comparing the DNA sequences already placed in the public repository (“known” strains), they conclude that indeed their strain is different. The research community will benefit from having this new sequence in the public repository so that the next time a researcher finds the same strain, he/she will be able to recognise if their isolate is a novel one, or if it is somehow related to strains previously sequenced.
There are three nucleotide repositories or primary databases for the submission of nucleotide and genome sequences: GenBank hosted by the National Center for Biotechnology Information (or NCBI). The European Nucleotide archive or ENA hosted by the European Molecular Biology Laboratories (EMBL). The DNA Data Bank of Japan or DDBJ hosted by the National Centre for Genetics.
GenBank The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information as part of the International Nucleotide Sequence Database Collaboration. Data format : XML ; ASN.1 ; Genbank format Data types captured : Nucleotide sequence; Protein sequence A GenBank release occurs every two months and is available from the ftp site .
Access to GenBank There are several ways to search and retrieve data from GenBank . Search GenBank for sequence identifiers and annotations with Entrez Nucleotide . Search and align GenBank sequences to a query sequence using BLAST (Basic Local Alignment Search Tool). See BLAST info for more information about the numerous BLAST databases. Search, link, and download sequences programatically using NCBI e-utilities . GenBank Data Usage NCBI places no restrictions on the use or distribution of the GenBank data. However, some submitters may claim patent, copyright, or other intellectual property rights in all or a portion of the data they have submitted.
EMBL The European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database is maintained at the European Bioinformatics Institute (EBI) in an international collaboration with the DNA Data Bank of Japan (DDBJ) and GenBank (USA). It was first established in 1974. Data is exchanged amongst the collaborative databases on a daily basis. The major contributors to the EMBL database are individual authors and genome project groups. WEBIN is the preferred web-based submission system for individual submitters, while automatic procedures allow incorporation of sequence data from large-scale genome sequencing centres and from the European Patent Office (EPO).
Database releases are produced quarterly. Network services allow free access to the most up-to-date data collection via Internet and WWW interfaces. EBI’s Sequence Retrieval System (SRS) is a network browser for databanks in molecular biology, integrating and linking the main nucleotide and protein databases plus many specialised databases. For sequence similarity searching a variety of tools (e.g., BLITZ, FASTA, BLAST) are available which allow external users to compare their own sequences against the most currently available data in the EMBL Nucleotide Sequence Database and SWISS-PROT. Accesed through the URL, http://www.ebi.ac.uk/embl
PIR database Protein Information Resource database Established in 1984, by National Biomedical Research Foundation (NBRF) It is an integrated public bioinformatics resource that support genomic and proteomic research and scietific studies. It assists researchers in the identification and interpretation of protein sequence information. PIR can be searched for entries or sequence similarity searches. It can be downloaded at http://www.pir.georgetown.edu / . PIR offers a variety of resources maily oriented to assist the propagation and standardization of protein annotation.
Conclusion Bioinformatics databases are storehouses of biological information . They are populated with experimentally derived data such as nucleotide sequence, protein sequence . Experimental results are submitted directly into the database by researchers, and the data are essentially archival in nature. Once given a database accession number, the data in primary databases are never changed: they form part of the scientific record. Examples include Gen bank, EMBL, DDBJ, PIR, SWISS-PROT, UNIPROT, PDB etc.