DATABASE Information available and related to a particular topic or subject is called as data . A database is a computerized archive used to store and organize data in such a way that information can be retrieved easily via a variety of search criteria. Computerized databases offer many facilities and utilities: It is easy to search and obtain required information. Redundancy of data can be reduced. This also avoids inconsistencies in the data, since any change to the data need not be carried out at several places in the database. The data can be shared more easily because a database may be accessed by several users simultaneously. The data can be authenticated and standards can be enforced more easily. 2
BIOLOGICAL DATABASE A collection of biological data arranged in computer readable form that enhances the speed of search and retrieval and is convenient to use is called a biological database. A range of information collected from scientific experiments, published literature, information regarding biological sequences, structures, binding sites, metabolic interactions, functional relationships, protein families, motifs (a short conserved region in a DNA sequence or protein) and homologs (biological molecules related to one another by divergent evolution from a common ancestor) etc., can be retrieved from these databases. They link knowledge obtained from various fields of biology and medicine. Biological databases are of the following types: Primary database Secondary database Composite database 3
PRIMARY DATABASES Primary databases store raw experimental data and contain only sequence or structure information. The different types of primary databases are 4
1. Primary nucleic acid databases They hold the experimentally determined nucleotide sequence information, together with the protein sequence inferred from the conceptual translation of these nucleotide sequences. These are sequences submitted directly by scientists and genome sequencing groups, and sequences taken from literature and patents. The three primary nucleotide sequence databases are the Nucleotide Sequence Database maintained by EMBL , GenBank and DDBJ . These three comprise the International Nucleotide Sequence Database Collaboration. Database entries are exchanged on a daily basis between these three primary nucleotide databases and hence the three function as a virtually unified db called INSD- International Nucleotide Sequence Database. These databases can be used without any legal restrictions . 5
a) GenBank Is a public db of all known nucleotide and protein sequences with supporting bibliographic and biological annotation. Is built and maintained by NCBI. Besides sequence data GenBank files contain information such as accession numbers, gene names, phylogenetic classification and references to published literature. Data may be submitted using BankIt - a www-based submission tool, Sequin – NCBI’s stand-alone submission software or using Barcode Submission Tool- a web-based submission tool. Retrieval of data is through the Entrez System- a db retrieval system that helps access the db entries. 6
b) EMBL ( E uropean Molecular Biology Laboratory) Constitutes Europe’s primary nucleotide seq. resource. The data originates from a combination of large-scale genome sequencing projects, direct submissions from individual scientists and the European Patent Office. There is a quarterly release of the whole database while new and updated records are distributed daily. EMBL db entries are grouped into divisions based mainly on taxonomy with a few exceptions like the new HTG (High-Throughput Genome Sequences) and GSS ( Genome Survey Sequences) divisions, for which grouping is based on the specific nature of the underlying data. Thus divisions provide subsets of the database which reflect the areas of interest of many users. The EMBL db currently consists of 17 divisions with each entry belonging to exactly one division. The database can be accessed or sequences can be retrieved via the EBI SRS server (Sequence Retrieval System) or the FTP server or using the Dbfetch (database fetch) – a tool for simple sequence retrieval via http. 7
c) DDBJ (DNA Data Bank of Japan) Is the only nucleotide sequence databank in Asia certified to collect nucleotide sequences from researchers and to issue the internationally recognized accession number to data submitters. It collects sequence data mainly from Japanese researchers. The principle purpose of DDBJ operations is to improve the quality of INSD i.e. when researchers make their data open to public through INSD, scientists at DDBJ make efforts to describe information on the data as rich as possible, according to the unified rules of INSD. For submitting their data, Japanese genome teams use mass submission tool –MST. 8
2. PRIMARY PROTEIN SEQUENCE DATABASES They contain entries which describe protein domains, families and functional sites. They also contain associated patterns and profiles to identify protein domains and families. Swiss- Prot , TrEMBL (translated EMBL) and PIR (Protein Information Resource) are the primary protein databases and are different from the nucleotide databases. These databases are curated, i e., they are created and maintained by groups of scientists. 9
Swiss- Prot Swiss- Prot tries to provide a high level of annotation (such as the description of the function of a protein, its domain structure, post translational modifications, variants etc ) and a minimum level of redundancy. It has a high level of integration with other databases. The Swiss- Prot entry contains large number of annotations. Each line begins with two letters, many of which are self-explanatory. Eg. ID (identity), AC (accession number), DT (date), DE (description), GN (gene name), CC (comment) etc.. Swiss- Prot not only presents a fairly comprehensive description of the protein and its functions but also provides cross references to the relevant entries in the secondary databases like PROSITE, PRINTS, Pfam , etc.. The Swiss- Prot database has some legal restrictions. The entries themselves are copyrighted, but freely accessible and usable by academic researchers. Commercial companies must pay a license fee to use Swiss-Prot. 10
TrEMBL TrEMBL is a computer annotated supplement of Swiss Prot and contains all the translations of the EMBL sequence entries that are not yet integrated in Swiss-Prot. The annotation of an entry in TrEMBL has not reached the standards required for inclusion into Swiss-Prot. As further data ensure the reliability of annotations, TrEMBL entries are moved to Swiss-Prot. Swiss- Prot and TrEMBL are developed by the Swiss- Prot groups at Swiss Institute of Bioinformatics (SIB) and at European Bioinformatics Institute (EBI). 11
PIR PIR is a protein sequence database of functionally annotated protein sequences. It tries to be comprehensive, well organised , accurate and consistently annotated. It does not reach the level of completeness in entry annotation as does Swiss- Prot. It is a division of NBRF (National Biomedical Research Foundation) in the US It has collaborated with EBI and SIB to establish the UniProt (universal protein database), that provides a single, centralised , authoritative resource for protein sequences and functional information. PIR also produces the NRL-3D -a database of sequences extracted from the 3D structures in the PDB. The NRL 3D database makes the sequence information in PDB available for similarity searches and retrieval and provides cross reference information for use with other PIR protein sequence databases. The Swiss- Prot and PIR overlap extensively but there are still many sequences which can be found only in one. 12
3. PRIMARY STRUCTURE DATABASE They pertain to macromolecular structure and store data on protein and nucleic acid structure. The primary resource for protein structure data is the Protein Data Bank (PDB) . It is the worldwide archive of structural data maintained by the Research Collaboratory for Structural Bioinformatics (RCSB), at Rutgers University. The associated Nucleic acid Data Bank (NDB) is also maintained here. 13
It is the main primary database for 3D structures of biological macromolecules. Data from X-ray crystallography and NMR spectroscopic studies are deposited in the PDB (using a web-based interface called AutoDep Input Tool). The data are extensively checked and verified by human curators before acceptance. It also accepts experimental data used to determine the structures and homology models. PDB entries contain atomic coordinates, and some structural parameters connected with atoms. PDB entries are annotated but are not as comprehensive as in Swiss- Prot There are no legal restrictions on the use of PDB. It was established in 1970 at the Brookhaven lab New York, US. It is maintained by RCSB (Research Collaboratory for Structural Bioinformatics). 14
Secondary databases are databases having information derived from the data in the primary database. They consolidate, summarise , standardise , classify, index and comment on primary databases. These are very important for inferring protein function. Examples are PROSITE, PRINTS, BLOCKS, etc.. Composite databases Amalgamates the information held in two or more of the primary databases. This means that only one database needs be searched rather than do multiple searches on individual primary dbs. Eg : OWL- SwissProt , PIR, GenPept and NRL3D NRDB- SwissProt and TrEMBL . 15
Organism specific databases Contain information, links and resources dedicated to particular species. They contain information on sequence data, gene expression, mutant phenotypes, genome maps, genome sequencing projects and relevant scientific literature and provide links to resources for obtaining clones, mutants as well as for contacting researchers. Eg . EcoGene – database for E.coli , Mouse Genome Database (MSD) for mouse, OMIM (Online Mendilian Inheritance in Man ) Specialised sequence databases These databases have particular types of nucleic acid or protein sequences deposited in them. For example, there are databases specifically for rRNA and tRNA sequences. 16
Commercial databases Unlike public databases which can be accessed freely by anyone using the WWW, commercial databases require subscription as they are the result of a single company’s research and investment. Eg . Incyte, UniGene etc . Literature databases A literature database contains the abstracts and in some cases, the full text and figures of published articles. Such databases can be searched using text strings to find words in the title, abstract, keywords, or by author or author’s institution. Medline was one of the earliest comprehensive online library resources. It has now been incorporated into a large resource called PubMed maintained by the NCBI. Other examples are the Web of Science and BioMedNet . 17