BIOLOGICAL DATABASES :
A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system.
The chief objective of the development of a database is to organize data ...
BIOLOGICAL DATABASES :
A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system.
The chief objective of the development of a database is to organize data in a set of structured records to enable easy retrieval of information.
Example. A few popular databases are GenBank from NCBI (National Center for Biotechnology Information), SwissProt from the Swiss Institute of Bioinformatics and PIR from the Protein Information Resource.
IMPORTANCE OF DATABASES :
1. Databases act as a store house of information.
2. Databases are used to store and organize data in such a way that information can be retrieved easily via a variety of search criteria.
3. It allows knowledge discovery, which refers to the identification of connections between pieces of information that were not known when the information was first entered. This facilitates the discovery of new biological insights from raw data.
4. Secondary databases have become the molecular biologist’s reference library over the past decade or so, providing a wealth of information on just about any gene or gene product that has been investigated by the research community.
5. It helps to solve cases where many users want to access the same entries of data.
6. Allows the indexing of data.
7. It helps to remove redundancy of data.
TYPES OF BIOLOGICAL DATABASES:
Biological databases are classified on
1. Based on content of biological data
2. Based on the nature of data.
1. BASED ON CONTENT OF BIOLOGICAL DATA :
Based on their contents, biological databases can be roughly divided into two categories:
1. Primary databases
2. Secondary databases
Size: 593.99 KB
Language: en
Added: Nov 04, 2023
Slides: 56 pages
Slide Content
Biological Databases SMT. P.SANGEETHA LECTURER IN BIOTECHNOLOGY KVRGCW(A), KURNOOL
Biological Databases A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system. The chief objective of the development of a database is to organize data in a set of structured records to enable easy retrieval of information. Example. A few popular databases are GenBank from NCBI (National Center for Biotechnology Information), SwissProt from the Swiss Institute of Bioinformatics and PIR from the Protein Information Resource.
Importance of Databases 1. Databases act as a store house of information. 2. Databases are used to store and organize data in such a way that information can be retrieved easily via a variety of search criteria. 3. It facilitates the discovery of new biological insights from raw data.
Importance of Databases 4. Secondary databases have become the molecular biologist’s reference library over the past decade or so, providing a wealth of information on just about any gene or gene product that has been investigated by the research community. 5. It helps to solve cases where many users want to access the same entries of data. 6. Allows the indexing of data. 7. It helps to remove redundancy of data.
Types of Biological Databases
1. Based on content of biological data
1. Primary databases Primary databases are also called as Archieval Database. They are populated with experimentally derived data such as nucleotide sequence, protein sequence or macromolecular structure. Experimental results are submitted directly into the database by researchers, and the data are essentially archival in nature. Once given a database accession number, the data in primary databases are never changed: they form part of the scientific record.
1. Primary databases Examples GenBank and DDBJ (nucleotide sequence) Protein Data Bank (PDB; coordinates of three-dimensional macromolecular structures)
2. Secondary databases Secondary databases comprise data derived from the results of analysing primary data. Secondary databases often draw upon information from numerous sources, including other databases (primary and secondary), controlled vocabularies and the scientific literature. They are highly curated, often using a complex combination of computational algorithms and manual analysis and interpretation to derive new knowledge from the public record of science.
2. Secondary databases Examples InterPro (protein families, motifs and domains) UniProt Knowledgebase (sequence and functional information on proteins) Ensembl (variation, function, regulation and more layered onto whole genome sequences)
2.Based on the nature of data 1. Structural database 2. Sequence database i . Protein sequence databases ii. Nucleic Acid sequence databases
1.Structural databases The structural databases contain structural information for each material derived from analysis of diffraction data. EX. PDB, CATH and SCOP
PDB(Protein Data Bank) www.rcsb.org/pdb/ The PDB was established in1970’s at the Brookehaven Lab on Long island, New York State, US. In 1999, the management was moved to the Research Collaboratory for Structural Bioinformatics(RCSB – a joint organisation between Rutgers University, San Diego Super Computer Centre). The PDB entries contain the atomic coordinates, and some structural parameters connected with the atoms or computed from the structures(secondary structure).
PDB(Protein Data Bank) The PDB entries contain some annotations, but it is not as comprehensive as in SWISS PROT. There are no legal restrictions on the use of the data in PDB. The Protein Data Bank is an archive of experimentally determined three dimensional structures (3D) of biological macromolecules, serving a global community of researchers, educators, and students.
PDB(Protein Data Bank) The archives contain atomic coordinates, bibliographic citations, primary and secondary structure information as well as crystallographic structure factors and NMR(Nuclear Magnetic Resonance) experimental data. PDB is the main primary database for 3D structures of biological macromolecules determined by X-Ray Crystallography and NMR.
PDB(Protein Data Bank) Structural biologists usually deposit their structures in the PDB on publication and some scientific journals require this before accepting a paper . It also accepts the experimental data used to determine the structures(X-Ray Crystallography and NMR) and homology models.
2. Sequence databases A sequence database is a type of biological database that is composed of a large collection of computerised nucleic acid sequences or other polymer sequences stored on a computer. These include I . Nucleotide databases II . Protein databases
NCBI(National Centre for Biotechnological Information) www.ncbi.nlm.nih.gov NCBI is a public available tool on web. NCBI was established in November 1988 at the National Library of Medicine in the United States . The NLM was chosen because it had experience in creating and maintaining biomedical databases and as part of the National Institute of Health(NIH) , it could establish a research program in computational molecular biology.
NCBI(National Centre for Biotechnological Information) The mission of NCBI is to develop new information technologies to aid in understanding of fundamental molecular and genetic process that control health and disease. More specifically, NCBI has been charged with creating automated systems for storing and analysing knowledge about molecular biology, biochemistry and genetics; facilitating the use of such databases and software by the research and medical community, coordinating efforts to gather biotechnology information both nationally and internationally and performing research into advanced methods of computer based information processing for analysing the structure and function of biologically important molecules.
NCBI maintains several databases. They are as follows Literature databases Entrez databases Nucleotide databases Genome specific resources Tools for data mining
NCBI maintains several databases. They are as follows Tools for Sequence Analysis Tools for 3D structure display and Similarity Searching Maps Resource Statistics Collaborative Cancer Research FTP (File Transfer Protocol)
1.Nucleotide databases The nucleotide database is a collection of sequences from several sources including GenBank, RefSeq,etc . I.PRIMARY DATABASES OF NUCLEOTIDE SEQUENCES: These are the chief databases that store and make available raw nucleic acid sequences to the public and researchers. They are referred to as primary nucleotide sequence databases since they are the repository of all the nucleic acid sequences. Ex. GenBank,DDBJ,EMBL
1.EMBL (European Molecular Biological Laboratory) www.ebi.ac.uk EMBL is the nucleotide sequence database from EBI(European Bioinformatics Institute). The EBI institute manages databases of biological data including nucleic acid, protein sequences and macromolecular structures. The EBI is a pioneer of novel and developmental bioinformatics research. The EBI is a centre for research and services in bioinformatics.
1.EMBL (European Molecular Biological Laboratory) The mission of EBI is to ensure that the growing body of information from molecular biology and genome research is placed in the public domain and is accessible freely . The databases is produced in collaboration with DDBJ and Gen Bank. Information can be retrieved from EMBL using the SRS(Sequence Retrieval System) ; this links the principal DNA and the protein sequence databases with motif, structure, mapping and other specialist databases.
1.EMBL (European Molecular Biological Laboratory) SRS is one of the most powerful data browsing retrieval tools available.SRS provides rapid, user friendly access to the large volumes of diverse and heterogeneous life science data stored in more than 400 internal and public domain databases. It can be used to browse the various biological sequence and literature databases. The EBI provides access to many tools for browsing and retrieving biological related sequence and literature data.
2.DDBJ (DNA Data Bank of Japan) www.ddbj.nig.ac.jp DDBJ began in 1986 as a collaboration with EMBL and GenBank. The database is produced, maintained and distributed at the National Institute of Genetics. Sequences may be submitted to it from all corners of the world by means of a web based data submission tool. The Web is also used to provide standard search tools such as Fast A and BLAST.
2.DDBJ (DNA Data Bank of Japan) DDBJ is a sole DNA Databank of Japan which is officially certified to collect the DNA sequences from researchers and to issue the internationally recognised accession number to data submitters. DDBJ is one of the International DNA databases including EBI responsible for EMBL database and NCBI responsible for GenBank database . Consequently, DDBJ has been collaborating with the two databanks through exchanging data and information on Internet, and by holding two meetings, the International DNA DataBank Advisory Meeting and the International DNA DataBanks Collaborative Meeting(IAM and ICM).
3. GenBank GenBank, the DNA database from NCBI incorporates sequences from publicly available sources. Information can be retrieved from GenBank using the Entrez Integrated Retrieval system; this combines data from the principal DNA and protein sequence databases with the information from genome maps and protein structures. Additional information on sequences can be accessed via MEDLINE facility which provides abstracts from the original published articles.
3. GenBank GenBank may be searched with the user query sequence by means of NCBI’s web interface to the BLAST suite of programs . A GenBank includes the sequence files, indices created on various database fields and information derived from database( Ex.Gen Pept , a database of translated coding sequences in FastA format). Most commonly used is the sequence entry file, which contains the sequence itself and descriptive information relating to it.
3. GenBank A GenBank entry consists of keywords, relevant associated sub key words, and an optional Feature Table, it end is indicated by a // terminator. The entry continues with BASE COUNT record which details the frequency of occurrence of the different base types in the sequence.
2.Secondary databases of nucleotide sequences Many of the secondary databases are simply the sub-collection of sequences culled from one or other of the primary databases such as GenBank or EMBL. 1.Omniome databases: 2. Fly Base Database 3. ACeDB
2.Secondary databases of nucleotide sequences 1.Omniome databases: is a comprehensive microbial resource maintained by TIGR(The Institute for Genomic Research]. It has not only the sequence and annotation of each of the completed genomes, but also has associated information about the organisms[such as taxon and gram stain pattern], the structure and composition of their DNA molecules and many other attributes of protein sequences predicted from the DNA sequences.
2.Secondary databases of nucleotide sequences 2.Fly Base Database : A consortium sequenced the entire genome of the fruitfly D.melanogaster to a high degree of completeness and quality. 3.ACeDB : It is a repository of not only the sequence but also the genetic map as well as phenotypic information about the C.elegans nematode worm.
II. PROTEIN DATABASES: A protein database is one or more datasets about protein’s aminoacid sequence, conformation, structure and features such as active sites. 1.Primary databases of proteins : The primary databases hold the experimentally determined protein sequences inferred from the conceptual translation of nucleotide sequences.
1.PIR (Protein Information Resource) www.pir.georgetown.edu The Protein Sequence Database was developed at the National Biomedical Research Foundation (NBRF) in US. It is involved in collaboration with Martinsred Institute for Protein Sequences (MIPS), Japan International Protein Information database (JIPID). PIR was developed by Margaret Dayhoff as a collection of sequences for investigating evolutionary relationships among proteins.
1.PIR (Protein Information Resource) The PIR database is split into four distinct sections – PIR1 to PIR4 which differ in terms of the quality of data, and level of annotation provided. PIR 1 – contains fully classified and annotated entries PIR 2 – includes preliminary entries which have not been thoroughly reviewed and may contain redundancy PIR 3 – contains unverified entries, which have not been reviewed
1.PIR (Protein Information Resource) PIR 4 entries fall into 4 categories : 1. Conceptual translations of artefactual sequences. 2. Conceptual translations of sequences that are not transcribed or translated. 3. Protein sequences or conceptual translations that are genetically engineered. 4. Sequence that are not genetically encoded and produced on ribosomes. One can search for entries or do sequences similarity searches at the PIR site. The database can be downloaded as a set of files.
2. SWISS PROT www.expasy.ch/sprot/ Swiss Prot is a protein sequence database, established in 1986, was produced collaboratively by the Department of Medical Biochemistry at the University of Geneva and the EMBL ; after 1994, the collaboration moved to EMBL’s UK outstation, EBI . In 1998, the collaboration moved to Swiss Institute of Bioinformatics(SIB). Hence, the database is now maintained collaboratively by SIB and EBI/EMBL.
2. SWISS PROT Swiss Prot is a protein sequence database which strives to provide a high level of annotations such as the description of the function of a protein, its domain structure, post translational modifications, variants, etc , a minimal level of redundancy and high level of integration with other databases. In 1996, a computer annotated supplement to SWISSPROT was created, termed TrEMBL.
2. SWISS PROT In SWISS PROT , as in many sequence databases, two classes of data can be distinguished : 1. Core data : Core data consists of : 1. Sequence data 2. Citation information(bibliographic references) 3. Taxonomic data(description of the biological source of the protein)
2. SWISS PROT 2. Annotation : 1. Function of protein 2. Post translational modifications 3. Domains and sites 4. Secondary structure
2. SWISS PROT 2. Annotation : 5 . Quaternary structure 6. Similarities to other proteins 7. Diseases associated with any member of deficiencies in the protein 8. Sequence conflicts, variants
2. SWISS PROT Sequence Entry File Each line is flagged with a two letter code, which helps to present the information in a structured way. Entries begin with the identification(ID) line and end with a // terminator. ID codes can some times change, so an additional identifier, an accession number(AC NO.), is also provided which ought to remain static between database releases.
2. SWISS PROT Sequence Entry File Next , the DT lines provide information about data of entry of the sequence of database and details of when it was last modified. The following lines give the gene name(GN), the Organism Species(OS), and the Organism Classification(OC) within the biological kingdoms.
2. SWISS PROT Sequence Entry File CC- Comment lines denote the function of protein, post translational modifications, similarity and tissue specificity. Database cross reference(DR) lines follow the comment field. These provide links to other biomolecular databases. Following the DR lines; (KW) key words and then a number of FT lines are present.
2. SWISS PROT Sequence Entry File FT line is Feature Table line which highlights the regions of interest in the sequence including secondary structure, ligand binding sites, post translational modifications. The final section of database entry includes the sequence(SQ) itself. The entry ends with a //terminator . SWISS PROT has become the most widely used protein sequence database in the world.
3. PubMed PubMed is a free resource supporting the search and retrieval of biomedical and life sciences literature with the aim of improving health–both globally and personally. 1.The PubMed database contains more than 33 million citations and abstracts of biomedical literature. 2.It does not include full text journal articles; however, links to the full text are often present when available from other sources, such as the publisher's website or PubMed Central (PMC) .
3. PubMed 3 . It is available to the public online since 1996. 4 . PubMed was developed and is maintained by the National Centre for Biotechnology Information (NCBI) , at the U.S. National Library of Medicine (NLM) , located at the National Institutes of Health (NIH) . 5 . Citations in PubMed primarily stem from the biomedicine and health fields, and related disciplines such as life sciences, behavioural sciences, chemical sciences, and bioengineering.
3. PubMed PubMed facilitates searching across several NLM literature resources: 1.Medline 2. PubMed Central (PMC) 3. Bookshelf 1. MEDLINE MEDLINE is the largest component of PubMed and consists primarily of citations from journals selected for MEDLINE; articles indexed with MeSH (Medical Subject Headings) and curated with funding, genetic, chemical and other metadata.
3. PubMed 2. PubMed Central (PMC) Citations for PubMed Central (PMC) articles make up the second largest component of PubMed. PMC is a full text archive that includes articles from journals reviewed and selected by NLM for archiving (current and historical), as well as individual articles collected for archiving in compliance with funder policies.
3. PubMed 3. Bookshelf The final component of PubMed is citations for books and some individual chapters available on Bookshelf . Bookshelf is a full text archive of books, reports, databases, and other documents related to biomedical, health, and life sciences.
1. Secondary databases of proteins The secondary databases are so termed because they contain the results of analysis of the sequences held in primary databases. 1 . PROSITE: A set of databases collects together patterns found in protein sequences rather than the complete sequences. PROSITE is one such pattern database. The protein motif and pattern are encoded as regular expressions. The information corresponding to each entry in PROSITE is of two forms – the patterns and the related descriptive text.
1. Secondary databases of proteins 2. PRINTS: In the PRINTS database, the protein sequence patterns, are stored as “finger prints”. The information includes : 1. The first section contains cross links to other databases that have more information about the characterised family. 2. The second section provides a table showing how many of the motifs that makeup the finger print occurs in how many of the sequences of that family. 3. The last section of the entry contains the actual fingerprints that are stored as multiple aligned sets of sequences , the alignment is made without gaps.
1. Secondary databases of proteins 3.Pfam : Pfam contains the profiles used using Hidden Markov Models(HMM) .HMM builds the model of the pattern as a series of the match, substitute, insert or delete state, with scores assigned for alignment to go from one state to another.
1. Secondary databases of proteins 4.TrEMBL : TrEMBL(Translated EMBL) was created in 1996 as a computer annotated supplement to SWISS –PROT. It contains translations of all the coding sequences (COS) in EMBL. TrEMBL was designed to address the need for a well structured SWISS PROT link resource that would allow very rapid access to sequence data from the genome projects.