Biological databases

6,458 views 33 slides Aug 05, 2020
Slide 1
Slide 1 of 33
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33

About This Presentation

Details on biological databases and their examples


Slide Content

Types of Biological data, Biological databases: Nucleic acid and Protein sequences and Protein structure databases Presented By : Syeda Tamanna Yasmin Doctoral Research Scholar Department of Microbiology

INTRODUCTION Data : A collection of facts from which conclusions may be drawn Biological Data: Relating to, caused by, or affecting life or living organisms TYPES OF BIOLOGICAL DATA

BIOLOGICAL DATABASES Database: A collection of ,structured ,searchable, updated periodically data Biological databases : li braries of life sciences information, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analysis. The data stored in biological databases consists of two types: R aw and C urated (or annotated) Type and Content of Data Sequence or Structure Nucleic acid or protein

The databases can be classified into three categories on the basis of the information stored. They are Primary Databases : It contains data that is derived experimentally. They can be further divided into protein or nucleotide databases which can be further divided as sequence or structure databases. The most commonly used primary databases are: DNA Data Bank of Japan ( DDBJ ), European Molecular Biology Laboratory ( EMBL ) Nucleotide Sequence Database, GenBank, and Protein Data Bank ( PDB ) SWISS- PROT Protein information Resource ( PIR )

Secondary Databases : It contains the data that is obtained through the analysis or treatment of data present in primary databases. It can contain conserved protein sequence, signature sequence active site residues of protein families. These databases can be further classified as metabolic pathways database, protein family database, etc. The most common examples are : Class Architecture Topology Homology (CATH), Kyoto Encyclopedia of Genes and Genomics ( KEGG ), Protein Families ( Pfam ) and Structural Classification of Proteins ( SCOP ).

Composite Databases: Composite databases are collections of several (usually more than two) primary database resources. This helps in the lessening the tedious task of searching through multiple databases referring to the same data. For example DrugBank offers details on drug and their targets, BioGraph incorporates assorted knowledge of biomedical science Bio Model is a storehouse of computational models of the biological developments, etc. NCBI being a composite database has stored a lot of sequence of nucleotide and protein within its server and thereby suffers from high redundancy in the data deposited ( IASRI , (N.D.).

Biological Databases Nucleotide databases Protein databases Structure Sequence Genbank EMBL DDBJ PROSITE PFAM SwissProt TrEMBL PIR PDB SCOP CATH CSD

Primary Nucleotide databases: GenBank The GenBank sequence database is open access, annotated collection of all publicly available nucleotide sequences and their protein translations. This database is produced and maintained by the National Center for Biotechnology Information ( NCBI ) as part of the International Nucleotide Sequence Database Collaboration ( INSDC ).   The database started in 1982 by  Walter Goad  and  Los Alamos National Laboratory. EMBL (European Molecular Biology Laboratory) The European Molecular Biology Laboratory ( EMBL ) Nucleotide Sequence Database is a comprehensive collection of primary nucleotide sequences maintained at the European Bioinformatics Institute (EBI). Data are received from genome sequencing centres, individual scientists and patent offices.  EMBL was created in 1974 and is an  intergovernmental organization funded by public research money from its member states. It was the idea of  Leó Szilárd ,  James Watson  and  John Kendrew . DDBJ (DNA databank of Japan) It is located at the National Institute of Genetics ( NIG ) in the Shizuoka prefecture of Japan. It is the only nucleotide sequence data bank in Asia. DDBJ began data bank activities in 1986 at NIG and funded by the Japanese  Ministry of Education, Culture, Sports, Science and Technology.

Secondary Nucleotide databases Omniome Database: Omniome Database is a comprehensive microbial resource maintained by TIGR (The Institute for Genomic Research). It facilitates the meaningful multi-genome searches and analysis, for instance, alignment of entire genomes, and comparison of the physical proper of proteins and genes from different genomes etc. FlyBase Database: A consortium sequenced the entire genome of the fruit fly  D.   Melanogaster  to a high degree of completeness and quality. FlyBase is one of the organizations contributing to the  Generic Model Organism Database ( GMOD ).

Primary databases of protein Protein Information Resource ( PIR ) – Protein Sequence Database ( PIR -PSD): The PIR -PSD is a collaborative endeavor between the PIR , the MIPS (Munich Information Centre for Protein Sequences, Germany) and the JIPID (Japan International Protein Information Database, Japan). A unique characteristic of the PIR -PSD is its classification of protein sequences based on the superfamily concept and also classified based on homology domain and sequence motifs. Protein Databank ( PDB ): It is a crystallographic database for the three-dimensional structure of large biological molecules, such as proteins. The  PDB  was  established  in 1971 at Brookhaven National Laboratory under the leadership of Walter Hamilton and originally contained 7 structures. After Hamilton's untimely death, Tom Koetzle began to lead the  PDB  in 1973, and then Joel Sussman in 1994. The database holds data derived from mainly three sources: Structure determined by X-ray crystallography, NMR experiments, and molecular modeling. SWISS- PROT UniProtKB/Swiss- Prot is the manually annotated and reviewed section of the UniProt Knowledgebase . It is a high quality annotated and non-redundant protein sequence database, Since 2002, it is maintained by the  UniProt consortium  and is accessible via the  UniProt website . The data in each entry can be considered separately as core data and annotation. TrEMBL (for Translated EMBL )  is a computer-annotated protein sequence database that is released as a supplement to SWISS-PROT. It contains the translation of all coding sequences present in the EMBL Nucleotide database, which have not been fully annotated.

The secondary databases of protein PROSITE :   A set of databases collects together patterns found in protein sequences rather than the complete sequences. PROSITE was created in 1988 by  Amos Bairoch , who directed the group for more than 20 years. Since July 2018, the director of PROSITE and Swiss- Prot is Alan Bridge. The protein motif and pattern are encoded as “regular expressions”. PRINTS: In the PRINTS database, the protein sequence patterns are stored as ‘fingerprints’. The information contained in the PRINT entry may be divided into three sections. the first section contains cross-links to other databases that have more information about the characterized family. The second section provides a table showing how many of the motifs that make up the fingerprint occurs in the how many of the sequences in that family. The last section of the entry contains the actual fingerprints that are stored as multiple aligned sets of sequences.

MHCPep : MHCPep is a database comprising over 13000 peptide sequences known to bind the Major Histocompatibility Complex of the immune system. It was established in 1994 . Pfam Pfam contains the profiles used using Hidden Markov models. Pfam consists of the four elements. The first is the annotation, which has the information on the source to make the entry, the method used and some numbers that serve as figures of merit. The second is the seed alignment that is used to bootstrap the rest of the sequences . The third is the HMM profile. The fourth element is the complete alignment of all the sequences identified in that family. The most recent version, Pfam 33.1, was released in May 2020 and contains 18,259 families.

The Cambridge Structural Database ( CSD ) It was originally a project of the University of Cambridge, which is set up to collect together the published three-dimensional structure of small organic molecules. All these crystal structures have been obtained using X-ray or neuron diffraction technique. For each entry in the CSD there are three distinct types of information stored. These are categorized as bibliographic information, chemical connectivity information and the three- dimensional coordinates. The Structural Classification of Proteins database ( SCOP ) It i s a l a r g e ly m an u al c l ass i fica t i o n o f p r ote i n s t ruc t ural do m a i ns b a sed o n si m i l a r i t i e s of their structures and amino acid sequences. SCOP was created in 1994 in the Centre for Protein Engineering and the Laboratory of Molecular Biology. It was maintained by Alexey G. Murzin and his colleagues in the Centre for Protein Engineering until its closure in 2010 and subsequently at the Laboratory of Molecular Biology in Cambridge, England . Example of some structural databases

CATH The CATH Protein Structure Classification database is a free, publicly available online resource that provides information on the evolutionary relationships of protein domains. It was created in the mid- 1990s by Professor Christine Orengo and colleagues including Janet Thornton and David Jones. The domains are then classified within the CATH structural hierarchy: at the Class (C) level, the Architecture (A) level, at the Topology/fold (T) level At the Homologous superfamily (H) level. The CluSTr (Cluster of SWISS- PROT and TrEMBL proteins) : This database offers an automatic classification of the entries in the SWISS- PROT and TrEMBL databases into groups of related proteins. The clustering is based on the analysis of all pair wise comparisons between protein sequences. The ProDom protein domain : This database is a compilation of homologous domains that have been automatically identified sequence comparison and clustering methods using the program PSI-BLAST. The focus is here to look for complete and self-contained structural domains and the search methods includes signals for such features .

Retrieval Databases Data Retrieval : data retrieval is the process of identifying and extracting data from a database, based on a query provided by the user or application. The three systems dier in the databases they search and the links they have to other information: Sequence Retrieval System (SRS) is a homogeneous interface to over 80 biological databases that had been developed at the European Bioinformatics Institute (EBI) at Hinxton , . It includes databases of sequences, metabolic pathways, transcription factors, application results (like BLAST, SSEARCH , FASTA ), protein 3-D structures, genomes, mappings, mutations, and locus specic mutations. Entrez is a molecular biology database and retrieval system. Developed by the National Center for Biotechnology information ( NCBI ) . It is entry point for exploring distinct but integrated databases. DBGET is an integrated database retrieval system, for handling the web of molecular biology databases, which is used as a backbone system in GenomeNet and KEGG developed at the university of Tokyo. Provided access to 20 databases, one at a time.

BLAST and FASTA BLAST  ( basic local alignment search tool ) A BLAST search enables a researcher to compare a subject protein or nucleotide sequence with a library or  database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. FASTA format   FASTA  is a  DNA  and  protein sequence alignment  software package first described by  David J. Lipman  and  William R. Pearson  in 1985 is a text-based  format for representing either  nucleotide sequences  or amino acid (protein) sequences, in which nucleotides or  amino acids  are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.

A sequence in FASTA format consists of: One line  starting with a " > " sign, followed by a sequence  identification code . A file in FASTA format may comprise  more than one  sequence. The FASTA format is sometimes also referred to as the "Pearson" format (after the author of the FASTA program and ditto format).

https:// www.toppr.com /guides/maths/statistics/frequency-distribution/ https:// www.enago.com /academy/biological-databases-an-overview-and-future-perspectives/ Biotechnology – expanding horizons by B.D. Singh, Kalyani publishers, Reprinted ,2016. Pages 736-743 A textbook of Bioinformatics by Sharma, Munjal, Shankar , Rastogi publications, pages 153- 160 https:// www.slideshare.net / vidhyakalaivani29 /major-databases-in-bioinformatics-71778405 REFERENCES
Tags