Sequence and Structural Databases of DNA and Protein, and its significance in Scientific Researches

3,112 views 44 slides Nov 25, 2021
Slide 1
Slide 1 of 44
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44

About This Presentation

ATGCU + 0s & 1s = Discoveries 101
Biological databases/ programming languages/ bioinformatic tools/


Slide Content

s.bituila ii mSc. Sequence and structural databases of Dna and protein , and its significances in scientific researches.

DNA Databases: Sequence Databases Structural Databases

DNA Sequence Databases: NCBI EMBL DDBJ Ensembl GenBank EBI UniGene

NCBI (National Centre for Biotechnological Information) Established in the year 1988 It aims to create public databases , develop software tools for sequence analysis and disseminate biomedical information, mainly to aid the research in computational biology. Roles: -Maintains several biological databases eg.GenBank,the nucleic acid sequence database. -provides data retrieval system (eg.Entrez) -provides computational resources for the analysis of GenBank data and a variety of other biological databases.

Tools available in NCBI: BLAST,Entrez,standard BLAST,megaBLAST, mega BLAST,PSI-BLAST,RPS-BLAST Types of Databases : -Nucleotide database -Literature database -protein database -Gene expression -Structural database -Chemical and others.

EMBL(European Molecular Biology Laboratory) Established in the year 1974 by Leo Sjilard , James Watson and John Kendrew. Roles: -Incorporates , Organizes and Distributes nucleotide sequences from the public sources. -Performs basic researches in molecular biology and medicine as well as trains Scientists, students and visitors. Tools: -Ppsearch,GeneQuiz,FASTA,DALI,BLAST-2,Radar,Dali-Lite etc.

DDBJ(DNA Databank of Japan) Established in the year 1986 Roles: -Collects nucleotide sequence data and provides freely available nucleotide sequence data. -Provides supercomputer system to support research activities in Life Sciences. Tools: -Getentry,SRS,TXSearch,LIBRA,GIB.

Ensemble: Launched in the year 1999 in response to the imminent completion of the Human Genome Project. Joint Project between the European Bioinformatics Institute and the welcome Trust Sanger Institute. It aims to provide a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates and model organisms. Genome databases for vertebrates and other eukaryotic species . It is one of the well known genome browsers for the retrieval of genomic information. Plays a major role in ENCODE (Encyclopaedia of DNA Elements Consortium) Project. Tools: BLAST ,Data Slicer, Variant Effect Predictor, Assembly converter etc.

GenBank: Started in the year 1982 by Walter Goad and Los Alamos National Laboratory. Produced and maintained by the National Centre for Biotechnology Information (NCBI) as a part of the International Nucleotide Sequence Database Collaboration(INSDC) Roles: -open access ,annotated collection of all publicly available nucleotide sequences and their protein translations. -Provide and encourage access within the scientific community. Tools: Bar S Tool, Sequin, BLAST,

EBI(European Bioinformatics Institute): 1980 EMBL -EBI is a centre for research and services in bioinformatics ,and is a part of European Molecular Biology Laboratory(EMBL) It hosts a number of publicly open ,free to use life sciences resources ,including biomedical databases, analysis tools and bio- ontologies which includes-; - ArrayExpress -archive of gene expression experiments. - BioModels - a database of computational models relevant to the life sciences. - BioStudies -a database that serves as a generic data archive at EMBL-EBI for biomolecular datasets. -European Nucleotide Archive (ENA) – resource of Nucleotide sequencing information.

UniGene: It is an NCBI database of the transcriptome and thus ,despite the name not primarily a database for genes. It provides informations on protein similarities, gene expression , cDNA clones and genomic location .

DNA Structural Databases:

RNase P Database: Compilation of RNase P sequences, sequence alignments , secondary structures, three dimensional models and accessory information. Also contains secondary structures of bacterial and archaeal RNAs including specially annotated ‘reference’ secondary structures of E.Coli and Bacillus subtilis RNase P RNAs,a minimum phylogenetic consensus structure,and coordinates for models of three-dimensional structure.

Protein Databases: Protein Sequence Databases Protein Structural Databases

Protein Sequence Databases: PIR SWISS-PROT Trembl iProclass Pfam

PIR(Protein Information Resource): 1984 by the National Biomedical Research Foundation(NBRF) Roles: -Source of annotated proteins database and analysis tools for the researchers. Provides an introduction to a range of biological database. Highlights the distinction between different data types and indicates where the most important resources are maintained. -It also supports genomic and proteomic research and scientific discovery.

PIR is split into four sections: PIR1: contains fully classified and annotated entries. PIR2: includes preliminary entries ,which have not been thoroughly reviewed and may contain redundancy . PIR3 contains unverified entries ,which have not been reviewedPIR4 entries fall into one of the four categories: -conceptual translations of artefactual sequences -conceptual translations of sequences that are not transcribed or translated -protein sequences or conceptual translations that are extensively genetically engineered -Sequences that are not genetically encoded and not produced on ribosomes.

SWISS-Prot: Founded in the year 1986 by Amos Bairoch and developed by Swiss Institute of Bioinformatics and subsequently developed by Rolf Apwelier at EBI. Provides high level annotations, including descriptions of the function of the protein, structure of its domains, its post translational modifications variants etc. Minimal redundancy and integration with other databases .

TrEMBL(Translated EMBL) Founded in the year 1996 as a computer annotated supplement to Swiss-Prot. Contains translation of all coding sequences present in EMBL, GenBank, DDBJ Nucleotide Sequence Databases and also protein extracted from the literature or submitted to Swiss-Prot.

iPro-class (Integrated Protein Knowledge bases) -First released in 2000 - Provides comprehensive description of a protein family ,function and structure for Uniprot protein sequence. It contains Value added descriptions of proteins including family relationship at global and local levels. Serves as a framework for data integration in distributed networking environment. It can also be used to support protein sequence annotation and genomic/proteomic research to obtain comprehensive up-to-date information on proteins.

Uses: iPro-class provides two types of protein sequence reports. In one type it covers information on genetic gene family structure function, taxonomy and literature with cross reference to molecular database .The second type present PIR super family membership information with length ,taxonomy and keyword statistics. It also provides links to various molecular biology databases.

Pfam 1995 by Erik Sonhammer , Sean Eddy and Richard Durbin as a collection of commonly occurring protein domains that could be used to annotate the protein coding genes of multicellular animals. It is a database of protein families. Includes annotations and multiple sequence alignment of protein families generated using hidden Markov models. The general purpose of Pfam database is to provide a complete and accurate classification of protein families. This method has been widely adopted by biologists because of its wide coverage of proteins and sensible naming conventions.

Uses : It is used by experimental biologists researching specific proteins ,by structural biologists to identify new targets for structure determination, by computational biologists to organize sequences and by evolutionary biologists for tracing the origins of proteins. It also allows users to submit protein or DNA sequences to search for matches to families in the database.

Structural Databases of protein ; PDB CATH SCOP Gene 3D D Bali E-MSD

PDB(Protein DataBank); 1971, by Brookhaven National Laboratory ,New York. It is a database for the three –dimensional structural data of large biological molecules, and nucleic acids. Roles: -It is a key resource in areas of structural biology ,such as structural genomics . -Provides protein structures to many other databases eg SCOP and CATH. Tools: -ADIT(auto Deep Input Tool), pdb-Extract, OOSTAR, Open Ras Mol, CIF Tr, MAXIT, Biopython, mmLIB,XML2PDB,

CATH( Class, Architecture, Topology and Homology) Mid 1990s by Professor Christine Orengo and colleagues including Janet Thornton and David Jones at the University College London. -It is a protein Structure Classification Database. and shares many broad features with the SCOP resource. -It provides information on the Evolutionary relationships of protein domains . Roles: -Class; at this level the domains are assigned according to their secondary structure content . -Architecture , at this level , information on the secondary structure arrangement in three dimensional space is used for assignment. It describes the gross secondary structure content and packing. -Topology encompasses both overall shape and connectivity of secondary structure -Homology groups domains that share more than 35% sequence identity and thought to share a common ancestor.

The four levels of CATH hierarchy: # Level Description 1. Class: The overall secondary structure content of the domain . 2. Architecture: High structural similarity but no evidence of homology . 3. Topology: A large-Scale grouping of topologies which share particular structural features 4. Homolog- ous superfam- ily Indicative of a demonstrable evolutionary relationship

SCOP( Structural Classification of Protein) 1994 Centre for Engineering and the Laboratory of Molecular Biology. Roles: -Describes Structural and Evolutionary relationship between proteins of known structure. -Provides broad survey of all known proteins folds , detailed information about the close relatives of protein and a protein and a framework for future research and classification.

E-MSD 1996 Provides clean Macromolecular Structure Data Accept and process depositions to the PDB. Transform the PDB flat –file archive to a relational database system. Management and distribution of data on molecular structures in close collaboration with PDB. Tools- Autodep and Emdep

Gene 3D: Provides structural annotation for proteins in the CATH sequence database. It uses the information in CATH to predict the locations of structural domains on millions of protein sequences available in public databases. Provides comprehensive structural and fuctional annotation of most available protein sequence including the Uniprot, Refseq and Integr 8 resources.

References: -Bioinformatics by Sabu M Thampi -Bioinformatics by Dardel -Bioinformatics for Biologists by Dr. Murtada Alshareifi -https://bioinf.comav.upv.es

Thank you