INTRODUCTION WHAT ARE SEQUENCE SUBMISSION TOOLS? Submission tools are either web-based or stand alone software which aids in submission of new biological data or updating the ready existing sequence data in various biological database available. Sequence data shared between several databases on a regular basis. Each sequence or a new data is provided with an accession number after submission in to a particular which is the same for all databases.
Every database has unique sequence submission portals NCBI : 3 ways of submission: Sequin, BankIt , tblast2n EBI-EMBL European Nucleotide Archives: WebIn DDBJ : Nucleotide Sequence Submission System (NSSS)
NCBI data submission system:
BankIt is used to submit data to GeneBank This is used for submission of genomic DNA (protein coding genes), transcripts ( eg.mRNA ), or small genomes( organelle,plasmids and phage ) from any organism. BankIt can include: A single sequence A few unrelated sequences or a few sequences with different features and / or source information. A large set of sequence with a small number of the same features/ source information. A small batch of sequence with a small number of features or source information.
BankIt can only be used to submit simple types of biological data. It can only be used when the data does not involve any complicated annotations. It cannot be used when advanced sequence analysis tools are required. The following categories of information are necessary for sequence submission Reference information – Author’s name,publication Source information- organism genus species , taxonomic lineage, uncultured/ cultured Source category- original sequence/3 rd party sequence Features - Exon , Intron, Sequence in FASTA format and of least 200 nucleotides
SEQUIN
Sequin is an interactive ,graphically – oriented program based on screen forms and controlled vocabularies that guides you through the process of entering your sequence and providing biological and bibliographic annotation. Sequin is designed to simplify the sequence submission process, and to provide increased data handling capabilities to accommodate very long sequence, complex annotations and robust error checking. Sequin is used for submitting , editing and updating both nucleotide and protein sequence data to NCBI,EMBL and DDBJ
Sequin has the capacity to handle long sequence and set of sequence like: Segmented entries Multiple annotations Population, phylogenetic and mutation studies Sequin is a more sophisticated software and has advanced features like: Graphical viewing Automatic annotations of complex sequences Built in validation functions for enhanced quality assurance and better editing features.
DATABASES
Biological databases store and organize biological data for easy retrieval of information. These centralized resources contain DNA and protein sequences, and their associated information. Primary databases store and make raw sequence data publicly available P rimary databases alone may not provide all the necessary information, as they often contain minimal annotation information . Secondary databases provide an added layer of information by curating, processing, and analyzing the raw data from primary databases.
Secondary databases refer to databases that are derived from primary databases, which include manually curated or computationally processed information. The amount of computational processing work in secondary databases varies greatly, depending on the level of information they provide Some secondary databases may simply archive translated sequence data, while others may provide extensive annotations and information on structure and function.
PROSITE Prosite is a database of protein families, domains, and functional sites that contains manually curated information on amino acid patterns and profiles of proteins. It is a secondary protein database that provides tools for the analysis of protein sequences and the identification of motifs. The database contains a large collection of signature patterns or profiles that hold biological importance. Each signature is associated with important biological information such as protein family, domain, or functional site. Prosite uses two types of signatures, patterns and generalized profiles, to identify conserved regions. These signatures can be used to predict the function and structure of proteins and help in the annotation of new protein sequences .
PRINTS PRINTS database contains protein family fingerprints which are groups of motifs. It provides groups of aligned unweighted sequence motifs or figure prints It helps as diagnostic resource for newly determined sequence, for this purpose it exploits group of motifs to build characteristic signature PRINTS uses a fingerprinting method that detects distant relatives of large and highly divergent protein superfamilies by exploiting conserved regions within sequence alignments.
BLOCKS BLOCKS is a collection of ungapped multiple alignments of segments of related protein sequences, called blocks, that represent the most conserved regions of proteins. It contains blocks for a wide variety of protein families, including enzymes, receptors, transporters, and structural proteins. Each block is assigned a unique identifier and annotated with information about the proteins it represents, including their names, functions, and structures. The database is widely used as a tool for protein family classification, protein structure prediction, and functional annotation .
Applications of Secondary Databases Secondary databases can be used to predict the structure and function of proteins by identifying homologous proteins with known structures. Secondary databases contain functional annotation information which helps to better understand the roles of proteins in different organisms. Secondary databases also help to identify conserved regions within a sequence, which can help to identify important functional domains and motifs. Secondary databases also help in evolutionary analysis by comparing protein sequences across different species to study the evolution of proteins. Secondary databases can also be used to identify potential drug targets by analyzing protein families and identifying conserved motifs that are essential for protein function