Finding sequence records with NCBI's Entrez Nucleotides

abrarhaider15 25 views 49 slides Jun 02, 2024
Slide 1
Slide 1 of 49
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49

About This Presentation

GENBANK related information


Slide Content

Finding sequence records with NCBI's Entrez Nucleotides

Limit the search

Limit to genomic DNA and GenBank

Retrieve sequence Click accession number to retrieve sequence in GenBank format

Some useful search fields and limits

1 º Sequence Database GenBank Nucleotide only sequence database Archival in nature Submission of GenBank Data to NCBI Direct submissions of individual records via Web (BankIt, Sequin) Batch submissions of bulk sequences via Email (EST, GSS, STS) FTP accounts for Sequencing Centers

Sequence Records (millions) Total Base Pairs (billions) GenBank 5 10 15 20 25 30 35 5 10 15 20 25 30 35 40 Sequence records Total base pairs Release 143: 37.3 million records 41.8 billion nucleotides Average doubling time ≈ 14 months ’83 ’84 ’85 ’86 ’87 ’88 ’89 ’90 ’91 ’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01 ’02 ’03 ’04

EBI GenBank DDBJ EMBL EMBL Entrez SRS getentry NIG CIB NCBI NIH Submissions Updates Submissions Updates Submissions Updates The International Sequence Database Collaboration Sequin BankIt ftp

Organization of GenBank: GenBank Divisions (gbdiv) Records are divided into 17 Divisions. 1 Patent (11 files) 5 High Throughput 11 Traditional Traditional Divisions: Direct Submissions (Sequin and BankIt) Accurate Well characterized EST (335) Expressed Sequence Tag GSS (116) Genome Survey Sequence HTG (61) High Throughput Genomic STS (5) Sequence Tagged Site HTC (6) High Throughput cDNA PRI (28) Primate PLN (12) Plant and Fungal BCT (10) Bacterial and Archeal INV (6) Invertebrate ROD (13) Rodent VRL (3) Viral VRT (7) Other Vertebrate MAM (1) Mammalian (ex. ROD and PRI) PHG (1) Phage SYN (1) Synthetic (cloning vectors) UNA (1) Unannotated BULK Divisions: Batch Submission (Email and FTP) Inaccurate Poorly characterized

File Formats of the Sequence Databases Each sequence is represented by a text record called a flat file. GenBank/GenPept (useful for scientists) FASTA (the simplest format) ASN.1 & XML (useful for programmers)

LOCUS AF062069 3808 bp mRNA INV 02-MAR-2000 DEFINITION Limulus polyphemus myosin III mRNA, complete cds. ACCESSION AF062069 VERSION AF062069.2 GI:7144484 KEYWORDS . SOURCE Atlantic horseshoe crab. ORGANISM Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus. REFERENCE 1 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein JOURNAL J. Neurosci. (1998) In press REFERENCE 2 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REFERENCE 3 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REMARK Sequence update by submitter COMMENT On Mar 2, 2000 this sequence version replaced gi:3132700. References DEFINITION Limulus polyphemus myosin III mRNA, complete cds. LOCUS AF0620069 3808 bp mRNA INV 02-MAR-2000 ORGANISM Limulus polyphemus Eukaryota;Metazoa;Arthropoda;Chelicerata;Merostomata; Xiphosura;Limulidae;Limulus. A Traditional “GenBank” Record Definition =Title ACCESSION AF062069 VERSION AF062069.2 GI:7144484 NCBI’s Taxonomy Accession.Version GI Number Accession Number Length mRNA = cDNA DNA = genomic Division Date of most recent modification

FEATURES Location/Qualifiers source 1..3808 /organism="Limulus polyphemus" /db_xref="taxon:6850" /tissue_type="lateral eye" CDS 258..3302 /note="N-terminal protein kinase domain; C-terminal myosin heavy chain head; substrate for PKA" /codon_start=1 /product="myosin III" /protein_id="AAC16332.2" /db_xref="GI:7144485" /translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDKQA NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHENSIIHRDIRAANIMF SKEGYVKLIDFGLSASVKNTNGKAQSSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISECLVKNPEYR PCIQEIPQHPFLAQVEGKEDQLRSELVDILKKNPGEKLRNKPYNVTFKNGHLKTISGQ BASE COUNT 201 a 689 c 782 g 1136 t ORIGIN 1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt 3781 aagatacagt aactagggaa aaaaaaaa // Lower down in the GenBank Record /protein_id="AAC16332.2" /db_xref="GI:7144485" GenPept Protein ID Feature Table

The Header LOCUS AF111785 5925 bp mRNA PRI 01-SEP-1999 Accession no Length Molecule type GenBank division code

Locus Name The first element The elements must begin with a letter Lengths can not be > 10 characters The Accession number of ensured uniqueness, eg. AF111785

Length Sequences range from 1 to 350,000 base pairs (bp) in a single record. GenBank and other formats seldom accepts sequences <50 bp. The 350 kb limit is a practical one. Records greater than 350 kb are acceptable if the sequence represent a single gene.

Molecule Type Usually DNA or RNA can also indicate strandness ( single or double, as ss or ds). The acceptable mol types are DNA, RNA, tRNA, rRNA, mRNA, and uRNA and are intended to reprsent the original biological molecule.

GenBank division code Three letters: either taxonomic inferences or other classification purposes.

Date The record was last made public. If the record has not been updated since made public the date would be the date first made public. If any of the features or annotations were updated and the record was released, then the date corresponds to the last date the entry released. Another date contained in the record is the date the record was submitted to the database. The databases make no claim that the dates are error free.

DEFINITION DEFINITION Homo sapiens myosin heavy chain II x /d mRNA, complete cds The definition line attempts to summarize the biology of the record. To ensure the biology and source of the DNA are clear to the user and to the database staff Generalize syntax for genomic record: Genus species product name ( gene symbol) gene,complete CDS The generalized syntax for mRNA definition: Genus species product name ( gene symbol) mRNA, complete CDS.

Rules applied to organelle sequence DEFINITION Genus species protein x (xxx) gene, complete cds; [ one choice from below], OR DEFINITION Genus species protein xxs ribosomal RNA gene, complete sequence; [ one choice from below], nuclear gene (s) for mitochondrial product (s) nuclear gene (s) for chloroplast product (s) mitochondrial gene(s) for mitochondrial product(s) chloroplast gene(s) for chloroplast product (s)

Genus-species name Genus-species names are given in the definition lines Common names ( e.g., human) or abbreviated genus names ( e.g., H. sapiens ) are no longer used One organism has escaped this agreement: the human immunodeficeincy virus is to be represented in the definition line as HIV1 or HIV2.

ACCESSION NO On the third line of the record, is the primary key to reference a given record in the database. Even the sequence is updated the accession no. remains unchanged. Remains in one of the two formats: “ 1+5” and “2+6” varieties. 1+5 indicates one uppercase letter followed by five digits and 2+6 is two letters plus six digits. Most new records are of latter variety. Most GenBank records have only one accession number. In case where more than one accession number is shown, the first accession number is the primary one .

VERSION VERSION AF111785.1 GI: 4808814 The version line contains the Accession.version and the gi ( geninfo identifier). If the sequence changes, the version number in the Accession.version will be incremented by one and the gi will change by the next available integer. The example shows the accession number AF111785 and the gi number 4808814

KEYWORDS NCBI discourages the use of keywords Include them on request.

SOURCE SOURCE human ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrate; Mammalia; Eutheria; Primates; Ctarrhini; Hominidae; Homo. The source line will either have the common name for the organism or its scientific name. Older records may contain other source information in this field

REFERENCE REFERENCE 1 ( bases 1 to 5925) AUTHORS Weiss, A., Mc Donough……. TITLE Organisation of human and mouse skeletal myosin heavy chain gene clusters is highly conserved JOURNAL Proc. Natl. Acad. Sci. U.S.A. 96 (6) , 2958-2963 (1999) MEDLINE 99178997 PUBMED 10077619 Each GenBank record must have at least one reference or citation. In many cases, there may be two or more reference blocks.

REFERENCE There is a MEDLINE and PubMed identifier Other references may be annotated as unpublished or as placeholders for a publication REFERENCE 1 ( bases 1 to 3291) AUTHORS …………………. TITLE CHIP, a widely expressed choromosomal protein required for ………….. JOURNAL Unpublished REFERENCE 3 ( bases 1 to 5925) AUTHORS TITLE Direct submission JOURNAL Submitted ( 09-Dec-1998) MCDB, University of Colorado at Boulder, Campus Box 0347, Boulder, Colorado 80309-0347, USA

Comment The last part of the header section in the GBFF is the comment. Includes a great variety of notes and comments. Genome centers like to include their contact information in this section as well as give acknowledgements. This section is optional and not found in most GenBank. May also include e-mail addresses, or URLs, but this practice is discouraged at NCBI. Also contains information about the history. If the sequence of a particular record is updated, the comment will contain a pointer to the previous version of the record.

COMMENT COMMENT On DEC 23, 1999 this sequence version replaced gi: 4454562 If new version of the record is retrieved this comment will point to the newer version of the sequence and also backward if there was an earliar still version COMMENT [WARNING] On Dec 23, 1999 this sequence was replaced by a newer version gi : 6633795

Feature Table The most important direct representation of the biological information in the record. A full set of annotations facilitates quick extraction of the relevant biological features The GenBank feature table documentation describes in great detail the legal features ( i.e. the ones that are allowed) and what qualifiers are permitted with them. Unfortunately, has often invited an excess of invalid, speculative, or computed annotations. In NCBI data model,”features” refer to the annotations that are one a part of the sequences, whereas annotations that describes the whole sequence are called “ descriptors”

The Source Feature The source feature is really a desvriptor in the data model view ( the BioSource, which referes to the whole sequence). The only feature must be present on all GenBank records. All features have a series of legal qualifiers, some of which are mendatory ( e.g., /organism for source).

The Source Feature source 1..5925 /organism= “ Homo sapiens” /db_xref=“taxon:9606” /chromosome=“17” /map=“17p13.1” /tissue_type=“skeletal muscle”

The CDS Feature The CDS Feature How to join two sequences together How to make an amino acid sequence from the indicated coordinates and the inferred genetic code. Uses database cross-reference (db_xref ). The list of db_xref database is maintained by the International Nucleotide Sequence Database Collaboration. /protein_id=" CAH55620.1 " /db_xref="GI:56311022

The CDS Feature CDS CDS <1..>405 /gene="cagA" /codon_start=1 /transl_table= 11 /product="cytotoxin associated protein CagA" /protein_id=" CAH55620.1 " /db_xref="GI:56311022" /db_xref="InterPro: IPR005169 “ /db_xref="UniProt/TrEMBL: Q5QRA2 “ /translation="FSDIRKELSEKLFGNSNNNNNGLKNNTEPIYAQVNKKKAGQAIS EEPIYAQVAKKVSAKIDQLNEATSAINRKIDRINKIASAGKGVGGFSGAGRSASPEP IYATIDFDEANQAGFPLRRSAAVNDLSKVGLSR "

Protein Sequence /protein_id=" CAH55620.1 " /db_xref="GI:56311022“ NCBI assigns an accession number and a gi ( geninfo) identifier to all sequences. Each protein sequence is assigned a protein_id or protein accession number. The format of this accession number is “3+5”, or three letters and five digits. Like the nucleotide sequence accession number the protien accession number is represented as Accession.version. When the protein sequence in the record changes, the version of the accession number is incremented by one and the gi is also changed.

Data Formats Vast number of sequence databases created in the last decade For understanding by both the human and computer the databases are designed to use a flat file format. Some Flat File formats: FASTA, GenBank/EMBL/DDBJ, SWISS-PROT, Pfam, PROSITE.

Main file formats used in Bioinformatics

ASN 1: Abstract Syntax Notation 1 used by NCBI

EMBL/Swiss Prot ( http://www.ebi.ac.uk/help/formats_frame.html)

FASTA Format The most common sequence format. The first line consists of “>”, followed by an identifier, which contains no white space. A FASTA file may contain more than one sequence entry. The entries are merely concatenated with the “>” prefixed lines indicating the start of a new sequence entry More information on the definition line can be added without breaking the rule It is recommended that all lines of text be shorter than 80 characters in length .

FASTA format >gi|4680721|gb|AAA61217.2| thyroid peroxidase [Homo sapiens] MRALAVLSVTLVMACTEAFFPFISRGKELLWGKPEESRVSSVLEESKRLVDTAMYATMQRNLKKRGILSG AQLLSFSKLPEPTSGVIARAAEIMETSIQAMKRKVNLKTQQSQHPTDALSEDLLSIIANMSGCLPYMLPP KCPNTCLANKYRPITGACNNRDHPRWGASNTALARWLPPVYEDGFSQPRGWNPGFLYNGFPLPPVREVTR HVIQVSNEVVTDDDRYSDLLMAWGQYIDHDIAFTPQSTSKAAFGGGSDCQMTCENQNPCFPIQLPEEARP AAGTACLPFYRSSAACGTGDQGALFGNLSTANPRQQMNGLTSFLDASTVYGSSPALERQLRNWTSAEGLL RVHGRLRDSGRAYLPFVPPRAPAACAPEPGNPGETRGPCFLAGDGRASEVPSLTALHTLWLREHNRLAAA LKALNAHWSADAVYQEARKVVGALHQIITLRDYIPRILGPEAFQQYVGPYEGYDSTANPTVSNVFSTAAF RFGHATIHPLVRRLDASFQEHPDLPGLWLHQAFFSPWTLLRGGGLDPLIRGLLARPAKLQVQDQLMNEEL TERLFVLSNSSTLDLASINLQRGRDHGLPGYNEWREFCGLPRLETPADLSTAIASRSVADKILDLYKHPD NIDVWLGGLAENFLPRARTGPLFACLIGKQMKALRDGDWFWWENSHVFTDAQRRELEKHSLSRVICDNTG LTRVPMDAFQVGKFPEDFESCDSITGMNLEAWRETFPQDDKCGFPESVENGDFVHCEESGRRVLVYSCRH GYELQGREQLTCTQEGWDFQPPLCKDVNECADGAHPPCHASARCRNTKGGFQCLCADPYELGDDGRTCVD ... >gi|4680720|gb|M17755.2|HUMTPOC Homo sapiens thyroid peroxidase (TPO) mRNA, complete cds GAGGCAATTGAGGCGCCCATTTCAGAAGAGTTACAGCCGTGAAAATTACTCAGCAGTGCAGTTGGCTGAG AAGAGGAAAAAAGAATGAGAGCGCTGGCTGTGCTGTCTGTCACGCTGGTTATGGCCTGCACAGAAGCCTT CTTCCCCTTCATCTCGAGAGGGAAAGAACTCCTTTGGGGAAAGCCTGAGGAGTCTCGTGTCTCTAGCGTC TTGGAGGAAAGCAAGCGCCTGGTGGACACCGCCATGTACGCCACGATGCAGAGAAACCTCAAGAAAAGAG GAATCCTTTCTGGAGCTCAGCTTCTGTCTTTTTCCAAACTTCCTGAGCCAACAAGCGGAGTGATTGCCCG AGCAGCAGAGATAATGGAAACATCAATACAAGCGATGAAAAGAAAAGTCAACCTGAAAACTCAACAATCA CAGCATCCAACGGATGCTTTATCAGAAGATCTGCTGAGCATCATTGCAAACATGTCTGGATGTCTCCCTT ACATGCTGCCCCCAAAATGCCCAAACACTTGCCTGGCGAACAAATACAGGCCCATCACAGGAGCTTGCAA CAACAGAGACCACCCCAGATGGGGCGCCTCCAACACGGCCCTGGCACGATGGCTCCCTCCAGTCTATGAG GACGGCTTCAGTCAGCCCCGAGGCTGGAACCCCGGCTTCTTGTACAACGGGTTCCCACTGCCCCCGGTCC GGGAGGTGACAAGACATGTCATTCAAGTTTCAAATGAGGTTGTCACAGATGATGACCGCTATTCTGACCT CCTGATGGCATGGGGACAATACATCGACCACGACATCGCGTTCACACCACAGAGCACCAGCAAAGCTGCC ...

GCG Exactly one sequence Begins with annotation lines Start of the sequence is marked by a line ending with "..“ This line also contains the sequence identifier, the sequence length and a checksum ID AA03518 standard; DNA; FUN; 237 BP. XX AC U03518; XX DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S DE rRNA and 5.8S rRNA genes, partial sequence. XX SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other; AA03518 Length: 237 Check: 4514 .. 1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 121 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 181 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc

GenBank/GenPept Can contain several sequences One sequence starts with: “LOCUS” The sequence starts with: "ORIGIN“ The sequence ends with: "//“ LOCUS AAU03518 237 bp DNA PLN 04-FEB-1995 DEFINITION Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S rRNA and 5.8S rRNA genes, partial sequence. ACCESSION U03518 BASE COUNT 41 a 77 c 67 g 52 t ORIGIN 1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 121 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc //

Tools for file format conversion http://bioportal.bic.nus.edu.sg/readseq/readseq.html http://www-bimas.cit.nih.gov/molbio/readseq/ http://bioweb.pasteur.fr/seqanal/interfaces/readseq-simple.html

Translation table: http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c#SG11

The Universal Genetic Code Table

Annotation

References http://www.ebi.ac.uk/embl/index.html http://www.ebi.ac.uk/Documentation/Release_nots/current/relnotes.html http://www.ncbi.nih.gov/genbank/gbrel.txt ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt ftp://ftp.ncbi.nih.gov/genbank/ http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html
Tags