UniProt

4,880 views 4 slides Mar 23, 2021
Slide 1
Slide 1 of 4
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4

About This Presentation

UniProt


Slide Content

UniProt
The Universal Protein Resource (UniProt) is a unreservedly open database of protein
sequence and useful information, many entries being derived from genome sequencing projects.
It contains a large amount of information about the biological function of proteins derived from
the research literature.
UniProt is a collaboration between the European Bioinformatics Institute (EMBL-EBI), the
SIB Swiss Institute of Bioinformatics and the Protein Information Resource (PIR). Across the
three institutes more than 100 people are involved through different tasks such as database
curation, software development and support.
UniProt is important to provide central resource for storing and interconnecting information from
large and disparate sources, and the most comprehensive catalog of protein sequence and
functional annotation.
There are three UniProt components:
 UniProt Knowledgebase (UniProtKB)
The central hub for the collection of functional information on proteins.
 UniProt Reference Clusters (UniRef) :
Provides clustered sets of sequences.
 UniProt Archive (UniParc) :
A comprehensive and non-redundant database that contains most of the publicly available
protein sequences.
As the number of completely sequenced genomes continues to increase, huge efforts are being
made in the research community to understand as much as possible about the proteins encoded
by these genomes. This work is critical to many areas of science including biology, medicine and
biotechnology - and is generating a wealth of data.
UniProt is important because it provides:
1. An up-to-date knowledge about protein
2. Comprehensive body of protein information.
3. The resource facilitates scientific discovery by collecting, interpreting and organising
this information, which saves researchers countless hours of work.
4. We use can do a wide range of tasks
5. From finding out about your protein of interest and comparing its protein sequence with
other proteins,
6. UniProt provides proteomes for species with completely sequenced genomes.
UniProtKB entries are available in three file formats

2 | P a g e

1. Flat Text
2. XML
3. RDF/XML
UniProtKB entries in these formats each contain only one protein sequence, the so-called
'canonical' sequence, Canonical sequence also available in FASTA format, as are additional
manually curated isoform sequences that are described in UniProtKB/Swiss-Prot.
Below we describe how these sets can be accessed.
In addition to the predefined FASTA, XML, RDF/XML and text formats, search results can also
be downloaded in tab-separated or Excel format, reflecting your own customizable column
settings.
UniProt provides several application programming interfaces (APIs) to query and access its data
programmatically:
 UniProt website REST API. ...
 Proteins REST API. ...
 UniProt SPARQL API. ...
 UniProt Java API.

Retrieving sequences from the website
 Perform your favorite query and view the resulting list of entries (e.g. this query retrieves
all UniProtKB entries that are part of the human proteome.
 Click the Download button in the query result page
 Choose the desired download format (Flat Text, XML, RDF/XML, tab-delimited, Excel
or FASTA if additional isoform sequences are desired)
 Choosing Flat Text, XML, or RDF/XML allows retrieval of all entries (and their
canonical sequences) from the result list in the desired format.
 Choosing FASTA (canonical) format allows retrieval of all canonical sequences from the
query result list. This can include canonical sequences from both UniProtKB/Swiss-Prot
and/or UniProtKB/TrEMBL entries.
 Choosing the option FASTA (canonical and isoform) allows retrieval of all canonical
sequences plus all manually reviewed isoform sequences described within
UniProtKB/Swiss-Prot. These manually reviewed isoform sequences are available as
distinct sequences in FASTA format only within this expanded downloadable set.
 Choosing Tab-separated or Excel allows retrieval of your search result table reflecting the
columns you have chosen to include.

3 | P a g e

Retrieving sequences from the FTP site
The UniProt FTP sites (accessible via the Download latest release link located on the home
page) provide the most frequently requested data sets in each of the file formats (Flat Text,
XML, RDF/XML, FASTA). The additional manually curated isoform sequences that are
described in UniProtKB/Swiss-Prot are available in a separate FASTA file
(uniprot_sprot_varsplic.fasta.gz). Our FTP directory also includes expanded FASTA sets,
containing both the canonical and manually reviewed isoform sequences, for all reference
proteomes.
UniProt has a number of datasets that you can navigate to and search within. You can click on
the dropdown menu to the left of the search box to see these datasets and select the one you are
interested in (Figure 10A). UniProtKB is selected by default.

Figure 10. Accessing the UniProt datasets via (A) the drop down menu or (B) the tiles on the
homepage
You can also see these datasets as tiles on the home page (Figure 10B). Clicking on one of these
tiles will take you to the entire dataset, where you can explore its contents or search within them.
Navigating the UniProt tools
UniProt provides four main tools:
 The Basic Local Alignment Search Tool (BLAST) for sequence search;
 the 'Align' multiple sequence alignment tool;
 the 'Retrieve/ID Mapping' tool where you can submit a list of identifiers to retrieve the
corresponding UniProt entries, or map them from or to an external database;

4 | P a g e

 The 'Peptide search' tool which allows you to submit short peptide sequences of at least 3
residues and find all UniProtKB sequences which have an exact match to the query
sequence.
You can navigate to these tools by clicking on their corresponding links in the header or in the
footer (Figure 11).

Figure 11. UniProt tools can be accessed from links on the header (A) and footer (B) of every
page.

Conclusion:
Our knowledge of the protein world is rapidly growing, and the complexity of biological systems is becoming
increasingly clear. UniProt continues to adapt its data gathering, data processing and data display to improve the
availability and utility of protein information for the benefit of all. UniProt resources can be accessed via the
website at http://www.uniprot.org/.
Reference:
Martin, M. J.; Gattiker, A.; Gasteiger, E.; Bairoch, A.; Apweiler, R. (2002). "High-quality
protein knowledge resource: SWISS-PROT and TrEMBL". Briefings in Bioinformatics. 3 (3):
275–284

The UniProt Consortium, The Universal Protein Resource (UniProt) in 2010, Nucleic Acids
Research, Volume 38, Issue suppl_1, 1 January 2010, Pages D142–D148
Yeh, L. S.; Huang, H.; Arminski, L.; Castro-Alvear, J.; Chen, Y.; Hu, Z.; Kourtesis, P.; Ledley,
R. S.; Suzek, B. E.; Vinayaka, C. R.; Zhang, J.; Barker, W. C. (2003)
Apweiler, R.; Bairoch, A.; Wu, C. H. (2004). "Protein sequence databases". Current Opinion in
Chemical Biology. 8 (1): 76–80.