INTRODUCTION Proteins are generally composed of one or more functional regions, commonly termed domains . Different combinations of domains give rise to the diverse range of proteins found in nature. Different combinations of domains give rise to functional diversity (Vogel et al., 2004), which includes their ability to form different protein interactions.
HISTORY Pfam was founded in 1995 by Erik Sonhammer , Sean Eddy and Richard Durbin as a collection of commonly occurring protein domains that could be used to annotate the protein coding genes of multicellular animals. One of its major aims at inception was to aid in the annotation of the C. elegans genome. The project was partly driven by the assertion in ‘One thousand families for the molecular biologist’ by Cyrus Chothia . Counter to this assertion, the Pfam database currently contains 16,306 entries corresponding to unique protein domains and families. However, many of these families contain structural and functional similarities indicating a shared evolutionary origin.
WHAT IS Pfam Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. In addition, each family has associated annotation, literature references, and links to other databases. The most recent version , Pfam 35.0, was released in November 2021 and contains 19,632 families .
The entries in Pfam are freely available via the web and in flat file format. ( Pfamis available in Europe at http://www.sanger.ac.uk/Software/Pfam/ (UK), http://www.cgb.ki.se/Pfam/ (Sweden), and http://pfam.jouy.inra.fr/ (France), and in the United States at http://pfam.wustl.edu/). Pfam is a founding member database of InterPro and, therefore, also available via the InterPro site at http://ebi.ac.uk/interpro.
The general purpose of the Pfam database is to provide a complete and accurate classification of protein families and domains. Originally, the rationale behind creating the database was to have a semiautomated method of curating information on known protein families to improve the efficiency of annotating genomes. The Pfam classification of protein families has been widely adopted by biologists because of its wide coverage of proteins and sensible naming conventions. USES
It is used by experimental biologists researching specific proteins, by structural biologists to identify new targets for structure determination, by computational biologists to organise sequences and by evolutionary biologists tracing the origins of proteins. Early genome projects, such as human and fly used Pfam extensively for functional annotation of genomic data. The Pfam website allows users to submit protein or DNA sequences to search for matches to families in the database.
If DNA is submitted, a six-frame translation is performed, then each frame is searched. Rather than performing a typical BLAST search, Pfam uses profile hidden Markov models , which give greater weight to matches at conserved sites, allowing better remote homology detection, making them more suitable for annotating genomes of organisms with no well-annotated close relatives. Pfam has also been used in the creation of other resources such as iPfam , which catalogs domain-domain interactions within and between proteins, based on information in structure databases and mapping of Pfam domains onto these structures.
For each family in Pfam one can: View a description of the family Look at multiple alignments View protein domain architectures Examine species distribution Follow links to other databases View known protein structures FEATURES
Entries can be of several types: family, domain, repeat or motif. Family is the default class, which simply indicates that members are related. Domains are defined as an autonomous structural unit or reusable sequence unit that can be found in multiple protein contexts. Repeats are not usually stable in isolation, but rather are usually required to form tandem repeats in order to form a domain or extended structure. Motifs are usually shorter sequence units found outside of globular domains. The descriptions of Pfam families are managed by the general public using Wikipedia.
Domains of unknown function (DUFs) represent a growing fraction of the Pfam database. The families are so named because they have been found to be conserved across species, but perform an unknown role. Each newly added DUF is named in order of addition. Names of these entries are updated as their functions are identified. Normally when the function of at least one protein belonging to a DUF has been determined, the function of the entire DUF is updated and the family is renamed. Some named families are still domains of unknown function, that are named after a representative protein, e.g. YbbR . Domains of unknown function
They are groupings of related families that share a single evolutionary origin, as confirmed by structural, functional, sequence and HMM comparisons. Clans were first introduced to the Pfam database in 2005 . To identify possible clan relationships, Pfam curators use the Simple Comparison Of Outputs Program(SCOOP) as well as information from the ECOD database. ECOD is a semi-automated hierarchical database of protein families with known structures, with families that map readily to Pfam entries and homology levels that usually map to Pfam clans. CLANS
Pfam was originally hosted on three mirror sites around the world to preserve redundancy. However between 2012 and 2014 , the Pfam resource was moved to EMBL-EBI, which allowed for hosting of the website from one domain, using duplicate independent data centres.
They are one of the computational algorithms used for predicting protein structure and function, identifies significant protein sequence similarities allowing the detection of homologs and consequently the transfer of information, i.e. sequence homology-based inference of knowledge. What are profile hidden Markov models?
Pfam -A and Pfam -B Pfam -A A profile HMM based hand curated Pfam entry which is built using a small number of representative sequences. They manually set a threshold value for each profile-HMM and search the models against the UniProtKB database. All of the sequences which score above the threshold for a Pfam entry are included in the entry’s full alignment. Pfam -B A set of unannotated, computationally generated multiple sequence alignments. They are one of the sources that are used for creating Pfam -A entries.
1.Search by sequence 2.Sequence by FASTA format
3.Search
1.Search by text
2.Select the protein 3.Select according to requirment
1.Select by domain architecture 2.Add the domain of interest
3.Select Domain architecture or according to the requriment