2. INTRODUCTION
Proteins are generally composed of one or more functional regions, commonly termed
domains.
Different combinations of domains give rise to the diverse range of proteins found in
nature.
Different combinations of domains give rise to functional diversity (Vogel et al., 2004),
which includes their ability to form different protein interactions.
3. HISTORY
Pfam was founded in 1995 by Erik Sonhammer, Sean Eddy and Richard Durbin as a collection of
commonly occurring protein domains that could be used to annotate the protein coding genes of
multicellular animals.
One of its major aims at inception was to aid in the annotation of the C. elegans genome. The project
was partly driven by the assertion in ‘One thousand families for the molecular biologist’ by Cyrus
Chothia.
Counter to this assertion, the Pfam database currently contains 16,306 entries corresponding to unique
protein domains and families. However, many of these families contain structural and functional
similarities indicating a shared evolutionary origin.
4. WHAT IS Pfam
Pfam is a database of protein families that
includes their annotations and multiple sequence
alignments generated using hidden Markov
models.
In addition, each family has associated annotation,
literature references, and links to other databases.
The most recent version, Pfam 35.0, was released
in November 2021 and contains 19,632 families.
5. The entries in Pfam are freely available via the web and in flat file format.
(Pfamis available in Europe at
http://www.sanger.ac.uk/Software/Pfam/(UK),http://www.cgb.ki.se/Pfam/(Sweden), and
http://pfam.jouy.inra.fr/(France), and in the United States athttp://pfam.wustl.edu/).
Pfam is a founding member database of InterPro and, therefore, also available via the InterPro site at
http://ebi.ac.uk/interpro.
6. The general purpose of the Pfam database is to provide a complete and accurate classification of
protein families and domains.
Originally, the rationale behind creating the database was to have a semiautomated method of
curating information on known protein families to improve the efficiency of annotating genomes.
The Pfam classification of protein families has been widely adopted by biologists because of its
wide coverage of proteins and sensible naming conventions.
USES
7. It is used by experimental biologists researching specific proteins, by structural biologists to identify
new targets for structure determination, by computational biologists to organise sequences and by
evolutionary biologists tracing the origins of proteins.
Early genome projects, such as human and fly used Pfam extensively for functional annotation of
genomic data.
The Pfam website allows users to submit protein or DNA sequences to search for matches to families
in the database.
8. If DNA is submitted, a six-frame translation is performed, then each frame is searched.
Rather than performing a typical BLAST search, Pfam uses profile hidden Markov models,
which give greater weight to matches at conserved sites, allowing better remote homology
detection, making them more suitable for annotating genomes of organisms with no well-
annotated close relatives.
Pfam has also been used in the creation of other resources such as iPfam, which catalogs
domain-domain interactions within and between proteins, based on information in structure
databases and mapping of Pfam domains onto these structures.
9. For each family in Pfam one can:
a. View a description of the family
b. Look at multiple alignments
c. View protein domain architectures
d. Examine species distribution
e. Follow links to other databases
f. View known protein structures
FEATURES
10. Entries can be of several types: family, domain, repeat or motif.
a. Family is the default class, which simply indicates that members are related.
b. Domains are defined as an autonomous structural unit or reusable sequence unit that can be
found in multiple protein contexts.
c. Repeats are not usually stable in isolation, but rather are usually required to form tandem
repeats in order to form a domain or extended structure.
d. Motifs are usually shorter sequence units found outside of globular domains.
The descriptions of Pfam families are managed by the general public using Wikipedia.
11. Domains of unknown function (DUFs) represent a growing fraction of the Pfam database.
The families are so named because they have been found to be conserved across species, but
perform an unknown role. Each newly added DUF is named in order of addition.
Names of these entries are updated as their functions are identified.
Normally when the function of at least one protein belonging to a DUF has been determined, the
function of the entire DUF is updated and the family is renamed.
Some named families are still domains of unknown function, that are named after a representative
protein, e.g. YbbR.
Domains of unknown function
12. They are groupings of related families that share a single evolutionary origin, as confirmed by
structural, functional, sequence and HMM comparisons.
Clans were first introduced to the Pfam database in 2005.
To identify possible clan relationships, Pfam curators use the Simple Comparison Of Outputs
Program(SCOOP) as well as information from the ECOD database.
ECOD is a semi-automated hierarchical database of protein families with known structures, with
families that map readily to Pfam entries and homology levels that usually map to Pfam clans.
CLANS
13. Pfam was originally hosted on three mirror sites around the world to preserve redundancy.
However between 2012 and 2014, the Pfam resource was moved to EMBL-EBI, which allowed
for hosting of the website from one domain, using duplicate independent data centres.
14. They are one of the computational algorithms used
for predicting protein structure and function, identifies
significant protein sequence similarities allowing the
detection of homologs and consequently the transfer of
information, i.e. sequence homology-based inference of
knowledge.
What are profile
hidden Markov
models?
15. Pfam-A and Pfam-B
Pfam-A
A profile HMM based hand curated Pfam entry which is built using a small number of
representative sequences.
They manually set a threshold value for each profile-HMM and search the models against the
UniProtKB database. All of the sequences which score above the threshold for a Pfam entry are
included in the entry’s full alignment.
Pfam-B
A set of unannotated, computationally generated multiple sequence alignments. They are one of the
sources that are used for creating Pfam-A entries.