Presence Server Assignment Unassigned Funds

On By In 1

If LDAP Sync is enabled on Cisco Unified Communications Manager, you must move the users to the new Organizational Unit (OU) from which their new cluster synchronizes if the deployment uses a separate LDAP structure (OU divided) for each cluster, where users are only synchronized from LDAP to their home cluster.


Note


You do not need to move the users if the deployment uses a flat LDAP structure, that is, all users are synchronized to all Cisco Unified Communications Manager and IM and Presence Service clusters where users are licensed to only one cluster.


For more information about how to move the migrating users to the relevant OU of the new home cluster, see the LDAP Administration documentation.

After you move the users, you must delete the LDAP entries from the old LDAP cluster.

Proceed to synchronize the users to the new home cluster.

Abstract

Motivation: Automating the assignment of existing domain and protein family classifications to new sets of sequences is an important task. Current methods often miss assignments because remote relationships fail to achieve statistical significance. Some assignments are not as long as the actual domain definitions because local alignment methods often cut alignments short. Long insertions in query sequences often erroneously result in two copies of the domain assigned to the query. Divergent repeat sequences in proteins are often missed.

Results: We have developed a multilevel procedure to produce nearly complete assignments of protein families of an existing classification system to a large set of sequences. We apply this to the task of assigning Pfam domains to sequences and structures in the Protein Data Bank (PDB). We found that HHsearch alignments frequently scored more remotely related Pfams in Pfam clans higher than closely related Pfams, thus, leading to erroneous assignment at the Pfam family level. A greedy algorithm allowing for partial overlaps was, thus, applied first to sequence/HMM alignments, then HMM–HMM alignments and then structure alignments, taking care to join partial alignments split by large insertions into single-domain assignments. Additional assignment of repeat Pfams with weaker E-values was allowed after stronger assignments of the repeat HMM. Our database of assignments, presented in a database called PDBfam, contains Pfams for 99.4% of chains >50 residues.

Availability: The Pfam assignment data in PDBfam are available at http://dunbrack2.fccc.edu/ProtCid/PDBfam, which can be searched by PDB codes and Pfam identifiers. They will be updated regularly.

Contact: Roland.Dunbracks@fccc.edu

1 INTRODUCTION

Clustering proteins of known structures into families or superfamilies is a long-standing task of particular importance in understanding structure–function relationships and for protein structure prediction by homology. Usually, protein classification in the PDB is accomplished at the level of domains—substructures that recur as functional units in different protein contexts.

Structure-based domain classifications of the PDB, such as SCOP (Murzin et al., 1995) and CATH (Orengo et al., 1997), are constructed by comparing the available protein structures in the PDB and creating classifications of new folds and superfamilies manually. Existing structure-based classifications cover only a portion of the PDB. The most recent SCOP release (v. 1.75A) is 2 years behind the PDB and only covers 61% of current PDB entries. CATH was last updated in November 2011 and covers 64% of the current PDB.

Sequence-based approaches, such as Pfam (Sonnhammer et al., 1997), ProDom (Servant et al., 2002), InterPro (Hunter et al., 2009), SMART (Schultz et al., 1998) and SUPfam (Pandit et al., 2002), can be readily applied to new structures because most new structures fit into existing sequence clusters. In this way, they are well suited for rapid and automated classification of new structures in a way that structure-based classifications are not. Some differences between sequence-based and structure-based classifications occur when a single sequence domain is structurally two or more domains with separate hydrophobic cores or a single structural domain is two sequence domains (Zhang et al., 2005). Structure-based methods are often superior in recognizing remote relationships between families, as these relationships may be apparent only from structural similarity and in the absence of any recognizable sequence similarity. Even with structural information, it may be difficult to distinguish between divergent and convergent evolutionary relationships (Tress et al., 2005).

Our aims in this article are two fold: first, to develop a general procedure that can be used to make rigorous assignments of existing protein family classification systems for any set of protein sequences, and second, to perform this task for the entire PDB. For the PDB, we wish to define a method that can be run automatically on a weekly or monthly basis. We have thus chosen the sequence-based domain classification given by Pfam, as new proteins in the PDB can be readily assigned to existing Pfams, without manual intervention required by structure-based classification systems.

Pfam is a database of protein families, in which each family is represented by a hidden Markov model (HMM) created from manually curated multiple sequence alignments (Sonnhammer et al., 1997). The Pfam classification of protein families has gained widespread acceptance among biologists because of its wide coverage of proteins and a sensible naming convention related to protein functions and commonly used names (Pkinase, SH2, etc.). Pfam was recently used in the Protein Structure Initiative to select targets and divide them among different high-throughput centres (Dessailly et al., 2009). Some Pfam families are seeded by structures in the PDB (Finn et al., 2010). Two or more related Pfam families are grouped into a Pfam clan (Finn et al., 2006). Such relationships are often identified through structural similarity, as they are in the structural classification systems.

Several assignments of Pfam domains to the PDB are currently available, including Pfam itself (Punta et al., 2012), SIFTS (Velankar et al., 2005) and the RCSB (Research Collaboratory for Structural Bioinformatics), covering 45, 87 and 94% of unique sequences in the PDB, respectively. Each of the currently available sources of Pfam assignments to the PDB suffer from one or more of a number of problems. First, because they use only the original PDB or UniProt sequences against the Pfam HMMs, they miss many potential assignments that occur when the sequence is not closely related to any single Pfam family. It is likely that sequence methods based on profile–profile comparison may identify these relationships and provide higher levels of statistical significance. Second, in some cases, these sources also provide completely overlapping assignments, sometimes when two different Pfams in the same clan align to the same region of a PDB sequence with good E-values. If we want to cluster at the family level and then the superfamily level, this produces discrepancies. Third, some proteins have long insertions relative to the Pfam HMM definition, and HMMER may produce two alignment segments, one on either side of the insertion. These two segments cover non-overlapping regions of the HMM, and together should comprise a single Pfam assignment. However, the publicly available sources simply list these separately, and they cannot easily be distinguished from repeated domains in the same protein. Fourth, some protein structures are composed of two chains that together comprise a single Pfam domain, i.e. two non-overlapping regions of the same Pfam HMM. Pfam, SIFTS and the RCSB do not properly account for domains split by insertions or split between different chains.

We overcome some of the deficiencies of other Pfam databases using several strategies. The first is to use consensus sequences derived from PSI-BLAST profiles and to run these through the Pfam HMM library. Such sequences can be fed to the Pfam HMMs like any protein sequence, and they usually produce more complete alignments with better E-values than the original sequences. Similar techniques have been used by us previously (Kahsay et al., 2005) and by others (Przybylski and Rost, 2008). Secondly, we utilize HHblits (Remmert et al., 2012) to produce HMMs for PDB sequences and their parent UniProt sequences, and then apply HMM–HMM alignment of these HMMs against the Pfam HMMs using HHsearch (Söding, 2005). The third approach is to utilize structure alignment of statistically confident and complete structures in Pfam families with weak hits in the same Pfam families—either those with poor E-values and/or alignments that cover only a portion of the Pfam HMM. This allows us to verify whether a weak assignment is correct and to extend short alignments.

Finally, we have developed a procedure for optimally combining assignments from these multiple sources into Pfam architectures for each protein in the PDB. The procedure combines non-overlapping or minimally overlapping partial assignments to the same Pfam into single assignments, thus, accounting for large insertions or domains split across multiple protein chains. We assign additional repeat domains at weaker E-values if the same repeat family is assigned earlier in the procedure at an E-value better than the general cut-off.

We explore the properties of the regions and proteins that cannot be assigned to a Pfam domain and the interactions between Pfam domains in the biological assemblies of structures in the PDB, according to our Pfam assignments. Regions not assigned to Pfams have a greater tendency to be disordered in protein structures and to have lower rates of regular secondary structure than regions assigned to Pfam domains. The number of Pfam–Pfam interactions is increased by the number of assignments made using the methods described here but are also critically dependent on the usage of biological assemblies from crystal structures rather than the asymmetric units.

The Pfam assignments can be searched through the ProtCID server (http://dunbrack2.fccc.edu/protcid) by PDB codes, Pfam codes and sequences. Downloadable files of the Pfam assignments and those proteins that cannot be assigned are also available on the website http://dunbrack2.fccc.edu/ProtCID/PDBfam.

Our procedure is general and can be applied to other domain classification systems and other target sequence sets. Even if the target sequence set is not the PDB, structural information may still be used for proteins in the sequence set that can be readily aligned with proteins of known structure.

2 METHODS

2.1 Searching Pfam through PSI-BLAST consensus sequences and HHsearch

Pfam v26 files Pfam-A.hmm and Pfam-B.hmm were downloaded from the Pfam website and were used as HMMER3 profile databases (Finn et al., 2010). The PDB sequences (Berman et al., 2000) were parsed from pdbx_seq_one_letter_code records in the PDB XML files (Westbrook et al., 2005). UniProt sequences were downloaded from the UniProt website (Bairoch et al., 2005). The XML files from the SIFTS database (Velankar et al., 2005) were used to find the residue correspondence between the UniProt and PDB sequences.

For each unique PDB sequence, we used one iteration of our modified PSI-BLAST (Altschul et al., 1997) from MolIDE (Wang et al., 2008) to generate a profile from sequences in the UniRef90 database (Li et al., 2000). The parameters for PSI-BLAST were ‘-e 10 -h 0.0001 -v 5000 -b 5000 -N 25 -f 16’. A PSI-BLAST profile is a position-specific scoring matrix (PSSM), which provides a log-odds score and percentage of occurrences for each of the 20 amino acid types at each position in the query sequence. A consensus sequence is a 1D simplification of a PSI-BLAST profile obtained by substituting the 20-dimensional vector in each residue position by the highest scoring or most common amino acid observed at that position. In this article, a ‘percentage consensus sequence’ is composed of the most frequent residues in each column, whereas a ‘PSSM consensus sequence’ is composed of the highest scoring amino acid at each position. We also applied the same procedure to the full UniProt sequences from which PDB sequences are derived, as identified by SIFTS. We, thus, have the following six sets of sequences: PDB sequences, PDB percentage consensus sequences, PDB PSSM consensus sequences, UniProt sequences, UniProt percentage consensus sequences and UniProt PSSM consensus sequences. In this article, we denote those sequences as PDB, PDB-percent, PDB-pssm, UNP, UNP-percent and UNP-pssm, respectively. We ran HMMER3 on all six sets of sequences against Pfam A and Pfam B HMM models. We refer to these six sets of alignments as ‘HMMER hits’.

We ran HHblits on unique sequences in the PDB and UniProt sequences to generate HMMs on database uniprot20_29Mar11, which is a database of HMMs created from a clustering of UniProt sequences at 20% identity (Remmert et al., 2012). We searched the Pfam HMMs with the HHblits-derived PDB and UniProt HMMs with HHsearch (Söding, 2005) to generate Pfam to PDB alignments through HMM–HMM alignments. We refer to these as ‘HH hits’.

2.2 Pfam E-value and FATCAT P-value cut-offs

To determine the cut-off of HMMER E-values and structure alignment P-values for each Pfam A present in the six sets of alignments, we collected those Pfam hits with HMMER E-value of <10−5, HMM coverage >0.9, and then selected the alignment with the largest number of match states assigned to residues with Cartesian coordinates in the PDB structures as the representative hit. A total of 5134 Pfams were selected. With HMMER3, we aligned each PDB sequence of these representative hits to all of the 5134 Pfam HMMs. The resulting data points were divided into the following two classes: same clan and different clan, depending on whether the two Pfams were in the same or different clans according to the clan definitions in the Pfam v26 MySQL database.

Smoothed density function curves were calculated using kernel density estimates (Sheather and Jones, 1991) in the R project (http://www.r-project.org/) by calculating probability density estimates of same clan and different-clan prediction as a function of log10(E-value): where Ai is log10(E-value), and Kh is a Gaussian kernel with bandwidth h:
The probability at A is calculated using Bayes’ rule: where P(A|same) and P(A|diff) are calculated from f(A) using the same clan and different-clan sets of E-values, respectively. P(same) and P(diff) are the percentages of data points from the same-clan class and the different-clan class. From a value of A such that P(same|A) > 95%, we selected a threshold for the Pfam E-values of 10−5 (see ‘Results’ section).

To select a Pfam E-value threshold for HH hits, we applied the same procedure on the HH alignments, which contains 5387 Pfam hits. The threshold of HHsearch E-value for P(same|A) > 95% is 10−4 (see ‘Results’ section).

We performed structure alignment with the FATCAT program (Ye and Godzik, 2003) of each structure with every other structure in the 5134 Pfam set. The data points consisting of log10(P-values) were defined as either same clan or different clan. Kernel density estimates and Bayes’ rule were used to obtain P(same|A) where A is the log10(P-value) from FATCAT. From the value of A such that P(same|A) > 95%, we selected a threshold for the FATCAT P-values of 10−3 (see ‘Results’ section).

2.3 General greedy algorithm

From any set of alignments of PDB sequences to Pfam HMMs, we use the same general procedure based on a simple greedy algorithm to create a unique assignment of a Pfam to each residue in a PDB sequence. Such an assignment constitutes a Pfam ‘architecture’ or arrangement of domains in the PDB sequence, allowing only for short overlaps.

For a given PDB sequence, we start by assigning the hit with the best E-value. If there is any region in the query of >30 amino acids that occurs within the boundaries of the alignment to the best HMM but which is not aligned to HMM match states, we create a ‘split assignment’. A split assignment indicates that match states in the HMM align to separate non-contiguous regions of the query sequence. The residues in the inserted region of the query are then ‘unassigned’, which means they are available for subsequent assignments. For each additional hit in order by E-value (best to worst), we check whether it overlaps the current Pfam assignments by >10 residues on either end. If it does not, then an assignment is made. Again, long insertions in the query result in split assignments and the insertions are unassigned.

If at any time, the same Pfam model aligns more than once to a query sequence, we check if the HMM match states align only once to the query and in order allowing short overlaps of <10 amino acids in the HMM. If yes, then we combine them into one assignment to the HMM. The assignment is split if there are >30 residues between the assigned regions, and the intervening residues are left unassigned. If the assignments to the Pfam cover the HMM match states more than once, then there is more than one copy of the Pfam in the sequence (e.g. repeated domains), and multiple assignments of the Pfam are made.

We also check whether the same Pfam aligned to different sequences within the same PDB entry. In some cases, these hits do not overlap in the HMM by >10 amino acids, and they are then combined into a single assignment.

In our procedure, we always used HMMER hits first, then HH hits.

2.4 Using structure alignments to improve Pfam assignments

We use structure alignment to verify whether Pfam–PDB alignments with weak E-values are correct and to extend short alignments to Pfam HMMs. To do so, we need to identify structures (or domains within structures) that cover Pfams in their entirety with good E-values. We call such structures exemplars for their Pfams. Only a subset of Pfams in the PDB has such high-quality alignments.

To identify exemplars, we first applied the greedy algorithm on all Pfam alignments in the six sets of sequences and consensus sequences with a conservative HMMER E-value of ≤10−5, obtaining split and combined Pfam assignments. Some split assignments may be possible where one component has significant E-value, whereas the other is weaker. Therefore, we continue the greedy algorithm with alignments with E-value of >10−5 if the same Pfam has already been assigned to the PDB sequence, up to an E-value of 1.0. We continued the greedy algorithm with the HH hits with an E-value cut-off of 10−4. The reason for applying the HMMER alignments before the HH alignments is discussed in the ‘Results’ section. For Pfams assigned in this procedure, we identify an exemplar structure, defined as the structure with the largest number of match states assigned to residues with Cartesian coordinates in the PDB entry, with a coverage of the Pfam HMM of at least 80%. HMM coverage is the number of the sequence residues with coordinates aligned to a Pfam HMM match state divided by the length of the model. In the event of a tie, the structure with the best E-value is used.

We divided the HMMER Pfam hits of all six sets into two non-overlapping sets: {Strong Hits} and {Weak Hits}. Strong hits are those hits with E-value of ≤10−5 and <10 residues missing from the N or C terminal end of the HMM, whereas weak hits comprise the remaining alignments. For each hit in {Weak Hits}, we checked whether there are exemplar structures for that Pfam and/or other Pfams in the same clan. If there are, we perform structure alignments with the FATCAT program (Ye and Godzik, 2003) on the region(s) of the weak hit structure not previously aligned to the {Strong Hits}. We performed this procedure separately for HH Pfam hits with E-value of ≤10−4.

If the FATCAT P-value is better than 10−3, we create an alignment of the PDB query to the Pfam HMM through the exemplar structure through a transitive alignment. For residue pairs AB and BC, (A to B) + (B to C) = (A to C). Here, A to B is the HMM to exemplar alignment, B to C is the structure alignment of the exemplar to the weak assignment and A to C is HMM to the weak assignment. Once this alignment is created, we move the alignment from {Weak Hits} to a new set {Struct Hits}.

2.5 The full algorithm for assigning Pfams to PDB sequences

The full procedure of creating Pfam assignments to PDB sequences is as follows. We have in hand six sets of alignments, {HMMER Strong Hits}, {HH Strong Hits}, {HMMER Struct Hits}, {HH Struct Hits}, {HMMER Weak Hits} and {HH Weak Hits}, the last two containing those weak hits (too short and/or too weak an E-value) for which structure alignment was not possible or did not produce a significant alignment. We use the {HMMER Strong Hits} first in the greedy algorithm until no more assignments can be made, and then continue with the {HH Strong Hits}. Second, we continue the greedy algorithm with the alignments in the {HMMER Struct Hits} and {HH Struct Hits} sets in that order until no more assignments can be made. Third, we apply the greedy algorithm to the remaining {HMMER Weak Hits} and {HH Weak Hits} with E-value of ≤10−5 (HMMER) or ≤10−4 (HH). These hits have strong statistical significance but >10 residues missing from the N or C terminal end of the HMM. Fourth, we proceed with the remaining {HMMER Weak Hits} and {HH Weak Hits} up to a value of 1.0, but we only add these if the same Pfam has already been assigned in one of the earlier steps. Some of these will be combined with earlier assignments to produce split assignments. Some will be repeated domains. Pfam B assignments are treated as weak hits and added only if the E-value is better than the appropriate threshold.

3 RESULTS

3.1 Establishing E-value and P-value cut-offs

We investigated the HMMER3 E-value level at which Pfam assignments are likely to be reliable. We created a set of 5134 Pfams with E-values to unique PDB sequences ≤10−5 and HMM coverage ≥90%. The associated PDB for each sequence was the one with the largest number of match states assigned to residues with Cartesian coordinates in the PDB entry. We aligned the PDB sequences against the other 5133 Pfams in the set with HMMER3 and classified the resulting alignments and E-values depending on whether the PDB sequence and the Pfam belonged to the same clan or different clans, according to Pfam v. 26. The probability density functions and classification functions versus log10(E-value) are shown in Figure 1a. The classification function refers to how likely the Pfam of the query sequence and the Pfam of the hit HMM belong to the same or different clan as a function of log10(E-value). A hit has equal probability of being in the same clan as a different clan when the E-value is 0.01 (log10 = −2). When the E-value is 10−5, the probability that a sequence belongs to the same clan is >95%. In this article, we define a Pfam assignment to be a strong assignment when its E-value is ≤10−5.

Fig. 1.

Probability density functions and classification functions of Pfam E-values by HMMER and HHsearch and FATCAT P-values. (a) Pfam E-values from the exemplars and Pfam A v26 profile database by HMMER3. Only log10(E-value) from −10 to 5 are shown. (b) Pfam E-values by HHsearch. Only log10(E-value) from −10 to 1 are shown. (c) FATCAT P-values. Only log10(P-value) from −5 to 0 are shown

Fig. 1.

Probability density functions and classification functions of Pfam E-values by HMMER and HHsearch and FATCAT P-values. (a) Pfam E-values from the exemplars and Pfam A v26 profile database by HMMER3. Only log10(E-value) from −10 to 5 are shown. (b) Pfam E-values by HHsearch. Only log10(E-value) from −10 to 1 are shown. (c) FATCAT P-values. Only log10(P-value) from −5 to 0 are shown

The same analysis was performed for the HHsearch alignments, using a set of 5387 Pfams with E-value of ≤10−4. Figure 1b shows the classification functions for the HH hits. When HHsearch’s E-value of ≥10−4, the probability of being in the same clan is >95%.

FATCAT provides a P-value for the significance of the structural similarity between two proteins. We ran FATCAT on all pairs of structures in the set of 5134 PDB structures used for the evaluation of HMMER. The P-values were also divided into two classes: same clan and different clan. The probability density and classification functions are shown in Figure 1c. When the P-value is <0.001, the probability that two structures are in the same clan is >95%. FATCAT suggests a P-value cut-off of 0.05 for two similar structures with 95% confidence interval. Our cut-off is more restrictive because we are trying to identify not only similar structural patterns but also probable homology. In this article, we use the more strict P-value cut-off of 0.001.

3.2 Comparison of HMMER and HHsearch

To determine the relative utility of HMMER3 and HHsearch for assigning Pfams to the PDB, we performed alignments of PDB sequences and UniProt sequences against the Pfam HMMs using both programs. We first calculated PSI-BLAST profiles using one round of search on UniProt90 for all unique protein sequences in the PDB. From these profiles, we determined consensus sequences using the most common amino acid in each position (given in the PSI-BLAST profile output in percentage terms) and the highest PSSM scoring amino acid. The means and SDs of the sequence identities between PDB and PDB-percent and PDB-pssm are 65.1 ± 10.0% and 63.3 ± 11.2%, respectively. We ran HMMER3 with the original PDB and UniProt sequences and their consensus sequences as queries against the HMMs in Pfam-A. A probability density estimate of the E-values from the original sequences demonstrated a maximum in the density at an E-value of 10−20, whereas the consensus sequences were shifted to a mode at 10−25. At a poor E-value of 10−5, the original sequences have almost twice as many hits as the consensus sequences, which have all been shifted to higher statistical significance. A total of 38% of the consensus alignments were longer than the original-sequence alignments, whereas only 10% were shorter. Most of the shorter assignments occurred when the alignments of the consensus sequences are broken down into two or more fragments, when the original PDB sequence alignment was not. These fragments will be joined in the application of the greedy algorithm.

We applied the greedy algorithm on the alignments to Pfam from the original PDB and UniProt sequences alone and a set combining these alignments with those from the consensus PDB and Uniprot sequences. The set of assignments from the combined consensus and original sequence alignments contains 371 more Pfam-As than the original PDB and UniProt sequences and increases residue assignments by 4%.

It is often assumed that HMM–HMM alignments should be better than sequence-HMM alignments; therefore, we compared the Pfam assignments from the consensus sequences with HMMER3 (described earlier in the text) and those from HMM–HMM alignments produced by HHsearch using the general greedy algorithm by HMMER3 given these E-value cut-offs. HHsearch produced assignments to 2% more entries and 2% more sequences than HMMER3. HHsearch did produce a much larger number of weak hits, by >65% compared with HMMER3. This indicates that HHsearch may be most useful when HMMER3 fails to make any assignment. HHsearch produced 60% fewer assignments of repeats than HMMER3 did (1881 versus 5006). HMMER and HHSearch assignments to the PDB are compared in Table 1.

However, the most significant drawback of the HHsearch assignments was the tendency to assign more remotely related Pfams in a Pfam clan compared with the HMMER3 assignments. We compared Pfam assignments from HMMER3 and HHsearch with ≥80% overlap on the PDB sequence. A total of 120 517 (91.6%) of 131 585 domain assignments were with the same Pfam in the two sources, whereas 10 629 assignments (8.1%) were to different Pfams within the same clan. Only 439 assignments (0.3%) belonged to different clans. Because HHsearch is expected to find more remote hits than HMMER, it seems likely that the HMMER assignment is correct, whereas the HHsearch assigns a more remotely related Pfam. As we want to make correct assignments at the Pfam level and the clan level, we prefer the HMMER assignments over the HHsearch assignments, when both are statistically significant.

3.3 Structure alignments

Both HMMER3 and HHsearch produce many alignments to PDB sequences with weak E-values and/or alignments shorter than the Pfam model definition. We investigated whether we could confirm some of the weak hits and extend short alignments by comparing structures. We define exemplars as structure/Pfam pairs with good HMMER E-values (≤10−5) or HH E-value (≤10−4) to the Pfam and HMM coverage of at least 80%. A total of 81% of Pfams in the PDB have exemplars. The structures of weak Pfam hits were aligned to the exemplar structures in the same clan, including the Pfam of the weak hit. A total of 7381 structure assignments were added to our Pfam assignments by replacing the original alignment to the HMM by a transitive alignment through the structure alignment. The number of PDB residues aligned to Pfam HMMs for these sequences rises by 36%. An example is shown in Figure 2.

Fig. 2.

Structure alignments verify and expand the Pfam assignments for the PDB entry 2EAB. Left: the initial Pfam assignments from the consensus sequences. Right: the Pfam assignments to the exemplar 1H54. Middle: the Pfam assignments of 2EAB after structure alignment

Fig. 2.

Structure alignments verify and expand the Pfam assignments for the PDB entry 2EAB. Left: the initial Pfam assignments from the consensus sequences. Right: the Pfam assignments to the exemplar 1H54. Middle: the Pfam assignments of 2EAB after structure alignment

The ability of structure alignments to verify weak Pfam assignments varies with the statistical significance of the Pfam alignment. At E-values better than 10−10, >80% of structure alignments are statistically significant (P-values of <0.001); these alignments are used solely to extend the Pfam domain assignments. At E-values of >0.01 and <10, about one-third of assignments are confirmed by structure alignment.

3.4 Pfam architectures for the PDB

Several domain assignments through Pfam and other classifications are publicly available. In Table 2

0 comments

Leave a Reply

Your email address will not be published. Required fields are marked *