Homology detection plays a key role in bioinformatics, whereas substitution matrix is one of the most important components in homology detec- tion. Thus, besides the improvement of alignment algorithms, another effective way to enhance the accuracy of homology detection is to use proper substitution matrices or even construct new matrices. A study on the features of various matrices and on the comparison of the performances between differ- ent matrices in homology detection enable us to choose the most proper or optimal matrix for some specific applications. In this paper, by taking BLOSUM matrices as an example, some detailed features of matrices in homology detection are stud- ied by calculating the distributions of numbers of recognized proteins over different sequence identities and sequence lengths. Our results clearly showed that different matrices have different preferences and abilities to the recognition of remote homologous proteins. Furthermore, detailed features of the vari- ous matrices can be used to improve the accuracy of homology detection.
The orientation between the backbone residues of proteins is defined based on the local configurations and the corresponding preferences are analyzed by statistics.It is found that all the residue pairs have some specific preferences of orientations.The statistical analysis is mainly concen-trated in the orientational distributions for two kinds of groupings of residues based on the hydrophobicity and secondary structural features.The statistics for such two types of groupings shows different orienta-tional preferences.It is found that for the former grouping the orientational preference is rather weak, while for the later a kind of strong orientational pref-erences.This suggests that the formation of local structures and of secondary structures are highly related to the orientational preferences.
Sequence alignment is a common method for finding protein structurally conserved/similar regions. However, sequence alignment is often not accurate if sequence identities between to-be-aligned se- quences are less than 30%. This is because that for these sequences, different residues may play similar structural roles and they are incorrectly aligned during the sequence alignment using substitu- tion matrix consisting of 20 types of residues. Based on the similarity of physicochemical features, residues can be clustered into a few groups. Using such simplified alphabets, the complexity of protein sequences is reduced and at the same time the key information encoded in the sequences remains. As a result, the accuracy of sequence alignment might be improved if the residues are properly clustered. Here, by using a database of aligned protein structures (DAPS), a new clustering method based on the substitution scores is proposed for the grouping of residues, and substitution matrices of residues at different levels of simplification are constructed. The validity of the reduced alphabets is confirmed by relative entropy analysis. The reduced alphabets are applied to recognition of protein structurally conserved/similar regions by sequence alignment. The results indicate that the accuracy or efficiency of sequence alignment can be improved with the optimal reduced alphabet with N around 9.
LI Jing1 & WANG Wei1,2 1 National Laboratory of Solid State Microstructure and Department of Physics, Nanjing University, Nanjing 210093, China