Abstract:
Predicting the secondary structure from protein sequence plays a crucial role in predicting the 3D structure and understanding the function of proteins. As new genes and proteins are discovered the size of the protein databases and datasets that can be used for training prediction models grows considerably. A two-stage hybrid classifier which employs dynamic Bayesian networks and a support vector machine (SVM) has been shown to provide state-of-the-art prediction accuracy. However, SVM is not effective for large datasets due to the quadratic optimization involved in model training. In this paper, we implemented two techniques on CB513 benchmark for reducing the number of
samples in the train set of the SVM. The first method randomly selects a fraction of data samples from the train set using a stratified selection strategy. This approach can remove approximately %50 of the data samples from the train set and reduce the model training time by %82.38 without decreasing the prediction accuracy significantly. The second
method clusters the data samples by a hierarchical clustering algorithm and replaces the train set samples with nearest neighbors of the cluster centers. We employed single linkage clustering, average linkage clustering and the Ward’s method for clustering the feature vectors. We optimized the number of clusters and the maximum number of nearest neighbors by computing the prediction accuracy on validation sets. We observed that clustering can also reduce the size of the train set by %50 without sacrificing prediction accuracy. Among the clustering techniques the Ward’s method provided the best accuracy on test data.