Evaluation of different biological data and computational classification methods for use in protein interaction prediction

doi:10.1002/prot.20865

Comparative Study

. 2006 May 15;63(3):490-500.

doi: 10.1002/prot.20865.

Evaluation of different biological data and computational classification methods for use in protein interaction prediction

Yanjun Qi¹, Ziv Bar-Joseph, Judith Klein-Seetharaman

Affiliations

PMID: 16450363
PMCID: PMC3250929
DOI: 10.1002/prot.20865

Comparative Study

Evaluation of different biological data and computational classification methods for use in protein interaction prediction

Yanjun Qi et al. Proteins. 2006.

. 2006 May 15;63(3):490-500.

doi: 10.1002/prot.20865.

Authors

Yanjun Qi¹, Ziv Bar-Joseph, Judith Klein-Seetharaman

Affiliation

¹ School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA.

PMID: 16450363
PMCID: PMC3250929
DOI: 10.1002/prot.20865

Abstract

Protein-protein interactions play a key role in many biological systems. High-throughput methods can directly detect the set of interacting proteins in yeast, but the results are often incomplete and exhibit high false-positive and false-negative rates. Recently, many different research groups independently suggested using supervised learning methods to integrate direct and indirect biological data sources for the protein interaction prediction task. However, the data sources, approaches, and implementations varied. Furthermore, the protein interaction prediction task itself can be subdivided into prediction of (1) physical interaction, (2) co-complex relationship, and (3) pathway co-membership. To investigate systematically the utility of different data sources and the way the data is encoded as features for predicting each of these types of protein interactions, we assembled a large set of biological features and varied their encoding for use in each of the three prediction tasks. Six different classifiers were used to assess the accuracy in predicting interactions, Random Forest (RF), RF similarity-based k-Nearest-Neighbor, Naïve Bayes, Decision Tree, Logistic Regression, and Support Vector Machine. For all classifiers, the three prediction tasks had different success rates, and co-complex prediction appears to be an easier task than the other two. Independently of prediction task, however, the RF classifier consistently ranked as one of the top two classifiers for all combinations of feature sets. Therefore, we used this classifier to study the importance of different biological datasets. First, we used the splitting function of the RF tree structure, the Gini index, to estimate feature importance. Second, we determined classification accuracy when only the top-ranking features were used as an input in the classifier. We find that the importance of different features depends on the specific prediction task and the way they are encoded. Strikingly, gene expression is consistently the most important feature for all three prediction tasks, while the protein interactions identified using the yeast-2-hybrid system were not among the top-ranking features under any condition.

PubMed Disclaimer

Figures

**Fig. 1**
Venn diagram of the overlap between the three gold standard positive datasets.

**Fig. 2**
Precision versus Recall curves for the six classifiers, Support Vector Machine (SVM), Decision Tree (DT), Naïve Bayes (NB), Linear Regression (LR), Random Forest (RF), and RF-based kNN approach (kRF) for the co-complex prediction task using the MIPS data set. (a) Features were encoded as “Summary.” (b) Features were encoded as “Detailed.”

**Fig. 3**
R50 Partial AUC score comparison between the six classification methods for all three prediction tasks. Each subgraph describes one specific prediction task: (a) protein co-complex relationship (MIPS), (b) direct protein–protein interaction (DIP), and (c) protein co-pathway relationship (KEGG). Within each subgraph, the white and gray represent “Detailed” and “Summary” encodings of the features, respectively. Abbreviations of the six classifiers are the same as in Figure 2. The averages of the R50 values are shown as bars. For each bar, the confidence interval of the mean of the estimate R50 are drawn as thin error bar lines.

**Fig. 4**
Precision versus Recall curves for the co-complex prediction task (MIPS) when varying the training set size to have (a) 50 interaction pairs, (b) 200 interaction pairs, (c) 500 interaction pairs, (d) 1000 interaction pairs. Features were encoded using the “Detailed” encoding style.

**Fig. 5**
R50 Partial AUC score comparison when using the top 6 ranked feature categories for each prediction task. The features were added one after the other according to the order in Table IV. The Random Forest classifier with “Detailed” feature encoding was used for this experiment. Each subgraph describes one specific prediction task: (a) protein co-complex relationship (MIPS), (b) direct protein–protein interaction (DIP), and (c) protein co-pathway relationship (KEGG). Each bar represents the score using all features up to that rank (1 to 6). The seventh bar presents the R50 score when using the full set of features.

See this image and copyright information in PMC

Cited by

Atypical cytostatic mechanism of N-1-sulfonylcytosine derivatives determined by in vitro screening and computational analysis.
Supek F, Kralj M, Marjanović M, Suman L, Smuc T, Krizmanić I, Zinić B. Supek F, et al. Invest New Drugs. 2008 Apr;26(2):97-110. doi: 10.1007/s10637-007-9084-1. Epub 2007 Sep 27. Invest New Drugs. 2008. PMID: 17898928
Predicting protein targets for drug-like compounds using transcriptomics.
Pabon NA, Xia Y, Estabrooks SK, Ye Z, Herbrand AK, Süß E, Biondi RM, Assimon VA, Gestwicki JE, Brodsky JL, Camacho CJ, Bar-Joseph Z. Pabon NA, et al. PLoS Comput Biol. 2018 Dec 7;14(12):e1006651. doi: 10.1371/journal.pcbi.1006651. eCollection 2018 Dec. PLoS Comput Biol. 2018. PMID: 30532261 Free PMC article.
Exploring Machine Learning Algorithms to Unveil Genomic Regions Associated With Resistance to Southern Root-Knot Nematode in Soybeans.
Canella Vieira C, Zhou J, Usovsky M, Vuong T, Howland AD, Lee D, Li Z, Zhou J, Shannon G, Nguyen HT, Chen P. Canella Vieira C, et al. Front Plant Sci. 2022 May 3;13:883280. doi: 10.3389/fpls.2022.883280. eCollection 2022. Front Plant Sci. 2022. PMID: 35592556 Free PMC article.
Supervised learning and prediction of physical interactions between human and HIV proteins.
Dyer MD, Murali TM, Sobral BW. Dyer MD, et al. Infect Genet Evol. 2011 Jul;11(5):917-23. doi: 10.1016/j.meegid.2011.02.022. Epub 2011 Mar 5. Infect Genet Evol. 2011. PMID: 21382517 Free PMC article.
Computational prediction of the human-microbial oral interactome.
Coelho ED, Arrais JP, Matos S, Pereira C, Rosa N, Correia MJ, Barros M, Oliveira JL. Coelho ED, et al. BMC Syst Biol. 2014 Feb 27;8:24. doi: 10.1186/1752-0509-8-24. BMC Syst Biol. 2014. PMID: 24576332 Free PMC article.

See all "Cited by" articles

References

1. Bader GD, Hogue CW. Analyzing yeast protein–protein interaction data obtained from different sources. Nat Biotechnol. 2003;20:991–997. - PubMed
1. von Mering C, Krause R, Snel B, Cornell M, Oliver S, Fields S, Bork P. Comparative assessment of large-scale data sets of protein–protein interactions. Nature. 2002;417:399–403. - PubMed
1. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA. 2001;10:4569–4574. - PMC - PubMed
1. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, Qureshi-Emili A, Li Y, Godwin B, Conover D, Kalbfleisch T, Vijayadamodar G, Yang M, Johnston M, Fields S, Rothberg JM. A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae. Nature. 2000;403:623–627. - PubMed
1. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, Remor M, Hofert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K, Hudak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B, Leutwein C, Heurtier MA, Copley RR, Edelmann A, Querfurth E, Rybin V, Drewes G, Raida M, Bouwmeester T, Bork P, Seraphin B, Kuster B, Neubauer G, Superti-Furga G. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002;415:141–147. - PubMed

Publication types

Actions
Actions
Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- Saccharomyces Genome Database

[1] Bader GD, Hogue CW. Analyzing yeast protein–protein interaction data obtained from different sources. Nat Biotechnol. 2003;20:991–997. - PubMed

[2] Bader GD, Hogue CW. Analyzing yeast protein–protein interaction data obtained from different sources. Nat Biotechnol. 2003;20:991–997. - PubMed

[3] von Mering C, Krause R, Snel B, Cornell M, Oliver S, Fields S, Bork P. Comparative assessment of large-scale data sets of protein–protein interactions. Nature. 2002;417:399–403. - PubMed

[4] von Mering C, Krause R, Snel B, Cornell M, Oliver S, Fields S, Bork P. Comparative assessment of large-scale data sets of protein–protein interactions. Nature. 2002;417:399–403. - PubMed

[5] Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA. 2001;10:4569–4574. - PMC - PubMed

[6] Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA. 2001;10:4569–4574. - PMC - PubMed

[7] Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, Qureshi-Emili A, Li Y, Godwin B, Conover D, Kalbfleisch T, Vijayadamodar G, Yang M, Johnston M, Fields S, Rothberg JM. A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae. Nature. 2000;403:623–627. - PubMed

[8] Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, Qureshi-Emili A, Li Y, Godwin B, Conover D, Kalbfleisch T, Vijayadamodar G, Yang M, Johnston M, Fields S, Rothberg JM. A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae. Nature. 2000;403:623–627. - PubMed

[9] Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, Remor M, Hofert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K, Hudak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B, Leutwein C, Heurtier MA, Copley RR, Edelmann A, Querfurth E, Rybin V, Drewes G, Raida M, Bouwmeester T, Bork P, Seraphin B, Kuster B, Neubauer G, Superti-Furga G. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002;415:141–147. - PubMed

[10] Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, Remor M, Hofert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K, Hudak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B, Leutwein C, Heurtier MA, Copley RR, Edelmann A, Querfurth E, Rybin V, Drewes G, Raida M, Bouwmeester T, Bork P, Seraphin B, Kuster B, Neubauer G, Superti-Furga G. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002;415:141–147. - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Evaluation of different biological data and computational classification methods for use in protein interaction prediction

Affiliation

Evaluation of different biological data and computational classification methods for use in protein interaction prediction

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases