GENE ONTOLOGY BASED SIMULATION FOR FEATURE SELECTION

Christopher E. Gillies, Mohammad-Reza Siadat, Nilesh V. Patel, George Wilson

2011

Abstract

Increasing interest among researchers is evidenced for techniques that incorporate prior biological knowledge into gene expression profile classifiers. Specifically, researchers are interested in learning the impact on classification when prior knowledge is incorporated into a classifier rather than just using the statistical properties of the dataset. In this paper, we investigate this impact through simulation. Our simulation relies on an algorithm that generates gene expression data from Gene Ontology. Experiments comparing two classifiers, one trained using only statistical properties and one trained with a combination of statistical properties and Gene Ontology knowledge, are discussed . Experimental results suggest that incorporating Gene Ontology information improves classifier performance. In addition, we discuss the relationship of distance between means of the distributions of the classes and the training sample size on classification accuracy.

References

  1. Ashburner, M. (2000). Gene ontology: Tool for the unification of biology. Nature Genetics, 25:25-29.
  2. Asyali, M. H., Colak, D., Demirkaya, O., and Inan, M. S. (2006). Gene expression profile classification: A review. Current Bioinformatics, 1:55-73.
  3. Baldi, P. and Long, A. D. (2001). A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes. Bioinformatics, 17(6):509-519.
  4. Barrell, D., Dimmer, E., Huntley, R. P., Binns, D., ODonovan, C., and Apweiler, R. (2009). The GOA database in 2009-an integrated Gene Ontology Annotation resource. Nucleic Acids Research, 37(suppl 1):D396- D403.
  5. Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M., and Yakhini, Z. (2000). Tissue classification with gene expression profiles. In Proceedings of the fourth annual international conference on Computational molecular biology, RECOMB 7800, pages 54-64, New York, NY, USA. ACM.
  6. Bo, T. and Jonassen, I. (2002). New feature subset selection procedures for classification of expression profiles. Genome Biology, 3(4).
  7. Bontempi, G. (2007). A blocking strategy to improve gene selection for classification of gene expression data. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 4:293-300.
  8. Breitling, R., Armengaud, P., Amtmann, A., and Herzyk, P. (2004). Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Letters, 573(1-3):83 - 92.
  9. Ding, C. and Peng, H. (2003). Minimum redundancy feature selection from microarray gene expression data. In Bioinformatics Conference, 2003. CSB 2003. Proceedings of the 2003 IEEE, pages 523 - 528.
  10. Duan, K.-B., Rajapakse, J., Wang, H., and Azuaje, F. (2005). Multiple svm-rfe for gene selection in cancer classification with expression data. NanoBioscience, IEEE Transactions on, 4(3):228 -234.
  11. Dudoit, S., Fridlyand, J., and Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97(457):77.
  12. Efron, B., Tibshirani, R., Storey, J. D., and Tusher, V. (2001). Empirical Bayes Analysis of a Microarray Experiment. Journal of the American Statistical Association, 96(456):1151-1160.
  13. Fox, R. and Dimmic, M. (2006). A two-sample bayesian t-test for microarray data. BMC Bioinformatics, 7(1):126.
  14. Furey, T. S., Cristianini, N., Duffy, N., Bednarski, D. W., Schummer, M., and Haussler, D. (2000). Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16(10):906-914.
  15. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and Lander, E. S. (1999). Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, 286(5439):531-537.
  16. Jafari, P. and Azuaje, F. (2006). An assessment of recently published gene expression data analyses: reporting experimental design and statistical factors. BMC Medical Informatics and Decision Making, 6(1):27.
  17. Kumar, P. V., Vinodh, K., An, M., and Elia, P. (2001). The gene ontology consortium: Creating the gene ontology resource: design and implementation. Genome Res, pages 1425-1433.
  18. Leung, Y. and Hung, Y. (2010). A multiple-filter-multiplewrapper approach to gene selection and microarray data classification. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 7(1):108 - 117.
  19. Li, L., Weinberg, C. R., Darden, T. A., and Pedersen, L. G. (2001). Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics, 17(12):1131-1142.
  20. Pan, W. (2003). On the use of permutation in and the performance of a class of nonparametric methods to detect differential gene expression. Bioinformatics, 19(11):1333-1340.
  21. Papachristoudis, G., Diplaris, S., and Mitkas, P. A. (2010). Sofocles: Feature filtering for microarray classification based on gene ontology. Journal of Biomedical Informatics, 43(1):1 - 14.
  22. Paul, T. K. and Iba, H. (2009). Prediction of cancer class with majority voting genetic programming classifier using gene expression data. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 6(2):353-367.
  23. Pesquita, C., Faria, D., Falco, A. O., Lord, P., and Couto, F. M. (2009). Semantic similarity in biomedical ontologies. PLoS Comput Biol, 5(7):e1000443.
  24. Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1, pages 448-453, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  25. Saeys, Y., Inza, I. n., and Larran˜aga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics (Oxford, England), 23(19):2507-2517.
  26. Saraswathi, S., Sundaram, S., Sundararajan, N., Zimmermann, M., and Nilsen-Hamilton, M. (2011). Icga-psoelm approach for accurate multiclass cancer classification resulting in reduced gene sets in which genes encoding secreted proteins are highly represented. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 8(2):452-463.
  27. Seco, N., Veale, T., and Hayes, J. (2004). An intrinsic information content metric for semantic similarity in wordnet. In de Mántaras, R. L. and Saitta, L., editors, ECAI, pages 1089-1090. IOS Press.
  28. Speed, T. P., editor (2003). Statistical Analysis of Gene Expression Microarray Data. Chapman and Hall.
  29. Statnikov, A., Aliferis, C. F., Tsamardinos, I., Hardin, D., and Levy, S. (2005). A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 21(5):631-643.
  30. Thomas, J. G., Olson, J. M., Tapscott, S. J., and Zhao, L. P. (2001). An Efficient and Robust Statistical Modeling Approach to Discover Differentially Expressed Genes Using Genomic Expression Profiles. Genome Research, 11(7):1227-1236.
  31. Wang, L., Chu, F., and Xie, W. (2007). Accurate cancer classification using expressions ofvery few genes. IEEE/ACM Trans. Comput. Biology Bioinform, 4(1):40-53.
  32. Wang, Y., Tetko, I. V., Hall, M. A., Frank, E., Facius, A., Mayer, K. F., and Mewes, H. W. (2005). Gene selection from microarray data for cancer classification-a machine learning approach. Computational Biology and Chemistry, 29(1):37 - 46.
  33. Xu, T., Du, L., and Zhou, Y. (2008). Evaluation of GObased functional similarity measures using S. cerevisiae protein interaction and expression profile data. BMC Bioinformatics.
  34. Yeoh, E.-J., Ross, M. E., Shurtleff, S. A., Williams, W., Patel, D., Mahfouz, R., Behm, F. G., Raimondi, S. C., Relling, M. V., Patel, A., Cheng, C., Campana, D., Wilkins, D., Zhou, X., Li, J., Liu, H., Pui, C.-H., Evans, W. E., Naeve, C., Wong, L., and Downing, J. R. (2002). Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell, 1(2):133 - 143.
Download


Paper Citation


in Harvard Style

E. Gillies C., Siadat M., V. Patel N. and Wilson G. (2011). GENE ONTOLOGY BASED SIMULATION FOR FEATURE SELECTION . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011) ISBN 978-989-8425-79-9, pages 286-294. DOI: 10.5220/0003665502940302


in Bibtex Style

@conference{kdir11,
author={Christopher E. Gillies and Mohammad-Reza Siadat and Nilesh V. Patel and George Wilson},
title={GENE ONTOLOGY BASED SIMULATION FOR FEATURE SELECTION},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)},
year={2011},
pages={286-294},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003665502940302},
isbn={978-989-8425-79-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)
TI - GENE ONTOLOGY BASED SIMULATION FOR FEATURE SELECTION
SN - 978-989-8425-79-9
AU - E. Gillies C.
AU - Siadat M.
AU - V. Patel N.
AU - Wilson G.
PY - 2011
SP - 286
EP - 294
DO - 10.5220/0003665502940302