Prediction of Essential Genes based on Machine Learning and Information Theoretic Features

Dawit Nigatu, Werner Henkel


Computational tools have enabled a relatively simple prediction of essential genes (EGs), which would otherwise be done by costly and tedious gene knockout experimental procedures. We present a machine learning based predictor using information-theoretic features derived exclusively from DNA sequences. We used entropy, mutual information, conditional mutual information, and Markov chain models as features. We employed a support vector machine (SVM) classifier and predicted the EGs in 15 prokaryotic genomes. A fivefold cross-validation on the bacteria E. coli, B. subtilis, and M. pulmonis resulted in AUC score of 0.85, 0.81, and 0.89, respectively. In cross-organism prediction, the EGs of a given bacterium are predicted using a model trained on the rest of the bacteria. AUC scores ranging from 0.66 to 0.9 and averaging 0.8 were obtained. The average AUC of the classifier on a one-to-one prediction among E. coli, B. subtilis, and Acinetobacter is 0.85. The performance of our predictor is comparable with recent and state-of-the art predictors. Considering that we used only sequence information on a problem that is much more complicated, the achieved results are very good.


  1. Acencio, M. L. and Lemke, N. (2009). Towards the prediction of essential genes by integration of network topology, cellular localization and biological process information. BMC bioinformatics, 10(1):1.
  2. Bauer, M., Schuster, S. M., and Sayood, K. (2008). The average mutual information profile as a genomic signature. BMC bioinformatics, 9(1):1.
  3. Ben-Hur, A. and Weston, J. (2010). A users guide to support vector machines. Data mining techniques for the life sciences, pages 223-239.
  4. Benson, D. A., Cavanaugh, M., Clark, K., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., and Sayers, E. W. (2013). Genbank. Nucleic acids research, 41(D1):D36-D42.
  5. Chalker, A. F. and Lunsford, R. D. (2002). Rational identification of new antibacterial drug targets that are essential for viability using a genomics-based approach. Pharmacology & therapeutics, 95(1):1-20.
  6. Chen, L., Ge, X., and Xu, P. (2015). Identifying essential streptococcus sanguinis genes using genome-wide deletion mutation. Gene Essentiality: Methods and Protocols, pages 15-23.
  7. Chen, W.-H., Minguez, P., Lercher, M. J., and Bork, P. (2012). OGEE: an online gene essentiality database. Nucleic acids research, 40(D1):D901-D906.
  8. Chen, Y. and Xu, D. (2005). Understanding protein dispensability through machine-learning analysis of highthroughput data. Bioinformatics, 21(5):575-581.
  9. Cheng, J., Xu, Z., Wu, W., Zhao, L., Li, X., Liu, Y., and Tao, S. (2014). Training set selection for the prediction of essential genes. PloS one, 9(1):e86805.
  10. Clarke, L. and Carbon, J. (1976). A colony bank containing synthetic coi ei hybrid plasmids representative of the entire e. coli genome. Cell, 9(1):91-99.
  11. Cullen, L. M. and Arndt, G. M. (2005). Genomewide screening for gene function using RNAi in mammalian cells. Immunology and cell biology, 83(3):217-223.
  12. Dalevi, D. and Dubhashi, D. (2005). The peres-shields order estimator for fixed and variable length markov models with applications to DNA sequence similarity. In International Workshop on Algorithms in Bioinformatics, pages 291-302. Springer.
  13. Date, S. V. and Marcotte, E. M. (2003). Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages. Nature biotechnology, 21(9):1055-1062.
  14. Deng, J., Deng, L., Su, S., Zhang, M., Lin, X., Wei, L., Minai, A. A., Hassett, D. J., and Lu, L. J. (2011). Investigating the predictability of essential genes across distantly related organisms using an integrative approach. Nucleic acids research, 39(3):795-807.
  15. Giaever, G., Chu, A. M., Ni, L., Connelly, C., Riles, L., Veronneau, S., Dow, S., Lucau-Danila, A., Anderson, K., Andre, B., et al. (2002). Functional profiling of the saccharomyces cerevisiae genome. nature, 418(6896):387-391.
  16. Hagenauer, J., Dawy, Z., Gobel, B., Hanus, P., and Mueller, J. (2004). Genomic analysis using methods from information theory. In Information Theory Workshop, 2004. IEEE, pages 55-59. IEEE.
  17. Hutchison, C. A., Chuang, R.-Y., Noskov, V. N., AssadGarcia, N., Deerinck, T. J., Ellisman, M. H., Gill, J., Kannan, K., Karas, B. J., Ma, L., et al. (2016). Design and synthesis of a minimal bacterial genome. Science, 351(6280):aad6253.
  18. Itaya, M. (1995). An estimation of minimal genome size required for life. FEBS letters, 362(3):257-260.
  19. Jacobs, M. A., Alwood, A., Thaipisuttikul, I., Spencer, D., Haugen, E., Ernst, S., Will, O., Kaul, R., Raymond, C., Levy, R., et al. (2003). Comprehensive transposon mutant library of pseudomonas aeruginosa. Proceedings of the National Academy of Sciences, 100(24):14339-14344.
  20. Katz, R. W. (1981). On some criteria for estimating the order of a markov chain. Technometrics, 23(3):243- 249.
  21. Lamichhane, G., Zignol, M., Blades, N. J., Geiman, D. E., Dougherty, A., Grosset, J., Broman, K. W., and Bishai, W. R. (2003). A postgenomic method for predicting essential genes at subsaturation levels of mutagenesis: application to mycobacterium tuberculosis. Proceedings of the National Academy of Sciences, 100(12):7213-7218.
  22. Letunic, I. and Bork, P. (2016). Interactive tree of life (itol) v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic acids research, page gkw290.
  23. Lu, Y., Deng, J., Rhodes, J. C., Lu, H., and Lu, L. J. (2014). Predicting essential genes for identifying potential drug targets in aspergillus fumigatus. Computational biology and chemistry, 50:29-40.
  24. Luo, H., Lin, Y., Gao, F., Zhang, C.-T., and Zhang, R. (2014). Deg 10, an update of the database of essential genes that includes both protein-coding genes and noncoding genomic elements. Nucleic acids research, 42(D1):D574-D580.
  25. Menéndez, M., Pardo, L., Pardo, M., and Zografos, K. (2011). Testing the order of markov dependence in DNA sequences. Methodology and computing in applied probability, 13(1):59-74.
  26. Mushegian, A. R. and Koonin, E. V. (1996). A minimal gene set for cellular life derived by comparison of complete bacterial genomes. Proceedings of the National Academy of Sciences, 93(19):10268-10273.
  27. Nigatu, D., Henkel, W., Sobetzko, P., and Muskhelishvili, G. (2016). Relationship between digital information and thermodynamic stability in bacterial genomes. EURASIP Journal on Bioinformatics and Systems Biology, 2016(1):1.
  28. Nigatu, D., Mahmood, A., Henkel, W., Sobetzko, P., and Muskhelishvili, G. (2014). Relating digital information, thermodynamic stability, and classes of functional genes in e. coli. In Signal and Information Processing (GlobalSIP), 2014 IEEE Global Conference on, pages 1338-1341. IEEE.
  29. Ning, L., Lin, H., Ding, H., Huang, J., Rao, N., and Guo, F. (2014). Predicting bacterial essential genes using only sequence composition information. Genet. Mol. Res, 13:4564-4572.
  30. Papapetrou, M. and Kugiumtzis, D. (2013). Markov chain order estimation with conditional mutual information. Physica A: Statistical Mechanics and its Applications, 392(7):1593-1601.
  31. Papapetrou, M. and Kugiumtzis, D. (2016). Markov chain order estimation with parametric significance tests of conditional mutual information. Simulation Modelling Practice and Theory, 61:1-13.
  32. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825-2830.
  33. Peres, Y. and Shields, P. (2005). Two new Markov order estimators. ArXiv Mathematics e-prints.
  34. Plaimas, K., Eils, R., and König, R. (2010). Identifying essential genes in bacterial metabolic networks with machine learning methods. BMC systems biology, 4(1):1.
  35. Provost, F. (2000). Machine learning from imbalanced data sets 101. In Proceedings of the AAAI2000 workshop on imbalanced data sets, pages 1-3.
  36. Salama, N. R., Shepherd, B., and Falkow, S. (2004). Global transposon mutagenesis and essential gene analysis of helicobacter pylori. Journal of bacteriology, 186(23):7926-7935.
  37. SantaLucia, J. (1998). A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc. Natl. Acad. Sci., 95(4):1460-1465.
  38. Sassetti, C. M., Boyd, D. H., and Rubin, E. J. (2001). Comprehensive identification of conditionally essential genes in mycobacteria. Proceedings of the National Academy of Sciences, 98(22):12712-12717.
  39. Sharp, P. M. and Li, W.-H. (1987). The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications. Nucleic acids research, 15(3):1281-1295.
  40. Song, K., Tong, T., and Wu, F. (2014). Predicting essential genes in prokaryotic genomes using a linear method: Zupls. Integrative Biology, 6(4):460-469.
  41. Tong, H. (1975). Determination of the order of a markov chain by akaike's information criterion. Journal of Applied Probability, pages 488-497.
  42. Visa, S. and Ralescu, A. (2005). Issues in mining imbalanced data sets-a review paper. In Proceedings of the sixteen midwest artificial intelligence and cognitive science conference, volume 2005, pages 67-73. sn.
  43. Ye, Y.-N., Hua, Z.-G., Huang, J., Rao, N., and Guo, F.-B. (2013). CEG: a database of essential gene clusters. BMC genomics, 14(1):1.
  44. Zhang, X., Acencio, M. L., and Lemke, N. (2016). Predicting essential genes and proteins based on machine learning and network topological features: a comprehensive review. Frontiers in physiology, 7.

Paper Citation

in Harvard Style

Nigatu D. and Henkel W. (2017). Prediction of Essential Genes based on Machine Learning and Information Theoretic Features . In Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2017) ISBN 978-989-758-214-1, pages 81-92. DOI: 10.5220/0006165700810092

in Bibtex Style

author={Dawit Nigatu and Werner Henkel},
title={Prediction of Essential Genes based on Machine Learning and Information Theoretic Features},
booktitle={Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2017)},

in EndNote Style

JO - Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 3: BIOINFORMATICS, (BIOSTEC 2017)
TI - Prediction of Essential Genes based on Machine Learning and Information Theoretic Features
SN - 978-989-758-214-1
AU - Nigatu D.
AU - Henkel W.
PY - 2017
SP - 81
EP - 92
DO - 10.5220/0006165700810092