
The graph emphases that all tested variants of 
proposed classifier outperformed C5.0 tree by 10%-
30%. Other tested standard classifiers were 
outperformed as well especially when small number 
of topics was used. If the reduction of 
dimensionality is not so significant the information 
about the word order in documents does not improve 
the classification much. 
8  CONCLUSIONS 
We proposed network representation of text 
documents that contains information about 
sequences of tokens and enables to exploit extracted 
features produced by latent semantic analysis. Then 
we illustrated how the network representation helps 
to improve the accuracy of classification. 
If information about context was present in input 
features classifiers performed considerably better 
especially when the dimensionality reduction was 
significant. We achieved improvement 10-30% in 
comparison with standard representation combined 
with kNN or C5.0 algorithms. 
The size of context window does not influence 
the classification accuracy so considerably. We 
observed that larger context implies slightly better 
classifier. The largest context of ten tokens 
outperformed the shortest context of two tokens by 
2% in average. 
The possible modifications of proposed method 
include:  
  tokenization of documents to n-grams instead 
of words before SVD and context networks 
are applied, 
  application of different methods of 
construction of context topic vector u. 
Our future work will focus to improvements of the 
algorithm to speed up the construction and 
comparison of larger context networks. 
REFERENCES 
Berry, P. M., Harrison, I., Lowrance, J. D., Rodriguez, A. 
C., & Ruspini, E. H. (2004). Link Analysis Workbench. 
Air Force Research Laboratory. 
Burt, R. S. (1978). Cohesion Versus Structural 
Equivalence as a Basis for Network Subgroups. 
Sociological Methods and Research, 7, pp. 189-212. 
Deerwester, S., Dumais, S., Furnas, G., Landauer, T., & 
Harshman, R. (1990). Indexing by Latent Semantic 
Analysis.  Journal of the American Society for 
Information Science, 41, pp. 391-407. 
Eibe, F., & Remco, B. (2006). Naive Bayes for Text 
Classification with Unbalanced Classes. Proceedings 
of 10th European Conference on Principles and 
Practice of Knowledge Discovery in Databases (pp. 
503-510). Berlin: Springer. 
Gaizauskas, R., & Wilks, Y. (1998). Information 
extraction: beyond document retrieval. Journal of 
Documentation, 54(1), pp. 70-105. 
Han, E., Karypis, G., & Kumar, V. (2001). Text 
Categorization Using Weight Adjusted k-Nearest 
Neighbor Classification. Proceedings of 5th Pacific-
Asia Conference on Knowledge Discovery and Data 
Mining (pp. 53-65). Springer-Verlag. 
Kelleher, D. (2004). Spam Filtering using Contextual 
Network Graphs.  
Landauer, T., Foltz, P., & Laham, D. (1998). An 
Introduction to Latent Semantic Analysis. Discourse 
Processes, 25, pp. 259-284. 
Marin, A. (2011). Comparison of Automatic Classifiers’ 
Performances using Word-based Feature Extraction 
Techniques in an E-government setting. Kungliga 
Tekniska Högskolan. 
Salton, G., & Buckley, C. (1988). Term-weighting 
Approaches in Automatic Text Retrieval. Information 
Processing & Management, 24(5), pp. 513-523. 
Vapnik, V. N. (1995). The Nature of Statistical Learning 
Theory. Springer-Verlag. 
Wasserman, S., & Faust, K. (1994). Social Network 
Analysis: Methods and Applications. Cambridge 
University Press. 
Weiss, S., Indurkhya, N., Zhang, T., & Damerau, F. 
(2005). Text Mining. Springer. 
Yang, Y., & Pedersen, J. O. (1997). A Comparative Study 
on Feature Selection in Text Categorization. 
Proceedings of the Fourteenth International 
Conference on Machine Learning (pp. 412--420). 
Morgan Kaufmann Publishers. 
Zhang, T., & Oles, F. J. (2000). Text Categorization Based 
on Regularized Linear Classification Methods. 
Information Retrieval, 4, pp. 5-31. 
 
KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
430