An Episode-based Approach to Identify Website User Access Patterns

Madhuka Udantha; Surangika Ranathunga; Gihan Dias

doi:10.5220/0005752703430350

An Episode-based Approach to Identify Website User Access Patterns

Madhuka Udantha, Surangika Ranathunga, Gihan Dias

2016

Abstract

Mining web access log data is a popular technique to identify frequent access patterns of website users. There are many mining techniques such as clustering, sequential pattern mining and association rule mining to identify these frequent access patterns. Each can find interesting access patterns and group the users, but they cannot identify the slight differences between accesses patterns included in individual clusters. But in reality these could refer to important information about attacks. This paper introduces a methodology to identify these access patterns at a much lower level than what is provided by traditional clustering techniques, such as nearest neighbour based techniques and classification techniques. This technique makes use of the concept of episodes to represent web sessions. These episodes are expressed in the form of regular expressions. To the best of our knowledge, this is the first time to apply the concept of regular expressions to identify user access patterns in web server log data. In addition to identifying frequent patterns, we demonstrate that this technique is able to identify access patterns that occur rarely, which would have been simply treated as noise in traditional clustering mechanisms.

References

Arnau, V. et al., 2014. Acceleration of short and long DNA read mapping without loss of accuracy using suffix array. , 30(23), pp.3396-3398.
Aye, T., 2011. Web log cleaning for mining of web usage patterns. Computer Research and Development (ICCRD), 2011.
Chandola, V., Banerjee, A. & Kumar, V., 2009. Anomaly detection: A survey. ACM Computing Surveys (CSUR), p.74.
Cooley, R., Mobasher, B. & Srivastava, J., 2013. Data Preparation for Mining World Wide Web Browsing Patterns. Knowledge and Information Systems, 1(1), pp.5-32.
EH Han, G Karypis, V Kumar, B.M., 1998. Hypergraph Based Clustering in High-Dimensional Data Sets?: A Summary of Results. IEEE Data Eng, 21.1, pp.15-22.
Facca, F.M. & Lanzi, P.L., 2005. Mining interesting knowledge from weblogs: a survey. Data & Knowledge Engineering, 53(3), pp.225-241.
Goldberg, D.E., 2006. Genetic Algorithms in Search , Optimization , and Machine Learning, Pearson Education India.
Han, J. et al., 2007. Frequent pattern mining: current status and future directions. Data Mining and Knowledge Discovery, 15(1), pp.55-86.
Hartigan, John A., and M.A.W., 1979. A K-Means Clustering Algorith. Applied statistics, pp.100-108.
Hipp, Jochen, Ulrich Güntzer, and G.N., 2000. Algorithms for Association Rule Mining - A General Survey and Comparison. ACM sigkdd explorations newsletter 2.1, pp.58-64.
Holland., J., 1992. Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence, MIT press.
Huo, H. et al., 2014. A Practical Implementation of Compressed Suffix Arrays with Applications to SelfIndexing. 2014 Data Compression Conference, (2), pp.292-301.
Iváncsy, R. & Vajk, I., 2006. Frequent Pattern Mining in Web Log Data. , 3(1), pp.77-90.
J Srivastava, R Cooley, M.D., 2000. Web Usage Mining?: Discovery and Applications of Usage Patterns from Web Data. ACM SIGKDD Explorations Newsletter 1.2, pp.12-23.
Kasai, T. et al., Computation in Suffix Arrays and Its Applications. , pp.181-192.
Kim, H.R. & Chan, P.K., 2003. Learning implicit user interest hierarchy for context in personalization. Proceedings of the 8th international conference on Intelligent user interfaces - IUI 7803, p.101.
Knorr, E.M., Ng, R.T. & Tucakov, V., 2000. Distancebased outliers: algorithms and applications. The VLDB Journal The International Journal on Very Large Data Bases, 8(3-4), pp.237-253.
Langhnoja, S.G., Barot, M.P. & Mehta, D.B., 2013. Web Usage Mining to Discover Visitor Group with Common Behavior Using DBSCAN Clustering Algorithm. , 2(7), pp.169-173.
Liang, Tianyi, et al., 2014. A DPLL ( T ) Theory Solver for a Theory of Strings and Regular Expressions. Computer Aided Verification. Springer International Publishing, pp.1-22.
Manber, U. & Myers, G., 1991. Suffix arrays: A new method for on-line string searches.
Mannila, Heikki, Hannu Toivonen, and A.I.V., 1995. Discovering frequent episodes in sequences Extended abstract. In The first Conference on Knowledge Discovery and Data Mining.
MN Garofalakis, R Rastogi, K.S., 1999. SPIRIT?: Sequential Pattern Mining with Regular Expression Constraints. VLDB. Vol. 99, pp.223-234.
Nazeer, K.A.A. & Sebastian, M.P., 2009. Improving the Accuracy and Efficiency of the k-means Clustering Algorithm. , I, pp.1-5.
Pham, Duc, and D.K., 2012. Intelligent optimisation techniques- genetic algorithms, tabu search, simulated annealing and neural networks.pdf, Springer Science & Business Media.
Pokrajac, D. & Hartford, E., 2007. Incremental Local Outlier Detection for Data Streams. , (April).
Rosenberg, Andrew, and J.H., 2007. V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure. EMNLP-CoNLL. Vol. 7, pp.1-2.
Sidhu, Reetinder, and V.K.P., 2007. Regular expression matching can be simple and fast. Perls Dev. Conf.
Sidorov, G. et al., 2014. Syntactic sub-sequences as machine learning features for natural language processing. Expert Systems with Applications 41.3, (Cic), pp.853-860.
Sun, K. & Bai, F., 2008. Mining Weighted Association Rules without Preassigned Weights. , 20(4), pp.489- 495.
Wang, L. et al., 2012. A Complete Suffix Array-Based String Match Search Algorithm of Sliding Windows. 2012 Fifth International Symposium on Computational Intelligence and Design, pp.210-213.
Wang, Jason Tsong-Li, et al., 1994. Combinatorial pattern discovery for scientific data: Some preliminary results. ACM SIGMOD Record. Vol. 23. No. 2, pp.1-2.
Y Fu, K Sandhu, M.S., 1999. Clustering of Web Users Based on Access Patterns. KDD Workshop on Web Mining. San Diego, CA. Springer-Verlag.
Yu, X. & Korkmaz, T., 2015. Heavy path based supersequence frequent pattern mining on web log dataset. Artificial Intelligence Research, 4(2), pp.1-12.
Yun, U. & Leggett, J.J., 2006. WSpan: Weighted Sequential pattern mining in large sequence databases. 2006 3rd International IEEE Conference Intelligent Systems, pp.512-517.

Download

Paper Citation

in Harvard Style

Udantha M., Ranathunga S. and Dias G. (2016). An Episode-based Approach to Identify Website User Access Patterns . In Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-173-1, pages 343-350. DOI: 10.5220/0005752703430350

in Bibtex Style

@conference{icpram16,
author={Madhuka Udantha and Surangika Ranathunga and Gihan Dias},
title={An Episode-based Approach to Identify Website User Access Patterns},
booktitle={Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2016},
pages={343-350},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005752703430350},
isbn={978-989-758-173-1},
}

in EndNote Style

TY - CONF
JO - Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - An Episode-based Approach to Identify Website User Access Patterns
SN - 978-989-758-173-1
AU - Udantha M.
AU - Ranathunga S.
AU - Dias G.
PY - 2016
SP - 343
EP - 350
DO - 10.5220/0005752703430350