The best experiment, experiment 3, which uses
complete preprocessing including stopword removal,
gives the highest accuracy of 0.722. The model in this
experiment also performed best on the international
class, with an F1-score of 0.88 and recall of 1.00,
meaning that all International data in the test set was
correctly classified. The National class performed
consistently high in all experiments, with F1-score
between 0.76 and 0.81, indicating that documents in
this category have features easily recognized by the
model. The lowest performance was consistently
found in the “Undefined” class, with an F1-score of
only 0.39 in the 3rd experiment. This indicates that
documents in this category have text structures that
are inconsistent or similar to other classes, making it
challenging to distinguish automatically. In general,
adding preprocessing stages such as stopword
removal improves the quality of feature
representation. It positively impacts classification
results, both overall (accuracy) and per class, except
for the undefined class.
These results show that the KNN method with TF-
IDF vectorization and cosine similarity can classify
proof of achievement well. This approach can speed
up the SNBP selection process automatically and
efficiently and reduce the burden of manual
classification by the committee. For future research,
further preprocessing steps such as stemming or
lemmatization should be added to make the text
feature representation cleaner and more informative.
It is also recommended to explore other classification
algorithms such as Support Vector Machine (SVM),
Naive Bayes, or Random Forest, which have different
approaches in handling text data, especially on OCR
extracted data that tends to be inconsistent.
Image preprocessing in this study was performed
manually (e.g., rotation and cropping) due to
efficiency considerations and time constraints. This
approach was chosen so that the research could focus
on testing the performance of the classification
algorithm, while automation of preprocessing could
be pursued in future studies.
DISCLAMER
AI tools were used to help with this writing so that the
grammar would be improved.
REFERENCES
Cholil, S. R., Handayani, T., Prathivi, R., & Ardianita, T.
(2021). Implementasi Algoritma Klasifikasi K-Nearest
Neighbor (KNN) Untuk Klasifikasi Seleksi Penerima
Beasiswa. IJCIT (Indonesian Journal on Computer and
Information Technology), 6(2), 118–127.
https://doi.org/10.31294/ijcit.v6i2.10438
Danny, M., Muhidin, A., & Jamal, A. (2024). Application
of the K-Nearest Neighbor Machine Learning
Algorithm to Preduct Sales of Best-Selling Products.
Brilliance: Research of Artificial Intelligence, 4(1),
255–264. https://doi.org/10.47709/brilliance.v4i1.4063
Firdaus, A., Syamsu Kurnia, M., Shafera, T., Firdaus, W. I.,
Teknik, J., Politeknik, K., & Sriwijaya -Palembang, N.
(2021). Implementasi Optical Character Recognition
(OCR) Pada Masa Pandemi Covid-19 *1. Jurnal
JUPITER, 13(2), 188–194.
Francis, S. A., & Sangeetha, M. (2025). A comparison
study on optical character recognition models in
mathematical equations and in any language. Results in
Control and Optimization, 18, 100532.
https://doi.org/10.1016/j.rico.2025.100532
Gani, M. O., Ayyasamy, R. K., Alhashmi, S. M.,
Sangodiah, A., & Fui, Y. T. (2022). ETFPOS-IDF: A
Novel Term Weighting Scheme for Examination
Question Classification Based on Bloom’s Taxonomy.
IEEE Access, 10(December), 132777–132785.
https://doi.org/10.1109/ACCESS.2022.3230592
Iqbal Mubarok, M., Purwantoro, P., & Carudin, C. (2024).
Penerapan Algoritma K-Nearest Neighbor (Knn)
Dalam Klasifikasi Penilaian Jawaban Ujisan Esai. JATI
(Jurnal Mahasiswa Teknik Informatika), 7(5), 3446–
3452. https://doi.org/10.36040/jati.v7i5.7676
Kumar, S., Kar, A. K., & Ilavarasan, P. V. (2021).
Applications of text mining in services management: A
systematic literature review. International Journal of
Information Management Data Insights, 1(1), 100008.
https://doi.org/10.1016/j.jjimei.2021.100008
Lai, Y. W., & Chen, M. Y. (2023). Review of Survey
Research in Fuzzy Approach for Text Mining. IEEE
Access, 11(February), 39635–39649.
https://doi.org/10.1109/ACCESS.2023.3268165
Lewu, R. Y., Kusrini, K., & Yaqin, A. (2024). Comparing
text classification algorithms with n-grams for
mediation prediction. IJCCS (Indonesian Journal of
Computing and Cybernetics Systems), 18(2).
https://doi.org/10.22146/ijccs.93929
Maharana, K., Mondal, S., & Nemade, B. (2022). A review:
Data pre-processing and data augmentation techniques.
Global Transitions Proceedings, 3(1), 91–99.
https://doi.org/10.1016/j.gltp.2022.04.020
Ni’mah, A. T., & Syuhada, F. (2022). Term Weighting
Based Indexing Class and Indexing Short Document for
Indonesian Thesis Title Classification. Journal of
Computer Science and Informatics Engineering (J-
Cosine), 6
(2), 167–175.
https://doi.org/10.29303/jcosine.v6i2.471
Nugraha, K. A. (2024). Penerapan Optical Character
Recognition untuk Pengenalan Variasi Teks pada
Media Presentasi Pembelajaran 69. Jurnal Buana
Informatika, 69–78.
Pertiwi, L. (2022). Penerapan Algoritma Text Mining,
Steaming Dan Texrank Dalam Peringkasan Bahasa