Optimizing the Classification of SNBP Student Candidate Files Using

the K-Nearest Neighbor (KNN) Algorithm

Diah Ayu Larasati, Amil Ahmad Ilham and Ady Wahyudi Paundu

Department of Informatics, Hasanuddin University, Gowa, Indonesia

Keywords: SNBP, K-Nearest Neighbor (KNN), Certificate Classification, International

Abstract: In addition to academic achievement, non-academic achievement is one of the critical factors that can increase

the chances of being accepted into the SNBP (Seleksi Nasional Berdasarkan Prestasi) pathway. There are

several criteria for non-academic achievement certificates that are used as assessment criteria according to the

portfolio from the Ministry of Education and Culture. This research aims to optimize the classification of each

certificate according to its grouping in the portfolio to make it easier for reviewers to check files. The KNN

algorithm's evaluation using the cross-validation method and evaluation metrics such as precision, recall, and

F1-score shows that text preprocessing has a significant influence on model performance. The best experiment,

experiment 3, which uses complete preprocessing including stopword removal, gives the highest accuracy of

0.722. The lowest performance was consistently found in the “Undefined” class, with an F1-score of only

0.39 in the 3rd experiment. These results show that the KNN method with TF-IDF vectorization and cosine

similarity can classify proof of achievement well.

1 INTRODUCTION

SNPMB (Seleksi Nasional Penerimaan Mahasiswa

Baru) is a selection system organized by the

Indonesian government to attract outstanding

students to continue their education at higher levels,

such as State Universities and State Polytechnics.

One of the selection channels in SNPMB is SNBP

(Seleksi Nasional Berdasarkan Prestasi), which

considers report cards and evidence of academic and

non-academic achievements uploaded by students on

the official SNPMB website.

So far, the selection committee has classified

thousands of proofs of achievement manually, which

takes a lot of time and effort. Therefore, an automated

classification system is needed to help the selection

process. Classification in the context of information

systems is a technique of mapping data into certain

predetermined classes. One of the most popular

classification algorithms is K-Nearest Neighbor

(KNN). This algorithm works by classifying new data

based on its proximity to data that already has a

known class, based on distance calculations.

The KNN algorithm has advantages due to its

simplicity and effectiveness in handling small to

medium datasets. However, the accuracy of KNN is

greatly influenced by the selection of the K value and

the quality of the data used. In this research, a text

mining and OCR (Optical Character Recognition)

approach is used to process raw data in the form of

certificate files into text form. This research explicitly

compares the KNN algorithm's accuracy against three

different text preprocessing scenarios.

The main contribution of this research is a

comprehensive analysis of the effect of text

preprocessing on classification accuracy. In addition,

this research provides practical benefits by offering a

system that can accelerate and streamline the

selection process, while also serving as a basis for the

development of future achievement document

classification methods.

2 METHODOLOGY

2.1 Optical Character Recognition

(OCR)

Optical Character Recognition, or OCR, is a

technology that converts text in image or picture file

format into text that can be read and edited by

Larasati, D. A., Ilham, A. A. and Paundu, A. W.

Optimizing the Classiﬁcation of SNBP Student Candidate Files Using the K-Nearest Neighbor (KNN) Algorithm.

DOI: 10.5220/0014276800004928

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 1st International Conference on Research and Innovations in Information and Engineering Technology (RITECH 2025), pages 231-237

ISBN: 978-989-758-784-9

231

computer applications (e.g., Notepad, and Microsoft

Word) (Firdaus et al., 2021).

In image-type media, text content cannot be

extracted directly because it is considered part of the

visual image, so a special method is needed to extract

text from an image, namely Optical Character

Recognition or OCR (Nugraha, 2024). In the context

of this research, OCR is a fundamental stage because

the proof of achievement data uploaded by students is

in the form of digital files that need to have their text

information extracted. The OCR process involves

several stages of image preprocessing, such as

rotating to correct the document's orientation,

cropping to focus on the area containing the text, and

excluding non-certificate files to ensure the quality of

the data to be processed.

OCR is frequently included as part of systems for

handling documents, efforts to make things digital,

and programs for entering data digitally. OCR can

also be used to understand writing done by hand, but

this use is not as common because handwriting can be

very different from person to person. Older OCR

setups relied on rules created by people, which meant

people had to do a lot of work and the results were not

very exact. Older types of machine learning models

include Support Vector Machines (SVM), Naive

Bayes, and K-Nearest Neighbors (KNN) (Francis &

Sangeetha, 2025).

Modern OCR technology uses deep learning

approaches to improve character recognition

accuracy, especially on documents with varying

image quality. In this research, the EasyOCR library

was chosen for its ability to handle various fonts and

layouts of certificate documents accurately. The PDF

to image conversion process is performed using the

PyMuPDF library, which maintains the visual quality

of the original document.

2.2 Text Mining

Text mining is extracting useful information and

knowledge from extensive, unstructured textual data

(Pertiwi, 2022). Text mining is a well-known method

used in areas like computer science, information

science, mathematics, and management to find useful

information from large amounts of data (Kumar et al.,

2021). In the context of document classification, text

mining consists of three main stages: preprocessing,

data mining, and postprocessing. The preprocessing

stage includes data selection, text cleaning, and

feature extraction, to transform documents into

representations that machine learning algorithms can

process. Text mining can use standard dictionary

ways, ways that let computers learn, and advanced

computer learning ways. The goal of text mining is to

figure out what the text is trying to say (Lai & Chen,

2023).

In this research, text mining is vital in converting

OCR-generated text into numerical features that the

KNN algorithm can process. This process involves

various preprocessing techniques such as case

folding, punctuation removal, space normalization,

and stop words removal. Each preprocessing

technique has a different impact on the quality of the

resulting features and ultimately affects the

classification accuracy. Figure 1 presents the design

of the system workflow that outlines the main

processes involved in this research.

Figure 1: The design of the system workflow developed.

2.3 K-Nearest Neighbor

The K-NN method works by sorting items into groups

by looking at the training information that is most like

the item being checked (Danny et al., 2024). The K-

Nearest Neighbor algorithm is a supervised learning

method that performs classification based on

closeness or similarity (Cholil et al., 2021). This

algorithm works by finding the K nearest neighbors

of the test data to the training data and determining

the class based on the majority of the neighboring

classes. KNN has lazy learning characteristics

because it does not build an explicit model during the

training phase but instead stores all the training data

and performs computations during the prediction

process.

𝑐𝑜𝑠 (𝑃,𝑄) 



||  ||



∑

(



|)  (



|)







∑

(



|)











∑

(



|)







(1)

This research uses a similarity measurement

method using cosine similarity (1), which is suitable

for text vector data resulting from TF-IDF. The most

RITECH 2025 - The International Conference on Research and Innovations in Information and Engineering Technology

232

commonly used distance measurement is Euclidean

distance. In this research, similarity measurement

uses cosine similarity, which is calculated by the

following equation 1.

P and Q are different document vectors, and the

components of the two vectors are the frequencies of

certain words in the document. Cosine similarity was

chosen because of its effectiveness in measuring the

similarity of text documents represented in the form

of TF-IDF vectors.

2.4 Term Weighting

Term weighting is a method of weighting words in

documents used to convert text into numerical

representations in text mining algorithms. This

weighting aims to provide a value that reflects the

importance or relevance of a word in the document

and the document collection as a whole.

Long text can generally be weighted using the

usual term weighting method such as TF-IDF

(Ni’mah & Syuhada, 2022). The term weighting

process consists of several main components. First,

Term Frequency (TF) calculates the frequency of

occurrence of a word in a particular document, where

the more often the word appears, the greater its

weight. Second, Inverse Document Frequency (IDF)

measures the uniqueness of the word in the entire

document collection by calculating the logarithm of

the ratio of total documents to the number of

documents containing the word and combining these

two components results in TF-IDF weight, which

gives high weight to words that appear frequently in

a particular document but rarely appear in the overall

document collection (Sholehhudin et al., 2018).

In this research, the term weighting

implementation uses the TF-IDF scheme with

ngram_range parameters of 1-2 to capture the context

of unigrams and bigrams, and max_features of 1000

to limit the feature dimension and avoid overfitting.

Bigrams are essential to capture meaningful phrases

in the context of certificates of achievement, such as

“first place” or “national level”, which have different

meanings when separated into individual words. TF-

IDF performs effectively when the same words are

used again (Gani et al., 2022)

2.5 Text Classification

Text classification is a supervised learning process

that aims to find a model to distinguish data classes

or concepts and predict the class of unidentified

objects. Unlike unsupervised clustering, text

classification requires training data that has been

labeled to build a prediction model (Iqbal Mubarok et

al., 2024).

Text classification specifically focuses on

categorizing and organizing text data to facilitate

easier management and analysis. These techniques

leverage natural language processing, machine

learning, and data mining to extract meaningful

patterns and categorize text data (Taha et al., 2024).

In the context of this research, text classification is

used to categorize proof of achievement based on its

level of achievement: International, National,

Provincial, Regency / City, and Undefined.

The success of text classification depends on the

quality of preprocessing and feature representation

used. Improper preprocessing can produce noise that

interferes with the learning process, while a good

feature representation can improve the algorithm's

ability to distinguish between classes.

3 RESULTS

3.1 Initial Data Collection and

Preparation

The data processing begins by collecting proof of

achievement in PDF/PNG/JPG format from 2022,

2023, and 2024 with 4,338, 6,578, and 11,472 files,

respectively. This research focuses on the PDF format

to homogenize the data processing process. Each PDF

file page is converted into an image in JPG/PNG

format using the PyMuPDF library with sufficient

resolution to maintain the quality of the text to be

extracted.

The image preprocessing stage is done manually

to ensure data quality. Rotating was done to correct

the orientation of tilted or upside-down documents,

while cropping focused on areas that contained

important information. After data cleaning, 3,335

files for 2022, 2,846 for 2023, and 4,571 for 2024

were obtained.

Each image file is then named according to the

convention of SNBP registration year—ID—page

number of the original PDF. This systematic naming

facilitates tracking and debugging during system

development. As illustrated in Figure 2, examples of

the cleaned certificate images are presented.

Optimizing the Classiﬁcation of SNBP Student Candidate Files Using the K-Nearest Neighbor (KNN) Algorithm

233

Figure 2: Examples of cleaned certificate images.

Extracting text from images is done using the

EasyOCR library configured for Indonesian and

English, considering that many certificates of

achievement use both languages. The OCR results in

the form of text are then stored as full text for each

year of registration.

3.2 Text Preprocessing

Pre-processing includes case folding, data cleaning,

tokenization, and stopword removal to prepare raw

data in a format suitable for analysis, modeling, and

machine learning tasks (Lewu et al., 2024).

Data Pre-processing is the first step in machine

learning in which the data gets transformed/encoded

so that it can be brought in such a state that now the

machine can quickly go through or parse that data

(Maharana et al., 2022).

The achievement level labeling process is carried

out based on SNBP guidelines that categorize

achievements into five classes, namely International,

National, Provincial, Regency/City, and Undefined.

Labeling is done manually by involving domain

experts to ensure the accuracy of categorization. Data

distribution shows the dominance of national

achievement levels, while international achievement

levels are in the minority. To address the imbalance

in the data, stratified sampling was conducted using

the amount of international data as a reference,

resulting in a balanced dataset with 205 national data

points, 109 district/city data points, 108 provincial

data points, 102 international data points, and 102

undefined data points.

The implementation of three preprocessing

scenarios was carried out in stages to analyze the

impact of each technique on classification accuracy.

The first scenario without preprocessing directly uses

raw text from OCR, which is then vectorized using

TF-IDF. Figure 3 provides an example of the OCR-

generated full text, where each recognized segment is

annotated with the corresponding achievement level

label. The second scenario applies text cleaning,

which includes case folding to convert all characters

into lowercase letters, punctuation removal to remove

irrelevant punctuation marks, number removal to

remove numbers that do not provide essential

information, and whitespace normalization to

normalize excess spaces. The third scenario adds

stopword removal to the text cleaning process. The

removed stopwords come from a combination of the

Sastrawi and NLTK libraries.

Figure 3: Example of OCR Output (fulltext) with labeled

achievement levels.

3.3 Model Evaluation

Model evaluation is performed using a stratified train-

test split with a ratio of 80:20 to ensure a balanced

class distribution in the training and testing data. The

K=3 value in the KNN algorithm was selected based

on the results of preliminary experiments that showed

optimal performance at this value. A small K value

was chosen to avoid bias towards the majority class

in the dataset.

The KNN model is built using the cosine distance

metric, which is more suitable for text-based data or

sparse vectors such as the results of the TF-IDF

process. After the model is trained, the test data will

be predicted to calculate the evaluation metric. A 3-

fold cross-validation process is performed using the

cross_val_score function to obtain more robust

performance estimates. Each fold maintains a

stratified class distribution to ensure the validity of

the evaluation. The evaluation metrics calculated

include accuracy and a classification report that

includes precision, recall, and F1-score for each class

as shown in Figure 4. These evaluation results

provide a comprehensive overview of the model's

performance in general and for each class. To provide

a clearer performance overview, Table 1 presents a

detailed comparison of the evaluation metrics derived

from the three experiments.

RITECH 2025 - The International Conference on Research and Innovations in Information and Engineering Technology

234

0 0,2 0,4 0,6 0,8 1 1,2

Precision

Recall

F1-score

Precision

Recall

F1-score

Precision

Recall

F1-score

Exp 1 Exp 2 Exp 3

Precisio

Recall F1-score

Precisio

Recall F1-score

Precisio

Recall F1-score

4 - Undefined

0,56 0,45 0,5 0,53 0,5 0,51 0,44 0,35 0,39

3 - Province

0,64 0,64 0,64 0,72 0,59 0,65 0,78 0,64 0,7

2 - National

0,76 0,76 0,76 0,75 0,88 0,81 0,76 0,83 0,79

1 - City/District

0,79 0,68 0,73 0,75 0,77 0,69 0,75 0,68 0,71

0 - International

0,75 1 0,860,720,950,830,78 1 0,88

Figure 4: Comparison chart of precision, recall, and accuracy of the three experiments.

Table 1: Detailed comparison of evaluation metrics from the three experiments.

Experiment Preprocessing Accuracy Class Precision Recall F1-score Support

Exp 1

Without

Cleaning

0.714

0 (International) 0.75 1 0.86 21

1 (City/District) 0.79 0.68 0.73 22

2 (National) 0.76 0.76 0.76 41

3 (Proviice) 0.64 0.64 0.64 22

4 (Undefined) 0.56 0.45 0.5 20

Exp 2

OCR Cleaning

+ Casefolding

+ Punct

0.706

0 (International) 0.72 0.95 0.83 21

1 (City/District) 0.75 0.77 0.69 22

2 (National) 0.75 0.88 0.81 41

3 (Province) 0.72 0.59 0.65 22

4 (Undefined) 0.53 0.5 0.51 20

Exp 3

OCR Cleaning

+ Casefolding

+ Punct +

Stopword

0.722

0 (International) 0.78 1 0.88 21

1 (City/District) 0.75 0.68 0.71 22

2 (National) 0.76 0.83 0.79 41

3 (Province) 0.78 0.64 0.7 22

4 (Undefined) 0.44 0.35 0.39 20

4 CONCLUSIONS

This research concludes by creating an automatic

classification system for SNBP proof of achievement

selection using the K-Nearest Neighbor (KNN)

algorithm. The KNN algorithm's evaluation using the

cross-validation method and evaluation metrics such

as precision, recall, and F1-score shows that text

preprocessing has a significant influence on model

performance.

Optimizing the Classiﬁcation of SNBP Student Candidate Files Using the K-Nearest Neighbor (KNN) Algorithm

235

The best experiment, experiment 3, which uses

complete preprocessing including stopword removal,

gives the highest accuracy of 0.722. The model in this

experiment also performed best on the international

class, with an F1-score of 0.88 and recall of 1.00,

meaning that all International data in the test set was

correctly classified. The National class performed

consistently high in all experiments, with F1-score

between 0.76 and 0.81, indicating that documents in

this category have features easily recognized by the

model. The lowest performance was consistently

found in the “Undefined” class, with an F1-score of

only 0.39 in the 3rd experiment. This indicates that

documents in this category have text structures that

are inconsistent or similar to other classes, making it

challenging to distinguish automatically. In general,

adding preprocessing stages such as stopword

removal improves the quality of feature

representation. It positively impacts classification

results, both overall (accuracy) and per class, except

for the undefined class.

These results show that the KNN method with TF-

IDF vectorization and cosine similarity can classify

proof of achievement well. This approach can speed

up the SNBP selection process automatically and

efficiently and reduce the burden of manual

classification by the committee. For future research,

further preprocessing steps such as stemming or

lemmatization should be added to make the text

feature representation cleaner and more informative.

It is also recommended to explore other classification

algorithms such as Support Vector Machine (SVM),

Naive Bayes, or Random Forest, which have different

approaches in handling text data, especially on OCR

extracted data that tends to be inconsistent.

Image preprocessing in this study was performed

manually (e.g., rotation and cropping) due to

efficiency considerations and time constraints. This

approach was chosen so that the research could focus

on testing the performance of the classification

algorithm, while automation of preprocessing could

be pursued in future studies.

DISCLAMER

AI tools were used to help with this writing so that the

grammar would be improved.

REFERENCES

Cholil, S. R., Handayani, T., Prathivi, R., & Ardianita, T.

(2021). Implementasi Algoritma Klasifikasi K-Nearest

Neighbor (KNN) Untuk Klasifikasi Seleksi Penerima

Beasiswa. IJCIT (Indonesian Journal on Computer and

Information Technology), 6(2), 118–127.

https://doi.org/10.31294/ijcit.v6i2.10438

Danny, M., Muhidin, A., & Jamal, A. (2024). Application

of the K-Nearest Neighbor Machine Learning

Algorithm to Preduct Sales of Best-Selling Products.

Brilliance: Research of Artificial Intelligence, 4(1),

255–264. https://doi.org/10.47709/brilliance.v4i1.4063

Firdaus, A., Syamsu Kurnia, M., Shafera, T., Firdaus, W. I.,

Teknik, J., Politeknik, K., & Sriwijaya -Palembang, N.

(2021). Implementasi Optical Character Recognition

(OCR) Pada Masa Pandemi Covid-19 *1. Jurnal

JUPITER, 13(2), 188–194.

Francis, S. A., & Sangeetha, M. (2025). A comparison

study on optical character recognition models in

mathematical equations and in any language. Results in

Control and Optimization, 18, 100532.

https://doi.org/10.1016/j.rico.2025.100532

Gani, M. O., Ayyasamy, R. K., Alhashmi, S. M.,

Sangodiah, A., & Fui, Y. T. (2022). ETFPOS-IDF: A

Novel Term Weighting Scheme for Examination

Question Classification Based on Bloom’s Taxonomy.

IEEE Access, 10(December), 132777–132785.

https://doi.org/10.1109/ACCESS.2022.3230592

Iqbal Mubarok, M., Purwantoro, P., & Carudin, C. (2024).

Penerapan Algoritma K-Nearest Neighbor (Knn)

Dalam Klasifikasi Penilaian Jawaban Ujisan Esai. JATI

(Jurnal Mahasiswa Teknik Informatika), 7(5), 3446–

3452. https://doi.org/10.36040/jati.v7i5.7676

Kumar, S., Kar, A. K., & Ilavarasan, P. V. (2021).

Applications of text mining in services management: A

systematic literature review. International Journal of

Information Management Data Insights, 1(1), 100008.

https://doi.org/10.1016/j.jjimei.2021.100008

Lai, Y. W., & Chen, M. Y. (2023). Review of Survey

Research in Fuzzy Approach for Text Mining. IEEE

Access, 11(February), 39635–39649.

https://doi.org/10.1109/ACCESS.2023.3268165

Lewu, R. Y., Kusrini, K., & Yaqin, A. (2024). Comparing

text classification algorithms with n-grams for

mediation prediction. IJCCS (Indonesian Journal of

Computing and Cybernetics Systems), 18(2).

https://doi.org/10.22146/ijccs.93929

Maharana, K., Mondal, S., & Nemade, B. (2022). A review:

Data pre-processing and data augmentation techniques.

Global Transitions Proceedings, 3(1), 91–99.

https://doi.org/10.1016/j.gltp.2022.04.020

Ni’mah, A. T., & Syuhada, F. (2022). Term Weighting

Based Indexing Class and Indexing Short Document for

Indonesian Thesis Title Classification. Journal of

Computer Science and Informatics Engineering (J-

Cosine), 6

(2), 167–175.

https://doi.org/10.29303/jcosine.v6i2.471

Nugraha, K. A. (2024). Penerapan Optical Character

Recognition untuk Pengenalan Variasi Teks pada

Media Presentasi Pembelajaran 69. Jurnal Buana

Informatika, 69–78.

Pertiwi, L. (2022). Penerapan Algoritma Text Mining,

Steaming Dan Texrank Dalam Peringkasan Bahasa

RITECH 2025 - The International Conference on Research and Innovations in Information and Engineering Technology

236

Inggris. BIMASATI(Bulletin of Multi-Disciplinary

Science and Applied Technology), 1(3), 100–104.

Sholehhudin, M., Fauzi Ali, M., & Adinugroho, S. (2018).

Implementasi Metode Text Mining dan K-Means

Clustering untuk Pengelompokan Dokumen Skripsi (

Studi Kasus : Universitas Brawijaya ). 2(11), 5518–

5524.

Taha, K., Yoo, P. D., Yeun, C., Homouz, D., & Taha, A.

(2024). A comprehensive survey of text classification

techniques and their research applications:

Observational and experimental insights. Computer

Science Review,

Optimizing the Classiﬁcation of SNBP Student Candidate Files Using the K-Nearest Neighbor (KNN) Algorithm

237