
quently preferred in e-commerce applications because
of its low response times (Garbe, 2019). Wang and
Zhang (2021) state in their study that while Sym-
Spell has high real-time performance, its lack of con-
textual awareness can lead to incorrect corrections
(Wang and Zhang, 2021). These shortcomings also
create difficulties in separating concatenated words.
Therefore, it has been suggested that dictionary-based
methods alone are not sufficient and should be used in
conjunction with other models. In Language Model
(LM) based approaches, models such as BERT, Dis-
tilBERT, and T5 are used to analyze the context of a
word and correct errors more accurately (Dutta and
Pande, 2024a). Dutta et al. (2024) have shown
that BART and T5 models increase the F1 score
in spell checking by 4% (Dutta and Pande, 2024a).
However, LM-based models have high computational
costs and response times, making them difficult to
use in real-time systems. For this reason, more op-
timized models like DistilBERT are used (Kakkar
and Pande, 2023). SymSpell offers a speed advan-
tage but does not consider contextual meaning, so hy-
brid systems that combine it with Transformer-based
models are recommended. Guo et al. (2024) have
shown that a method where candidate words from
SymSpell are ranked by a language model increases
accuracy (Guo and et al., 2024). Similarly, Dutta
et al. (2024) increased the accuracy rate by using
a re-ranking method specifically for separating con-
catenated words in e-commerce searches (Dutta and
Pande, 2024b). In Turkish, separating concatenated
words presents additional difficulties because the lan-
guage is agglutinative and morphologically rich. In-
correct segmentation can lead to meaning loss and er-
roneous suggestions. In the context of e-commerce,
examples of brands, products, and models that should
not be segmented make the process particularly chal-
lenging. These difficulties have been addressed us-
ing language models and rule-based systems (Uzun,
2022). Phonetic analysis and multi-approach mod-
els have the potential to improve the process of sepa-
rating concatenated words (Behrooznia et al., 2024).
E-commerce platforms face various challenges in us-
ing spell checkers and word segmenters. The rea-
sons for these challenges are brand names, product
names, model names, and users’ use of natural lan-
guage (Pande, 2022). Developing customized models
is crucial for improving the search experience and in-
creasing conversion rates.
3 DATASET
During the dataset preparation phase, data obtained
from various tables within an e-commerce platform
were used. This data includes the search terms users
have entered into the search engine and the frequency
of these terms. The search terms and their search fre-
quencies were determined using a dataset from the
last 6 months. From this data, terms that had been
searched for at least 10 times in the last 6 months were
taken, and meaningless terms were cleaned. A dataset
containing approximately 10 million search terms and
their frequencies was created. This dataset is specific
to the e-commerce platform where the study was con-
ducted and is not publicly accessible.
3.1 Data Cleaning
Search terms with a frequency below 10 were re-
moved from the dataset. Symbols, emojis, and ex-
tra spaces within the search terms were removed, and
characters were converted to lowercase. Terms con-
sisting only of symbols or numbers, and terms con-
taining no characters, were removed from the dataset.
The words that make up each search term were sorted
alphabetically. As a result of this sorting, search terms
with the same sorted order but different search fre-
quencies, such as ”blue tshirt” and ”tshirt blue,” had
the low-frequency one removed.
3.2 Dictionary Creation
The dictionary used in dictionary-based algorithms
is of great importance. To create the dictionary, the
search terms were separated into their unigrams and
bigrams. After creating unigrams and bigrams for
each search term, the search frequency of the term
was divided by the number of words in the term. The
resulting value corresponds to the search frequency
of the unigram in the term. If the same unigram ap-
pears multiple times in the dataset, the values from
each search term are summed to get a final value. For
example, if ”black frame” was searched 100 times and
”brown frame” was searched 70 times, the calculated
frequency for ”frame” would be 85. The same process
was repeated for bigrams. After obtaining unigrams
and bigrams, three different dictionaries were created.
The first dictionary, dictionary wa, was created from
search terms and their frequencies without separating
them into unigrams and bigrams. The second dictio-
nary, dictionary ou, was created from only unigrams.
The third and final dictionary, dictionary ub, was cre-
ated from both unigrams and bigrams.
A Novel Method for Word Segmentation and Spell Correction in e-Commerce Search Engines
15