with each state corresponding to a character and
transitions between states indicating the probability
of moving from one character to the next. HMMs rely
on two key principles: state transitions and
observation probabilities. This approach allows
HMMs to handle noisy or unclear text effectively,
making them adaptable to various languages, fonts,
and styles.
Cao et al. (Cao, 2011) demonstrated the use of
HMMs in recognizing mixed handwritten and
typewritten text. The system uses OCR to identify
word boundaries and applies HMMs to classify text
types. By leveraging context and features like image
intensity, HMMs achieve a lower error rate (4.75%)
than older methods, enhancing OCR accuracy and
reliability for mixed-type documents.
2.3 CNN
CNNs are deep learning models specifically designed
for processing and analyzing image data, widely used
in OCR for recognizing both printed and handwritten
text. CNNs enhance recognition accuracy by
automatically extracting features such as edges,
textures, and shapes from images, making them
effective in handling variations like deformations,
rotations, and scaling. The key components of CNNs
include convolutional layers that capture local
features, pooling layers that reduce data size while
preserving essential information, activation functions
like Rectified Linear Unit (ReLU) to learn complex
patterns, and fully connected layers that map these
features to output classes for final text recognition.
This structure allows CNNs to minimize the need for
manual feature engineering and remain robust against
image variations.
In a study by Alrehali et al. (Alrehali, 2020),
CNNs were applied to recognize characters in
historical Arabic manuscripts. The architecture used
multiple convolutional and pooling layers to learn
character shapes and reduce feature sizes, achieving
an accuracy of 88.20% by increasing training
samples. Despite slightly lower performance than
traditional methods, CNNs demonstrated unique
strengths in complex manuscript recognition,
showing potential for future advancements.
2.4 RNN and LSTM
RNNs are designed for processing sequential data,
making them ideal for OCR, particularly in
recognizing handwritten text. Unlike traditional
feedforward networks, RNNs utilize contextual
information from sequences, allowing them to handle
continuous characters and lengthy texts more
effectively. They achieve this by maintaining a
"recurrent" structure, where the current hidden state
is updated at each step based on both the current input
and the previous hidden state. RNNs use the same
weights to handle sequences of varying lengths and
learn temporal dependencies through the
Backpropagation Through Time (BPTT) algorithm.
However, traditional RNNs struggle with long
sequences due to the vanishing gradient problem,
which often necessitates the use of LSTM networks
or Gated Recurrent Units (GRU) to improve their
capacity for modeling long-term dependencies.
LSTM networks are a more advanced type of
RNN specifically designed to address the vanishing
gradient problem encountered in long data sequences.
LSTMs are particularly effective in OCR tasks that
require long-term dependency recognition, such as
handwritten and complex printed text. They employ
memory cells with gating mechanisms to control
information storage, updating, and forgetting, which
helps retain relevant context while filtering out noise.
This adaptability to various OCR scenarios, including
multilingual recognition and document analysis,
enhances both text recognition accuracy and
reliability.
In a study by Jun (Ma, 2024), RNNs and LSTMs
were applied to sequence modeling for handwritten
text recognition. While RNNs could handle
sequential data, they struggled with long sequences.
LSTMs overcame this limitation with gates that
capture long-term dependencies. The study used a
Bidirectional LSTM, which processes data in both
forward and backward directions, improving context
capture for letters and words. By combining CNNs
for feature extraction with LSTMs for sequence
modeling, the hybrid approach reduced the Word
Error Rate (WER) to 12.04% and the Character Error
Rate (CER) to 5.13%, outperforming standalone
CNN models.
Similarly, Su et al. (Su, 2015) employed RNNs
and LSTMs to enhance scene text recognition without
requiring character segmentation. Their method
converted word images into sequential HOG feature
vectors, which were classified using a multi-layer
RNN. By integrating LSTMs with input, forget, and
output gates, the model captured long-term
dependencies effectively. Using Connectionist
Temporal Classification (CTC), the approach aligned
unsegmented input with correct word outputs,
achieving high accuracy rates (up to 92%) on datasets
like ICDAR 2003, ICDAR 2011, and SVT. This
combined RNN-LSTM method proved effective in
recognizing text in complex, real-world conditions,
setting a strong benchmark for future research.