Word and Image Embeddings in Pill Recognition

Rich

ard R

adli

, Zsolt V

osh

azi

and L

aszl

o Cz

uni

University of Pannonia, 8200 Veszpr

em, Egyetem u. 10., Hungary

Keywords:

Metrics Learning, Pill Recognition, Multi-Modal Learning, Multihead Attention, Multi-Stream Network,

Dynamic Margin Triplet Loss.

Abstract:

Pill recognition is a key task in healthcare and has a wide range of applications. In this study, we are addressing

the challenge to improve the accuracy of pill recognition in a metrics learning framework. A multi-stream

visual feature extraction and processing architecture, with multi-head attention layers, is used to estimate the

similarity of pills. We are introducing an essential enhancement to the triplet loss function to leverage word

embeddings for the injection of textual pill similarity into the visual model. This improvement reﬁnes the

visual embedding on a ﬁner scale than conventional triplet loss models resulting in higher accuracy of the

visual model. Experiments and evaluations are made on a new pill dataset, freely available.

1 INTRODUCTION

Accurate recognition of prescription pill images,

based on their visual characteristics, can play a cru-

cial role in ensuring patient safety and optimizing

modern healthcare systems, particularly for elderly

patients. Since medication errors are the most com-

mon mistakes in healthcare (Cronenwett et al., 2007)

recognition technology has the potential to prevent

errors throughout the pharmaceutical supply chain,

enhance the expertise of poison control profession-

als, improve medication adherence, mitigate losses of

medications and prescriptions during evacuation sce-

narios, and drive advancements in remote and self-

diagnosis technologies as well as smart healthcare ap-

plications. Pill recognition systems can signiﬁcantly

enhance the quality of medication dispensing, either

for home usage or in large-scale automated pill dis-

pensing systems.

Strictly theoretically, medication pills are de-

signed with distinctive features, including size,

colour, shape, engravings, and imprints. However,

various factors contribute to the challenges of accu-

rate recognition, such as:

• Pill photographs are captured under diverse con-

ditions, such as varying illumination, viewing an-

gles, distances, and camera settings.

https://orcid.org/0009-0009-3160-1275

https://orcid.org/0009-0004-3032-8784

https://orcid.org/0000-0001-7667-9513

• Due to their small size, local features on pills are

often not clearly visible or may be distorted.

• Number of possible classes can be very large (tens

of thousands of possible pills), while few-shot

learning is a requirement for many applications.

The two main approaches for optical recognition

or veriﬁcation systems are object classiﬁcation and

metric learning (Yang and Jin, 2006). In both cases

characteristics such as imprinted or carved signs, size,

color, and shape are considered as observed by the

camera sensors during training and inference pro-

cesses. Both of these methods apply ’yes’ or ’no’

evaluations (meaning that a pill belongs to the same

or another class) in a sense that no scaled similarity

is considered between the different classes. Metric

learning aims to measure the similarity among objects

while using a distance metric function. In the proce-

dure we have elaborated, we also take into account

textual information from pill information leaﬂets that

was previously ignored

In our models we deploy multi-stream metrics

learning since this way we can control the usage of

different image features in the model and also few-

shot learning is easy to carry out.

The main contributions of our paper are the fol-

lowings:

• We showed that high-level textual information,

gained from free-text, can be utilized in optical

pill recognition.

• We introduced a new triplet loss (dynamic margin

Rádli, R., Vörösházi, Z. and Czúni, L.

Word and Image Embeddings in Pill Recognition.

DOI: 10.5220/0012460800003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 3: VISAPP, pages

729-736

ISBN: 978-989-758-679-8; ISSN: 2184-4321

729

triplet loss - DMTL), where the margin is dynam-

ically controlled with distances of word embed-

dings of class textual descriptors;

• We created a new dataset (OGYEIv2) of 112

classes of pill with 4480 images.

• We evaluated our model on our dataset with 5-fold

validation.

2 RELATED WORKS

Generic deep neural network (DNN) object detectors

have been applied for pill recognition in several recent

articles, such as (Tan et al., 2021), (Nguyen et al.,

2022), and (Heo et al., 2023).

In (Tan et al., 2021) three well-known DNN ob-

ject detectors (YOLOv3, RetinaNet, and SSD) were

compared on a custom dataset, resulting only in small

differences in mAP (∼ 2%, all above 0.80). More

speciﬁc approaches are described in (Nguyen et al.,

2022) and (Heo et al., 2023). In (Nguyen et al.,

2022) the proposed solution used a prescription-based

knowledge graph, representing the relationship be-

tween pills. A graph embedding network extracted

the relational features of pills and a framework was

applied to fuse the graph-based relational information

with the image-based visual features for the ﬁnal clas-

siﬁcation. The drawback of this method is that it re-

quires medical prescriptions, or equivalently it can be

applied when there are multiple pills on the image.

In (Heo et al., 2023) the authors trained not only RGB

images of the pills but also imprinted characters. In

the pill recognition step, the different modules sepa-

rately recognize both the features of pills and their im-

prints, meanwhile correcting the recognized imprint

to ﬁt the actual data of other features. A trained lan-

guage model was also applied to the imprint correc-

tion. It was shown through an ablation study that the

language model could signiﬁcantly improve the pill

identiﬁcation ability of the system. The drawback of

this solution is that a speciﬁc language model (includ-

ing an OCR - optical character recognition module) is

required for the application.

In contrast to these approaches, our solution

avoids the use of speciﬁc language models and uses

only generic models to process the information leaﬂet

of pills. In the above models the training would re-

quire the processing of textual printed information

and/or language-speciﬁc OCR modules. They face

problems when texts are not visible (see Fig.1 for il-

lustration) or when new classes are to be added to

The dataset is available at: https://www.kaggle.com/

datasets/richardradli/ogyeiv2

the model, also these texts should be added manually.

Our primary purpose is to elaborate a more general

and easily extensible framework.

For the above reasons, we followed the ap-

proaches (Zeng et al., 2017) and (Ling et al., 2020)

where the utilization of metrics learning was demon-

strated in order to embed the pill images.

The winner (Zeng et al., 2017) of an algorithm

challenge on pill recognition in 2016, announced

by the United States National Library of Medicine

(NLM-NIH) (Yaniv et al., 2016), used a multi-stream

technique. In (Zeng et al., 2017) the visual infor-

mation (e.g. colour, gray-scale, and gradient images

of already localized pills) are processed by so called

’training CNNs’. A knowledge distillation model

compression framework then condensed the training

CNNs into smaller footprint CNNs (’student CNNs’),

employed during inference. The CNNs were designed

to embed features in a metric space, where cosine

distance was utilized as a metric to determine how

similar the features, produced by the CNNs. During

the training of the streams, siamese neural networks

(SNNs) were used with three inputs: the anchor im-

age, a positive, and a negative sample, while the ap-

plied triplet loss was responsible to minimize the dis-

tance between the anchor image and positive samples,

and to increase the distance between the anchor image

and negative samples.

The winner model was improved in (Ling et al.,

2020) with better accuracy tested on the CURE

dataset. The teacher-student compression approach

was left and a separate OCR stream, and a stream

fusion network was introduced. The OCR stream

was responsible for text localization, geometric nor-

malization, and feature embedding with the Deep

TextSpotter (Busta et al., 2017). In addition to the

OCR stream, RGB, texture, and contour streams were

used; segmentation was performed using an improved

U-Net model to generate the stream inputs.

Our approach has similar structure to (Ling et al.,

2020) but with a few modiﬁcations: we replaced

the OCR method with an LBP (local binary pattern)

(Ojala et al., 1994) stream, we utilize a more reﬁned

backbone in streams, we use state-of-the-art YOLO

network for object detection, and we added attention

mechanisms to the models. The performance of our

multi-stream framework was compared to the archi-

tecture of (Ling et al., 2020) in (R

adli et al., 23b), us-

ing the CURE dataset, showing a few percentage ad-

vantage in all test settings. The main contribution of

this article is the improvement of our previous model

by the introduction of a new triplet loss, which uti-

lizes textual information about medicines. Details of

our custom model are given in Section 4.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

730

3 DATASETS

3.1 Pill Images

Our fundamental use-case model is the operation of a

dispensing veriﬁcation device in order to capture im-

ages of various pharmaceutical pills, mostly in a con-

trolled environment. Our custom dataset (OGYEIv2)

was created under the following conditions:

• pills have uniform mid-gray background,

• some images were taken under an upper LED

lamp equipped with a diffuser, which gave the

tablets a good overall appearance, with no signiﬁ-

cant shadows but clear imprints,

• other images were captured using a side mounted

LED strip lamp: the engraved surface patterns be-

come clearly visible.

All images in our dataset went through a lens

distortion correction (undistortion) and have been

also pixel-wise annotated. The main parameters of

OGYEIv2 are given in Table 1. For comparison we

included those of CURE.

Table 1: Comparison of the CURE and our novel OGYEIv2

dataset.

CURE OGYEIv2

Number of

pill classes

196 112

Number of

images

8973 4480

Raw image

resolution

800×800

2448×2448

3840×2160

Image resolution

after undistortion

- 3746×2019

Instance per

class

40-50 40

Backgrounds 6 1

Segmentation

labels

No Yes

Free-text

description

No Yes

3.2 Text Obtained from Pill

Information Leaﬂets

To complement the visual data, we have collected the

ofﬁcial information leaﬂets about the pills provided

by the manufacturers. These leaﬂets include a general

description of the visual appearance of medicines,

which should be localized ﬁrst. It was performed

using NLP (natural language processing) methods

since these information (typically 2 or 3 sentences in

Figure 1: Pill images are captured with different lamps.

First line: engraved text is only visible with side-light. Sec-

ond line: same pills but with different poses, where the

white imprint is only visible in the right photograph.

length) was in a speciﬁc section of the documents as

free text. These sentences form the textual part of our

online available OGYEIv2 dataset. At this moment,

according to our knowledge, this is the only freely

available free-text+image pill dataset. The extracted

sentences were then tokenized, with particular atten-

tion to features such as pharmaceutical form, colour,

shape, convexity, edge, and imprint or engravings. In

the resulting text dataset, the distribution of words be-

tween the different classes shows that the class with

the highest word count contains a maximum of 28

words, while the class with the lowest word count

comprises a minimum of 5 words.

The word embedding procedure is described in Sub-

section 4.3.

4 ENHANCING OBJECT

RECOGNITION WITH

TEXTUAL CLASS

INFORMATION

4.1 Overview

A schematic representation of our proposed model is

depicted in Figure 2. Addressing the limitations of

metrics embedding - which does not inherently solves

the challenge of object detection and localization -

the initial step is inference using a state-of-the-art

YOLOv7 object detection model trained for detect-

Word and Image Embeddings in Pill Recognition

731

ing pills in an image. It draws bounding boxes around

the tablets it detects, which we later crop and feed to

our multi-stream network.

Subsequently, in the initial phase of image embed-

ding (Phase 1), four parallel data streams are used,

characterized by closely aligned structures. In the sec-

ond stage (Phase 2), which is trained independently,

the information content of these distinct branches is

fused.

Figure 2: Overview of the proposed approach.

4.2 Image Streams

The implementation of different sub-streams is mo-

tivated by the intention to enforce the extraction and

utilization of various image features that may be bene-

ﬁcial in different circumstances and for different types

of pills. Our approach implements the following sub-

streams:

1. Contour-stream: Contour images are produced by

applying the Canny edge detector on a smoothed

grayscale representation of the images, performed

by a 7 × 7 Gaussian convolution kernel.

2. LBP-stream: LBP (Ojala et al., 1994), a widely

adopted handcrafted local descriptor, ﬁnds appli-

cation in numerous computer vision tasks, such as

OCR (for both handwritten and printed text). We

computed LBP representations from the grayscale

inputs and integrated them into the same type of

streams as the other descriptors.

3. RGB-stream: For RGB images, colour represen-

tations are directly fed into the embedding deep

neural network.

4. Texture-stream: The images are generated by the

subtraction of the smoothed grayscale representa-

tions from the original (grayscale) pill images.

In (R

adli et al., 2023) we analyzed the posi-

tive contribution of the LBP-stream and the attention

mechanism, the ablation study of all sub-streams is

omitted from this article due to size considerations.

For all image based sub-streams we applied the well-

proved EfﬁcientNetv2 S (Tan and Le, 2021). The

training mechanism is explained in Subsection 4.4.

4.3 Text Embedding

In our paper, we used the hu_core_news_lg CNN-

based, large, pre-trained language model from the

Python library called HuSpaCy (Orosz et al., 2023).

The hu_core_news_lg model is trained on a large

corpus of Hungarian Webcorpus 2.0 which includes

more than 9 billion words based on newspapers and

it is part of spaCy’s language model pipeline. This

model provides tokenization, sentence splitting, part-

of-speech tagging, lemmatization, dependency anal-

ysis and named entity recognition, and includes pre-

trained word vectors. The ”lg” in the model name

stands for ”large,” indicating that it is a relatively

large-sized model with a broader vocabulary and po-

tentially better performance on certain tasks com-

pared to smaller models. It contains 200 000 unique

word vectors and produces 300-dimensional ﬂoret

vectors.

4.4 Training of Sub-Streams

To train the stream networks, we employ SNNs

(siamese neural networks) featuring three inputs,

where an anchor image is denoted by I

, a positive

example is represented by I

, and a negative exam-

ple is denoted by I

(see the schematic illustration

in Figure 3). In contrast to the approach employed

by Ling et al. (Ling et al., 2020), where relatively

light convolutional neural networks (CNNs) were uti-

lized, our models take advantage of the more reﬁned

architecture of EfﬁcientNetv2 S (small-scaled DNN

model). EfﬁcientNet (Tan and Le, 2019) is a gen-

erally applicable network for computer vision tasks,

optimized for depth, width, and resolution. In con-

trast to the EfﬁcientNet (v1) backbone EfﬁcientNetV2

(Tan and Le, 2021) has several signiﬁcant improve-

ments. First, EfﬁcientNetV2 extensively incorporates

both MBConv (inverted residual block) and fused-

MBConv, which has already proven to be very efﬁ-

cient in MobileNetV2 (Sandler et al., 2018). Sec-

ondly, one of the key difference is that EfﬁcientNetV2

favors smaller expansion ratios in MBConvs, aiming

to minimize memory access overhead associated with

higher ratios. Moreover, EfﬁcientNetV2 uses smaller

3×3 kernel sizes. To compensate this reduced recep-

tive ﬁeld more layers are utilized. Finally, another

distinctive feature is the complete removal of the last

stride-1 stage presented in the original EfﬁcientNet.

According to (Tan and Le, 2021) EfﬁcientNetv2 out-

performed its competitors in accuracy on ImageNet

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

732

ILSVRC2012, while learning signiﬁcantly faster than

others, other backbone networks could also be used

for the proposed sub-streams. The CNN of (Ling

et al., 2020) has 9M parameters for the RGB stream

and 2.2M for the texture and contour streams, while

EfﬁcientNetV2 S, with multihead attention involved,

20.4M parameters.

Before the concatenation of the embedding vec-

tors, we integrated a multihead attention module

(Vaswani et al., 2017) within each stream. To gather

the information from these data streams, the resulting

output vectors were concatenated and a hidden layer

and an output layer were combined to create the ﬁnal

embedding (see Figure 2). Notably, during the train-

ing of the fusion network the individual streams were

kept frozen preserving their existing parameters.

Figure 3: Overview of the training of a stream.

4.5 Loss Functions

The objective of metric embedding learning is to ac-

quire a parametric function f

(I) : R

→ R

, param-

eterized by θ (the parameters of the embedding net-

work). This function is designed to map images into

the embedding space where similar images are to be

placed to metrically close positions, dissimilar im-

ages to distant points. This is achieved by the Triplet

loss (Schroff et al., 2015) presented with the inputs I

, and I

. Formally:

tri

∑

∀(I

)

[m + D( f (I

), f (I

))

−D( f (I

), f (I

))]

(1)

where D is the distance function, and margin m

deﬁnes how far negative samples are to be placed.

4.5.1 Dynamic Margin Triplet Loss

The proposed dynamic margin triplet loss (DMTL) is

responsible to maintain larger margins for less similar

pills and smaller margins for more similar ones. It

is achieved by changing the value of m in Eq.1 as a

function of the distances of word embeddings:

m = α · d

Norm

(2)

where α is the default margin value (set to 0.5). The

normalized distance (d

Norm

) is based on the distance

between the anchor and the negative word embed-

dings:

Norm

= 1 +

(u − 1) · (d

min

− d

)

max

− d

min

(3)

where d

is the distance value from the Euclidean

distance matrix, d

max

is the maximum distance, while

min

is the minimum distance in a row of the matrix,

and u is the upper limit.

Figure 4: Visual illustration of pill-to-pill Euclidean dis-

tances based on word embeddings of the applied pre-trained

language model.

Figure 4 shows the distance values (d

) of pills

based on the pre-trained language model embeddings

of the ofﬁcial free-text description leaﬂets.

In Figure 5 we have selected the pill which is least

similar (Algoﬂex Rapid - AR) to most of the others

and the one that is very similar to others (Algopyrin -

A). AR is responsible for the yellow vertical and hor-

izontal lines in Figure 4. In the ﬁrst row of Figure

5 the most similar (Tritace - T) and less similar pills

(CalciKid - CK) are given with their distance values.

The second row shows the pills most similar to Algo-

pyrin - namely Dorithricin (D) and Naprosyn (N). Al-

thogh we can intuitively agree on the distances within

the rows, we can think that N is more similar to A

than T is to AR, which is not represented by distances

Word and Image Embeddings in Pill Recognition

733

based on word embeddings. Thus, the pre-trained em-

beddings contain usable information but could be im-

proved in the future. The impact of using DMTL is

revealed in the next section.

Figure 5: Visual illustration of examples of pill-pill Eu-

clidean distances (d) based on word embeddings of the ap-

plied pre-trained model. First row: Algoﬂex Rapid and the

two pills that are most and least similar to it. Bottom row:

Algopyrin and the two most similar drugs to it.

Investigating the literature we found the most

similar idea in (Zhou et al., 2020) called Ladder-

loss. Ladder-loss was proposed as a loss function

for visual-semantic embedding. There the usage of

the triplet loss is to minimize the distance between

a query image and its corresponding sentence while

maximizing the distances between the query image

and a set of irrelevant sentences. First, they calculate

the relevance degrees between images and each sen-

tence using similarity functions based on an NLP em-

bedding model. After that, these relevance degree val-

ues are divided into L levels with predeﬁned thresh-

olds (the rungs of ladder). The similarities of a query

image with texts is placed in an inequality chain,

based on these level based classiﬁcation of texts. Fi-

nally, these inequality chains deﬁne individual margin

values planted into the triplet loss (see Eq. 1).

In contrast to Ladder-loss, in our approach, we

are injecting semantic (textual) information into our

purely visual embedding model. Thus we improve

the training of the visual metrics model from a bi-

nary (similar/dissimilar) approach to a more reﬁned

scaled method. Moreover, we don’t classify the sam-

ples into categories with distinct levels of relevance

but apply the above deﬁned normalization to obtain

directly the continuous margin values as deﬁned by

Eq. 2 and Eq. 3.

5 EXPERIMENTS

In our paper, we conduct ”two-sided” tests, where

both sides of the pills are categorized into the same

class, and we apply 5-fold cross-validation to get bet-

ter statistical reliability.

During the evaluation procedures we adhered to

the standard method where the query image under-

went the embedding process and the resulting embed-

ding vector was compared to the embedding vectors

of randomly chosen reference pills using Euclidean

distance. (In the future, we intend to test k-nearest

neighbours matching, as there are typically several

available reference images per class.) We ranked the

results to determine the values of Top 1 and Top 5 ac-

curacy, listed in Table 2, 3, and 4 at different values

of u of Eq.3. We have also investigated all possible

conﬁgurations to include the dynamic margin triplet

loss at the two phases of our approach (see the ﬁrst

two columns of Tables 2- 4).

During the training of the sub-streams, the follow-

ing hyper-parameters were set: the Adam optimizer

was used, the learning rate was set to 1 × 10

−4

for

all four streams, weight decay regularization was ap-

plied with a coefﬁcient of 1 × 10

−5

, and the batch size

was set to 32. The default value of the margin m was

chosen to be 0.5 for both loss functions. Each model

was trained for a total of 30 epochs and only the best

weight ﬁle was kept.

As for the fusion phase (Phase 2), distinct hyper-

parameter settings were implemented. The batch

size was set to 32, the learning rate was adjusted to

2 × 10

−4

for enhanced stability, and weight decay

was initialized at 1 × 10

−8

. In this phase, we intro-

duced a learning rate scheduling mechanism where

the initial learning rate was updated in every 5 epochs

with a gamma of 0.1. The network was trained over

30 epochs and again only the ﬁle containing the best

weighting factors was preserved.

Figure 6. displays training and validation loss

curves of our model in Phase 2 (both phases used

word embeddings, and the u parameter in our loss

function was set to 4). The loss curves indicate a sta-

ble model where no signs of overﬁtting or underﬁtting

are observed.

The training and testing processes were run on an

NVIDIA Quadro RTX 5000 GPU card equipped with

16 GB of VRAM memory. The following tables show

our experimental results:

Throughout the experiments, the use of DMTL

had a clearly positive effect on the accuracy of the

model. Speciﬁcally, when DMTL was applied in both

Phase 1 and Phase 2, the model consistently achieved

the highest performance, reaching a peak Top 1 accu-

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

734

Figure 6: Evolution of loss functions during training.

Table 2: Results of ablation study of the dynamic margin

triplet (DMTL) loss on the OGYEIv2 dataset, u=2.

Phase 1 Phase 2 Top 1 Acc. Top 5 Acc.

w/o DMTL w/o DMTL

91.72 98.125

w/o DMTL w/ DMTL

91.72 98.125

w/ DMTL w/o DMTL

92.34 98.43

w/ DMTL w/ DMTL

93.43 99.06

racy of 93.56 % for u=3 and peak Top 5 accuracy of

99.84% for u = 4. This conﬁrms the effectiveness of

DMTL in enhancing the model’s capabilities for pill

recognition.

6 CONCLUSION

In the domain of pill recognition, we have adopted

the metrics learning methodology of previously suc-

cessful approaches (Zeng et al., 2017), (Ling et al.,

2020), (R

adli et al., 2023), (R

adli et al., 23b). Beside,

that our data streams are more uniform and can be

trained in more straightforward ways than the previ-

ous models of (Zeng et al., 2017) and (Ling et al.,

2020), we introduced a new triplet loss called dy-

namic margin triplet loss. With the help of DMTL

we reached notable improvements (1.84 % in Top 1

accuracy, u = 3) compared the best previous model

adli et al., 23b). DMTL could inject textual infor-

mation, generated by general language model embed-

ding, into the visual model. Thus we learnt that even

short free-text could add useful information to vi-

sual models in object recognition. We utilized k-fold

cross-validation to test the robustness of our model,

ensuring the evaluation of its performance across di-

verse subsets of the data. Additionally, we also cre-

ated a novel pill dataset, named OGYEIv2.

In the near future, we are going to extend our tests

to larger datasets, such as the NLM-NIH (Yaniv et al.,

2016), and plan to perform not only two-sided but also

one-sided tests.

Table 3: Results of ablation study of the dynamic margin

triplet loss (DMTL) on the OGYEIv2 dataset, u=3.

Phase 1 Phase 2 Top 1 Acc. Top 5 Acc.

w/o DMTL w/o DMTL

91.72 98.125

w/o DMTL w/ DMTL

92.81 99.21

w/ DMTL w/o DMTL

92.96 99.21

w/ DMTL w/ DMTL

93.56 99.37

Table 4: Results of ablation study of the dynamic margin

triplet loss (DMTL) on the OGYEIv2 dataset, u=4.

Phase 1 Phase 2 Top 1 Acc. Top 5 Acc.

w/o DMTL w/o DMTL

91.72 98.125

w/o DMTL w/ DMTL

92.5 99.06

w/ DMTL w/o DMTL

92.96 99.06

w/ DMTL w/ DMTL

93.53 99.84

ACKNOWLEDGEMENTS

This work has been partly supported by the 2020-

1.1.2-PIACI-KFI-2021-00296 and the TKP2021-

NVA-10 project of the National Research, Develop-

ment and Innovation Fund. We also acknowledge the

ﬁnancial support of the Hungarian Scientiﬁc Research

Fund grant OTKA K-135729. We are grateful to the

NVIDIA corporation for supporting our research with

GPUs obtained by the NVIDIA Hardware Grant Pro-

gram. Last but not least, we would like to thank J

ozsef

Bene for his work in creating the dataset.

REFERENCES

Busta, M., Neumann, L., and Matas, J. (2017). Deep

textspotter: An end-to-end trainable scene text local-

ization and recognition framework. In Proceedings of

the IEEE International Conference on Computer Vi-

sion, pages 2204–2212.

Cronenwett, L. R., Bootman, J. L., Wolcott, J., Aspden, P.,

et al. (2007). Preventing medication errors. National

Academies Press.

Heo, J., Kang, Y., Lee, S., Jeong, D.-H., and Kim, K.-M.

(2023). An accurate deep learning–based system for

automatic pill identiﬁcation: Model development and

validation. J. Med. Internet Res., 25:e41043.

Ling, S., Pastor, A., Li, J., Che, Z., Wang, J., Kim, J., and

Callet, P. L. (2020). Few-shot pill recognition. In Pro-

ceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition, pages 9789–9798.

Nguyen, A. D., Nguyen, T. D., Pham, H. H., Nguyen, T. H.,

and Nguyen, P. L. (2022). Image-based contextual pill

recognition with medical knowledge graph assistance.

In Asian Conference on Intelligent Information and

Database Systems, pages 354–369. Springer.

Ojala, T., Pietikainen, M., and Harwood, D. (1994). Perfor-

mance evaluation of texture measures with classiﬁca-

tion based on Kullback discrimination of distributions.

Word and Image Embeddings in Pill Recognition

735

In Proceedings of 12th International Conference on

Pattern Recognition, volume 1, pages 582–585. IEEE.

Orosz, G., Szab

o, G., Berkecz, P., Sz

ant

o, Z., and Farkas,

R. (2023). Advancing Hungarian Text Processing

with HuSpaCy: Efﬁcient and Accurate NLP Pipelines.

In Text, Speech, and Dialogue, pages 58–69, Cham.

Springer Nature Switzerland.

adli, R., V

osh

azi, Z., and Cz

uni, L. (23b). Pill met-

rics learning with multihead attention. In Proceedings

of the 15th International Joint Conference on Knowl-

edge Discovery, Knowledge Engineering and Knowl-

edge Management SCITEPRESS - Science and Tech-

nology Publications (2023), pages 132–140.

adli, R., V

osh

azi, Z., and Cz

uni, L. (2023). Multi-

stream pill recognition with attention. In 2023 IEEE

12th International Conference on Intelligent Data Ac-

quisition and Advanced Computing Systems: Tech-

nology and Applications (IDAACS), volume 1, pages

942–946.

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and

Chen, L.-C. (2018). Mobilenetv2: Inverted residu-

als and linear bottlenecks. In Proceedings of the IEEE

conference on computer vision and pattern recogni-

tion, pages 4510–4520.

Schroff, F., Kalenichenko, D., and Philbin, J. (2015).

Facenet: A uniﬁed embedding for face recognition

and clustering. In Proceedings of the IEEE Confer-

ence on Computer Vision and Pattern Recognition,

pages 815–823.

Tan, L., Huangfu, T., Wu, L., and Chen, W. (2021). Com-

parison of RetinaNet, SSD, and YOLOv3 for real-time

pill identiﬁcation. BMC Medical Informatics and De-

cision Making, 21:1–11.

Tan, M. and Le, Q. (2019). EfﬁcientNet: Rethinking model

scaling for convolutional neural networks. In Interna-

tional Conference on Machine Learning, pages 6105–

6114. PMLR.

Tan, M. and Le, Q. (2021). Efﬁcientnetv2: Smaller mod-

els and faster training. In International conference on

machine learning, pages 10096–10106. PMLR.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. Advances in Neural

Information Processing Systems, 30.

Yang, L. and Jin, R. (2006). Distance metric learning:

A comprehensive survey. Michigan State Universiy,

2(2):4.

Yaniv, Z., Faruque, J., Howe, S., Dunn, K., Sharlip, D.,

Bond, A., Perillan, P., Bodenreider, O., Ackerman,

M. J., and Yoo, T. S. (2016). The National Library

of Medicine pill image recognition challenge: An ini-

tial report. In 2016 IEEE Applied Imagery Pattern

Recognition Workshop (AIPR), pages 1–9. IEEE.

Zeng, X., Cao, K., and Zhang, M. (2017). Mobiledeep-

pill: A small-footprint mobile deep learning system

for recognizing unconstrained pill images. In Pro-

ceedings of the 15th Annual International Conference

on Mobile Systems, Applications, and Services, pages

56–67.

Zhou, M., Niu, Z., Wang, L., Gao, Z., Zhang, Q., and Hua,

G. (2020). Ladder loss for coherent visual-semantic

embedding. In Proceedings of the AAAI Conference

on Artiﬁcial Intelligence, volume 34, pages 13050–

13057.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

736