Ontology-Driven Deep Learning Model for Multitask Visual Food

Analysis

Daniel Ponte

1 a

, Eduardo Aguilar

1,2 b

, Mireia Ribera

1 c

and Petia Radeva

1,3 d

Dept. de Matem

atiques i Inform

atica, Universitat de Barcelona, Gran Via de les Corts Catalanes 585, Barcelona, Spain

Dept. de Ingenier

ıa de Sistemas y Computaci

on, Universidad Cat

olica del Norte, Angamos 0610, Antofagasta, Chile

Computer Vision Center, Cerdanyola (Barcelona), Spain

Keywords:

Food Ontology, Food Image Analysis, Multitask Learning.

Abstract:

The food analysis from images is a challenging task that has gained signiﬁcant attention due to its multiple

applications, especially in the ﬁeld of health and nutrition. Ontology-driven deep learning techniques have

shown promising results in improving model performance. Food ontology can leverage domain-speciﬁc in-

formation to guide model learning and thus substantially enhance the food analysis. In this paper, we propose

a new ontology-driven multi-task learning approach for food recognition. To this end, we deal multi-modal

information, text and images, in order to extract from the text the food ontology, which represents prior knowl-

edge about the relationship of food concepts at different semantic levels (e.g. food groups and food names),

and apply this information to guide the learning of the multi-task model to perform the task at hand. The pro-

posed method was validated on the public food dataset named MAFood-121, speciﬁcally on dishes belonging

to Mexican cuisine, outperforming the results obtained in single-label food recognition and multi-label food

group recognition. Moreover, the proposed integration of the ontology into the deep learning framework al-

lows providing more consistent results across the tasks.

1 INTRODUCTION

Food recognition from images has gained signiﬁcant

attention due to its various applications, including diet

tracking (Ming et al., 2018), food recommendation

(Deldjoo et al., 2020), and health analysis (Allegra

et al., 2020). Despite this, food recognition remains a

challenging task due to the complexity of food im-

ages, which can vary in terms of their appearance,

size, shape, texture, and color (Jiang et al., 2019). Ad-

ditionally, food may be presented in different settings,

such as plates, bowls, or trays, adding another layer

of complexity to the classiﬁcation task. Therefore,

a well-complied deep learning model are required to

correctly address the classiﬁcation of food images.

In recent years, the use of ontologies in the classi-

ﬁcation of food images has shown promising results

(Zhao et al., 2021; Wang et al., 2022), which has led

to the development of ontology-based image classiﬁ-

https://orcid.org/0000-0002-4482-3645

https://orcid.org/0000-0002-2463-0301

https://orcid.org/0000-0003-1455-1869

https://orcid.org/0000-0003-0047-5172

cation techniques. However, the lack of a standard-

ized food ontology is a major problem to food be-

tween different systems and applications. This can

lead to ambiguity and confusion in how food is classi-

ﬁed and described, making it difﬁcult to integrate with

different systems. Additionally, it can have a nega-

tive impact in areas such as public health, nutrition re-

search, agriculture, and the food industry, among oth-

ers, where accurate and complete information about

food is important.

The integration of the food ontology into a deep

learning framework could provide several beneﬁts

such as: a) Relating the food concepts predicted by

the model to speciﬁc diseases to prevent damage to

the health of people suffering from it (Donadello and

Dragoni, 2019); b) Relating the food concepts pre-

dicted by the model with allergens to prevent food al-

lergy or intolerance; c) Training of multi-task model

being aware of the coexistence and exclusion of food

concepts at different semantic levels (Wang et al.,

2022), just to mention a few.

Ontology-based deep learning models (Popovski

et al., 2020) have been proposed that provide an

overview and comparison of named entity recog-

624

Ponte, D., Aguilar, E., Ribera, M. and Radeva, P.

Ontology-Driven Deep Learning Model for Multitask Visual Food Analysis.

DOI: 10.5220/0012388200003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 2: VISAPP, pages

624-631

ISBN: 978-989-758-679-8; ISSN: 2184-4321

nition methods in the food domain, which can be

used for automated information extraction about food

from text, where four methods are discussed: FoodIE

(Popovski et al., 2019), NCBO (SNOMED CT),

NCBO (OntoFood) and NCBO (FoodON) (Stojanov

et al., 2020). The comparison is performed us-

ing a dataset of 1000 recipes taken from Allrecipes

(Song et al., 2023). They have also been proposed

on extracting food information from text (Popovski

et al., 2020) or substituting ingredients in recipes

(Ławrynowicz et al., 2022). Although these ap-

proaches are interesting, they do not directly focus on

classifying food images.

In the area of large-scale visual recognition, sev-

eral approaches have been proposed that combine on-

tologies and deep learning (Kuang et al., 2018; Zhang

et al., 2019). In particular, in the ﬁeld of food image

recognition, lately, there has been a growing interest.

In (Divakar et al., 2019), the problem of predicting

Type 2 Diabetes Mellitus is addressed by proposing

an ontology-based model to improve the accuracy of

food recognition algorithms. In (Wang et al., 2022)

proposes an ingredient ontology and joint learning, in

(Zhao et al., 2021) presents a fusion learning frame-

work with semantic embedding, in (Donadello and

Dragoni, 2019) focuses on ontologies and deep neu-

ral networks, and in (Kuang et al., 2018) introduces

multi-level deep learning.

Our method differs from previous approaches in

that it focuses on building a well-structured food on-

tology to enable knowledge transfer from it to a neu-

ral network, with the aim of leveraging multi-modal

data (text and image) for uni-modal multitask recog-

nition of food images. The idea behind this work is

that an adequate ontology is essential for the accurate

classiﬁcation of foods. By building a strong ontol-

ogy, a clear and coherent structure can be established

to represent knowledge about food and its properties.

This ontology can be used to guide the classiﬁcation

of foods through a neural network, allowing greater

precision and a better understanding of the character-

istics of each food. Moreover, transferring knowledge

from the ontology to the neural network can also help

improve the generalization capability of the network.

Our main contributions are:

• Careful building of a specialized food ontology

from textual data retrieval in public food recipes.

• Ensure consistency in multi-tasking results by in-

tegrating prior knowledge extracted from the on-

tology into a food image classiﬁcation model.

• The proposed ontology-driven method improved

performance at both the dish and food group level

compared to the baseline approach.

Figure 1: Ontology creation diagram.

2 METHODOLOGY

This section describes the process of building a food

ontology and its application in the classiﬁcation of

food groups and dishes.

2.1 Food Ontology

Food ontology is governed as a fundamental pillar of

this methodology. In the building of the proposed

food ontology, we focus on two semantic levels of

food concepts: the dish name (e.g., caesar salad, gua-

camole) and the food groups (e.g., vegetables, bread).

An illustration of the stages involved in the proposed

process can be seen in Fig. 1. First, information is

collected from various web sources, including food

websites. From these resources, lists of recipes cover-

ing a wide range of food dishes are obtained. To carry

out the construction of the ontology, the most repre-

sentative recipes for each dish are extracted. The on-

tological hierarchy is established by grouping ingre-

dients at different semantic levels that allow classify-

ing and relating food groups to dishes. This process

involves converting ingredients into more general cat-

egories, following the guidance provided by the He-

lis food ontology (Dragoni et al., 2018). This step

signiﬁcantly simpliﬁes the organization and search of

the food data. The resulting ontology is nourished

with precise and coherent relationships between food

groups and dishes.

Each stage is described in more detail in the fol-

lowing subsections.

2.1.1 Recipe Crawler and Assignment

The ﬁrst stage consists of collecting recipes related

to a speciﬁc dish from food web pages enriched with

Ontology-Driven Deep Learning Model for Multitask Visual Food Analysis

625

google recipes metadata such as Yummly (Yummly,

2023) and AllRecipes (Allrecipes, 2023), to identify

the ingredients commonly used in its preparation. The

retrieval is performed with Beatiful Soup Python li-

brary using web scraping strategies and taking beneﬁt

of the structured information. From that, a list of R

recipes is compiled for each food dish taking into ac-

count the similarity between the dish name and the

recipe title.

2.1.2 Uniﬁcation of the Ingredients List

Once all the recipes are linked to the dishes belonging

to the target dataset, the next step is to extract a list of

the ingredients from the information provided in the

recipes. For this purpose, natural language processing

techniques (available in the NLTK library) are applied

to analyze the text, including the removal of anything

that differs from the ingredient names as part of the

normalization process (e.g. stopwords, verbs, culi-

nary measurements, etc.). As a result, a unique list of

ingredients is created. This list is generated to facili-

tate the construction of the ontology and to normalize

the data, which is essential for the subsequent stage.

Finally, the unique list of ingredients is subjected to

a further reﬁnement process. Selective removals are

manually made on ingredients that are not visible or

would not be used in the experiments (e.g., salt, pep-

per, Vinegar). This process is essential to ensure that

the ontology is composed of only relevant ingredients.

2.1.3 Building Ontology

In this stage, the recipes and their corresponding in-

gredients are linked to each dish. In fact, the ingredi-

ents are grouped into a high-level food concept (food

groups) such as ”meat,” ”vegetables,” ”fruits,” ”cere-

als,” ”dairy products,” ”spices,” among others identi-

ﬁed through SPARQL queries on the Helis ontology

(Donadello and Dragoni, 2019) with the GraphDB

tool (G

uting, 1994). These food groups may be differ-

ent from those available in the target dataset. There-

fore, an additional step linking the food group to the

available annotations is necessary. This step is per-

formed manually in our experiments where for exam-

ple food groups such as ’beans’ and ’fruits’ were con-

sidered vegetables to preserve the original annotations

of the target dataset.

2.1.4 Making a Relationship Ontology Matrix

To deepen the analysis and exploitation of the ontol-

ogy, a coexistence matrix is created. This matrix cap-

tures the relationship between food groups and dishes.

However, it is highlighted that this structure can be

Figure 2: The framework of the proposed method.

scaled to include more semantic levels, which would

allow a more detailed and granular representation.

The central purpose of the matrix is to quantify the

presence of food groups in each of the dishes. Each

cell in the matrix stores the number of times a speciﬁc

food group appears in a particular dish. This provides

a quantitative view of the composition of each dish in

terms of food groups.

The relationship between semantic levels is re-

ﬂected in the structure of the relationship matrix (see

equation 1). The columns represent the names of the

dishes, while the rows refer to the food groups. The

values in the cells of the matrix indicate the strength

of the relationship between food groups and dishes.

This reﬂects not only how many times a food group is

found in a dish, but also the diversity of food groups

present on it. The matrix also allows to understand the

relationship between the semantic levels of the food

groups and their distribution in the dishes.

The relationship matrix RM is formally deﬁned as

follows:

RM =







∑

r=1

|FG

∈R

...

∑

r=1

|FG

∈R

...

∑

r=1

|FG

∈R

...

∑

r=1

|FG

∈R

...

∑

r=1

|FG

∈R







(1)

where |R

| corresponds to the number of recipes

linked to the d-th dish, R

- the list of food groups

for the r-th recipe linked to the d-th dish and FGg -

the g-th food group.

2.2 Ontology-Driven Multitask Food

Recognition

The proposed ontology-driven deep learning method

for performing multi-task food recognition is illus-

trated in Fig. 2. In deep learning, a multi-task ap-

proach can be performed from a generic network,

where all parameters are shared to extract features, to

a speciﬁc network, where independent networks are

used for each task (Misra et al., 2016). For multi-task

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

626

food recognition, a generic network is considered due

to the similarity of the task at hand (food recognition

and food groups recognition), where both can beneﬁt

from the extracted general features from the backbone

(e.g., ResNet50 (He et al., 2016)). At the top of the

backbone, a dropout layer is considered to avoid over-

ﬁtting, followed by a speciﬁc fully connected layer for

each task.

For food recognition, softmax activation is ap-

plied on the logits layer to provide a probability of

the most likely dish. For food group recognition, like

any multi-label task, a sigmoid activation is applied

on the logits layer to provide independent probability

for each group. Afterwords, the probability of each

food concept belonging to each task, ois obtained. An

essential component of this network is the integration

of a food ontology, speciﬁcally the relationship ma-

trix, which acts as an additional layer that reﬂects the

hierarchical and semantic relationships between food

groups and dishes. The output of this layer is aggre-

gated to the output of the food groups to provide a

reﬁned probability of them. This ontology provides

contextual information to the network, allowing a bet-

ter understanding of the composition of the dishes and

a more precise classiﬁcation. The interaction between

the ontology and the network is achieved by a custom

layer, which weights the model predictions based on

the relationships established in the ontology.

The probability of the reﬁned food groups is for-

mally deﬁned as follows:

p(y

|W, RM) = λ · p(y

|W)+

(1 − λ) ·

∑

d=1

p(y

= d|W) · p(y

= d)

(2)

p(y

|W) =

1 + e

(x)

, (3)

p(y

|W) =

(x)

∑

k=1

(x)

, (4)

p(y

) = RM[g, d], (5)

where p(y

|W) represents the conditional probability

that a speciﬁc ingredient y

is present in the image; W

- the model weights; f

(x) - the logits outputs for the

food groups; p(y

|W) - the probability that a speciﬁc

dish y

is the correct class; f

(x)

- the d-th logits

output for the dishes; p(y

) - the probability, ex-

tracted from the relationship matrix of having a food

group y

given the dish y

; K and D - the number of

dishes; x - the input image; and λ - a hyperparameter

to weight the contribution of the both terms.

For the model learning, two equally weighted loss

functions are used: Cross-Entropy Loss (CELoss) for

dish task and Binary Cross-Entropy Loss (BCELoss)

for the food groups task. The equation representing

CELoss function for a single input image is expressed

as follows:

CELoss = −

∑

d=1

ˆy

· log(p(y

|W )), (6)

where ˆy

is the Ground Truth (GT) label in one-hot

encoding and p(y

|W ) is the probability given by the

model for the d-th dish. The CEloss is calculated in-

dividually for each image and then averaged.

Regarding the BCELoss, it is formally deﬁned as

follows:

BCELoss =

∑

g=1

ˆy

· log(p(y

|W, RM))+

∑

g=1

(1 − ˆy

) · log(1 − p(y

|W, RM))

(7)

where G represents the number of food groups; ˆy

the GT in one-hot encoding for g-th food group, indi-

cating whether food group g is present or not in the

sample; and p(y

|W, RM) is the reﬁned probability

given for g-th food group. The BCEloss is calculated

individually for each image and then averaged over

the total images and food groups.

It is interesting to note that although the pro-

posed method explicitly reﬁnes the prediction of food

groups, the fact of using the probability of the dish,

together with the relationship matrix, for the reﬁne-

ment also indirectly results in the predictions for the

dishes being reﬁned.

3 VALIDATION

In this section, we present the dataset, the experimen-

tal setting, and the different evaluation metrics.

3.1 Dataset

The data set used in this research, known as MAFood-

121 (Aguilar et al., 2019), consists of a total of 21,175

images representing traditional dishes from eleven of

the most popular cuisines in the world. These dishes,

which reﬂect the richness and diversity of global gas-

tronomy, have been grouped into a dataset covering

121 dishes in total. Each of the images belonging to

these dishes is labeled with at least one of the ten food

groups previously deﬁned: bread, egg, fried foods,

meat, noodles/pasta, rice, seafood, soup, dumpling

and vegetables.

Ontology-Driven Deep Learning Model for Multitask Visual Food Analysis

627

For the purposes of this research, it is decided to

focus on Mexican cuisine, selecting 11 representa-

tive dishes with a total of 2,242 images that encap-

sulate the essence and culinary variety of this tra-

dition. These dishes range from classics like ’cae-

sar salad’, ’enchiladas’ and ’tacos’, to delicacies like

’guacamole’, ’pozole’ and ’tostadas’. This meticu-

lous selection is carried out with the purpose of fo-

cusing the classiﬁcation task in a speciﬁc and repre-

sentative context, which allows a detailed and precise

analysis of the Mexican gastronomic wealth. In the

experiments, we maintain the original division of the

dataset. Speciﬁcally, 73.6% of the images are used

for training, 12.44% for validation and the remaining

13.96% for testing.

3.2 Experimental Setup

For the implementation of the proposed method,

ResNet50 is selected as the backbone. This network

is also used as a baseline for comparison purposes.

ResNet50 is pretrained on ImageNet and then the

baseline and proposed method are retrained for a to-

tal of 20 epochs, using a set of empirically selected

hyperparameters. A learning rate (LR) of 0.001 is

used and the batch size is set to 64 to balance com-

putational efﬁciency and training stability. As part

of the regularization, a Dropout layer after the last

convolutional layer with a rate of 0.1 is included to

avoid overﬁtting in both methods. The inclusion of

this layer provides effective regulation and improves

the model’s ability to generalize to unseen data. The

Adam optimizer is used to minimize the loss function.

Additionally, a simple data preprocessing is applied

that included image resizing to 224x224 pixels and

normalization with a mean and standard deviation of

0.5. On the other hand, the number of recipes for each

dish is set at 20 (|R

| = 20), because we detect that as

we increase this number, recipes are recovered that do

not represent well the dish consulted. Finally, λ is set

from 0.5 to 0.9 with a step of 0.1.

Traditional evaluation metrics were selected for

each of the target tasks. For the single-label food

recognition problem, whose aim is to classify the gen-

eral context of the images with the most likely food,

the Accuracy metric was used. For the multi-label

classiﬁcation problem of food group recognition, that

involves categorizing food images with food groups

representing each ingredient contained in the food,

four metrics was selected: 1) Precision (P) which

corresponds the proportion of correctly predicted in-

stances relative to the total number of instances pre-

dicted under that label, 2) Recall (R) which measures

the model’s ability to capture all true labels, 3) F1

score which combines precision and recall into a sin-

gle metric, providing a comprehensive assessment of

model performance and 4) The Jaccard index which

evaluates the overlap between true and predicted la-

bels, quantifying the degree of similarity in the set of

labels. Furthermore, the metric Multi-task accuracy

(MTA) (Aguilar et al., 2019) was used to quantify the

consistency of model predictions across multiple clas-

siﬁcation tasks.

4 RESULTS

In this work, we evaluate the performance of the

multi-task classiﬁcation methods applied to the chal-

lenge of identifying food dishes and their food groups

from images. For this purpose, ResNet50 and the

proposed ontology-driven ResNet50 (OD-ResNet50)

were evaluated on Mexican dishes belonging to the

MAFood-121 dataset. Regarding OD-ResNet-50, the

parameter λ is set after analysis of the results on

the validation set. Five experiments were performed

changing λ from 0.5 to 0.9. The results obtained for

the training and validation set can be seen in Fig. 3.

As observed, low λ tends to provide lower perfor-

mance than high λ. We identify that a λ of 0.8 or

0.9 are appropriate and are therefore used to evaluate

the performance of the model on the test set.

Table 1 summarizes the results obtained by eval-

uating the performance of three models: ResNet50,

OD-ResNet50 with λ equals to 0.9 and OD-ResNet50

with λ equals to 0.8. The evaluations were performed

on the Test set, Validation set and Training set. The

results are expressed in terms of Precision (P), Recall

(R), F

-Score (F

), Jaccard Index, Accuracy and the

MTA. In the Validation and Test sets, it is observed

that the proposed method outperforms the baseline in

all the metrics evaluated in both food recognition and

food groups recognition and also in the joint evalua-

tion (MTA). On the other hand, it can also be observed

that when we compare the proposed method using λ

equal to 0.8 with respect to λ equal to 0.9, we observe

that a low λ provides a better P for the food groups

while a high λ provides better results in all reaming

metrics. This suggests that the more strict the onto-

logical integration, the model tends to provide fewer

false predictions although it loses its ability to ﬁnd all

food groups. As for the Training set, a different be-

havior is observed. In this case, the performance of

the food groups is slightly higher than the proposed

method, although the accuracy for food recognition is

still lower. These results demonstrate the ability of

the proposed method to generalize better, particularly

with respect to food groups.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

628

Figure 3: Performance of OD-ResNet50 on the training and validation sets using λ from 0.5 to 0.9.

Table 1: Comparison of the performance of classiﬁcation methods with single-label, multi-label and multitask metrics.

Method λ P R F

Jaccard index Accuracy MTA

Test set

ResNet50 - 0.8439 0.8311 0.8374 0.7203 0.6741 0.5271

OD-ResNet50 0.9 0.8498 0.8356 0.8427 0.7281 0.7093 0.5536

OD-ResNet50 0.8 0.8562 0.8250 0.8403 0.7246 0.6997 0.5531

Validation set

ResNet50 - 0.8289 0.8034 0.8160 0.6891 0.7419 0.5820

OD-ResNet50 0.9 0.8333 0.8120 0.8225 0.6985 0.7527 0.5972

OD-ResNet50 0.8 0.8417 0.8000 0.8203 0.6954 0.7348 0.5826

Training set

ResNet50 - 0.8887 0.8685 0.8785 0.7833 0.8916 0.7477

OD-ResNet50 0.9 0.8847 0.8564 0.8703 0.7704 0.9055 0.7472

OD-ResNet50 0.8 0.8983 0.8539 0.8755 0.7786 0.9073 0.7570

Figure 4: Confusion matrix of the food recognition provided by ResNet50 (left) and OD-ResNet50 (right) with λ equals to

0.9.

Figure 4 shows the confusion matrix related to

food recognition provided by the baseline and the pro-

posed method. In general, it is observed that in most

dishes the OD-ResNet50 model provides equal or bet-

ter performance than ResNet50. We can also see that

OD-ResNet50 is less sensitive to class imbalance. In

particular, this can be noticed in the chilaquiles plate,

which contains the smallest number of images in the

data set. In this case, the model was able to miss-

clasify the images much less than ResNet50.

Ontology-Driven Deep Learning Model for Multitask Visual Food Analysis

629

Figure 5: Success and failure cases of OD-ResNet50 with λ equals to 0.9 on MAFood-121.

Qualitative results of ResNet50 and OD-ResNet50

model are presented in the Fig. 5. From this,

the positive inﬂuence of the ontology on the perfor-

mance of the classiﬁcation models can be clearly ob-

served. In the results obtained by OD-ResNet50 for

the ﬁrst three examples a signiﬁcant increase in the

accuracy of the predictions for both food dishes and

food groups is observed. For example, in the case

of ’enchiladas’, the model with ontology achieves a

conﬁdence score of 0.81, while the ResNet50 model

(without ontology) achieves a score of 0.80, indi-

cating a substantial improvement in prediction abil-

ity. This pattern is repeated in other examples, such

as ’huevos rancheros’ and ’pozole’, where the OD-

ResNet50 model clearly outperforms the ResNet50

model. This reinforces the idea that the inclusion of

an ontology provides additional and consistent infor-

mation that supports the classiﬁcation process, thus

improving the accuracy of the predictions. However,

in few cases we noticed a negative inﬂuence of the

ontology. For example, in the case of ’guacamole’,

the ResNet50 model slightly outperforms the model

with ontology in food group classiﬁcation. The rea-

son for this is that the food group bread is not common

for guacamole and for this reason the model must be

very secure so as not to lose that prediction due to

prior knowledge incorporated by the ontology.

In summary, the inclusion of an ontology in the

classiﬁcation process brings substantial improve-

ments in most cases, suggesting its relevance in future

applications in the ﬁeld of computer vision and

food classiﬁcation.

5 CONCLUSIONS

This work highlights the relevance of ontologies in

food classiﬁcation from images. Incorporating the on-

tology into the deep learning model resulted in a sub-

stantial improvement in accuracy in the classiﬁcation

of food groups and food dishes. Additionally, greater

consistency in responses and more effective knowl-

edge transfer from the ontology to the learning model

were observed from the results, improving its general-

ization ability. Furthermore, the proposed multimodal

information integration, which combines text and im-

age data, enriches the model learning process. This

combination allowed the model to acquire knowledge

more accurately and effectively, resulting in greater

accuracy in food classiﬁcation. In terms of future re-

search, a wide spectrum of possibilities opens up. In-

tegration of additional information, such as nutritional

data, could be explored in order to further improve

accuracy in food classiﬁcation. Likewise, it could be

investigated how the ontology could be used in the

detection of allergens in food, which would have fun-

damental implications for food safety. Additionally,

the applications of ontology and multi-modal infor-

mation in food classiﬁcation in various cultures and

regions of the world could be studied.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

630

ACKNOWLEDGEMENTS

This work has been partially supported by the Span-

ish project PID2022-136436NB-I00 (AEI-MICINN),

Horizon EU project MUSAE (No. 01070421),

2021-SGR-01094 (AGAUR), Icrea Academia’2022

(Generalitat de Catalunya), Robo STEAM (2022-

1-BG01-KA220-VET-000089434, Erasmus+ EU),

DeepSense (ACE053/22/000029, ACCI

O), Deep-

FoodVol (AEI-MICINN, PDC2022-133642-I00),

PID2022-141566NB-I00 (AEI-MICINN), CERCA

Programme / Generalitat de Catalunya, and Agencia

Nacional de Investigaci

on y Desarrollo de Chile

(ANID) (Grant No. FONDECYT INICIACI

11230262). D. Ponte acknowledges the support

of Secretar

ıa Nacional de Ciencia, Tecnolog

ıa

e Innovaci

on Senacyt Panam

a (Scholarship No.

270-2022-125).

REFERENCES

Aguilar, E., Bola

nos, M., and Radeva, P. (2019). Regular-

ized uncertainty-based multi-task learning model for

food analysis. Journal of Visual Communication and

Image Representation, 60:360–370.

Allegra, D., Battiato, S., Ortis, A., Urso, S., and Polosa,

R. (2020). A review on food recognition technology

for health applications. Health Psychology Research,

8(3).

Allrecipes, I. (2023). Allrecipes.

Deldjoo, Y., Schedl, M., Cremonesi, P., and Pasi, G. (2020).

Recommender systems leveraging multimedia con-

tent. ACM Computing Surveys (CSUR), 53(5):1–38.

Divakar, H., Ramesh, D., and Prakash, B. (2019). An on-

tology driven system to predict diabetes with machine

learning techniques. International Journal of Innova-

tive Technology and Exploring Engineering (IJITEE),

9:4005–4011.

Donadello, I. and Dragoni, M. (2019). Ontology-driven

food category classiﬁcation in images. In ICIAP,

pages 607–617. Springer.

Dragoni, M., Bailoni, T., Maimone, R., and Eccher, C.

(2018). Helis: An ontology for supporting healthy

lifestyles. In The Semantic Web–ISWC 2018: 17th In-

ternational Semantic Web Conference, Monterey, CA,

USA, October 8–12, 2018, Proceedings, Part II 17,

pages 53–69. Springer.

uting, R. H. (1994). Graphdb: Modeling and querying

graphs in databases. In VLDB, volume 94, pages 12–

15. Citeseer.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In CVPR, pages

770–778.

Jiang, S., Min, W., Liu, L., and Luo, Z. (2019). Multi-

scale multi-view deep feature aggregation for food

recognition. IEEE Transactions on Image Processing,

29:265–276.

Kuang, Z., Yu, J., Li, Z., Zhang, B., and Fan, J. (2018). In-

tegrating multi-level deep learning and concept ontol-

ogy for large-scale visual recognition. Pattern Recog-

nition, 78:198–214.

Ławrynowicz, A., Wr

oblewska, A., Adrian, W. T., Kul-

czy

nski, B., and Gramza-Michałowska, A. (2022).

Food recipe ingredient substitution ontology design

pattern. Sensors, 22(3):1095.

Ming, Z.-Y., Chen, J., Cao, Y., Forde, C., Ngo, C.-W., and

Chua, T. S. (2018). Food photo recognition for di-

etary tracking: System and experiment. In MultiMe-

dia Modeling: 24th International Conference, MMM

2018, Bangkok, Thailand, February 5-7, 2018, Pro-

ceedings, Part II 24, pages 129–141. Springer.

Misra, I., Shrivastava, A., Gupta, A., and Hebert, M.

(2016). Cross-stitch networks for multi-task learning.

In CVPR, pages 3994–4003.

Popovski, G., Kochev, S., Korousic-Seljak, B., and Efti-

mov, T. (2019). Foodie: A rule-based named-entity

recognition method for food information extraction.

In ICPRAM, pages 915–922. SCITEPRESS.

Popovski, G., Seljak, B. K., and Eftimov, T. (2020). A sur-

vey of named-entity recognition methods for food in-

formation extraction. IEEE Access, 8:31586–31594.

Song, Y., Yang, X., and Xu, C. (2023). Self-supervised

calorie-aware heterogeneous graph networks for food

recommendation. ACM Transactions on Multime-

dia Computing, Communications and Applications,

19(1s):1–23.

Stojanov, R., Kocev, I., Gramatikov, S., Popovski, G., Sel-

jak, B. K., and Eftimov, T. (2020). Toward robust food

ontology mapping. In 2020 IEEE International Con-

ference on Big Data (Big Data), pages 3596–3601.

IEEE.

Wang, Z., Min, W., Li, Z., Kang, L., Wei, X., Wei, X.,

and Jiang, S. (2022). Ingredient-guided region dis-

covery and relationship modeling for food category-

ingredient prediction. IEEE Transactions on Image

Processing, 31:5214–5226.

Yummly, I. (2023). Yummly.

Zhang, Y., Qu, Y., Li, C., Lei, Y., and Fan, J. (2019).

Ontology-driven hierarchical sparse coding for large-

scale image classiﬁcation. Neurocomputing, 360:209–

219.

Zhao, H., Yap, K.-H., and Kot, A. C. (2021). Fusion learn-

ing using semantics and graph convolutional network

for visual food recognition. In WACV, pages 1711–

1720.

Ontology-Driven Deep Learning Model for Multitask Visual Food Analysis

631