Uncertainty Modeling and Deep Learning Applied to Food Image

Analysis

Eduardo Aguilar

1,2 a

, Bhalaji Nagarajan

2 b

, Rupali Khatun

2 c

, Marc Bola

nos

2 d

and Petia Radeva

2,3 e

Departamento de Ingenier

ıa de Sistemas y Computaci

on, Universidad Cat

olica del Norte,

Avenida Angamos 0610, Antofagasta, Chile

Departament de Matem

atiques i Inform

atica, Universitat de Barcelona,

Gran Via de les Corts Catalanes 585, 08007 Barcelona, Spain

Computer Vision Center, Cerdanyola, Barcelona, Spain

eaguilar02@ucn.cl, {bhalaji.nagarajan, rupali.khatun, marc.bolanos, petia.ivanova}@ub.edu

Keywords:

Uncertainty Modeling, Food Recognition, Deep Learning.

Abstract:

Recognizing food images arises as a difﬁcult image recognition task due to the high intra-class variance and

low inter-class variance of food categories. Deep learning has been shown as a promising methodology to

address such difﬁcult problems as food image recognition that can be considered as a ﬁne-grained object

recognition problem. We argue that, in order to continue improving performance in this task, it is necessary

to better understand what the model learns instead of considering it as a black box. In this paper, we show

how uncertainty analysis can help us gain a better understanding of the model in the context of the food

recognition. Furthermore, we take decisions to improve its performance based on this analysis and propose a

new data augmentation approach considering sample-level uncertainty. The results of our method considering

the evaluation on a public food dataset are very encouraging.

1 INTRODUCTION

In the present fast-paced world, unhealthy food habits

are the basis of most chronic diseases (like obesity, di-

abetes, cardiovascular related diseases, thyroid, etc.).

All over the world, problems regarding nutritional

habits are related to the lack of knowledge about what

people are eating on a daily basis. Unhealthy habits

can more easily be prevented if they have the aware-

ness about the nutritional value of the food they con-

sume in their daily meals (Alliance, 2019). The prob-

lem is that more than 80% of people is not com-

pletely aware of how much they eat, what percentage

of proteins, carbohydrates, salt, etc. are consumed in

every plate. Moreover, it is quite difﬁcult for peo-

ple to calculate the nutritional aspects for every meal

they consume (Sahoo et al., 2019). Manual calcula-

tion of this information is quite time-consuming and

https://orcid.org/0000-0002-2463-0301

https://orcid.org/0000-0003-2473-2057

https://orcid.org/0000-0002-9682-5888

https://orcid.org/0000-0001-9838-1435

https://orcid.org/0000-0003-0047-5172

Figure 1: Example of high within-class variability belong

to the ravioli food class.

more often results in imprecise methods. This creates

the need for automatic systems that would be able to

log the food a person consumes everyday (Bruno and

Silva Resende, 2017). This would enable both the pa-

tients and the health care professionals to better man-

age chronic conditions related to nutrition (El Khoury

et al., 2019).

Automatic food recognition is not only performed

in the dietary management of patients, but has a

wide variety of applications in the food and restau-

rant chains. Food detection in smart restaurants is be-

coming a practical application rather than a research

problem (Aguilar et al., 2018). Automatic food recog-

nition faces challenging computer vision and machine

learning problems due to the nature of images that are

used in this task (see Fig. 1).

Deep learning algorithms have become very pop-

ular, and they own this popularity to their excep-

tional performance, enhanced processing abilities,

large datasets, and outstanding classiﬁcation abilities

compared to the traditional machine learning methods

(Subhi et al., 2019). However, despite the good per-

formance shown, deep learning algorithms need huge

amounts of data or they are prone to overﬁtting. To

avoid it, one of the most difﬁcult and general prob-

lems in this work is getting an adequate dataset, which

not only means a large dataset, but also composed of

very diverse and carefully curated samples.

Data augmentation is a popular strategy adopted

to prevent deep learning methods from overﬁtting.

It consists in applying transformations to the origi-

nal data in order to increase the sample size and its

variability. Examples of standard transformations in

images are: random crops, image ﬂips or reﬂections

and color distortions. On the other hand, novel solu-

tions have been provided by Generative Adversarial

Network-based methods (GANs), which can generate

synthetic, new and plausible images. However, the

majority of data augmentation strategies have been

applied indistinctly for all the images, without tak-

ing into account that in some cases, particular classes

or images can be harder to classify and would require

more particular data augmentation methods. On the

other hand, uncertainty analysis can give us a good

clue to understand what does the model learn and

from this, we can expand the dataset to overcome the

deﬁciencies we ﬁnd. In this work, we propose to ex-

plore a combination of both ﬁelds: GANs and uncer-

tainty modelling, with the aim of generating new data

focusing on the samples that the model has not been

able to learn well (with high uncertainty).

The major contributions of this work are as fol-

lows: a) to use Epistemic Uncertainty to ﬁnd the sam-

ples that are the hardest for the model to learn; and

b) to use Generative Adversarial Networks to perform

data augmentation to create visually similar images to

the hard samples in the dataset. The rest of the work is

organized as follows. Next section details the recent

relevant literature. Section 3 explains the proposed

methodology. Experimental details are provided in

Section 4, followed by conclusions in the last Section.

2 RELATED WORK

Food image analysis is an active area of research,

which analyses food data from various sources and

applies it to solve different food-related tasks. Here,

the most relevant recent literature is discussed.

2.1 Food Recognition

Food recognition is a challenging computer vision

task, due to the complex nature of food images. The

images could contain dishes that are mixed or could

contain many food items (Wang et al., 2019). The task

is of a ﬁne-grained nature, where the classes have high

intra-class variability and high inter-class similarity.

The initial works related to the recognition task

used different hand-crafted features such as color, tex-

ture and shape (Matsuda et al., 2012; Chen et al.,

2009; Joutou and Yanai, 2009; Bosch et al., 2011).

These works were primarily concerned with tack-

ling the problem in a constrained environment. The

datasets during these studies had less number of im-

ages or classes and are restrictive in the conditions in

which the images were taken (Ciocca et al., 2017a;

Matsuda et al., 2012; Jing-jing Chen, 2016).

With the advent of Convolutional Neural Net-

works (CNN), food recognition tasks of complex na-

ture were also tackled. CNNs were able to massively

outperform by far the traditional food recognition al-

gorithms. The datasets started to have large num-

bers of images and a large number of dishes were be-

ing recognized (Bossard et al., 2014; Ciocca et al.,

2017b; Donadello and Dragoni, 2019; Kaur et al.,

2019). Different CNNs have been successfully ap-

plied to food recognition task, as AlexNet (Yanai and

Kawano, 2015), GoogLeNet (Wu et al., 2016; Meyers

et al., 2015; Liu et al., 2016a), Network-In-Networks

(Tanno et al., 2016), Inception V3 (Hassannejad et al.,

2016), ResNet-50 (Ming et al., 2018), Ensemble NN

(Nag et al., 2017), Wide Residual Networks (Martinel

et al., 2018), and CleanNet (Lee et al., 2018).

Food images in the wild often contain more than

one food class. Therefore multi-label food recog-

nition and detection have an increased complexity.

Also, different food can be located very close to each

other or even mixed. In this case, food recogni-

tion is usually preceded by food detection (Anzawa

et al., 2019). Earlier works involved using colour

and texture-based food segmentation (Anthimopou-

los et al., 2014). (Aguilar et al., 2019) proposed a

semantic food framework, covering food segmenta-

tion, detection and recognition. (Chen et al., 2017)

focused on multi-label ingredient recognition, while

a multi-task learning has been proposed in the works

of (Aguilar et al., 2019; Zhou et al., 2016).

The high inter-class similarity of the food images

makes it difﬁcult to train models that could be used to

recognize dishes in the wild. Although large datasets

are created with more classes, the images do not rep-

resent the complex nature of the food. Therefore, gen-

erative models could be used to create new synthetic

data that are similar to the real world data.

2.2 Generative Adversarial Network

A Generative Adversarial Network (Goodfellow

et al., 2014) is a deep learning method that generates

very realistic synthetic images in the domain of inter-

est. In the GAN framework, two different networks

compete with each other. And these two networks

work as thief (generator) and police (discriminator).

The Generator, as its name states, generates fake sam-

ples from random noise and tries to fool the Discrim-

inator. On the other hand, the Discriminator has the

role of distinguishing between real and fake samples.

They both compete with each other in the training

phase. The steps are repeated several times in order

for the Generator and Discriminator to get better in

their respective jobs after each iteration.

One of the most popular extensions of Generative

Adversarial Nets is the conditional model. Note that

in an unconditioned generative model, there is no con-

trol over modes of the data being generated. How-

ever, in the Conditional GAN (CGAN) (Mirza and

Osindero, 2014), the generator learns to create new

samples with a speciﬁc condition or set of character-

istics. Such conditioning could be based on class la-

bels, on some part of data for inpainting like (Good-

fellow et al., 2013), or even on data from different

modalities. Thus, in CGAN both the generator and

the discriminator are conditioned on some extra in-

formation y, where y could be any kind of auxiliary

information such as a label associated to an image or

more detailed tag, rather than a generic sample from

an unknown noisy distribution.

A further extension of the GAN architecture,

which is built upon the CGAN extension, is the Auxil-

iary Classiﬁer GANs (ACGAN) (Odena et al., 2017).

In ACGAN, the input is the latent space along with a

class label. Furthermore, every generated sample has

a corresponding class label. The Generator model re-

ceives as input, a random point from the latent space

and a class label, and gives as output the generated

image. The Discriminator model receives as input an

image and returns as output the probability that the

provided image is real, or the probability of the im-

age belonging to each known class. As well known,

unbalanced data is a big problem for object recogni-

tion, where models tend to classify much better in the

dominant classes. In the case of the GANs, this prob-

lem is also present producing low quality synthetic

images in classes with few samples. Some propos-

als have addressed this problem (Mariani et al., 2018;

Ali-Gombe and Elyan, 2019), which are discussed in

the following paragraphs.

In BAGAN (Mariani et al., 2018), an augmenta-

tion framework is proposed to restore balance in un-

balanced datasets by creating new synthetic images

for minority classes. The proposed approach requires

two training steps: the ﬁrst corresponds to initial-

izing the GAN with the features learned by means

of an auto-encoder, and then the entire model is re-

trained. Another approach to restore balance is MFC-

GAN (Ali-Gombe and Elyan, 2019). Oppositely to

BAGAN, MFC-GAN is simpler to train and just needs

one training step. This model uses multiple fake

classes to ensure a ﬁne-grained generation and clas-

siﬁcation of the minority class instances.

A novel method titled SINGAN (Shaham et al.,

2019) has been recently published. Differently to pre-

vious GANs, this model is an unconditional genera-

tive model that can be learned from a single natural

image. SINGAN model is trained to capture the in-

ternal distribution of the image patches, then it gen-

erates high quality, diverse samples that contain the

same visual content as the image. The pyramid struc-

ture of fully convolutional layers of SINGAN learns

the patch distribution of the image at a different scale

in each layer. This results in generating new samples

of arbitrary size and aspect ratios that have signiﬁcant

variability, yet maintain both the global structure and

the ﬁne textures of the training image.

SINGAN requires to train a separate model for

each new sample that one desires to generate from

a particular image, thus, becoming a very expensive

technique. However, this can be very useful if we

only need to increase a small subset of the images.

Uncertainty modeling can help us decide which sub-

set is best suited to improve model performance.

2.3 Uncertainty Modeling

Uncertainty can be explained simply as a state of

doubt about what the model has or has not learned

from the input data. In Bayesian modeling, un-

certainty mainly can be presented in two different

ways (Kendall and Gal, 2017): aleatory uncertainty,

which captures noise inherent in the observations; and

epistemic uncertainty, which can be explained away

given enough data. The uncertainty can be captured

from a Bayesian Neural Network (BNN). However,

in a Deep Learning scheme it becomes intractable

in this way (Blundell et al., 2015; Gal and Ghahra-

mani, 2016; Sensoy et al., 2018). Instead, variational

Bayesian methods have been adopted in the literature

(Blundell et al., 2015; Gal and Ghahramani, 2016;

Molchanov et al., 2017; Louizos and Welling, 2017),

where MC-dropout (Gal and Ghahramani, 2016) is

the most popular technique to estimate the uncertainty

Figure 2: Main scheme of the our UAGAN food recognition method.

due to its simplicity regarding to the implementation.

Recent methods of image classiﬁcation have

adopted this technique to estimate the uncertainty in

their scheme (Aguilar et al., 2019; Khan et al., 2019;

Aguilar and Radeva, 2019b; Aguilar and Radeva,

2019a; Nielsen and Okoniewski, 2019). In the case

of (Aguilar et al., 2019), the aleatory uncertainty is

used in order to weigh dynamically different kinds

of losses for multi-label and single-label food-related

tasks. On the other hand, in (Khan et al., 2019),

the authors deal with the imbalanced object classi-

ﬁcation problem. They redeﬁned the large-margin

softmax loss (Liu et al., 2016b), incorporating un-

certainty at the class-level and sample-level based on

the Bayesian uncertainty measure to address the rarity

of the classes and the difﬁculty level of the individ-

ual samples. Regarding (Aguilar and Radeva, 2019b;

Aguilar and Radeva, 2019a), the analysis of the epis-

temic uncertainty has been applied for different pur-

poses: to identify the best data augmentation that will

be applied in a particular class (Aguilar and Radeva,

2019a) and to judge when a ﬂat or hierarchical clas-

siﬁer is used (Aguilar and Radeva, 2019b). A work

closer to our proposal, but not in the food recognition

ﬁeld, is that published by (Nielsen and Okoniewski,

2019), that proposes an active learning scheme based

on acquisition function sampling. This mechanism

considers the prediction uncertainty of the classiﬁer

to determine the GAN samples to incorporate in the

training set, which are labeled by an external oracle.

The main differences between our proposal and

(Nielsen and Okoniewski, 2019) are the following: a)

our aim is completely different, we apply the uncer-

tainty analysis to discover complex samples to per-

form data augmentation, and not to apply it after the

data augmentation to select the sample that will be

used during the training, b) our training scheme is

done in two phases, and not several phases, which

do not require an external oracle to do it, because

the labels are automatically assigned, and c) we adopt

a GAN that generates a new sample keeping a high

quality content of the input image, instead of generat-

ing a sample by merging different input images that in

some cases can be very noisy and insert a bias towards

the most frequent content.

3 UNCERTAINTY-AWARE

GAN-AUGMENTED FOOD

RECOGNITION

In this section, we describe all phases involved in

the Uncertainty-Aware GAN-Augmented (UAGAN)

method to perform food recognition using uncertainty

modeling and GANs. As you can see in Fig. 2, the

method contemplates 3 main phases with the follow-

ing purposes: a) hard samples discovery, b) synthetic

image generation and c) ﬁnal training.

3.1 Hard Sample Discovery

The ﬁrst step of our proposed approach involves the

analysis of the food images of the training set, with

the aims of identifying those that are difﬁcult to clas-

sify. To do this, our criterion is based on the analy-

sis of Epistemic Uncertainty (EU) through the calcu-

lation of the entropy. The samples with high uncer-

tainty are those in which the model has not been able

to learn well their discriminant features and, there-

fore, are considered hard samples. On the other hand,

we adopt the method called MC-dropout (Gal and

Ghahramani, 2016) for EU estimation, mainly due to

its simple implementation. Basically, we need to add

a dropout layer before each fully connected layer, and

after the training, we perform K predictions with the

dropout turned on. The K probabilities (softmax out-

puts) are averaged and then the entropy is calculated

to reﬂect the EU. Finally, the images are ordered with

respect to their EU, and we select the top n images

with higher EU to perform the next step.

3.2 Image Generation

Once the images have been chosen, the next step cor-

responds to increasing the data with nearby images in

terms of visual appearance. We believe that one of the

determining factors that does not allow the model to

learn the features of hard images corresponds to the

fact that they differ from most images that represent a

particular class. This hard images may be present in

the training set due to the complexity of the acquisi-

tion and also after dividing the data for the training.

The latter is due to the fact that during the generation

of subsets only the sample size is considered and not

the variability of the sample. Therefore, we propose

to make new images by applying small changes to the

original ones. The best method for this purpose is

the recent GAN-based method called SINGAN (Sha-

ham et al., 2019), which can learn from a single image

and generate different samples carrying the same vi-

sual content of the input image. In this step, we adopt

the SINGAN to generate one synthetic image for each

chosen image according to the uncertainty criterion.

3.3 Final Training

Finally, in the last step, the whole CNN model is

trained with both types of images: the synthetic im-

ages obtained with SINGAN and the original images.

4 VALIDATION

In this section, we ﬁrst describe the dataset used to

evaluate the proposed approach, which is composed

of public images of food belonging to Italian cuisine.

Next, we describe the evaluation metric and exper-

imental setup. Finally, we present the results obtained

with the baseline methods and our proposal.

4.1 Dataset

From the dataset MAFood-121 (Aguilar et al., 2019),

we use all the images of the dishes that belong to the

Italian cuisine. In total, 11 dishes were chosen, which

are composed of 2468 images with a maximum, mini-

mum and average of 250, 104 and 224 images, respec-

tively. The data is distributed as 72% of the images for

training, 11% for validation and 17% for test.

4.2 Metric

In order to evaluate our proposal, we use the standard

metric used for object recognition named overall Ac-

curacy (Acc). We evaluate our experiment 5 times and

show the result in terms of average accuracy and the

respective standard deviation.

4.3 Experimental Setup

For classiﬁcation purposes, ResNet-50 (He et al.,

2016) was adopted as the base CNN architecture. We

adapted this model to be able to apply MC-dropout by

removing the output layer, and instead, we added one

fully connected layer of 2048 neurons, followed by a

dropout layer with a probability of 0.5, and ended up

with an output layer of 11 neurons with softmax ac-

tivation. For simplicity, we call this architecture the

same as the original (ResNet50). As for training, we

use the categorical cross-entropy loss and the Adam

optimizer to train all models during 40 epochs with a

batch-size of 32, initial learning rate of 0.0002, decay

of 0.2 every 8 epochs and patience of 10 epochs.

Three different training strategies of the same

model are used for a benchmark purpose:

• ResNet50, baseline model training with the origi-

nal images without data augmentation.

• ResNet50+SDA, baseline model with standard

data augmentation applying during the training,

like random crops and horizontal ﬂips.

• UAGAN , ResNet50+SDA using the real and syn-

thetic images.

Figure 3: Histogram for the entropy of the predicted images.

With respect to the image generation, we use the

default parameters proposed by the authors of SIN-

GAN.

4.4 Results

In this section, we present the results obtained by the

proposed method. The ﬁrst step of our method corre-

sponds to selecting those images difﬁcult to classify

(with high uncertainty). After training the model with

the original images, we determine the EU and build a

histogram for the training images (see Fig. 3). The

right side of the histogram corresponds to all the im-

ages considered to generate the new ones. The crite-

rion applied corresponds to selecting all images with

EU equal to or greater than the uncertainty calculated

by the average between the maximum and minimum

uncertainty predicted for all images. A total of 120

images was selected.

In Fig. 4, we can see the distribution of the train-

ing images along each dish, the average entropy in all

the images, the proportion of the selected images and

the average entropy for the selected images. Unlike

the evidence shown in (Khan et al., 2019) for CIFAR-

10, for this type of data, the frequency of the images is

not a factor that determines a high or low uncertainty

for a speciﬁc class. In our case, we believe that un-

certainty occurs due to the great variability of visual

appearance that may be present in the images belong

to the same class of dish, where the factor to consider

is the diversity of the collected sample and not only

the size of the sample. To ﬁll the gap of poorly repre-

sented sample for a class, we duplicate the presence of

images with high uncertainty through a generation of

synthetic images with the SINGAN method. In Fig.

5, some examples of the generated images are shown.

With a total of 1889 training images, 1679 origi-

nals and 120 synthetic ones, we train the ﬁnal model.

The results obtained by three different training strate-

gies of the same model are shown in the Table 1. All

models were ﬁne-tuned from ImageNet (Krizhevsky

et al., 2012) and retrained the whole network using the

target training set. For each strategy, 5 models were

trained with random initialization values and random

order of images. Then, we calculated the average ac-

curacy and the standard deviation achieved for the

best model obtained on each iteration according to

the performance on the validation set. For the results

achieved, we can see that UAGAN improved the per-

formance in terms of accuracy with respect to the rest

of strategies. Speciﬁcally, the improvement is 3,17%

on ResNet50 and 1,32% on ResNet50+SDA.

Figure 4: Training images vs epistemic uncertainty.

Table 1: Results obtained on the test set in term of accuracy

with the standard deviation.

Method Acc Std

ResNet50 79.15% 0,60%

ResNet50 + SDA 81.00% 0,78%

UAGAN (our proposal) 82.32% 0,96%

Figure 5: Synthetic image generated on the selected images

from the training set.

5 CONCLUSIONS

In this paper, we presented a novel method for

sample-level uncertainty-aware data augmentation

composed of three phases: 1) identiﬁcation of hard

samples, by means of analysis of the epistemic uncer-

tainty; 2) generating new data from identiﬁed sam-

ples; and 3) performing the ﬁnal training with the

original and synthetic images. We demonstrated the

effectiveness of the approach proposed on the Ital-

ian dishes from MAFood121 public dataset. The re-

sult obtained shows that our proposal outperforms the

classiﬁcation by incorporating only 120 synthetic im-

ages based on the uncertainty analysis (5% of the to-

tal). As future work, we will explore both sample-

level and class-level uncertainty to increase deep

learning datasets in an active learning framework.

ACKNOWLEDGEMENTS

This work was partially funded by TIN2018-095232-

B-C21, SGR-2017 1742, Nestore ID: 769643,

Validithi and CERCA Programme/Generalitat de

Catalunya. E. Aguilar acknowledges the support of

CONICYT Becas Chile. P. Radeva is partially sup-

ported by ICREA Academia 2014. We acknowledge

the support of NVIDIA Corporation with the donation

of Titan Xp GPUs.

REFERENCES

Aguilar, E., Bola

nos, M., and Radeva, P. (2019). Regular-

ized uncertainty-based multi-task learning model for

food analysis. Journal of Visual Communication and

Image Representation, 60:360–370.

Aguilar, E. and Radeva, P. (2019a). Class-conditional data

augmentation applied to image classiﬁcation. In Inter-

national Conference on Computer Analysis of Images

and Patterns, pages 182–192. Springer.

Aguilar, E. and Radeva, P. (2019b). Food recognition by

integrating local and ﬂat classiﬁers. In Iberian Con-

ference on Pattern Recognition and Image Analysis,

pages 65–74. Springer.

Aguilar, E., Remeseiro, B., Bola

nos, M., and Radeva, P.

(2018). Grab, pay, and eat: Semantic food detection

for smart restaurants. IEEE Transactions on Multime-

dia, 20(12):3266–3275.

Ali-Gombe, A. and Elyan, E. (2019). Mfc-gan: class-

imbalanced dataset classiﬁcation using multiple fake

class generative adversarial network. Neurocomput-

ing, 361:212–221.

Alliance, I. U. N. (2019). National adult nutrition survey.

Public Health.

Anthimopoulos, M. M., Gianola, L., Scarnato, L., Diem, P.,

and Mougiakakou, S. G. (2014). A food recognition

system for diabetic patients based on an optimized

bag-of-features model. IEEE journal of biomedical

and health informatics, 18(4):1261–1271.

Anzawa, M., Amano, S., Yamakata, Y., Motonaga, K.,

Kamei, A., and Aizawa, K. (2019). Recognition of

multiple food items in a single photo for use in a

buffet-style restaurant. IEICE TRANSACTIONS on In-

formation and Systems, 102(2):410–414.

Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra,

D. (2015). Weight uncertainty in neural network. In

ICML, pages 1613–1622.

Bosch, M., Zhu, F., Khanna, N., Boushey, C. J., and Delp,

E. J. (2011). Combining global and local features for

food identiﬁcation in dietary assessment. In 2011 18th

IEEE International Conference on Image Processing,

pages 1789–1792. IEEE.

Bossard, L., Guillaumin, M., and Van Gool, L. (2014).

Food-101 – mining discriminative components with

random forests. In European Conference on Computer

Vision.

Bruno, V. and Silva Resende, C. J. (2017). A survey on

automated food monitoring and dietary management

systems. Journal of health & medical informatics,

8(3).

Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and

Yuille, A. L. (2017). Deeplab: Semantic image seg-

mentation with deep convolutional nets, atrous convo-

lution, and fully connected crfs. IEEE transactions on

pattern analysis and machine intelligence, 40(4):834–

848.

Chen, M., Dhingra, K., Wu, W., Yang, L., Sukthankar, R.,

and Yang, J. (2009). Pﬁd: Pittsburgh fast-food image

dataset. In 2009 16th IEEE International Conference

on Image Processing (ICIP), pages 289–292. IEEE.

Ciocca, G., Napoletano, P., and Schettini, R. (2017a). Food

recognition: a new dataset, experiments and results.

IEEE Journal of Biomedical and Health Informatics,

21(3):588–598.

Ciocca, G., Napoletano, P., and Schettini, R. (2017b).

Learning cnn-based features for retrieval of food im-

ages. In International Conference on Image Analysis

and Processing, pages 426–434. Springer.

Donadello, I. and Dragoni, M. (2019). Ontology-driven

food category classiﬁcation in images. In Interna-

tional Conference on Image Analysis and Processing

(2), volume 11752 of Lecture Notes in Computer Sci-

ence, pages 607–617. Springer.

El Khoury, C. F., Karavetian, M., Halfens, R. J., Crutzen,

R., Khoja, L., and Schols, J. M. (2019). The effects of

dietary mobile apps on nutritional outcomes in adults

with chronic diseases: A systematic review. Journal

of the Academy of Nutrition and Dietetics.

Gal, Y. and Ghahramani, Z. (2016). Dropout as a bayesian

approximation: Representing model uncertainty in

deep learning. In ICML, pages 1050–1059.

Goodfellow, I., Mirza, M., Courville, A., and Bengio, Y.

(2013). Multi-prediction deep boltzmann machines.

In Advances in neural information processing sys-

tems, pages 548–556.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A., and Ben-

gio, Y. (2014). Generative adversarial nets. In

Advances in neural information processing systems,

pages 2672–2680.

Hassannejad, H., Matrella, G., Ciampolini, P., De Munari,

I., Mordonini, M., and Cagnoni, S. (2016). Food

image recognition using very deep convolutional net-

works. In Proceedings of the 2nd International Work-

shop on MADiMa, pages 41–49. ACM.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Jing-jing Chen, C.-w. N. (2016). Deep-based ingredient

recognition for cooking recipe retrival. ACM Multi-

media.

Joutou, T. and Yanai, K. (2009). A food image recognition

system with multiple kernel learning. In 2009 16th

IEEE International Conference on Image Processing

(ICIP), pages 285–288. IEEE.

Kaur, P., Sikka, K., Wang, W., Belongie, S., and Divakaran,

A. (2019). Foodx-251: A dataset for ﬁne-grained food

classiﬁcation. arXiv preprint arXiv:1907.06167.

Kendall, A. and Gal, Y. (2017). What uncertainties do

we need in bayesian deep learning for computer vi-

sion? In Advances in neural information processing

systems, pages 5574–5584.

Khan, S., Hayat, M., Zamir, S. W., Shen, J., and Shao, L.

(2019). Striking the right balance with uncertainty.

In Proceedings of the IEEE conference on computer

vision and pattern recognition, pages 103–112.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. In Advances in neural information process-

ing systems, pages 1097–1105.

Lee, K.-H., He, X., Zhang, L., and Yang, L. (2018). Clean-

net: Transfer learning for scalable image classiﬁer

training with label noise. In Proceedings of the IEEE

conference on computer vision and pattern recogni-

tion, pages 5447–5456.

Liu, C., Cao, Y., Luo, Y., Chen, G., Vokkarane, V., and

Ma, Y. (2016a). Deepfood: Deep learning-based food

image recognition for computer-aided dietary assess-

ment. In International Conference on Smart Homes

and Health Telematics, pages 37–48. Springer.

Liu, W., Wen, Y., Yu, Z., and Yang, M. (2016b). Large-

margin softmax loss for convolutional neural net-

works. In ICML, volume 2, page 7.

Louizos, C. and Welling, M. (2017). Multiplicative normal-

izing ﬂows for variational bayesian neural networks.

In ICML-Volume 70, pages 2218–2227. JMLR. org.

Mariani, G., Scheidegger, F., Istrate, R., Bekas, C., and

Malossi, C. (2018). Bagan: Data augmentation with

balancing gan. arXiv preprint arXiv:1803.09655.

Martinel, N., Foresti, G. L., and Micheloni, C. (2018).

Wide-slice residual networks for food recognition.

In 2018 IEEE Winter Conference on Applications of

Computer Vision (WACV), pages 567–576. IEEE.

Matsuda, Y., Hoashi, H., and Yanai, K. (2012). Recogni-

tion of multiple-food images by detecting candidate

regions. In 2012 IEEE International Conference on

Multimedia and Expo, pages 25–30. IEEE.

Meyers, A., Johnston, N., Rathod, V., Korattikara, A., Gor-

ban, A., Silberman, N., Guadarrama, S., Papandreou,

G., Huang, J., and Murphy, K. P. (2015). Im2calories:

towards an automated mobile vision food diary. In

Proceedings of the IEEE Conference on Computer Vi-

sion and Pattern Recognition, pages 1233–1241.

Ming, Z.-Y., Chen, J., Cao, Y., Forde, C., Ngo, C.-W., and

Chua, T. S. (2018). Food photo recognition for dietary

tracking: System and experiment. In International

Conference on Multimedia Modeling, pages 129–141.

Springer.

Mirza, M. and Osindero, S. (2014). Conditional generative

adversarial nets. arXiv preprint arXiv:1411.1784.

Molchanov, D., Ashukha, A., and Vetrov, D. (2017). Vari-

ational dropout sparsiﬁes deep neural networks. In

ICML-Volume 70, pages 2498–2507. JMLR. org.

Nag, N., Pandey, V., and Jain, R. (2017). Health multime-

dia: Lifestyle recommendations based on diverse ob-

servations. In Proceedings of the 2017 ACM on Inter-

national Conference on Multimedia Retrieval, pages

99–106. ACM.

Nielsen, C. and Okoniewski, M. (2019). Gan data augmen-

tation through active learning inspired sample acquisi-

tion. In Proceedings of the IEEE conference on com-

puter vision and pattern recognition Workshops, pages

109–112.

Odena, A., Olah, C., and Shlens, J. (2017). Conditional im-

age synthesis with auxiliary classiﬁer gans. In ICML-

Volume 70, pages 2642–2651. JMLR. org.

Sahoo, D., Hao, W., Ke, S., Xiongwei, W., Le, H.,

Achananuparp, P., Lim, E.-P., and Hoi, S. C. (2019).

Foodai: Food image recognition via deep learning for

smart food logging.

Sensoy, M., Kaplan, L., and Kandemir, M. (2018). Evi-

dential deep learning to quantify classiﬁcation uncer-

tainty. In Advances in neural information processing

systems, pages 3179–3189.

Shaham, T. R., Dekel, T., and Michaeli, T. (2019). Sin-

gan: Learning a generative model from a single nat-

ural image. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages

4570–4580.

Subhi, M. A., Ali, S. H., and Mohammed, M. A. (2019).

Vision-based approaches for automatic food recogni-

tion and dietary assessment: A survey. IEEE Access,

7:35370–35381.

Tanno, R., Okamoto, K., and Yanai, K. (2016). Deepfood-

cam: A dcnn-based real-time mobile food recognition

system. In Proceedings of the 2nd International Work-

shop on MADiMa, pages 89–89. ACM.

Wang, Y., Chen, J.-j., Ngo, C.-W., Chua, T.-S., Zuo, W.,

and Ming, Z. (2019). Mixed dish recognition through

multi-label learning. In Proceedings of the 11th Work-

shop on Multimedia for Cooking and Eating Activi-

ties, CEA ’19, page 1–8, New York, NY, USA. Asso-

ciation for Computing Machinery.

Wu, H., Merler, M., Uceda-Sosa, R., and Smith, J. R.

(2016). Learning to make better mistakes: Semantics-

aware visual food recognition. In Proceedings of the

24th ACM international conference on Multimedia,

pages 172–176. ACM.

Yanai, K. and Kawano, Y. (2015). Food image recognition

using deep convolutional network with pre-training

and ﬁne-tuning. In 2015 IEEE International Confer-

ence on Multimedia And Expo Workshops (ICMEW),

pages 1–6. IEEE.

Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Tor-

ralba, A. (2016). Learning deep features for discrim-

inative localization. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 2921–2929.