In deep CNNs, input image goes through many
convolution layers as shown in figure 3. In those con-
volution layers, the network learns new and increas-
ingly complex features in its layers. The first convo-
lution layer(s) learn features such as edges and simple
textures. Later convolution layers learn features such
as more complex textures and patterns. The last con-
volution layers learn features such as objects or parts
of objects.
Figure 3: Different features recognised at different layers.
The semantic representations generated by a given
CNN corresponding to an input image is the union of
all feature maps generated by each convolution layer
of the CNN, and not only the final feature map. Re-
lying on several feature maps provide the networks
with different spatial scales. The features maps at
different scales can be used as common representa-
tion for all tasks, however Feature Pyramid Network
(FPN) which is adopted for mutli-scale feature rep-
resentation for object detection task has been proven
for its accuracy. The main contribution of FPN is
to enhance the semantic representation capability of
shallower layer feature maps, using the semantic in-
formation encoded in deeper layer feature maps. The
main weakness of feature maps generated by shallow
layers is that they are not semantically as rich as the
feature maps generated by deeper layers. It is because
the process of semantic encoding of input images into
feature maps is a hierarchical process where the ba-
sic semantics appear in the early layer feature maps,
while the more complex semantics appear in the fea-
ture maps of deeper layers.
The attention mechanism in Neural Networks
tends to mimic the cognitive attention possessed by
human beings. The main aim of this function is to em-
phasize the important parts of the information, and try
to de-emphasize the non-relevant parts. Since work-
ing memory in both humans and machines is limited,
this process is key to not overwhelm a system’s mem-
ory. In deep learning, attention can be interpreted as
a vector of importance weights. When we predict an
element, which could be a pixel in an image or a word
in a sentence, we use the attention vector to infer how
much is it related to the other elements.
Our hypothesis is that the combination of multi-
scale features and scale specific attention is optimal
for extraction of heterogeneous product attributes in
MTL set up. We perform ablation study by altering
the methods of computing multi-scale features, atten-
tion and prove the efficacy of the proposed method.
The details of variants of the method are given in sub-
sequent sections.
Figure 4: Baseline architecture proposed in (Parekh et al.,
2021).
3 RELATED WORK
In the image classification task or image based multi-
task learning setup, it is possible that different tasks or
concepts require different spatial and semantic infor-
mation in order to improve the performance. However
traditional classification or multi task network uses
features only from the last layer. In this paper we pro-
pose a architecture which leverage features from mul-
tiple scale to improve performance of multiple tasks.
Using features maps from multiple scales has been
an important idea for object detection tasks. This
helps in detecting object of different scales, aspect ra-
tio and region of interest on various benchmarks(Lin
et al., 2014; Everingham et al., 2015). The Single-
Shot Detector(SSD)(Liu et al., 2016) has been one
of the first networks which used features from differ-
ent layers of the network to detect object of differ-
ent scales. SSD used output from early convolution
layers to detect smaller objects and output from later
layer to detect larger objects. But SSD has some prob-
lems detecting small-scale objects because early con-
volution layers contain low-level information but less
semantic information for tasks such as classification.
FPN(Lin et al., 2017a) solve this problem by hav-
ing both top-down and bottom-up pathways. Using
this, reconstructed higher resolution feature map also
has rich semantic information. FPN also have lateral
connections between bottom-up and top-down feature
maps to help the detector to predict the location bet-
ter. There have been many extensions of original FPN
e.g. BiFPN(Tan et al., 2020), NASFPN(Ghiasi et al.,
2019), PANet(Liu et al., 2018), etc. but not much
work has been done for using FPN for classification.
In (Baloian et al., 2021), authors have presented
the evidence on how features from different scales can
be useful to extract certain attributes like texture or
VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications
646