MobText: A Compact Method for Scene Text Localization

Luis Gustavo Lorgus Decker

1,∗ a

, Allan da Silva Pinto

, Jose Luis Flores Campana

Manuel Cordova Neira

, Andreza A. dos Santos

, Jhonatas S. Conceic¸

, Marcus A. Angeloni

Lin Tzy Li

and Ricardo da S. Torres

RECOD Lab., Institute of Computing, University of Campinas, 13083-852, Brazil

AI R&D Lab, Samsung R&D Institute Brazil, 13097-160, Brazil

Department of ICT and Natural Sciences, Norwegian University of Science and Technology (NTNU),

Alesund, Norway

Keywords:

Scene Text Detection, Mobile Devices, Object Detector Networks, MobileNetV2, Single Shot Detector.

Abstract:

Multiple research initiatives have been reported to yield highly effective results for the text detection problem.

However, most of those solutions are very costly, which hamper their use in several applications that rely

on the use of devices with restrictive processing power, like smartwatches and mobile phones. In this paper,

we address this issue by investigating the use of efﬁcient object detection networks for this problem. We

propose the combination of two light architectures, MobileNetV2 and Single Shot Detector (SSD), for the

text detection problem. Experimental results in the ICDAR’11 and ICDAR’13 datasets demonstrate that our

solution yields the best trade-off between effectiveness and efﬁciency and also achieved the state-of-the-art

results in the ICDAR’11 dataset with an f-measure of 96.09%.

1 INTRODUCTION

Reading text in images is still an open problem in

computer vision and image understanding research

ﬁelds. In fact, this problem has attracted a lot of at-

tention of these communities due to large number of

modern applications that can potentially beneﬁt from

this knowledge, such as self-driving vehicles (Yan

et al., 2018; Zhu et al., 2018), robot navigation, scene

understanding (Wang et al., 2018), assistive technolo-

gies (Yi et al., 2014), among others.

Several methods have been recently proposed in

the literature towards localizing textual information

in scene images. In general, the text reading prob-

lem is divided into two separated tasks, localization

and recognition, in which the former seeks to localize

delimited candidate regions that contain textual infor-

mation, while the second is responsible for recogniz-

ing the text inside the candidate regions found during

https://orcid.org/0000-0002-6959-3890

https://orcid.org/0000-0001-9772-263X

∗

Part of results presented in this work were ob-

tained through the “Algoritmos para Detecc¸

ao e Reconhec-

imento de Texto Multil

ıngue” project, funded by Samsung

Eletr

onica da Amaz

onia Ltda., under the Brazilian Infor-

matics Law 8.248/91.

Figure 1: Examples of textual elements with different font

sizes and styles.

localization task. In both tasks, the inherent variabil-

ity of a text (e.g., size, color, font style, background

clutter, and perspective distortions), as illustrated in

Fig. 1, makes text reading a very challenging prob-

lem.

Among the approaches for localizing texts in im-

ages, the deep-learning-based techniques are the most

promising strategy to reach high detection accuracy.

He et al., for example, presented a novel technique

Decker, L., Pinto, A., Campana, J., Neira, M., Santos, A., Conceição, J., Angeloni, M., Li, L. and Torres, R.

MobText: A Compact Method for Scene Text Localization.

DOI: 10.5220/0008954103430350

In Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2020) - Volume 5: VISAPP, pages

343-350

ISBN: 978-989-758-402-2; ISSN: 2184-4321

343

Detections

Non-maximum suppression

MobileNetV2

Convolutional layers

Figure 2: Overview of the proposed method for text local-

ization.

for scene text detection by proposing a Convolutional

Neural Network (CNN) architecture (He et al., 2016)

that focuses on extracting text-related regions and

speciﬁc characteristics of text. The authors intro-

duced a deep multi-task learning mechanism to train

the Text-CNN efﬁciently, where each level of the su-

pervised information (text/non-text label, character

label, and character mask) is formulated as a learn-

ing task, besides a pre-processing method which en-

hances the contrast of small-size region improving the

local stability of text regions.

Although the proposed CNN presented a reason-

able efﬁciency in detecting candidate regions, with

a processing time of about 0.5 seconds per image,

the pre-processing step requires about 4.1 seconds per

image, which may prevent a real-time detection.

Another venue that may render outstanding re-

sults in terms of effectiveness consists in combining

different deep learning architectures to beneﬁt from

complementary information to make a better deci-

sion. In this vein, (Zhang et al., 2016) introduced

an approach based on two Fully Convolutional Net-

work (FCN) architectures for predicting a salient map

of text regions in a holistic manner (named as Text-

Block FCN), and also for predicting the centroid of

each character. Similarly, Tang et al. (Tang and Wu,

2017) proposed an ensemble of three modiﬁed VGG-

16 networks: the ﬁrst extracts candidate text regions

(CTR); the second network reﬁnes the coarse CTR de-

tected by the ﬁrst model, segmenting them into text;

and ﬁnally, the reﬁned CTR are served to a classiﬁ-

cation network to ﬁlter non-text regions and obtain

the ﬁnal text regions. The CTR extractor network is

a modiﬁed VGG-16 that, in the training process, re-

ceives the edges of the text as supervisory informa-

tion in the ﬁrst blocks of convolutional layers and

the segmented text regions in the last blocks. Both

strategies present issues in terms of computational ef-

ﬁciency that could make their use unfeasible in re-

strictive computing scenarios (e.g., mobile devices).

Towards having a truthfully single stage text de-

tection, Liao et al. (Liao et al., 2018) proposed an

end-to-end solution named TextBoxes++, which han-

dles arbitrary orientation of word bounding boxes,

whose architecture inherits from the VGG-16. Sim-

ilarly to TextBoxes++, Zhu et al. proposed a deep

learning approach (Zhu et al., 2018) also based on

the VGG-16 architecture, but for detecting text-based

trafﬁc sign. Both techniques presented outstanding

detection rates, though rely on the VGG-16 archi-

tecture, which could be considered inadequate for

restrictive computing scenarios due its model size.

In contrast, lighter CNN architectures, such as Mo-

bileNet (Howard et al., 2017), present a very compet-

itive alternative for this scenario, with a model size

of 4.2 millions of parameters and the FLOPS of 569

millions, for instance.

With those remarks, we propose a novel method

for text localization considering efﬁciency and effec-

tiveness trade-offs. Our approach combines two light

architectures that were originally proposed for object

detection – MobileNetV2 (Sandler et al., 2018) and

SSD (Liu et al., 2016) – and adapts them to our prob-

lem. The main contributions of this paper are: (i) the

proposal of an effective method for text localization

task in scene images, which presented better or com-

petitive results (when compared with state-of-the-art

methods) at a low computational costs in terms of

model size and processing time; and (ii) a compara-

tive study, in the context of text localization, compris-

ing widely used CNN architectures recently proposed

for object detection.

2 PROPOSED METHOD

Fig. 2 illustrates the overall framework of our ap-

proach for text localization, which uses MobileNetV2

as feature extractor and then SSD (convolutional lay-

ers) as multiple text bonding boxes detector. Next,

We will detail the CNN architectures used, then ex-

plain the learning mechanism adopted for ﬁnding a

proper CNN model for the problem.

2.1 Characterization of Text Regions

with MobileNetV2

The MobileNetV2 is a new CNN speciﬁcally de-

signed for restrictive computing environments that in-

cludes two main mechanisms for decreasing the mem-

ory footprints and the number of operations while

keeping the effectiveness of its precursor architecture,

the MobileNet (Sandler et al., 2018): the linear bot-

tlenecks and the inverted residuals.

Fig. 3 shows the MobileNetV2 architecture used

to characterize text candidate regions. The bottleneck

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

344

x2 x3

conv. layer (3x3)

channels (32)

stride (2)

conv. layer (1x1)

channels (1280)

stride (1)

bottleneck

residual block

channels (16)

stride (1)

bottleneck

residual block

channels (24)

stride (2)

bottleneck

residual block

channels (32)

stride (2)

bottleneck

residual block

channels (64)

stride (2)

bottleneck

residual block

channels (96)

stride (2)

bottleneck

residual block

channels (160)

stride (2)

bottleneck

residual block

channels (320)

stride (2)

300x300

Figure 3: MobileNetV2 architecture used in this work and

its parameters. More detail on Bottlenet residual block

in (Sandler et al., 2018).

residual block implements the optimization mecha-

nisms aforementioned considering the convolutional

operations with a kernel of size 3 × 3. The ﬁrst bot-

tleneck block uses an expansion factor of 1, while the

remaining blocks use an expansion factor of 6, as sug-

gested by Sandler et al. (Sandler et al., 2018).

2.2 Detecting Multiple Text Bounding

Boxes via SSD

The localization of text regions in scene images is

challenging due to inherent variability of the text,

such as size, color, font style, and distortions. The

text localization should handle multiple scales and

bounding boxes with varying aspect ratio. Although

several authors consider the image pyramid for per-

forming multi-scale detection, it is quite costly, which

may be impractical in a restrictive computing sce-

nario. Thus, we use the Single Shot detector (SSD)

framework (Liu et al., 2016), a state-of-the-art method

for object detection. The SSD approach includes a

feature pyramid mechanism that allows the identiﬁca-

tion of text regions in multiple scales. Speciﬁcally, in

the framework, the authors adopt a top-down fusion

strategy to build new features with strong semantics

while keeping ﬁne details. Text detections are per-

formed based on multiple new constructed features

respectively during a single forward pass. All detec-

tion results from each layer are reﬁned by means of a

non-maximum suppression (NMS) process (Neubeck

and Gool, 2006).

2.3 Using Linear Bottlenecks and

Inverted Residuals Bottlenecks for

Memory Efﬁciency

Besides the use of depthwise separable convolutions

operations, MobileNetV2 introduced linear bottle-

necks in the convolutional blocks. This reduces

the number of parameters of a neural network and

captures the low-dimensional subspace, supposing

that such low-dimensional subspace is embedded in

a manifold formed by a set of activation tensors.

In (Sandler et al., 2018), Sandler et al. showed empir-

ical evidences that the use of linear layers is impor-

tant to prevent non-linearity added from destroying

information. Experiments conducted by the authors

showed that non-linear bottlenecks, built with recti-

ﬁed linear units, can decrease the performance sig-

niﬁcantly in comparison with linear bottlenecks. By

using the idea of Inverted Residual bottlenecks, the

authors achieved better memory use, reducing a sig-

niﬁcant amount of computation. We follow this idea

in this paper.

2.4 Learning

The main decisions we took in the learning phase of

our network are described below.

Objective Function. Similar to (Liu et al., 2016),

we use a multi-task loss function to learn the bound-

ing boxes locations and text/non-text predictions

(Eq. 1). Speciﬁcally, x

i j

indicates a match (x

i j

= 1)

or non-match (x

i j

= 0) between i-th default bound-

ing boxes, j-th ground-truth bounding boxes; N is the

number of matches; and the α parameter is used to

weight the localization loss (L

loc

) and the conﬁdence

loss (L

con f

). The used loss function can be deﬁned as:

L(x, c, l, g) =

con f

(x, c) + αL

loc

(x, l, g)) (1)

We adopted the smooth L1 loss (Girshick, 2015)

for L

loc

between the predicted box (l) and the ground

truth box (g), and a sigmoid function for L

con f

. Plus,

we consider α = 1 in the same fashion as (Girshick,

2015).

Hard Example Mining. The hard example miner

is a mechanism used to prevent imbalances between

negative and positive examples in the training phase.

During the search for text during the training, we usu-

ally have several non-text bounding boxes and few

text bounding boxes. To mitigate the training with im-

balanced data, we sort the negative bounding boxes

according to their conﬁdence, selecting the negative

samples with higher conﬁdence value, considering a

ratio proportion of 3:1 with the positive samples.

3 EXPERIMENTAL PROTOCOL

This section presents the datasets, metrics, and proto-

cols used for evaluating the proposed method.

MobText: A Compact Method for Scene Text Localization

345

3.1 Datasets

We evaluated the proposed methods in two datasets

widely used for evaluating text localization methods:

ICDAR’11Karatzas et al. (2011), that contains 551

digitally created images, such as headers, logos, cap-

tions, among others, and ICDAR’13Karatzas et al.

(2013), containing 462 born-digital or scene text im-

ages (captured under a wide variety, such as blur,

varying distance to camera, font style and sizes, color,

texture, etc). We also used the SynthTextGupta et al.

(2016) dataset to help training our network due to

the small size of the ICDAR’s datasets. We have not

used ICDAR’15Karatzas et al. (2015) and some other

newer datasets, because our method is not tailored to

the prediction of oriented bounding boxes, which is

required to handle such multioriented datasets.

3.2 Evaluation Metrics

Effectiveness. We evaluated the effectiveness of the

methods in terms of recall, precision, and f-measure.

Here, we consider a correct detection (true positive)

if the overlap between the ground-truth annotation

and detected bounding box, which is measured by

computing the intersection over union, is greater than

50%. Otherwise, the detected bounding box is con-

sidered an incorrect detection.

Efﬁciency. The efﬁciency aspects considered both

the processing time and the disk usage (in MB). We

used the GNU/Linux time command to measure the

processing time, while the disk usage considered the

size of the learned models. All experiments were

performed considering a Intel(R) Core(TM) i7-8700

CPU @ 3.20GHz with 12 cores, a Nvidia GTX 1080

TI GPU, and 64GB of RAM.

3.3 Evaluation Protocols

The experiments were divided into three steps: train-

ing, ﬁne-tuning, and test. For the training step, we

used four subsets of the SynthText dataset. This

dataset comprises of images with synthetic texts

added in different backgrounds and we selected

samples of the dataset considering 10 (9.25%), 20

(18.48%), and 30 (27.71%) images per background,

then ﬁnally the whole dataset. The resulting sub-

sets were again divided into train and validation, us-

ing 70% for training and 30% for validation. Using

these collections, we trained a model with random

initialization parameters for 30 epochs. For the ﬁne-

tuning step, we took the model trained in SynthText

and continued this training using ICDAR’11 or IC-

DAR’13 training subsets, stopping when we reached

2000 epochs. The number of epochs was deﬁned em-

pirically. Finally, for the test step, we evaluated each

ﬁne-tuned model in the test subset of ICDAR’11 or

ICDAR’13.

Experimental Setup. We conducted the training of

the proposed method considering a single-scale in-

put, and therefore, all input images were resized to

300 × 300 pixels. The training phase was performed

using a batch size of 24 and we used the RMSprop op-

timizer (Tieleman and Hinton, 2012) with a learning

rate of 4 × 10

. We also use the regularization L2-

norm, with a λ = 4 × 10

, to prevent possible over-

ﬁtting.

3.4 State-of-the-Art Object Detection

Methods for Comparison

This section provides an overview of the chosen meth-

ods for comparison purpose. For a fair compari-

son, we selected recent approaches speciﬁcally de-

signed for a fast detection, including SqueezeDet and

YOLOv3. We also use methods for text localiza-

tion that presents good compromise among effective-

ness and efﬁciency as baselines, which are brieﬂy de-

scribed in this section.

TextBoxes. This method consists of a Fully Con-

volutional Network (FCN) adapted for text detection

and recognition (Liao et al., 2017). This network uses

the VGG-16 network as feature extractor followed by

multiple output layers (text-boxes layers), similar to

SSD network. At the end, the Non-maximum sup-

pression (NMS) process is applied to the aggregated

outputs of all text-box layers.

TextBoxes++. This method extends the TextBoxes

method (Liao et al., 2017) toward detecting arbitrary-

oriented text, instead of only (near)-horizontal bound-

ing boxes (Liao et al., 2018). TextBoxes++ also

brings improvements in the training phase, which

leads to a further performance boost, in terms of accu-

racy, especially for detecting texts in multiple scales.

In this work, the authors combine the detection scores

of CRNN recognition method (Shi et al., 2017) with

the TextBoxes++ to improve the localization results

and also to have an end-to-end solution.

SSTD. Single-shot text detector proposed

by He et al. (He et al., 2017) designed a natural

scene text detector that directly outputs word-level

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

346

bounding boxes without post-processing, except for

a simple NMS. The detector can be decomposed

into three parts: a convolutional component, a text-

speciﬁc component, and a box prediction component.

The convolutional and box prediction components

are inherited from the SSD detector (Liu et al., 2016)

and the authors proposed a text-speciﬁc component

which consists of a text attention module and a

hierarchical inception module.

SqueezeDet. This network was proposed to detect

objects for the autonomous driving problem, which

requires a real-time detection (Wu et al., 2017).

The SqueezeDet contains a single-stage detection

pipeline, which comprises three components: (i) a

FCN responsible for generating the feature map for

the input images; (ii) a convolutional layer responsi-

ble for detecting, localizing, and classifying objects

at the same time; and (iii) the non-maximum suppres-

sion (NMS) method, which is applied to remove the

overlapped bounding boxes.

YOLOv3. This is a convolutional network origi-

nally proposed for the object detection problem (Red-

mon and Farhadi, 2018). Similarly to SSD network,

the YOLOv3 predicts bounding boxes and class prob-

abilities, at the same time.

4 RESULTS

This section presents the experimental results of the

proposed method (SSD-MobilenetV2) and a compar-

ison with the state-of-the-art methods for text local-

ization. Table 1 shows the results for the evaluated

methods considering the ICDAR’11 dataset. In this

case, the SSD-MobilenetV2 method achieved the best

results with Precision, Recall, and F-measure values

of 97.40%, 94.81%, and 96.09%, respectively. On

the other hand, the SqueezeDet network presented the

lowest Precision and F-measure among the evaluated

methods (56.36% and 66.01%, respectively). In turn,

the TextBoxes achieved the lowest results of Recall

(71.93%).

With regard to ICDAR’13 dataset, the SSTD

methods presented the highest Recall (82.19%), and

F-measure (86.33%), while the YOLOv3 reached the

best results in terms of Precision (Table 1). Note,

however, that the SSD-MobileNetV2 yields very

competitive results for this dataset as well, in terms

of Precision.

As we could observe, the proposed approach pre-

sented some difﬁcult in localizing scene text in the IC-

DAR’13 dataset. In comparison with results achieved

Figure 4: Comparison results among the evaluated methods

considering aspects of efﬁcacy and efﬁciency.

Figure 5: Two high resolution examples of ICDAR’13

dataset with both medium-sized text (detected by our

method) and small-sized (not detected).

for the ICDAR’11, the precision and recall rates de-

creased 9.36 and 31.61 percentage points, respec-

tively, which suggest that our network did not local-

ized several candidate regions containing texts.

To understand the reasons that led the proposed

method to have this difﬁcult in localizing text for

the ICDAR’13 datasets, we performed an analysis of

failure cases taking into account the relative area of

missed bounding boxes. Fig. 4 presents a box-plot

graph that shows the distribution of the relative area

of bounding boxes (i.e., ratio of bounding box area to

MobText: A Compact Method for Scene Text Localization

347

Table 1: Comparison of effectiveness among the evaluated deep learning-based methods for the ICDAR’11 and ICDAR13

dataset.

Datasets ICDAR’11 ICDAR’13

Methods P (%) R (%) F (%) P (%) R (%) F (%)

SSD-MobilenetV2 97.40 94.81 96.09 88.38 66.67 76.00

SSTD 89.28 78.53 83.56 90.91 82.19 86.33

TextBoxes 92.15 71.93 80.80 88.84 74.16 80.83

TextBoxes++ 95.76 90.51 93.06 90.49 80.82 85.38

YOLOv3 94.27 89.21 91.67 92.01 75.71 83.07

SqueezeDet 56.36 79.66 66.01 29.41 62.47 39.99

Figure 6: Comparison among distributions of relative areas

of bounding boxes from Ground-Truth (GT), False Nega-

tives (FN) cases, and False Positive (FP) cases. We omitted

the points considered outliers for a better visualization.

image area) for the ground-truth, false positive cases,

and false negative cases.

As we can observe, the missed bounding boxes

(false negative cases) have a small relative area. More

precisely, 75% of false negative cases (third quartile

of FN box-plot) have a relative area up to 0.01 and

correspond to 50% of the bounding box present in the

ground-truth (median of GT box-plot). This results

suggest to us that high resolution images with rela-

tively small text (see Fig. 5) are specially challenging

to our method. To overcome this limitation, future

investigations can be conducted to devise an architec-

ture to better localize bounding boxes with multiple

scales such as Feature Pyramid Networks (FPNs), as

proposed by (Lin et al., 2017).

In term of the efﬁciency of the presented methods,

Fig. 4 summarizes the results considering the metrics

used to assess the effectiveness of the evaluated meth-

ods, in terms of F-measure, along with the metrics for

measuring the efﬁciency of those methods, consider-

ing the ICDAR’11 and ICDAR’13 datasets.

Regarding the efﬁciency (processing time and disk

usage), the proposed method (SSD-MobilenetV2)

yielded very competitive results, taking only 0.45

and 0.55 seconds per image, considering the IC-

DAR’11 and ICDAR’13, respectively. Compar-

ing SSD-MobilenetV2 with the baseline methods

originally proposed for text localization (TextBoxes,

TextBoxes++, SSTD), the proposed method presented

the very competitive results with a processing time

of 0.67 seconds per image and with disk usage of

about 37.0MB. In contrast, the most effective baseline

methods, the SSTD and TextBoxes++ networks, pre-

sented competitive and worse results in terms of ef-

fectiveness and processing time, respectively, in com-

parison with the proposed method. Regarding the disk

usage, the SSD-MobileNetV2 also presented the best

balance between accuracy and model size.

Now, when compared with the state-of-the-art ap-

proaches for object detection, the proposed method

also presented competitive results. In this case,

the fastest approach for text localization was the

SqueezeDet network, which takes about 0.1 seconds

per image, on average. However, when we take

into account the trade-off between efﬁciency and ef-

fectiveness, we can safely argue that the proposed

method presented a better compromise between these

two measures. Fig. 7 and Fig. 8 provide some cases of

success (ﬁrst column) and of failure of the proposed

method for the ICDAR’11 and ICDAR’13 datasets.

For the ﬁrst (see Fig), the proposed method was able

to localize textual elements with different font styles

and even multi-oriented texts. For the latter, the pro-

posed method was able to localize text in several con-

texts such as in airport signs, trafﬁc signs, text in ob-

jects, among others. Failures are due to compression

artifacts, the high similarity between the background

and the text colors, small texts, and lighting condi-

tions.

5 CONCLUSIONS

How to perform efﬁcient and effective text detec-

tion in scene images in restrictive computing envi-

ronments? To address that research problem, we pre-

sented a new method based on the combination of two

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

348

Figure 7: Examples of success (ﬁrst row) and failure (second row) cases of the proposed approach for the ICDAR’11 dataset.

Green bounding boxes indicate the regions correctly localized (true positives cases), while red bounding boxes show candidate

regions were not detected by our method (false negatives cases).

Figure 8: Examples of success (ﬁrst row) and failure (second row) cases of the proposed approach for the ICDAR’13 dataset.

Green bounding boxes indicate the regions correctly localized (true positives cases), while red bounding boxes show candidate

regions were not detected by our method (false negatives cases).

light architectures, MobileNetV2 and Single Shot De-

tector (SSD), which yielded better or comparable ef-

fectiveness performance when compared with state-

of-the-art baselines despite having a low processing

time and small model size. Compared with other

object detector solutions, our methods is the most

promising. Our ﬁndings disagree with the discussion

provided in (Ye and Doermann, 2015), as we demon-

strated that adapting object detector networks for text

detection is a promising research venue.

Future research efforts will focus on better char-

acterizing both small and large candidate regions to

localize text in multiple scales such as Feature Pyra-

mid Networks.

REFERENCES

Girshick, R. (2015). Fast r-cnn. In The IEEE International

Conference on Computer Vision (ICCV).

Gupta, A., Vedaldi, A., and Zisserman, A. (2016). Synthetic

data for text localisation in natural images. In 2016 IEEE

Conference on Computer Vision and Pattern Recognition

(CVPR), pages 2315–2324.

He, P., Huang, W., He, T., Zhu, Q., Qiao, Y., and Li, X.

MobText: A Compact Method for Scene Text Localization

349

(2017). Single shot text detector with regional attention.

In 2017 IEEE International Conference on Computer Vi-

sion (ICCV), pages 3066–3074.

He, T., Huang, W., Qiao, Y., and Yao, J. (2016). Text-

attentional convolutional neural network for scene text

detection. IEEE Transactions on Image Processing,

25(6):2529–2541.

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D.,

Wang, W., Weyand, T., Andreetto, M., and Adam,

H. (2017). Mobilenets: Efﬁcient convolutional neu-

ral networks for mobile vision applications. CoRR,

abs/1704.04861.

Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S.,

Bagdanov, A., Iwamura, M., Matas, J., Neumann, L.,

Chandrasekhar, V. R., Lu, S., Shafait, F., Uchida, S., and

Valveny, E. (2015). Icdar 2015 competition on robust

reading. In 2015 13th International Conference on Doc-

ument Analysis and Recognition (ICDAR), pages 1156–

1160.

Karatzas, D., Mestre, S. R., Mas, J., Nourbakhsh, F., and

Roy, P. P. (2011). Icdar 2011 robust reading competition

- challenge 1: Reading text in born-digital images (web

and email). In 2011 International Conference on Docu-

ment Analysis and Recognition, pages 1485–1490.

Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., Big-

orda, L. G. i., Mestre, S. R., Mas, J., Mota, D. F., Al-

maz

an, J. A., and de las Heras, L. P. (2013). Icdar

2013 robust reading competition. In Proceedings of the

2013 12th International Conference on Document Anal-

ysis and Recognition, ICDAR ’13, pages 1484–1493,

Washington, DC, USA.

Liao, M., Shi, B., and Bai, X. (2018). Textboxes++: A

single-shot oriented scene text detector. IEEE Transac-

tions on Image Processing, 27(8):3676–3690.

Liao, M., Shi, B., Bai, X., Wang, X., and Liu, W. (2017).

Textboxes: A fast text detector with a single deep neural

network. In Proceedings of the Thirty-First AAAI Confer-

ence on Artiﬁcial Intelligence, February 4-9, 2017, San

Francisco, California, USA., pages 4161–4167.

Lin, T., Doll

ar, P., Girshick, R., He, K., Hariharan, B., and

Belongie, S. (2017). Feature pyramid networks for object

detection. In 2017 IEEE Conference on Computer Vision

and Pattern Recognition (CVPR), pages 936–944.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu,

C.-Y., and Berg, A. C. (2016). SSD: Single shot multibox

detector. In Leibe, B., Matas, J., Sebe, N., and Welling,

M., editors, Computer Vision – ECCV 2016, pages 21–

37, Cham. Springer International Publishing.

Neubeck, A. and Gool, L. V. (2006). Efﬁcient non-

maximum suppression. In 18th International Conference

on Pattern Recognition (ICPR’06), volume 3, pages 850–

855.

Redmon, J. and Farhadi, A. (2018). Yolov3: An incremental

improvement. CoRR, abs/1804.02767.

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and

Chen, L. (2018). Mobilenetv2: Inverted residuals and

linear bottlenecks. In 2018 IEEE/CVF Conference on

Computer Vision and Pattern Recognition, pages 4510–

4520.

Shi, B., Bai, X., and Yao, C. (2017). An end-to-end train-

able neural network for image-based sequence recogni-

tion and its application to scene text recognition. IEEE

Transactions on Pattern Analysis and Machine Intelli-

gence, 39(11):2298–2304.

Tang, Y. and Wu, X. (2017). Scene text detection

and segmentation based on cascaded convolution neu-

ral networks. IEEE Transactions on Image Processing,

26(3):1509–1520.

Tieleman, T. and Hinton, G. (2012). Lecture 6.5-rmsprop:

Divide the gradient by a running average of its recent

magnitude. COURSERA: Neural networks for machine

learning, 4(2):26–31.

Wang, L., Wang, Z., Qiao, Y., and Van Gool, L. (2018).

Transferring deep object and scene representations for

event recognition in still images. International Journal

of Computer Vision, 126(2):390–409.

Wu, B., Iandola, F., Jin, P. H., and Keutzer, K. (2017).

Squeezedet: Uniﬁed, small, low power fully convo-

lutional neural networks for real-time object detection

for autonomous driving. In 2017 IEEE Conference on

Computer Vision and Pattern Recognition Workshops

(CVPRW), pages 446–454.

Yan, C., Xie, H., Liu, S., Yin, J., Zhang, Y., and Dai,

Q. (2018). Effective uyghur language text detection in

complex background images for trafﬁc prompt identiﬁ-

cation. IEEE Trans. Intelligent Transportation Systems,

19(1):220–229.

Ye, Q. and Doermann, D. S. (2015). Text detection and

recognition in imagery: A survey. IEEE Trans. Pattern

Anal. Mach. Intell., 37(7):1480–1500.

Yi, C., Tian, Y., and Arditi, A. (2014). Portable camera-

based assistive text and product label reading from hand-

held objects for blind persons. IEEE/ASME Transactions

on Mechatronics, 19(3):808–817.

Zhang, Z., Zhang, C., Shen, W., Yao, C., Liu, W., and Bai,

X. (2016). Multi-oriented text detection with fully con-

volutional networks. In The IEEE Conference on Com-

puter Vision and Pattern Recognition (CVPR).

Zhu, Y., Liao, M., Yang, M., and Liu, W. (2018). Cascaded

segmentation-detection networks for text-based trafﬁc

sign detection. IEEE Transactions on Intelligent Trans-

portation Systems, 19(1):209–219.

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

350