Fast Scene Text Detection with RT-LoG Operator and CNN

Dinh Cong Nguyen

1,2

, Mathieu Delalandre

, Donatello Conte

and The Anh Pham

Tours University, Tours City, France

Hong Duc University, Thanh Hoa, Vietnam

Keywords:

Scene Text Detection, Keypoint Grouping, RT-LoG, Character Pattern.

Abstract:

Text detection in scene images is of particular importance for the computer-based applications. The text detec-

tion methods must be robust against variabilities and deformations of text entities. In addition, to be embedded

into mobile devices, the methods have to be time efﬁcient. In this paper, the keypoint grouping method is pro-

posed by ﬁrst applying the real-time Laplacian of Gaussian operator (RT-LoG) to detect keypoints. These

keypoints will be grouped to produce the character patterns. The patterns will be ﬁltered out by using a CNN

model before aggregating into words. Performance evaluation is discussed on the ICDAR2017 RRC-MLT

and the Challenge 4 of ICDAR2015 datasets. The results are given in terms of detection accuracy and time

processing against different end-to-end systems in the literature. Our system performs as one of the strongest

detection accuracy while supporting at approximately 15.6 frames per second to the HD resolution on a regular

CPU architecture. It is one of the best candidates to guarantee the trade-off between accuracy and speed in the

literature.

1 INTRODUCTION

Scene text detection is a key topic in the literature

(Long et al., 2018) since several years ago due to its

inﬂuence in the real-life. Text detection can be ap-

plied in a wide range of applications such as trafﬁc

sign recognition, blind assistance, augmented reality

and so on. However, detecting and localizing scene

text still remains a challenge due to degradation. This

covers different aspects such as the texture, illumina-

tion changes, the differences in languages, scales of

characters and the background/foreground transitions

Figure 1. Robust methods must be designed against

variabilities, deformations of text entities.

Moreover, another crucial problem is to adapt the

methods to be time-efﬁcient such as they can em-

bedded into mobile devices. This involves an almost

complete reformulation of the methods to make them

real-time in order to respect the time constraint for

detection (Neumann and Matas, 2015).

The real-time methods in the literature apply a

two-stage strategy for localization followed by text

veriﬁcation. The localization determines the posi-

tions of candidate text elements in the image at a low

complexity level. The main goal is to process with

a strong recall to not miss text elements. After that,

text veriﬁcation speciﬁes which candidate is text or

not. It ﬁlters out the false alarms by using veriﬁcation

Figure 1: Examples of text in natural scenes with speciﬁc

degradations (a) blurring (b) different sizes of character (c)

illumination changes.

procedures. The two-stage strategy is opposite to the

end-to-end strategy merging localization and veriﬁca-

tion (Long et al., 2018).

A core component of the two-stage strategy is the

local operator. The local operator extracts candidate

keypoints at the locations of text elements in the im-

age. Different real-time operators have been proposed

in the literature for scene text detection as the FAS-

Text operator (Busta et al., 2015), the Canny Text

Detector (Cho et al., 2016) and the Maximally Sta-

ble Extremal Regions (MSER) operator (Gomez and

Karatzas, 2014).

In this paper, we propose a new system applying a

two-stage strategy for the real-time detection of scene

text. Compared with the other systems in the litera-

ture, our system applies in the ﬁrst stage an RT-LoG

operator. The RT-LoG operator is the real-time im-

plementation of the Laplacian of Gaussian (LoG) op-

Nguyen, D., Delalandre, M., Conte, D. and Pham, T.

Fast Scene Text Detection with RT-LoG Operator and CNN.

DOI: 10.5220/0008944502370245

In Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2020) - Volume 4: VISAPP, pages

237-245

ISBN: 978-989-758-402-2; ISSN: 2184-4321

237

erator (Fragoso et al., 2014; Nguyen et al., 2019). It

is competitive for time processing and can be adapted

to text detection. Furthermore, it provides meaningful

scale-space and contrast information for detection.

This paper gives several key contributions.

• We propose a new grouping approach to embed

the RT-LoG operator into a two-stage system.

• We highlight how the RT-LoG operator can sup-

port the full pipeline for scene text detection.

• We provide a performance evaluation where our

overall system achieves one of the strongest de-

tection accuracy of the literature, while requiring

less than two orders of magnitude for the process-

ing resources compared to our competitors.

The rest of paper is organized as follows. Section 2

presents our system. Then, the performance evalua-

tion of the system is discussed in section 3. At last,

section 4 will summarize and give out some perspec-

tives. For convenience, Table 1 provides the meaning

of the main symbols used in the paper.

2 PROPOSAL APPROACH

The general architecture is presented in Figure 2. At

the ﬁrst stage, we employ the RT-LoG operator to de-

tect the keypoints among the strokes composing the

characters. Next, a dedicated algorithm is proposed

to group the keypoints using the spatial/scale-space

representation of the RT-LoG operator. The grouping

method results in Regions of Interest (RoIs) consti-

tuted of character patterns (characters, parts of char-

acters, connected characters). These RoIs are veriﬁed

into text/non text regions by a CNN. Before the veri-

Table 1: The main symbols used in the paper.

Symbols Meaning

x,y Image coordinates

g = f (x, y)

f the image function returning

the gray-level g at any location (x,y)

Stroke width parameter w ∈ [w

min

max

]

with {w

min

,...,w

max

} the set of values

m Scale parameter m = (w

max

− w

min

) + 1

h(x,y)

Map of the responses h to all the pixels

in the image with h ∈ [−1,1]

β A thresholding parameter

p = (x,y,g,h,w) A keypoint

n,k The number of keypoints with k << n

S A set of keypoints {p

,.., p

}

Q ⊆ S A subset of keypoints {p

,.., p

}

q The number of Regions of Interest (RoIs)

A set of RoIs R = {R

,.,R

,.R

}

is a given RoI

ﬁcation, these RoIs are normalized using the RT-LoG

and geometric features. This normalization relaxes

the veriﬁcation with the CNN from invariant prob-

lems such as the contrast, scale and orientation. A

method for text line grouping is applied in the ﬁnal

stage. We will detail in following subsections.

2.1 The RT-LoG Operator

To start, we process the image with the RT-LoG op-

erator to detect the keypoints constituting the charac-

ters Figure 3. Optimization of the operator is obtained

with estimation of the LoG function and fast Gaus-

sian convolution (Fragoso et al., 2014). Adaptation

to stroke detection is given by the stroke model for

scale-space representation (Nguyen et al., 2019; Liu

et al., 2014).

The operator is controlled with a range of widths

w ∈ [w

min

max

] of the strokes to detect. Within this

range, m = {w

max

− w

min

} + 1 is not only the number

of stroke widths to detect but also the parameter of

the scale-space problem. An additional parameter β

serves to threshold the operator responses and to tune

the precision/recall of the detection.

The operator results in a set of n keypoints S =

,...p

) where a keypoint p = (x,y,g,h, w) is

given with (x,y) the spatial coordinates, g the gray

level at the keypoint location, h the operator response

and w the stroke width parameter. The operator

response h is either positive or negative depending

the background/foreground transition of the character.

Figure 3 gives an example of keypoint detection with

the corresponding response map h(x, y) providing the

h responses at every location (x,y).

The RT-LoG operator provides meaningful scale-

space and contrast information. This can drive a

grouping method to obtain the character patterns and

to guide a text veriﬁcation stage. We will discuss

these aspects in next subsections.

2.2 The Spatial/Scale-space Grouping

The RT-LoG operator gives out, for each image, a

set of keypoints belonging to the strokes composing

the characters. These keypoints must be grouped to-

gether to constitute character patterns. The grouping

of keypoints is well-known topic in the image pro-

cessing and computer vision ﬁeld (Dan et al., 2015).

The main challenge here is to outline keypoints be-

longing to particular objects or RoIs in the image.

Different strategies can be applied as the area-split

with load imbalance, the clustering (e.g. K-means,

Density-based spatial clustering), the grouping with

machine learning (while characterizing the keypoints

with descriptors) or the geometric consistency.

The grouping methods are dependent on the used

features for the keypoints and the considered local op-

erators. Several grouping methods for the LoG-based

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

238

Figure 2: The detail of our approach.

Figure 3: (a) input image (b) detected keypoints with cir-

cle/radius at w/2 (c) the corresponding feature map h(x,y).

operators have been proposed in the literature Table 2.

These methods target particular applications such as

the general object detection or the scene text detec-

tion. The scale-space model is a key property to en-

sure a consistency while detecting the keypoints. Op-

timal model could be established for the stroke detec-

tion (Liu et al., 2014; Nguyen et al., 2019) but not for

general object detection. In that case, the strategy of

the grouping methods is driven by a high-level charac-

terization of the keypoints with handcrafted features

and/or machine learning methods.

At the best of our knowledge, a dedicated group-

ing method for the RT-LoG operator, while embed-

ding the stroke model, has neither been investigated

in the literature. In this paper, we propose a new ap-

proach to drive a consistent grouping with optimiza-

tion using the spatial/scale-space representation of the

RT-LoG operator. Our method uses three main steps

as detailed in Figure 4. We will detail these steps in

next paragraphs.

Table 2: The overview of the grouping methods for LoG-

based operators.

References Methods Applications

(Dong et al., 2015)

The stroke width values are

locally calculated and gathered

into SWT map.

Text

detection

(Mao et al., 2013)

The geometric context of

SIFT keypoints is discovered.

Text

detection

(Shivakumara and Phan, 2010)

K-means clustering is used

to identify candidate text regions

based on the maximum difference.

Text

detection

(Xu et al., 2016)

Keypoints are clustered into

different group by mean-shift

algorithm and geometrical closeness.

Object

detection

(Fernndez-Carams et al., 2014)

Vertical lines is produced by the

Haar-like features and Integral image.

Object

detection

Foreground/Background Partitioning (Step 1):

in this step, we need to cope with both conditions

in scene text detection such as a light character in

a dark background producing minus responses and

vice-versa. A keypoint p = (x,y,g,h,w) is given with

a normalized response of the operator h. This re-

sponse h ∈ [−1, 1] results in two conditions of white

h ∈ [−1, 0[ and dark h ∈]0, 1] foregrounds, respec-

tively. The keypoints belonging to a same character

pattern are supposed to have a similar response. This

step splits the set of keypoints into two subsets based

on their responses.

Scale-space Partitioning and Grouping (Steps

2,3): the RT-LoG operator is scale-invariant and pro-

vides a stroke width parameters w for each of the de-

tected keypoint. This is an useful information that can

be used as a local threshold to group the keypoints.

The RT-LoG operator applies a Non-maximum sup-

pression (NMS) within 3 × 3 local neighborhood.

This guarantees an overlapping between the key-

points, while setting the operator with w

min

≥ 2.

The grouping performs for each of the keypoint p

in the subsets obtained from the step 1. A keypoint

is grouped with a keypoint p

, with i, j ∈ [1,n], if

their Euclidean distance ﬁts with Eq (1), illustrated in

Figure 5. In that case, the label of the keypoint p

propagated to the keypoint p

− p

≤

+ w

(1)

Similar to (Cabaret and Lacassagne, 2017), the

overall grouping must applies forward/backward re-

quests and propagation to merge the labels between

the keypoints. The requests could be time-consuming

depending signiﬁcantly on the used request algorithm

and the number of keypoints. For optimization, the

keypoints can indexed with a KD-tree based DB-

SCAN (Vijayalaksmi and Punithavalli, 2012). The

method is known as one of the fastest in the literature

for the regular range searching problem. The overall

grouping can be achieved with nlog(n) comparisons.

We propose here a strategy for optimization tak-

ing into account the speciﬁcity of the detection prob-

lem with the RT-LoG operator. A large number of

keypoints belonging to a character pattern has a close

stroke width Figure 6 (a). We use this property to

group at a low complexity level the keypoints within

the same scale. After that, the grouping is extended to

Fast Scene Text Detection with RT-LoG Operator and CNN

239

the in-between scales. We use here a two-stage strat-

egy for grouping, as detailed below.

• Step 2: the keypoints are partitioned ﬁrst into

subsets according to their stroke width parameter

w ∈ [w

min

,...,w

max

]. This is obtained with a linear

scan of the overall set S = (p

,..., p

) at a com-

plexity O(n). A subset is noted as Q = (p

,.., p

)

where k << n. Every subset Q is indexed with

KD-tree based DBSCAN to get cluster compo-

nents corresponding to the character pattern Fig-

ure 6 (b). This process has a complexity klog(k)

for every subset Q. Each cluster is assigned with

an ID.

• Step 3: the grouping is extended to the in-between

scales Figure 6 (c). For optimization, clusters

from the different scales will be compared. The

comparison is stopped at the ﬁrst matching among

clusters. The forward/backward propagation is

applied to merge the labels and the clusters to get

the set of RoIs R = {R

,.,R

Finally, we compute a set of geometric features

from the keypoints belonging to a RoI to get the cen-

troid, the area/perimeter, the bounding box and the

orientation. These features will be used in the scene

text detection pipeline to drive the image normaliza-

tion and text line construction Figure 2.

2.3 Veriﬁcation and Text Line

Construction

The grouping algorithm results in the detection of

RoIs. As the RT-LoG operator detects strokes, the de-

tected RoIs could be either character patterns or back-

ground elements from the natural scene image. The

RoIs must be classiﬁed to ﬁlter out the character pat-

terns from background. This is the text veriﬁcation

problem that takes part in the scene text detection.

The traditional approach for scene text detection is

to apply hand-crafted features with classiﬁcation. Re-

Table 3: CNN models for text veriﬁcation.

Image size (24 × 24) - (32 × 32)

Layer No Filter size Filter No Pooling

Conv1 (5 × 5)-(8 × 8) x20 - x96 (2 × 2)-(5 × 5)

Conv2 (2 × 2)-(5 × 5) x50 - x256 (2 × 2)

cently, CNN becomes prominent into the ﬁeld where a

main issue is to design end-to-end system (Long et al.,

2018). When applying to text veriﬁcation, the CNN

classiﬁes the RoIs of images to verify if a RoI belongs

to a text part or not.

Several CNN models have been proposed in the

literature for the text veriﬁcation (Yang et al., 2015;

Ray et al., 2016; Wang et al., 2018; Zhang et al.,

2015; Ghosh et al., 2019). Table 3 gives a summary.

The fundamental model is to process with two con-

volutional layers while using a ReLU function for the

non-linearity and a max pooling for optimization. A

fully-connected layer (FC) is used in the ﬁnal stage

for classiﬁcation set with a softmax function. Some

works have investigated deeper models using up to

four convolutional layers as (Zhang et al., 2015).

For the sake of performance evaluation, we cus-

tomize our model inspired from (Wang et al., 2018;

Ghosh et al., 2019). It is produced with two convolu-

tional layers and one FC layer. For each convolutional

layer, ReLu activation and average pooling layer are

followed.

A core problem for text veriﬁcation is the vari-

ability character patterns. These patterns are not nor-

malized and suffered from distortions Figure 7 (a) -

(d). Another problem comes from the look-like cur-

sive characters that results in word patterns after de-

tection and grouping Figure 7 (e). The normalization

of these patterns to 32 × 32 images introduces other

distortions as the down-scaling and the modiﬁcation

of the aspect ratio. This tends to burden the learning

period of the CNN training for text veriﬁcation.

To solve these problems, we apply an image nor-

malization before to classify the character patterns

with the CNN Figure 2. We use the RT-LoG and geo-

metric features to drive the normalization as they pro-

vide meaningful information about the character pat-

terns. Our process is detailed here.

• Skew correction: we use the geometric features to

correct the skew of characters with image rotation

Figure 7 (a).

• Background/foreground normalization: the key-

points belonging to a character pattern provide

a common operator response h. This response

is whether positive or negative depending on the

background/foreground transition. We look for all

the character patterns having a positive response

h > 0 and invert the image for normalization Fig-

ure 7 (b).

• Contrast normalization: the character patterns ap-

pear with different background/foreground ampli-

tudes. This is a problem of contrast normalization

that can be solved from the operator response h.

Indeed, the operator response h ∈ [−1,1], where

|h| = 1 is obtained from bilevel black and white

image having a maximum amplitude. We use the

response h and the g values to get a lookup table

to normalize the contrast Figure 7 (c).

• Scale normalization: the normalization of charac-

ter patterns to 32×32 image distorts the aspect ra-

tio Figure 7 (d). We use the average stroke width

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

240

Figure 4: The detail of spatial/scale-space grouping method.

Figure 5: The grouping rule of two keypoints.

w of character patterns as a scale estimator. This

estimator serves to compute a scale ratio w

with w

an ofﬂine parameter obtained from the

training database. The parameter w

is ﬁxed in or-

der to embed the character patterns into a 32 × 32

image, as an average. If not, we split the character

patterns into 32 × 32 patches while respecting an

overlapping threshold Figure 7 (e).

In the ﬁnal step, after veriﬁcation with normaliza-

tion, the close text regions must be grouped to get the

text lines. Similar to (Cho et al., 2016), we use the

method of minimum-area encasing rectangle to pro-

vide consistent bounding boxes.

3 PERFORMANCE EVALUATION

In this section, we present the performance evalua-

tion of our system. Section 3.1 introduces the used

datasets. Section 3.2 details the characterization met-

rics. Results are discussed in terms of detection accu-

racy and time processing in sections 3.3 and 3.4.

3.1 Datasets

Several public datasets have been proposed over

the years for performance evaluation of text detec-

tion methods. The ICDAR2017 and ICDAR2019

RRC-MLT are the two up-to-date datasets in litera-

ture (Nayef et al., 2017; Nayef et al., 2019). The

ICDAR2019 RRC-MLT is a recent dataset and a

slight update of the ICDAR2017 RRC-MLT

. For

The ICDAR2019 RRC-MLT dataset includes 1K im-

ages more on the existing set of 9K images.

our performance evaluation, we have selected the IC-

DAR2017 RRC-MLT dataset where more compara-

tive results are available in the literature. It includes

7200 training images, 1800 validation images and

9000 test images. The images are given at differ-

ent resolutions (VGA, HD, Full-HD, Quad-HD, 4K).

This dataset has a particular focus on the multi-lingual

scene text detection in 9 languages and offers a deep

challenge for scalability. Figure 8 shows some visual

examples of images.

For the processing time, it is more common in

the literature to use the Challenge 4 of ICDAR2015

dataset (Karatzas and Gomez-Bigorda, 2015). This

dataset contains 1000 training images and 500 test im-

ages at HD resolution (1280 × 720).

3.2 Characterization Metrics

For the characterization metrics, we have followed the

recommendations of the international contest (Nayef

et al., 2017). The characterization is achieved at

two levels while applying the Intersection over Union

(IoU) criterion and computing the F-measure. The

output of the text detection system is provided with

bounding boxes. A detection (i.e. a true positive) is

obtained if a detected bounding box has more than

50% overlap (the IoU criterion) with a bounding box

in the groundtruth. The unmatched boxes in the detec-

tion and the groundtruth are false positives and nega-

tives, respectively. The detection cases serve to com-

pute the regular metrics precision (P), recall (R), F-

measure

(F). Let’s note that some degraded texts in

the dataset are marked as “don’t care” boxes and ig-

nored in the evaluation process.

3.3 Scene Text Detection

Table 4 shows our evaluation in comparison with the

state-of-the-art. For clariﬁcation, we have stood on

the top 10 of competitive systems of the literature.

In addition, our system is given in two implemen-

tation with/without image normalization. Our goal

We are not detailing these aspects here and report to

(Nayef et al., 2017).

Fast Scene Text Detection with RT-LoG Operator and CNN

241

Figure 6: (a) Keypoints within the scale-space partitions (b) grouping within the scale-space partitions (c) grouping within

in-between scale-spaces.

Figure 7: Image normalization.

Table 4: Comparison of methods.

(P) Precision (R) Recall and (F) F-Measure on ICDAR2017 RRC-MLT.

Rank Methods P(%) R(%) F(%)

1 PMTD (Liu et al., 2019) 85.15 72.77 78.48

2 FCN-MOML (He et al., 2018) 82.66 72.53 77.26

3 R-CNN-PAN (Huang et al., 2019) 80 69.8 74.3

4 LOMO MS (Zhang et al., 2019) 80.2 67.2 73.1

5 Proposed method with normalization 65.2 82.1 72.6

6 MOSTD (Lyu et al., 2018b) 74.3 70.6 72.4

7 Proposed method without normalization 62.5 82.7 71.2

8 Fots (Liu et al., 2018) 81.86 62.30 70.75

9 AF-RPN (Zhong et al., 2018) 75 66 70

10 Attention Model (Wang et al., 2019) 72 63.5 67.48

11 SCUT DLVClab1 (Nayef et al., 2017) 80.3 54.5 65

Figure 8: Images from the ICDAR2017 RRC-MLT dataset

(Nayef et al., 2017).

here is to illustrate how the RT-LoG operator can sup-

port the CNN for classiﬁcation.

As highlighted in Table 4, our system appears in

the top 5 for the F-measure score. Furthermore, our

system achieves the strongest recall score of the liter-

ature. This is ensured by the use of the RT-LoG opera-

tor and the dedicated grouping method, which allows

a near complete detection of text elements. The nor-

malization obtained with the RT-LoG operator results

in a slight gap for the F-measure score from 1%-2%.

Visual examples of true and missed detections are

shown in Figure 9. The operator fails to detect text in

images with very low contrast. This can be explained

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

242

Figure 9: Text detection samples with (green) the true de-

tections and (red) the missed-cases.

by the real-time implementation of the LoG operator,

that is not contrast invariant (Nguyen et al., 2019).

3.4 Processing Time

Table 5: Average processing time in milliseconds (ms) /

amounts of pixels, keypoints and RoIs of each step of the

proposed method.

Methods

Types

HD Data workﬂow

RT-LoG 320 ms 1.2 Mpixel

Brute-force KD-tree/DBSCAN 662ms 5.2 Kkeypoints

Optimized KD-tree/DBSCAN 258 ms 5.2 Kkeypoints

Veriﬁcation 336 ms 90 RoIs

We characterize and compare in this section the

processing time of our system against other systems

in the literature. As discussed in section 3.1, the IC-

DAR2015 dataset has been used for testing as it is

common for processing time evaluation.

We have computed ﬁrst the processing time of

each step of our method using a single thread/core

implementation Table 5. Our implementation is done

with the C++ language and tested on the Mas-OS

and an Intel(R) Core(TM) i7- 4770HQ CPU, 2.2 GHz

having a near 32 GFLOPS SP

performance.

We provide in addition in Table 5 information

about the data workﬂow in the pipeline ( the amounts

of pixels, keypoints and RoIs). The optimization

achieved by our grouping method results in an ap-

proximate two to three times for acceleration factor.

Compared to systems-based R-CNN for scene text

detection (Liu et al., 2019), our system produces less

RoIs. These RoIs are resized into 32 × 32 before ver-

iﬁcation. This explains the low-level processing time

required by our text veriﬁcation stage.

As highlighted the RT-LoG operator results in a

large reduction of the data to process. Despite the dif-

ferent of amount of data, the processing time is close

between the RT-LoG operator and the rest of pipeline

(grouping with CNN). This can be explained by the

Single Precision.

real-time implementation of the operator that ensures

a strong optimization of the spatial convolution.

We have evaluated, in a second step, the Frame

rate per second (FPS) of our system on the IC-

DAR2015 dataset Table 6. This evaluation has been

done with a full parallelism support on the CPU while

applying multi-core/threading. We provide in addi-

tion the FPS of the top systems in literature. For a fair

comparison, Table 6 addresses the test architectures

of different systems (either GPU or CPU) with their

relative performances, as obtained from CPU/GPU

benchmarks

Table 6: Frame rate per second (FPS) among methods on

the Challenge 4 of ICDAR2015 dataset.

Methods

Processing

types

FPS Architecture

Performances

TFLOPS

FOTs-RT(Liu et al., 2018) 22.6

TITAN-Xp

GPU

12.15

Ours 15.6

CPU

2.2 GHz

0.032

SSTD (He et al., 2017) 7.7

TITAN X

GPUs

6.691

EAST (Zhou et al., 2017) 6.52

TITAN-Xp

GPU

12.15

MTS(Lyu et al., 2018a) 4.8

Titan Xp

GPU

12.15

MOSTD(Lyu et al., 2018b) 3.6

Tesla

K40m GPU

5.046

As emphasized in Table 6, our system has the sec-

ond highest FPS while processing with a difference

of two orders of magnitude in term of processing re-

sources. All the top systems in the literature perform

with end-to-end CNN model requiring a GPU archi-

tecture. For clariﬁcation, Figure 10 presents a gen-

eral comparison of the performance considering the

F-measure scores, FPS and test architecture.

Figure 10: The performance considering the F-measure

scores and FPS from scene text detection systems corre-

spond to their architecture.

www.techpowerup.com; www.cpubenchmark.net.

Fast Scene Text Detection with RT-LoG Operator and CNN

243

4 CONCLUSIONS AND

PERSPECTIVES

This paper presents a new two-stage system for scene

text detection. It applies in the ﬁrst stage the RT-LoG

operator. This operator supports the full pipeline for

scene text detection. A dedicated algorithm is applied

to group the keypoints into RoIs using the RT-LoG

features. The RoIs are then classiﬁed into text/non

text regions by a CNN. Before the veriﬁcation, the

RoIs are normalized with the RT-LoG features. This

normalization relaxes the veriﬁcation from invariant

problems. The proposed system is in the top 5 for the

F-measure score and achieves the strongest recall of

the literature. It performs as one of the highest FPS

rate while processing with a difference of two orders

of magnitude in term of processing resources.

As a main perspective, the precision rate of the

system can be consolidated. This can be obtained

by making the RT-LoG operator contrast-invariant, to

deal with the missed detection cases. Deeper CNN

models could be applied to make more robust at the

text veriﬁcation stage. At last, acceleration of the RT-

LoG operator could be obtained by taking advantage

of the Gaussian kernel distribution and decomposition

with box ﬁltering.

REFERENCES

Busta, M., Neumann, L., and Matas, J. (2015). Fastext: Ef-

ﬁcient unconstrained scene text detector. In Interna-

tional Conference on Computer Vision (ICCV), pages

1206–1214.

Cabaret, L. and Lacassagne, L. (2017). Distanceless label

propagation: an efﬁcient direct connected component

labeling algorithm for gpus. In International Confer-

ence on Image Processing Theory, Tools and Applica-

tions (IPTA), pages 1–6.

Cho, H., Sung, M., and Jun, B. (2016). Canny text detec-

tor: Fast and robust scene text localization algorithm.

In International Conference on Computer Vision and

Pattern Recognition (CVPR), pages 3566–3573.

Dan, G., Khan, M., and Fodor, V. (2015). Characterization

of surf and brisk interest point distribution for dis-

tributed feature extraction in visual sensor networks.

volume 17, pages 591–602.

Dong, W., Lian, Z., Tang, Y., and Xiao, J. (2015). Text de-

tection in natural images using localized stroke width

transform. In International Conference on Multimedia

Modeling (ICMM), pages 49–58.

Fernndez-Carams, C., Moreno, V., and Curto, B. (2014). A

real-time door detection system for domestic robotic

navigation. Journal of Intelligent Robotic Systems,

76(1):119–136.

Fragoso, V., Srivastava, G., Nagar, A., and Li, Z. (2014).

Cascade of box (cabox) ﬁlters for optimal scale

space approximation. In International Conference on

Computer Vision and Pattern Recognition Workshops

(CVPRW), pages 126–131.

Ghosh, M., Mukherjee, H., Obaidullah, S., Santosh, K.,

Das, N., and Roy, K. (2019). Identifying the pres-

ence of graphical texts in scene images using cnn. In

International Conference on Document Analysis and

Recognition Workshops (ICDARW), volume 1, pages

86–91.

Gomez, L. and Karatzas, D. (2014). Mser-based real-time

text detection and tracking. In International Con-

ference on Pattern Recognition (ICPR), pages 3110–

3115.

He, P., Huang, W., He, T., and Zhu, Q. (2017). Single shot

text detector with regional attention. In International

Conference on Computer Vision (ICCV), pages 3047–

3055.

He, W., Zhang, X., Yin, F., and Liu, C. (2018). Multi-

oriented and multi-lingual scene text detection with

direct regression. Transactions on Image Processing

(TIP), 27(11):5406–5419.

Huang, Z., Zhong, Z., Sun, L., and Huo, Q. (2019). Mask

r-cnn with pyramid attention network for scene text

detection. In Conference on Applications of Computer

Vision (WACV), pages 764–772.

Karatzas, D. and Gomez-Bigorda, L. (2015). Icdar 2015

competition on robust reading. In International Con-

ference on Document Analysis and Recognition (IC-

DAR), pages 1156–1160.

Liu, J., Liu, X., Sheng, J., and Liang, D. (2019). Pyramid

mask text detector. arXiv preprint:1903.11800.

Liu, X., Liang, D., S.Yan, and Chen, D. (2018). Fots: Fast

oriented text spotting with a uniﬁed network. In Con-

ference on computer vision and pattern recognition

(CVPR), pages 5676–5685.

Liu, Y., Zhang, D., Zhang, Y., and Lin, S. (2014). Real-time

scene text detection based on stroke model. In Inter-

national Conference on Pattern Recognition (CVPR,

pages 3116–3120.

Long, S., He, X., and Ya, C. (2018). Scene text

detection and recognition: The deep learning era.

arXiv:1811.04256.

Lyu, P., Liao, M., .Yao, C., and Wu, W. (2018a). Mask

textspotter: An end-to-end trainable neural network

for spotting text with arbitrary shapes. In The Euro-

pean Conference on Computer Vision (ECCV).

Lyu, P., Yao, C., and W. Wu, S. Y. (2018b). Multi-oriented

scene text detection via corner localization and region

segmentation. In Conference on Computer Vision and

Pattern Recognition (CVPR), pages 7553–7563.

Mao, J., Li, H., Zhou, W., Yan, S., and Tian, Q. (2013).

Scale based region growing for scene text detection. In

International conference on Multimedia (ACM), pages

1007–1016.

Nayef, N., Patel, Y., Busta, M., and .Chowdhury, P. (2019).

Icdar2019 robust reading challenge on multi-lingual

scene text detection and recognition–rrc-mlt-2019.

arXiv preprint:1907.00945.

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

244

Nayef, N., Yin, F., Bizid, I., and Choi, H. (2017). Icdar2017

robust reading challenge on multi-lingual scene text

detection and script identiﬁcation-rrc-mlt. In Interna-

tional Conference on Document Analysis and Recog-

nition (ICDAR), volume 1, pages 1454–1459.

Neumann, L. and Matas, J. (2015). Real-time lexicon-free

scene text localization and recognition. Transactions

on pattern analysis and machine intelligence (PAMI),

38(9):1872–1885.

Nguyen, D., Delalandre, M., Conte, D., and Pham, T.

(2019). Performance evaluation of real-time and

scale-invariant log operators for text detection. In

International Conference on Computer Vision, Imag-

ing and Computer Graphics Theory and Applications

(VISAPP), pages 344–353.

Ray, A., Shah, A., and Chaudhury, S. (2016). Recogni-

tion based text localization from natural scene images.

In International Conference on Pattern Recognition

(ICPR), pages 1177–1182.

Shivakumara, P. and Phan, T. (2010). A laplacian approach

to multi-oriented text detection in video. Transactions

on pattern analysis and machine intelligence (PAMI),

33(2):412–419.

Vijayalaksmi, S. and Punithavalli, M. (2012). A fast ap-

proach to clustering datasets using dbscan and pruning

algorithms. International Journal of Computer Appli-

cations (IJCA), 60(14).

Wang, H., Rong, X., and Tian, Y. (2019). Towards accu-

rate instance-level text spotting with guided attention.

In International Conference on Multimedia and Expo

(ICME), pages 994–999.

Wang, Y., Shi, C., Xiao, B., Wang, C., and Qi, C. (2018).

Crf based text detection for natural scene images us-

ing convolutional neural network and context infor-

mation. Neurocomputing, 295:46–58.

Xu, H., Lu, C., Berendt, R., and Jha, N. (2016). Auto-

matic nuclei detection based on generalized laplacian

of gaussian ﬁlters. Journal of biomedical and health

informatics, 21(3):826–837.

Yang, H., Wang, C., Che, X., Luo, S., and Meinel, C.

(2015). An improved system for real-time scene text

recognition. In International Conference on Multime-

dia Retrieval (ACM), pages 657–660.

Zhang, C., Liang, B., Huang, Z., En, M., and Han, J. (2019).

Look more than once: An accurate detector for text of

arbitrary shapes. arXiv preprint:1904.06535.

Zhang, C., Yao, C., Shi, B., and Bai, X. (2015). Automatic

discrimination of text and non-text natural images. In

International Conference on Document Analysis and

Recognition (ICDAR), pages 886–890.

Zhong, Z., Sun, L., and Huo, Q. (2018). An anchor-free re-

gion proposal network for faster r-cnn based text de-

tection approaches. arXiv preprint:1804.09003.

Zhou, X., Yao, C., Wen, H., and Wang, Y. (2017). East:

an efﬁcient and accurate scene text detector. In Inter-

national conference on Computer Vision and Pattern

Recognition (CVPR), pages 5551–5560.

Fast Scene Text Detection with RT-LoG Operator and CNN

245