Deep Learning with Sparse Prior

Application to Text Detection in the Wild

Adleni Mallek

, Fadoua Drira

, Rim Walha

, Adel M. Alimi

and Frank LeBourgeois

ReGIM-lab, University of Sfax, ENIS, BP 1173, 3038, Sfax, Tunisia

LIRIS, University of Lyon, INSA-Lyon, CNRS, UMR5205, F-69621, Lyon, France

{adleni.mallek.tn, fadoua.drira, rim.walha, adel.alimi}@ieee.org, franck.lebourgeois@insa-lyon.fr

Keywords:

PCANet, Deep Learning, Sparse Coding, Text Detection.

Abstract:

Text detection in the wild remains a very challenging task in computer vision. According to the state-of-the-

art, no text detector system, robust whatever the circumstances, exists up to date. For instance, the complexity

and the diversity of degradations in natural scenes make traditional text detection methods very limited and

inefﬁcient. Recent studies reveal the performance of texture-based approaches especially including deep mod-

els. Indeed, the main strengthens of these models is the availability of a learning framework coupling feature

extraction and classiﬁer. Therefore, this study focuses on developing a new texture-based approach for text

detection that takes advantage of deep learning models. In particular, we investigate sparse prior in the struc-

ture of PCANet; the convolution neural network known for its simplicity and rapidity and based on a cascaded

principal component analysis (PCA). The added-value of the sparse coding is the representation of each fea-

ture map via coupled dictionaries to migrate from one level-resolution to an adequate lower-resolution. The

speciﬁcity of the dictionary is the use of oriented patterns well-suited for textual pattern description. The

experimental study performed on the standard benchmark, ICDAR 2003, proves that the proposed method

achieves very promising results.

1 INTRODUCTION

Text in an image is a vital source of information

very useful mainly for understanding the content of

the image. Nowadays, a lot of studies are focusing

on text detection regarding various applications like

content-based image retrieval and document image

analysis. However, even though the tremendous work

done across this subject, text detection in the wild

remains a very challenging task in computer vision.

The study of the state-of-the-art ascertains that no text

detector system, robust whatever the circumstances,

exists up to date. The major difﬁculty is raised by

the widely use of digital cameras because the image

acquisition process in this case could introduce new

added distortions in terms of blur, noise, non-uniform

illumination and perspective degradations while tak-

ing into consideration the diversity of characters’ lay-

out (shape, size, position and color). Moreover, the

presence of background objects in natural scene im-

ages could be at the origin of many false-positive de-

tections. As illustration, we give in Figure 1 some

images extracted from the well-known ICDAR 2003

database. These images clearly illustrate the above

depicted challenges in text image detection in the

wild.

Figure 1: Extracts form the ICDAR 2003 dataset.

Faced with these different constraints, traditional

methods for text detection process have proven to be

insufﬁcient. Thus, training a text detector, although it

is not a trivial task, represents the possible issue for a

good text detection. The state-of-the-art distinguishes

between two categories of text detection methods ac-

cording to the data taken into account (Ye and Doer-

mann, 2015): (1) connected component analysis and

(2) sliding window classiﬁcation.

Mallek A., Drira F., Walha R., Alimi A. and LeBourgeois F.

Deep Learning with Sparse Prior - Application to Text Detection in the Wild.

DOI: 10.5220/0006129102430250

In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2017), pages 243-250

ISBN: 978-989-758-226-4

243

The ﬁrst category includes bottom-up methods,

that proceed in general by a pre-processing step for

noise removal, then a segmentation step for charac-

ter regions extraction followed by a grouping step

for word detection. Stroke Width Transform (SWT)

(Epshtein et al., 2010; Neumann and Matas, 2013)

and Maximally Stable Extremal Regions (MSER)

(Neumann and Matas, 2012) are two representative

methods of this category for scene text detection. The

work of Gao et al. (Gao et al., 2014) introduces a

bottom-up visual saliency model utilizing color fea-

ture. The work of Yi et al. (Yi and Tian, 2011) uses

color features for the extraction of connected compo-

nents. Even though the simple implementation behind

this overall detection process and which explains its

popularity, it remains very sensitive to the size and

shape of characters, background noise and degraded

text patterns.

An alternative solution is sliding window based

methods characterized by the recourse to image

patches with different sizes on the image pyramid

and the application of text/non-text classiﬁers (Wang

et al., 2011; Lee et al., 2011). Thus, this solution in-

cludes top-down methods. In particular, these meth-

ods exploit predominant features (e.g. texture, edge

map...) of text regions characteristics to extract them

from the other object of the background. The classiﬁ-

cation of text/non-text region could use either heuris-

tic methods (Garcia and Apostolidis, 2000; Zhong

et al., 1995) or machine learning methods (Lien-

hart and Wernicke, 2002). The efﬁciency of heuris-

tic methods relies on an adequate choice of features

and heuristic ﬁlters. In order to improve the ro-

bustness, machine learning methods investigate rel-

evant features over a given neighborhood accompa-

nied with robust classiﬁer that is trained using ma-

chine learning techniques to distinguish textual re-

gions. For example, Anthimopoulos et al. (Anthi-

mopoulos et al., 2013) create hand-crafted features

speciﬁcally designed for text detection. These are

dynamically-normalized edge features which gener-

ate local binary patterns within the sliding window

image patch. Wang et al. (Wang et al., 2011) use

a classiﬁer trained on the histogram of oriented gra-

dients features in a sliding window scenario to ﬁnd

characters in an image, grouping them using a picto-

rial structures model for a ﬁxed lexicon.

The main limitation of these methods is notice-

able for objects having similar structural texture to

texts. Moreover it remains not efﬁcient in terms of

computation cost. Therefore, the focus of text detec-

tion system using machine learning tools is twofold:

(1) the choice of a robust learning tool with a low-

computational cost to perform an efﬁcient classiﬁca-

tion and (2) the choice of an adequate feature space

for a better description of the textual patterns.

Deep learning techniques have become recently

a popular trend in machine learning applications.

Their main strengthens is the availability of a learn-

ing framework coupling feature extraction and classi-

ﬁcation. Due to their efﬁciency, we focus on devel-

oping a new texture-based approach for text detection

which takes advantage of these techniques. In partic-

ular, we propose to deal with the structure of the sim-

ple convolution neural network PCANet with sparse

prior. The choice of this architecture is explained

by the improvement of some shortcomings of classic

convolution neural network including the high train-

ing computational cost time, special tuning parame-

ters and technical problems. Furthermore, the sparse

coding offers an automatic generation of sparse vec-

tors according to a dictionary to encode an input sig-

nal. The network stacked of multiple feature extrac-

tion stages, each of which comprises a convolutional

ﬁlter bank layer and a feature pooling layer, and a

non-linearity stage. For ﬁlter bank in each convolu-

tion layer, we present an unsupervised feature learn-

ing that can be automatically learned by PCA. The

feature maps are generated from the results of convo-

lution and are aggregated by pooling layer via sparse

prior taking beneﬁt of coupled dictionaries to mi-

grate from one level-resolution to an adequate lower-

resolution level. The predeﬁned dictionaries contain

patches of writing patterns extracted from a collection

of high-quality character images and thus describe the

speciﬁcities of writing. The non-linearity stage in-

volves binary hashing then concatenates block-wises

histogram to ﬁnally fed the results into a training clas-

siﬁer.

The remainder of the paper is organized as fol-

lows: Section 2 presents related works dealing with

deep learning based text detection methods in the

wild. Section 3 gives a description of our contribu-

tion; an emphasis is made to describe properly both

PCANet, the deep learning network architecture and

sparse coding representation for a better comprehen-

sion of the overall algorithm. After that, Section 4

gives experimental results for performance evaluation

of the proposed method within a comparative study

involving ICDAR 2003 well-known database. Sec-

tion 5 closes this study with overall conclusions and

suggestions for future research works.

2 RELATED WORKS

The efﬁciency of deep learning models has been no-

ticeable in various applications of computer vision

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

244

like image classiﬁcation and object detection. In par-

ticular, text/non-text classiﬁcation in natural images is

among the most extensively studied application in the

literature. If traditional approaches are based on hand-

engineering better sets of features (Srinivas et al.,

2016) which are very limited for highly challenging

textual scene images, recent studies investigate deep

learning, heavily based on convolutional deep neural

networks (ConvNet), to compute hierarchical features

or representations from raw images. Each neuron is

related to a feature where the subsequent layer takes a

broad view of the essential features from the previous

one (Krizhevsky et al., 2012; Girshick et al., 2014; Si-

monyan and Zisserman, 2015; Szegedy et al., 2015).

In general, a deep network architecture comprises

(1) a convolutional ﬁlter bank layer, (2) a nonlinear

processing layer, (3) and a feature pooling layer. For

each layer, the learning process could be achieved us-

ing one of the well-known learning techniques such

as restricted Boltzmann machines, regularized auto-

encoders (Ciresan et al., 2011) or their variations

(Socher et al., 2011), the PCANet... The latter rep-

resents a very simple deep learning network that can

offer a valuable baseline for reading complex deep

learning architectures suited for large-scale image ap-

plications (Chan et al., 2014).

The underlying assumption for feature design

is the use of text/non-text information for training.

Therefore, this information is very important to learn

a discriminative representation. The proposed deep

learning based methods of Wang et al. (Wang et al.,

2012), Jaderberg et al. (Jaderberg et al., 2014) and

Gupta et al. (Gupta et al., 2016) compute globally

image features that lead to insufﬁcient and no generic

text/non-text representation. For a discriminative fea-

ture learning, He et al. (He et al., 2016) propose a

text CNN model that particularly involves text-related

speciﬁc characteristics such as text region mask, char-

acter label and binary text/non-text information.

Therefore, an efﬁcient deep learning based text

detection method must encapsulate discriminative

text patterns and a deep architecture with a low-

computational cost.

3 CONTRIBUTIONS

This section details the proposed text detection sys-

tem which relies on the two most popular models of

machine learning: deep learning and sparse coding

representation. In fact, we propose to proceed by a

combination of the merits of the PCANet model and

the domain expertise of sparse coding to improve the

performance of the text detection system with a faster

training and adequate feature representation. There-

fore, we could learn in a simple and rapid manner the

most appropriate feature representation for text im-

age from data. Figure 2 illustrates an overview of the

proposed system. Further details are discussed in the

following subsections.

3.1 PCANet: Brief Review

PCANet is an example of a simpliﬁed deep learning

network. Therefore, it shares with ConvNet many of

its fundamental representation and functional proper-

ties. In particular, it represents a feature that auto-

matically learns from a given region of interest but

takes beneﬁt of a more abstract expression of some

properties. The convolution ﬁlter bank is Principal

Component Analysis (PCA) ﬁlters. The non-linear

layer is the binary hashing (quantization). The pool-

ing layer is the block-wise histogram of the decimal

values of the binary vectors. Therefore, the struc-

ture of PCANet model contains the following steps :

patch-mean removal, PCA ﬁlter convolutions to pro-

duce a set of feature maps, binary quantization and

mapping, block-wise histograms, and an output clas-

siﬁers (Chan et al., 2014).

Let us denote N the number of the input training

images

{

}

i=1

of size m ×n. For each image, we take

around each pixel a k

× k

patch. Then, the mean is

subtracted from each patch and the feature vector is

given. After that, the ﬁlter is deﬁned for the next step

by taking into account the maximum eigenvalues. For

the second stage, the same process is repeated.

PCANet model includes a number of PCA stages

followed by an output stage. The number of these

stages can be varied, but a typical PCANet has two

stages. The work of (Chan et al., 2014) emphasized

that the two-stage PCANet model outperforms the

single stage in most cases. However, the increase of

the number of stages does not always improve the ﬁ-

nal results performance. In particular, this study fo-

cuses on the two-stage PCANet model.

We note that the output stage contains the opera-

tions of binary hashing and block-wise histogram. A

step function is applied to generate binary values for

each patch and these values are converted to decimal

values using the binary hashing. The block-wise his-

togram operates on these decimal values to generate

the ﬁnal output features. These features are then fed

into a trained classier like SVM. We suppose that the

number of ﬁlters in stage 1 after PCA application on

the N training images is F

and in stage 2 is F

Deep Learning with Sparse Prior - Application to Text Detection in the Wild

245

3.2 Sparse Coding: Brief Review

The goal of sparse coding is to automatically con-

struct sparse vectors from the training inputs Y . For

sparse representation, the following objective func-

tion is used to learn a dictionary D and jointly ﬁnd

a sparse linear combination of the basis vectors by

minimizing the data reconstruction errors:

D = argmin

D,α

Y − Dα

(1)

where α is the sparse vectors that represent the train-

ing samples Y according to the dictionary D, and λ is a

constant that controls the sparsity penalty and ﬁdelity

of the approximation to Y .

In the literature, we have noticed the recourse to

coupled sparse feature spaces: high-resolution and

low-resolution feature spaces mainly in patch-based

super-resolution tasks (Walha et al., 2015; Walha

et al., 2014; Yang et al., 2010). The given spaces

could be nominated as the observation space and the

latent space. Both of them must be tied by a mapping

function. The underlying assumption is to ensure a

collaborative learning between these two coupled dic-

tionaries in order to efﬁciently reconstruct the signal

in the latent feature space given the sparse represen-

tation of its corresponding signal in the observation

space.

Let us denote

j=1,2

i=1...C

the patch pairs for the

level 1 (l

) and level 2 (l

) of each cluster i where C is

the total number of clusters. The coupled dictionaries

j=1,2

i=1...C

are thus deﬁned as follows:

= argmin

,α



− D



. (2)

3.3 Proposed PCANet with Sparse Prior

In this section, we propose an approach that combines

the simplicity and efﬁciency of PCANet architecture

and the domain expertise of sparse coding. The main

difference, in comparison with the PCANet model

described above, is that the feature maps, generated

from the results of convolution, are aggregated by

pooling layer via sparse prior. This is possible via the

use of coupled dictionaries to migrate from one level-

resolution to an adequate low-resolution level. In

this case, a correct selection of the training database

must be undertaken carefully for a successful learn-

ing process. Attracted by the competitive results of

the Walha et al. work (Walha et al., 2015), we choose

to proceed using its training database; even it has

been already collected to enhance the spatial resolu-

tion of textual images, we will adapt it to the con-

text of our study. In fact, this database is composed

of several patch images extracted around character

edges of high-quality images. These character im-

ages, generated using the graphic library FreeType,

encapsulates a variety of sizes, styles and fonts. The

high-resolution training set is composed by 124000

patch images. Figure 3 shows samples of the high-

resolution training set.

Algorithm 1: PCANet with sparse prior.

Input: Training patches (ICDAR 2003), Coupled

dictionaries level 1 and level 2.

1: Stage I: First feature extraction via PCA

1.a: First ﬁlter construction for layer 1 via PCA

application on the N training patches.

1.b: Convolution step: For each training patch,

extract L

feature maps by convoluting it with PCA

based ﬁlters bank.

2: Stage II: Feature pooling via sparse prior

2.a: Each feature map is represented via sparse

coding using coupled dictionaries to migrate from

one level-resolution to an adequate low-resolution

level.

3: Stage III: Second feature extraction via PCA

3.a: Second ﬁlter construction for layer 2 via PCA

application on the pooled features maps

3.b: Convolution Step: extract L

feature maps by

convoluting the output of the second stage with the

second PCA based ﬁlters bank.

4: Stage IV: Feature preparing for SVM classiﬁer

4.a: Apply binary hashing

4.b: Concatenate block-wise histogram

Output: Learned Features, Learned SVM Classiﬁer.

As output for the Stage I, we obtain NF

feature

maps where N is the number of the input image and

is the number of ﬁlter of the ﬁrst stage. Firstly, we

extract l patches with a size of k

×k

of the i

image

that is vectorized into column vectors with a stride of

s pixels. Thus, we collect all patches extracted from

the same input image which forms a matrix with a size

T , denoted as

T = (k

) × ((

m − k

+ 1)(

n − k

+ 1)); (3)

= [x

i,1

i,2

,...,x

i, j

,...,x

i,T

] ∈ ℜ

×k

(4)

where x

i j

∈ ℜ

×k

and j ∈ 1..l denotes the j

vec-

torized patch in I

In order to improve the features quality within a

neighborhood and removing the noise corresponding

to the small eigenvalues of the data covariance matrix,

we subtract the mean value of the corresponding patch

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

246

Figure 2: The four stage architecture of the proposed PCANet with sparse prior for text detection.

Figure 3: From top to down : Illustration of the selection

process of patch images from a character image and sam-

ples of the writing pattern from the training database (Walha

et al., 2015).

from each column vector in the matrix X

and obtain

the following matrix:

= [x

i,1

i,2

,...,x

i, j

,...,x

i,T

] ∈ ℜ

×k

(5)

where x

i, j

is a mean-removed patch.

Once the same matrix is applied for all input im-

ages, we regroup them to form a large matrix deﬁned

as:

X =



,...,X



∈ ℜ

×k

(6)

The eigenvectors associated with the most ener-

getic eigenvalues from the covariance matrix are then

extracted. The next step is performed on the matrix

. Therefore, we take the convolutional ﬁlters that

form a matrix with F

principal eigenvectors of XX

The PCA ﬁlters that are learned can be expressed

as follows:

= mat

×k

(XX

)) ∈ ℜ

,l = 1,2, ..., F

(7)

where mat

×k

(v) is a function that denotes the map-

ping relationship from vector v to a matrix W ∈ ℜ

and q

(XX

) designates the f

eigenvector of matrix

. Lastly, all the principal eigenvectors of a matrix

keep the main variation of all the mean removed train-

ing patches and we obtain F

ﬁlters of size k1 × k2.

After that, we convolute each input image with the

learned ﬁlters to generate ﬁlter responses at each pixel

location and namely the ﬁltering feature maps results.

Each feature map, extracted at corresponding location

in the image, represents particular features.

Let the f

ﬁlter output of the ﬁrst stage be as fol-

lows:

= I

⊗W

,i = 1,2,...,N (8)

where ⊗ denotes 2D convolutional, and I

padded

with zeros before convolution.

For the Stage II, we apply a pooling layer via

sparse prior that represents the feature maps using

coupled dictionaries. This stage allows to recover a

patch of low-resolution level J

from an input patch

of one level resolution I

. Given the learned coupled

dictionaries {D

} (Walha et al., 2015), sparse

coding generates sparse vectors to encode an input

patch I

from the dictionary D

. This can be mathe-

matically formulated as:

α = min



− D



(9)

where α is the sparse representation of I

over D

Then, we select only the representation

α that mini-

mizes the local reconstruction error according to the

appropriate dictionary

α = min



− D



(10)

The optimal solution

α is then applied to generate a

low-resolution level of patch J

from the correspond-

ing D

dictionary based on: J

= D

α.

In the case of the Stage III, we apply the same pro-

cess that is used in the ﬁrst stage. Firstly, we collect

Deep Learning with Sparse Prior - Application to Text Detection in the Wild

247

all the overlapping patches of J

. Then, we join all

vectors extracted from which we subtract patch mean

to form a matrix denoted as



i, f ,1

i, f ,2

,...,y

i, f , j

,...,y

i, f ,T



∈ ℜ

×k

(11)

where y

i, j

is a mean-removed patch. We further col-

lect patches from all mean-removed patches in the f

ﬁlter output, and concatenate the matrix Y

denoted as

,...,Y

∈ ℜ

×k

(12)

The PCA ﬁlters are learned and can be expressed

as follows:

= mat

×k

(YY

)) ∈ ℜ

,l = 1,2, ..., F

(13)

For each input J

of the third stage, we take the

ﬁrst L

main eigenvectors as PCA ﬁlters convolving

with W

for f = 1, 2, ...,F

= {J

⊗W

}

f =1

(14)

The number of features maps of the third stage is

as output of the second feature extraction.

The stage IV is the output stage, we use binary

hashing and histogram statistics which build feature

maps to form ﬁnal representation of the input image

as in PCANet (Chan et al., 2014). Each F

input fea-

ture maps to the second stage obtains F

feature maps

as outputs {J

⊗W

}

f =1

. We binarize these output

maps and apply {J

⊗W

}

f =1

, where H(.) is a Heav-

iside step function whose value is one for positive en-

tries and zero otherwise.

Around each pixel, we consider the vector of F

binary bits as a decimal number. This converts the F

outputs produced in the second stag back into a single

integer-valued image.

For each of F

integer valued images, we parti-

tion it into B blocks. We compute the histogram of

the decimal values in each block, and concatenate all

the B histograms into one vector. After this encoding

process, the feature of the input image I

becomes the

set of block-wise histograms. The local blocks can

be either overlapping or non-overlapping, depending

on applications. The output is then fed into a trained

classier.

4 EXPERIMENTAL RESULTS

4.1 Experimental Framework

4.1.1 Datasets and Evaluation Protocols

Our experiments were implemented with Matlab

2015 b platform, and run on a 64-bit Microsoft 10

machine powered by Intel(R) Core(TM) i7-4750HQ

CPU 2.00 GHz processor and 8 GB RAM.

In this section, we choose the dataset ICDAR 2003

with challenging text scene images for evaluating our

text detection method. This dataset proposed by S. M.

Lucas et al. contains 507 natural scene images in to-

tal; 258 images from the dataset are prepared for train-

ing and 249 images for testing. Most images vary in

size from 600 × 450 to 1280 × 960 and contain about

4 text regions on average for each image.

To evaluate the text detection, we used three im-

portant metrics in performance assessment: Precision

(P), Recall (R)and F-Measure (FM). Precision mea-

sure is the ratio between the successfully extracted

text regions (TP) and all detected regions (E). Re-

call measure represent the ratio of the successfully ex-

tracted text regions (TP) that should be in the ground

truth regions (T). The F-Measure is computing the

performance of algorithm, which combines the pre-

vious two measures. The metric formulas are given

by the following deﬁnitions:

P =

T P

(15)

R =

T P

(16)

FM =

2 × P × R

(P + R)

(17)

4.1.2 Experimental Setup

Our text detection system is based on PCANet archi-

tecture and on sparse coding representation. We use

samples of ICDAR 2003 dataset to learn PCA ﬁlters

with L

ﬁlters in the rst stage and L

groups in the

third stage; each group contains L

ﬁlters. To extract

the features, we adopted convolution layer that con-

voluate each input image 32 × 32 using patch size of

k1 ×k2 with the learned PCA-based ﬁlters to produce

a set of feature map. To train dictionary, the num-

ber of atoms varies between 128 and 1024. For the

classiﬁcation layer, these outputs are fully connected

to train the network by back-propagating the L

-SVM

classication error.

For the experimental study, we choose the number

of ﬁlter as set L

= L

= 8 that is inspired from the

common setting of Gabor ﬁlters with 8 orientations.

The results are shown in Figure 4.

4.2 Performance Evaluation

In this section, we evaluate the performance of the

proposed system on the Robust Reading dataset IC-

DAR 2003. For text detection, we generate a 2-way

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

248

Figure 4: Filters learned on the ICDAR 2003 dataset. (a) 8

ﬁlters in the ﬁrst stage;(b) There are 8 groups in the second

stage, and each group contains 8 ﬁlters, shown in a column.

text/no-text classication dataset by cropping patches

from this dataset.

4.2.1 Impact of the Patch Size

We study the inﬂuence of the different patch size y

varying between 7 and 11 pixels for the ﬁrst stage. To

calculate the overlapping between adjacent patches

from the left to right input image and the patch size

of the third stage, we apply the following formula:

+ 1)

. (18)

For ICDAR 2003 dataset, table 1 proves that the

best number of patch size is 7 and 4 for Stage I and

Stage II respectively.

Table 1: Impact of the patch size for Stage I and Stage II of

our proposition: Case study of ICDAR 2003 database.

Patch size

P R FM

Stage I Stage III

7 4 0.975 0.965 0.97

9 5 0.966 0.958 0.962

11 6 0.96 0.962 0.961

4.2.2 Impact of the Number of Atoms

In this subsection, the patch size is ﬁxed to 7 × 7 for

stage I and 4 × 4 for stage III. We deﬁne the relation-

Table 2: Impact of the number of atoms for our proposition:

Case study of ICDAR 2003 database.

Number of atoms P R FM

128 0.97 0.962 0.966

256 0.975 0.965 0.97

512 0.976 0.96 0.968

ship between the classiﬁcation performance and the

number of atoms. To generate the dictionary, we ap-

plied the K-means clustering algorithm patches. We

used the dictionary of Walha et al. (Walha et al., 2015)

to conduct both the two dictionaries generation such

as one level-resolution and low-resolution level. Ta-

ble 2 illustrates that the number of atoms equal to 256

gives better results than the other.

4.2.3 Comparative Methods Study

We compared our proposed text detection system

to the-state-of-the-art methods using the comparative

metrics already deﬁned above.

Table 3: Comparisons of different methods of text detec-

tion: Case study of ICDAR 2003 database.

Algorithm P R FM

Epshtein et al. 0.73 0.60 0.66

Yi and Tian 0.71 0.62 0.62

Huang et al. 0.81 0.74 0.72

Basic PCANet 0.95 0.95 0.95

PCANet/Sparse prior 0.97 0.96 0.97

According to the different metrics such as Preci-

sion, Recall and F-Measure established for different

algorithms, including basic PCANet(without sparse

prior), modiﬁed PCANet (with sparse prior) and the

works of (Epshtein et al., 2010; Yi and Tian, 2011;

Huang et al., 2013), tested on the ICDAR 2003

dataset, table 3 shows that the modiﬁed variant of

PCANet with sparse prior achieves a better text de-

tection result.

5 CONCLUSIONS

We proposed a novel approach which takes the advan-

tage of unsupervised deep learning. More precisely,

we combine the simplicity and efﬁciency of PCANet

architecture and the domain expertise of sparse prior.

Our proposition provides competitive results com-

pared with state-of-the art methods but opens more

efﬁcient prospects for other methods.

Further investigations will focus on the integration

of a dedicated feature to deﬁne automatically the ﬁl-

ter numbers. The study of neighborhood for a given

Deep Learning with Sparse Prior - Application to Text Detection in the Wild

249

image block will be integrated to improve the overall

system detection performance.

REFERENCES

Anthimopoulos, M., Gatos, B., and Pratikakis, I. (2013).

Detection of artiﬁcial and scene text in images and

video frames. Pattern Anal. Appl., 16(3):431–446.

Chan, T. H., Jia, K., Gao, S., Lu, J., Zeng, Z., and Ma, Y.

(2014). Pcanet: A simple deep learning baseline for

image classiﬁcation. IEEE Trans. on Image Process-

ing, 24(12):5017–5032.

Ciresan, D. C., Meier, U., Masci, J., Gambardella, L. M.,

and Schmidhuber, J. (2011). High performance neu-

ral networks for visual object classication. Technical

Report IDSIA-01-11.

Epshtein, B., Ofek, E., and Wexler, Y. (2010). Detect-

ing text in natural scenes with stroke width trans-

form. IEEE Computer Vision and Pattern Recogni-

tion, pages 2963–2970.

Gao, R., Uchida, S., Shahab, A., Shafait, F., and Frinken,

V. (2014). Visual saliency models for text detection in

real world. PLoS ONE, 9:114–539.

Garcia, C. and Apostolidis, X. (2000). Text detection and

segmentation in complex color images. In Proc. Int.

Conf. on Acoustics, Speech and Signal Processing,

pages 2326–2329.

Girshick, R., Donahue, J., Darrell, T., , and Malik, J. (2014).

Rich feature hierarchies for accurate object detection

and semantic segmentation. IEEE Computer Vision

and Pattern Recognition.

Gupta, A., Vedaldi, A., , and Zisserman, A. (2016). Syn-

thetic data for text localisation in natural image. IEEE

Computer Vision and Pattern Recognition.

He, T., Huang, W., Qiao, Y., and Yao, J. (2016). Text-

attentional convolutional neural network for scene

text detection. IEEE Trans. Image Processing, pages

2529–2541.

Huang, W., Lin, Z., Yang, J., and Wang, J. (2013). Text lo-

calization in natural images using stroke feature trans-

form and text covariance descriptors. IEEE Int. Conf.

on Computer Vision, pages 1241–1248.

Jaderberg, M., Vedaldi, A., and Zisserman, A. (2014). Deep

features for text spotting. European Conf. on Com-

puter Vision.

Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. Neural Information Processing Systems.

Lee, J.-J., Lee, P.-H., Lee, S.-W., Yuille, A., and Koch, C.

(2011). Adaboost for text detection in natural scene.

In Int. Conf. on Document Analysis and Recognition,

pages 429–434.

Lienhart, R. and Wernicke, A. (2002). Localizing and seg-

menting text in images and videos. IEEE Trans. on

Circuits and Systems for Video Technology, 12:256–

268.

Neumann, L. and Matas, J. (2012). Real-time scene text

localization and recognition. IEEE Computer Vision

and Pattern Recognition, pages 3538–3545.

Neumann, L. and Matas, J. (2013). Scene text localization

and recognition with oriented stroke detection. IEEE

Int. Conf. on Computer Vision, pages 97–104.

Simonyan, K. and Zisserman, A. (2015). Very deep con-

volutional networks for large-scale image recognition.

Int. Conf. on Learning Representation.

Socher, R., Pennington, J., Huang, E., Ng, A., and Man-

ning, C. (2011). Semi-supervised recursive autoen-

coders for predicting sentiment distributions. In Conf.

on Empirical Methods in Natural Language Process-

ing, pages 151–161.

Srinivas, S., Sarvadevabhatla, R., Mopuri, K., Prabhu, N.,

Kruthiventi, S., and Radhakrishnan, V. (2016). A tax-

onomy of deep convolutional neural nets for computer

vision. Frontiers in Robotics and AI.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., and Reed, S.

(2015). Going deeper with convolutions. Computer

Vision and Pattern Recognition.

Walha, R., Drira, F., Lebourgeois, F., Garcia, C., and Alimi,

A. (2014). Sparse coding with a coupled dictionary

learning approach for textual image super-resolution.

Int. Conf. on Pattern Recognition, pages 4459–4464.

Walha, R., Drira, F., Lebourgeois, F., Garcia, C., and Al-

imi, A. (2015). Resolution enhancement of textual

images via multiple coupled dictionaries and adaptive

sparse representation selection. Int. Journal of Docu-

ment Analysis and Recognition, 18(1):87–107.

Wang, K., Babenko, B., and Belongie, S. (2011). End-to-

end scene text recognition. In Int. Conf. on Computer

Vision, pages 1457–1464.

Wang, T., Wu, D. J., Coates, A., and Ng, A. Y. (2012). End-

to-end text recognition with convolutional neural net-

works. Int. Conf. on Pattern Recognition, pages 3304–

3308.

Yang, J., Wright, J., Huang, T., and Ma, Y. (2010). Im-

age super-resolution via sparse representation. IEEE

Trans. Image Process, 19(11):2861–2873.

Ye, Q. and Doermann, D. (2015). Text detection and recog-

nition in imagery: A survey. IEEE Trans. on Pattern

Analysis and Machine Intelligence, 37(7):1480–1500.

Yi, C. and Tian, Y. (2011). string detection from natu-

ral scenes by structure-based partition and grouping.

IEEE Trans. on Image Processing, 20(9):2594–2605.

Zhong, Y., Karu, K., and Jain, A. (1995). Locating text

in complex color images. Pattern Recognition, pages

1523–1536.

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

250