Towards the Rectiﬁcation of Highly Distorted Texts

Stefania Calarasanu

, S

everine Dubuisson

and Jonathan Fabrizio

EPITA-LRDE, 14-16, Rue Voltaire, F-94276, Le Kremlin Bic

etre, France

Sorbonne Universit

es, UPMC Univ Paris 06, CNRS, UMR 7222, ISIR, F-75005, Paris, France

Keywords:

Text Rectiﬁcation, Text in Perspective.

Abstract:

A frequent challenge for many Text Understanding Systems is to tackle the variety of text characteristics in

born-digital and natural scene images to which current OCRs are not well adapted. For example, texts in

perspective are frequently present in real-word images, but despite the ability of some detectors to accurately

localize such text objects, the recognition stage fails most of the time. Indeed, most OCRs are not designed

to handle text strings in perspective but rather expect horizontal texts in a parallel-frontal plane to provide a

correct transcription. In this paper, we propose a rectiﬁcation procedure that can correct highly distorted texts,

subject to rotation, shearing and perspective deformations. The method is based on an accurate estimation of

the quadrangle bounding the deformed text in order to compute a homography to transform this quadrangle

(and its content) into a horizontal rectangle. The rectiﬁcation is validated on the dataset proposed during the

ICDAR 2015 Competition on Scene Text Rectiﬁcation.

1 INTRODUCTION

Retrieving the textual information from born-digital

and real-scene images can often be a challenging

task due to the variety of text properties (color, size,

font, orientation) but also to external causes, such

as lighting conditions (shadows, specularity, reﬂec-

tions, etc.), cluttered backgrounds, possible occlu-

sions, poor image resolution and quality, or situations

where the the text plane is not parallel to the camera

one. These circumstances do not only affect the text

detection process but also the text recognition stage.

Unfortunately, most of the current OCRs have low

performances on recognizing curved, inclined, verti-

cal or perspective distorted texts. Such text examples

are illustrated in Fig. 1. Our contributions concern a

Figure 1: Examples of real scene images with deformed

texts from the ICDAR 2015 Competition Scene Text Recti-

ﬁcation dataset (Liu and Wang, 2015).

rectiﬁcation method that can simultaneously correct

rotation, shearing and perspective deformations. It

uses a homography that maps the image coordinates

onto the world coordinate system and brings the de-

formed texts to a front-parallel view. The validation

of this method is done on a recent dataset, used during

the ICDAR 2015 Competition on Scene Text Rectiﬁ-

cation (Liu and Wang, 2015). It contains a very large

amount of challenging texts, extracted from synthetic

and real scenes, with different transformations. The

organization of the paper is as follows: we ﬁrst give a

brief overview of the state of the art in Sec. 2, then the

entire rectiﬁcation procedure in Sec. 3, while the ex-

perimental results are presented in Sec. 4. Concluding

remarks and perspectives are given in Sec. 5.

2 RELATED WORK

In the literature, the problem of managing distorted

texts has been handled in different ways. Some works

((Santosh and Wendling, 2015), (Almazan et al.,

2013)) tackled this by proposing powerful recogni-

tion stages capable of managing distorted characters.

On the opposite, many works ﬁrst rectify the distor-

tions, then recognize the text. A ﬁrst category of ap-

proaches relies on feature learning but, when texts

are strongly distorted, these methods fail to provide

a correct transcription. In such cases, the rectiﬁcation

procedure is a better alternative. Some text rectiﬁ-

cation procedures also target speciﬁc cases, such as

multi-oriented, italic or texts in perspective. Some

Calarasanu, S., Dubuisson, S. and Fabrizio, J.

Towards the Rectiﬁcation of Highly Distorted Texts.

DOI: 10.5220/0005772602410248

In Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2016) - Volume 4: VISAPP, pages 241-248

ISBN: 978-989-758-175-5

241

works have exclusively focused on rectifying italic

texts to enhance the performance of OCRs that have

difﬁculties in providing an accurate transcription of

sheared texts. The authors in (Zhang et al., 2004)

proposed an approach based on the statistical analysis

of stroke patterns extracted from the wavelet decom-

position of text images. In (Fan and Huang, 2005)

authors introduced a method that rectiﬁes italic texts

using a shear transform. First, the characters are clas-

siﬁed into three classes of angles. Then, the shear an-

gle is determined differently for each character based

on its corresponding italic class. Perspective recov-

ery needs to be applied when the camera optical axis

is not perpendicular to the text plane. When a text

is in perspective, the characters change their origi-

nal structure. This makes the OCR performs poorly

and produces low accuracy scores. However, a se-

ries of works proposed recognition modules capable

of identifying oriented characters or texts in perspec-

tive. The authors in (Lu and Tan, 2006) proposed a

recognition technique capable of recognizing charac-

ters in perspective by extracting perspective invariant

features, such as character ascenders and descenders

or number of centroid intersections. Cross ratio spec-

trum and Dynamic Time Wrapping techniques were

employed during the recognition process in (Li and

Tan, 2008), (Zhou et al., 2009). In (Phan et al., 2013)

SIFT features were extracted to recognize texts in

perspective in different orientations. To correct the

perspective distortion, many works rely on the ho-

mography transformation (Myers et al., 2005), (Ye

et al., 2007), (Cambra and Murillo, 2011), (Kiran

and Murali, 2013). In (Ye et al., 2007), the rectiﬁ-

cation is done based on a correlation between a set

of feature points and a plane-to-plane homography

transformation. The extension of this work, presented

in (Cambra and Murillo, 2011), consists of an opti-

mization of the parameters of the homography. The

method in (Kiran and Murali, 2013) implied a ﬁrst

stage where text borders are captured using a geom-

etry based segmentation and then feature points are

extracted using the Harris corner detector. The au-

thors in (Merino-Gracia et al., 2013) performed a par-

allel rectiﬁcation using a homography and a shear-

ing transform. The method ﬁrst proposes a horizontal

foreshortening by detecting the upper and lower lines

bounding the text region. Next, vertical foreshorten-

ing and shearing are done by using a linear regression

based on the variation of shear characters. The au-

thors in (Chen et al., 2004) used an afﬁne transforma-

tion to correct the perspective deformations, but the

method requires the camera parameters to be known.

Such an assumption was also required in the work

in (Clark et al., 2001). The borderline analysis was

implied in (Ferreira et al., 2005). The main problem

of these approaches is that they rely on the hypothe-

sis that text regions are bounded by rectangles. The

work in (Zhang et al., 2013) used the Transformed

Invariant Low-rank Textures algorithm to rectify En-

glish, Chinese characters and digits. The method pre-

sented in (Busta et al., 2015) proposed a skew text

rectiﬁcation in real scene images based on ﬁve skew

estimators used for character segmentation (or poly-

gon approximation): vertical dominant, vertical dom-

inant on convex hull, longest edge, thinnest proﬁle

and symmetric glyph. In (Myers et al., 2005), the au-

thors used a projective transformation to correct text

in perspective. The parameters used for the rectiﬁ-

cation are derived from a series of features extracted

from each text line, such as top and baselines of a

text, or the dominant vertical direction of character

strokes. In (Yonemoto, 2014) a correction method

based on a quadrangle estimation is proposed, which

supposes that the text contains a sufﬁcient number

of horizontal and vertical strokes. Authors in (Hase

et al., 2001) proposed a generic method to correct in-

clined, curved and distorted texts. Text is ﬁrst clas-

siﬁed with respect to the alignment and distortion of

its characters, then different types of corrections are

applied. A rectiﬁcation approach for license plate

images was proposed (Deng et al., 2014) using the

Hough transform and different types of projections.

The method, based on ﬁnding parallel lines, consists

of applying two transformations: a horizontal tilt and

a vertical shear transform. Many of the approaches

discussed above correct the text of individual text

lines. Some works proposed rectiﬁcation algorithms

on whole documents. The work in (Stamatopoulos

et al., 2011) targeted the rectiﬁcation of distorted doc-

uments. It performs a curved surface projection, a

word baseline ﬁtting and a horizontal alignment. The

authors in (Liang et al., 2008) proposed a rectiﬁca-

tion method for planar and curved documents by esti-

mating 3D document shapes from texture ﬂow infor-

mation. Contrary to works previously cited, that use

the same global approach and which imply an afﬁne

transformation for perspective correction followed by

a shearing rectiﬁcation to correct the perspective dis-

tortions, our proposed method has the advantage of

using a single afﬁne transformation that can rectify,

individually or simultaneously, texts subject to ro-

tation, shearing and perspective deformations. This

transformation is obtained from a very accurate quad-

rangle estimation of the distorted text, described in

the next section.

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

242

3 PROPOSED METHOD

Let us consider a text string as a set of N charac-

ters deﬁned as C = {C

}

i=1..N

, where C

is the in-

dividual CC corresponding to the i

character. We

call G = {G

}

i=1..N

the set of the centroids corre-

sponding to each C

in C. Similarly, we denote by

W = {W

}

i=1..N

the set of the weighted centroids that

belong to each C

in C . We classify the CCs into two

categories: extremity CCs corresponding to the ﬁrst

) or last ( C

) characters of the text string (in

the order of reading); and inner CCs corresponding

to any of the characters that are located between the

two extremity CCs.

3.1 Overview of the Rectiﬁcation

Process

The perspective rectiﬁcation process relies on ﬁnd-

ing a homography matrix in order to rectify the text

image. This is done implying several stages. The

method ﬁrst relies on a CC ﬁltering, described in

Sec. 3.2 during which punctuation signs and point

over some characters are temporarily removed. The

ﬁltering is followed by an extremity CC identiﬁca-

tion procedure, discussed in Sec. 3.3, which aims at

identifying of the ﬁrst and last characters of a text

string. The process then precisely estimates a quad-

rangle (see Sec. 3.4) that bounds the distorted text.

The four points that deﬁne the quadrangle are then

used to compute the homography matrix. Finally, this

homography matrix is used to map all the points of

the deformed text onto a parallel-front plane, as ex-

plained in Sec. 3.5.

3.2 Connected Component Filtering

Before applying the rectiﬁcation, we need to ﬁlter the

CCs and remove punctuation marks such as “.”, “,”

or “:” and points over some characters such as “i”

and “j”. Such a removal is needed because the entire

text correction is based on the relative position of a

CC with respect to the other ones. We deﬁne l

the

length of the diagonal of the box bounding C

com-

puted as: l

+ w

, where h

and w

are re-

spectively the height and width of the bounding box of

. Hence, C

is kept during the ﬁltering procedure as

long as its diagonal satisﬁes the following constraint:

> l

·T

, with l

∑

i=1

where l

is the average

of all diagonal lengths and T

is a threshold that was

experimentally set to 0.35. This constraint removes

all CCs whose diagonal is considerably smaller than

the average diagonal.

a b

Figure 2: Centroids and reference line ﬁtting using LSM:

(a) weighted centroids (red); (b) the reference line that best

ﬁts the weighted centroids (yellow).

3.3 Extremity Connected Components

After the CC ﬁltering, we need to ﬁnd the two ex-

tremity CCs, i.e. corresponding to the ﬁrst and last

characters. This requires the steps described below.

Fitting the Reference Line. An approximation of

the text orientation is obtained by using the Least-

Square Method (LSM) to ﬁt a reference line to the

set of weighted centroids W . The slope of this line,

called L

Re f

gives an approximation of the text orien-

tation. Fig. 2 shows examples of centroids and a ref-

erence line for the text string “International”.

Finding the Two Extremity CCs. First we identify

the two closest neighbors of each CC C

, denoted as

and C

. If C

is the ﬁrst extremity, then its two

nearest neighbors will be the two following charac-

ters. If C

is the last extremity, they will be its two

preceding characters. If C

an inner CC, its two clos-

est neighbors will be its predecessor and its succes-

sor. Let W

and W

be the weighted centroids of

the two neighbors of C

. We then deﬁne l

and l

the lines that unite W

and, respectively, W

and W

= (W

), l

= (W

) Let θ

be the orien-

tation angle of C

, computed as θ

= angle(l

, l

All CCs for which this angle is smaller than 45 de-

grees are selected as extremity CC candidates. If

more than two CCs satisfy this constraint, we com-

pute the largest distance between each pair of candi-

date CCs. The pair of CCs for which the distance

between their centroids is the largest are identiﬁed as

the two extremities C

and C

, with e1, e2 ∈ [1, N].

This stage is illustrated in Fig. 3(a) and 3(b).

Identifying the First and Last Extremities. Based

on the slope m(L

Re f

) of L

Re f

(see below), we can

ﬁnd the two extremities C

and C

. If the slope

m(L

Re f

) ∈ [−0.1, 0.1], the text is considered as hor-

izontal and hence we determine the ﬁrst and last char-

acters depending on the x-coordinates of the weighted

Towards the Rectiﬁcation of Highly Distorted Texts

243

a b

Figure 3: The procedure for ﬁnding the extremity CCs:

(a) the angle between the lines (in green) uniting the cen-

troids of an extremity CC and the centroids of its two clos-

est neighbors; (b) the angle between the lines (in magenta)

uniting the centroid of an inner CC (“a”) and the centroids

of its two closest neighbors (“n” and “t”).

centroids of the two CCs. If m(L

Re f

) < −0.1, the

text is inclined following a bottom-left to top-right

direction. In this case we choose the CC closer to

the bottom origin point deﬁned as O

= (0, y

max

). If

m(r) > 0.1, the text follows a top-left to bottom-right

direction and the ﬁrst and last characters are chosen

based on the smallest distance between the upper ori-

gin point O

= (0, 0) and the two centroids W

and

3.4 Quadrangle Approximation

We are now interested in ﬁnding the quadrangle that

best ﬁts a text string. This consists in identifying the

four lines that bound the text, referred here as the bot-

tom (L

), upper (L

), left (L

) and right (L

) lines.

Bottom and Top Boundary Line Fitting. Let us

consider P

= {P

} and P

= {P

} the sets contain-

ing the upper and lower extremity points of C

respec-

tively. In order to ﬁnd these points we use the slope

of the reference line as a guideline:

1. We consider lines parallel to L

Re f

at different dis-

tances d in two directions (positive and negative)

corresponding to the upper and bottom points. We

denote these lines L

, where s is the direction sign.

The procedure consists in, for each direction sign,

increasing d by one, computing intersections be-

tween L

and CC and storing them into P

and

, until L

do not intersect any CC anymore

(Fig. 4(a) and 4(b)).

2. The LSM is then used to ﬁt L

on the set of points

and L

on P

(Fig. 4(c)).

3. Finally, we check if L

and L

correctly bound the

text string. If L

or L

intersects the set of CCs

C, the lines are shifted (parallel to L

or L

) until

they perfectly bound the text string (Fig. 4(d)).

a b

c d

Figure 4: Bottom and up boundary line ﬁtting procedure:

(a) parallels to the reference line in both directions: upper

(blue) and bottom (magenta); (b) extremity points: upper

(blue) and bottom (magenta); (c) LSM line ﬁtting of the

lower and bottom extremity points; (d) shifting of the initial

upper and lower lines.

Left and Right Boundary Line Fitting. We call

= {P

} the set containing the left extremity points

and P

= {P

} the set containing the right extremity

points. To get the left and right boundary lines, the

positions of the ﬁrst and last CCs are used, following

the stages described below:

1. Let us deﬁne L

Re f

the line perpendicular to L

Re f

that passes through W

. The procedure consists

of tracing parallels to L

Re f

until the parallel lines

do not intersect anymore the CC extremities. All

border points that belong to both the last parallel

line and the extremity CC are considered as ex-

tremity points and stored into the two sets P

and

(Fig. 5(a) and 5(b)).

2. For each of the two sets P

and P

average left

point P

and right point P

are computed:







∑

| P

∑

| P







, P





∑

| P

∑

| P





3. We look for the lines L

(resp. L

) that best ﬁt

(resp. P

) and C

(resp. C

). Let us deﬁne

the line perpendicular to L

Re f

passing through

. The best ﬁtting left line is obtained by ro-

tating L

until it covers the maximum number of

border points (Fig. 5(c)). The same reasoning is

done to obtain L

(see Fig. 5(d)).

4. Finally, we check if L

and L

correctly bound the

text string. If L

or L

intersects the set of CCs C ,

the lines are shifted (parallel to L

or L

) until they

perfectly bound the text string.

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

244

a b

c d

Figure 5: Left and right boundary line ﬁtting procedure: (a)

ﬁnding the left extremity points; (b) line variation to ﬁnd

the best ﬁtting left lines; (c) left and right boundary lines;

(d) left and right boundary lines after shifting.

a b

Figure 6: Perspective distortion rectiﬁcation: (a) bounding

quadrangle estimation; (b) rectiﬁed text.

3.5 Homography

The relationship that maps a point (x, y) from a pro-

jective plane to a point (x

, y

) from another projective

plane is deﬁned as: (x

, y

, 1)

= H(x, y, 1)

, where

H the homography matrix. To compute H we need

eight points: four points from the image plane and

their corresponding four points from the real world

plane. In our case, we use the four corners of the input

quadrilateral that bounds the distorted text and pro-

vide the coordinates of the output rectangular plane

onto which we want to map this text string. By using

approximations of lines L

, L

and L

we can get

the four corners (P

, P

and P

) as their intersec-

tions. The perspective deformation is then corrected

transforming all points (x, y) that belong to the in-

put quadrangle and get their corresponding positions

, y

) into the output rectangle. Fig. 6 illustrates a

rectiﬁcation result after applying the homography.

4 EXPERIMENTAL RESULTS

We evaluate the performance of our rectiﬁcation

method on two datasets proposed for the ICDAR

2015 Competition on Scene Text Rectiﬁcation (Liu

and Wang, 2015). The ﬁrst dataset contains 2500 En-

glish and Chinese synthetic texts obtained by apply-

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

text string frequency

accuracy intervals

Accuracy

Synthetic dataset

Real-scene dataset

Figure 7: Quality-quantity histograms for text rectiﬁcation.

Accuracy values for: (a) synthetic text; (b) real-scene text.

ing on 1000 text samples random deformation types,

such as rotation, shearing, horizontal fore-shortening

and vertical fore-shortening with different parame-

ters. The second dataset is derived from real-scene

MSRA-TD500 dataset (Yao, 2012). It contains 60

image samples of English, respectively 60 images

with Chinese texts, subject to orientation and perspec-

tive distortions.

The evaluation of the rectiﬁcation method is done

using the performance measurements proposed dur-

ing the ICDAR 2015 Competition on Scene Text Rec-

tiﬁcation (Liu and Wang, 2015). The OCR accuracy

(Acc) is computed between the recognition result R of

a rectiﬁed text string and its corresponding GT tran-

scription G deﬁned as: Acc(R, G) =

1−L(R,G)

max(|R|,|G|)

, where

|R| and |G| represent the length of the two strings

and L(R, G) the Levenshtein distance between R and

G. The rectiﬁcation performance metric considers

the OCR results before and after the rectiﬁcation and

is deﬁned as: RP(R, D, G) = Acc(R, G) − Acc(D, G),

where D is the recognition result of the distorted text

before applying the rectiﬁcation. RP reﬂects both the

impact of the rectiﬁcation method on the ﬁnal OCR

transcription but also the level of difﬁculty of the text

recognition process. In our experiments we used the

Tesseract OCR engine (Smith, 2007) to obtain the

recognition results.

We deﬁne Acc

and Acc

as the overall ac-

curacy performance before and after the rectiﬁca-

tion over a dataset of N text strings, computed as:

Acc

∑

Acc(D

)

and Acc

∑

Acc(R

)

Sim-

ilarly, we deﬁne the overall rectiﬁcation performance

RP as: RP

∑

RP(R

)

The validation of the proposed rectiﬁcation ap-

proach is exclusively done based on our results, as no

result of any participant at the ICDAR 2015 Compe-

tition on Scene Text Rectiﬁcation is publicly available.

Table 1 exposes the scores from both datasets.

Towards the Rectiﬁcation of Highly Distorted Texts

245

Electrolux professional Manchestet SportsCenter D O U BLE\\NT

Nighfl’me SIDE BISON Sha/v LAMS Wmouth

Figure 8: Rectiﬁcation results on the synthetic dataset: original image (top), text rectiﬁcation result (middle) and OCR

transcription using Tesseract (bottom).

Discussion on the Results Obtained on the Syn-

thetic Dataset. Fig. 8 illustrates rectiﬁcation re-

sults of some synthetic deformed texts. The accu-

racy on the synthetic dataset (see Table 1) is evalu-

ated to approximately 0.72. On the other hand, the

accuracy before the rectiﬁcation is very low (0.08),

which indicates the difﬁculty to deal with text string

deformations and also the efﬁciency of our method.

Hence, the rectiﬁcation performance RP is equal

to 0.64. Fig. 7 illustrates the quantity-quality his-

tograms (Calarasanu et al., 2015) containing the dis-

tributions of accuracy values. By looking at the fre-

quency in the last bin of the histogram, one can notice

that half of the rectiﬁed texts have obtained almost

perfect recognition accuracies (i.e. accuracy values

in the intervals [0.8, 0.9[ and [0.9, 1]). Approximately

10% of the texts got a low accuracy rate, belonging

to interval [0, 0.1[ which corresponds to rectiﬁcation

failures.

Table 1: Rectiﬁcation evaluation results on ICDAR 2015

Competition on Scene Text Rectiﬁcation datasets.

DATASET Acc

RP Acc

Synthetic 0.721979 0.637037 0.0849421

Real-scene (EN) 0.65149 0.187165 0.464326

Most of the problems mainly come from an incor-

rect approximation of the quadrangle that bounds the

deformed text. The left and right bounding lines do

not always get the best orientation of the extremity

a b c

IOOHGS 00 HOTEZ

Figure 9: Rectiﬁcation failures: original image (top), text

rectiﬁcation result (middle) and OCR transcription using

Tesseract (bottom); (a) upward rectiﬁed text strings; (b) rec-

tiﬁed text strings with disproportionate character sizes; (c)

text with an extremity character “T”.

a b

WIS/’75

Figure 10: OCR recognition failures due to (a) challeng-

ing font, (b) small text size and (c) complex design: origi-

nal image (top), text rectiﬁcation result (middle) and OCR

transcription using Tesseract (bottom).

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

246

suANYUAN JIE Public Te|ephe Houghton United Global Education REferra\

SHANYUAN JIE Public Telephone Houghton United Global Education Referral

Figure 11: OCR recognition before and after the rectiﬁcation process: original image (top), text rectiﬁcation result (middle),

OCR transcription before the rectiﬁcation and OCR transcription after the rectiﬁcation using Tesseract (bottom).

CC. For characters such as “A”, “T” or “L” the pro-

cedure looks for the lines that maximize the number

of border points. However, these lines have a differ-

ent direction than the one of the characters and hence

this can produce inaccurate rectiﬁcation as seen in

Fig. 9(c). Furthermore, an imprecise approximation

of the quadrangle can be caused by the upper and/or

bottom bounding lines which can cause the dispro-

portion of character sizes as seen in Fig. 9(b). Here,

the upper and lower lines are not approximations, but

the unique lines passing through the upper and bottom

extremity points which produces wrong lines (i.e. not

parallel to the real direction of the text). The OCR

also produces erroneous text transcriptions when the

deformed texts are upward (Fig. 9(a)).

One of the advantages of the proposed rectiﬁca-

tion method is its ability to correct very challenging

deformations, that make text strings unreadable. Al-

though the rectiﬁcation is not always very accurate,

which consequently leads to imprecise OCR tran-

scriptions, the visual results remain notable. From a

visual point of view, we succeed in transforming un-

readable texts into readable ones. Such examples are

depicted in Fig. 8. The recognition performance of

Tesseract when dealing with inclined texts varies de-

pending on the case. For example, the OCR performs

better on the last part of the sequence “Gt. Yarmouth”

(“mouth”) than on the apparently easier to read part

“Gt. Ya”, depicted in the last example of Fig. 8.

Discussion on the Results Obtained on the Real-

scene Dataset. The accuracy score obtained on the

real-scene dataset (see Table 1) is slightly lower than

the one obtained on the synthetic dataset (approxi-

mately 0.65). The rectiﬁcation performance is very

low, equal to 0.19, due to many reasons such as: chal-

lenging fonts that are not correctly handled by the

OCR (Fig. 10(a)); texts with small characters, which

can affect the rectiﬁcation process; complex text de-

signs in which characters are composed of multiple

CCs (Fig. 10(b)) for which the rectiﬁcation method

fails, as it can only handle characters represented by

one CC; the dataset contains few challenging text dis-

tortions (mainly oriented texts and few text strings in

perspective) which explains the low value of RP in

Table 1 compared to the synthetic dataset (Fig. 11);

dataset size (60 images, versus 2000 images in the

synthetic dataset) which leads to less representative

scores.

Moreover, the accuracy histogram in Fig. 7 shows

that approximately the same proportion of deformed

texts have been incorrectly rectiﬁed as in the case of

the synthetic datatset. On the other hand, the distri-

bution of accuracy values is compacted into the inter-

vals [0.3, 0.5[, [0.8, 0.9[ and [0.9, 1], whereas the val-

ues computed on the synthetic dataset were more scat-

tered. Nonetheless, both histograms in Fig. 7 present

a similar behavior which validates the fact that the

proposed rectiﬁcation method is independent of the

text type, i.e. synthetic or natural.

5 CONCLUSION

In this paper we have presented a perspective rec-

tiﬁcation method that accurately corrects highly de-

formed text strings. The proposed perspective recti-

ﬁcation relies on a homographic transformation that

maps the camera coordinates onto a parallel-front

plane. The homographic transformation is powerful

as it handles both rotation and perspective projections,

including shearing effects, but depends on how accu-

rate is the estimation of the bounding quadrangle of

a distorted text. Our proposed two-stage procedure

to ﬁnd this quadrangle consisted in approximating a

top and bottom boundary lines based on a reference

line computed using the LSM. Secondly, we provide

a precise estimation of the lines bounding the extrem-

ity characters by iterating all possible lines until ﬁnd-

ing the one that best bounds the two CCs. This tech-

nique implies however some limitations. Namely, the

deformed text should contain characters represented

by single CCs and moreover the text should be up-

ward only (rotated or not). The experiments were

conducted on the two ICDAR 2015 Competition on

Scene Text Rectiﬁcation datasets, for which the recti-

Towards the Rectiﬁcation of Highly Distorted Texts

247

ﬁcation procedure gives similar performance results.

A slightly lower recognition accuracy was obtained

on the real-scene dataset due to a number of reasons,

such as the low performance of Tesseract OCR on

texts with complex fonts or designs. On the other

hand, many texts in this dataset present only rota-

tion or slight perspective deformations, compared to

the synthetic dataset, which contains more challeng-

ing texts that are subject to multiple transformations at

the same time. The difﬁculty of the synthetic dataset

is also proven by the high rectiﬁcation performance

score. We have demonstrated that the proposed rec-

tiﬁcation method can successfully correct oriented,

sheared or perspective distorted texts. We have also

shown that we could rectify unreadable texts and ob-

tain satisfactory OCR accuracy scores. Future per-

spectives focus on a deeper analysis on the shape of

some characters such as “A”, “L” or “T” namely a

study of their symmetry to prevent inaccurate quad-

rangle approximations. The evaluation of the recti-

ﬁcation procedure is inﬂuenced by the used OCR.

Tesseract expects a very accurate text rectiﬁcation and

often fails when the characters are slightly inclined.

For example, the letter “t” is often interpreted as “f”,

“l” as the symbol “\”, “L” as “Z”. Hence, more per-

formant OCRs are being considered for further tests

such as CuneiForm, ABBYY or OmniPage.

ACKNOWLEDGEMENTS

This work was supported by FUI 14 (LINX project).

REFERENCES

Almazan, J., Fornes, A., and Valveny, E. (2013). De-

formable hog-based shape descriptor. In ICDAR,

pages 1022–1026.

Busta, M., Drtina, T., Helekal, D., Neumann, L., and Matas,

J. (2015). Efﬁcient character skew rectiﬁcation in

scene text images. In ACCV, pages 134–146.

Calarasanu, S., Fabrizio, J., and Dubuisson, S. (2015). Us-

ing histogram representation and earth mover’s dis-

tance as an evaluation tool for text detection. In IC-

DAR.

Cambra, A. and Murillo, A. (2011). Towards robust and

efﬁcient text sign reading from a mobile phone. In

ICCV, pages 64–71.

Chen, X., Yang, J., Zhang, J., and Waibel, A. (2004). Auto-

matic detection and recognition of signs from natural

scenes. TIP, 13(1):87–99.

Clark, P., Mirmehdi, D., and Doermann, D. (2001). Recog-

nizing text in real scenes. IJDAR, 4:243–257.

Deng, H., Zhu, Q., Tao, J., and Feng, H. (2014). Rec-

tiﬁcation of license plate images based on hough

transformation and projection. TELKOMNIKA IJEE,

12(1):584–591.

Fan, K. C. and Huang, C. H. (2005). Italic detection and

rectiﬁcation. JISE, 23:403–419.

Ferreira, S., Garin, V., and Gosselini, B. (2005). A text de-

tection technique applied in the framework of a mobile

camera-based application. In CBDAR, pages 133–139.

Hase, H., Yoneda, M., Shinokawa, T., and Suen, C. (2001).

Alignment of free layout color texts for character

recognition. In ICDAR, pages 932–936.

Kiran, A. G. and Murali, S. (2013). Automatic rectiﬁca-

tion of perspective distortion from a single image us-

ing plane homography. IJCSA, 3(5):47–58.

Li, L. and Tan, C. (2008). Character recognition under se-

vere perspective distortion. In ICPR.

Liang, J., DeMenthon, D., and Doermann, D. (2008). Geo-

metric rectiﬁcation of camera-captured document im-

ages. PAMI, 30(4):591–605.

Liu, C. and Wang, B. (2015). Icdar 2015

competition on scene text rectiﬁcation.

http://ocrserv.ee.tsinghua.edu.cn/icdar2015 str/.

Lu, S. and Tan, C. (2006). Camera text recognition based

on perspective invariants. In ICPR, volume 2, pages

1042–1045.

Merino-Gracia, C., Mirmehdi, M., Sigut, J., and Gonz

alez-

Mora, J. L. (2013). Fast perspective recovery of text

in natural scenes. IVC, 31(10):714–724.

Myers, G., Bolles, R., Luong, Q.-T., Herson, J., and Arad-

hye, H. (2005). Rectiﬁcation and recognition of text

in 3-d scenes. IJDAR, 7(2-3):147–158.

Phan, T. Q., Shivakumara, P., Tian, S., and Tan, C. L.

(2013). Recognizing text with perspective distortion

in natural scenes. In ICCV, pages 569–576.

Santosh, K. and Wendling, L. (2015). Character recognition

based on non-linear multi-projection proﬁles measure.

FCS, 9(5):678–690.

Smith, R. (2007). An overview of the tesseract ocr engine.

In ICDAR, pages 629–633.

Stamatopoulos, N., Gatos, B., Pratikakis, I., and Perantonis,

S. (2011). Goal-oriented rectiﬁcation of camera-based

document images. TIP, 20(4):910–920.

Yao, C. (2012). Detecting texts of arbitrary orientations in

natural images. In CVPR, pages 1083–1090.

Ye, Q., Jiao, J., Huang, J., and Yu, H. (2007). Text detec-

tion and restoration in natural scene images. VCIR,

18(6):504–513.

Yonemoto, S. (2014). A method for text detection and recti-

ﬁcation in real-world images. In ICIV, pages 374–377.

Zhang, L., Lu, Y., and Tan, C. (2004). Italic font recognition

using stroke pattern analysis on wavelet decomposed

word images. In ICPR, volume 4, pages 835–838.

Zhang, X., Lin, Z., Sun, F., and Ma, Y. (2013). Rectiﬁcation

of optical characters as transform invariant low-rank

textures. In ICDAR, pages 393–397.

Zhou, P., Li, L., and Tan, C. (2009). Character recognition

under severe perspective distortion. In ICDAR, pages

676–680.

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

248