Stereoscopic Text-based CAPTCHA on Head-Mounted Displays

Tadaaki Hosaka and Shinnosuke Furuya

School of Management, Tokyo University of Science, Chiyoda-city, Tokyo, Japan

Keywords:

CAPTCHA, Image Authentication, 3D Vision, Stereo Matching, Virtual Reality.

Abstract:

Text-based CAPTCHAs (completely automated public Turing test to tell computers and humans apart) are

widely used to prevent unauthorized access by bots. However, there have been advancements in image seg-

mentation and character recognition techniques, which can be used for bot access; therefore, distorted char-

acters that are difﬁcult even for humans to recognize are often utilized. Thus, a new text-based CAPTCHA

technology with anti-segmentation properties is required. In this study, we propose CAPTCHA that uses

stereoscopy based on binocular disparity. Generating a character area and its background with the identical

color patterns, it becomes impossible to extract the character regions if the left and right images are analyzed

separately, which is a huge advantage of our method. However, character regions can be extracted by using

disparity estimation or subtraction processing using both images; thus, to prevent such attacks, we intention-

ally add noise to the image. The parameters characterizing the amount of added noise are adjusted based on

experiments with subjects wearing a head-mounted display to realize stereo vision. With optimal parameters,

the recognition rate reaches 0.84; moreover, sufﬁcient robustness against bot attacks is achieved.

1 INTRODUCTION

Various types of CAPTCHA (completely automated

public Turing test to tell computers and humans apart)

have been proposed (Roshanbin and Miller, 2013;

Hasan, 2016), all of which present a task, which is

easy for humans but highly difﬁcult for machines to

perform, to the user requesting authentication. In typ-

ical text-based CAPTCHAs, some distorted letters or

digits are displayed, which must be input into the re-

sponse ﬁeld by the user.

However, there have been advances in image seg-

mentation and character recognition techniques to

break CAPTCHAs (Bursztein et al., 2014; Gao et al.,

2016; Chen et al., 2017); therefore, it was necessary

to present characters with a large degree of shape dis-

tortion. Consequently, such authentication sometimes

became difﬁcult even for humans to pass. Further-

more, recent developments in machine learning are

remarkable and most CAPTCHAs can be broken if

segmentation is correctly performed (George et al.,

2017; Ye et al., 2018); thus, it is important to re-

alize anti-segmentation. Therefore, new text-based

CAPTCHA that does not overly deform characters

and has anti-segmentation properties is required.

In this study, we propose text-based CAPTCHA

that uses binocular-vision-based stereoscopy, which

has not been previously considered in this research

ﬁeld, for technologies that use three-dimensional

(3D) applications. The proposed method uses stereo

images that contain a character in front of a wall-like

background. By using the same color pattern for the

character and background, segmentation of either left

or right image into these two regions becomes impos-

sible, which is a major advantage of our method over

conventional CAPTCHA methods. In plain stereo im-

ages that enable stereoscopic viewing based on binoc-

ular disparity, character regions can be extracted us-

ing disparity estimation or background subtraction;

the proposed method intentionally adds noise to in-

capacitate such image processing. The validity of our

proposed method is conﬁrmed by experiments with

subjects wearing a head-mounted display (HMD) to

realize stereo vision.

2 PREVIOUS RESEARCH ON

TEXT-BASED CAPTCHA

Extensive research has been conducted on text-based

CAPTCHA. Large IT enterprises such as Microsoft

(Chellapilla et al., 2005) and Google (Kluever and

Zanibbi, 2009) have actively conducted research

on CAPTCHAs. In particular, reCAPTCHA, cur-

rently provided by Google, has been introduced

Hosaka, T. and Furuya, S.

Stereoscopic Text-based CAPTCHA on Head-Mounted Displays.

DOI: 10.5220/0008893407670774

In Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2020) - Volume 5: VISAPP, pages

767-774

ISBN: 978-989-758-402-2; ISSN: 2184-4321

767

on many websites. Although the latest version of

reCAPTCHA, i.e., reCAPTCHA v3, evaluates the

user’s behavior on the website as a score and deter-

mines whether the user is a bot, traditional text-based

CAPTCHAs are still used on numerous websites.

Most traditional text-based CAPTCHA methods

enhance robustness against bot attacks by using tech-

niques such as character distortion, rotation, twist-

ing, and overlap (Roshanbin and Miller, 2013; Hasan,

2016). However, as characters are increasingly

transformed, it becomes more difﬁcult for humans

to recognize them. To solve this problem, three-

dimensional (3D) text-based CAPTCHAs have been

proposed (Macias and Izquierdo, 2009; Imsamai and

Phimoltares, 2010; Ye et al., 2014). In these methods,

characters as 3D objects are deformed by operations

such as rotation and projected onto a two-dimensional

(2D) plane corresponding to the user’s monitor. The

validity of these methods is based on the fact that hu-

mans can perceive 3D space even in a 2D image.

Research on methods to break CAPTCHAs has

also been actively conducted (Bursztein et al., 2014;

Gao et al., 2016; Chen et al., 2017). Typical break-

ing techniques for text-based CAPTCHA consist of

three parts: preprocessing, segmentation, and recog-

nition. Preprocessing is performed to facilitate the

subsequent segmentation or recognition process; typ-

ical methods include binarization, noise removal, and

thinning. Segmentation aims to detect the bound-

aries of the presented characters by measuring the

size of the connected regions, performing vertical pro-

jection, and so on. Representative techniques of the

recognition process include template matching, clus-

tering, and machine learning such as support vector

machines (Starostenko et al., 2015) and deep learning

(Stark et al., 2015).

With regards to machine learning, recent devel-

opments in the techniques have been remarkable.

George et al. proposed the recursive cortical network

(RCN), inspired by the human brain having the ability

to learn and generalize from a few examples (George

et al., 2017). RCN is a hierarchical model that rep-

resents the contours and surfaces of characters, and

allows character recognition without requiring much

learning data. Their results showed that the recogni-

tion rate for reCAPTCHA database is 0.943 for letters

and 0.666 for words. Ye et al. proposed a method to

break CAPTCHAs using generative adversarial net-

work (GAN) which is a method of generating data

necessary to train deep neural networks (Ye et al.,

2018). Their method ﬁrstly trains two competing net-

works corresponding to a generator and discrimina-

tor. The generator tries to create a CAPTCHA, which

is visually similar to the target CAPTCHA. The dis-

criminator then tries to determine if the synthesized

CAPTCHAs are genuine. After the learning of the

two networks has progressed sufﬁciently, the gener-

ator can synthesize a CAPTCHA indistinguishable

from real ones and the deep neural networks can be

trained using these images. Chen et al. proposed a

method to further improve the recognition accuracy

of deep convolutional neural network for confusion

classes (Chen et al., 2019).

The development of such techniques to break

CAPTCHAs defeats most methods when character

and background segmentation is potentially possible.

3D CAPTCHAs can also be broken with high prob-

ability along with traditional text-based CAPTCHAs

(Ye et al., 2014; Bursztein et al., 2014; Gao et al.,

2016; Chen et al., 2017). Therefore, it is im-

portant to realize new text-based CAPTCHA that

does not overly deform the characters and has anti-

segmentation properties.

In this study, we propose text-based CAPTCHA

that uses binocular-vision-based stereoscopy. By us-

ing the same color pattern for the character and back-

ground, it becomes impossible to segment either the

left or right image into character and background re-

gions. In our study, HMDs are used to make it eas-

ier for the subject to view the image stereoscopically,

which may seem to narrow the applicability of the

proposed method. However, most conventional text-

based CAPTCHAs are intended for use on personal

computers with a large screen and a keyboard and are

not always useful in VR or MR environments where

users have an HMD and/or handheld controllers (Bhat

et al., 2015; Singh and Singh, 2017). It is worthwhile

to propose a secure authentication method that can be

used even when the user wears devices specialized

for VR or MR environments. As an example, Yang

et al. proposed a CAPTCHA method by letting the

user play a game on a handset that had only a small

screen without a keyboard (Yang et al., 2015). By in-

vestigating the behavior during the game, it was pos-

sible to determine whether the user is a human or a

bot. Similar to their proposed CAPTCHA that takes

advantage of the characteristics of a handset, we real-

ize new CAPTCHA by effectively using stereovision

provided by an HMD.

3 PROPOSED CAPTCHA BASED

ON BINOCULAR VISION

To realize a new kind of text-based CAPTCHA, we

utilize stereo images in which the characters have the

same color pattern as that of the background. Further-

more, noise is intentionally added to the stereo im-

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

768

Figure 1: Example of background image. This image con-

sists of only eight colors (each RGB component is either 0

or 255). In this research, this image is utilized as the back-

ground for both the left and right images, creating a wall at

inﬁnite distance.

ages to enhance robustness against supposed attacks

by bots.

The procedure of the proposed stereo image gen-

eration is as follows:

1. Generate background

2. Draw characters in front of the background

3. Add noise for robustness against stereo matching

4. Add noise for robustness against background sub-

traction.

The details of each process are described below.

3.1 Generating Background

We ﬁrst generate two identical images of size 285 ×

241. For each pixel, each RGB component is ran-

domly determined to be either 0 or 255. These im-

ages are composed of eight colors, as shown in ﬁg-

ure 1. These two images are arranged horizontally

as a stereo image pair, creating a wall as background

at inﬁnite distance. Even though the background has

zero disparity in this research, the following argument

holds even for general cases of non-zero disparity.

3.2 Drawing Characters in Front of

Background

To place a character in front of the background, we

add a letter with disparity k to the stereo image pair.

RGB combinations of each pixel in the character re-

gion are randomly selected to have one of the eight

colors used for the background. Therefore, it is in

principle impossible to recognize the character by

viewing a left or right image alone. The character

can be recognized only as a perception of depth us-

ing binocular vision, as shown in ﬁgure 2.

Figure 2: Diagram of how humans can see embedded char-

acter. In stereo images, the character region has disparity k

and the background has zero disparity. By looking at the left

and right images with the left and right eyes, respectively,

the character appears to ﬂoat above the background to a hu-

man even though it cannot be recognized when viewing the

left or right image alone.

It is in principle possible to present multiple char-

acters for the CAPTCHA. However, in this research,

stereovision using an HMD is considered and only

one character is displayed because the size of the

monitor on an HMD is limited. By performing mul-

tiple tests, identiﬁcation with an accuracy same as

when presenting multiple characters simultaneously

can be realized.

3.3 Adding Noise for Robustness

against Stereo Matching

Although it is not possible to segment each image into

the character region and background, several possible

attacks on the proposed method are considered.

One possible attack on the above method is to ex-

tract the character region using disparity estimation

via stereo matching. Therefore, to make disparity

estimation difﬁcult, every pixel in each row of the

right image is shifted by +1 or −1 in the x direction

with probability a/2. With this operation, for a pixel

whose RGB values become indeﬁnite at the left or

right end in each row, we randomly deﬁne the color

again. However, this operation is ineffective against

m × 1 (m ∈ N )-sized block matching. Therefore, a

similar operation is applied to the y-axis; every pixel

in each column of the right image is shifted by +1

or −1 in the y direction with probability b/2. As the

parameters a and b increase, the robustness to stereo

matching increases but stereoscopic viewing becomes

more difﬁcult for humans. It is thus necessary to op-

timize these parameters.

3.4 Adding Noise for Robustness

against Background Subtraction

The other possible attack on the proposed method is

to extract the character region by shifting the left or

right image by the amount of the background dispar-

ity (zero in the present case, but a non-zero value in

Stereoscopic Text-based CAPTCHA on Head-Mounted Displays

769

general) and taking the difference of the two images.

An example of a difference image for a stereo image

pair in which the character “3” is embedded is shown

in ﬁgure 3(a). This subtraction operation can extract

the character part as the region with a non-zero differ-

ence (depicted as white pixels).

To prevent this, one of the RGB values of each

pixel in the stereo image pair is redetermined as 0 or

255 with probability c. The difference image for the

stereo image pair after this process is shown in ﬁg-

ure 3(b). This process generally increases the number

of pixels with differences and is thus not useful for

hiding the character region, which indicates that ero-

sion and dilation processing can be used to extract the

character region from the background.

Therefore, we add further noise to the stereo im-

age pair to disturb the character region of the dif-

ference image. To achieve this, the RGB values of

each pixel in the character region of the left image

are copied to the identical coordinates in the right im-

age with probability d. The aim of this operation is

to intentionally set the difference value in the charac-

ter region of the left image to zero. An example of a

difference image after this process is shown in ﬁgure

3(c), which indicates that this process can work well.

3.5 Parameter Adjustments

The four parameters a, b,c, and d need to be adjusted

so that stereoscopic viewing by a human is possible

and character recognition by a machine cannot be eas-

ily performed. Furthermore, the four parameters need

to be properly balanced because if even one of them

is too large, binocular vision by a human becomes

highly difﬁcult. In contrast, if one of the parame-

ters is close to 0 with the other parameters kept not

too large, extraction of the character regions by image

processing becomes easy. Although it is necessary to

moderately increase all parameters, it is difﬁcult to

theoretically obtain the optimal values. The optimal

combination of parameters was thus determined us-

ing experiments with subjects.

In this study, experiments were conducted in a

situation where the subject could view the image

stereoscopically by wearing an HMD. The proposed

method is also applicable to other devices that enable

stereoscopic vision, which include polarized or ac-

tive shutter glasses (used in cinemas, etc.), lenticular

lenses or parallax barriers (used in naked-eye 3D sys-

tems), and traditional anaglyph glasses. In addition,

with a little training, some people can see the right

image with their right eye and left image with their

left eye without using a special device. For such peo-

ple, the proposed method can be used as an alternative

(a) (b) (c)

Figure 3: Character extraction by background subtraction.

White pixels represent non-zero difference and black pixels

represent zero difference. (a) Difference image for a stereo

image pair not subjected to any processing; the character

region can be easily extracted. (b) Difference image for the

stereo image pair processed with c = 0.5 (other parameters

set to zero); although the background is contaminated by

noise, simple erosion and dilation operations can be used to

extract the character region. (c) Difference image for the

stereo image pair processed with d = 0.7 (other parameters

set to zero); noise is added to the character region. Ad-

justing both c and d is necessary for enhancing robustness

against background subtraction.

to typical 2D CAPTCHAs.

4 EVALUATION EXPERIMENTS

4.1 Experimental Settings

Evaluation experiments were conducted with a total

of 20 participants. Stereo image pairs correspond-

ing to the seven sets of parameters shown in table

1 were presented and viewed through an HMD de-

signed for VR headsets. As the value of each param-

eter increases, the difﬁculty level of stereoscopic vi-

sion rises and robustness against stereo matching and

background subtraction is enhanced. Therefore, im-

ages with parameter set 1 were the easiest to perceive

and least resistant to attacks, and images with param-

eter set 7 were the most difﬁcult to perceive and most

resistant to attacks. Since parameter b more strongly

and negatively inﬂuences the stereoscopic view than

parameter a in most cases, the value of parameter b is

set smaller than that of a.

In each stereo image pair, a digit from 0 to 9

was embedded using the method described in the pre-

vious section with disparity k = 10. Examples of

stereo image pairs for some parameter sets are shown

in ﬁgure 4. Subjects viewed these images through

an HMD having a divider at the center (ﬁgure 5)

and were asked to state the number they perceived

by stereopsis. Subjects ﬁrst optimized their viewing

setup in terms of visibility of targets by adjusting the

focal length and the distance between lenses using

plain stereo images corresponding to the parameter

set (a,b,c,d) = (0,0,0,0). With the optimized setup,

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

770

Table 1: Parameter sets in our experiments.

No. a b c d

1 0 0 0 0

2 0.1 0.1 0.1 0.1

3 0.3 0.1 0.1 0.1

4 0.3 0.15 0.15 0.15

5 0.3 0.2 0.15 0.15

6 0.3 0.2 0.3 0.2

7 0.3 0.3 0.3 0.3

the subject sequentially viewed images with various

parameters.

Some people could not stereoscopically view the

numbers even under parameter set 1; they were ex-

cluded from the evaluation because the proposed

CAPTCHA is expected to be utilized in VR or MR

applications and it is reasonable to assume that such

users have at least plain stereoscopic vision.

For each subject, two tests each for parameter sets

1-5 and four tests each for parameter sets 6 and 7 were

conducted. For each test, correct/incorrect, response

time, and conﬁdence of answer were recorded. The

degree of conﬁdence was self-assessed by subjects ac-

cording to the following criteria:

3: Subjects can clearly perceive the whole character

as if it were ﬂoating on the background.

2: Subjects can recognize the character without re-

lying on estimation, but the clarity of the stereo-

scopic vision is inferior to that at level 3.

1: Subjects can estimate the character from some per-

ceivable information such as piecewise contours.

The key point of levels 2 and 3 is that the subject does

not rely on estimation. Subjects learned the degree of

level 3 by stereoscopically viewing the image gener-

ated with parameter set 0; this degree was used as a

reference for level 2. When the subject gave an in-

correct digit or could not even estimate the digit, the

degree of conﬁdence was not recorded.

4.2 Experimental Results for Human

Recognition

The accuracy rates for the seven parameter sets are

shown in ﬁgure 6. For parameter sets 1-4, which

make it relatively easy to recognize the character, the

correct answer rate was 1. In practical applications,

considering that another image can be presented if an

incorrect answer is given, an accuracy rate of 0.8 or

more, as obtained for parameter sets 5 and 6, seems

sufﬁcient. For parameter set 7, with (a,b, c,d) =

(0.3,0.3,0.3,0.3), the correct answer rate sharply de-

(a) Parameter set 1: “9”

(b) Parameter set 6: “8”

Figure 4: Examples of stereo image pairs. The left and right

images were concatenated for subjects to view through an

HMD designed for VR headsets. By adjusting their view-

ing setup so that the vertical red lines drawn at the center

of the left and right images overlap, the subjects could eas-

ily perform binocular vision with the parallel-eyed stereo

method. The character (digit) embedded in each stereo im-

age is given in the caption.

Figure 5: Head-mounted displays utilized in our experi-

ments. (Right) This HMD device has a divider between two

lenses, so subjects could easily perform the parallel-eyed

stereo method.

creased to 0.45. Each parameter value should thus be

less than 0.3.

Figure 7 shows the average response time for sub-

jects who responded correctly. To eliminate the inﬂu-

ence of extremely rapidly or slowly responding sub-

jects, the top and bottom 20% of response times were

Stereoscopic Text-based CAPTCHA on Head-Mounted Displays

771

0.2

0.4

0.6

0.8

Correct answer rate

Parameter set

Figure 6: Correct answer rate (recognition rate). For param-

eter sets 1-4, every subject gave the correct answer for all

tests. The recognition rate slightly decreased for parameter

sets 5 and 6, and dropped sharply (under 0.5) for parameter

set 7.

omitted from the average calculation. The experimen-

tal results show that as the values of parameters a, b, c,

and d increase, the response time tends to increase.

In future versions of the proposed method, present-

ing multiple characters is conceivable, and thus the

recognition speed per character should be high. From

this viewpoint, the stereo image pairs generated with

parameter set 7 are too stressful for users in practical

situations.

Figure 8 shows the average conﬁdence level for

subjects who responded correctly. The conﬁdence de-

gree for parameter sets 1-6 exceeds level 2 (characters

are perceptible without estimation). For parameter set

7, the conﬁdence level is lower than 2, conﬁrming that

this set is unsuitable for practical applications.

The above results indicate that parameter sets 1-6

are appropriate from the viewpoint of human recogni-

tion performance.

4.3 Robustness against Bot Attacks

Robustness against possible attacks using stereo

matching and background subtraction was investi-

gated.

Disparity maps for parameter sets 5-7 obtained us-

ing stereo matching with 5 × 5 blocks are shown in

ﬁgure 9. To generate a binary image representing the

foreground and background, pixels with an estimated

disparity of 0 or ±1 are black and pixels with an esti-

mated disparity of more than 1 are white in consider-

ation of the fact that the correct disparity of the back-

ground is zero and that a shift of one pixel may occur

in relation to parameter a. To extract the character

region from this binary image, we performed erosion

and dilation operations, in which the order and num-

Average response time (sec)

Parameter set

Figure 7: Average response time. Some subjects could im-

mediately give an answer and some took more than one

minute to answer. To mitigate the inﬂuence of these extreme

cases, the top and bottom 20% of response times were ex-

cluded from the average calculation of each parameter set.

As the difﬁculty of recognition increases, response time in-

creases.

Average confidence level

Parameter set

Figure 8: Average conﬁdence degree for answers. For pa-

rameter sets 1-5, a high degree of conﬁdence was observed.

The conﬁdence level dropped sharply, but remained above

2, for parameter set 6, indicating that recognition was made

without estimation.

ber of erosion and dilation processes were manually

tuned based on the grid search method. The follow-

ing two types of results are shown in ﬁgure 10:

• (Left) As many pixels as possible belonging to the

character region are retained as white pixels.

• (Right) As many pixels as possible belonging to

the background region are retained as black pix-

els.

These results qualitatively show that parameter sets 6

and 7 have sufﬁcient robustness against character re-

gion extraction; when we try to retain the character

region as the foreground, approximately the same or

more mispredicted white pixels appear in the back-

ground and when we attempt to eliminate these back-

ground noise, the character part cannot be recognized.

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

772

(a) Param. set 5 (b) Param. set 6 (c) Param. set 7

Figure 9: Disparity maps estimated using stereo matching

with 5 × 5 blocks. The true disparity values are 10 in the

character region and 0 in the background region. Estimated

disparity values are ampliﬁed by a factor of 10, and their

absolute values are expressed as brightness in these maps.

As the values of parameters increase, more noisy disparity

maps are obtained.

Robustness can be increased in future work by adap-

tively varying the disparity of the background.

Figure 11 shows the binarized subtraction images

for parameter sets 5-7, where pixels with identical

RGB values for the left and right images are black,

and pixels that are different in any one RGB compo-

nent are white. We attempted to extract the charac-

ter region by performing erosion and dilation opera-

tions, as done above. The results are shown in ﬁg-

ure 12. The images generated with parameter sets 6

and 7 have signiﬁcant tolerance against attacks, which

may be further enhanced by making the character and

background structures more complicated. For param-

eter set 5, it is speculated that recognizing the charac-

ter by machines is possible.

The above results indicate that parameter sets

6 and 7 are appropriate from the viewpoint of ro-

bustness against bots attacks. Furthermore, together

with the results of human recognition performance

described in the previous subsection, we conclude

that parameter values of approximately (a,b,c,d) =

(0.3,0.2,0.3,0.2) are suitable for practical use.

5 CONCLUSIONS

This paper proposed a text-based CAPTCHA that

uses stereopsis based on binocular vision. To enhance

robustness against supposed bot attacks using stereo

matching and image subtraction, we manipulated the

pixel values, which correspond to artiﬁcial noise addi-

tion. In evaluation experiments with subjects wearing

an HMD, users achieved a recognition rate of more

than 0.8 without resorting to speculation for images

that were resistant to attacks. It is speculated that by

making the scene depth structure more complicated,

the proposed stereo images can become even more ro-

bust against bot attacks.

Future work will include veriﬁcation of our

method with incorporation of traditional text-based

(a) Parameter set 5

(b) Parameter set 6

Figure 10: Extraction of the character region from bina-

rized disparity maps using erosion and dilation operations.

(Left) The order and number of erosion and dilation pro-

cesses were adjusted so that many pixels in the actual char-

acter region were maintained as the foreground. (Right)

These parameters were adjusted so that as many mispre-

dicted white pixels as possible in the actual background re-

gion was removed. It seems possible for machines to rec-

ognize the character with parameter set 5 because most of

the predicted foreground (white) pixels are concentrated in

the true character region. With parameter sets 6 and 7, it

seems difﬁcult for machines to recognize the character be-

cause collecting white pixels only in the character region

seems difﬁcult.

(a) Param. set 5 (b) Param. set 6 (c) Param. set 7

Figure 11: Binarized subtraction images. Pixels with

matching RGB values are shown in black, and pixels that

are different in any one RGB component are shown in

white. As in the disparity maps, the degree of noise is in-

creased in order, which means that extracting the character

region using image subtraction becomes more difﬁcult as

the values of parameters increase.

CAPTCHA techniques, such as multiple letters, mul-

tiple strings, overlap, distortion, and adhesion. As

another direction, robustness against breaking tech-

niques based on sophisticated machine learning such

Stereoscopic Text-based CAPTCHA on Head-Mounted Displays

773

(a) Parameter set 5

(b) Parameter set 6

Figure 12: Extraction of the character region from subtrac-

tion images using erosion and dilation operations. (Left)

The order and number of erosion and dilation processes

were adjusted so that many pixels in the actual character re-

gion were maintained as the foreground. (Right) These pa-

rameters were adjusted so that as many mispredicted white

pixels as possible in the actual background region was re-

moved. As in the disparity maps, with parameter set 5,

it seems possible for machines to extract the character re-

gion by adjusting the erosion and dilation parameters. This

seems difﬁcult with parameter sets 6 and 7, for which the

digits in the results are highly difﬁcult to recognize, even

for humans.

as support vector machine and convolutional neural

networks should be investigated.

REFERENCES

Bhat, A., Bhagwat, G., and Chavan, J. (2015). A survey

on virtual reality platform and its applications. Inter-

national Journal of Advanced Research in Computer

Engineering & Technology, 4(10):3775–3778.

Bursztein, E., Aigrain, J., Moscicki, A., and Mitchell, J. C.

(2014). The end is nigh: Generic solving of text-based

captchas. In Proceedings of 8th USENIX Workshop on

Offensive Technologies (WOOT 14).

Chellapilla, K., Larson, K., Simard, P. Y., and Czerwinski,

M. (2005). Designing human friendly human interac-

tion proofs (hips). In Proceedings of the 2005 Confer-

ence on Human Factors in Computing Systems, pages

711–720.

Chen, J., Luo, X., Guo, Y., Zhang, Y., and Gong, D.

(2017). A survey on breaking technique of text-based

captcha. Security and Communication Networks, page

6898617.

Chen, J., Luo, X., Liu, Y., Wang, J., and Ma, Y. (2019). Se-

lective learning confusion class for text-based captcha

recognition. IEEE Access, 7:22246–22259.

Gao, H., Yan, J., Cao, F., Zhang, Z., Lei, L., Tang, M.,

Zhang, P., Zhou, X., Wang, X., and Li, J. (2016). A

simple generic attack on text captchas. In Proceed-

ings of Network and Distributed System Security Sym-

posium (NDSS).

George, D., Lehrach, W., Kansky, K., L

azaro-Gredilla,

M., Laan, C., Marthi, B., Lou, X., Meng, Z., Liu,

Y., Wang, H., Lavin, A., and Phoenix, D. S. (2017).

A generative vision model that trains with high data

efﬁciency and breaks text-based captchas. Science,

358(6368):eaag2612.

Hasan, W. (2016). A survey of current research on captcha.

International Journal of Computer Science Engineer-

ing Survey, 7:1–21.

Imsamai, M. and Phimoltares, S. (2010). 3d captcha: A

next generation of the captcha. In Proceedings of 2010

International Conference on Information Science and

Applications, pages 1–8.

Kluever, K. A. and Zanibbi, R. (2009). Balancing usability

and security in a video captcha. In Proceedings of the

5th Symposium On Usable Privacy and Security.

Macias, C. R. and Izquierdo, E. (2009). Visual word-based

captcha using 3d characters. In Proceedings of the

3rd International Conference on Crime Detection and

Prevention, pages 1–5.

Roshanbin, N. and Miller, J. (2013). A survey and analysis

of current captcha approaches. Journal of Web Engi-

neering, 12:1–40.

Singh, N. and Singh, S. (2017). Virtual reality: A brief

survey. In Proceedings of 2017 International Confer-

ence on Information Communication and Embedded

Systems (ICICES), pages 1–6.

Stark, F., Hazrba, C., Triebel, R., and Cremers, D. (2015).

Captcha recognition with active deep learning. In Ger-

man Conference on Pattern Recognition Workshop.

Starostenko, O., Cruz-Perez, C., Uceda-Ponga, F., and

Alarcon-Aquino, V. (2015). Breaking text-based

captchas with variable word and character orientation.

Pattern Recogn., 48(4):1101–1112.

Yang, T.-I., Koong, C.-S., and Tseng, C.-C. (2015). Game-

based image semantic captcha on handset devices.

Multimedia Tools and Applications, 74(14):5141–

5156.

Ye, G., Tang, Z., Fang, D., Zhu, Z., Feng, Y., Xu, P., Chen,

X., and Wang, Z. (2018). Yet another text captcha

solver: A generative adversarial network based ap-

proach. In Proceedings of the 2018 ACM SIGSAC

Conference on Computer and Communications Secu-

rity, pages 332–348.

Ye, Q., Chen, Y., and Zhu, B. (2014). The robustness of

a new 3d captcha. In Proceedings of 11th IAPR In-

ternational Workshop on Document Analysis Systems,

pages 319–323.

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

774