TRANSCRIPTION SUPPORT SYSTEM USING SUBVERSION

Takehiko Murakawa, Hitoshi Fukuoka

Daichi Noda and Masaru Nakagawa

Faculty of Systems Engineering, Wakayama University, Japan

Keywords:

Digital library, Ancient document, Web service, Version control.

Abstract:

We report the data management system and the interface for reading the shot image and the text of a Buddhist

sutra written in Chinese and modifying the text so that it may be the same as the image in terms of content.

By using Subversion we maintain the text ﬁles efﬁciently and obtain the difference of the contents between

any two points of time easily. To make sure that the system can be employed as a multiuser transcription

support tool, we present the working model and deliver the experiment where the workers used the system and

we found revision markings on which two workers or more made. Furthermore we propose the method for

piecing together the workers’ respective outcomes to produce the integrated text ﬁle.

1 INTRODUCTION

Tens of thousands of Buddhist sutras which were tran-

scribed by hand around 1,000 years ago are existent

in Japan now. Through the literature assessment, it

became evident that those sutras reﬂect the earlier

Buddhist canons in Chinese, in comparison to the set

of sutras by woodblock printing after the 10th cen-

tury. To prevent such monumental documents while

the further research is permitted, a project was orga-

nized and has received the subsidy from the govern-

ment of Japan to enable the researchers to comput-

erize the materials and manage them with databases.

The authors have worked on the project with a view to

providing the database system together with service-

able interfaces (Tanaka et al., 2006; Fukuoka et al.,

2009).

When bending our eyes on the activities outside

Japan, the conservation of ancient documents includ-

ing transcribed Buddhist sutras has a close connec-

tion to the cultural heritage. European countries seem

to be more active than the United States and Asian

nations. For example, the report of EPOCH (Euro-

pean Research Network of Excellence in Processing

Open Cultural Heritage) (EPOCH, 2005) insists on

the importance of the prevention of historical mate-

rials in various languages, although the report mainly

describes the maintenance of building structures and

remains. Various documentation and reading support-

ing systems were actually constructed (Chen et al.,

2007; Ribeiro et al., 2007; Alshuhri, 2008).

There exist temples in Japan say Kongoji Temple

and Nanatsudera Temple which possess lots of Bud-

dhist sutras in good condition. The experts of digi-

tization have had the approval of the head priests to

use the rooms in the temples, brought photograph-

ing equipments including high-resolution digital cam-

eras, photography platforms and lighting apparatus,

and taken the pictures. Most sutras of Kongoji Tem-

ple have been taken, and the amount of the ﬁle size

rose above 1 terabyte.

During the photo shoot, the lighting and the avoid-

ance of distortion were sufﬁciently taken into consid-

eration. In addition, we developed tools for combin-

ing the shot images of the sutra, as well as the Web

application which allows one to browse the sutra im-

ages and move the viewpoint freely by the drag op-

eration just as Google Maps. The trouble is that he

or she cannot make use of full-text search on those

sutras. The authors attempted to apply several char-

acter recognition applications to the images, but did

not lead to a positive outcome. There is, fortunately,

released a data set of the text on Buddhist sutra in

Chinese. The CD-ROM image ﬁle of the text data is

available free of charge, but we had to pay attention

to the fact it was derivedfrom the woodblock-printing

sutras. When looking at the shot image and the corre-

sponding text ﬁle for comparison, we detected plenty

of mismatches. If the editing interface is supplied say

by extending the sutra browser mentioned above, then

the user will be able to ﬁnd the desired sequence of

Chinese characters using the full-text search. More-

150

Murakawa T., Fukuoka H., Noda D. and Nakagawa M.

TRANSCRIPTION SUPPORT SYSTEM USING SUBVERSION.

DOI: 10.5220/0002846801500155

In Proceedings of the 6th International Conference on Web Information Systems and Technology (WEBIST 2010), page

ISBN: 978-989-674-025-2

over the information about the mismatches has a po-

tential for the contribution to humanities; it will be

a powerful tool for a comparative analysis between

the transcribed Buddhist sutras and the woodblock-

printing ones.

In this paper, we give an introduction to the devel-

oping support system for making the documents using

the shot images and text ﬁles of the sutras, with the aid

of the version control software Subversion. Through

the experiment, we made sure that the system can be

used as a multiuser transcription support tool, but the

problem becomes how to integrate the workers’ re-

sults which might conﬂicts each other. We then pro-

pose the consolidation method as well as the working

model for resolving this problem.

2 THE SYSTEM

Kongoji Issaikyo is a collection of thousands of Bud-

dhist sutras which were transcribed nearly 1,000 years

ago and nowpossessed by Kongoji Temple in Japan in

good preservation. It is true that the sutra were written

in Chinese and currently most Japanese people (and

maybe Chinese) who are good command of Chinese

characters cannot understand the passages of the su-

tras, but Buddhism researchers regularly read these

historical materials to make clear the religious cir-

cumstances of those days. Such researchers consider

that the sutras will make a contribution to Buddhist

study or historical science of Japan. Although they

began to survey the documents decade ago, the re-

searchers has been energetically doing the work about

the checkup of the scriptures and the photographing

of all the descriptive content. The authors received a

number of shot images taken with a digital camera,

and since the we have been investigating the auto-

mated method for combining the images and existing

text ﬁles of Buddhist sutras.

In order to search the shot images of Buddhist su-

tra for a passage, we need the text document which is

in strict correspondence with the images. The most

established text ﬁles of Buddhist sutras are main-

tained by Chinese Buddhist Electronic Text Associ-

ation (CBETA, ) now. We refer to the data set as

CBETA texts. The ﬁles are based on Taisho Tripi-

taka which is derived from Buddhist sutras by wood-

block printing. The images of Buddhist scriptures and

the CBETA texts in our own hands look largely same,

though, we can easily ﬁnd the difference by character

between the image and the text. Some of the differ-

ences are due to the transcription error such as a literal

error or an omitted error, but some may be the very

distinctions that the Buddhism researchers for throw-

ing light on, that is, the clue to the propagation of

Buddhist sutras and Buddhism. A goal of our study is

to supply a practical supportsystem for comparingthe

shot images and the relevant texts or for contrasting

the edited texts compatible with the Kongoji Issaikyo.

Based on this problem consciousness together

with the contents in hand, we are developing the data

management system and the interface for enabling a

user to read the shot image and the text all together

and modify the text so that the text may be compat-

ible with the image. By using the system, we will

be able not only to search images through a full-text

search of the text ﬁles but to get a foothold for learn-

ing the difference between transcribed Buddhist su-

tras and those of woodblock printing. Note that the

system are not for correcting the shot image, since

we have to show our respect for the historical mate-

rial. In addition we do not intend to send the modiﬁed

text to CBETA; we actually attempt to make digital

transcriptions of Kongoji Issaikyo or ancient Japanese

Buddhist sutras efﬁciently using CBETA texts.

A typical Buddhist sutra consists of a few thou-

sand Chinese letters. Particular two-letter idioms

such as the word meaning bodhisattva appear fre-

quently, and repetitive sequence of letters are often

seen. Whether it is transcribed or wood-block printed,

the number of letters in line is basically ﬁxed and the

characters are arranged in a methodical fashion ver-

tically and horizontally. These properties of the doc-

uments imply that it is impossible only for a single

worker to ﬁnish the absolutely perfect text ﬁle. To en-

sure a quality of the text, we should pay attention to

the multiple users’ operations.

For supporting the multiple-user modiﬁcation and

management of the documents, we put in use the ver-

sion control software Subversion (SVN, ) in our sys-

tem. Subversion was originally intended for the ver-

sion management of source ﬁles of software. The tool

tells us the difference of the contents between any two

points of time, which is suitable for managing and

displaying the distinction of the modiﬁed text ﬁles of

Buddhist sutras.

We are introducing several terms around Subver-

sion. A group of ﬁles maintained in a server for ver-

sion control is a repository, while a working copy is a

unit of ﬁle manipulation associated with a repository.

Note that the repository and the working copy have a

one-to-many relationship. An update means the oper-

ation where the latest version is sent from a repository

to a working copy. Conversely the action to transmit

the modiﬁcation within a working copy to the repos-

itory is called a commit (used as a noun as well as a

verb). In doing a commit, the user, committer, can

leave a message which is commonly called a com-

TRANSCRIPTION SUPPORT SYSTEM USING SUBVERSION

151

mit log, which will inform the co-workers of why and

how the committer went over the ﬁles. The change

of repository is controlled by a serial number named

a revision. After a commit, the revision increases by

just one.

Subversion is often abbreviated as SVN, while the

lower-case word

svn

is a command name for various

operations. When executing

svn diff

together with

the option about two revisions, we will obtain the line-

at-a-time difference of the ﬁle between the speciﬁed

points of time. The ﬁle of this output is called a diff

ﬁle. Figure 1 is an example of a diff ﬁle, as the result

of comparison between revisions 249 and 270. The

line with a single minus sign preceding Chinese let-

ters is the one which the text ﬁle of revision 249 in-

cludes but that of revision 270 does not, and the plus

sign means the reverse. The three lines before or after

them show the common context. It is easy to know the

difference by character, not only with our eyes, but in

an automatic way by writing a program.

Figure 1: Example of diff ﬁle.

The typical ﬁle management using Subversion

separates the disk storage for the repository from the

working copies. Various tools for the servers and for

the clients are available while the hosting services of

the repositories (for example (Unfuddle, )) are offered

on the Web. In contrast, our system puts both data in

a Web server and invokes the

svn

command within

it. Therefore the user can enjoy the services using his

or her favorite browser without regard to the working

copy and other notions of Subversion.

We brieﬂy explain how to use our system. A user

gains access to the top page with a favorite browser,

and logs in by entering the user name and the pass-

word. If it proves successful, then there is displayed a

page for selecting the sutra to read and edit. After that,

he or she sees a shot image and a text side by side. On

either one of the components, the user can carry out

the drag operation so that the both portions of the im-

age and the text may move according to the operation

by pixel. Note that the selection using the drag on

the text part is prohibited but the smooth scrolling is

available instead. There are found several button on

the page for jump the display position quickly. Af-

ter ﬁnding a mismatch, he or she can push the button

for switching the review mode and the edit mode; in

the edit mode, a text form for the horizontal writing

is provided to modify the line (Fig. 2). The modi-

ﬁed line is highlighted on the text portion (Fig. 3; The

second vertical line to the right of the text part grows

down a bit.) but the commit is not done yet. When

the button for commit is pressed, the modiﬁcations

are transmitted to the server, and he or she looks at

the review mode where there is no highlighted line.

Figure 2: Screenshot before modiﬁcation.

Figure 3: Screenshot after modiﬁcation.

The following two ways of making use of the sys-

tem are inside the scope of the assumption.

Auxiliary Function of Review Tool. A user ordinar-

ily enjoys our service as a digital book of Buddhist

sutras, but if the divergent characters are found,

then he or she can correct the text.

Transcription Support Tool. A worker observes the

shot image and the text of a sutra to resolve the

mismatch by modifying the text. His or her goal is

to make a document where the text is completely

coincident with the sutra images.

WEBIST 2010 - 6th International Conference on Web Information Systems and Technologies

152

3 EVALUATION EXPERIMENT

We performed laboratory experiments to verify the

usefulness of our system. Five college students put

the system in use to make a text ﬁle which coincides

with the shot image of Kongoji Issaikyo. We got two

sutras ready for edit. Any of the sutras has about

450 lines while each line mostly holds 17 Chinese

letters. With consideration for the enlisters’ health,

each person read and edited a sutra for three quarter

hour. Prior to the performance, we prepared the text

ﬁle where the region which does not appear in the cor-

responding shot image is deleted and the positions of

newline character were ﬁtted in with the image.

We would like to discuss how to judge whether

each modiﬁcation is valid or not, under the assump-

tion that there does not exist a correct text which is

completely the same as the shot images with regard to

the content. A simple way is that users edit a text ﬁle

simultaneously or during separate intervals. However

this naive method has at least two defects. First, when

a collision arises, i.e. an attempt of commit overlaps

the previously approved commit, it is harder to re-

solve the collision than a standard ﬁle management

using Subversion, since the user cannot directly con-

tact with the working copy in our system. The other

reason is as to the inequality. For example, after a

pioneer ﬁnds lots of mismatches and changes them,

other co-workers recognize much less discrepancies.

In this case, the ﬁrst worker apparently operate harder

according to the commit record. Another possibility

attaches importance to later changes; earlier changes

might be rewritten again although the last commit will

not be examined by the others. Some of the troubles

described adobe may be resolved through some sort of

control out of our system, but such a constraint might

provide an awkward execution environment.

In our experiments, each worker began with the

text ﬁle which is common if the referred sutra is the

same. Actually after a worker ended the reading and

the edit, the revision was wound back to that of ini-

tial state. (But it is obvious that the rewind would

not blot out the previous commits.) It is true that

the time-multiplexed operation like this seems inef-

fective with regard to the experiment and practical

use, but we are improving the system so as to permit

the concurrent use for a common sutra by introducing

branches which are broadly accepted in Subversion-

based ﬁle managements. After all the operation were

ﬁnished, we invoked the

svn diff

command to ob-

tain the diff ﬁles. As the result of reading the mod-

iﬁed contents, we decided that the modiﬁed position

where at least two persons rewrote is adopted. Fur-

thermore we delivered the adoption judgment for all

the modiﬁcations by comparing each diff ﬁle with the

consolidated one.

After the workers had rewritten the mismatched

characters using our system for three quarter hour per

sutra, we processed their achievements by means of

the above procedure. As seen in Table 1 (Fukuoka

et al., 2009), the number of examined lines ranged 33

to 109, the number of modiﬁcation 8 to 26, and the

number of contribution 3 to 11. There does not exist

any collision having need of choice where a character

is replaced by different ones. When we cast a spot-

light on the consolidated text ﬁles, there were found

41 and 43 modiﬁcations in the respective sutras which

someone had made, and 28 and 14 modiﬁcations were

done by two workers or more.

Table 1: Workers’ activities.

Sutra Worker Lines Mods Adopts

1 A 70 8 4

1 B 50 18 11

1 C 69 13 4

1 D 36 14 5

1 E 38 9 4

2 A 109 26 10

2 B 64 10 5

2 C 75 8 6

2 D 33 16 3

2 E 96 9 8

Lines: Number of examined lines.

Mods: Number of modiﬁcations.

Adopts: Number of adoptions.

4 INTEGRATION OF THE

MODIFICATION WORKS

In this section, we formulate the process where the

consolidated version of text is constructed out of the

diff ﬁles of the workers.

In advance, an unrevised text ﬁle is split into char-

acters to which identiﬁcation codes are assigned re-

spectively. The straightforwardway would be to num-

ber the characters serially, and the rest of this paper

adopts it. We can instead use a pair of the line num-

ber and the position of the line, as long as the values

are ordered along with the order of characters. For a

valid integer p, the character which correspondsto the

number p according to the above numbering is called

“the character at p”. By using this coding, we formu-

late the modiﬁcation of a worker.

If a worker insert a character α just after the char-

acter at p, then the action is denoted as p+ α. In the

case of prepending a character α to the text, 0 + α is

TRANSCRIPTION SUPPORT SYSTEM USING SUBVERSION

153

the code. Putting a string say α

···α

in right after

the character at p is associated with p + α

·· ·α

The deletion of the character at p is written as p−.

We present the form of the substitution as well,

instead of combining the insertion and the dele-

tion. If the character at p is replaced by a sequence

·· ·β

, then the code is p!β

·· ·β

. Typically

the substituted character is only one say β, and then

p!β is the desired word. The instruction where the

string of n characters is replaced by that of m charac-

ters is represented as a replacement word and m − 1

deletion words. Moreover we deﬁne a correction

word in terms of the word formed by a number, a

symbol which is one of the plus sign, the minus sign

or the exclamation mark, and a string which is empty

following the minus sign.

A given diff ﬁle is transformed into the sequence

of the correction words. To cope with complicated

differences, we may deploy the algorithm for calcu-

lating the Levenshtein distance (Gusﬁeld, 1997). All

the correction words are sorted by the position, the

operator and the trailing string. Once extracting the

list which is called the correction array, we can make

the resulting text ﬁle of a worker from the original

text ﬁle and the correction array; all we have to do is

to apply each correction word of the array in reverse.

Finally the correction array is the essence of the dif-

ferential with regard to the character, whereas the diff

ﬁle indicates the information of lines.

Now we explain how to consolidate the workers’

fruits through their correction arrays (see Fig. 4). Let

N be the number of the workers, and W

stand for the

correction array for the i-th worker. The next step is

to conﬁgure the arrayW by sorting the union ofW

for

all 1 ≤ i ≤ N. Assume that W

and W can be regarded

both as the sets and as the arrays, i.e. the judgment

of belongingness and the random access with an in-

dex are available to the discrete or consolidated cor-

rection array. However W should not be the set that

we would like to obtain, since W may include inad-

equate instructions. In addition, there might exist a

pair of correction words (called collision words) with

a common position and different instructions of in-

sertion, deletion or displacement, in W, although W

does not hold such a couple. We denote #W to rep-

resent the number of correction words in W and W[ j]

(1 ≤ j ≤ #W) to mean the j-th word of the array W.

The matrix A is composed which has N rows and

#W columns. The component A

is determined by

comparing W with W

, that is:

= 1 if W[ j] ∈ W

. (1)

= 0 if W[ j] 6∈ W

. (2)

By deﬁnition, each column has at least one 1 compo-

nent. If the number of ones is larger than a thresh-

Figure 4: Workﬂow of integrating workers’ modiﬁcation.

old which may be a constant or a half of the workers,

then the modiﬁcation corresponding to the column is

accepted. When a small threshold is employed, col-

lision words may be accepted, and then either one

should be selected in some fashion. Finally we can

obtain the matrix A

′

whose columns are the accepted

ones and the valid correction array W

′

composed of

the accepted correction words.

When applying the correction word of W

′

re-

versely to the text ﬁle before modiﬁcation, we will

get hold of the text ﬁle where the workers’ results are

consolidated.

We can quantify the i-th worker’s contribution by

comparing W

with W

′

. In concrete terms, #(W

∩W

′

)

means the number of worker i’s modiﬁcations which

are instrumental in making the resulting text ﬁle, or

the adoption number. The rejection number is derived

by #(W

− W

′

) where the minus sign denotes the set

subtraction. These numberscan convert into the ratios

using the division by #W

, which may be more suitable

for evaluating the worker’s performance.

To make sure that this formulation works well, we

apply it to a simpliﬁed example of a string manipu-

lation. Let T

BEFORE

be “japanese”, and assume that

three workers have changed the sequence of charac-

ters into “jopan´es”, “japon´es” and “japonese” respec-

tively. The values of the strings, arrays and matrices

according to the above formulation are as follows:

BEFORE

= “japanese”. (3)

= “jopan´es”. (4)

= “japon´es”. (5)

= “japonese”. (6)

–D

are omitted since they are no use in this case.)

= {2!o, 6!´e, 8−}. (7)

WEBIST 2010 - 6th International Conference on Web Information Systems and Technologies

154

= {4!o, 6!´e, 8−}. (8)

= {4!o}. (9)

W = {2!o, 4!o, 6!´e, 8−}. (10)

A =





1 0 1 1

0 1 1 1

0 1 0 0





. (11)

′





0 1 1

1 1 1

1 0 0





. (12)

′

= {4!o, 6!´e, 8−}. (13)

By applying the correction words in W

′

to T

BEFORE

reverse, via “japanes” and “japan´es”, we ﬁnally ob-

tain the string “japon´es” which is assigned to T

AFTER

Let the contribution be the number of ones of the

row of A

′

which corresponds to the worker (equal to

the adoption number), then:

= 2. (14)

= 3. (15)

= 1. (16)

In the case of evaluation experiment in the previ-

ous section, the column of Adopts in Table 1 is equal

to the contribution mentioned just above. When di-

viding the values by that in the column of Mods re-

spectively, we can obtain the ratios of adoption. The

ratios varied from 3/16 to 8/9.

5 CONCLUSIONS

In the anterior half of this paper, we have introduced

the documentation support system which utilizes Sub-

version and enables one to read the shot image and the

text at a time. The second half was devoted to provide

the workﬂow of consolidating text ﬁle from the sep-

arated achievements of the workers. The more work-

ers are engaged in the revision, the more mismatches

will be detected, but we have no idea of the quantita-

tive relationship among the numbers of workers, mis-

matches, and what should be corrected. We will have

to conduct experiments from this viewpoint, as well

as continue to improve the interface for practical use

by the researchers of Buddhism or historical science.

This application is able to manage text ﬁles by

comparing the ﬁles with the image ﬁles of copytext,

no matter what language the copytext was written in.

We request that the text should consist of Unicode

characters, since the system is obliged to output diff

ﬁles where UTF-8 is used as the character code. If

you would like to express the character outside this

encoding, then it should be speciﬁed as a string which

is regarded as a character in going for the correction

word, using some sort of escape sequence.

We developed the documentation support system

under the assumption that the text ﬁles which resem-

ble the shot images of sutra in content are available.

For applying to other sorts of materials, it would be

one way to make T

BEFORE

with a character sensing

equipment which may not be accurate enough.

REFERENCES

Alshuhri, S. S. (2008). Arabic manuscripts in a digital li-

brary context. In 11th International Conference on

Asian Digital Libraries (ICADL 2008), LNCS 5362,

pages 387–393.

CBETA. Chinese Buddhist Electronic Text Association

(CBETA). http://www.cbeta.org/. (in Chinese).

Chen, S.-P., Hsiang, J., Tu, H.-C., and Wu, M. (2007). On

building a full-text digital library of historical docu-

ments. In 10th International Conference on Asian

Digital Libraries (ICADL 2007), LNCS 4822, pages

49–60.

EPOCH (2005). Report on common research agenda.

Fukuoka, H., Murakawa, T., Noda, D., and Nakagawa, M.

(2009). Evaluation of support system for revising

Buddhist canons using subversion. In IPSJ Sympo-

sium, Vol.2009, No.16, pages 61–66. (in Japanese).

Gusﬁeld, D. (1997). Algorithms on strings, trees, and se-

quences: Computer science and computational biol-

ogy. Cambridge University Press.

Ribeiro, C., David, G., and Calistru, C. (2007). Multimedia

in cultural heritage collections: A model and appli-

cations. In 10th International Conference on Asian

Digital Libraries (ICADL 2007), LNCS 4822, pages

186–195.

SVN. Subversion. http://subversion.tigris.org/.

Tanaka, T., Nino, Y., Zhang, R., Rolland, M., Nakagawa,

M., Aoki, S., Utsunomiya, K., and Ochiai, T. (2006).

A database system of Buddhist canons. In Seventh

Joint Conference on Knowledge-Based Software En-

gineering, pages 327–336.

Unfuddle. http://unfuddle.com/.

TRANSCRIPTION SUPPORT SYSTEM USING SUBVERSION

155