WRITING SUPPORT SYSTEM

DEALING WITH NOTATIONAL VARIANT SELECTION

Aya Nishikawa, Ryo Nishimura, Yasuhiko Watanabe and Yoshihiro Okada

Ryukoku University, Dep. of Media Informatics, Seta, Otsu, Shiga, Japan

Keywords:

Writing support system, Dominant notational variant, κ values.

Abstract:

In Japanese, there are a large number of notational variants of words. This is because Japanese words are writ-

ten in three kinds of characters: Kanji (Chinese) characters, Hiragara letters, and Katakana letters. Japanese

students study basic rules of Japanese writing in school for many years. However, it is difﬁcult to learn which

notational variant is suitable for ofﬁcial, business, and technical documents because the rules have many ex-

ceptions. From the viewpoint of information retrieval, a considerable number of studies have been made on

notational variants, however, previous Japanese writing support systems were not concerned with them suf-

ﬁciently. This is because their main purposes were misspelling detection. Nondominant notational variants

are not misspelling, but often unsuitable for ofﬁcial, business, or technical documents. To solve this problem,

we developed a writing support system which detects nondominant notational variants in students’ reports and

shows dominant ones to the students. This system is based on the idea that suitable notational variants are

used dominantly in ofﬁcial, business, and technical documents. In this study, we ﬁrst show the diversity of

notational variants of Japanese words and how to develop notational variant dictionaries by which our system

determines which notational variant is dominant in ofﬁcial, business, and technical documents. Finally, we

conducted a control experiment and show the effectiveness of our system.

1 INTRODUCTION

In English, there are few words which are spelled in

several different ways, such as, color and colour. In

contrast, in Japanese, there are a large number of no-

tational variants of words. This is because Japanese

words are written in three kinds of characters:

• Kanji (Chinese) characters,

• Hiragara letters, and

• Katakana letters.

For example, sakura [cherry blossom], one of the

symbols of Japan, is written in three ways, as shown

in Figure 1. Basic rules of Japanese writing are an-

nounced by the Cabinet, and Japanese students study

them in school for many years. However, it is dif-

ﬁcult to learn the rules because they have many ex-

ceptions. In fact, we often ﬁnd the confusion of no-

tational variants in Japanese university students’ re-

ports, including unsuitable notational variants for of-

ﬁcial, business, and technical documents. As a result,

it is important for students to learn which notational

variant is suitable for ofﬁcial, business, and technical

Figure 1: Notational variants of sakura.

documents. To solve this problem, we developed a

writing support system which detects unsuitable no-

tational variants in students’ reports and shows suit-

able ones to the students. In this study, we assumed

that suitable notational variants are used dominantly

in ofﬁcial, business, and technical documents, on the

other hand, unsuitable ones are inferior or not found

in these documents. If the assumption is proper, un-

suitable notational variants can be detected by con-

Nishikawa A., Nishimura R., Watanabe Y. and Okada Y. (2009).

WRITING SUPPORT SYSTEM DEALING WITH NOTATIONAL VARIANT SELECTION.

In Proceedings of the First International Conference on Computer Supported Education, pages 73-80

DOI: 10.5220/0001973800730080

 SciTePress

names of plants Hiragana Katakana Kanji+

sakura [cherry blossom] 184 39 736

bara [rose] 0 217 0

himawari [sun ﬂower] 42 8 0

tsubaki [camellia] 9 25 83

tsutsuji [azalea] 5 15 0

ringo [apple] 8 71 10

mikan [orange] 66 37 2

Figure 2: The frequencies of notational variants of nouns

(plant names) in the newspaper articles [Mainichi Newspa-

per (Jan. 2006 – June 2006)].

ﬁrming whether they are used dominantly in ofﬁcial,

business, and technical documents. In this study, we

will use the term dominant notational variant of a

word to refer to the most frequent notational variant

of the word. Furthermore, our system shows the fre-

quencies of notational variants to the students because

they are objective and concrete measures. As a result,

the system gives the students chances to consider the

reasons why they used nondominant notational vari-

ants. There are two reasons why our system does not

replace nondominant notational variants to dominant

ones automatically.

• it is not appropriate to restrict the use of nondom-

inant notational variants because the use of nota-

tional variants is one of the sources of the richness

of Japanese expressions.

• it is important to consider the reasons why they

used nondominant notational variants and choose

suitable ones, especially, in educational institu-

tions.

From the viewpoint of information retrieval, a

considerable number of studies have been made

on notational variants (Kubomura 03) (Kouda 06)

(Bamba 08), however, spell checkers in Japanese

word processor, such as Microsoft word 2007, and

previous Japanese writing support systems were not

concerned with notational variants sufﬁciently (Shi-

momura 92) (Araki 93) (Murata 01). This is because

their main purposes were misspelling detection. Non-

dominant notational variants are not misspelling, but

often unsuitable for ofﬁcial, business, or technical

documents. In contrast, Yokoyama dealt with vari-

ants of Kanji characters (Yokoyama 06), but not with

variants of words. Furthermore, he did not consider

this variant problem from the viewpoint of document

domains. Dominant notational variants may varywith

document domains. For example, in newspaper arti-

cles, sakura is dominantly written in a Kanji charac-

ter, on the other hand, in documents in biology, it is

dominantly written in Katakana letters. Our system

can deal with this problem ﬂexibly by switching dic-

tionaries of notational variants, which were developed

connection words Hiragana Kanji+

tatoeba [for example] 273 570

shitagatte [consequently] 21 26

tadasi [however] 343 0

ippou [on the contrary] 1 2879

mata [also, in addition] 4895 8

sarani [furthermore] 2677 24

Figure 3: The frequencies of notational variants of connec-

tion words in the newspaper articles [Mainichi Newspaper

(Jan. 2006 – June 2006)].

by using ofﬁcial, business, and technical documents

in several domains.

2 NOTATIONAL VARIANTS OF

JAPANESE WORDS

In this section, in order to show the diversity of no-

tational variants of Japanese words, we will show no-

tational variants of nouns, connection words, and de-

clinable words.

2.1 Notational Variants of Japanese

Nouns

In case of Japanese nouns, notational variants can be

classiﬁed into three types:

• words consist of Hiragana letters,

• words consist of Katakana letters, and

• words consist of Kanji characters and occasion-

ally Hiragana and Katakana letters.

Figure 2 shows the frequencies of notational variants

of plant names in the Mainichi newspaper articles

(Jan. 2006 – June 2006). As shown in Figure 2, dom-

inant ways of writing plant names are inconsistent.

2.2 Notational Variants of Japanese

Connection Words

Connection words are important words in students’

reports because they make the relationships between

sentences and ideas smoother and clearer. In case of

Japanese connection words, notational variants can be

classiﬁed into two types:

• words consist of Hiragana letters, and

• words consist of Kanji characters and occasion-

ally Hiragana letters.

Figure 3 shows the frequencies of notational variants

of connection words in the Mainichi newspaper arti-

cles (Jan. 2006 – June 2006). As shown in Figure

CSEDU 2009 - International Conference on Computer Supported Education

declinable words Hiragana Katakana Kanji+

yasashii [easy] 188 0 9

muzukashii [hard] 21 0 1524

(a) The frequencies of antonymous words:

yasashii [easy] and muzukashii [hard].

declinable words Hiragana Kanji+ (1) Kanji+ (2)

mijikai [short] mijikai mijika-i miji-kai

0 362 0

okonau [conduct] okonau okona-u oko-nau

15 9 2152

kawaru [change] kawaru kawa-ru ka-waru

15 9 2152

arawasu [show] arawasu arawa-su ara-wasu

7 283 1

(b) The frequencies of declinable words with declensional

Kana ending. Declensional Kana endings of Kanji+(1)

are shorter than those of Kanji+(2). Bold letters repre-

sent Kanji characters.

Figure 4: The frequencies of notational variants of declin-

able words in the newspaper articles [Mainichi Newspaper

(Jan. 2006 – June 2006)].

3, dominant ways of writing connection words are in-

consistent.

2.3 Notational Variants of Japanese

Declinable Words

In case of Japanese declinable words, notational vari-

ants can be classiﬁed into three types:

• words consist of Hiragana letters,

• words consist of Katakana letters with Hiragana

letters “suru”, and

• words consist of Kanji characters with declen-

sional Kana (Hiragana) ending.

Figure 4 (a) shows the frequencies of notational

variants of antonymous words, yasashii [easy] and

muzukashii [hard], in the Mainichi newspaper arti-

cles (Jan. 2006 – June 2006). Yasashii [easy] is

dominantly written in Hiragana letters, on the other

hand, muzukashii [hard] is dominantly written in

Kanji characters with declensional Kana (Hiragana)

ending. In other words, the contrast between yasashii

[easy] and muzukashii [hard] is broken from the view-

point of the dominant way of writing.

Both yasashii

[easy] and muzukashii [hard] have one type of declen-

sional Kana ending: -shii. As a result, they have one

variant with declensional Kana ending, yasa-shii and

muzuka-shii, respectively.

However, considerable

One of the authors dislikes this violation of the contrast

and always writes muzukashii [hard] in Hiragana letters in

his works.

Bold letters represent Kanji characters.

Figure 5: System overview.

number of declinable words have two types of de-

clensional Kana ending, and as a result, two variants

with declensional Kana ending. For example, kawaru

[change] has two types of declensional Kana ending,

-ru and -waru. As a result, kawaru [change] has two

variants with declensional Kana ending, kawa-ru and

ka-waru. Figure 4 (b) shows the frequencies of nota-

tional variants of declinable words with declensional

Kana ending in the Mainichi newspaper articles (Jan.

2006 – June 2006). It also shows that dominant ways

of writing declensional Kana ending are inconsistent.

Declensional Kana ending is one of the most trou-

bling aspect of notational variants. Japanese students

often feel confusionsabout declensional Kana ending.

As a result, we are often confronted with the confu-

sion of declensional Kana ending in their reports.

3 WRITING SUPPORT SYSTEM

BASED ON NOTATIONAL

VARIANT DICTIONARIES

3.1 System Overview

Figure 5 shows the overview of our system. Our sys-

tem is based on the idea that suitable notational vari-

ants are used dominantly in ofﬁcial, business, and

technical documents. Figure 6 shows an example of

how to use our writing support system. As shown in

Figure 6, users can access and send input sentences

to the system via web browsers by using CGI based

HTML forms. Input sentences are segmented into

words by using a Japanese morphological analyzer,

JUMAN (Kurohashi 05). Then, by using notational

variant dictionaries, the system conﬁrms whether no-

tational variants of the words are used dominantly in

ofﬁcial, business, and technical documents. When

WRITING SUPPORT SYSTEM DEALING WITH NOTATIONAL VARIANT SELECTION

(a) An input sentence, tabako wo yameru no ha muzukashii [it is hard to stop smoking], is given to the system.

(b) The system detects a nondominant notational variant, muzukashii [hard], in the input sentence and shows the fre-

quency information of the word in the newspaper articles and technical documents.

Figure 6: An example of how to use our writing support system. English system messages are inserted ad hoc for convenience

of non-Japanese readers of this paper.

CSEDU 2009 - International Conference on Computer Supported Education

the system detects a nondominant notational variant

of a word in an input sentence, it is underlined and

turns red, and the system shows the frequency infor-

mation of notational variants of the word and gives

users chances to consider the reasons why they used

nondominant variants. In Figure 6 (a), a user gives an

input sentence, tabako wo yameru no ha muzukashii

[it is hard to stop smoking], to the system. Then,

as shown in Figure 6 (b), the system detects a non-

dominant notational variant, muzukashii [hard], in the

input sentence. muzukashii [hard] is underlined and

turns red, and the frequency information is shown. In

this way, the key to detecting nondominant notational

variants is notational variant dictionaries. In section

3.2, we show how to develop notational variant dic-

tionaries.

3.2 Development of Notational Variant

Dictionaries

In this study, we assumed that suitable notational vari-

ants are used dominantly in ofﬁcial, business, or tech-

nical documents, on the other hand, unsuitable ones

are inferior or not found in these documents. If the as-

sumption is proper, unsuitable notational variants can

be detected by conﬁrming whether they are used dom-

inantly in ofﬁcial, business, or technical documents.

In order to conﬁrm whether notational variants are

used dominantly, we extracted examples of notational

variants from

• 296364 newspaper articles published in the

Mainichi Newspaper from January 2006 to June

2006 (Mainichi 07).

• 319 technical reports published in the 12th Annual

Meeting of the Association for Natural Language

Processing (2006).

and developed notational variant dictionaries. In this

study, we used newspaper articles because we aimed

to acquire notational variants of words which used in

various domains. On the other hand, we used tech-

nical reports because we aimed to acquire notational

variants of words in speciﬁc domains and develop do-

main speciﬁc dictionaries of notational variants. The

reason why we developed domain speciﬁc dictionar-

ies of notational variants was that dominant nota-

tional variants may vary with document domains. By

switching domain speciﬁc dictionaries of notational

variants, our system can conﬁrm whether notational

variants are suitable to compose documents in the spe-

ciﬁc domains. In this study, we acquired notational

variants in a speciﬁc domain from technical reports

published in the Annual Meeting of the Association

for Natural Language Processing (2006). Some of the

technical reports were given to the students, who took

part in the experiment described in Section 4, as ref-

erence works. This is one reason why we extracted

examples of notational variants from the technical re-

ports. Sentences in these documents were segmented

into words by using a Japanese morphological ana-

lyzer, JUMAN (Kurohashi 05). When JUMAN ﬁnds

a notational variant, it gives a variant label to the vari-

ant. The same variant label is given to notational vari-

ants of a word. By using these variant labels, we ex-

tracted notational variants and developed two dictio-

naries of

• notational variants in newspaper articles, and

• notational variants in technical reports of natural

language processing.

Table 1 shows the results of the notational variant ex-

traction from newspaper articles and technical docu-

ments. The most frequent notational variant of each

word was considered as the dominant notational vari-

ant.

As shown in Table 1, notational variants of 27988

and 9211 words were extracted from the newspaper

articles and technical documents, respectively. These

words can be classiﬁed into two types:

TYPE I a word of this type has actually two or more

notational variants, however, only one of them

was found in the newspaper articles or technical

documents.

TYPE II a word of this type has two or more nota-

tional variants which were found in the newspaper

articles or technical documents.

Table 2 shows the unique and total number of no-

tational variants of TYPE II words in the newspa-

per articles and technical documents. In order to

show how much the dominant notational variant of

a word is used dominantly, we introduced dominant

degree. Suppose that a word has notational variant i

(i = 1, ··· ,N). The dominant degree of the word is

calculated as follows:

d =

∑

i=1

where d is the dominant degree of the word, f

and

are the frequencies of notational variant i and the

dominant notational variant of the word, respectively.

Figure 7 shows the histograms of the dominant de-

grees of TYPE II words in the newspaper articles and

technical documents. In Figure 7, the broken lines

showthe histograms of the dominantdegreesof all the

TYPE II words in the newspaper articles and technical

documents. On the other hand, the thick lines show

WRITING SUPPORT SYSTEM DEALING WITH NOTATIONAL VARIANT SELECTION

Table 1: The results of the notational variant extraction from the newspaper articles and technical documents.

unique # of unique # of total # of

part of words notational notational

speech (variant labels) variants variants

noun 20603 26747 3656574

verb 3897 6403 1283024

adjective 2120 2830 280787

adverb 1125 1607 115609

conjunction 87 100 30850

interjection 80 97 2643

attributive 75 98 10946

preﬁx 1 3 10891

Total 27988 37885 5391324

(a) The results of the notational variant extraction from

the newspaper articles [Mainichi Newspaper (Jan.

2006 – June 2006)].

unique # of unique # of total # of

part of words notational notational

speech (variant labels) variants variants

noun 6458 7154 310980

verb 1548 2093 101398

adjective 706 825 22952

adverb 376 459 13037

conjunction 60 71 4465

interjection 30 33 148

attributive 32 39 1192

preﬁx 1 3 302

Total 9211 10677 454474

(b) The results of the notational variant extraction from

the technical documents [the Annual Meeting of the

Association for Natural Language Processing (2006)].

Table 2: The unique and total number of notational variants of TYPE II words in the newspaper articles and technical docu-

ments. A TYPE II word has two or more notational variants which were found in the newspaper articles / technical documents.

unique # of unique # of total # of

part of words notational notational

speech (variant labels) variants variants

noun 5328 11472 1817055

verb 2135 4641 916302

adjective 628 1338 176374

adverb 440 922 72251

conjunction 13 26 12980

interjection 15 32 593

attributive 22 45 8853

preﬁx 1 3 10891

Total 8582 18479 3015299

(a) The unique and total number of notational vari-

ants of TYPE II words in the newspaper articles

[Mainichi Newspaper (Jan. 2006 – June 2006)].

unique # of unique # of total # of

part of words notational notational

speech (variant labels) variants variants

noun 644 1340 62848

verb 508 1053 56058

adjective 110 229 6253

adverb 78 161 5617

conjunction 11 22 1330

interjection 3 6 13

attributive 7 14 941

preﬁx 1 3 302

Total 1362 2828 133362

(b) The unique and total number of notational variants

of TYPE II words in the technical documents [the

Annual Meeting of the Association for Natural Lan-

guage Processing (2006)].

200

400

600

800

1000

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

word frequency

dominant degree

TYPE II words (used 10 times or more)

TYPE II words (all)

(a) The histograms of the dominant degrees of TYPE II

words in the newspaper articles [Mainichi Newspa-

per (Jan. 2006 – June 2006)].

100

150

200

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

word frequency

dominant degree

TYPE II words (used 10 times or more)

TYPE II words (all)

(b) The histograms of the dominant degrees of TYPE II

words in the technical documents [the Annual Meet-

ing of the Association for Natural Language Pro-

cessing (2006)].

Figure 7: The histograms of the dominant degrees of TYPE II words in the newspaper articles and technical documents.

CSEDU 2009 - International Conference on Computer Supported Education

the histograms of the dominant degrees of TYPE II

words the notational variants of which were used 10

times or more in the newspaper articles and technical

documents. The reason why we eliminated words the

notational variants of which were used less than 10

times in the newspaper articles and technical docu-

ments is that it is difﬁcult to conﬁrm which notational

variant is used dominantly because there were too few

samples. As a result, we thought that dominant nota-

tional variants were credible when they satisfy the fol-

lowing conditions, and gavecredibility labels to them.

• in case of a TYPE I word, the notational variant

of the word was used 10 times or more in the

newspaper articles or technical documents. 11825

and 2285 TYPE I words in the newspaper arti-

cles and technical documents, respectively, satis-

ﬁed this condition.

• in case of a TYPE II word, the sum of frequencies

of all the variants of the word was 10 or more,

and the dominant degree was 0.8 or more. 5270

and 590 TYPE II words in the newspaper arti-

cles and technical documents, respectively, satis-

ﬁed the above conditions.

4 EXPERIMENTAL RESULTS

To evaluate our method, we conducted a control ex-

periment. We gave 10 problems of notational variant

selection to 20 subjects, university students in com-

puter science. Each problem consisted of two sen-

tences. The differences between the two sentences

were only notational variants. For example, the fol-

lowing sentences mean that it is hard to stop smoking:

• tabako wo yameru noha muzukashii

• tabako wo yameru noha muzuka-shii

the differences between the two sentences above are

muzukashii and muzuka-shii. The former is written

in Hiragana letters and the latter is written in Kanji

Characters (in Bold letters) and Hiragana letters. The

subjects were requested to choose one of the sen-

tences, which seemed to be suitable for them to use

in ofﬁcial, business, and technical documents. Sub-

jects were classiﬁed into two groups, group A and B.

• subjects in group A were given only 10 problems

and no more information.

• subjects in group B were given the same 10 prob-

lems and frequency information of the notational

variants in the test materials.

The frequency information of the notational variants

were retrieved by our experimental writing support

system. As shown in Figure 6 (b), when our system

Table 3: Experimental results.

rate of choosing

group κ value dominant notational variants

group A 0.261 74%

group B 0.623 87%

Table 4: Interpretation of κ values.

κ Interpretation

< 0 no agreement

0.0 - 0.20 slight agreement

0.21 - 0.40 fair agreement

0.41 - 0.60 moderate agreement

0.61 - 0.80 substantial agreement

0.81 - 1.00 almost perfect agreement

detects a nondominant notational variant of a word in

an input sentence, it shows the frequency information

of notational variants of the word. For example, the

frequency information of muzukashii and muzuka-

shii was shown as follows:

newspaper articles muzukashii muzuka-shii

21 1524

technical reports muzukashii muzuka-shii

0 155

To evaluate the experimental results, we intro-

duced two measurement: κ values and the rate of

choosing dominant notational variants (Table 3). κ

values are statistical measures for assessing the re-

liability of agreement between subjects. κ values

are generally thought to be more robust than simple

percent agreement calculation, in this case, the rate

of choosing dominant notational variants, because κ

values take into account the agreement occurring by

chance. Table 4 shows the interpretation of κ values

(Landis 77). As shown in Table 3 and 4, in this exper-

iment, there was fair agreement of notational variant

selection in group A. In other words, we were con-

fronted with the confusion of notational variants in

their answers. In each problem, some students chose

a nondominant (unsuitable) notational variant for no

reason and they were totally unaware of doing it. It

shows that the notational variant selection is a seri-

ous problem. On the other hand, there was substantial

agreement in group B. In addition, we obtained 13 %

increase of the rate of choosing dominant notational

variants when the frequency information was given to

subjects. It shows that the frequency information of

notational variants is promising. It also implies that

students do not have conﬁdence in their notational

variant selection and ﬂexibly change their decisions

when the reasons are given to them. Actually, three

subjects in group B changed their decisions, and three

other subjects did not change but felt sure of their de-

cisions. Some of them said that they can obey sys-

WRITING SUPPORT SYSTEM DEALING WITH NOTATIONAL VARIANT SELECTION

tem’s advices more simply than teacher’s instructions

without concrete evidences. The other four subjects

in group B reported that the frequency information is

not necessary. Actually, one of them could choose

dominant variants correctly in all the problems, on the

other hand, the others could not. This is because they

obeyed a peculiar writing rule: they must use as many

Kanji characters as possible in their ofﬁcial, business,

and technical reports. This is the limitation of our

writing support system, and where a human instruc-

tor comes in.

ACKNOWLEDGEMENTS

This research has been supported partly by the

Grant-in-Aid for Scientiﬁc Research (C) under Grant

No.20500106.

REFERENCES

Kubomura and Kameda: Information Retrieval System with

Abilities of Processing Katakana-Allographs, Trans.

of IEICE, Vol.J86-D-II, No.3, (2003).

Kouda: Search method of variant notations on a science and

technology document retrieval system, IPSJ SIG NL,

Vol.2006, No.118, (1993).

Bamba, Shinzato, and Kurohashi: Development of a Large-

scale Web Page Clustering System using an Open

Search Engine Infrastructure TSUBAKI, IPSJ SIG

NL, Vol.2008, No.4, (1993).

Shimomura, Namiki, Nakagawa, and Takahashi: A method

for detecting errors in Japanese sentences based

on morphological analysis using minimal cost path

search, Trans. of IPSJ, Vol.33, No.4, (1992).

Araki, Ikehara, and Tukahara: A method for detect-

ing and correcting of characters wrongly substituted,

deleted or inserted in Japanese strings using 2nd-order

Markov model, IPSJ SIG NL, Vol.93, No.79, (1993).

Murata and Isahara: Extraction of negative examples based

on positive examples: automatic detection of mis-

spelled Japanese expressions and relative clauses that

do not have case relations with their heads, IPSJ SIG

NL, Vol.2001, No.69, (2001).

Yokoyama: Can we predict preference for kanji form from

newspaper data on character frequency?, IPSJ SIG

CH, Vol.2006, No.10, (2006).

Kurohashi and Kawahara: JUMAN Manual version 5.1 (in

Japanese), Kyoto University, (2005).

Mainichi Shinbun CD-Rom data set 2006, Nichigai Asso-

ciates Co., (2007).

Landis and Koch: The measurement of observer agreement

for categorical data, Biometrics, Vol. 33, (1977).

CSEDU 2009 - International Conference on Computer Supported Education