EFFICIENT LITERATURE RESEARCH

BASED ON SEMANTIC TAGNETS

Implemented and Evaluated for a German Text-corpus

Uta Christoph, Daniel G

otten and Karl-Heinz Krempels

Informatik 4, Intelligent Distributed Systems Group, RWTH Aachen University, Aachen, Germany

Keywords:

Semantic networks, Literature research, Text corpus, Text analysis.

Abstract:

In this paper we present an approach that is capable to automatically generate semantic tagnets for given

sets of german tags (keywords) and an arbitrary text corpus using three different analysis methods. The

resulting tagnets are used to estimate similarities between texts that are manually tagged with the keywords

from the given tagset. Basically, this approach can be used in digital libraries to provide an efﬁcient and

intuitive interface for literature research. Although it is mainly optimized for the german language the proposed

methods can easily be enhanced to generate tagnets for a given set of english keywords.

1 INTRODUCTION

Due to the large amount of data available in the world

wide web data structuring is an important topic to-

day. One famous approach is the semantic web as pro-

posed by Tim Berners-Lee (Berners-Lee et al., 2001).

The idea is to structure the available data by its se-

mantic meaning to provide much better access meth-

ods and possibilities for automatic analysis. Since no

suitable semantic wordnets are available for the ger-

man language the idea is to build a system that allows

to estimate semantic relations for a given set of ger-

man words. In the following an approach will be pre-

sented that structures keywords (tags) and documents

by their semantic information and similarities.

Normally, no semantic information is available for

information stored in digital libraries and solely sim-

ple search interfaces are provided that allow to search

for documents by its title or author names. Appar-

ently, research of similar documents is hard and time-

consuming in such cases. One approach to solve this

problem is to avail ss that are used to describe the

content of single documents and enable the search for

documents by their tags.

Although tags allow to categorize documents they

do not completely solve the problem described above

as illustrated by the following example. Given two

similar documents d

and d

annotated by tags from

two disjunct sets S

and S

. It might not be possi-

ble to recognize the similarities between d

and d

, if

no information about semantic relations between the

tags of both sets is given. Especially for a large set of

available similar tags to describe each document this

problem may occur quite frequently.

The presented approach is based on similarity

analysis of documents annotated manually with tags.

These document similarities are estimated by auto-

matically extracting semantic relations between the

tags of the given set. Both the given tags and the ex-

tracted relations form a network called tagnet. The

derived network of documents is called similarity net.

Apparently, both networks can be used to make liter-

ature research more efﬁcient, by visualizing them in

a suitable way. The resulting system provides a quite

intuitive interface that enables non-technical users to

research literature by surﬁng through the similarity

net.

At ﬁrst we discuss in Section 2 relevant problems

that may occur during automatic text analysis. We

show that these problems mainly result from morpho-

logical special cases of the german language. As pos-

sible solutions the approaches of lemmatization and

stemming will be discussed. Afterwards, the idea

of the implemented system (the tagnet builder) that

builds the described networks is introduced in Sec-

tion 3. Finally, the results are evaluated in Section 4.

Christoph U., GÃ˝utten D. and Krempels K.

EFFICIENT LITERATURE RESEARCH BASED ON SEMANTIC TAGNETS - Implemented and Evaluated for a German Text-corpus.

DOI: 10.5220/0002805400480054

In Proceedings of the 6th International Conference on Web Information Systems and Technology (WEBIST 2010), page

ISBN: 978-989-674-025-2

2 PROBLEMS

Automatic approaches of text analysis are quite error-

prone without adaptation to a given scenario. Espe-

cially in the case of automatic text analysis of ger-

man

texts problems occur due to the morphological

special cases in the german language. Concerning

statistical approaches in text analysis three problems

have to be taken into account:

1. multilingualism

2. morphologicial special cases

3. understanding of texts

Since the considered corpus solely consists of german

texts we do not consider multilingualism for the mo-

ment. For this reason only the problems (2) and (3)

have to be discussed.

2.1 Morphological Special Cases

The main objective of the tagnet builder is the extrac-

tion of semantic information from keywords based on

statistical frequency analysis on the considered text

corpus. These frequencies have to be as accurate as

possible, so a reliable mapping of all given keywords

to unique stems is required. Furthermore, numbers

and stop words can be deleted, since they are not rel-

evant for the analysis.

2.1.1 Stemming

Stemming is an algorithmical approach for the esti-

mation of unambiguous stems for words of a given

language. A famous approach is the porter stem-

mer (Porter, 1980). An implementation for german

words based on this approach is available in terms of

Snowball

Unfortunately, stemming algorithms do not gen-

erate real stems, but unique pseudo stems by sufﬁx

stripping based on predeﬁned rules without consider-

ation of grammatical characteristics. Especially for

the case of german words two different words might

be mapped to one stem although there exists no se-

mantic similarity. Thus, this simple stemming ap-

proach may deteriorate the quality of the semantic

tagnet generated by the tagnet builder.

2.1.2 Lemmatization

To provide mappings to real stems lemmatization has

to be used. This lemmatization is often based on full-

form lexicons that have to be deﬁned in prior. Un-

Or other inﬂectional complex languages.

http://snowball.tartarus.org

fortunately, no complete fullform lexicons are avail-

able for the german language due to the possibility of

compound words. This problem can be solved with

the help of a learning lemmatizer as it has been pro-

posed by P. Perera and R. Witt

e (Perera and Witte,

2005). Basically, this lemmatizer processes words in

three phases.

At ﬁrst it checks if the current word already is part

of the existing fullform lexicon to lemmatize it imme-

diately by its corresponding stem. If the current word

is not part of the existing lexicon, it will be classiﬁed

with respect to its POS-Tags

. Then suitable rules for

sufﬁx stripping will be applied to get candidate stems

that are returned and inserted into the lexicon

2.2 Understanding of Texts

Automatic understanding of texts is a complex prob-

lem and still a research topic today. Especially if fre-

quency analysis is used, problems may occur that are

referable to words with different meanings in subject

to the context.

The approach proposed in this paper will assume

that the tagnet builder is solely used for texts of one

topic. Thus, the described problem normally will not

occur in this scenario, but has to be addressed if texts

of different topics should be analysed.

3 APPROACH

Given an unstructured list of tags, the described sys-

tem is able to ﬁnd semantic relations between these

tags and store them in a suitable database. Based on

this semantic tagnet text similarities can be calculated

for a corpus of tagged texts.

3.1 Tagnet Builder

The tagnet builder uses three different approaches to

gain as much information about semantic relations

between the given tags as possible. These three ap-

proaches are:

1. rule-based extraction of semantic relations

2. lexicon-based extraction of semantic relations

3. statistical estimation of semantic relations based

on occurrences in a given set of documents

In the following the approaches will be described in

detail.

Part-of-speech tagging basically denotes the tagging of

words with its grammatical characteristics.

The lexicon may contain intermediately wrong stems,

that will be corrected over time.

EFFICIENT LITERATURE RESEARCH BASED ON SEMANTIC TAGNETS - Implemented and Evaluated for a

German Text-corpus

3.1.1 Rule-based Approach

The problem of compound words has been mentioned

in the sections above. Nevertheless, analysis of such

compounds can be utilized as a ﬁrst approach to rec-

ognize semantic relations between two words. As

an example, Fachsprache (terminology) and Sprache

(language) can be considered. Obviously, Sprache is

a sufﬁx and hypernym

of Fachsprache.

A sufﬁx matching based on regular expressions is

used to recognize such hypernym relations between

different tags. Analogously, semantic speciﬁcation

relations, i.e. Fach (subject) speciﬁes Fachtext

(specialized text) can be recognized by such an

approach. In this case preﬁx matching instead of

sufﬁx matching has to be used. Nevertheless, this

simple matching approach results in so many wrong

matches, that it is not applicable.

Example. Given the two tags Text and Kontext

(context). The simple approach will recognize a

hypernym relation between these two tags although

this is an incorrect matching since no hypernym

relation exists in this case.

As a ﬁrst solution a minimal string length δ of the

string before the sufﬁx is demanded. Recalling the

example no semantic relation between Text, Kontext

or other similar tags will be detected for δ > 3. Since,

δ > 4 proved not to be useful, δ = 4 is assumed in the

following. Nevertheless, some correct matchings can-

not be recognized due to this restriction, but estima-

tions show that the number of not recognized existing

relations is rather small.

As a result the regular expression for the sufﬁx

matching for a given alphabet Σ and a sufﬁx s

can

be denoted as follows:

= ((((Σ+)-)+Σ

Σ*)+(Σ+ Σ+))s

(ε|en|e)

If w

∈ L(r

) for a given tag s

then a hypernym rela-

tion between s

and w

exists. Note that this regular

expression has already been adapted to several special

cases that occured in the given text-corpus. Further-

more, possible inﬂective sufﬁxes have been removed

from s

in prior to match as much inﬂective forms

of the word w

as possible. Since this is an exclu-

sively rule-based approach, semantic relations can-

not be recognized for morphological special cases. If

such words should also be recognized a reliable lem-

matizer like the learning lemmatizer described above

is needed.

In contrast to this sufﬁx matching, the identiﬁca-

tion of speciﬁcation relations is much more complex.

or generic term

This mainly results from two facts:

1. inﬂectional sufﬁxes

2. erroneous splitting of compounds

In german verbs, nouns and adjectives can be used to

create compounds or specify other words. Often in-

ﬂectional sufﬁxes are removed in those cases. As an

example lesen (read) and Lesestrategie (reading strat-

egy) can be considered. Here, the sufﬁx n is removed

before composing the two words lesen and Strategie

(strategy). Obviously, most of those semantic rela-

tions should be recognized by the tagnet builder.

Additionally, speciﬁcation relations may be rec-

ognized wrongly because of equivocal compounds

that are mainly determined by the context. One ex-

ample is Texterkennung that can be decomposed to

Text-erkennung (text recognition) or Texter-kennung

(writer identiﬁcation). In this case only the speciﬁca-

tion relation between Text and Erkennung is correct,

but also the relation between Texter and Kennung will

be recognized. Unfortunately, this problem can solely

be solved by manually deﬁned blacklists. Neverthe-

less, a ﬁlter can be deﬁned that ﬁlters wrongly de-

tected relations, i.e. between Text and textualisieren

(textualize). This ﬁlter checks if for a candidate match

the remaining sufﬁx is a known tag or word from a

given lexicons. Here, ualisieren is not contained in

the given lexicon and the relation will be discarded

correctly.

Finally, the regular expression can be denoted as

follows, where again inﬂective sufﬁxes are removed

from s

and the expression is adapted to some special

cases:

= (Σ

Σ*-)*s

(ung|en|e|n|ε)(s|-|ε)Σ

Σ*

If w

∈ L(r

) for a given tag s

, than a candidate spec-

iﬁcation relation between s

and w

exists that has to

be ﬁltered using the described ﬁlter afterwards.

3.1.2 Lexicon-based Approach

The rule-base approach is solely capable to recognize

hypernym and speciﬁcation relations. Nevertheless,

semantic relations of other types should be identiﬁed

as well by the tagnet builder, i.e. schreiben (write) and

lesen (read). For this reason a lexicon-based approach

is presented, where Wikipedia

is used as the test lex-

icon. It is assumed that the given lexicon entries have

been normalized by the methods of stop word elimi-

nation and stemming described above.

The idea of the lexicon-based approach is to ex-

tract lexicon entries for each tag of the given set and

perform a frequency analysis for all other tags on

http://de.wikipedia.org/

WEBIST 2010 - 6th International Conference on Web Information Systems and Technologies

these entries. Again, sufﬁxes and preﬁxes have to be

considered during this analysis, but the general ap-

proach will be described ﬁrst.

The main aspect of the lexicon-based approach is

to determine the strength of a semantic relation be-

tween tags. This signiﬁcance is estimated based on

relative frequencies of tags in the given lexicon en-

tries. The relative frequency γ

of a tag s

in the

lexicon entry l

for a given tag s

is deﬁned as,

(1)

where n

denotes the absolute frequency of the cur-

rent tag s

in the lexicon entry. The relative frequency

is then derived by dividing n

by the total number of

words N in l

. In this simple approach n

is calcu-

lated as the frequency of all direct occurrences of s

in l

A further improvement can be achieved by split-

ting the lexicon entries in headlines and text blocks

and giving occurrences in the headlines a higher sig-

niﬁcance than the ones in the text blocks. This is done

by a weight factor δ

∈ [0, 1]. The resulting enhanced

relative frequency γ

is denoted as,

= δ

∗

t,s

h,s

(2)

where n

t,s

is the absolute frequency of s

in the text

blocks of l

. Analogously, n

h,s

is the absolute fre-

quency of s

in the headlines, N

denotes the num-

ber of words in the text blocks, and N

the number of

words in the headlines.

Much more information can be gained by consid-

ering occurrences of s

in l

as preﬁxe or sufﬁx as

well. But occurrences as preﬁx generally not as rel-

evant as direct or sufﬁx occurrences, since preﬁxes

normally just specify other words in the german lan-

guage. For this reason the enhanced approach is ex-

tended by a weight function which rates the different

occurrences in the text blocks by additional parame-

ters δ

and δ

. This distinction is only applied to text

blocks in the tagnet builder.

Thus, the enhanced relative frequency within the

text blocks n

t,s

is calculated by,

t,s

= n

direct

t,s

+ δ

∗ n

pre f ix

t,s

+ δ

∗ n

su f fix

t,s

(3)

where δ

and δ

are weight parameters for the ab-

solute frequencies as preﬁx and sufﬁx, respectively.

This allows a much better estimation of semantic re-

lations between tags than the simple approach, since

language speciﬁc characteristics are taken into ac-

count.

3.1.3 Statistical Estimation

In contrast to the approaches described above it is also

possible to estimate semantic relations between tags

by frequency analysis on texts of a given corpus. Un-

fortunately, a large text corpus is needed for such sta-

tistical approaches in general. Hence, in Section 4.3

it was evaluated how the approach described below

works on a corpus of only 200 texts.

Our approach is composed of two estimation

steps.

1. estimation of the absolute tag frequencies in the

texts

2. estimation of possible semantic relations based on

these absolute frequencies

For step (1) basically the approaches described above

can be used, so that this step will be skipped here. In

the following it is assumed that for the given set of

tags S and all texts t

(i ∈ N)

)∀s

∈ S

denotes the absolute tag frequency for tag s

in the

text t

with j ∈ {1 . . . |S|}. Based on these absolute

values it is possible to calculate relative tag frequen-

cies for all tags s

in the text t

) =

)

(4)

where N

= max(n

), . . . , n

|S|

)). These are used

to statistically estimate the signiﬁcance of possibly

existing semantic relations between single tags.

Estimation of Possible Semantic Relations. For

all texts t

and a pair of tags (s

, s

) with s

, s

∈ S it is

checked, if n

) > 0 and n

) > 0 holds . The set of

all such texts is denoted as T. The strength of the sta-

tistical relations between these two tags δ

is cal-

culated recursively for all texts t

∈ T (i ∈ {1 . . . |T |})

(k)

(k−1)

∗ k + n

) ∗ n

)

k + 1

(5)

where δ

(0)

= 0, and 0 ≤ k ≤ |T |. The strength of a

possible semantic relation between to tags s

and s

than given by δ

= δ

(|T |)

This results in an histogram of all calculated pair-

wise statistical relations.

3.1.4 Result Combination

Finally, the tagnet builder combines the three result-

ing relation sets into one set containing the most rele-

vant relations of all three approaches. Since the qual-

ity of this ﬁnal set mainly depends on the quality of

EFFICIENT LITERATURE RESEARCH BASED ON SEMANTIC TAGNETS - Implemented and Evaluated for a

German Text-corpus

all three sets, it is once again necessary to introduce

weight factors that allow to control the inﬂuence of a

base set on the ﬁnal set. In case of relation duplicates

occuring in two different base sets the average of both

weighted qualities is calculated and stored as the new

relation quality in the ﬁnal set.

3.2 Similarity Analysis

In contrast to other approaches which estimate text

similarities (Lee et al., 2005) our approach is based

on an uncertainty relation on tagsets S

and S

of two

texts t

and t

This solves the problem of disjunct

tagsets of similar texts and allows to estimate simi-

larity values in such cases. To realize the main ob-

jective of our approach the simple estimation is dis-

cussed ﬁrst.

∩ S

∪ S

(6)

Obviously, no similarity can be calculated for disjunct

tagsets.

As a solution the tagsets S

and S

can be extended

by information gained from the generated semantic

tagnet. These new sets S

and S

are generated recur-

sively for a ﬁxed number of recursion steps n

max

∈ N.

In each recursion step the neighbours of each tag in S

(i ∈ {1, 2}) are added to the sets S

. In our ﬁndings it

proved that n

max

> 4 is not useful. Thus, n

max

< 4 is

assumed in the following.

Although this idea already allows to calculate sim-

ilarities between two texts t

and t

with S

∩S

0 it

not yet satisfying, since no distinctions are made be-

tween the tags in the two sets S

and S

. For this rea-

son tag signiﬁcances p

i, j

were introduced which de-

pend on the current recursion step n and manually de-

ﬁned signiﬁcances γ

for all considered relation types

r. The similarity of two texts t

and t

then can be

estimated as

∗

∑

i=1

1,i

+ p

2,i

) (7)

where N = |S

∩ S

| and M = |S

∪ S

|. For a given

threshold τ and a text t

all texts with δt

>= τ (i ∈

{1, ··· , |T |},t

6= t

) can be queried.

4 EVALUATION

The described system has been implemented in Java

and evaluated on four different tagsets.

ipTS: This set contains 3087 tags that mainly are

part of the topic of text production and writing

research which are part of the ipTS

research

project. Note that this tagset has been prepro-

cessed in terms of removing wrong tags and ad-

justing ﬂectional sufﬁxes. Thus, the results were

somewhat better than the results of the three other

tagsets.

II: This set contains 2748 tags from the topic of in-

formatics

IDS: This set contains the 1769 most basic german

words

AMS

O: This set contains more than 10 000 occupa-

tional qualiﬁcations of different topics and thus

is the most comprehensive tagset in this evalua-

tion

The evaluation was made by applying the imple-

mented system to these four tagsets. Subsequently the

generated semantic tagnets were checked for wrongly

recognized semantic relations. Since manual prepro-

cessing is unwanted, the ipTS tagset was the only pre-

processed set in this evaluation.

Fig. 1 depicts a part of a semantic tagnet visualiza-

tion (G

otten, 2009) which was generated by the tagnet

builder from the ipTS tagset. The semantic tagnet was

embedded into the ipTS

project website to simplify

the literature research task for domain experts. The

arrows represent hyperonym relations between key-

words, while the undirected edges express generic se-

mantic relations.

4.1 Rule-based Approach

In the ﬁrst evaluation step exclusively the rule-based

approach was considered. Table 1 contains the eval-

uation results of the rule-based approach on all four

tagsets. Obviously, the rule-based approach works

quite well for all given tagsets, since the accuracy is

smaller than 1% for most of the sets.

Table 1: Semantic relations extracted by rule-based ap-

proach.

Tagset Entries Relations Error Rate

ipTS 3807 1287 0,8 %

II 2748 1656 0,6 %

IDS 1769 502 0,4 %

AMS

O 10 245 1760 1,5 %

http://www.ipts.rwth-aachen.de/

http://is.uni-sb.de/vibi/

http://www.ids-mannheim.de/oea/

http://www.ams.or.at/bis/

http://www.ipts.rwth-aachen.de/

WEBIST 2010 - 6th International Conference on Web Information Systems and Technologies

Figure 1: Part of the extracted semantic net.

Only for the AMS

O tagset 1,5% semantic rela-

tions were recognized wrongly, due to the tags of dif-

ferent topics. Note that this accuracy does not con-

sider existing semantic relations between tags that are

not recognized. However, on average semantic rela-

tions between 33% of the given tags were recognized.

This is shown in Table 2.

In the ﬁrst evaluation step exclusively the rule-

based approach is considered. Table 1 contains the

evaluation results of the rule-based approach on all

four tagsets. Obviously, the rule-based approach

works quite well for all given tagsets, since the ac-

curacy is smaller than 1% for most of the sets.

Table 2: Number of linked tags.

Tagset Linked tags

ipTS 30,4 % (1160)

II 51,6 % (1419)

IDS 29,5 % (522)

AMS

O 20,7 % (2122)

The results illustrate the beneﬁt of such a pattern-

based approach which estimates the possible seman-

tic relations between the tags in different sets in short

time. As an example the semantic relations for the

ipTS tagset were estimated in less than 20 seconds

on the testing machine, while the estimation on the

AMS

O set only took 88 seconds. For a synthetic gen-

erated tagset the rule-based approach shows an almost

linear growth in runtime for large sets with more than

20 000 entries

4.2 Lexicon-based Approach

In contrast to this fast rule-based approach the

lexicon-based approach needs much more calculation

time (more than four hours). For all smaller sets the

calculation takes less than one hour. Although the cal-

culation time is much longer than in the rule-based

approach, the lexicon-based approach allows to rec-

ognize semantic relations between tags that cannot be

recognized by the plain rule-based approach as de-

scribed in Section 3.1.1.

Table 3 shows the numbers of extracted relations

for the given tagsets. In comparison to the rule-

based approach many more relations are extracted es-

pecially for the AMS

O tagset.

Table 3: Semantic relations extracted by lexicon-based ap-

proach.

Tagset Entries Relations Error rate

ipTS 3807 2853 2,8 %

II 2748 3337 5,1 %

IDS 1769 1600 4,7 %

AMS

O 10 245 5494 3,4 %

This mainly results from the type of the contained

tags in this set since there is a wide range of words or

concepts that describe similar processes or are some-

how related to each other. Consequently, the total

number of alike tags is higher for this lexicon-based

approach compared to the results of the rule-based ap-

proach as shown in Table 4.

Table 4: Number of linked tags.

Tagset Linked tags

ipTS 31,5 % (1198)

II 50,0 % (1376)

IDS 44,7 % (791)

AMS

O 32,1 % (3286)

4.3 Statistical Approach

The statistical approach has solely been evaluated

for the ipTS set due to the lack of a suitable text

corpus for the other three tagsets. As expected a

very large number of candidate semantic relations

were extracted, but mostly with very small semantic

strengths. For a proper threshold only 3455 of more

than 400 000 candidate relations remained. Unfortu-

nately the number of wrongly recognized semantic re-

lations still were quite high (12%). However it was

possible to reduce the error rate to 4,7% by blacklist-

ing problematic tags. It is assumed that the result of

EFFICIENT LITERATURE RESEARCH BASED ON SEMANTIC TAGNETS - Implemented and Evaluated for a

German Text-corpus

the described statistical approach can be improved by

a larger text corpus.

5 CONCLUSIONS

An approach has been described that allows to efﬁ-

ciently generate semantic tagnets for given unstruc-

tured lists of tags and an arbitrary text corpus. This

semantic tagnet can be used to estimate text similar-

ities for tagged texts in digital libraries to provide a

more intuitive way of literature research adapted to

the user’s cognitive model. In addition to this the gen-

erated semantic tagnet could be used to deﬁne ontolo-

gies or allow users to enhance the network using a

suitable interface similar to the idea of user feedback

as proposed in (Doan and McCann, 2003).

In a future version the tagnet builder may be en-

hanced by some components that allow to extract

synonym relations, too. One idea of a rule-based

approach has been proposed in (Ananthanarayanan

et al., 2008) for english words. This enhancement

would allow to store relations between tags in dif-

ferent languages to create a multilingual semantic net

that can be used for a digital library that stores texts

in different languages.

ACKNOWLEDGEMENTS

This research was funded in part by the DFG Cluster

of Excellence on Ultra-high Speed Information and

Communication (UMIC), German Research Founda-

tion grant DFG EXC 89, and by the German Research

Foundation grant for the Project Interdisciplinary Text

Production and Writing (ipTS

REFERENCES

Ananthanarayanan, R., Chenthamarakshan, V., Deshpande,

P. M., and Krishnapuram, R. (2008). Rule based syn-

onyms for entity extraction from noisy text. In AND

’08: Proceedings of the second workshop on Analyt-

ics for noisy unstructured text data, pages 31–38, New

York, NY, USA. ACM.

Barnett, B. (2009). Regular expressions,

http://www.grymoire.com/unix/regular.html.

Berners-Lee, T., Hendler, J., and Lassila, O. (2001). The se-

mantic web - a new form of web content that is mean-

ingful to computers will unleash a revolution of new

possibilities.

http://www.ipts.rwth-aachen.de/

Chafﬁn, R. (1992). The concept of a semantic relation. In

A. Lehrer, E. K., editor, Frames, Fields and Contrasts,

pages 253–288. Lawrence Erlbaum, Hillsdale, N.J.

Collins, A. and Quillian, M. (1969). Retrieval time from

semantic memory. Journal of Verbal Learning and

Verbal Behavior, 8(2):240–247.

Doan, A. and McCann, R. (2003). Building data integration

systems: A mass collaboration approach. In IIWeb,

pages 183–188.

Fellbaum, C., editor (1998). WordNet: An Electronic Lex-

ical Database (Language, Speech, and Communica-

tion). The MIT Press.

Gaizauskas, R. and Humphreys, K. (1997). Using a se-

mantic network for information extraction. Nat. Lang.

Eng., 3(2):147–169.

otten, D. (2009). Semantische Schlagwortnetze zur ef-

ﬁzienten Literaturrecherche. Master’s thesis, RWTH

Aachen University.

Harris, Z. (1985). In Katz, J. J., editor, The Philosophy of

linguistics, pages 26–47. Oxford University Press.

Harrison, M. A. (1978). Introduction to Formal Language

Theory. Addison-Wesley Longman Publishing Co.,

Inc., Boston, MA, USA.

Lee, M. D., Pincombe, B., and Welsh, M. (2005). An em-

pirical evaluation of models of text document simi-

larity. In Proceedings of the 27th Annual Conference

of the Cognitive Science Society, pages 1254–1259,

Mahwah, NJ. Erlbaum.

obner, S. (2003). Semantik. Eine Einf

uhrung.

Lovins, J. B. (1968). Development of a stemming algo-

rithm. Mechanical Translation and Computational

Linguistics, 11:22–31.

Perera, P. and Witte, R. (2005). A self-learning context-

aware lemmatizer for german. In Proceedings

of Human Language Technology Conference and

Conference on Empirical Methods in Natural Lan-

guage Processing, pages 636–643, Vancouver, British

Columbia, Canada. Association for Computational

Linguistics.

Porter, M. F. (1980). An algorithm for sufﬁx stripping. Pro-

gram, 14(3):130–137.

Porter, M. F. (2009). German stemming algorithm.

Quillian, M. R. (1967). Word concepts: A theory and simu-

lation of some basic semantic capabilities. Behavioral

Science, 12:410–430.

Schmid, H. (1994). Probabilistic part-of-speech tagging us-

ing decision trees. In Proceedings of International

Conference on New Methods in Language Processing.

Schmid, H. (1995). Improvements in part-of-speech tagging

with an application to german. In In Proceedings of

the ACL SIGDAT-Workshop, pages 47–50.

Sowa, J., editor (1991). Principles of Semantic Net-

works: Explorations in the Representation of Knowl-

edge (Morgan Kaufmann Series in Representation and

Reasoning). Morgan Kaufmann Pub.

Sowa, J. (2009). Semantic networks,

http://www.jfsowa.com/pubs/semnet.htm.

WEBIST 2010 - 6th International Conference on Web Information Systems and Technologies