Information Extraction from Web Services: A

Comparison of Tokenisation Algorithms

Alejandro Metke-Jimenez, Kerry Raymond and Ian MacColl

Faculty of Science and Technology, Queensland University of Technology, Brisbane, Australia

Abstract. Most web service discovery systems use keyword-based search al-

gorithms and, although partially successful, sometimes fail to satisfy some users

information needs. This has given rise to several semantics-based approaches that

look to go beyond simple attribute matching and try to capture the semantics of

services. However, the results reported in the literature vary and in many cases

are worse than the results obtained by keyword-based systems. We believe the ac-

curacy of the mechanisms used to extract tokens from the non-natural language

sections of WSDL ﬁles directly affects the performance of these techniques, be-

cause some of them can be more sensitive to noise. In this paper three existing

tokenization algorithms are evaluated and a new algorithm that outperforms all

the algorithms found in the literature is introduced.

1 Introduction

Web services have become the de facto technology to enable distributed computing in

modern platforms. Services directories, such as Programmable Web, list thousands of

services that can be used to build complex software applications. Also, most modern

backend software applications expose their functionality as web services in order to

facilitate B2B interactions as well as integration with other software products deployed

internally. Some applications, such as mashups, are built entirely by combining existing

services.

With the ever increasing number of web services available both in the public In-

ternet as well as in the private Intranets, web service discovery has become an im-

portant problem for software developers wanting to use services to build applications.

The UDDI standard was initially proposed to address this problem by providing a cen-

tralized repository of service descriptions. However, some authors consider that the

standard is too complex for end users who just want to publish their services [1] and

also, several problems with the centralized architecture have been identiﬁed. Perhaps

the most relevant of these is the fact that registration in a centralized repository is not

mandatory and therefore all unregistered services will not be discoverable by potential

clients [2]. These shortcomings of the UDDI standard have given rise to proposals of

decentralized mechanisms that crawl the web in search of WSDL ﬁles and use only the

information found in them.

In [2] the authors propose an extensive classiﬁcation of different types of web ser-

vice discovery approaches. We are interested in the approaches found in the syntactic

Metke-Jimenez A., Raymond K. and MacColl I..

Information Extraction from Web Services - A Comparison of Tokenisation Algorithms.

DOI: 10.5220/0003698000120023

In Proceedings of the 2nd International Workshop on Software Knowledge (SKY-2011), pages 12-23

ISBN: 978-989-8425-82-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

matching and semantic matching categories, since these techniques are decentralized

and have proven to be effective in other search domains. Within these categories the

approaches can be differentiated by the degree of user involvement required. Although

several standards have been proposed to add semantic annotations to web services [3],

all of these methods require both choosing an ontologyand adding semantic annotations

to the services. On the other hand, approaches based on information retrieval methods

use the information already available in the service description ﬁles. We believe these

latter approaches are likely to be more successful because currently there is little incen-

tive for developers to manually annotate web services.

Several researchers have implemented service discovery tools that rely solely on

the information contained in the web service description ﬁles. It is difﬁcult to measure

the relative performance of these implementations against each other, and the results

found in the literature vary. In [3] the authors compared several approaches to web

service discovery, including information retrieval based methods and semantics based

methods, and found that the semantic based approach using Latent Semantic Analysis

outperformed the information retrieval approach. However, in [1] the authors showed

that when comparing the results of a system based on Latent Semantic Analysis with

the results of a system based on the Vector Space Model, recall improved but precision

fell.

In our own experiments we implemented a web service search engine based on the

Vector Space Model [4] and a semantics based engine using Explicit Semantic Analysis.

In the classic Vector Space Model approach, text documents (in this case, WSDL ﬁles)

are represented as vectors in a term space. If a term occurs in a document then its

value in that dimension of the vector is non-zero. The value is typically derived from

the frequency of the term in the document and the commonness of the term in the

whole document collection. One of the most common weighting formulas used is Term

Frequency Inverse Document Frequency, although several variations of this formula

exist. In our implementation a variation of TFIDF that uses sublinear scaling was used.

The relevant documents for a query are calculated by representing the query as a vector

in the same term space and then calculating the similarity between the query vectors and

the vector representations of the documents in the collection. In this implementation

cosine similarity, which compares the angle between the vectors, was used.

Explicit Semantic Analysis is a method introduced in [5–7] in which text in nat-

ural language form is represented in a concept space derived from articles found in

Wikipedia. In this method each article in Wikipedia is treated as a concept in a general-

purpose ontology and the text in the articles is used to determine the degree of related-

ness between the concept and a text snippet. An inverted index is used to build a vector

for each individual term. The inverted index keeps track of the articles in Wikipedia that

contain each term and their weight (which is typically calculated using some variant of

TFIDF). The resulting vector for the text snippet is the centroid vector of the vectors

derived for each term. The similarity between two text snippets can then be calculated

based on these ”concept vectors” by using standard vector-based similarity similarity

metrics such as cosine similarity. The method is said to be explicit because the concepts

are explicitly deﬁned in Wikipedia in the form of articles.

Compared to the Vector Space Model, Explicit semantic analysis produces vectors

in a concept space rather than vectors in a term space. Also, depending on the docu-

ment set, the vectors in ESA tend to be shorter (the length of a vector is the number of

articles in Wikipedia) but less sparse than the term vectors used in VSM. ESA is con-

sidered a semantics-based method because the documents are represented in a concept

space rather than just a term space.

We evaluated the performance of both engines using the same dataset kindly pro-

vided to us by the authors in [3]. Our initial results showed that the ESA implementa-

tion performed worse than the VSM approach

. Upon further investigation we found

that one of the contributing factors for the poor results of the ESA implementation was

the noise introduced by the incorrect tokenisation of some of the terms extracted from

the non-natural language sections of the WSDL ﬁles.

Other researchers have identiﬁed the need to ﬁnd better ways of dealing with the

complexities of extracting usable information from the non-natural language text found

in WSDL ﬁles [8]. However, to the best of our knowledge, there is no detailed infor-

mation published regarding the accuracy of the tokenisation algorithms found in the

literature. Therefore, we created a data set designed speciﬁcally to test the accuracy of

different algorithms when tokenizing strings typically found in WSDL ﬁles. This paper

addresses the research question of which tokenisation algorithm produces best results

when dealing with non-natural language strings found in WSDL ﬁles.

The rest of this paper is structured as follows. Section 2 provides a review of related

work. Section 3 talks about the impact of introducing noise in the tokenisation process.

Section 4 describes the data set that is used to evaluate the tokenisation algorithms.

Section 5 shows the evaluation results for three tokenisation algorithms. Section 6 in-

troduces a new algorithm and Section 7 shows its evaluation results. Section 8 discusses

future work. Finally, Section 9 summarises the paper’s research contributions.

2 Background

Few of the WSDL tokenisation algorithms in the literature are explained in detail. One

of the few exceptions can be found in [9]. The authors use a tokenisation algorithm

that is similar to the Maximum Matching algorithms used in Chinese segmentation.

The algorithm starts with the ﬁrst character and tries to ﬁnd the longest matching word

in the dictionary that is completely contained in the string. If no word is found then

the algorithm assumes that the single character is a token and moves on to the next

character. If a word is found then the algorithm marks that substring as a token and

moves on to the next character until there are no more characters in the string.

For example, assuming that we wanted to tokenise the string “downloadMP3Music”,

and that the words “download”, “MP3”, and “music” are in the dictionary, the algorithm

would start looking for the longest word starting with a “d”. It would ﬁrst ﬁnd the word

“down” but then it would also ﬁnd the longer word “download”. Since no words longer

than “download” are found in the dictionary, that substring is marked as a token and the

algorithm starts over from the letter “M”.

The results are part of an ongoing project and are not shown here because they are out of the

scope of this paper.

The authors argue that relying on naming conventions such as Camel Case is prob-

lematic because many strings do not follow the convention correctly and because cer-

tain words, such as eBay, make it difﬁcult to comply with the conventions. Also, some

strings found in WSDL ﬁles do not even follow a naming convention and are just a list

of lower case or upper case characters. The authors provide a table with a few examples

of the results obtained using the tokeniser but there is no information about the accuracy

of the algorithm when applied to a large set of strings. Also, there is no reference to the

dictionary that was used. This is important because the performance of this algorithm is

directly related to the dictionary being used. In the example mentioned in the previous

paragraph, if the term MP3 is not in the dictionary, the result of the tokenisation is going

to be incorrect (“download” “MP” “3” “Music”, assuming the abbreviation “MP” is in

the dictionary). We will refer to this technique as MMA.

In [3], even though the authors do not explain the algorithm used to tokenise WSDL

names explicitly, we were able to reverse engineer it based on their data sets. The to-

keniser does the following:

1. Dashes and underscores are removed from the string.

2. The characters in the string are iterated over from left to right. The ﬁrst character is

ﬂagged as the start of the current token.

3. When a dot or a space is found a new token is created using the characters from the

start-of-token ﬂag to the previous character. The dot or space character is discarded.

4. When an upper case letter is found a new token is created using the characters from

the start-of-token ﬂag to the previous character. The upper case letter is ﬂagged as

the start of the next token.

5. The process continues until there are no more characters in the string.

Using the same example we used to describe the previous tokenisation algorithm,

the result of applying this algorithm would produce the following tokens: “download”,

“M”, “P3”, “Music”. We will refer to this technique as Simple.

3 The Importance of Tokenisation

Tokenising strings in programming-language-type format is both challenging and im-

portant. Generating the incorrect tokens has a negative impact on the overall perfor-

mance of a search engine, but the impact is different depending on the approach being

used. For example, lets consider the string “BINNAME”. This string was found in a

WSDL ﬁle that describes a fraud detection service. The correct way of tokenising it

is “BIN” and “NAME”. However, some algorithms incorrectly split this string into the

tokens “BINNA” and “ME”. The token “BINNA” can be associated with the area next

to the Lamington National Park in Australia. The term is uncommon and very discrim-

inative. If an approach based on the Vector Space Model is used, this error will likely

generate a false positive if someone is looking for a service related to Binna Burra (an

online reservation system for the Binna Burra Lodge, perhaps). However, the error is

likely to go unnoticed when the system is evaluated against a standard test collection,

since the information needs included in these benchmarks tend to be generic and will

almost certainly not include anything related to Binna Burra.

In the Explicit Semantic Analysis approach, documents (in this case web services)

are represented in a concept space derived from Wikipedia articles. This is achieved

by using standard information retrieval techniques to calculate the degree of related-

ness between the document and every article in Wikipedia. If an incorrect token such

as “BINNA” is introduced, the service is likely to be mapped closely to the concepts

related to that token in Wikipedia (which are likely to be the articles related to Binna

Burra). Depending on the technique being used to calculate the similarity between this

concept vector and other concept vectors, such as the ones that represent queries, the

negativeimpact on the overall results can be much more signiﬁcant. For example, if only

the top k concepts in the vectors are used in the comparison, a term such as “BINNA”

can cause not only false positives but can also prevent the service from being ranked

high in the results for a relevant query, thus affecting precision. This happens because

the irrelevant concepts related to the term “BINNA” will displace the concepts related

to the service’s real purpose

4 Evaluation Data Set and Metrics

In order to evaluate the different tokenisation algorithms, a collection of strings was as-

sembled from a collection of 576 WSDL ﬁles

. The values of the name attribute of the

following tags in the WSDL namespace were used: service, port, binding, operation,

message, and port. The values of the name attribute of the tags in the type declarations

were also used. The tags used were: element, simpleType, complexType, and enumera-

tion.

This extraction process produced a collection of 31126 strings. These strings were

then manually tokenised. When more than one tokenisation alternative was found for

a string, the one deemed most common was used. For example, the string “wikiweb-

service.GetRecentChanges” was tokenised as “wiki”, “web”, “service”, “get”, “recent”,

“changes”. Although another valid tokenisation could have been “wiki”, “webservice”,

“get”, “recent”, because sometimes web services is written as a single word

, we con-

sidered the ﬁrst possibility as being the most common.

Three metrics were used to evaluate the tokenisation algorithms. The ﬁrst one calcu-

lates the percentage of tokenisation results that are completely correct. For example, if

the correct tokenisation of the string “BINNAME” is “bin” and “name”, then the result

is considered correct only when the tokeniser produces these exact tokens. This metric

is referred to as % Perfect Tokenisations.

The second metric is precision. In this context precision is deﬁned as

precision =

, (1)

In this case the term “BINNA” is particularly problematic because it is very discriminative.

The impact will depend on the incorrect tokens being generated.

The original dataset included 785 WSDL ﬁles, but some of them were duplicates and were

removed

Wikipedia includes “webservice” as an alternative way to spell “web service”.

where ct is the number of correct tokens produced by the tokeniser and tp is the total

number of tokens produced by the tokeniser.

The third metric is recall. In this context recall is deﬁned as

recall =

, (2)

where ct is the number of correct tokens produced by the tokeniser and tp is the total

number of tokens in the ground truth (the total number of tokens that should have been

returned).

When tokenising long multi-word strings, the results of the three metrics can be sig-

niﬁcantly different. For example, suppose the correct tokenisation of the string “wiki-

webservice.GetRecentChanges” is “wiki”, “web”, “service”, “get”, “recent”, and

‘changes‘”.If an algorithm tokenises the string as “wiki”, “webservice”, “get”, “recent”,

and “changes”, the ﬁrst metric just counts the result as being wrong. The second metric

counts four correct tokens (the tokens “wiki”, “get”, “recent”, and “changes”) and ﬁve

tokens producedby the tokeniser, resulting in a precision value of 80%. The third metric

counts the same four tokens as correct and six tokens in the ground truth, resulting in

a recall value of 66.66%. If this were the only string in the collection the percentage

of perfect tokenisations would be 0%, the average precision would be 66.66%, and the

average recall would be 80%.

5 Evaluation of Existing Tokenisation Techniques

Along with the two tokenisation techniques discussed in Section 2 a third technique

that uses the Camel Case naming convention to decide how to tokenise strings was

implemented. The technique does the following:

1. Dashes and underscores are removed from the string.

2. The characters in the string are iterated over from left to right. The ﬁrst character is

ﬂagged as the start of the current token.

3. The second character is analysed to determine if the current token is in lower case

(all characters are lower case), upper case (all characters are upper case), or camel

case (the ﬁrst character is upper case and the rest are lower case).

4. The token boundary is determined based on the type of word detected in the previ-

ous step. If the word is lower case or camel case then the word boundary is ﬂagged

when an upper case character is found. The detected token does not include the

upper case character, which is ﬂagged as the start of a new token.

5. If the word is upper case then the word boundary is ﬂagged when a lower case

character is found. In this case the upper case character right before the lower case

character is not included in the token because it is assumed to be the ﬁrst character

of the next token.

6. Numbers are considered to be part of the current token.

7. When a dot or a space are found a new token is created using the characters from the

start-of-token ﬂag to the previous character. The dot or space character is discarded.

8. The process continues until there are no more characters in the string.

Using the same example once more, the result of applying this technique would

produce the following tokens: “download”, “MP3”, “Music”. We will refer to this tech-

nique as Naming Convention.

The three techniques, Simple [3], Maximum Matching [9], and Naming Convention,

were evaluated using the tokenisation data set. For the Maximum Matching technique,

four dictionaries were used. The dictionaries were derived from WordNet, the entries in

the English language Wiktionary

, the titles of the English language Wikipedia, and the

entire corpus of the English language Wikipedia. Table 1 shows the number of words

in each of these dictionaries. The results of the evaluation are shown in Table 2.

Table 1. Sizes of the dictionaries.

Dictionary Number of Words

WordNet 87539

Wikipedia Titles 1039508

Wiktionary Titles 1805285

Wikipedia Corpus 5549346

Table 2. Results of the evaluation of three tokenisation algorithms.

Tokeniser Dictionary % Perfect Tokenisations % Precision % Recall

Simple NA 71.11% 81.59% 83.9%

Maximum Matching WordNet 52.53% 67.11% 72.34%

Maximum Matching Wiktionary 45.56% 61.62% 62.92%

Maximum Matching Wikipedia Titles 41.55% 56.72% 57.11%

Maximum Matching Wikipedia Corpus 24.53% 41.37% 39.73%

Naming Convention NA 88.15% 94.49% 93.49%

The results show that the performance of the Maximum Matching tokenisation al-

gorithm is heavily dependent on the dictionary being used. Also, using a dictionary

containing more words doesn’t necessarily increase the algorithm’s performance. In

fact, the worst performance was obtained when using the largest dictionary, in this case,

the one derived from the entire Wikipedia corpus. The performance degrades because

Wikipedia includes many non-words that are incorrectly identiﬁed by the algorithm as

the right tokens. For example, when given the string ”APRSoapIn” the MMA algo-

rithm using the WordNet-derived dictionary correctly tokenises it into ”apr”, ”soap”,

and ”in”. However, the same algorithm produces the incorrect tokens ”aprs”, ”oapi”,

and ”n” when using the Wikipedia-corpus-derived dictionary because the terms ”aprs”

and ”oapi” are contained somewhere in Wikipedia (APRS is the acronym for Auto-

matic Packet Reporting System and OAPI is the acronym for The African Intellectual

Property Organization).

The best performing algorithm is the one based on naming conventions. This tells

us that most non-natural language text within WSDL ﬁles is written following some

naming convention and done correctly. However, around 16% of the text in the WSDL

Technically the entries in Wiktionary are the titles of the wiki pages. The content in the body

of the wiki pages was not used.

ﬁles does not follow a naming conventionor does so incorrectly.These results were used

as a motivation for the design of the WEASEL tokeniser.

6 The WEASEL Tokeniser

Finding the right tokenisation for a given string requires knowledge that can be obtained

from the string itself or from some external source. The dictionary used in the Maxi-

mum Matching algorithm is an example of external information that can be used in the

tokenisation process. The letter cases used in both the Simple and Naming Convention

algorithms is an example of information that can be found in the string itself.

The results in Section 5 show that the best performing algorithms were the ones that

used the letter case information to determine how to tokenise the strings. This suggests

that this information should not be ignored. It would also seem that using a general

purpose dictionary as a source of external information is not very useful in the quest for

achieving the perfect tokenisation, considering that regardless of the dictionary used,

the performance of the Maximum Matching algorithm was the worse. However, this

can be attributed to the fact that the algorithm only explores one word combination

possibility out of many. Also, a dictionary speciﬁcally tuned for a particular domain

might improve the algorithm’s performance.

Table 3 shows some examples of the strings that the Naming Convention tokeniser

had trouble tokenising correctly. The incorrectly tokenised strings can be classiﬁed into

the following categories:

1. Strings that contain numbers, where the tokens around the numbers are split incor-

rectly.

2. Strings that do not use any naming convention.

3. Strings that use some naming convention but use it incorrectly.

Table 3. Examples where the Naming Convention tokeniser had trouble tokenising correctly.

WSDL Name Tokens Expected Tokens

BINNAME binname bin, name

historicoptiondatawsdl historicoptiondatawsdl historic, option, data, wsdl

Census1850 GetSurname census1850, get, surname census, 1850, get, surname

AxesgraphType axesgraph, type axes, graph, type

Strings in the ﬁrst category are difﬁcult to tokenise correctly without using exter-

nal information because the numbers are sometimes part of the tokens, in strings such

as “ﬁndMP3Service”, and sometimes are independent tokens, in strings such as “In-

terestRateSwaps10Month”. Strings in the second category are also difﬁcult to tokenize

correctly because it is hard to differentiate between single-word strings written in lower

camel case and multi-word strings written all in the same letter case. Finally, the strings

in the third category are the most difﬁcult to tokenise correctly because it is hard to iden-

tify the cases where a naming conventionhas been used incorrectly,especially when this

information is being used as the primary mechanism to identify word boundaries.

The ﬁrst version of the WEASEL tokeniser (referred to as WEASEL v1 in the

tables) attempts to improve the performance of the Naming Convention algorithm by

dealing with the ﬁrst problem. By using a dictionary the algorithm tries to identify if

a number found within a string belongs to the current token, the next token, or should

be treated as a separate token. It also uses the commonness of the words, which can be

easily calculated by counting word occurrences in a large general-purpose text (in this

case the whole English language Wikipedia corpus). The algorithm does the following:

1. The string is tokenised using the Naming Convention algorithm.

2. When a number is detected in any token the number is separated from the token. For

example, if the initial tokenisation of the string “InterestRateSwaps10Month” pro-

duces the tokens “Interest”, “Rate”, “Swaps10”, and “Month”, the token “Swaps10”

is split into “Swaps” and “10”.

3. The different combinations between the number and its surrounding tokens are

looked up in the dictionary.In the example, these combinations would be “Swaps10”

and “10Month”.

4. If none of the combinations are found in the dictionary then the number is treated

as a separate token. This is what happens in this example and the incorrect tokeni-

sation is now ﬁxed.

5. If only one of the combinations is found in the dictionary then that combination is

considered the correct token.

6. If both combinations are found in the dictionary then the combination that is most

common is chosen as the correct token.

The second version of the WEASEL tokeniser (referred to as WEASEL v2 in the

tables) attempts to solve the second problem by using an algorithm similar to the Max-

imum Matching algorithm to tokenise strings that do not use any naming convention.

The algorithm is only used when a string is detected to be entirely upper case or entirely

lower case and does the following:

1. Starting from the ﬁrst character, the algorithm looks for all the words in the dictio-

nary that start with that character and match the next characters in the string. For

example, for the string ”BINNAME”, the algorithm would ﬁnd the words “b”, “bi”,

“bin”, and “binna” (assuming these words are in the dictionary).

2. A score is assigned to each word based on its length and commonness. The simple

formula Score(t) = commonness(t) ∗ length(t)

is used, with k = 2. The ratio-

nale behind this formula is that it is more likely for a token to be the correct choice

if it is longer and more common. The constant k is used to increase the relevance of

the token’s length against its commonness. The value of the constant was derived

by experimenting with a subset of the ground truth that included only lower case

strings.

3. For each word the algorithm is run recursively starting from the next character in

the string, until no more characters are left. For example, for the token “b”, the

algorithm would run recursively using the rest of the string, in this case “inname”.

This produces a set with several possible tokenisations.

4. Each tokenisation possibility is given a score, which is simple the average of the

scores for each token. The one with the highest score is returned.

The difference with the Maximum Matching algorithm is that this algorithm assem-

bles several possible tokenisation alternatives, based on the words found in the dictio-

nary, and also uses the commonness information to select the best possibility. The main

drawback is its complexity, both in time and space. The algorithm is very inefﬁcient,

but in practice it works well for short strings. A simple optimization for strings longer

than a certain K is to limit the recursive runs to the top two scoring words only. This

reduces the amount of computation and memory signiﬁcantly.

Although this algorithm performs well for multi-word strings that use no naming

convention, it can introduce noise when used in a collection thats include single-letter

words written in lower camel case, if the words are not part of the dictionary (words

might be abbreviated or might even be in another language). In this case the algorithm

will try to tokenize the words further and is likely to end up with several short words

that are unrelated to the service.

Table 4. Results of the WEASEL tokeniser evaluation.

Tokeniser Dictionary % Perfect Tokenisations % Precision % Recall

Naming Convention NA 88.15% 94.49% 93.49%

WEASEL v1 WordNet 94.17% 96.65% 96.71%

WEASEL v1 Wiktionary 94.31% 96.72% 96.75%

WEASEL v1 Wikipedia Titles 93.72% 96.55% 96.37%

WEASEL v1 Wikipedia Corpus 91.84% 95.89% 95.43%

WEASEL v2 WordNet 92.65% 95.23% 95.33%

WEASEL v2 Wiktionary 94.34% 96.8% 96.84%

WEASEL v2 Wikipedia Titles 93.93% 96.79% 96.61%

WEASEL v2 Wikipedia Corpus 92.02% 96.11% 95.66%

7 WEASEL Evaluation

Table 4 shows the results of evaluating the different versions of the WEASEL tokeniser

using the same four dictionaries that were used with the Maximum Matching algorithm.

The results of the Naming Convention algorithm, the best performing algorithm of the

ones evaluated before, are also included.

The results show that version two using the dictionary derived from Wiktionary

produced the best results. The numbers show a signiﬁcant improvement over the per-

formance achieved by the Naming Convention algorithm. Version one using the same

dictionary comes in a close second. The performance of these two algorithms is very

similar but version one performs slightly faster when dealing with long tokens that use

no naming convention. The results also show that the performance of these algorithms

is dependent on the external algorithm being used. The dictionary that produced the

best results was the one derived from Wiktionary and the one that produced the worst

results was the one derived from the whole WIkipedia corpus. These results are consis-

tent with the ones observed in the evaluation of the MMA algorithms where these two

dictionaries were also the best and worst respectively.

8 Future Work

Although the new tokenisation algorithm performs better than other algorithms avail-

able in the literature, there is still room for improvement. Another source of exter-

nal information that was not used and might be useful in certain scenarios is context.

Analysing tokens extracted from other sections of a WSDL ﬁle can help ﬁnd common

patterns that may be useful when dealing with a complex string. In fact, context is used

a lot by humans when manually tokenising difﬁcult strings.

Although the levels of noise generated by the new algorithms are much lower than

the existing algorithms, it is important to inspect the types of tokens being generated

incorrectly. Incorrect tokens such as “binna” are likely to have a signiﬁcant impact on

the performance of some semantics-based systems, but incorrect tokens such as “whois-

goingtobepresident” are not as problematic.

Finally, the new tokenisation algorithm will be used in the implementation of the

web service engine based on Explicit Semantic Analysis. WEASEL v2 using Wik-

tionary produced the best results and will therefore be used instead of the original MMA

implementation. We expect the performance to improve given the superior performance

of the new algorithm.

9 Conclusions

In this paper, the performance of tokenization techniques used to extract information

from programming-language-type text is identiﬁed as one of the key aspects that di-

rectly affects the performance of certain semantics-based service discovery approaches.

A data set containing a large set of strings extracted from WSDL ﬁles is used to eval-

uate three existing tokenization algorithms. Finally a new algorithm that outperforms

existing algorithms is introduced.

Acknowledgements

This research is supported in part by the CRC Smart Services, established and sup-

ported under the Australian Government Cooperative Research Centres Programme,

and a Queensland University of Technology scholarship.

References

1. Wu, C., Chang, E.: Aligning with the web: an atom-based architecture for web services

discovery. Service Oriented Computing and Applications 1 (2007) 97–116 10.1007/s11761-

007-0008-x.

2. D’Mello, D., Ananthanarayana, V.: A review of dynamic web service description and discov-

ery techniques. In: 2010 First International Conference on Integrated Intelligent Computing,

IEEE (2010) 246–251

3. Bose, A.: Effective web service discovery using a combination of a semantic model and a data

mining technique. Master’s thesis, Queensland University of Technology (2008)

4. Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Communica-

tions of the ACM 18 (1975) 613–620

5. Gabrilovich, E.: Feature generation for textual information retrieval using world knowledge.

PhD thesis, Israel Institute of Technology (2006)

6. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based ex-

plicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artiﬁ-

cial Intelligence. Volume 7., Morgan Kaufmann Publishers Inc. (2007) 1606–1611

7. Gabrilovich, E., Markovitch, S.: Wikipedia-based semantic interpretation for natural language

processing. Journal of Artiﬁcial Intelligence Research 34 (2009) 443–498

8. J.Hou, J.Zhang, R.Nayak, A.Bose: Semantics-based web service discovery using information

retrieval techniques. In: Pre-Proceedings of the Initiative for the Evaluation of XML Retrieval

2010, IR Publications (2010) 274 – 285

9. Wu, C., Chang, E., Aitken, A.: An empirical approach for semantic web services discovery.

In: Software Engineering, 2008. ASWEC 2008. 19th Australian Conference on, IEEE (2008)

412–421