m360 
 
The  data  in  table  3  shows  that  Soundex  for  this 
example, and also for many others, works quite well. 
However, the start of the string is important, and the 
Soundex algorithm, which is developed with English 
in  mind,  fails  for  instance  to  find  any  significant 
equality for words beginning with “i” or “j”. A single 
symbol plus a three-digit representation might also be 
too  narrow  to  catch  more  subtle  similarities  or 
differences. Soundex might be more useful that some 
string metrics, but unless a version is developed that 
take pronunciation for Old Swedish in account it may 
be unsatisfactory as a working tool. 
3.3.2  The Winkler-Jaro Distance 
Another  string  measurement  is  the  Winkler-Jaro 
distance  metric.  Despite  its  name,  it  is  not  a  true 
metric,  and  more  of  a  similarity  than  a  distance 
(Winkler, 1990). It is computed for words pairwise, 
with the resulting value 1 for perfectly equal strings 
and 0 for unequal ones (i.e. strings having completely 
different characters). The ingoing parameters for the 
measure are the string length, the number of matching 
characters and the number of transpositions. 
Table 4: Pairwise similarity values for “jomfru” (“virgin”). 
Word forms  Winkler-Jaro 
umfru - iomffrv  0.6429 
The  computation  of  this  measure  can  yield  any 
floating  number  between  zero  and  one,  so  its 
comparison power should perhaps be better that both 
Levenshtein and Soundex. In table 4 below we see as 
an example the pairwise Winkler-Jaro values for the 
word  “jomfru”  (Eng.  “virgin”  or  “maiden”)  in  its 
different spelling variants in the corpora. 
As can be seen  from the  table, the Winkler-Jaro 
measure  gives  high  scores  for  these  related  word 
pairs, and these also occur quite adjacent in the full 
listing.  Seemingly,  this  measure  might  be  a  good 
choice for finding and grouping related word forms 
together.  
The  relation  may  then  be  either  a  matter  of 
spelling  variation  or  a  closeness  due  to  inflectional 
causes.  In  both  cases,  this information is  helpful in 
inventorying  the  text  and  giving  clues  for  lexicon 
look-ups, either manual or automated. 
3.4  Stop Word List 
Stop word lists are used for subtracting non-specific 
or uninteresting words from any given text. Such a list 
typically consists of some of the most frequent words 
in  any  language,  belonging  to  closed  word  classes, 
such  as  determiners,  pronouns,  prepositions  and 
conjunctions. Also, auxiliary verbs might be included 
in such lists.  
For use here, a stop word list was constructed by 
examining  the  frequency  list  of  the  corpora.  The 
principles  used  for  choosing  words  were  in 
accordance with the general ideas behind stop word 
lists and resulted in a list of 74 specific words:  
 
"honom",  "hans",  "a",  "ok",  "oc",  "han",  "hon", 
"at",  "mz",  "the", "them",  "ther", "swa",  "af",  "aff", 
"ey",  "foer",  "i",  "j",  "ii",  "jak",  "jac",  "thz",  "til", 
"vm",  "vtan",  "som",  "sit",  "sin",  "sina",  "sinom", 
"aar", "aeftir", "aen", "aer", "alle", "alt", "aat", "een", 
"enkte",  "for",  "haenna",  "haenne",  "hanom", 
"hanum",  "henna",  "henne",  "hona",  "hulkin", 
"hwar", "hwat", "iak", "mik", "sidhan", "sidhe", "sik", 
"tel", "tha", "thaen", "thaer", "thaes", "then", "thera", 
"thik",  "thin",  "tho",  "thu",  "thy",  "tik",  "war", 
"wara", "wardth", "wilde" and "hafdhe”.  
 
Further, variants of the name Maria were brought 
together into the most frequent form. 
The  result  of  this  procedure  made  upon  the 
already normalised text, as described previously, of 
the corpora can be seen in figure 3 below. Here, words 
with  lexical  meaning  now  appear,  that  seem  to  be 
characteristic  of  the  texts  in the corpora. This is 
probably the best we can accomplish in terms of a text 
analysis  for  finding  key  words  in  the  absence  of  a 
reference  corpora  for  what  would  constitute  a 
“normal” text in Old Swedish. 
 
 
NLPinAI 2020 - Special Session on Natural Language Processing in Artificial Intelligence