
proximity between vectors-terms, the less is the 
angle, the higher is the cosine of the angle (cosine 
measure). Consequently, maximum proximity is 
equal to 1, and minimum one is equal to 0. 
The obtained term-term matrix measures the 
proximity between terms on the basis of their co-
occurrence in documents (as coordinates of vectors-
terms are frequencies of their use in documents). 
The latter means that the sparser the initial term-
document matrix, the worse is the quality of the 
term-term proximity matrix. Therefore, it is 
expedient to save the initial matrix from information 
noise and rarefaction with the help of the latent 
semantic analysis (Deerwester et al., 1990). The 
presence of noise is conditioned by the fact that, 
apart from the knowledge about the subject domain, 
the initial documents contain the so-called “general 
places” which, nevertheless, contribute to the 
statistics of distribution. 
We use the method of latent semantic analysis 
for clearing up the matrix from information noise. 
The essence of the method is based on 
approximation of the initial sparse and noised matrix 
by a matrix of lesser rank with the help of singular 
decomposition. Singular decomposition of matrix A 
with dimension M×N, M>N  is its decomposition in 
the form of product of three matrices – an 
orthogonal matrix U with dimension M×M, diagonal 
matrix  S with dimension M×N  and a transposed 
orthogonal matrix V with dimension N×N: 
 
=
. 
 
Such decomposition has the following 
remarkable property. Let matrix A be given for 
which singular decomposition =
 is known 
and which is needed to be approximated by matrix 
with the pre-determined rank k. If in matrix S 
only  k greatest singular values are left and the rest 
are substituted by nulls, and in matrices U and V
T
 
only  k columns and k lines are left, then 
decomposition 
 
=
 
 
will give the best approximation of the initial matrix 
A by matrix of rank k. Thus, the initial matrix A with 
the dimension M×N is substituted with matrices of 
lesser sizes M×k and k×N and a diagonal matrix of k 
elements. In case when k is much less than M and N 
there takes place a significant compression of 
information but part of information is lost and only 
the most important (dominant) part is saved. The 
loss of information takes place on account of 
neglecting small singular values, i.e. this loss is the 
higher, the more singular values are discarded. Thus, 
the initial matrix gets rid of information noise 
introduced by random elements. 
3.3 Summary 
The extracted concepts and relations must be plotted 
on a concept map. Let us repeat that as concepts or 
nodes of a graph we use all terms for which 
Pearson’s criterion is higher than a certain threshold 
value determined experimentally. In literature the 
value of 6.6 is indicated as a threshold but by 
varying this value it is possible to reduce or increase 
the list of concepts. For example, a too high value of 
the threshold will allow to leave only the most 
important terms which have the highest values of 
Pearson’s criterion. 
In the same way the number of extracted 
relations can be varied. If among all pairwise 
distances in the term-term matrix, the values lower 
than a certain threshold are nulled, the edges (links) 
will connect only the concepts the proximity 
between which is higher than the indicated 
threshold. 
4 EXPERIMENTS 
To carry out experiments, we chose the subject 
domain “Ontology engineering”. The documents 
representing chapters from the textbook (Allemang, 
D. et al, 2011) formed a positive set of the teaching 
collection. Besides, some articles from other themes 
formed a negative set of the teaching collection. 
Tokenization and lemmatization from the collection 
resulted in a thesaurus of unique terms. The use of 
Pearson’s criterion with the threshold value of 6.6 
allowed to select 500 key concepts of the subject 
domain. Table 1 presents the first 10 concepts with 
the greatest value of the criterion.  
Table 1: The first 12 concepts of the subject domain. 
No 
Concept Chi-square test value 
1 semantic  63.69 
2 Web  59.95 
3 property  59.87 
4 manner  57.08 
5 model  53.74 
6 class  52.40 
7 major  51.71 
8 side  50.78 
9 word  50.59 
10 query  44.09 
11 rdftype  37.41 
12 relationship  35.71 
DATA2015-4thInternationalConferenceonDataManagementTechnologiesandApplications
252