Topic-OPA: A Topic Ontology for Modeling Topics of Old Press Articles
Mirna El Ghosh, Cecilia Zanni-Merk, Nicolas Delestre, Jean-Philippe Kotowicz and Habib Abdulrab
Normandie Universit
´
e, INSA Rouen, LITIS, 76000 Rouen, France
Keywords:
Topic Ontologies, Topic Modeling, Open Knowledge Graphs, SPARQL.
Abstract:
Topic ontologies are recently gaining much importance in several domains. Their purpose is to identify the
themes necessary to describe the knowledge structure of an application domain. Meanwhile, their development
from scratch is hard and time consuming task. This paper discusses the development a topic-specific ontology,
named Topic-OPA, for modeling topics of old press articles. Topic-OPA is extracted from the open knowledge
graph Wikidata by the application of a SPARQL-based fully automatic approach. The development process
of Topic-OPA depends mainly on a set of disambiguated named entities representing the articles. Each named
entity is unambiguously identified by a Wikidata URI. In contrast to existent topic ontologies, which are
limited to taxonomies, the structure of Topic-OPA is composed of hierarchical and non-hierarchical schemes.
The domain application of this work is the old french newspaper Le Matin. Finally, an evaluation process is
performed to assess the structure quality of Topic-OPA.
1 INTRODUCTION
Topic ontologies are recently gaining significant at-
tention in the ontology engineering community. They
are being increasingly used in various domains such
as semantic matching (Tang et al., 2009), topic la-
beling (Allahyari and Kochut, 2017), topic model-
ing (Sleeman et al., 2018) and evaluating topical
search (Maguitman et al., 2010). The purpose of
topic ontologies is to represent the main “themes”
of a given application domain. The most commonly
known approaches for building topic models are the
keyword-based construction approaches which are
based mainly on text mining and information retrieval
techniques (Maguitman et al., 2010). Examples of
these approaches are the statistical approaches such
as probabilistic latent semantic analysis (pLSA) (Hof-
mann, 1999) and latent Dirichlet allocation (LDA)
(Blei et al., 2003). These approaches depend on the
textual content of the articles and consider it as a mix-
ture of topics. Their main drawback is that they risk
to retrieve specific topics. Although, it is hard and
time consuming to construct an ontology from a large
corpus of documents (Maguitman et al., 2010).
The works we are presenting in this article are part
of the ASTURIAS
1
project. The main goal of this
1
Analyse STructURelle et Indexation s
´
emantique
d’ArticleS de presse - Structural Analysis and Semantic In-
dexing of Newspaper Articles
project is to thematically organize a collection of old
press articles with a set of topics (e.g. Politics, Art,
Sport, Science, etc.).
In fact, one of the specific features of old press
is that it does not offer thematic entries: articles ap-
pear and follow one another without a thematic logic.
Under these conditions, it remains a tedious task to
query sources that report the same events from dif-
ferent points of view in different areas of the news-
paper. The scientific challenge is to propose robust
approaches for the analysis of texts that are noisy
due to the imperfect process of automatic transcrip-
tion of images into electronic texts. These approaches
need also to be multi-thematic, and robust to linguis-
tic evolution over the centuries. The ambition the
ASTURIAS project (whose workflow appears in Fig-
ure 1) is to study the digitization process from end
to end of the processing chain: WP1- from newspa-
per images, automatically analyze sections, articles
and texts; WP2- extract named entities from these el-
ements WP3- Topic labeling and hyperlinking the ar-
ticles based on the analysis made in 1 and the named
entities extracted in WP2.
This article will present our results on building a
topic model for WP3 of the ASTURIAS project. In
this context, a fundamental hypothesis is that articles
are represented by a set of “not ambiguous” named
entities (e.g. person, organization, product and loca-
tion) extracted from open data sources (coming from
El Ghosh, M., Zanni-Merk, C., Delestre, N., Kotowicz, J. and Abdulrab, H.
Topic-OPA: A Topic Ontology for Modeling Topics of Old Press Articles.
DOI: 10.5220/0010147202750282
In Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2020) - Volume 2: KEOD, pages 275-282
ISBN: 978-989-758-474-9
Copyright
c
2020 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
275
Figure 1: The pipeline of the project ASTURIAS.
WP2 of the project). Therefore, our research prob-
lem can be defined as follows: Given a corpus of old
press articles A represented by a set of named entities
N, that are collected from A and identified by a set
of URIs, a topic model is required for modeling the
topics that represent A. The topic model will be used
for topic labeling the old press articles. From this per-
spective, we propose a SPARQL-based approach, re-
lying mainly on the set of the disambiguated named
entities, for building the topic model. In this regard,
open knowledge graphs, such as Wikidata, are con-
sidered. The main goal of this paper is to discuss
the development process of a topic-specific ontology,
named Topic-OPA, by the application of a SPARQL-
based fully automatic approach. Topic-OPA is de-
rived from the open knowledge graph Wikidata based
on a set of “not ambiguous” named entities represent-
ing the articles. A case study is demonstrated for
building Topic-OPA from the articles of Le Matin
2
,
an old french newspaper first published in 1884 and
discontinued in 1944. Finally, Topic-OPA is evalu-
ated by the application of a structure-based evalua-
tion approach. The rest of the paper is organized as
follows: Section 2 presents the main related works of
this study. In section 3, we discuss the development
process of Topic-OPA. Section 4 presents a case study
in the context of Le Matin. In section 5, we evaluate
Topic-OPA. Finally section 6 concludes the paper.
2 RELATED WORKS
In this section, topic ontologies and ontology engi-
neering approaches are introduced as the main related
foundations for our study.
2.1 Topic Ontologies
Topic ontologies are considered as special type of on-
tologies. Their purpose is to identify the “themes”
necessary to describe the knowledge structure of an
application domain (Zhao and Meersman, 2005). A
topic ontology is represented as a set of topics that are
interconnected using semantic relations. Two main
2
https://gallica.bnf.fr/ark:/12148/cb328123058/date,
last visited on April 8 2020
types of topic ontologies are defined: simple, and gen-
eral (Maguitman et al., 2010). The simple topic on-
tologies are composed of topics linked by hierarchi-
cal relations. Meanwhile, in general topic ontologies,
transverse relations are included to link different top-
ics in a non-hierarchical scheme. For representing
general topic ontologies, the following components
are commonly defined:
Topics: concepts of the topic ontology (e.g. Sport,
Art, Politics).
Predicates: types of relationships defining the
semantic relations which can be established be-
tween ontology concepts. Multiple predicates are
defined in general topic ontologies: hierarchical
(e.g. subClassOf ) and non-hierarchical (e.g. stud-
ied by, part of, etc.)
Relationships: concrete links among ontology
concepts which will be used to characterize paths
in graphs. They are distinguished according to
their predicate and the couple of elements they
link. They can be represented as a triplet (s,p,o)
where s the subject, o the object and p the predi-
cate that links s and o (e.g. Literature subClassOf
Art, Art part of Culture).
2.1.1 KB-LDA Topic Model
For topic labeling purposes, the topic model KB-LDA
(Allahyari and Kochut, 2017) is developed based on
combining topic models with ontological concepts
in a single framework. KB-LDA used the seman-
tic knowledge graph of concepts in an ontology (e.g.
DBpedia) and their diverse relationships with unsu-
pervised probabilistic topic models for generating au-
tomatic topic labels. The topic labeling process is
performed based on the semantic similarity between
the entities included in text documents and a suitable
portion of the ontology. For this purpose a seman-
tic graph is constructed from the concepts of the on-
tology and their classification hierarchy as labels for
topics.
2.1.2 IPCC Topic Model
For topic modeling purposes, IPCC (Sleeman et al.,
2018) is a domain-specific topic ontology used for
grounding a topic model in the domain of climate re-
search. The topic ontology is “seeded” with prede-
fined key word phrase concepts which are obtained
from domain-specific sources such as domain experts,
and by data mining semi-structured sources. Natu-
ral Language Processing techniques have been used
to extract the meaningful key word phrase concepts
KEOD 2020 - 12th International Conference on Knowledge Engineering and Ontology Development
276
from these sources. While, the topic modeling pro-
cess is applied on textual resources such as, reports
and research papers, the ontology concepts are used
for weighting concepts founded in these resources.
Furthermore, the topic ontology is enriched with the
concepts associated with the textual resources and the
generated topics.
2.2 Ontology Engineering Approaches
In the ontology engineering domain, several ap-
proaches have been proposed for building ontolo-
gies from scratch or by reusing other existing ontolo-
gies. The most known approaches are Uschold and
Gruninger (Uschold and Gruninger, 1996), Methon-
tology (Fern
´
andez-L
´
opez et al., 1997) and ON-TO-
KNOWLEDGE (Sure et al., 2004). These approaches
focus on an iterative process of ontology building
and are composed of common phases such as spec-
ification, conceptualization, formalization, applica-
tion and evaluation. In addition, approaches such
as Text2Onto (Cimiano and V
¨
olker, 2005) and On-
toGen (Fortuna et al., 2007) aim to generate ontolo-
gies semi-automatically with the help of user inter-
ference. These approaches exploit textual resources
and rely on natural language processing techniques.
However, few works have been found in the literature
about building ontologies from knowledge graphs.
In (B
¨
ohm and Ortiz, 2018), the authors discusses
the building of topic-specific ontologies from open
knowledge graphs such as ConceptNet (Speer et al.,
2017). A query-based interactive approach is applied
for extracting entities and relations from the knowl-
edge graph. Based on the extraction process as well
as the interaction of the user, the central taxonomy
of the topic ontology is constructed. Furthermore,
adding complex concepts is processed to enrich the
ontology. Finally, a clean-up phase is performed in
order to modify or to add new concepts to the taxon-
omy.
3 SPARQL-BASED AUTOMATIC
APPROACH FOR BUILDING
TOPIC-OPA
For building topic ontologies, the most commonly
known approaches are the keyword-based construc-
tion approaches which are based mainly on text min-
ing and information retrieval techniques (Maguitman
et al., 2010). However, these approaches are not ef-
ficient, hard and time consuming to construct an on-
tology from a large corpus of documents (Maguitman
et al., 2010). From this perspective and for simpli-
fying the construction process of Topic-OPA, open
knowledge graphs are considered. Generally, knowl-
edge graphs are very large and contain many entities
that are too general or specific to be successfully used
as topics for topic labeling (B
¨
ohm and Ortiz, 2018).
Meanwhile, they can be leveraged to build with mod-
erate efforts small to medium-sized meaningful topic
ontologies. As a knowledge graph, we selected Wiki-
data. It is a free and open knowledge graph and acts as
central storage for the structured data of its Wikime-
dia sister projects including Wikipedia, Wiktionary,
and others (Erxleben et al., 2014). Wikidata stores
more than 402 million statements about over 45 mil-
lion entities (Malyshev et al., 2018). Today, more than
60 million of items are described. The data model of
Wikidata is based on a directed, labelled graph where
entities are connected by edges that are labelled by
“properties” (Bielefeldt et al., 2018). Thus, the sys-
tem distinguishes two main types of entities: items
and properties. Items are uniquely identified by a “Q”
followed by a number, such as Paris (Q90). Proper-
ties describe detailed characteristics of an item and
represented by a “P” followed by a number, such as
instance of (P31). Entities are represented by URIs
(e.g. http://www.wikidata.org/entity/Q90 for Paris
and http://www.wikidata.org/entity/P31 for instance
of ).
3.1 Ontology Specification
The ontology specification clarifies the scope and the
purpose of the targeted topic ontology Topic-OPA.
Topic-OPA is a topic-specific ontology intended for
modeling the topics of old press articles. Thus, the
scope is limited to old newspapers and journals which
are not organized thematically as the recent ones.
Therefore, given a corpus of articles in 1920, Topic-
OPA is constructed from the disambiguated named
entities representing these articles (see Figure 2 for
an example of named entities representing the arti-
cles depicted in Figure 7). Thereby, Topic-OPA will
not be useful for labeling articles in 2020. Concern-
ing the purpose, Topic-OPA is intended to build au-
tomated applications such as topic labeling systems.
Although, it can be used to develop larger ontologies
for more specialized purposes reducing the time and
effort needed to develop ontologies from scratch.
3.2 Ontology Requirements
In the ontology engineering domain, the set of re-
quirements that the ontology should satisfy is di-
vided into functional and non-functional requirements
Topic-OPA: A Topic Ontology for Modeling Topics of Old Press Articles
277
(Fern
´
andez et al., 2009). The functional requirements
define what needs to be expressed by the ontology
model. Meanwhile, the non-functional requirements
specify how an ontology needs to be designed in order
to be applicable. For Topic-OPA, the main functional
requirement is that it needs to be composed of two
different schemes:
Hierarchical Scheme: consists of hierarchical re-
lations such as subClassOf that permit the infer-
ence of knowledge in the ontology graph.
Non-hierarchical Scheme: involves non-
hierarchical relations such as related, part
of, used by, etc. that have an important implica-
tion in the semantic relationships between the
concepts.
Concerning the non-functional requirements, we con-
sider data traceability and scalability by mapping the
concepts and the relations of the topic ontology to en-
tities in open knowledge graphs such as Wikidata.
3.3 Ontology Definition
In our work, we are interested in general topic on-
tologies which are composed of hierarchical and non
hierarchical schemes. In the following, we define
these ontologies by considering mapping to knowl-
edge graphs.
Definition 8. We define a general topic ontology, in
which mapping to knowledge graphs is considered, by
O =
T, R,E,φ
, with
T the set of topic concepts,
R the set of predicates: {subClassOf, instance of,
part of, use, related by, etc.},
E the set of relationships: E T × R × T
φ the mapping of T and R to entities in a knowl-
edge graph K.
3.4 Ontology Building
For building Topic-OPA, a SPARQL-based fully auto-
matic approach is applied. This approach, which aims
to harvest Topic-OPA from the open knowledge graph
Wikidata, is composed of three main phases: (1) con-
struction of the hierarchical scheme, (2) construction
of the non-hierarchical scheme and (3) ontology en-
richment.
3.4.1 Building the Hierarchical Scheme:
Bottom-up Strategy
The hierarchical scheme of Topic-OPA, which rep-
resents the taxonomy of topic concepts, can be for-
mally defined by H =
T, R,E
v
,φ
, where T is the
set of topic concepts, R is the unique predicate
{subClassOf } used for ordering the topic concepts,
E
v
is the set of ordering relations and φ is the map-
ping function to Wikidata. In the hierarchy, a root
element denoted > is defined as a general subsumer
for all the topic concepts, i.e., t
i
T,t
i
v >. For
building the hierarchy, a SPARQL-based bottom-up
approach is applied. The development process starts
with a definition of the most specific topic concepts
of the hierarchy and continues by extracting the more
general concepts. The approach started from a set of
named entities N represented by a set of URIs (see
Figure 2).
Figure 2: Example of named entities representing articles
A
1
and A
2
depicted in Figure 7.
Definition of the Most Specific Topic Concepts.
At this phase, a SELECT SPARQL query, relying
mainly on N and the Knowledge graph K, is applied to
define S
T
T the most specific topic concepts of the
hierarchy, t
i
S
T
,@t
j
/t
j
v t
i
. The SELECT query
q(n,r) takes as inputs a named entity n N and a
property r K and returns set of topic concepts. For
the application of q, we defined two main relation
types {P31, P106}. The property instance of (P31)
is used for all the named entities to retrieve their su-
perclasses. Meanwhile, for the named entities that
are instances of Human (Q5), which is a very gen-
eral topic, applying the property occupation (P106) is
required to fetch more specific topic concepts. In the
following, the syntax of q is presented. We denote by
entityId, the Wikidata ID of the named entity which is
extracted from the URI.
SELECT ?specificTopic WHERE {
wd:entityId ?property ?specificTopic.
VALUES ?property {wdt:P31 wdt:P106}}
As an example, let us consider a named entity n =
{John Simon(Q333091)} (see Figure 3). In Wikidata,
John Simon is instance of (P31) Human (Q5) and
linked to judge, lawyer and politician by the property
occupation (P106). Thus, S
T
(n)={Judge, Lawyer,
Politician}.
KEOD 2020 - 12th International Conference on Knowledge Engineering and Ontology Development
278
Figure 3: Definition of the most specific concepts based on
the named entities of A
1
.
Extraction of Hierarchies. The aim of this phase
is to build the taxonomy of topic concepts H. The
building process starts from the most specific to the
most general concepts. For this purpose, a CON-
STRUCT SPARQL query q
H
(t
i
)/t
i
S
T
and associ-
ated to φ(t
i
), is applied to fetch the parent classes of
t
i
aiming to build a RDF graph of the hierarchy. In
this context, each query returns three different types
of triples: (1) to define the ontology classes, (2) to
create the taxonomic relations (inspired by usage in
RDF rdfs:subClassOf ) and (3) to label the ontology
classes. All triples are denoted by (s, p,o), where s
the subject, p the predicate and o the object. In the
following, the syntax of q
H
is presented. We denote
by topicId the Wikidata ID of t
i
S
T
.
CONSTRUCT { ?class a owl:Class.
?class rdfs:subclassOf ?superclass.
?class rdfs:label ?classLabel.
?property rdfs:domain ?class.
?property rdfs:label ?classLabel.}
WHERE { wd:topicId wdt:P279* ?class.
?class wdt:P279 ?superclass.
?class rdfs:label ?classLabel.}
In Figure 4, an example of triples extracted based on
S
T
(John Simon).
Figure 4: Example of triples for building the hierarchical
scheme of Topic-OPA.
3.4.2 Building the Non-hierarchical Scheme
The non-hierarchical scheme of Topic-OPA can be
formally defined by NH =
T, R,E,φ
, where T is
the set of topic concepts, R is the finite set of pred-
icates, E T ×R ×T is the set of transverse relation-
ships among the topics and φ the mapping function. In
this phase, the non-hierarchical relations are extracted
from Wikidata for building NH. These relations are
represented by the definition of the domain/range of
the properties that will be added to the graph as edges
between domains and ranges. For this purpose, a
CONSTRUCT query q
NH
(t
i
)/t
i
T and associated to
φ(t
i
), is applied to fetch all the triples where t
i
are
domains or ranges. In this context, the selection of
properties is restricted to a predefined list based on
their relevance in different domains (e.g. field of work
(P101), has part (P527), has quality (P1552), part of
(P361), practiced by (P3095), etc.). In the following,
the syntax of q
NH
is presented. We denote by topicId
the Wikidata ID of t
i
T .
CONSTRUCT { ?domain ?property ?range.
?range rdfs:label ?rangeLabel.
?property rdfs:label ?propertyLabel.}
WHERE { VALUES ?property {
wdt:P1269 wdt:P425 wdt:P101
wdt:P136 wdt:P527 wdt:P1552 wdt:P1557 wdt:P106
wdt:P2388 wdt:P2389 wdt:P361 wdt:P710 wdt:P3095
wdt:P4646 wdt:P641 wdt:P2578 wdt:P366 wdt:P1535
wdt:P2283 wdt:P1889}
{wd:topicId ?property ?range.
?range rdfs:label ?rangeLabel.}}
The execution of q
NH
produced a list of triples de-
noted by (d, p,r), where d the domain, p the predi-
cate and r the range. Furthermore, these triples are
parsed and added to the structure of Topic-OPA for
building the non-hierarchical scheme. In Figure 5, an
example of non-hierarchical relations extracted based
on the previously added concepts (see Figure 4).
Figure 5: Example of triples for building the non-
hierarchical scheme of Topic-OPA.
3.4.3 Ontology Enrichment
After building H and NH, we apply in this phase an
enrichment process based on NH. The application of
q
NH
has imported new concepts to the ontology such
as Government, Judiciary and Politics, among many
others. Therefore, these concepts will be added to the
hierarchy as well as their parent classes by applying
the query q
H
(see Figure 6).
Figure 6: Example of the enrichment of the hierarchical
scheme of Topic-OPA.
Topic-OPA: A Topic Ontology for Modeling Topics of Old Press Articles
279
4 CASE STUDY: LE MATIN
In this section, we introduce the application of the
SPARQL-based approach for developing Topic-OPA
in the context of the old newspaper Le Matin. For
this purpose, we have chosen A a corpus of 48 articles
published between 1910 and 1937 (see Figure 7 for an
example). For Building Topic-OPA, a set of N = 392
named entities representing A is considered (see Fig-
ure 2). As a result, we obtained a topic ontology, as
a subset of Wikidata, which is accessible and man-
ageable in ontology editors such as Prot
´
eg
´
e
3
. Note
that the topic ontology is not curated. We maintained
the concepts and relations which are obtained by the
application of the fully automatic approach. Thus,
Topic-OPA contains 2073 concepts, 3261 SubClas-
sOf relations and 1135 non-hierarchical relations. In
Figure 8, we depict an excerpt of Topic-OPA around
the Politics topic. The solid lines represent the Sub-
ClassOf relations and the dashed lines represent the
non-hierarchical relations.
Figure 7: Example of articles from Le Matin.
5 ONTOLOGY EVALUATION
Generally, the ontology evaluation approaches are di-
vided into four main categories (Fern
´
andez et al.,
2009): (1) gold standard-based that aims to com-
pare the developed ontology with a previously cre-
ated reference ontology known as the gold standard;
(2) corpus-based that tends to compare the developed
3
https://protege.stanford.edu/, last visited 23 July 2020
ontology with the content of a text corpus that covers
a given domain significantly ; (3) application-based
that considers the evaluation of ontologies according
to their performance in applications; (4) structure-
based that quantifies structure-based properties such
as the size and the complexity of ontologies.
In order to choose the “best” evaluation approach,
there is a need to define the motivation behind evalu-
ating a developed ontology (Fern
´
andez et al., 2009).
In our study, as evoked earlier, Topic-OPA is intended
to be used as a knowledge base in a topic labeling
system. Thus, it is considered as an application-based
ontology. In this context, Topic-OPA can be evalu-
ated using application-based and structure-based ap-
proaches for the following reasons:
the gold standard-based approach is not applica-
ble: Topic-OPA is developed as a subset of Wiki-
data. Thus, the best reference ontology for Topic-
OPA is Wikidata itself. However, it is impossible
to use Wikidata as a gold standard ontology be-
cause of its size. In addition, since Topic-OPA is
built for and from a given corpus of press articles,
it cannot be compared with other ontologies that
should be created under similar conditions with
similar goals.
the corpus-based approach is eliminated: the tex-
tual resources are out of scope of our study. As
evoked earlier, our hypothesis is based on a set
of disambiguated named entities extracted from
open knowledge bases such as Wikidata.
the application-based approach is the best evalua-
tion approach: it implies to evaluate the usability
of Topic-OPA being an application-based ontol-
ogy. This evaluation will be performed in further
works after embedding Topic-OPA in the topic la-
beling system.
the structure-based approach is a useful evalu-
ation approach for assessing the structure-based
properties of Topic-OPA. This approach is recom-
mended as an efficient approach for evaluating the
learned ontologies (Dellschaft and Staab, 2008).
Several measures have been recognized for the
structure-based evaluation such as Knowledge cover-
age and popularity measures (i.e. number of classes
and number of properties) and structural measures
(i.e. maximum depth, average depth, depth variance,
etc.) (Fern
´
andez et al., 2009). The application of
these measures relies on an assumption that is a richly
populated ontology, with higher depth and breadth
variance is more likely to provide reliable semantic
content. The structural measures are positively cor-
related with the semantic accuracy of the knowledge
modeled in the ontology (Sanchez et al., 2015). In the
KEOD 2020 - 12th International Conference on Knowledge Engineering and Ontology Development
280
Figure 8: Excerpt of Topic-OPA around the concept Politics.
context of Topic-OPA, we quantified some structural
measures, by considering its taxonomic structure, as
follows:
maximum depth=28: represents the length of the
longest taxonomic branch in the ontology..
average depth=6: is the average length of all tax-
onomic branches.
depth variance=6.38: is the dispersion with re-
spect to the average depth, computed as the stan-
dard mathematical variance.
We conclude that Topic-OPA is a richly learnt on-
tology. However, the majority of the topic concepts
are dispersed homogeneously within the core level of
Topic-OPA. This implies two main issues: (1) the hi-
erarchical structure of Topic-OPA is a balanced taxon-
omy, in which the majority of taxonomic edges have
almost the same depth and (2) it will be a challenging
task to expose the topic concepts which are relevant
for topic labeling the articles.
6 CONCLUSIONS AND FUTURE
WORK
This paper discussed the building process of a topic-
specific ontology, named Topic-OPA, for representing
the contents of a set of old press articles that needs
to be labeled with a set of topics. Topic-OPA will
be used for associating topics to each old press ar-
ticle. The development of Topic-OPA relies mainly
on a set of disambiguated named entities represent-
ing the articles. In this regard, a SPARQL-based fully
automatic approach is applied, based on the disam-
biguated named entities, for harvesting Topic-OPA
from the open knowledge graph Wikidata. The pro-
posed approach is composed of four main phases: (1)
collect the most specific topic concepts which are lo-
cated at the lowest level of Topic-OPA, (2) build the
hierarchical scheme based on these concepts, (3) con-
struct the non-hierarchical scheme based on the hi-
erarchical scheme and (4) enrich the ontology with
the concepts imported by the non-hierarchical rela-
tions. A case study is presented in the context of the
old french newspaper Le Matin for building Topic-
OPA from a corpus of 48 articles. By the applica-
tion of the SPARQL-based approach, a richly learnt
topic-specific ontology is obtained. Furthermore, a
structure-based evaluation approach is applied to as-
sess the quality of the structure of Topic-OPA. We
found that the majority of the topic concepts are lo-
cated at the core level of Topic-OPA. This implies
that a challenging task will take place for defining the
topic concepts which will be used for labeling the ar-
ticles. In this study, we do not consider the curation
of the topic ontology after the automatic building pro-
cess. We maintained the ontology structure and con-
tent, including the abstract and specific concepts, as
derived from Wikidata. In future works, we will ap-
ply a curation process aiming to clean and leverage
Topic-OPA. Furthermore, Topic-OPA will be embed-
ded in a topic labeling system for automatic topic la-
beling of old press articles. A specific semantic relat-
edness measure, named Rel
Topic
, has been proposed
in order to associate the good topic of Topic-OPA to a
specific press article taking into consideration the set
Topic-OPA: A Topic Ontology for Modeling Topics of Old Press Articles
281
of named entities of it. Unfortunately, we have not
been able to present it in this paper, because of lack
of space. However, it is worth highlighting that the
preliminary results on the use of Rel
Topic
associated
with Topic-OPA are encouraging, as its use presents
a precision higher than 80% on a corpus of old press
articles that were labeled by human experts.
ACKNOWLEDGMENTS
This work is funded by the Normandy Region
(France) and the European Union with the European
Regional Development Fund (FEDER).
REFERENCES
Allahyari, M. and Kochut, K. (2017). A knowledge-based
topic modeling approach for automatic topic labeling.
International Journal of Advanced Computer Science
and Applications, 8(9):335–349.
B
¨
ohm, K. and Ortiz, M. (2018). A tool for building topic-
specific ontologies using a knowledge graph. In Pro-
ceedings of the 31st International Workshop on De-
scription Logics co-located with 16th International
Conference on Principles of Knowledge Representa-
tion and Reasoning (KR 2018).
Bielefeldt, A., Gonsior, J., and Krotzsch, M. (2018). Practi-
cal linked data access via sparql: The case of wikidata.
In Proceedings of the WWW2018 Workshop on Linked
Data on the Web (LDOW-18), CEUR Workshop Pro-
ceedings.
Blei, D., Ng, A., and Jordan, M. (2003). Latent dirichlet al-
location. The Journal of Machine Learning Research,
3:993–1022.
Cimiano, P. and V
¨
olker, J. (2005). Text2onto. In Nat-
ural Language Processing and Information Systems,
NLDB 2005, pages 227–238. Springer, Berlin, Hei-
delberg.
Dellschaft, K. and Staab, S. (2008). Strategies for the eval-
uation of ontology learning. In Proceedings of the
2008 Conference on Ontology Learning and Popula-
tion: Bridging the Gap between Text and Knowledge,
Frontiers in Artificial Intelligence and Applications,
pages 253–272.
Erxleben, F., G
¨
unther, M., Kr
¨
otzsch, Mendez, J., and
Vrande
ˇ
ci
´
c, D. (2014). Introducing wikidata to the
linked data web. In Proceedings 13th Int. Semantic
Web Conf. (ISWC’14), LNCS, pages 50–65.
Fern
´
andez, M., Overbeeke, C., Sabou, M., and Motta, E.
(2009). What makes a good ontology? a case-study in
fine-grained knowledge reuse. In The semantic Web,
pages 61–75. Springer, Berlin, Heidelberg.
Fern
´
andez-L
´
opez, M., G
´
omez-P
´
erez, A., and Juristo, N.
(1997). Methontology: From ontological art towards
ontological engineering. In AAAI.
Fortuna, B., Grobelnik, M., and Mladenic, D. (2007). Onto-
gen: Semi-automatic ontology editor. In Human Inter-
face and the Management of Information, Interacting
in Information Environments, Human Interface 2007,
pages 309–318. Springer, Berlin, Heidelberg.
Hofmann, T. (1999). Probabilistic latent semantic indexing.
In Proceedings of the 22nd annual international ACM
SIGIR conference on Research and development in in-
formation retrieval, pages 50–57. ACM New York.
Maguitman, A., Cecchini, R., Lorenzetti, C., and Menczer,
F. (2010). Using topic ontologies and semantic simi-
larity data to evaluate topical search. In Proceedings
of Conferencia Latino-americana de Inform
´
atica.
Malyshev, S., Krotzsch, M., Gonzalez, L., Gonsior, J.,
and Bielefeldt, A. (2018). Getting the most out of
wikidata: Semantic technology usage in wikipedia’s
knowledge graph. In Proceedings of the 17th Inter-
national Semantic Web Conference (ISWC’18), pages
376–394. Springer.
Sanchez, D., Batet, M., Martinez, S., and Ferrer, J. (2015).
Semantic variance: An intuitive measure for ontology
accuracy evaluation. Engineering Applications of Ar-
tificial Intelligence, 39:89–99.
Sleeman, J., Finin, T., and Halem, M. (2018). Ontology-
grounded topic modeling for climate science research.
In Proceedings of Semantic Web for Social Good
Workshop, ISWC.
Speer, R., Chin, J., and Havasi, C. (2017). Conceptnet 5.5:
An open multilingual graph of general knowledge. In
AAAI, pages 4444–4451.
Sure, Y., Staab, S., and Studer, R. (2004). On-to-
knowledge methodology (otkm). In Handbook on
Ontologies, International Handbooks on Information
Systems. Springer, Berlin, Heidelberg.
Tang, Y., Baer, P., Zhao, G., and Meersman, R. (2009).
On constructing, grouping and using topical ontology
for semantic matching. In Proceedings of OTM 2009
Workshops (On the Move to Meaningful Internet Sys-
tems). Springer Berlin, Heidelberg.
Uschold, M. and Gruninger, M. (1996). Ontologies: princi-
ples, methods and applications. The Knowledge Engi-
neering Review, 11:93–136.
Zhao, G. and Meersman, R. (2005). Architecting ontol-
ogy for scalability and versatility. In On the Move
to Meaningful Internet Systems 2005: CoopIS, DOA,
and ODBASE. OTM 2005.
KEOD 2020 - 12th International Conference on Knowledge Engineering and Ontology Development
282