Topic-OPA: A Topic Ontology for Modeling Topics of Old Press Articles

Mirna El Ghosh, Cecilia Zanni-Merk, Nicolas Delestre, Jean-Philippe Kotowicz and Habib Abdulrab

Normandie Universit

e, INSA Rouen, LITIS, 76000 Rouen, France

Keywords:

Topic Ontologies, Topic Modeling, Open Knowledge Graphs, SPARQL.

Abstract:

Topic ontologies are recently gaining much importance in several domains. Their purpose is to identify the

themes necessary to describe the knowledge structure of an application domain. Meanwhile, their development

from scratch is hard and time consuming task. This paper discusses the development a topic-speciﬁc ontology,

named Topic-OPA, for modeling topics of old press articles. Topic-OPA is extracted from the open knowledge

graph Wikidata by the application of a SPARQL-based fully automatic approach. The development process

of Topic-OPA depends mainly on a set of disambiguated named entities representing the articles. Each named

entity is unambiguously identiﬁed by a Wikidata URI. In contrast to existent topic ontologies, which are

limited to taxonomies, the structure of Topic-OPA is composed of hierarchical and non-hierarchical schemes.

The domain application of this work is the old french newspaper Le Matin. Finally, an evaluation process is

performed to assess the structure quality of Topic-OPA.

1 INTRODUCTION

Topic ontologies are recently gaining signiﬁcant at-

tention in the ontology engineering community. They

are being increasingly used in various domains such

as semantic matching (Tang et al., 2009), topic la-

beling (Allahyari and Kochut, 2017), topic model-

ing (Sleeman et al., 2018) and evaluating topical

search (Maguitman et al., 2010). The purpose of

topic ontologies is to represent the main “themes”

of a given application domain. The most commonly

known approaches for building topic models are the

keyword-based construction approaches which are

based mainly on text mining and information retrieval

techniques (Maguitman et al., 2010). Examples of

these approaches are the statistical approaches such

as probabilistic latent semantic analysis (pLSA) (Hof-

mann, 1999) and latent Dirichlet allocation (LDA)

(Blei et al., 2003). These approaches depend on the

textual content of the articles and consider it as a mix-

ture of topics. Their main drawback is that they risk

to retrieve speciﬁc topics. Although, it is hard and

time consuming to construct an ontology from a large

corpus of documents (Maguitman et al., 2010).

The works we are presenting in this article are part

of the ASTURIAS

project. The main goal of this

Analyse STructURelle et Indexation s

emantique

d’ArticleS de presse - Structural Analysis and Semantic In-

dexing of Newspaper Articles

project is to thematically organize a collection of old

press articles with a set of topics (e.g. Politics, Art,

Sport, Science, etc.).

In fact, one of the speciﬁc features of old press

is that it does not offer thematic entries: articles ap-

pear and follow one another without a thematic logic.

Under these conditions, it remains a tedious task to

query sources that report the same events from dif-

ferent points of view in different areas of the news-

paper. The scientiﬁc challenge is to propose robust

approaches for the analysis of texts that are noisy

due to the imperfect process of automatic transcrip-

tion of images into electronic texts. These approaches

need also to be multi-thematic, and robust to linguis-

tic evolution over the centuries. The ambition the

ASTURIAS project (whose workﬂow appears in Fig-

ure 1) is to study the digitization process from end

to end of the processing chain: WP1- from newspa-

per images, automatically analyze sections, articles

and texts; WP2- extract named entities from these el-

ements WP3- Topic labeling and hyperlinking the ar-

ticles based on the analysis made in 1 and the named

entities extracted in WP2.

This article will present our results on building a

topic model for WP3 of the ASTURIAS project. In

this context, a fundamental hypothesis is that articles

are represented by a set of “not ambiguous” named

entities (e.g. person, organization, product and loca-

tion) extracted from open data sources (coming from

El Ghosh, M., Zanni-Merk, C., Delestre, N., Kotowicz, J. and Abdulrab, H.

Topic-OPA: A Topic Ontology for Modeling Topics of Old Press Articles.

DOI: 10.5220/0010147202750282

In Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2020) - Volume 2: KEOD, pages 275-282

ISBN: 978-989-758-474-9

275

Figure 1: The pipeline of the project ASTURIAS.

WP2 of the project). Therefore, our research prob-

lem can be deﬁned as follows: Given a corpus of old

press articles A represented by a set of named entities

N, that are collected from A and identiﬁed by a set

of URIs, a topic model is required for modeling the

topics that represent A. The topic model will be used

for topic labeling the old press articles. From this per-

spective, we propose a SPARQL-based approach, re-

lying mainly on the set of the disambiguated named

entities, for building the topic model. In this regard,

open knowledge graphs, such as Wikidata, are con-

sidered. The main goal of this paper is to discuss

the development process of a topic-speciﬁc ontology,

named Topic-OPA, by the application of a SPARQL-

based fully automatic approach. Topic-OPA is de-

rived from the open knowledge graph Wikidata based

on a set of “not ambiguous” named entities represent-

ing the articles. A case study is demonstrated for

building Topic-OPA from the articles of Le Matin

an old french newspaper ﬁrst published in 1884 and

discontinued in 1944. Finally, Topic-OPA is evalu-

ated by the application of a structure-based evalua-

tion approach. The rest of the paper is organized as

follows: Section 2 presents the main related works of

this study. In section 3, we discuss the development

process of Topic-OPA. Section 4 presents a case study

in the context of Le Matin. In section 5, we evaluate

Topic-OPA. Finally section 6 concludes the paper.

2 RELATED WORKS

In this section, topic ontologies and ontology engi-

neering approaches are introduced as the main related

foundations for our study.

2.1 Topic Ontologies

Topic ontologies are considered as special type of on-

tologies. Their purpose is to identify the “themes”

necessary to describe the knowledge structure of an

application domain (Zhao and Meersman, 2005). A

topic ontology is represented as a set of topics that are

interconnected using semantic relations. Two main

https://gallica.bnf.fr/ark:/12148/cb328123058/date,

last visited on April 8 2020

types of topic ontologies are deﬁned: simple, and gen-

eral (Maguitman et al., 2010). The simple topic on-

tologies are composed of topics linked by hierarchi-

cal relations. Meanwhile, in general topic ontologies,

transverse relations are included to link different top-

ics in a non-hierarchical scheme. For representing

general topic ontologies, the following components

are commonly deﬁned:

• Topics: concepts of the topic ontology (e.g. Sport,

Art, Politics).

• Predicates: types of relationships deﬁning the

semantic relations which can be established be-

tween ontology concepts. Multiple predicates are

deﬁned in general topic ontologies: hierarchical

(e.g. subClassOf ) and non-hierarchical (e.g. stud-

ied by, part of, etc.)

• Relationships: concrete links among ontology

concepts which will be used to characterize paths

in graphs. They are distinguished according to

their predicate and the couple of elements they

link. They can be represented as a triplet (s,p,o)

where s the subject, o the object and p the predi-

cate that links s and o (e.g. Literature subClassOf

Art, Art part of Culture).

2.1.1 KB-LDA Topic Model

For topic labeling purposes, the topic model KB-LDA

(Allahyari and Kochut, 2017) is developed based on

combining topic models with ontological concepts

in a single framework. KB-LDA used the seman-

tic knowledge graph of concepts in an ontology (e.g.

DBpedia) and their diverse relationships with unsu-

pervised probabilistic topic models for generating au-

tomatic topic labels. The topic labeling process is

performed based on the semantic similarity between

the entities included in text documents and a suitable

portion of the ontology. For this purpose a seman-

tic graph is constructed from the concepts of the on-

tology and their classiﬁcation hierarchy as labels for

topics.

2.1.2 IPCC Topic Model

For topic modeling purposes, IPCC (Sleeman et al.,

2018) is a domain-speciﬁc topic ontology used for

grounding a topic model in the domain of climate re-

search. The topic ontology is “seeded” with prede-

ﬁned key word phrase concepts which are obtained

from domain-speciﬁc sources such as domain experts,

and by data mining semi-structured sources. Natu-

ral Language Processing techniques have been used

to extract the meaningful key word phrase concepts

KEOD 2020 - 12th International Conference on Knowledge Engineering and Ontology Development

276

from these sources. While, the topic modeling pro-

cess is applied on textual resources such as, reports

and research papers, the ontology concepts are used

for weighting concepts founded in these resources.

Furthermore, the topic ontology is enriched with the

concepts associated with the textual resources and the

generated topics.

2.2 Ontology Engineering Approaches

In the ontology engineering domain, several ap-

proaches have been proposed for building ontolo-

gies from scratch or by reusing other existing ontolo-

gies. The most known approaches are Uschold and

Gruninger (Uschold and Gruninger, 1996), Methon-

tology (Fern

andez-L

opez et al., 1997) and ON-TO-

KNOWLEDGE (Sure et al., 2004). These approaches

focus on an iterative process of ontology building

and are composed of common phases such as spec-

iﬁcation, conceptualization, formalization, applica-

tion and evaluation. In addition, approaches such

as Text2Onto (Cimiano and V

olker, 2005) and On-

toGen (Fortuna et al., 2007) aim to generate ontolo-

gies semi-automatically with the help of user inter-

ference. These approaches exploit textual resources

and rely on natural language processing techniques.

However, few works have been found in the literature

about building ontologies from knowledge graphs.

In (B

ohm and Ortiz, 2018), the authors discusses

the building of topic-speciﬁc ontologies from open

knowledge graphs such as ConceptNet (Speer et al.,

2017). A query-based interactive approach is applied

for extracting entities and relations from the knowl-

edge graph. Based on the extraction process as well

as the interaction of the user, the central taxonomy

of the topic ontology is constructed. Furthermore,

adding complex concepts is processed to enrich the

ontology. Finally, a clean-up phase is performed in

order to modify or to add new concepts to the taxon-

omy.

3 SPARQL-BASED AUTOMATIC

APPROACH FOR BUILDING

TOPIC-OPA

For building topic ontologies, the most commonly

known approaches are the keyword-based construc-

tion approaches which are based mainly on text min-

ing and information retrieval techniques (Maguitman

et al., 2010). However, these approaches are not ef-

ﬁcient, hard and time consuming to construct an on-

tology from a large corpus of documents (Maguitman

et al., 2010). From this perspective and for simpli-

fying the construction process of Topic-OPA, open

knowledge graphs are considered. Generally, knowl-

edge graphs are very large and contain many entities

that are too general or speciﬁc to be successfully used

as topics for topic labeling (B

ohm and Ortiz, 2018).

Meanwhile, they can be leveraged to build with mod-

erate efforts small to medium-sized meaningful topic

ontologies. As a knowledge graph, we selected Wiki-

data. It is a free and open knowledge graph and acts as

central storage for the structured data of its Wikime-

dia sister projects including Wikipedia, Wiktionary,

and others (Erxleben et al., 2014). Wikidata stores

more than 402 million statements about over 45 mil-

lion entities (Malyshev et al., 2018). Today, more than

60 million of items are described. The data model of

Wikidata is based on a directed, labelled graph where

entities are connected by edges that are labelled by

“properties” (Bielefeldt et al., 2018). Thus, the sys-

tem distinguishes two main types of entities: items

and properties. Items are uniquely identiﬁed by a “Q”

followed by a number, such as Paris (Q90). Proper-

ties describe detailed characteristics of an item and

represented by a “P” followed by a number, such as

instance of (P31). Entities are represented by URIs

(e.g. http://www.wikidata.org/entity/Q90 for Paris

and http://www.wikidata.org/entity/P31 for instance

of ).

3.1 Ontology Speciﬁcation

The ontology speciﬁcation clariﬁes the scope and the

purpose of the targeted topic ontology Topic-OPA.

Topic-OPA is a topic-speciﬁc ontology intended for

modeling the topics of old press articles. Thus, the

scope is limited to old newspapers and journals which

are not organized thematically as the recent ones.

Therefore, given a corpus of articles in 1920, Topic-

OPA is constructed from the disambiguated named

entities representing these articles (see Figure 2 for

an example of named entities representing the arti-

cles depicted in Figure 7). Thereby, Topic-OPA will

not be useful for labeling articles in 2020. Concern-

ing the purpose, Topic-OPA is intended to build au-

tomated applications such as topic labeling systems.

Although, it can be used to develop larger ontologies

for more specialized purposes reducing the time and

effort needed to develop ontologies from scratch.

3.2 Ontology Requirements

In the ontology engineering domain, the set of re-

quirements that the ontology should satisfy is di-

vided into functional and non-functional requirements

Topic-OPA: A Topic Ontology for Modeling Topics of Old Press Articles

277

(Fern

andez et al., 2009). The functional requirements

deﬁne what needs to be expressed by the ontology

model. Meanwhile, the non-functional requirements

specify how an ontology needs to be designed in order

to be applicable. For Topic-OPA, the main functional

requirement is that it needs to be composed of two

different schemes:

• Hierarchical Scheme: consists of hierarchical re-

lations such as subClassOf that permit the infer-

ence of knowledge in the ontology graph.

• Non-hierarchical Scheme: involves non-

hierarchical relations such as related, part

of, used by, etc. that have an important implica-

tion in the semantic relationships between the

concepts.

Concerning the non-functional requirements, we con-

sider data traceability and scalability by mapping the

concepts and the relations of the topic ontology to en-

tities in open knowledge graphs such as Wikidata.

3.3 Ontology Deﬁnition

In our work, we are interested in general topic on-

tologies which are composed of hierarchical and non

hierarchical schemes. In the following, we deﬁne

these ontologies by considering mapping to knowl-

edge graphs.

Deﬁnition 8. We deﬁne a general topic ontology, in

which mapping to knowledge graphs is considered, by

O =



T, R,E,φ



, with

• T the set of topic concepts,

• R the set of predicates: {subClassOf, instance of,

part of, use, related by, etc.},

• E the set of relationships: E ⊆ T × R × T

• φ the mapping of T and R to entities in a knowl-

edge graph K.

3.4 Ontology Building

For building Topic-OPA, a SPARQL-based fully auto-

matic approach is applied. This approach, which aims

to harvest Topic-OPA from the open knowledge graph

Wikidata, is composed of three main phases: (1) con-

struction of the hierarchical scheme, (2) construction

of the non-hierarchical scheme and (3) ontology en-

richment.

3.4.1 Building the Hierarchical Scheme:

Bottom-up Strategy

The hierarchical scheme of Topic-OPA, which rep-

resents the taxonomy of topic concepts, can be for-

mally deﬁned by H =



T, R,E

,φ



, where T is the

set of topic concepts, R is the unique predicate

{subClassOf } used for ordering the topic concepts,

is the set of ordering relations and φ is the map-

ping function to Wikidata. In the hierarchy, a root

element denoted > is deﬁned as a general subsumer

for all the topic concepts, i.e., ∀t

∈ T,t

v >. For

building the hierarchy, a SPARQL-based bottom-up

approach is applied. The development process starts

with a deﬁnition of the most speciﬁc topic concepts

of the hierarchy and continues by extracting the more

general concepts. The approach started from a set of

named entities N represented by a set of URIs (see

Figure 2).

Figure 2: Example of named entities representing articles

and A

depicted in Figure 7.

Deﬁnition of the Most Speciﬁc Topic Concepts.

At this phase, a SELECT SPARQL query, relying

mainly on N and the Knowledge graph K, is applied to

deﬁne S

⊂ T the most speciﬁc topic concepts of the

hierarchy, ∀t

∈ S

,@t

v t

. The SELECT query

q(n,r) takes as inputs a named entity n ∈ N and a

property r ∈ K and returns set of topic concepts. For

the application of q, we deﬁned two main relation

types {P31, P106}. The property instance of (P31)

is used for all the named entities to retrieve their su-

perclasses. Meanwhile, for the named entities that

are instances of Human (Q5), which is a very gen-

eral topic, applying the property occupation (P106) is

required to fetch more speciﬁc topic concepts. In the

following, the syntax of q is presented. We denote by

entityId, the Wikidata ID of the named entity which is

extracted from the URI.

SELECT ?specificTopic WHERE {

wd:entityId ?property ?specificTopic.

VALUES ?property {wdt:P31 wdt:P106}}

As an example, let us consider a named entity n =

{John Simon(Q333091)} (see Figure 3). In Wikidata,

John Simon is instance of (P31) Human (Q5) and

linked to judge, lawyer and politician by the property

occupation (P106). Thus, S

(n)={Judge, Lawyer,

Politician}.

KEOD 2020 - 12th International Conference on Knowledge Engineering and Ontology Development

278

Figure 3: Deﬁnition of the most speciﬁc concepts based on

the named entities of A

Extraction of Hierarchies. The aim of this phase

is to build the taxonomy of topic concepts H. The

building process starts from the most speciﬁc to the

most general concepts. For this purpose, a CON-

STRUCT SPARQL query q

)/t

∈ S

and associ-

ated to φ(t

), is applied to fetch the parent classes of

aiming to build a RDF graph of the hierarchy. In

this context, each query returns three different types

of triples: (1) to deﬁne the ontology classes, (2) to

create the taxonomic relations (inspired by usage in

RDF rdfs:subClassOf ) and (3) to label the ontology

classes. All triples are denoted by (s, p,o), where s

the subject, p the predicate and o the object. In the

following, the syntax of q

is presented. We denote

by topicId the Wikidata ID of t

∈ S

CONSTRUCT { ?class a owl:Class.

?class rdfs:subclassOf ?superclass.

?class rdfs:label ?classLabel.

?property rdfs:domain ?class.

?property rdfs:label ?classLabel.}

WHERE { wd:topicId wdt:P279* ?class.

?class wdt:P279 ?superclass.

?class rdfs:label ?classLabel.}

In Figure 4, an example of triples extracted based on

(John Simon).

Figure 4: Example of triples for building the hierarchical

scheme of Topic-OPA.

3.4.2 Building the Non-hierarchical Scheme

The non-hierarchical scheme of Topic-OPA can be

formally deﬁned by NH =



T, R,E,φ



, where T is

the set of topic concepts, R is the ﬁnite set of pred-

icates, E ⊆ T ×R ×T is the set of transverse relation-

ships among the topics and φ the mapping function. In

this phase, the non-hierarchical relations are extracted

from Wikidata for building NH. These relations are

represented by the deﬁnition of the domain/range of

the properties that will be added to the graph as edges

between domains and ranges. For this purpose, a

CONSTRUCT query q

)/t

∈ T and associated to

φ(t

), is applied to fetch all the triples where t

are

domains or ranges. In this context, the selection of

properties is restricted to a predeﬁned list based on

their relevance in different domains (e.g. ﬁeld of work

(P101), has part (P527), has quality (P1552), part of

(P361), practiced by (P3095), etc.). In the following,

the syntax of q

is presented. We denote by topicId

the Wikidata ID of t

∈ T .

CONSTRUCT { ?domain ?property ?range.

?range rdfs:label ?rangeLabel.

?property rdfs:label ?propertyLabel.}

WHERE { VALUES ?property {

wdt:P1269 wdt:P425 wdt:P101

wdt:P136 wdt:P527 wdt:P1552 wdt:P1557 wdt:P106

wdt:P2388 wdt:P2389 wdt:P361 wdt:P710 wdt:P3095

wdt:P4646 wdt:P641 wdt:P2578 wdt:P366 wdt:P1535

wdt:P2283 wdt:P1889}

{wd:topicId ?property ?range.

?range rdfs:label ?rangeLabel.}}

The execution of q

produced a list of triples de-

noted by (d, p,r), where d the domain, p the predi-

cate and r the range. Furthermore, these triples are

parsed and added to the structure of Topic-OPA for

building the non-hierarchical scheme. In Figure 5, an

example of non-hierarchical relations extracted based

on the previously added concepts (see Figure 4).

Figure 5: Example of triples for building the non-

hierarchical scheme of Topic-OPA.

3.4.3 Ontology Enrichment

After building H and NH, we apply in this phase an

enrichment process based on NH. The application of

has imported new concepts to the ontology such

as Government, Judiciary and Politics, among many

others. Therefore, these concepts will be added to the

hierarchy as well as their parent classes by applying

the query q

(see Figure 6).

Figure 6: Example of the enrichment of the hierarchical

scheme of Topic-OPA.

Topic-OPA: A Topic Ontology for Modeling Topics of Old Press Articles

279

4 CASE STUDY: LE MATIN

In this section, we introduce the application of the

SPARQL-based approach for developing Topic-OPA

in the context of the old newspaper Le Matin. For

this purpose, we have chosen A a corpus of 48 articles

published between 1910 and 1937 (see Figure 7 for an

example). For Building Topic-OPA, a set of N = 392

named entities representing A is considered (see Fig-

ure 2). As a result, we obtained a topic ontology, as

a subset of Wikidata, which is accessible and man-

ageable in ontology editors such as Prot

. Note

that the topic ontology is not curated. We maintained

the concepts and relations which are obtained by the

application of the fully automatic approach. Thus,

Topic-OPA contains 2073 concepts, 3261 SubClas-

sOf relations and 1135 non-hierarchical relations. In

Figure 8, we depict an excerpt of Topic-OPA around

the Politics topic. The solid lines represent the Sub-

ClassOf relations and the dashed lines represent the

non-hierarchical relations.

Figure 7: Example of articles from Le Matin.

5 ONTOLOGY EVALUATION

Generally, the ontology evaluation approaches are di-

vided into four main categories (Fern

andez et al.,

2009): (1) gold standard-based that aims to com-

pare the developed ontology with a previously cre-

ated reference ontology known as the gold standard;

(2) corpus-based that tends to compare the developed

https://protege.stanford.edu/, last visited 23 July 2020

ontology with the content of a text corpus that covers

a given domain signiﬁcantly ; (3) application-based

that considers the evaluation of ontologies according

to their performance in applications; (4) structure-

based that quantiﬁes structure-based properties such

as the size and the complexity of ontologies.

In order to choose the “best” evaluation approach,

there is a need to deﬁne the motivation behind evalu-

ating a developed ontology (Fern

andez et al., 2009).

In our study, as evoked earlier, Topic-OPA is intended

to be used as a knowledge base in a topic labeling

system. Thus, it is considered as an application-based

ontology. In this context, Topic-OPA can be evalu-

ated using application-based and structure-based ap-

proaches for the following reasons:

• the gold standard-based approach is not applica-

ble: Topic-OPA is developed as a subset of Wiki-

data. Thus, the best reference ontology for Topic-

OPA is Wikidata itself. However, it is impossible

to use Wikidata as a gold standard ontology be-

cause of its size. In addition, since Topic-OPA is

built for and from a given corpus of press articles,

it cannot be compared with other ontologies that

should be created under similar conditions with

similar goals.

• the corpus-based approach is eliminated: the tex-

tual resources are out of scope of our study. As

evoked earlier, our hypothesis is based on a set

of disambiguated named entities extracted from

open knowledge bases such as Wikidata.

• the application-based approach is the best evalua-

tion approach: it implies to evaluate the usability

of Topic-OPA being an application-based ontol-

ogy. This evaluation will be performed in further

works after embedding Topic-OPA in the topic la-

beling system.

• the structure-based approach is a useful evalu-

ation approach for assessing the structure-based

properties of Topic-OPA. This approach is recom-

mended as an efﬁcient approach for evaluating the

learned ontologies (Dellschaft and Staab, 2008).

Several measures have been recognized for the

structure-based evaluation such as Knowledge cover-

age and popularity measures (i.e. number of classes

and number of properties) and structural measures

(i.e. maximum depth, average depth, depth variance,

etc.) (Fern

andez et al., 2009). The application of

these measures relies on an assumption that is a richly

populated ontology, with higher depth and breadth

variance is more likely to provide reliable semantic

content. The structural measures are positively cor-

related with the semantic accuracy of the knowledge

modeled in the ontology (Sanchez et al., 2015). In the

KEOD 2020 - 12th International Conference on Knowledge Engineering and Ontology Development

280

Figure 8: Excerpt of Topic-OPA around the concept Politics.

context of Topic-OPA, we quantiﬁed some structural

measures, by considering its taxonomic structure, as

follows:

• maximum depth=28: represents the length of the

longest taxonomic branch in the ontology..

• average depth=6: is the average length of all tax-

onomic branches.

• depth variance=6.38: is the dispersion with re-

spect to the average depth, computed as the stan-

dard mathematical variance.

We conclude that Topic-OPA is a richly learnt on-

tology. However, the majority of the topic concepts

are dispersed homogeneously within the core level of

Topic-OPA. This implies two main issues: (1) the hi-

erarchical structure of Topic-OPA is a balanced taxon-

omy, in which the majority of taxonomic edges have

almost the same depth and (2) it will be a challenging

task to expose the topic concepts which are relevant

for topic labeling the articles.

6 CONCLUSIONS AND FUTURE

WORK

This paper discussed the building process of a topic-

speciﬁc ontology, named Topic-OPA, for representing

the contents of a set of old press articles that needs

to be labeled with a set of topics. Topic-OPA will

be used for associating topics to each old press ar-

ticle. The development of Topic-OPA relies mainly

on a set of disambiguated named entities represent-

ing the articles. In this regard, a SPARQL-based fully

automatic approach is applied, based on the disam-

biguated named entities, for harvesting Topic-OPA

from the open knowledge graph Wikidata. The pro-

posed approach is composed of four main phases: (1)

collect the most speciﬁc topic concepts which are lo-

cated at the lowest level of Topic-OPA, (2) build the

hierarchical scheme based on these concepts, (3) con-

struct the non-hierarchical scheme based on the hi-

erarchical scheme and (4) enrich the ontology with

the concepts imported by the non-hierarchical rela-

tions. A case study is presented in the context of the

old french newspaper Le Matin for building Topic-

OPA from a corpus of 48 articles. By the applica-

tion of the SPARQL-based approach, a richly learnt

topic-speciﬁc ontology is obtained. Furthermore, a

structure-based evaluation approach is applied to as-

sess the quality of the structure of Topic-OPA. We

found that the majority of the topic concepts are lo-

cated at the core level of Topic-OPA. This implies

that a challenging task will take place for deﬁning the

topic concepts which will be used for labeling the ar-

ticles. In this study, we do not consider the curation

of the topic ontology after the automatic building pro-

cess. We maintained the ontology structure and con-

tent, including the abstract and speciﬁc concepts, as

derived from Wikidata. In future works, we will ap-

ply a curation process aiming to clean and leverage

Topic-OPA. Furthermore, Topic-OPA will be embed-

ded in a topic labeling system for automatic topic la-

beling of old press articles. A speciﬁc semantic relat-

edness measure, named Rel

Topic

, has been proposed

in order to associate the good topic of Topic-OPA to a

speciﬁc press article taking into consideration the set

Topic-OPA: A Topic Ontology for Modeling Topics of Old Press Articles

281

of named entities of it. Unfortunately, we have not

been able to present it in this paper, because of lack

of space. However, it is worth highlighting that the

preliminary results on the use of Rel

Topic

associated

with Topic-OPA are encouraging, as its use presents

a precision higher than 80% on a corpus of old press

articles that were labeled by human experts.

ACKNOWLEDGMENTS

This work is funded by the Normandy Region

(France) and the European Union with the European

Regional Development Fund (FEDER).

REFERENCES

Allahyari, M. and Kochut, K. (2017). A knowledge-based

topic modeling approach for automatic topic labeling.

International Journal of Advanced Computer Science

and Applications, 8(9):335–349.

ohm, K. and Ortiz, M. (2018). A tool for building topic-

speciﬁc ontologies using a knowledge graph. In Pro-

ceedings of the 31st International Workshop on De-

scription Logics co-located with 16th International

Conference on Principles of Knowledge Representa-

tion and Reasoning (KR 2018).

Bielefeldt, A., Gonsior, J., and Krotzsch, M. (2018). Practi-

cal linked data access via sparql: The case of wikidata.

In Proceedings of the WWW2018 Workshop on Linked

Data on the Web (LDOW-18), CEUR Workshop Pro-

ceedings.

Blei, D., Ng, A., and Jordan, M. (2003). Latent dirichlet al-

location. The Journal of Machine Learning Research,

3:993–1022.

Cimiano, P. and V

olker, J. (2005). Text2onto. In Nat-

ural Language Processing and Information Systems,

NLDB 2005, pages 227–238. Springer, Berlin, Hei-

delberg.

Dellschaft, K. and Staab, S. (2008). Strategies for the eval-

uation of ontology learning. In Proceedings of the

2008 Conference on Ontology Learning and Popula-

tion: Bridging the Gap between Text and Knowledge,

Frontiers in Artiﬁcial Intelligence and Applications,

pages 253–272.

Erxleben, F., G

unther, M., Kr

otzsch, Mendez, J., and

Vrande

c, D. (2014). Introducing wikidata to the

linked data web. In Proceedings 13th Int. Semantic

Web Conf. (ISWC’14), LNCS, pages 50–65.

Fern

andez, M., Overbeeke, C., Sabou, M., and Motta, E.

(2009). What makes a good ontology? a case-study in

ﬁne-grained knowledge reuse. In The semantic Web,

pages 61–75. Springer, Berlin, Heidelberg.

Fern

andez-L

opez, M., G

omez-P

erez, A., and Juristo, N.

(1997). Methontology: From ontological art towards

ontological engineering. In AAAI.

Fortuna, B., Grobelnik, M., and Mladenic, D. (2007). Onto-

gen: Semi-automatic ontology editor. In Human Inter-

face and the Management of Information, Interacting

in Information Environments, Human Interface 2007,

pages 309–318. Springer, Berlin, Heidelberg.

Hofmann, T. (1999). Probabilistic latent semantic indexing.

In Proceedings of the 22nd annual international ACM

SIGIR conference on Research and development in in-

formation retrieval, pages 50–57. ACM New York.

Maguitman, A., Cecchini, R., Lorenzetti, C., and Menczer,

F. (2010). Using topic ontologies and semantic simi-

larity data to evaluate topical search. In Proceedings

of Conferencia Latino-americana de Inform

atica.

Malyshev, S., Krotzsch, M., Gonzalez, L., Gonsior, J.,

and Bielefeldt, A. (2018). Getting the most out of

wikidata: Semantic technology usage in wikipedia’s

knowledge graph. In Proceedings of the 17th Inter-

national Semantic Web Conference (ISWC’18), pages

376–394. Springer.

Sanchez, D., Batet, M., Martinez, S., and Ferrer, J. (2015).

Semantic variance: An intuitive measure for ontology

accuracy evaluation. Engineering Applications of Ar-

tiﬁcial Intelligence, 39:89–99.

Sleeman, J., Finin, T., and Halem, M. (2018). Ontology-

grounded topic modeling for climate science research.

In Proceedings of Semantic Web for Social Good

Workshop, ISWC.

Speer, R., Chin, J., and Havasi, C. (2017). Conceptnet 5.5:

An open multilingual graph of general knowledge. In

AAAI, pages 4444–4451.

Sure, Y., Staab, S., and Studer, R. (2004). On-to-

knowledge methodology (otkm). In Handbook on

Ontologies, International Handbooks on Information

Systems. Springer, Berlin, Heidelberg.

Tang, Y., Baer, P., Zhao, G., and Meersman, R. (2009).

On constructing, grouping and using topical ontology

for semantic matching. In Proceedings of OTM 2009

Workshops (On the Move to Meaningful Internet Sys-

tems). Springer Berlin, Heidelberg.

Uschold, M. and Gruninger, M. (1996). Ontologies: princi-

ples, methods and applications. The Knowledge Engi-

neering Review, 11:93–136.

Zhao, G. and Meersman, R. (2005). Architecting ontol-

ogy for scalability and versatility. In On the Move

to Meaningful Internet Systems 2005: CoopIS, DOA,

and ODBASE. OTM 2005.

KEOD 2020 - 12th International Conference on Knowledge Engineering and Ontology Development

282