Smart Access to Historical Archives based on Rich Semantic
Metadata
Matteo Caserio, Anna Goy and Diego Magro
Dipartimento di Informatica, Università di Torino, C. Svizzera 185, Torino, Italy
Keywords: Digital Archives, User Interface, Semantic Metadata, Ontologies, Web Application.
Abstract: Documentary heritage about social and political history of the 20th Century could be an important support
for citizens awareness, provided that it is not endangered by the lack of effective access and management
tools. In this paper we present a work that is part of the Harlock'900 project and aims at showing how a rich
semantic representation, based on ontologies and Semantic Web standards, can enable an innovative and
user-friendly access to resources stored in historical archives. In particular, we present a web application
enabling users to explore events, places and people mentioned in archival resources. The application relies
on a semantic layer including a computational ontology and a RDF triplestore, and provides a User
Interface that supports the navigation through highly interconnected data, offering different possible
exploration paths. We also report the encouraging feedback obtained by a preliminary evaluation based on a
domain expert walkthrough the app.
1 INTRODUCTION
Documentary heritage about social and political
history of the 20th century has a high potential for
creating and supporting the awareness of citizens
about the changes occurred in society in the last
century. This awareness, in turn, is the key to a more
self-aware social and economical development of
countries, with positive effects on national and trans-
national integration.
Highly relevant for scholars, social and political
scientists or historians, this heritage is valuable also
for the general audience, due to its capability of
narrating social changes through documents and
testimonies where places, people and events are
brought to life by pictures, newspaper articles,
audiovisual clips, interviews, etc.
However, this potential is endangered by the lack
of effective, user friendly access and management
tools, to orientate the audience in the extremely rich
and varied universe of documents and resources.
The main objective of the work presented in this
paper is to show how a rich semantic representation,
based on ontologies and Semantic Web standards,
can support metadata enrichment, which in turn
enables an innovative and effective access to
resources from historical archives, offering users
new paths through the history of the 20th Century.
Such an innovative and effective access is
demonstrated by the User Interface of the web-based
application resulting from our approach, and it can
benefit a heterogeneous and large audience,
including humanities researchers, journalists, policy
makers, designers, movie/multimedia makers, and
simply interested people, as well as travel agencies
and tourist promotion organizations, educational
institutions, creative writing schools, trade unions
and other civil society actors.
One of the most innovative aspects of the
approach is that relationships between events,
people, organizations, places, and documents need
not to be hard-coded in the system, but dynamically
computed/discovered by automatically exploring
and querying the system semantic knowledge about
the content of resources.
This work is part of the on-going Harlock'900
project (started in 2016) involving the University of
Torino (Computer Science Department) and the
Fondazione Istituto Piemontese A. Gramsci, a non-
profit institution for research on contemporary
history, that is part of the Polo del '900 cultural
initiative (www.polodel900.it). The project's main
goal is that of providing an online user-friendly
access to a set of documents from the institute's
Caserio M., Goy A. and Magro D.
Smart Access to Historical Archives based on Rich Semantic Metadata.
DOI: 10.5220/0006487000930100
In Proceedings of the 9th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (KMIS 2017), pages 93-100
ISBN: 978-989-758-273-8
Copyright
c
2017 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
archives, based on a rich semantic representation of
their content. The overall approach is summarized in
(Goy et al., 2015). The project is partially funded by
Compagnia di San Paolo and University of Torino,
within the PRiSMHA project (started in May 2017).
In the following we will summarize the relevant
related work (Section 2), then we will briefly present
the semantic layer, including the HERO ontology
and the RDF representation of resources content
(Section 3). In Section 4 we will describe the User
Interface enabling users to explore events, places
and people mentioned in archival resources, together
with some implementation details. We will conclude
the paper by reporting the result of a preliminary
evaluation (Section 5) and by sketching the future
directions (Section 6).
2 RELATED WORK
In recent years there has been a remarkable interest
in the application of Semantic Web technologies to
history research, as documented by the survey
reported in (Meroño-Peñuela et al., 2015), which
surveyed 21 scientific papers, 21 research projects
and 36 tools. Many results considered in this survey
show the effectiveness of semantic approaches
(including ontologies and Linked Data best
practices) in publishing and connecting historical
datasets and in enhancing search and retrieval. This
holds, in particular, also for archives and cultural
heritage resources, a domain where semantic
technologies and the LOD (Linked Open Data)
principles are receiving more and more attention
(Oomen and Belice, 2012) and have already proved
their maturity.
A lot of projects could be mentioned, but a full
survey is out of the scope of this paper. Europeana
(www.europeana.eu) the European Union digital
library providing access to cultural heritage digitized
contents of hundreds of European galleries, libraries,
archives is one of the most prominent examples of
usage of Semantic Web technologies in the cultural
heritage domain. A considerable portion of
Europeana metadata are already available on the
LOD cloud see, for instance, the Europeana LOD
Pilot (Haslhofe and Isaac, 2011) based on EDM:
Europeana Data Model (Isaac, 2013), the metadata
model adopted within Europeana. Another important
semantic model in the cultural heritage domain is the
CIDOC Conceptual Reference Model (www.cidoc-
crm.org), which is an ISO standard and has been
successfully adopted in several EU-funded projects,
like PAPYRUS (www.ict-papyrus.eu) an EU
project aimed at enabling users to query digital
libraries in order to discover cross-domain
relationships between concepts.
In the historical domain, the notion of event is a
key concept; as a consequence, as described in
Section 3, it plays a major role in our system, in line
with other projects with a similar focus on recent
history; see, for example, Agora (van den Akker et
al, 2010) and DIVE (de Boer et al., 2015) a project
that builds on the results of Agora and aims at
supporting researchers, professional users, and
general public in event-centric browsing of cultural
heritage objects from multiple heterogeneous
collections.
EDM, CIDOC-CRM, and other models e.g. the
Event Ontology (purl.org/NET/c4dm/event.owl),
LODE (Shaw et al, 2009), SEM (Hage et al., 2011)
provide a basic notion of event, enable the
representation of "who does what when and where",
but fail to offer tools for a richer semantic
representation that is the main goal of our project
(see Section 3).
Partially within the Europeana ecosystem, some
projects can provide valuable hints for the promotion
of innovative and creative usage of digital (data on)
cultural resources; see, for instance: Europeana
Space (www.europeana-space.eu); Europeana
Creative (www.europeanacreative.eu); AXES
(www.axes-project.eu), that provides advanced
multimodal search and access to audiovisual digital
resources (search for spoken words in audio, for
images within video sequences, for images depicting
specific kinds of objects or similar to other images);
the recently started I-Media-Cities (imediacities.eu);
HOPE (www.peoplesheritage.eu), that aims at
providing an access point to digital collections in
particular, smaller, independent and usually not
maintained by official state archives relevant to
the social history and to the history of the labor
movement.
An important aspect of the application presented
in this paper is the experience supported by the User
Interface. As demonstrated by projects like the Atlas
of Nazi and Fascist massacres
(www.straginazifasciste.it/?lang=en), or Memorie di
Guerra (War Memories: www.memoriediguerra.it),
a key aspect in projects aimed at supporting an
enhanced access to historical documents is the
availability of advanced and user-friendly interfaces,
supporting multi-dimensional access to archival
resources (Boschetti et al., 2014). In the project
described by Boschetti and colleagues, advanced
search functionalities are offered to explore World
Wars textual documents. The same focus on
advanced search tools based on content
automatically extracted from historical documents
characterizes the ALCIDE project (Moretti et al.,
2016). The platform User Interface enables users to
select the time span to search for and display
keywords and entities mentioned in the retrieved
documents. In the work about the Betrothed Lovers
book, described in (Bolioli et al., 2013), interactive
graphs are used to display the social network of the
story characters, while in the RAMBLE ON project
(Menini et al., 2017) the focus is on the movements
of famous historical figures and thus the User
Interface is based on interactive maps.
In other works, narrative formats based on an
underlying semantic representation of the contents
of a cultural heritage collection have been
exploited to present meaningful patterns. For
example, in the Storyspace (Wolff et al., 2012) and
Storyscope (Mulholland et al., 2015) systems the
User Interface displays an underlying narrative, i.e.
a composition of events and related objects, possibly
tailored to different audiences, while the Labyrinth
project (Lieto and Damiano, 2013) offers an
application for exploring digital media repositories
with the guidance of a set of cultural archetypes,
enabling user interaction through maps, timelines
and 3D navigation.
3 SEMANTIC LAYER
We think that one of the main reasons why
"European digital content [...] used to be
inaccessible, buried among huge amounts of data
and not sufficiently tagged with adequate metadata"
(EU Programmes 2014-2020, call H2020-CULT-
COOP-09-2017: ec.europa.eu/research/participants/
portal/desktop/en/opportunities/h2020/topics/cult-
coop-09-2017.html) is the very low expressive
power of currently available semantic
representations. In fact, although simple semantic
models have the clear advantage of facilitate
processing, interoperability and sharing, they often
fail to provide actually useful data. In other words,
some semantic complexity is needed, in order to
express interesting information (characterizations
and relations).
To this purpose, we designed and implemented a
semantic layer that takes archive metadata as input
and enrich them with a semantic representation of
resources content, linked to an ontology that
represents a shared conceptualization.
Figure 1 depicts the overall architecture
underlying our application. The Metadata
Management Platform, relying on basic metadata
from archives catalogs, is endowed with a layer
hosting the semantic model (the ontology and the
semantic knowledge base, described in the
following).
Figure 1: Overall architecture of the application.
The semantic layer can be seen as a "semantic
lens" on archive resources, enabling users to "see"
their content. In particular, the semantic
representation should enable the system to discover
relations between events, people, places,
organizations, and archival resources themselves.
For instance, the system could identify relevant
relationships between E. Valabrega, his niece M.
Diena, the city of Torino, the Valabrega company,
the historical period called 'Resistenza', the
deportation of Jews, linking them to the pictures,
texts, letters and historical documents talking about
them.
The semantic model is represented by a
computational ontology of historical events (HERO:
Historical Event Representation Ontology), that
relies on the well-known and cognitive-grounded
foundational ontology DOLCE (Masolo et al.,
2003). An exhaustive description of the HERO
ontology and its dependencies on existing models
is out of the scope of this paper. In the following,
we will sketch its main structure, in order to provide
the reader with an overall picture of the kind of
knowledge available in the semantic layer.
HERO is written in OWL and is composed by
different modules; the most important are the
following:
HERO-TOP: it includes the top layer, i.e. the
most general concepts, directly linked to
DOLCE classes (e.g. the hero:Object class,
defined as a subclass of dolce:Endurant; the
hero:Perdurant class, defined as a subclass of
dolce:Perdurant).
HERO-EVENT: it includes classes and
properties related to the representation of
events; for example, it includes a class
hierarchy, a small part of which is depicted in
Figure 2. Moreover, it contains the relations:
hero:hasParticipant, useful to connect events
to their participants; hero:hasLocation, to
connect events to the places they occurred;
hero:hasTimeSpan, to connect events to the
time intervals within which they took place.
HERO-PLACE: it includes classes and
properties modeling geographic features, i.e.,
entities that can be georeferenced on a map
(e.g., cities, rivers, countries, but also
buildings and streets).
HERO-TIME: it includes classes and
properties modeling time intervals, following
Allen's Interval Algebra (Allen, 1983).
HERO-ROCS: it includes classes and
properties for the representation of
"containers", i.e. entities that contain other
entities as their members; in particular, it
defines the semantics of sets and collections,
taking into account the analysis in (Bottazzi et
al., 2006). This module also models the notion
of organization, partially based on the
discussion in (Bottazzi et al., 2009).
Figure 2: A fragment of the HERO taxonomy of classes
representing event types.
On the basis of HERO, the content of archival
resources can be formally represented in a machine-
readable format. In particular, within the
Harlock'900 project, we annotated 200 text
fragments, extracted from biographies and
testimonies talking about events occurred in
Piemonte (North-West of Italy) in the period 1943-
1945, and often related to the "Resistenza" (the local
partisans struggle against the Fascist regime and the
Nazi occupation). In each text fragment, a small
team of annotators identified the mentioned events
and, for each event, they identified a typology (as
defined by classes in HERO-EVENT), as well as
when available the time period, the place it
occurred and its participants.
The result of the annotation process is a set of
RDF triples, stored in a triplestore (see below for
implementation details). A small text fragment,
together with its partial semantic representation, is
shown in Figure 3.
Figure 3: A text fragment with its (partial) semantic
representation.
4 WEB APPLICATION AND
USER INTERFACE
The data in the semantic layer can be accessed in
two ways, namely through:
RESTful API, that enable third party
applications to use them.
A User Interface, that allow users to access
archival resources by navigating their content.
In the following, we will describe the web
application offering the User Interface (UI); the full
definition and implementation of RESTful API to
access data in the semantic layer is a work in
progress.
The major role of the UI is to support the
navigation through highly interconnected data, and
therefore it offers many different possible
exploration paths: for example, a user may be
interested in retrieving all event occurred in a given
place, and from them to discover involved people,
thus finding further interesting places where other
events took place, and so on. Figure 4 shows the
web app home page.
Figure 4: Home page.
On the left-hand side, users can select an area
(provincia) within the Piemonte region by clicking
on it: the app provides an interactive map, populated
with all places in that area where events occurred
(see Figure 5).
Figure 5: The map showing places in Provincia di Cuneo
(left-hand side) and access to events occurred in Barge
(right-hand side).
By clicking on the pop-up referring to a single
place (the village of Barge in the right-hand side of
Figure 5), the events that occurred there can be seen
and explored. The page listing the events occurred in
a specific place is analogous to the page listing all
events in the knowledge base, accessible from the
home page (Figure 4, first button on the right-hand
side), and depicted in Figure 6. From the right-hand
side of the home page, the user can also search for a
specific person, group or organization, select a
single place, or access text fragments (see Figure 4).
Figure 6: Events (list).
On the right-hand side of the page listing events
(Figure 6) there are filters: depending on the
activated ones, the listed events can refer to a
specific (set of) place(s) or participants, or time
(date). Moreover, the default visualization is a list,
that can be ordered by combining different criteria
(in Figure 6, for example, it is ordered by
participants, and further orderings can be added
e.g., by time), but the user can switch to a table
view, providing more details (Figure 7, upper-side)
or to a timeline view (Figure 7, lower-side).
Figure 7: Events: table view with details (upper-side) and
timeline view (lower-side).
By clicking on a single entity (person, group,
organization, place, event), wherever in the User
Interface, its profile with detailed information is
provided (Figure 8). For instance, if a person is
selected, the corresponding profile shows: the events
that person had been involved in; other people
linked to that person (because they participated in a
same event); an interactive map with markers on the
locations where the events took place; the text
fragments where that person is mentioned.
Moreover, on the right-hand side, references to
original archival resources related to that person
(e.g., original documents, letters, pictures) are
shown.
The web app has been implemented using
standard web languages and technologies (such as
Java, JSTL, Javascript, jQuery, HTML5, and CSS3);
it exploits several libraries, namely: Leaflet
(leafletjs.com), Exhibit (www.simile-widgets.org/
exhibit), D3 (Data Driven Documents: d3js.org),
TopoJSON (github.com/topojson/topojson), and
Bootstrap (getbootstrap.com).
The triplestore for metadata is implemented with
Apache Jena Fuseki (http://jena.apache.org/
documentation/fuseki2/index.html).
Figure 8: Profile of Franco Diena.
5 EVALUATION OF THE
APPROACH
The UI described in Section 4 represents an
important proof-of-concept for our project, since it
demonstrates that a rich semantic representation of
resources content enables user to navigate a very
rich graph of relations between events, people,
places, and resources themselves. However, being a
proof-of-concept, it is currently based on a relatively
small set of data, that is not enough for an evaluation
with real users.
We are working in two directions, in order to set
up tools for feeding large amounts of new data
describing resources content to the knowledge base
(RDF triplestore). In particular, we are investigating:
(a) strategies for automatic event extraction from
texts (Rovera, 2016), and (b) crowdsourcing systems
(within the already mentioned PRiSMHA project).
With a larger knowledge base, covering a significant
number of archival resources and historical events,
we will be able to test the web application in real
world scenarios, with end users. For the moment, in
order to assess the current User Interface, we
performed a preliminary evaluation by asking two
domain experts (historians, in our case) to perform a
walkthrough, acting as users of the web application.
Instead of focusing on usability issues (as in
standard cognitive walkthrough evaluations (Polson
et al., 1992; Gena and Weibelzahl, 2007)), we asked
experts to assess the potential usefulness of our app.
To this purpose, we provided them with the
following usage scenario:
You are a researcher in history of the 20th
Century, and you have just been assigned the
task of creating a multimedia narrative to be
linked to a "pietra d'inciampo" (small brass
plates fixed on sidewalk tiles; each plate shows
the name of a victim of the Nazi-Fascist
deportation and it is placed close to the victim's
home), namely the one mentioning the Valabrega
family. In order to build the narrative, you have
to discover the story of the most important
members of the Valabrega family, the (historical)
events they took parts, the relevant places they
lived. Moreover, we suggest to include, within
the narrative, references to archival resources
available in Torino, such as pictures, videos,
manuscripts, books, etc.
We encouraged the experts, playing the role of
users of our prototype, to express free comments
about their experience, and recorded them.
We obtained some specific suggestions about
possible improvements of the UI (e.g., they asked
for a "pictorial" representation of the relations
among people, organizations and groups), but the
overall feedback was definitely positive: in summary
they said that with respect to traditional access
tools for online archives they are used to the
application provides a much richer, more flexible,
and more interesting way to discover: (a)
relationships between people, events and places; (b)
relevant historical resources, otherwise hidden in the
archive shelves.
6 CONCLUSIONS
In this paper we presented a web-based prototype
enabling user to navigate through historical events,
people and places linked to archival resources. The
richness of the navigation experience is based on the
semantic representation of the resources content,
grounded in the HERO ontology and stored in a
RDF triplestore.
As already mentioned (Section 5), we are
working towards two enhancements supporting input
to the semantic knowledge base, i.e., automatic
event extraction from texts (Rovera, 2016), and
crowdsourcing approaches. Moreover, as mentioned
in Section 4, we are implementing a set of RESTful
API that will enable third party applications (e.g., in
the educational and touristic fields) to reuse our
semantically enriched metadata.
A further enhancement of the semantic
knowledge base is the possibility of linking our
metadata to open dataset such as GeoNames
(www.geonames.org) and DBpedia
(wiki.dbpedia.org). However, in this respect, it is
worth noting that, given the very specific domain
and the relatively fine-grained representation of
events, a lot of entities in our triplestore (events
themselves, but also places and people) are not
present, even in huge datasets like GeoNames and
DBpedia; so, the linking is interesting but limited.
Finally, the UI itself could be enhanced by
exploiting more advanced Information
Visualizations tools and techniques. In particular, we
are exploring the possibility of using interactive
graphs to graphically show the relations among
people, providing users with a picture of the social
network of the players.
REFERENCES
van den Akker, C., Aroyo, L., Cybulska, A., van Erp, M.,
Gorgels, P., Hollink, L., Jager, C., Legêne, S., van der
Meij, L., Oomen, J., van Ossenbruggen, J., Schreiber,
G., Segers, R., Vossen, P., Wielinga, B., 2010.
Historical Event-based Access to Museum
Collections. Applied Artificial Intelligence, 25.
Allen, J.F., 1983. Maintaining Knowledge about Temporal
Intervals. Communications of the ACM, 26(11), 832-
843.
Bolioli, A., Casu, M., Lana, M., Roda, R., 2013. Exploring
the Betrothed Lovers. In M. Finlayson, B. Fisseni, B.
Lowe, J. C. Meister (Eds.), Workshop on
Computational Models of Narrative, vol. 32, OASIC.
Boschetti, F., Cimino, A., Dell'Orletta, F., Lebani, G. E.,
Passaro, L., Picchi, P., Venturi, G., Montemagni, S.,
Lenci, A. 2014. Computational Analysis of Historical
Documents: An Application to Italian War Bulletins in
World War I and II. In LREC 2014 Workshop on
Language resources and technologies for processing
and linking historical documents and archives –
Deploying Linked Open Data in Cultural Heritage.
Bottazzi, E., Catenacci, C., Gangemi, A., Castelfranchi,
C., 2006. From Collective Intentionality to Intentional
Collectives: An Ontological Perspective. Cognitive
System Research, 7(2-3), 192-208.
Bottazzi, E., Ferrario, R., 2009. Preliminaries to a DOLCE
Ontology of Organisations. Int. J. Business Process
Integration and Management, 4(4), 225-238.
de Boer, V. Oomen, J., Inel, O., Aroyo, L., van Staveren,
E., Helmich, W., de Beurs, D. 2015. DIVE into the
Event-Based Browsing of Linked Historical Media.
Journal of Web Semantics, 35(3), 152-158.
Isaac A. (Ed.), 2013. Europeana Data Model Primer.
http://pro.europeana.eu/files/Europeana_Professional/
Share_your_data/Technical_requirements/EDM_Docu
mentation/EDM_Primer_130714.pdf.
Gena, C., Weibelzahl, S., 2007. Usability engineering for
the adaptive web. In P. Brusilovsky, A. Kobsa, W.
Nejdl (Eds.), The Adaptive Web. Methods and
Strategies of Web Personalization, Springer, Berlin,
720-762.
Goy, A., Magro, D., Rovera, M., 2015. Ontologies and
historical archives: A way to tell new stories. Applied
Ontology, 10(3-4), 331-338.
van Hage, W.R., Malaisé, V., Segers, R., Hollink, L.,
Schreiber, G., 2011. Desing and use of the Simple
Event Model (SEM). Journal of Web Semantics, 9(2),
128-136.
Haslhofe, B., Isaac, A. 2011. data.europeana.eu The
Europeana Linked Open Data Pilot. In International
Conference on Dublin Core and Metadata
Applications.
Lieto, A. Damiano, R. 2013. Building Narrative
Connections among Media Objects in Cultural
Heritage Repositories. In International Conference on
Interactive Storytelling, 257-260.
Masolo, C., Borgo, S., Gangemi, A., Guarino, N.,
Oltramari, A., 2003. WonderWeb Deliverable D18.
Technical Report, CNR.
Menini, S., Sprugnoli, R., Moretti, G., Bignotti, E.,
Tonelli, S., Lepri, B., 2017. RAMBLE ON: Tracing
Movements of Popular Historical Figures. Conference
of the European Chapter of the Association for
Computational Linguistic, to appear.
Meroño-Peñuela, A. Ashkpour, A., van Erp, M.,
Mandemakers, K., Breure, L., Scharnhorst, A.,
Schlobach, S., van Harmelen, F., 2015. Semantic
Technologies for Historical Research: A Survey.
Semantic Web Journal, 6(6), 539-564.
Moretti, G., Sprugnoli, R., Menini, S., Tonelli, S., 2016.
ALCIDE: Extracting and visualising content from
large document collections to support humanities
studies. Knowledge-Based Systems, 111, 100-112.
Mulholland P., Wolff A., Kilfeather E., 2015. Storyscope:
Supporting the authoring and reading of museum
stories using online data sources. In WebSci 2015,
ACM Press.
Oomen, J., Belice, L., 2012. Sharing cultural heritage the
linked open data way: why you should sign up. In
Museums and the Web Conference.
Polson P.G., Lewis C., Rieman J., Wharton C., 1992.
Cognitive Walkthroughs: A Method for Theory-
Based Evaluation of User Interfaces. International
Journal of Man-Machine Studies, 36, 741-773.
Rovera, M., 2016. A Knowledge-Based Framework for
Events Representation and Reuse from Historical
Archives. In H. Sack, E. Blomqvist, M. d'Aquin, C.
Ghidini, S. P. Ponzetto, C. Lange (Eds.). The Semantic
Web
Proc. ESWC 2016. LNCS 9678, Springer,
Heidelberg, 845-852.
Shaw, R., Troncy, R., Hardman, L., 2009. LODE: Linking
Open Descriptions of Events. 4th Asian Conference on
The Semantic Web, 153-167.
Wolff, A., Mulholland, P., Collins, T., Storyspace: a story-
driven approach for creating museum narratives. In
23rd ACM conference on Hypertext and social media,
ACM Press, 89-98.