Machine Learning to Geographically Enrich Understudied Sources:

A Conceptual Approach

Lorella Viola

and Jaap Verheul

Luxembourg Centre for Contemporary and Digital History (C

DH), University of Luxembourg, Belval Campus,

Maison des Sciences Humaines, 11, Porte des Sciences, L-4366 Esch-sur-Alzette, Luxembourg

Department of History and Art History, Utrecht University, Drift 6, 3512 BS, Utrecht, The Netherlands

Keywords: Machine Learning, Sequence Tagging, Spatial Humanities, Geographical Enrichment, Immigrant

Newspapers.

Abstract: This paper discusses the added value of applying machine learning (ML) to contextually enrich digital col-

lections. In this study, we employed ML as a method to geographically enrich historical datasets. Specifically,

we used a sequence tagging tool (Riedl and Padó 2018) which implements TensorFlow to perform NER on a

corpus of historical immigrant newspapers. Afterwards, the entities were extracted and geocoded. The aim

was to prepare large quantities of unstructured data for a conceptual historical analysis of geographical refer-

ences. The intention was to develop a method that would assist researchers working in spatial humanities, a

recently emerged interdisciplinary field focused on geographic and conceptual space. Here we describe the

ML methodology and the geocoding phase of the project, focussing on the advantages and challenges of this

approach, particularly for humanities scholars. We also argue that, by choosing to use largely neglected

sources such as immigrant newspapers (also known as ethnic newspapers), this study contributes to the debate

about diversity representation and archival biases in digital practices.

1 INTRODUCTION

Advances in machine learning (ML) are allowing re-

searchers both in computer science and the humani-

ties to develop new tools and methods for exploring

digital collections. At the same time, libraries are re-

sorting more and more to ML methods to maximise

the potential of their collections, to improve the user

experience, and to discover the technical require-

ments necessary to facilitate the discovery and use of

digital collections. One effective application of ML is

enriching digital data with geographical information.

Because ML uses contextual information to disam-

biguate entities, the method goes beyond the state of

the art of place name extraction and disambiguation

based on gazetteers or ensembles of NER-tools

(Canale, Lisena, and Troncy 2018; Won, Murrieta-

Flores, and Martins 2018; Mariona Coll Ardanuy and

Sporleder 2017; Maria Coll Ardanuy 2017).

This article discusses the added value as well as

the challenges of using a ML method aimed to subse-

https://orcid.org/0000-0001-9994-0841

https://orcid.org/0000-0001-6978-7698

quently perform conceptual Named Entity Recogni-

tion (cNER), a way to identify subjective and situa-

tional geographical markers and connect them to ex-

plicit geo-references to space. In doing so, cNER

aligns with what has been referred to as Spatial Turns,

a revision of past approaches to the study of space and

place (Murrieta-Flores & Martins, 2019) which

acknowledges that place and space are distinct enti-

ties as place is created through social experiences and

can be both real and imagined.

As its evidence, the study uses a corpus of Italian

American historical newspapers (ChroniclItaly, Vi-

ola 2018) published between 1898 and 1920. We ar-

gue that cNER opens up a way to explore the cultural

context of geographical place and that, by choosing to

use a largely neglected data source such as immigrant

newspapers, on the whole this study contributes to the

debate about language diversity representation and

archival biases in digital practices.

Viola, L. and Verheul, J.

Machine Learning to Geographically Enrich Understudied Sources: A Conceptual Approach.

DOI: 10.5220/0009094204690475

In Proceedings of the 12th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2020) - Volume 1, pages 469-475

ISBN: 978-989-758-395-7; ISSN: 2184-433X

469

2 CONTEXT OF THE STUDY

The study stands at the crossroads of migration stud-

ies, linguistics, history, and digital humanities. It de-

veloped within the context of a larger project, the

GeoNewsMiner (GNM - Viola et al., 2019) aiming to

create a visualisation tool of geo-references. The tool

was intended to assist researchers working in spatial

humanities, a recently emerged interdisciplinary field

focussing on geographic and conceptual space, espe-

cially from a historical perspective (Bodenhamer,

Corrigan, & Harris, 2010, 2015a). Essentially, spatial

humanities are based on Geographic Information Sys-

tems (GIS). GIS are used to geo-reference a dataset,

map it, display it, and analyse it by cross-referencing

different layers of information. Such process of cross-

referencing has provided researchers working in

fields such as environmental history, historical de-

mography, and economic, urban and medieval history

with new perspectives leading them to challenge tra-

ditional positions and even explore new questions.

One of the conceptual challenges digital humani-

ties is engaging with as it evolves out of GIS map-

ping, is to identify the layers of meaning humans at-

tached historically to geographical spaces. Only when

humans invest meaning in such landscapes, locales

and dwellings, the objective space is turned into a

“place” that they can be emotionally attached to, and

that can be part of collective narratives of pride, long-

ing or loss. Whereas geographical space is “a realm

without meaning […] a ‘fact of life’, which, like time,

produces the basic coordinates for human life,” the

concept of place is space in which humans have in-

vested meaning (Cresswell, 2010, p. 10; Gregory,

2014; Withers, 2009). One of the central aims of spa-

tial humanities has become to connect these cultural

constructs to an Euclidian grid of space, for instance

by performing “deep mapping” (Bodenhamer, Corri-

gan, & Harris, 2015b, pp. 26–28, passim).

Importantly, the subjective attachment to place is

expressed in the way such locations places are given

proper names. Whereas GIS codes represent digital

definitions of geographical space, toponyms or place

names are often much more ambiguous cultural mark-

ers that represent attachment, fear, longing or other

sentiments. Spatial humanities has begun to connect

these subjective connotations to the toponyms that

can be extracted as named entities from digitized

texts, such as travel narratives, novels and newspa-

pers (Donaldson, Gregory, & Taylor, 2017; Tally,

2011; Taylor, Donaldson, Gregory, & Butler, 2018).

Although much work has been done in the field of

geographic text analysis (GTA), one of the next chal-

lenges within spatial humanities that this article ad-

dresses is how such geographical markers change

over time as a result of human movement and migra-

tion (White, 2010, pp. 17, 18, passim). Here we at-

tempt to tackle questions concerning the way places

are mentioned by migrants, individuals who are dis-

placed and have to make sense of their lives across

contesting cultural values and traditions. The chal-

lenge is to trace how toponyms serve as diasporic id-

iotopes that migrants use to negotiate the attachment

to their lost homeland and the new host society. We

argue that spatial humanities can help to extract such

indicators of a sense of place in large heritage collec-

tions and map the “persistence of place in a hectic

world” (Pascual-de-Sans, 2004).

Drawing from the application of ML, GIS, data

mining, and visualisation, the paper discusses how

performing cNER on ethnic newspapers can provide

researchers with new perspectives on the geographies

of the past so as to open up new ways to study the

socio-cognitive dimension of migration history.

3 THE TASK OF NER

In order to conceptualise the historicization of place

name disambiguation, a method is needed to identify

toponymic identifiers in big data collections within

their proper historical context (such as place names,

borders, and nations). Current Named Entity Recog-

nizers (such as the much-used Stanford Named Entity

Tagger) assume rigid designators and are historically

static, as often based on contemporary word lists.

This means that they cannot interpret historical

changes in place names (e.g., St. Petersburg - Petro-

grad - Leningrad) within their proper historical con-

text and cannot deal with culturally ambiguous geo-

graphical references (e.g., America, United States, the

New World, Washington) or homonymic identifiers

(e.g., Limerick – limerick) (Marrero et al, 2013;

Neudecker, 2014).

The most common way to overcome these limita-

tions is to train classifiers, employing semi-super-

vised ML, sometimes using crowdsourcing, or to use

hand-crafted grammar-based systems that require in-

tensive supervision by computational linguists. The

required annotation labour prohibits application to

big datasets of historical periodicals and other serial

texts that are currently available (Ju et al., 2016).

An alternative route is to disambiguate entities on

the basis of contextual information sources such as

Wikipedia (Coll Ardanuy, 2017; Coll Ardanuy &

Sporleder, 2017; Zhang & Iria, 2009) or to combine

place name taggers with a scoring system within an

ARTIDIGH 2020 - Special Session on Artiﬁcial Intelligence and Digital Heritage: Challenges and Opportunities

470

ensemble (Canale, Lisena, & Troncy, 2018; Won,

Murrieta-Flores, & Martins, 2018). As recent surveys

confirm, however, ML algorithms based on neural

networks now outperform all methods that are based

on gazetteers or static databases (Yadav & Bethard,

2019). These latter methods present two major ad-

vantages in text enriching: 1) they may be based on

the historical context of a historical corpus

(McDonough, Moncla, & van de Camp, 2019); 2)

they are able to recognize toponyms in a dynamic way

as a geographical concept (Eijnatten, 2019). We pro-

pose to use the term conceptual NER (cNER) as a

level of analysis to enrich place name analysis.

4 METHODOLOGY AND

DATASET

To establish and maintain internal cohesion whilst

distinguishing themselves from others, diasporic

groups adopt a collective concept of identity. Such

shared identity is constructed through a variety of

markers which may be religious, linguistic, performa-

tive and of other nature, for example geographical.

This type of diasporic identity markers is conveyed

through language to both express the bond with a ‘re-

membered homeland’ and the connection with the

host society. cNER applies a pragmatic perspective to

the task of NER by assessing the socio-linguistic in-

formation and historical context in which places are

mentioned and discussed. This is done by combining

the visualisation of place name references with lin-

guistic, social, and historical data, both automatically

-for instance by means of sentiment analysis- and

non-automatically, through the qualitative analysis of

sample excerpts. The overarching aim is to visualise

the narratives behind distant and local places and ob-

tain deeper insights of how such links were main-

tained and renegotiated over time for contemporary

purposes and future generations. Here we focus on the

ML (cfr. 4.1) and geocoding parts (cfr. 4.2) of the

study.

4.1 Machine Learning

The collection was first tagged for entities using an

advanced ML sequence tagging tool that implements

Tensorflow (Riedl and Padó, 2018). The novelty of

the tool lies in the fact that it combines BiLSTM and

CRF and character embeddings. The two methods

https://developers.google.com/maps/documentation/geoc

oding/start

were tested separately against four datasets to com-

pare both individual performances and the combina-

tion of the two. They found that combining BiLSTM

with a CRF as top layer outperforms CRFs with hand-

coded features consistently when enough data is

available. They concluded that modern RNNs have

much to recommend to researchers working in NER

as they consistently yield the best performance.

Methodologically, they trained the character em-

beddings with pre-trained word embeddings while

training the model itself. They also used character-

and subword based word embeddings computed with

FastText (Bojanowski et al., 2017) which, by retriev-

ing embeddings for unknown words through the in-

corporation of subword information, was found to

significantly alleviate issues with out-of-vocabulary

words.

4.2 Geocoding

Once the tagging task was completed, locations were

geocoded by using the Google API. Geocoding with

Google is a two-stage process that requires Google

Geocoding API

. First, Google Geocoding API pro-

vides users with a Place ID for each location. The

Place ID uniquely identifies a place as it is stored in

the Google Places database and on Google Maps. Be-

cause the language of the dataset was Italian, the lan-

guage of the API was set to Italian. It was found that

setting the API language as the language of the data-

set improved the accuracy of the geocoding results.

At the same time, however, this meant that the results

were also returned in Italian. Therefore, in order to

have the results returned in English, only the Place ID

was extracted in this first stage. Once the Place ID

was received, it was possible to use the Google Ge-

ocoding API to perform reverse geocoding, that is to

obtain all the details for the location in English (e.g.,

geo-coordinates, administrative level).

4.3 ChroniclItaly

To demonstrate the potential of cNER, we used a cor-

pus of Italian ethnic newspapers (i.e., ChroniclItaly,

Viola 2018) as an example of diasporic media pub-

lished in the United States between 1898 and 1920.

ChroniclItaly is an open access collection that in-

cludes all front pages of seven Italian language news-

papers published in California, Massachusetts, Penn-

sylvania, Indiana, Vermont, and West Virginia be-

Machine Learning to Geographically Enrich Understudied Sources: A Conceptual Approach

471

tween 1898 and 1920. The corpus, which was ex-

tracted from the Chronicling America newspaper col-

lection of the Library of Congress, includes 4,810 is-

sues and for a total of 16,624,571 words. Featuring

mainstream (prominenti), radical (sovversivi), and

politically independent newspapers, ChroniclItaly is

a well-balanced resource for the study of the Italian

immigrant press of the time. Moreover, because it is

entirely digital, this corpus is a powerful tool for con-

ducting text-based searches and analysis, both quan-

titative and qualitative. The newspapers’ titles are:

L’Italia, Cronaca sovversiva, La libera parola, The

patriot, La ragione, La rassegna, and La sentinella

del West Virginia.

Although immigrant newspapers have often been

used by migration historians also to study questions

of belonging in relation to space, such socio-cognitive

dimension of migration remains largely unexplored

digitally. This includes the lack of use of immigrant

newspapers not only as a source of data analysis but

also as a starting point for creating research and anal-

ysis tools. Thus, by using ChroniclItaly, we also

aimed to contribute to the debate about the lack of di-

versity, archival biases and silences in the archives in

digital scholarship. Thanks to larger amounts of data,

today the digital analysis of place name references in

immigrant storytelling allows researchers to under-

stand how individuals made sense of their diasporic

identities within the host community and perhaps re-

consider previous interpretations.

Finally, it is worth mentioning that digital schol-

ars wishing to carry out research in languages other

than English often find themselves confronted with

the relative lack of appropriate computational re-

sources, including for instance accessing already

available trained models in the desired language.

Thus, by both using and creating resources in Italian,

the study also addresses the issue of underrepresenta-

tion of languages other than English in digital schol-

arship.

5 TAGGING THE CORPUS

The sequence tagging model for the Italian language

was trained on I-CAB (Italian Content Annotation

Bank), an open access corpus annotated for entities

(i.e. persons-PER, organizations-ORG, locations-

LOC, and geopolitical entities-GPE), temporal ex-

pressions, and relations between entities. I-CAB con-

tains 525 news articles taken from the Italian newspa-

https://fasttext.cc/docs/en/crawl-vectors.html

per L'Adige and totals up around 180,000 words. Em-

beddings were computed using Italian Wikipedia and

they have been trained using Fastext with 300 dimen-

sions

. Once the training was complete, the output had

the following format (Figure 1):

Figure 1: Output of the sequence tagger for ChroniclItaly.

The first column is the input word, the second column

specifies the pre-processed, lowercased word, the

third column contains a flag, that is whether the word

has been known during training (KNOWN) or not

(UNKNOWN). If labels are assigned to the input file,

these will appear in the third column. The last column

contains the predicted tags. The no-entity tag is O.

Because some entities (e.g., Stati Uniti "United

States") have multiple words, the tagging scheme dis-

tinguishes between the beginning (tag B-...) or the in-

side of an entity (tag I-...). Figure 2 shows the tags:

Figure 2: Tags of the sequence tagger.

6 ML RESULTS

The sequence tagger retrieved 1,369 unique locations

(both LOC and GPE) which occurred 214,110 times

throughout the whole corpus. Because each individ-

ual document was time stamped, the number of refer-

ences to each location was quantified at any given

time within the timeframe of ChroniclItaly (i.e.,

1898-1920). The results of the F1 score for Italian

models are shown in Table 1:

Table 1: F1 score for Italian models.

Type Score

accuracy 98.15%

precision 83.64%

recall 82.14%

FB1 82.88

ARTIDIGH 2020 - Special Session on Artiﬁcial Intelligence and Digital Heritage: Challenges and Opportunities

472

Table 2 shows the F1 score for each of the entity:

Table 2: F1 score for Italian models per entity.

Entity Precision Recall FB1

GPE 83.90% 86.18% 85.02

LOC 69.70%% 44.23% 54.12

ORG 73.36% 73.08% 73.22

PER 89.78% 87.59% 88.68

The tagged version of ChroniclItaly is ChroniclItaly

2.0 (Viola, 2019) and it is available as an open access

resource

7 VISUALISATION

To visualise the results, we chose to use a Shiny

app,

the GeoNewsMiner (GNM, Viola et al 2019). This al-

lowed us to present and analyse the data in an intui-

tive, interactive, and reproduceable way. Within

GNM, references to place names in Chroniclitaly can

be explored according to five different levels of ag-

gregations:

 Time: from 1898 to 1920;

 Newspaper’s title

 Type of frequency visualization: absolute,

percentage

 Geographical information: include/exclude

references to cities/regions

 Percentile: select the least/most mentioned

places

In order to reflect the changing geo-political borders

within the analysed period (1898-1920), users can ad-

ditionally choose between three historical world maps

displaying the different borders at three crucial points

in history which intersect with the timeframe of

ChroniclItaly: 1880, 1914, 1920. By default, GNM

displays a contemporary (1994) world map. Finally,

users can also share their results, download the raw

data which may reflect their filters’ selection, and

download the visualised map as a .png file. GNM is

available as an open access resource,

a full documen-

tation of the project is also available on GitHub

. Fig-

ure 3 shows a static image of the GNM app.

https://doi.org/10.24416/UU01-4MECRO

https://shiny.rstudio.com/

Figure 3: Static image of the GNM app.

8 DISCUSSION

The method presented a number of technical chal-

lenges. The first remark concerns the performance of

the sequence tagger. Although the overall F1 score

was satisfactory, the performance for the entity LOC

was rather poor. However, in ChroniclItaly GPE en-

tities significantly outnumbered LOC entities (77.2%

GPE vs 22.8% LOC), thus at least partially compen-

sating for this limitation.

This first issue can be attributed to the second

challenge of the study, namely OCR issues. The

newspapers in ChroniclItaly were digitized primarily

from microfilm holdings. In addition to the well-

known limitations for OCR processes such as unusual

text styles or very small fonts, other limitations occur

when dealing with old material, including markings

on the pages or a general poor condition of the origi-

nal text. Such limitations also apply to the OCR-gen-

erated searchable texts in ChroniclItaly which there-

fore contain errors. However, the OCR quality was

found better in the most recent texts, perhaps due to a

better conservation status or better initial condition of

the originals which overall improved over the course

of the nineteenth century. Therefore, the quality of the

OCR data can vary greatly even within the same

newspaper. The OCR error limitation could however

be at least partially overcome in two ways: first, it was

reasonable to assume that important concept words

would have been repeated several times within an ar-

ticle thus increasing the likelihood that OCR read

them correctly in at least some of the passages. Sec-

ond, the geo-coding was restricted to place names that

were referred to at least more than 8 times across the

whole collection as the highest number of false posi-

tives was found for occurrences <8.

Other minor issues mainly concerned multi-name

locations (e.g., Costa del Pacifico ‘Pacific Coast’)

https://utrecht-university.shinyapps.io/GeoNewsMiner/

https://github.com/lorellav/GeoNewsMiner

Machine Learning to Geographically Enrich Understudied Sources: A Conceptual Approach

473

which were tagged as individual parts, as shown in

Figure 4:

Figure 4: Example of tags of the sequence tagger for multi-

name locations.

The third challenge concerned the geocoding phase.

In addition to the geo-coordinates, Google provides

further details, such as the tag type[] indicating why

those specific geo-coordinates have been attributed to

a location. Understanding the type[] of a location is

therefore very important, especially when working

with historical data. The Google Places database

stores places based on a contemporary world map,

however the locations in a historical dataset may have

changed name or may no longer exist. Moreover, cat-

egories such as country, city, region, municipality,

etc. which Google uses to determine the location

type[] are highly dependent on the location itself and

consequently, certain categories may not apply or

they may change from country to country. In ge-

ocoding the places in ChroniclItaly 2.0, we encoun-

tered cases in which the type and level given by

Google required a manual edit. The full list of re-

marks and manual edits is available in the GNM

GitHub repository.

Despite the described challenges, we argue in fa-

vour of ML methods to enrich digital collections. One

of the biggest advantages of using ML for tagging, for

instance, is that it is not based on gazetteers, therefore

place name extraction and disambiguation is more re-

liable. Another advantage is that models can be con-

tinuously improved by increasing the amount and

quality of data the algorithms learn from, so that they

can make faster and more accurate predictions. Fi-

nally, ML methods offer the huge benefit of being

completely unsupervised thus bearing great potential

of assistance also for not highly computationally

trained researchers, such as humanities scholars. Be-

cause a solid training in ML was not necessary for us-

ing the sequence tagger, indeed, we found that the

most challenging part of the project was the ge-

ocoding phase. Therefore, ML bears great potential

for the humanities: with modest investment of time, it

may be possible to enrich automatically large

amounts of data, saving scholars precious time and

resources that can be dedicated to investigating new

avenues of data analysis.

9 CONCLUSIONS

This paper discussed the added value of applying ML

to contextually enrich digital collections. In this

study, we employed ML as a method to geograph-

ically enrich a historical dataset. Specifically, we used

a deep learning architecture for NER tasks (Riedl and

Padó 2018) which implements TensorFlow to per-

form NER on a corpus of historical immigrant news-

papers (ChroniclItaly, Viola 2018). The aim was to

prepare large quantities of unstructured data for a

conceptual historical analysis of geographical refer-

ences, which we called conceptual Named Entity

Recognition, cNER. Triangulating the quantitative

information provided by ML with data visualisation

and a qualitative (i.e., sentiment), socio-historical and

linguistic analysis, cNER enables us to trace and un-

derstand the changing cultural constructions that are

attached to place names, as they are derived from the

historical context. This further pragmatic level of

analysis will help us to establish how imagined

“place” is defined over time in relation to changing

realities of space. We argue that overall the cNER

method does better justice to the historical ambigui-

ties that are embedded in the texts themselves than

extraction on the basis of gazetteers or static external

information allow us to achieve.

Despite a number of limitations which were

mainly encountered during the geocoding phase, we

found that the method has much to recommend par-

ticularly to humanities scholars who are more and

more confronted with the challenge of exploring col-

lections larger than before and in a digital format. Fi-

nally, we also argued that, by choosing to use largely

neglected sources such as Italian immigrant newspa-

pers, this study contributed to the debate about lan-

guage diversity representation and archival biases in

digital practices.

REFERENCES

Ardanuy, Maria Coll. (2017). Entity-Centric Text Mining

for Historical Documents. Georg-August-Universitat

Gottingen, Göttingen.

Ardanuy, Mariona Coll, & Sporleder, C. (2017). Toponym

disambiguation in historical documents using semantic

and geographic features. Proceedings of the 2nd Inter-

national Conference on Digital Access to Textual Cul-

tural Heritage - DATeCH2017, 175–180. https://

doi.org/10.1145/3078081.3078099

Bodenhamer, D. J., Corrigan, J., & Harris, T. M. (Eds.).

(2010). The spatial humanities: GIS and the future of

humanities scholarship. Bloomington, Ind.: Indiana

Univ. Press.

ARTIDIGH 2020 - Special Session on Artiﬁcial Intelligence and Digital Heritage: Challenges and Opportunities

474

Bodenhamer, D. J., Corrigan, J., & Harris, T. M. (Eds.).

(2015a). Deep maps and spatial narratives. Blooming-

ton: Indiana University Press.

Bojanowski, P. Grave, E., Joulin, A. and Mikolov, T. 2017.

Enriching word vectors with subword information.

Transactions of the Association for Computational Lin-

guistics 5:135–146.

Canale, L., Lisena, P., & Troncy, R. (2018). A Novel En-

semble Method for Named Entity Recognition and Dis-

ambiguation Based on Neural Network. In D.

Vrandečić, K. Bontcheva, M. C. Suárez-Figueroa, V.

Presutti, I. Celino, M. Sabou, … E. Simperl (Eds.), The

Semantic Web – ISWC 2018 (Vol. 11136, pp. 91–107).

https://doi.org/10.1007/978-3-030-00671-6_6

Cresswell, T. (2010). Place: A short introduction (Repr.).

Malden, Mass.: Blackwell.

Donaldson, C., Gregory, I. N., & Taylor, J. E. (2017). Lo-

cating the beautiful, picturesque, sublime and majestic:

Spatially analysing the application of aesthetic termi-

nology in descriptions of the English Lake District.

Journal of Historical Geography, 56, 43–60.

https://doi.org/10.1016/j.jhg.2017.01.006

Eijnatten, J. V. (2019). Something about the Weather. Us-

ing Digital Methods to Mine Geographical Conceptions

of Europe in Twentieth-Century Dutch Newspapers.

BMGN - Low Countries Historical Review, 134(1), 28–

61. https://doi.org/10.18352/bmgn-lchr.10655

Gregory, I. N. (2014). Further Reading: From Historical

GIS to Spatial Humanities: An Evolving Literature. In

I. N. Gregory & A. Geddes (Eds.), Toward spatial hu-

manities: Historical GIS and spatial history (pp. 186–

202). Bloomington, Ind.: Indiana Univ. Press.

Ju, Y., Adams, B., Janowicz, K., Hu, Y., Yan, B., &

McKenzie, G. (2016). Things and Strings: Improving

Place Name Disambiguation from Short Texts by Com-

bining Entity Co-Occurrence with Topic Modeling. In

E. Blomqvist, P. Ciancarini, F. Poggi, & F. Vitali

(Eds.), Knowledge Engineering and Knowledge Man-

agement (Vol. 10024, pp. 353–367).

https://doi.org/10.1007/978-3-319-49004-5_23

Marrero, M., Urbano, J., Sánchez-Cuadrado, S., Morato, J.,

& Gómez-Berbís, J. M. (2013). Named Entity Recogni-

tion: Fallacies, challenges and opportunities. Computer

Standards & Interfaces, 35(5), 482–489.

https://doi.org/10.1016/j.csi.2012.09.004

McDonough, K., Moncla, L., & van de Camp, M. (2019).

Named entity recognition goes to old regime France:

Geographic text analysis for early modern French cor-

pora. International Journal of Geographical Infor-

mation Science, 33(12), 2498–2522. https://

doi.org/10.1080/13658816.2019.1620235

Murrieta-Flores, P., & Martins, B. (2019). The geospatial

humanities: Past, present and future. International

Journal of Geographical Information Science, 33(12),

2424–2429.

Neudecker, C. (2014, March 3). Named Entity Recognition

for digitised newspapers – Europeana Newspapers. Re-

trieved 10 November 2019, from http://www. euro-

peana-newspapers.eu/named-entity-recognition-for-

digitised-newspapers/

Pascual-de-Sans, A. (2004). Sense of place and migration

histories Idiotopy and idiotope. Area, 36(4), 348–357.

https://doi.org/10.1111/j.0004-0894.2004.00236.

Riedl, M. and Padó, S. 2018. A Named Entity Recognition

Shootout for German. Proceedings of the 56th Annual

Meeting of the Association for Computational Linguis-

tics (Short Papers), pages 120–125. Melbourne, Aus-

tralia, July 15 - 20, 2018

Tally, R. T. (Ed.). (2011). Geocritical explorations: Space,

place, and mapping in literary and cultural studies.

New York: Palgrave Macmillan.

Taylor, J., Donaldson, C. E., Gregory, I. N., & Butler, J. O.

(2018). Mapping Digitally, Mapping Deep: Exploring

Digital Literary Geographies. Literary Geographies,

4(1), 10–19.

Viola, L. (2018). ChroniclItaly: A corpus of Italian Ameri-

can newspapers from 1898 to 1920. Utrecht University.

Retrieved from https://public.yoda.uu.nl/i-

lab/UU01/T4YMOW.html

Viola, L. (2019). ChroniclItaly 2.0. A corpus of Italian

American newspapers annotated for entities, 1898-

1920 (Version 2.0). Retrieved from https://

doi.org/10.24416/UU01-4MECRO

Viola, L., De Bruin, J., van Eijden, K., & Verheul, J. (2019).

The GeoNewsMiner (GNM): An interactive spatial hu-

manities tool to visualize geographical references in

historical newspapers (v1.0.0). Retrieved from

https://github.com/lorellav/GeoNewsMiner

White, R. (2010). Spatial History Project. Retrieved 8 No-

vember 2019, from https://web.stanford.edu/

group/spatialhistory/cgi-bin/site/pub.php?id=29

Withers, C. W. J. (2009). Place and the ‘Spatial Turn’ in

Geography and in History. Journal of the History of

Ideas, 70(4), 637–658. https://doi.org/10.1353/jhi.0.

0054

Won, M., Murrieta-Flores, P., & Martins, B. (2018). En-

semble Named Entity Recognition (NER): Evaluating

NER Tools in the Identification of Place Names in His-

torical Corpora. Frontiers in Digital Humanities, 5.

https://doi.org/10.3389/fdigh.2018. 00002

Yadav, V., & Bethard, S. (2019). A Survey on Recent Ad-

vances in Named Entity Recognition from Deep Learn-

ing models. ArXiv:1910.11470 [Cs]. Retrieved from

http://arxiv.org/abs/1910.11470

Zhang, Z., & Iria, J. (2009). A novel approach to automatic

gazetteer generation using Wikipedia. Proceedings of

the 2009 Workshop on The People’s Web Meets NLP:

Collaboratively Constructed Semantic Resources, 1–9.

Retrieved from http://dl.acm.org/citation.cfm?id=

1699765.1699766

Machine Learning to Geographically Enrich Understudied Sources: A Conceptual Approach

475