Extracting Event-related Information from a Corpus Regarding Soil

Industrial Pollution

Chuanming Dong

1,3 a

, Philippe Gambette

2 b

and Catherine Domingu

1 c

LASTIG, Univ. Gustave Eiffel, ENSG, IGN, F-77420 Champs-sur-Marne, France

LIGM, Univ. Gustave Eiffel, CNRS, ESIEE Paris, F-77454 Marne-la-Vall

ee, France

ADEME, Agence de l’Environnement et de la Ma

ıtrise de l’

Energie, F-49004, Angers, France

Keywords:

Information Extraction, Deep Learning, Word Embedding, Semantic Annotation, Industrial Pollution.

Abstract:

We study the extraction and reorganization of event-related information in texts regarding industrial pollution.

The object is to build a memory of polluted sites that gathers the information about industrial events from

various databases and corpora. An industrial event is described through several features as the event trigger,

the industrial activity, the institution, the pollutant, etc. In order to efﬁciently collect information from a large

corpus, it is necessary to automatize the information extraction process. To this end, we manually annotated

a part of a corpus about soil industrial pollution, then we used it to train information extraction models with

deep learning methods. The models we trained achieve 0.76 F-score on event feature extraction. We intend

to improve the models and then use them on other text resources to enrich the polluted sites memory with

extracted information about industrial events.

1 INTRODUCTION

Pollution is becoming one of the major concerns for

French dwellers. The French Ministry of the Eco-

logical Transition (MTES) is responsible for collect-

ing and updating pollution data from industrial sites

which are gathered in a certain number of databases,

including BASOL, the database of (potentially) pol-

luted sites; BASIAS, a historical inventory of old in-

dustrial sites; and S3IC, the database of classiﬁed fa-

cilities.

With abundant information about industrial sites,

these databases are proven to be necessary for the

assessment of the situation of a polluted site and

the calculation of the cost for rehabilitating a waste-

land. Nevertheless, the information contained in them

can become inconsistent across databases due to their

speciﬁc objectives and different update rates. The

BASIAS database has been created to record the ac-

tivities of old industrial sites. Comparing to other

databases, it specializes at classifying the productive

activities of a site, but in the meantime some informa-

tion, for example the address of a site, may not be up

to date in this database. The S3IC database has been

https://orcid.org/0000-0003-3232-8177

https://orcid.org/0000-0001-7062-0262

https://orcid.org/0000-0002-0362-6805

constructed through inspection of industrial facilities,

which means it contains the information about the op-

erations of facilities on a site, the authorization status

for the operations and the danger level of those facil-

ities. It classiﬁes the industrial activities conducted

through those facilities from the point of view of an

MTES inspector, which makes S3IC different from

other databases about polluted sites. Lastly, BASOL

focuses on the pollution of industrial sites. In this

database, each site is described in details through the

potential pollution processes and/or the remediation

processes, as well as a list of pollutants detected in the

site, all of these are missing from the other databases.

The multiplication of databases and their content

variations make it difﬁcult to have a synthetic view

of the situation of the sites. In addition, historical

information such as industrial events also plays an

important role in the assessment of sites, but this in-

formation is either missing or disorganized in these

databases.

Therefore, we have planned to create a memory

of sites that reorganizes the information from these

databases in a more invariable and efﬁcient way. A

memory of sites is a database constructed on events

that happened in those sites. Eventually, users will

be able to query this database for polluted site infor-

mation, like location, pollutants and industrial activi-

Dong, C., Gambette, P. and Dominguès, C.

Extracting Event-related Information from a Corpus Regarding Soil Industrial Pollution.

DOI: 10.5220/0010656700003064

In Proceedings of the 13th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2021) - Volume 1: KDIR, pages 217-224

ISBN: 978-989-758-533-3; ISSN: 2184-3228

217

ties etc. Since the existing databases do not share the

same objectives regarding the pollution treatment nor

the same deﬁnition of an industrial event, they do not

record the same events. Besides, those events are usu-

ally embedded in narrative texts as a part of databases,

and there are a lot more events described in the texts,

like regulatory reports, rather than in databases. So,

in this paper, we introduce an information extraction

model which enables event-related extraction from a

plain text. In the future, the chronological assembly

of these events will make it possible to build the mem-

ory of polluted sites.

The information extraction model suits the BA-

SOL narrative texts from which events must be ex-

tracted. So, after a brief introduction about related

work in section 2, section 3 describes the narrative

text corpus, the notion of event and the features which

describe industrial events and are looked for in the

corpus. The automatic annotation process is based on

deep learning; it combines a neural network and word

embeddings; they are explained in section 4. The au-

tomatic annotation of the event features is assessed in

section 5. The results are discussed, based on preci-

sion, recall and F-score measures in section 6. The

paper concludes with perspectives in section 7.

2 RELATED WORK

In natural language processing (NLP), an information

extraction task can be regarded as a sequence label-

ing task or a classiﬁcation task. Information extrac-

tion tasks focused on event features are relatively new

to the NLP community. Over the last decade, sev-

eral approaches have been proposed by different re-

searchers. In (Arnulphy, 2012), a machine learning

model has been used to classify the words by their

predeﬁned syntactic, morphologic, semantic and lex-

ical features in order to recognize the events. This

classiﬁer is based on a decision tree algorithm, and

eventually gets a 0.74 F-score on linguistic feature

classiﬁcation. In (Battistelli et al., 2013), a data min-

ing approach has been proposed, which involves ex-

tracting semantic patterns of sentences that describe

an event. So, sentences with similar patterns can be

extracted as events. Although these approaches are

different in usage of models and algorithms, they all

require the assistance of an abundant linguistic re-

source. For example, in (Arnulphy, 2012), French

lexicons including action verbs and event nouns are

used to deﬁne the lexical features of words. In re-

cent years, the development of artiﬁcial neural net-

work and language models has made the deep learn-

ing approaches much more viable for NLP tasks, in-

cluding sequence labeling tasks. In (Panchendrarajan

and Amaresan, 2018), a model trained on Bi-LSTM

neural network has gained a 0.90 F-score on named

entity annotation. In the work of (Shin et al., 2020), a

spatial information extraction model based on BERT

(Bidirectional Encoder Representations from Trans-

formers) is presented. By implementing the language

model BERT, the authors have successfully extracted

different types of spatial entities with a F-score of

0.90 in total. From these works, it can be seen that

the usage of artiﬁcial neural networks and language

models has improved the result in sequential labeling

tasks, especially semantic annotation, without imple-

menting extra linguistic resources.

Our project to build a memory of polluted sites

focuses on extracting information about industrial

events. As named entity extraction, event extraction is

also a semantic annotation task. Different from pre-

vious work, we seek to extract events with a certain

theme: pollution. This means that we need an ap-

proach with strong ability to process semantic fea-

tures in text. Our proposed approach is inspired by

recent work and is based on a deep learning method

and a language model.

3 THE BASOL CORPUS AND THE

INDUSTRIAL EVENTS

BASOL describes polluted or potentially polluted

sites, and soils requiring preventive or remedial action

by public authorities through a structured database of

the industrial events, which is complemented by nar-

rative texts. The description of industrial event in-

cludes speciﬁc features, which are relevant in the con-

text of pollution. The corpus extracted from the BA-

SOL database is ﬁrst presented. The concept of in-

dustrial event with its characteristics is based on this

corpus; the design of the labels of the characteristics

and their use are then introduced.

3.1 Description of the Corpus

BASOL contains structured information about more

than 7 000 polluted sites since the 1990s, includ-

ing their geographic location, owners’ identity and

detected pollutants. In addition, narrative texts are

added to the database records and provide detailed

information concerning the facilities and the indus-

trial sites. The texts collected as a corpus provide the

source in which industrial events are looked for. The

corpus contains 155 587 sentences, with a vocabulary

of 48 032 words. The descriptive texts are meant to

clarify the industrial incidents that had an inﬂuence on

KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval

218

the site, so they include mentions of industrial events.

The vocabulary is focused on the topic of the indus-

trial pollution. Since this is an ofﬁcial database, the

usage of standard French is also a signiﬁcant quality.

As an example, the following sentence is taken from

the corpus: La soci

e BRODARD GRAPHIQUE

etait install

ee depuis 1959 sur la zone industrielle

de Coulommiers (BRODARD GRAPHIQUE was es-

tablished since 1959 on the Coulommiers industrial

area).

3.2 The Concept of Event

The corpus details industrial events. But what exactly

is an event? By the deﬁnition of dictionary, an event

is “a thing that happens, especially something im-

portant”

. Various deﬁnitions of an event have been

made in previous works. In her doctoral thesis, (Ar-

nulphy, 2012) deﬁnes an event as something happens

that changes the state. In (Lecolle, 2009), an event

is regarded as a singular, unexpected and unrepeat-

able case. In (Battistelli et al., 2013), although there

is no clear deﬁnition of event, the importance of date

in event extraction is emphasized which implies that

event is a notion with signiﬁcant temporal properties.

From these deﬁnitions, it is shown that event is a rel-

atively subjective notion which can be adapted to the

need of research. But there is a consistency in these

deﬁnitions. It is clear that the notions of “important”

and “happen” are crucial. These notions represent

two major aspects of an event: occurrence and im-

portance. From a semantic perspective, occurrence

can be interpreted as having a distinctive and closed

time range. And importance implies an impact on the

reality. Therefore an event can be deﬁned as some-

thing that impacts the reality, with a distinctive and

terminated time marker.

In this project, we speciﬁcally study industrial

events. Based on the deﬁnition of event, an industrial

event can be deﬁned as something impacts the indus-

trial situation, with a distinctive and terminated time

marker. According to this deﬁnition, several elements

must be deﬁned to specify an industrial event. First, to

describe the occurrence of an event, a time marker, an

action and an actor are required. Since eventually the

events will be linked to industrial sites in the database,

a place marker is also crucial. With these elements ex-

tracted, we can describe the occurrence of an event as

“Who did What When and Where”. Second, the im-

portance of the event needs to be described. Although

the impact on industry can not be extracted directly

from a text, information may be found on the inﬂu-

https://www.oxfordlearnersdictionaries.com/

deﬁnition/american english/event

ence of an industrial event on the environment. To

gather this information, elements such as pollutants,

chemical components and products should also be ex-

tracted.

3.3 Label Design and Application

Therefore, we propose the following set of labels to

designate the features of an industrial event:

• O: an object, a nominal phrase that serves as an

argument of an action. It can be either the actor,

the receiver or the complement of an action;

• N: an action trigger of an event, usually a momen-

tary verb or its nominal derivation;

• A: an industrial activity; An activity is a repeating

action that a company conducts daily;

• T: an indicator of time, typically a date;

• L: an indicator of location, only geographic and

administrative locations;

• R: a relation, usually a prepositional phrase indi-

cating the logical relation between other labels;

• I: an institution’s name;

• S: a chemical element;

• U: a pollutant other than chemical elements;

• D: a pollutant in form of a container for other pol-

lutants, for example a wasteyard.

These labels, while covering the need for annotating

basic information, may cause a problem of overlap.

For example, in this segment that describes an indus-

trial activity, aspersion de Xyloph

ene sur les poutres

de bois (in English: Xylophene sprinkling on the

wooden beams), label U should be assigned to the

chemical product Xyloph

ene (Xylophene), while an-

other label A, industrial activity, is assigned to the

whole segment. In order to reduce the risk of over-

lapping, the labels have been separated into 2 groups.

The ﬁrst one contains the labels O, N, T, A, L and

R, which are useful to describe an event or an activ-

ity. The I, D, S and U labels are in the second group;

they provide complementary information about pollu-

tion and institution. From a linguistic perspective, the

labels of the ﬁrst group have a strong link to syntactic

features of words. The assignment of the ﬁrst group

labels requires information about the part-of-speech

and the dependency relations between words, such as

whether the word is a noun or a verb, whether it is the

subject or the predicate in the sentence. The second

group is more related to semantic features, and it is by

knowing the meaning of the words that these labels

can be assigned. For example, Hydrocarbure (Hydro-

carbon) is identiﬁed as a chemical substance (label S)

Extracting Event-related Information from a Corpus Regarding Soil Industrial Pollution

219

not because it is the subject of a sentence, but because

it means an organic compound consisting entirely of

hydrogen and carbon

. In addition, a priority rule has

been deﬁned in order to assign only one label to each

word. For example, a place name, annotated as a lo-

cation, L label (ﬁrst group), may also be annotated O

(second group) as the object of an event trigger verb.

The rule which has been implemented priorizes the

indicator of location, which much more speciﬁes the

event than the fact it is an object too.

On the other hand, the designation of the event

features are often made up of several words, for ex-

ample: La soci

e BRODARD GRAPHIQUE, sur la

zone industrielle de Coulommiers. Therefore, the “B-

I-E-O” (begin, inside, end, outside) annotation format

has been implemented in the annotation work. Since

this format uses different labels for the beginning and

the end of an extracted expression, it enables to detect

multiword units. In this way, both category labels

and boundary labels can be assigned at the same time

to each word in a group. So, it is easy to distinguish

between groups of words, even if they are of the

same category. Consequently, the labels assigned

to each word is in fact a combination of a bound-

ary label and a category label. Here is an example:

Les installations de l’usine

BO IO IO EO

ont

e d

emolies entre 1970 et 1980 .

BN IN EN BT IT IT ET

The two-character labels enable to delimit three

phrases: Les installations de l’usine (label O), ont

emolies (label N), and entre 1970 et 1980 (label T).

As can be seen in this example, the assignment

of labels is realised within a sentence. Normally, the

boundary of a sentence does not necessarily match the

boundary of an event; some features of an event may

appear in a different sentence from the one that con-

tains the trigger of the event. However, the BASOL

corpus is a combination of brief texts that summarize

the activities and events that occur at a site. So, it

is more unlikely to ﬁnd an event announced in two

sentences in this corpus. Consequently, the narrative

texts have been segmented and annotated into sen-

tences. This has several advantages. The sentence is a

perfect unit for the input of a deep learning algorithm

(see the next section), since a paragraph as a unit may

be too voluminous for the algorithm to run efﬁciently,

and a word as a unit risks loosing context features of

the word. The segmentation into sentences enables a

https://en.wikipedia.org/wiki/Hydrocarbon

better control of the manual annotation workload.

4 AUTOMATIC ANNOTATION OF

EVENT FEATURES

The targeted memory of polluted sites is based on a

chronological assembly of pollution events. Each of

them is described through its features; the goal of the

information extraction model is to automatically iden-

tify and annotate the features. The model which is

proposed combines a neural network to identify the

phases, and word embeddings to distinguish between

the use contexts of each word occurrence. The two

components are independent and the choice of each

one is guided by criteria that are explained. The train-

ing of the model combines both components and is

based on the training corpus that has been manually

annotated.

4.1 Choice of the Information

Extraction Model

Several models are suitable to automatic information

extraction. The most adopted ones are the models

based on linguistic rules, and those trained with su-

pervised deep learning method.

The rule-based models can perform a very precise

information extraction. However, they rely on imple-

mented vocabularies and their performance may dete-

riorate when processing a corpus with new terminolo-

gies, which is known as an Out-of-Vocabulary prob-

lem (OOV). This could be a major drawback in our

case because the corpus could be extended to other

documents that deal with the same theme but with

another vocabulary (more technical or more regula-

tory) or with the mention of new institution names and

other chemical product names. Finally, we choose to

make a neural model based on deep learning method,

in order to solve OOV and to obtain a more ﬂexible

tool. The supervised deep learning method on which

the model is made is called Bi-LSTM (Bidirectional

Long Short-Term Memory) (Basaldella et al., 2018).

LSTM is a recurrent neural network (RNN). Compar-

ing to other neural network structures, RNN is more

suitable for sequential learning task, especially in the

case where the output of an input can be inﬂuenced by

the previous inputs. This property of RNN suits the

feature annotation since a word’s label assignment is

strongly inﬂuenced by the words in its context. De-

rived from the traditional RNN, the Bi-LSTM neural

network is more ﬂexible than RNN in sequence tag-

ging tasks because of its ability of reserving the inﬂu-

KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval

220

ence of a word’s remote context during training. And

since this is a bi-directional model, it can learn from

both previous context and following context, and thus

it is more suitable for detecting the beginning and the

end boundaries of an expression.

4.2 Choice of Word Embeddings

For text data being able to be processed by the neural

network, one step is indispensable: word embedding.

Indeed, every input text word is substituted with its

vector that the algorithm can process. So, the vec-

tor returns the context of the word in the text. Sev-

eral word integration models exist, which inﬂuence

the performance of the information extraction mod-

els. At the beginning of the implementation, in order

to quickly test the performance of Bi-LSTM neural

network, we have tried training with one of the sim-

plest word embedding method: Word2vec (Mikolov

et al., 2013). This method, while able to efﬁciently

provides word vectors generated from the context of

each word, has some fundamental ﬂaws that inﬂu-

enced the performance of the models. First of all, the

vector generated by Word2vec is static, this means

each word form has one and only one vector for

the whole text unit, regardless of its different con-

texts. Consequently, the word vectors generated by

Word2vec model cannot represent polysemy, the case

where a word can have different meanings in differ-

ent context. Furthermore, unlike multi-layer deep

learning word embedding models, Word2vec cannot

generate vectors that embed complex linguistic infor-

mation of different levels, such as a word’s syntac-

tic and semantic features. Therefore, other word em-

bedding models have been taken into consideration,

specially some state of art language models. Finally,

we have decided to use the French language model

CamemBERT (Martin et al., 2020), a Transformer-

based model trained on a large French corpus. This

model is known for its state-of-art performance for

natural language processing tasks in French, includ-

ing part-of-speech tagging, dependency parsing and

named entity recognition. What makes this model

special is that it assigns different vectors to different

occurrences of the same word, according to the con-

texts. And for words it cannot recognize, it breaks

down the words into morphemes to assign them the

corresponding vectors. Thus, this model is not af-

fected by polysemy or OOV problems. Since this

model can efﬁciently integrate the semantic features

in the context, it would be helpful for recognizing the

labels closely related to word sense, the pollutants for

example.

4.3 Training and Validation Corpora

As explained above, the proposed model is based on

a neural bi-LSTM model. It must be trained with an

annotated corpus in order to learn the labels which an-

notate the event features. The annotated corpus must

be reliable (annotations must be manually checked),

consistent, suitable for the task and of sufﬁcient size.

In addition, a part of the manually annotated corpus

must be reserved for the assessment task. In order to

reduce the manual annotation work, a “bootstrapping”

annotation-training process has been implemented.

First, the event-related information is manually anno-

tated in a small sample of corpus. Then, the model is

trained on this annotated sample to become a rough

trained annotation model. Through this model, an-

other corpus sample can be automatically annotated

and then manually corrected, resulting in a new train-

ing cycle for the model, which improves it. By re-

peating this process we can perform a “bootstrapping”

annotation-training process. It enables to accumulate

annotated and checked samples which are gathered to

form the ﬁnal training corpus. Thus, the model can

be trained, as much as necessary, on an abundant and

reliable corpus and become an efﬁcient tool.

As seen before, the narrative texts have been seg-

mented into sentences and annotated. Thus, each in-

put data unit of the model is a sentence which is in

the form of a tensor that contains the vector of every

sentence word.

The passage from a sentence to its words is based

on a tokenization process. To ensure the coherent

combination of the different components of the ﬁnal

model, the tokenization method of the word embed-

ding provider, i.e. CamemBERT, has been adopted.

However, the way that CamemBERT splits certain

words into lexemes can cause inconvenience for man-

ual annotation or correction. Therefore, a script that

can transform the CamemBERT tokens to TreeTag-

ger (Schmid, 1994) tokens

has been prepared, along

with their labels. The TreeTagger tokenization is the

one chosen for the manual annotation, but this script

can also transform CamemBERT tokens to any other

types of tokens. The script can also work in the op-

posite direction, and transform other tokenized sen-

tences to CamemBERT tokens.

This is a bootstrapping experiment that augments

the annotated text through the model training ses-

sions. For the ﬁrst session, only 120 annotated sen-

tences were prepared for training the model, and 100

sentences to test and evaluate it. After applying the

model, we manually corrected the annotation result,

https://github.com/DongChuanming/KDIR 2021

shared/blob/main/KDIR tokenization transformer.py

Extracting Event-related Information from a Corpus Regarding Soil Industrial Pollution

221

and thus obtained 100 more correctly annotated sen-

tences.

The second session has consisted of several steps:

ﬁrst, a transitory model has been trained on the 220

annotated sentences, then by using this model, 301

new sentences have been automatically annotated.

This enables to efﬁciently obtain 301 more parsed

sentences by correcting the annotation result. Then

these sentences have been split into 3 groups: 130

sentences join the training data, giving 350 sentences

for model training; 120 sentences for developing,

more precisely for choosing the number of epochs;

and the evaluation set composed of those 120 sen-

tences complemented with the last 51 annotated sen-

tences.

Figure 1: Illustration of the bootstrap method used to aug-

ment the training and evaluation corpora.

5 EVALUATION OF

ANNOTATION MODELS

Since the labels have been separated into two groups,

two models (named Model 1 and Model 2) have been

implemented to automatically annotate the event fea-

tures. Both are based on the Bi-LSTM neural algo-

rithm and use the same word embeddings provided by

CamemBERT. They have been trained and assessed

with the same training and evaluation corpora. They

share the same training processes (numbers of epochs

and learning rate), named session below.

During model training, the evaluation has already

begun. In order to ﬁnd the parameters that optimize

the training, we have tested the models with 120 de-

veloping sentences with different network conﬁgura-

tions. To illustrate, here is a graph that shows how the

F-score of each label of Model 1 evolves according

to different numbers of epochs, with learning rate at

0.01 :

According to ﬁgure 1, at epoch 400, most labels

have the highest F-score, thus 400 is the best epoch

number for Model 1 training if other parameters don’t

change. Aside from epoch number, we have also

tested other parameters like learning rate and butch

size, for both Model 1 and Model 2, to ﬁnd their best

value. The evaluation results presented below are for

Figure 2: Evolution of the F-score computed on the devel-

oping set of the ﬁrst session, by epoch number for each label

- Model 1.

models trained with the best parameters at the mo-

ment. Since all parameters have not yet been tested, it

is possible that the models will be further improved.

The evaluation results of the two models trained dur-

ing both sessions are shown in the following tables.

The evaluation is realised on each label separately.

Since event-related information has been extracted

through the category labels, at this stage, the bound-

aries labels have not been evaluated. Table 1 and 2 are

the evaluation of Model 1 and Model 2 trained during

the ﬁrst session.

Table 1: Number of true positives (TP) and evaluation of the

precision (p), recall (r) and F-score of Model 1 on the test

set of the ﬁrst training session (100 sentences, 400 epochs).

Label TP p r F-score

trigger (N) 143 0.66 0.57 0.61

activity (A) 64 0.40 0.58 0.47

object (O) 209 0.94 0.85 0.89

time (T) 184 0.93 0.88 0.90

location (L) 61 0.63 0.59 0.61

relation (R) 23 0.45 0.45 0.45

Total 684 0.72 0.70 0.71

Table 2: Result of Model 2 on the test set of the ﬁrst training

session (100 sentences, 400 epochs).

Label TP p r F-score

institution (I) 28 0.93 0.46 0.62

chemicals (S) 29 0.88 0.58 0.70

pollutant (P) 2 0.10 0.40 0.16

container (D) 0 - - -

Total 59 0.71 0.50 0.59

Table 3 and 4 show the evaluation of Model 1 and

Model 2 trained during the second session.

KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval

222

Table 3: Result of Model 1 on the evaluation set of the sec-

ond training session (171 sentences, 400 epochs).

Label TP p r F-score

trigger (N) 308 0.77 0.69 0.73

activity (A) 120 0.62 0.64 0.63

object (O) 763 0.89 0.82 0.85

time (T) 194 0.89 0.93 0.91

location (L) 86 0.55 0.63 0.59

relation (R) 157 0.61 0.54 0.57

Total 1628 0.78 0.74 0.76

Table 4: Result of Model 2 on the evaluation set of the sec-

ond training session (171 sentences, 400 epochs).

Label TP p r F-score

institution (I) 146 0.95 0.77 0.85

chemicals (S) 95 0.90 0.82 0.86

pollutant (P) 54 0.75 0.47 0.57

container (D) 2 0.25 0.15 0.19

Total 297 0.87 0.68 0.77

6 RESULT ANALYSIS

Although we only used a small manually annotated

corpus, we already obtained promising results on the

models. For a simple comparison, we have also tested

two other NLP tools on date annotation, a popular

Python library called dateparser

, and NOOJ

, an

annotation software for linguists. Both of them are

based on rules. Considering the reliance of event on

its time marker, this comparison should be able to re-

ﬂect the performance on event extraction too. As a

result, dateparser can only detect the date expres-

sions in our text with a 0.77 precision and a 0.48 re-

call; NOOJ obtained 0.98 precision, but only a 0.44

recall. This proves that our models have a state-of-art

performance for detecting certain entities. By observ-

ing the score of the different labels, and by comparing

the manual and automatic annotations, we have dis-

covered some interesting points to address. The score

of the different labels, and the comparison between

the manual and automatic annotations give clues to

improve the results of the automatic annotation of the

event features. The commentaries of the results and

the improvement clues are organized regarding three

themes : the confusion between labels, the improve-

ment due to the increase of the corpus, and the rele-

vance of the CamemBERT word embeddings.

https://dateparser.readthedocs.io/en/latest/

http://explorationdecorpus.corpusecrits.huma-num.fr/

nooj/

6.1 Comparison between Labels

The models do not work well on some labels. Com-

paring to time (T) and object (O) labels, event trigger

(N), industrial activity (A) and location (L) labels do

not have an impressive F-score. After observing the

automatic annotation results on these labels, we see

that certain sentences that should have been annotated

as event trigger, are annotated as industrial activity.

Based on our deﬁnition of event trigger, the action

that triggers an event should be a momentary verb or

its nominal derivation. In contrast, an industrial ac-

tivity is an action conducted by enterprises frequently

during a period of time, and should be designated by

a durative verb or its nominal derivation, or a repeat-

ing action. However, it is difﬁcult to distinguish an

event trigger expression from an industrial activity ex-

pression, based on their syntactic features, especially

when they are all nominal derivation of verbs. Unlike

a verb, a noun does not have “momentary” nor “du-

rative” as properties. Therefore, once nominalized,

these event trigger expressions are confused with an

activity, usually in the form of a nominal phrase.

A similar problem can be found with the label lo-

cation. Since the expression of a location often has

a prepositional structure, the nominal part of a loca-

tion expression can easily be recognized as an object

if its position is close to an event trigger or an activ-

ity. Besides, based on its deﬁnition, the recognition

of a location expression is trickier. The location ex-

pressions we want to extract include only geographic

and administrative locations. For example, even if

the prepositional phrase dans les nappes des calcaires

grossiers (in the coarse limestone sheets) indicates a

position and hence is annotated by our model as a lo-

cation, it does not belong to either precedent types ,

and therefore should not be recognized as a location.

6.2 Improvement Due to the Corpus

Increase

An improvement can be observed between the two

sessions. Comparing to the ﬁrst session, the models

trained in second session have a better performance

on annotating most labels due to the increase of the

training text. Also, it is noticeable that Model 2 has

beneﬁted more from this training corpus increase. In-

deed, the labels of the second group are less frequent

than those of the ﬁrst group. Consequently, there are

not enough second group annotation examples in the

ﬁrst session; the category container (D) is even ab-

sent from the test corpus of the ﬁrst session. With

more training text attached to the second session, the

models are able to learn the second group annotations

Extracting Event-related Information from a Corpus Regarding Soil Industrial Pollution

223

on more label instances and thus improve Model 2.

6.3 Relevance of the Word Embeddings

The use of the CamemBERT word embeddingds also

improved the results. The pollutant category (P) is the

one that beneﬁts the most from the use of the vectors.

To compare, by using our preliminary model imple-

menting the Word2vec method, the pollutant annota-

tion precision is only 0.05 but by using the current

model the score has increased to 0.56 without low-

ering the recall. A pollutant expression is usually a

nominal phrase. It is very difﬁcult to differ it from any

other nominal component, on syntactic level. And un-

like institution names or chemicals, the expression of

pollutants does not involve changes of word case or

the usage of nomenclatures. So the most promising

ways to recognize them are by analysing the polar-

ity (positive or negative) in the context, and by build-

ing the word meaning itself, all of which require us-

age of complicated semantic features. Unlike syntac-

tic features, semantic features are hard to extract and

to be comprehended by the algorithm. The Camem-

BERT model, which has embedded semantic features

in form of word vectors, enables the neural network to

learn annotation patterns on a semantic level. So, our

model can recognize some typical pollutant expres-

sions, like tensio actif (surfactant) and other chemi-

cal products, which is exactly the information which

must be extracted in order to build the memory of pol-

luted sites.

7 CONCLUSION

In this paper, we have described an approach for

event-related information extraction from a corpus fo-

cused on industrial pollution. With a supervised deep

learning method, we trained two models that can sim-

ulate our manual annotation on industrial event fea-

tures. Right now, the models trained with Bi-LSTM

neural networks have given promising results, but we

still need them to be better at detecting event trig-

gers and industrial activities in order to use them on

other text resources. Given the fact that the models

are trained with only a small portion of the corpus,

and the neural network conﬁgurations are not fully

explored, it could be possible to improve the model.

Aside from increasing training text data and adjust-

ing neural network setting, it is also interesting to see

if the model could have a better performance if we

use paragraphs instead of sentences as the input of the

neural networks, since the narration of an event is not

limited in a sentence.

This work is devoted to the construction of the

polluted sites memory, based on an only consistent

and complete database. Eventually, the event-related

information extracted by the models will be inserted

in the database. For future work, we will apply a syn-

tactic parser to link the extracted event features by

dependency relations, and train a classiﬁer to catego-

rize the events, so that they can be integrated into the

database with an appropriate structure. The models

will also be tested and used on other corpora in the do-

main of industrial pollution, to connect other sources

of data and enrich the polluted site memory.

REFERENCES

Arnulphy, B. (2012). D

esignations nominales des

enements:

etude et extraction automatique dans les

textes. PhD thesis, Universit

e Paris 11.

Basaldella, M., Antolli, E., Serra, G., and Tasso, C. (2018).

Bidirectional LSTM Recurrent Neural Network for

Keyphrase Extraction, pages 180–187. Springer.

Battistelli, D., Charnois, T., Minel, J.-L., and Teiss

edre, C.

(2013). Detecting salient events in large corpora by a

combination of NLP and data mining techniques. In

Conference on Intelligent Text Processing and Com-

putational Linguistics, volume 17(2), pages 229–237,

Samos, Greece.

Lecolle, M. (2009).

ements pour la caract

erisation des

toponymes en emploi

enementiel. In Evrard, I.,

Pierrard, M., Rosier, L., and Raemdonck, D. V., ed-

itors, Les sens en marge Repr

esentations linguistiques

et observables discursifs, pages 29–43. L’Harmattan.

Martin, L., Muller, B., Ortiz Su

arez, P. J., Dupont, Y., Ro-

mary, L., de la Clergerie,

E., Seddah, D., and Sagot, B.

(2020). CamemBERT: a tasty French language model.

In Proceedings of the 58th Annual Meeting of the As-

sociation for Computational Linguistics, pages 7203–

7219, Online. Association for Computational Linguis-

tics.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).

Efﬁcient estimation of word representations in vector

space.

Panchendrarajan, R. and Amaresan, A. (2018). Bidi-

rectional LSTM-CRF for named entity recognition.

In Proceedings of the 32nd Paciﬁc Asia Conference

on Language, Information and Computation, Hong

Kong. Association for Computational Linguistics.

Schmid, H. (1994). Probabilistic part-of-speech tagging us-

ing decision trees.

Shin, H. J., Park, J. Y., Yuk, D. B., and Lee, J. S. (2020).

BERT-based spatial information extraction. In Pro-

ceedings of the Third International Workshop on Spa-

tial Language Understanding, pages 10–17, Online.

Association for Computational Linguistics.

KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval

224