TEMPORAL INFORMATION INDEXING MODEL

Witold Abramowicz and Andrzej Bassara

Department of Information Systems, Poznan University of Economics, ul. Niepodleglosci 10, Poznan, Poland

Keywords:

Information Retrieval, Temoral Information Retrieval, Temporal Expressions, Indexing, Temporal Indexing.

Abstract:

Modern information retrieval models are not capable of resolving queries containing temporal criteria. One is

not able to search for documents which content relates to certain time (for instance ,,ﬁnd all documents related

to the third quarter of the last year“). This limitation is mainly due to syntactic nature of modern information

retrieval models, which perform query-document matching based on syntactic or simpliﬁed semantic similarity

measures. In this article, we are focusing on the problem of creating document indexes, which represent time

to which document contents relate, and which in turn allow for searching documents using temporal criteria.

1 INTRODUCTION

Modern information retrieval (IR) systems are not ca-

pable of searching for documents which contain in-

formation related to a speciﬁed time. It is relatively

easy to ﬁnd documents based on their publication

date. Nevertheless, the publication date may be sig-

niﬁcantly different from the time to which the article

relates. Similarly to the bi-temporal databases (Jensen

and Snodgrass, 2006), two orthogonal dimensions of

time exists: the transaction time and the valid time.

The transaction time is speciﬁc for a publication pro-

cess and may include: creation, approval, publication

or modiﬁcation dates. The valid time is the time to

which information presented in the article relates.

This limitation is mainly caused by simpliﬁcation

of indexing. Documents are usually indexed automat-

ically with uncontrolled vocabulary. In such case, in-

dexing terms are usually words extracted from doc-

ument content. Computation of relevance is than

based purely on syntactic features. Sometimes words

are stemmed or lemmatized. The comparison of

query/document terms may by also supported by the-

sauruses or performed on ontological level. both ap-

proaches brings the process closer to semantic level.

Table 1: Sample query with temporal criteria.

Document: The board of the Globe Trade in-

forms that during 16 August 2006

Information

need:

all documents that relate to the

third quarter of the last year

Query: the third quarter of the last year

This approach is, however, not appropriate for

queries with temporal criteria. The table 1 presents

a sample scenario. It appears that the query and the

document are not syntactically similar. The semantic

comparison based on concepts comparison will also

yield no similarity. The document seems however to

be partially relevant. Limiting our consideration only

to calendar expressions, the computation of relevance

requires:

1. extraction of temporal features from the document

and the query – ,,16 August 2006“, ,,the third

quarter of the last year (2006)”,

2. encoding their value using a formal time model –

Y2006M08D16

Y2006Q3

3. comparing the values by means of arithmetic spe-

ciﬁc for selected time model –

Y2006M08D16

within

Y2006Q3

4. computing the relevance – the references are ex-

pressed on different granularity levels (days and

quarters), and although one reference contains an-

other, it is not clear how to compute relevance, as

it may be dependent on information need.

Successful application of this approach requires,

however, addressing following issues:

Time Models Multiplicity. Temporal expressions

may be formalized in various time models (point-

based, interval-based, point-interval based).

These models are also often extended to support

multiple time units and imprecise expressions.

Moreover, it is not be possible to compare two

temporal expressions, unless they are expressed

in comparable and known time models.

387

Abramowicz W. and Bassara A. (2008).

TEMPORAL INFORMATION INDEXING MODEL.

In Proceedings of the Tenth International Conference on Enterprise Information Systems - AIDSS, pages 387-390

DOI: 10.5220/0001670403870390

 SciTePress

Multiplicity of Temporal Features. The most

straightforward way of expressing temporal

information is to relate it to calendar expressions.

There are, however, other approaches that may

be followed. For instance, some events my be

related to some other events by means of temporal

relations (,,A happens during B“).

Polymorphism of Time Expressions. Semantically

equivalent temporal expression may be expressed

in many different ways: ,,18 January 2007“, ,,18

I 2007” ,,18-01-2007“ ,,yesterday” (if reference

date is 2007-01-18), relates to the same date.

Ambiguity/Imprecision. In many cases, there is no

way to precisely qualify the value of temporal ex-

pressions, for instance, in fuzzy expressions, like

,,the beginning of May“.

2 RELATED WORK

The highlighted above issues are not fully addressed

in the literature. The most well known system

that deals with temporal IR is TOODOR (Tempo-

ral Object-Oriented Document Organization and Re-

trieval) (Llavori et al., 1998). Each article stored

in TOODOR system is qualiﬁed by two attributes:

publication date and temporal horizon (valid time),

what makes it de facto bi-temporal database. Unfor-

tunately, the temporal horizon is not deﬁned. The au-

thors state that its semantic is speciﬁc for particular

application and its value should be set manually. In

later publications (Llido; et al., 2001), it is suggested

that its value should be based on calendar expressions

extracted from document content. The indexing pro-

cess consists of the following steps: extraction and

normalization of all calendar expressions, determina-

tion of the most important date, deﬁnition of tempo-

ral horizon as an interval which covers all expressions

that are within certain range from the most important

date.

There is also a TDRM (Temporal Document Re-

trieval Model) (Kalczynski and Chou, 2005) which

focuses mainly on fuzzy expressions, i.e. expressions

whose value may not be determined precisely (e.g.

,,at the beginning of May“). The authors suggest us-

ing fuzzy set theory to encode their value. TDRM also

accommodates Vector Space Model for weighting in-

dexing terms. In this case each temporal reference is

decomposed to a set of days. Each day is regarded

as a single indexing term, whose weight is dependent

on its frequency within the document and within the

whole collection.

The major problems with the presented ap-

proaches are related mainly to the lack of: precise

Figure 1: Metamodel of Temporal Information Indexing.

deﬁnition of temporal features, documentation of fea-

ture extraction and normalization process, explana-

tion of rationale for undertaking certain design de-

cisions (especially related to granularities conversion

or terms weighting). Moreover, both approaches are

constrained to a very limited set of temporal features.

3 META-MODEL OF TEMPORAL

INDEXING

Document index serves as a surrogate, which repre-

sents document important features in a compact and a

machine processable form. The content of an index is

dependent on the potential information needs. In most

cases indexes cover: topics, words, or named entities

important for an indexed document. Temporal index,

on the other hand, should reﬂect time to which facts

presented in the document relate.

Many potential and usable temporal indexing

models exist. These models differ mainly in terms

of: time model, deﬁnition of temporal features, their

normalization and extraction procedures, deﬁnition

of indexing terms, and deﬁnition of index structure.

All these approaches may be described by one meta-

model, which deﬁnes: necessary components, data

they process, their interrelationships, and recommen-

dations for certain design decisions (see ﬁgure 1). The

presented meta-model is a result of generalization of

existing approaches for temporal indexing and mod-

els that have been created during our experiments.

3.1 Static Model

The static model deﬁnes necessary resources re-

quired during indexing process, which include: a time

model, a deﬁnition of temporal features along with

extraction rules, and temporal features normalization

rules.

A time model is the most fundamental component.

It provides a deﬁnition of indexing terms, which may

include: time points, intervals or granules. We sug-

gest using a calendar-based time model. The decision

ICEIS 2008 - International Conference on Enterprise Information Systems

388

is motivated by its:

• Popularity – calendar-based temporal expressions

occur relatively frequently, especially in news sto-

ries,

• Simplicity – one of the most common way of ex-

pressing temporal constraints by users is to use

calendar expressions, using the same time model

for queries and index simpliﬁes the model,

• Expressiveness – model should allow to express

semantics of temporal expressions as precisely as

possible; each expression should be encoded at

the granularity level at which it was expressed in

a document.

A document may be then indexed with pairs

(I, G), where I is a granule index within granularity

G (see (Bettini et al., 1998) for calendar arithmetic).

We suggest using following granularities: a day of the

week, a day of the month, a week of the year, a month

of the year, a quarter of the of year, a half of the year,

a season of the year, a year, a decade, and a century

– G ∈ {DOW . . . MTH, YER. . . CTR}. The choice is

dictated by the relative frequency of expressions ex-

pressed at these granularity levels. The list obviously

does not cover all potential granularities, for exam-

ple: a day of the year and a ﬁscal year are missing,

but they appeared relatively rarely in analyzed docu-

ments. The index I of granule within granularity G is

computed as a number of granules between analyzed

granule and reference granule. The reference gran-

ule for granularity days is the ﬁrst day of this era. For

other granularities, this is the granule that contains the

day with index 1 (DAY(1)).

This construction has two advantages. Firstly,

we do not lose semantics, whenautomatically shifting

granularity levels (during ,,a week” is not the same as

during six consecutive days that constitute this week).

Secondly, it is easy to compare expressions on differ-

ent granularity levels. For instance, in order to test

if MTH(i) ∩YER( j) ∈

0, the process is trivial, while

according to a deﬁnition of the calendar (Bettini et al.,

1998) both MTH and YER are deﬁned as a derivative

granularities of granularity DAY.

The calendar is usedto encode values of document

temporal features. Following features have been de-

ﬁned:

Temporal Expressions. Temporal expressions relate

directly to a model of time. All necessary infor-

mation required to qualify their values is embod-

ied in: the expression itself, the surrounding con-

text, and the time model. No external knowledge

is required. For instance ,,2007-01-02“, ,,tomor-

row“ or ,,before” are temporal expressions, but

,,during Great Depression“ is not one. Although,

the last expression points to some time period, it

requires knowledge at the beginning and ending

dates of this event, in order to precisely set the

time period.

Objects and Events. Objects and events posses tem-

poral features. They themselves do not have a

value speciﬁed by a time model but they exist in

time. For instance, an event may have an occur-

rence date and an object exists during some time

period.

Concepts. Concepts themselves, usually do not have

a meaning allowing to relate them to certain time

periods. We may assume, however, that conceptu-

alization layer is dynamic. The new concepts are

being created and some concepts lose popularity.

Moreover, the popularity of the concepts appear-

ing in documents change over time.

The last component used to characterize the in-

dexing model is a normalization process. The normal-

ization process sets values of temporal features in se-

lected time model. In case of calendar model, for each

temporal expression indices of granules and granular-

ity level need to be speciﬁed. The normalization pro-

cedure is partially independent from the other compo-

nents. It appears that more than one common normal-

ization approach for different temporal features often

exists, furthermore temporal feature may be normal-

ized using different approaches. We can distinguish

following normalization approaches:

Rules. For some categories of temporal fea-

tures, it is possible to deﬁne normalization

mechanism in terms of conditional statements

(IF.. .THEN...rules). This approach is espe-

cially useful in case of calendar expressions. For

example, if a reference date is ,,2000-01-01”

and a date to be normalized is ,,February“ and

from thenarrative context it appears that we speak

about future, then the year of the normalized date

should be set to the year of the reference date, i.e.

2000.

DB of States/Events. Above, we have used an ex-

ample of ,,Great Depression”. The normaliza-

tion of such an expression requires information at

the beginning and ending dates of this event. It

is possible to create a database of events/states,

which may be in turn used for indexing purposes.

The indexing model is certainly limited only to

events/states it has knowledge on.

Distribution of Concepts in Time. We have as-

sumed, that concepts used in text, or at least

their subset, including concepts used to describe

events and states are related to time. It is pos-

sible to build probabilistic model which deﬁnes

TEMPORAL INFORMATION INDEXING MODEL

389

probability of occurrence of particular concept in

documents related to different time points. One

may use a joint probability to assess probability

that a document containing certain concepts

relates to a certain period.

3.2 Behavioral Model

The behavioral model deﬁnes which resources of the

static model and in which order are to be used at each

stage of the indexing process.

Generalizing investigated approaches, the index-

ing process consist of the following steps:

1. The processing unit is a single document. For

each document a list of temporal features is ex-

tracted (according to the deﬁnition of temporal

features). At this stage partial transformation or

normalization of temporal features is possible.

For instance, temporal expressions may be en-

coded using some formal notation. Therefore, the

extraction process may be indirectly dependent on

a time model.

2. Each of extracted features is normalized based on

deﬁned normalization process. The result of nor-

malization is a list of temporal features values for-

malized with respect to the chosen time model.

3. Based on the normalized features temporal index

is created. At this stage terms may be weighted or

ﬁltered.

4 SUMMARY

The metamodel denes the indexing process and re-

sources that are necessary to accomplish it. Charac-

teristic of a particular model are dependent on: a time

model, a denition of temporal features and a normal-

ization process. Having decided on calendar-based

time model, we may look for promising models modi-

fying denition of temporal features and normalization

procedure. The following models were implemented

with satisfactory results:

• Temporal references – We assume that an appear-

ance of a temporal expression in text causes that

the article is related to that date. We do not ana-

lyze, however, this relationship.

• Events – We assume that if an event occurs in a

document, then the document itself is related to a

date/dates speciﬁc for that event. Again, the se-

mantic of this relationship is not analyzed. In this

case a database of events and their speciﬁc dates

is needed.

• Concepts – a probability of a concept occurring

in a document depends on the period to which

the document relates. In other words, documents

that relate to different time periods may use di-

verse concept set. For instance, a concept ,,col-

lective farming“ may occur with relative higher

frequency in documents related to the rst half of

he last century, then for example, in documents

related to this century. Of course, one concept

does not allow deriving any conclusions, but com-

bining probability of occurrence of each concept

contained in the document may give some clue on

the document valid time.

• Semantic Similarity – In traditional IR systems

indexing is sometimes based on the similarity of

documents. In this approach, it is assumed that

syntactically similar documents are also seman-

tically similar and that semantically similar doc-

uments should have similar indexes. Therefore,

syntactically similar documents should also have

similar indexes.

REFERENCES

Bettini, C., Dyreson, C. E., Evans, W. S., and Snodgrass,

R. T. (1998). A glossary of time granularity concepts.

Lecture Notes in Computer Science, 1399.

Jensen, C. and Snodgrass, R. (2006). Temporal Databases.

Kalczynski, P. J. and Chou, A. (2005). Temporal document

retrieval model for business news archives. Inf. Pro-

cess. Manage., 41(3):635–650.

Llavori, R. B., Cabo, M. J. A., and Barber, F. (1998). Dis-

covering temporal relationships in databases of news-

papers. In IEA/AIE ’98: Proceedings of the 11th In-

ternational Conference on Industrial and Engineering

Applications of Artiﬁcial In telligence and Expert Sys-

tems, pages 36–45, London, UK. Springer-Verlag.

Llido;, D., Llavori, R. B., and Cabo, M. J. A. (2001).

Extracting temporal references to assign document

event-time periods. In DEXA ’01: Proceedings of the

12th International Conference on Database and Ex-

pert Systems Applications, pages 62–71, London, UK.

Springer-Verlag.

ICEIS 2008 - International Conference on Enterprise Information Systems

390