SMART DOCUMENT TECHNOLOGIES FOR EXTRACTING

AND STRUCTURING DATA FROM PATIENT RECORDS

Opportunities for New Knowledge based Services

Denys Proux, Eric Cheminot, Caroline Hagege and Frederique Segond

Xerox Research Centre Europe, 6 Chemin de Maupertuis, 38240 Meylan, France

Keywords: Smart document technologies, Textual content processing, Structured information, Document conversion,

Risk assessment, Healthcare, Hospital acquired infections.

Abstract: Costs related to healthcare are exploding. In Europe studies show that they will reach in 2050 up to 10% to

15% of the Gross Domestic Product. Several causes drive up these costs such as ageing population or the

rise of chronic diseases. Adverse events jeopardizing patient safety in complex hospital workflows such as

Hospital Acquired Infections are also a major contributor to these costs. But recent studies suggest that it

might be possible to reduce this impact by 20 to 30% with a combination of appropriate processes and smart

monitoring tools. Information Technologies are therefore seen as a key enabler to help improving

information workflow and process efficiency inside hospitals.

1 THE EUROPEAN

HEALTHCARE MARKET

The healthcare system in Europe is currently

submitted to an extreme pressure. One explanation is

the fact the population is getting older (the amount

of people over 50 will increase by 35% between

2005 and 2050 and the amount over 85 will triple in

the same period) and require more attention and

cares. As a consequence, latest estimates foresee a

rising of the global cost up to 16% of the European

Gross National Product (GNP) until 2020 (currently

it represents only 9%). Information and

Communication Technology (ICT) which has

already allowed significant process and cost

improvements in other domains could be and will be

a key element to help keeping the healthcare budget

under control.

However to achieve this goal there is a clear

need for new, more efficient and Smarter tools for

proactively predicting diseases, for personalizing

healthcare and more generally to empower the

patient in his own safety and health.

Therefore challenges for ICT are many.

Solutions should address various needs on a stepped

way to electronic and structured information which

will eventually open the door for analytics,

simulation forecasting, prevention, etc. They range

from:

i. moving from paper or isolated data to electronic

information systems,

ii. moving from unstructured to structured

information,

iii. moving from isolated database to interconnected

information systems,

iv. finally building advanced services, such as

forecasting and risk assessment systems, on top of

this.

In this paper we focus on two aspects which are:

moving from paper (or unstructured information)

into structured electronic records, and leveraging

recorded information through smart and value added

services such as for example risk assessment tools.

In that perspective two projects will be detailed

where we have customized existing tools or created

entirely new technologies and functionalities to

address both parts of the workflow and value chain.

633

Proux D., Cheminot E., Hagege C. and Segond F..

SMART DOCUMENT TECHNOLOGIES FOR EXTRACTING AND STRUCTURING DATA FROM PATIENT RECORDS - Opportunities for New

Knowledge based Services.

DOI: 10.5220/0003184906330636

In Proceedings of the 3rd International Conference on Agents and Artiﬁcial Intelligence (ICAART-2011), pages 633-636

ISBN: 978-989-8425-40-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

2 SERVICES FOR MEDICAL

RECORDS MANAGEMENT

2.1 The Legacy Documents Conversion

Project: Context

Our team in our Research Group started a

collaboration with a business company developing a

document management solution for group of

hospitals in the middle-west of the UK. The deal is

to outsource the management of legacy patient

records. These documents are paper based, so the

project in its original form aims to convert this

unstructured information into an electronic format.

The considered corpus represents almost 1,000,000

patient records (that are generally composed by a

several documents coming from several departments

such as radiology, surgery, intensive care, etc). The

conversion is foreseen to be done on a 7 year period.

Several scenarios are considered ranging from

simple paper to digital conversion to more advanced

content analysis and indexing features built on top of

it.

2.2 Workflow and Implementation

The model that has been put in place is the “scan

when requested” scenario. Each time a new request

to access a given patient record is submitted all

patient related documents are retrieved and scanned.

Our contribution was to customize for that purpose a

set of Smart Document Technologies that automate

part of the document content indexing process in

order to extracted meta-data that could be attached

to these scanned documents in order to facilitate

future queries and document retrievals. This set of

technologies and their context of application is

detailed below.

Glyphstamps for Document separation:

Glyphstamps is a technology that consist in special

tags printed on documents that encode rich

information. It is used to identify information pieces

in the set of document related to a patient. They are

used to detect document boundaries but also to route

documents for appropriate post processing according

to their type.

FuzzyMatch: sometimes, a document category can

be defined (at least partly) by a set of keywords.

Unfortunately, when dealing with paper documents,

we first have to perform an OCR to extract the

content in a useful format. This generally leads to

OCR errors so the idea of the FuzzyMatch

technology is to be able to efficiently match

searched keywords in OCRed documents while

taking into account possible OCR errors (or

orthographic errors in the original document).

Distance with queried keywords is computed and

frequent errors are learned and taken into account.

Categorization: this technology is based on a model

previously trained on collections of already

annotated documents. The aim is to detect

automatically from document contents those that

match with predefined and pertinent categories. For

this project we use our own made categorizer that

learns from human annotations to build probabilistic

models It relies on a probabilistic generative model

(Gaussier 2002) which can be seen as a hierarchical

extension to PLSA (Probabilistic Latent Semantic

Analysis (Hofmann 1999)) where both documents

and words may belong to different categories.

Named Entity Detection: The goal of this step is to

identify in the text some specific information that

can be used to further enrich the indexing step. This

information can be for example people names, social

security numbers, locations, treatments, drug names,

bacteria,, etc. This step can be done by pattern

matching, but we found it more efficient to use a

deeper analysis with the Xerox Incremental Parser

(XIP) (Aït-Mokhtar 2002) which allows to design

more complex and efficient detection and

disambiguation rules as well a co-reference

management.

All these technologies work on electronic documents

which means that paper documents should be

scanned and OCRed first. However some documents

may also contain some handwritten parts which is

problem as up-to-now, automating document content

processing in this case is very difficult. Nevertheless

we developed and tested some new technologies and

algorithm such as Handwritten Keyword

Recognition to try to address this issue. This

technology works as a machine-learning image

categorization process: sample documents are pre-

processed to extract the bounding boxes of these

words (graphical boxes around words) and searched

– positive - words are annotated by an operator

(Perronnin 2009). The system is then trained to

recognize those words. We have reached very

promising results, but due to the extreme variability

of hand-writing (especially in the medical field), it

will be reserved for a limited number of words,

crucial for the process.

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

634

3 LEVERAGING INFORMATION

AND KNOWLEDGE

MANAGEMENT

3.1 Improvement Opportunities

Optimizing the document workflow in hospitals by

converting legacy documents into a digital version,

or unstructured into structured information to feed

databases is the very first step to unleash the power

of advanced services building on this information. It

opens the door to a new set of applications that can

provide the assistance of knowledge services thanks

to automated data mining, and information analysis

tools. Risk assessment is one example.

A project has been started in 2009 to develop an

Hospital Acquired Infection (HAI) detector

monitoring patient related reports produced inside

hospitals (ALADIN-DTH). This project is funded in

part by the French Research Agency (ANR) for 3

years. It is a unique collaboration between key

partners with unique competencies in all aspect

required to build such a system: an University

Laboratory specialized into building multi-

terminologies resources for the medical domain, a

content French provider for Pharmaceutical

information, a research Center specialized into

designing Natural Language Processing and

Semantic Management systems and 4 university

hospitals providing real data and HAI expertise.

3.2 Key Technology behind this Project

The heart of this project is the Xerox Incremental

Parser (XIP), which performs text mining. This

parser is robust that is to say it has already been used

in various projects to process large collections of

unrestricted documents. It has been designed to

follow strict incremental strategies when applying

parsing rules. The system never backtracks on rules

to avoid falling into combinational explosion traps

which makes it very appropriate to parse real long

sentences from scientific texts for example (Aït-

Mokhtar 1997).

We have decided to use such a tool as HAI is a

complex issue that implies for instant pieces of

evidences appearing according to a strict chronology

(e.g. a patient has a surgery, then 2 days latter some

symptoms occur such as fever, then some specific

antibiotics are prescribed, etc. ). To establish these

connections we need to have a certain level of

understanding of the content of the document,

simple keyword detection is not enough.

3.3 Addressing the Terminology Issue

To be able to detect pertinent information from text

it is crucial to be able to address all terminologies in

use in targeted hospitals. This is required to build

appropriate lexicons that encompass pertinent terms

characterizing an HAI or serving as pieces of

evidence to conclude to an HAI suspicious case. One

of the partners (CISMeF) provides this information:

the Multi Terminologies Indexer is a generic

automatic indexing tool able to tag an entire

document with all terminologies necessary for the

project. This server offer term identification

covering the following terminologies: SNOMED 3.5

(International Systemized Nomenclature of human

and veterinary MEDicine), MeSH (Medical Subject

Heading), ICD10 (Classification of Diseases) and

CCAM (French CPT), TUV (Unified Thesaurus of

Vidal), ATC (Anatomical Therapeutic and Chemical

Classification), drug names with international non-

proprietary names (INN) and brand names, Orphanet

terms (rare diseases), CIF (International Functional

Terminology), CISP2 (International Classification

for Primary care), DRC (Consultation results),

MedlinePlus.

3.4 Temporal Sequence Detections

In this project the chronology of events is crucial to

characterize a problem. Relying on an existing

temporal processing system (Hagege 2008) we

perform an adaptation of this system for French in

order to be able to detect and normalize temporal

expressions appearing in text. Three kinds of

temporal expressions are considered:

1) Absolute dates (for example 10/03/2010)

2) Referential dates with reference to the utterance

time. (for example “yesterday”)

3) Referential dates whose reference is another

textual expression (for example, “two days after

admission”).

Discovering and normalizing temporal

expressions enable us to associate a time stamp to

the event described in text. As a result, we can

associate to the description of a potential risk factor

for HAI a time stamp and check if it occurs after or

before another specific event.

3.5 Temporal Sequence Detections

Risk indicators consist in the description of infection

events such as mentioning a specific bacteria,

antibiotic, symptoms such as fever etc. All the

SMART DOCUMENT TECHNOLOGIES FOR EXTRACTING AND STRUCTURING DATA FROM PATIENT

RECORDS - Opportunities for New Knowledge based Services

635

elements must also occur according to a specific and

strict time frame. For instance, the simple presence

of a temperature rising is not enough to decide that

we have an HAI case. But if this temperature rising

is associated with the presence of a certain bacteria,

and with the presence of a catheter, and if this

temperature raising occurs at least 2 days after the

admission of the patient in the hospital, then we

have clues enough to suspect an HAI occurrence.

The relationship between risk indicators is carried by

the syntactic dependencies calculated by XIP.

The following figure illustrates this temporal

aspect of information extraction.

Figure 1: Analysis results for a given sentence.

3.6 Next Steps

The current version of the system, or to be precise,

current detections rules work at the sentence level.

This means that link between risk factors are search

inside the boundary of a sentence. But it can also

happen that elements characterizing an HAI are

distributed in different and distant section of the

medical report. It is therefore necessary to build long

distance dependency detection and a reasoning

algorithm able to work on a full patient report. These

risk scenarios are now refined with the help of

medical experts.

4 CONCLUSIONS AND BEYOND

In this paper we described the needs for ICT

solutions to help improving the management of

medical information to the benefits of both patient

safety and healthcare economy. In this context the

improvement of internal hospital processes and the

way information is managed can be a key

contributor to drive down these costs.

We have described in this paper two on-going

projects illustrating the benefit brought by Smart

Document Technologies to two major challenges

which are: the move from unstructured information

(e.g. paper) to structured EMR, and the development

of smart monitoring systems for risk assessment.

These are just examples.

They are many others fields of applications for

ICT such as remote monitoring of patients or

improvement of an hospital workflow through

assisted activity tracking and resources management.

We are just at the beginning of the e-revolution of

the traditional healthcare landscape.

ACKNOWLEDGEMENTS

ALADIN is a 3 year project funded by the French

Agence Nationale de la Recherche (National

Research Agency - ANR) in the context of the

TecSan (Technologies pour la Santé et l’Autonomie)

program.

REFERENCES

Aït-Mokhtar S., Chanod J. P., Roux C., 2002. Robustness

beyond shallowness: incremental deep parsing.

Cambridge University Press. Volume 8, Issue 3,

Pages: 121 - 144. June 2002.

Gaussier E., Goutte C., Popat K., Chen F., 2002. A

hierarchical model for clustering and categorising

documents. Proc. ECIR-02. 229-247. Springer.

Haas J. P., Mendonca E. A., Ross B., Friedman C., Larson

E., 2005. Use of Computerized Surveillance to Detect

Nosocomial Pneumonia in Neonatal Intensive Care

Unit Patients. Am J Infect Control, 2005; 33(8):439-

43.

Hagège C., Tannier X. XTM a robust Temporal Text

Processor. Proceedings of CICLing 2008, Haifa,

Israel.

Hofmann T., 1999. Probabilistic Latent Semantic

Analysis. Proc 15th Conf. on Uncertainty in Artificial

Intelligence. 289-296. Morgan Kaufmann.

Perronnin F., Rodriguez J. A., 2009. Handwritten word-

image retrieval with synthesized typed queries.

ICDAR 2009 (International Conference on Document

Analysis and Recognition). Barcelona, Sapin, July 26-

29, 20.

Output of temporal sequence processing (addition of a time

stamp):

La patiente a présenté , une semaine après son arrivée dans

le Service , une infection urinaire à Klebsielle ,

accompagnée d' une pneumopathie nosocomiale ( sans

germe retrouvé au combicath ) , pour lequel elle a bénéficié

d' un traitement pendant une semaine par Rocéphine et

Oflocet.

</TEMPS>

Output of potential candidates for HAI risk detection

ANTIBIOTIQUE(Rocéphine)

ANTIBIOTIQUE(Oflocet)

DISPOSITIF(combicath)

BACTERIE(Klebsielle)

INFECTION(infection urinaire)

INFECTION(pneumopathie)

GERME_INFEC(Klebsielle)

BACTERIA(Klebsiella)

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

636