OPTIMISING CLASSIFIERS FOR THE DETECTION OF

PHYSIOLOGICAL DETERIORATION IN PATIENT

VITAL-SIGN DATA

Sara Khalid, David A. Clifton, Lei Clifton and Lionel Tarassenko

Institute of Biomedical Engineering, Dept. of Engineering Science, University of Oxford

Old Road Campus, Roosevelt Drive, Oxford, OX3 7DQ, U.K.

Keywords: Novelty detection, Multi-class classification, SVM, MLP.

Abstract: Hospital patient outcomes can be improved by the early identification of physiological deterioration.

Automatic methods of detecting patient deterioration in vital-sign data typically attempt to identify devia-

tions from assumed “normal” physiological condition. This paper investigates the use of a multi-class ap-

proach, in which “abnormal” physiology is modelled explicitly. The success of such a method relies on the

accuracy of data annotations provided by clinical experts. We propose an approach to estimate class labels

provided by clinicians, and refine those labels such they may be used to optimise a multi-class classifier for

identifying patient deterioration. We demonstrate the effectiveness of the proposed methods using a large

data-set acquired in a 24-bed step-down unit.

1 INTRODUCTION

Adverse events in patient condition are often pre-

ceded by physiological deterioration evident in vital-

sign data (Buist et al., 1999), and it is well-

understood that patient outcomes can be improved

by detecting this deterioration sufficiently early

(NPSA, 2007). Machine learning techniques have

been shown to be able to detect such physiological

deterioration by analysing vital-sign data acquired

from patient monitors connected to acutely ill hospi-

tal patients, such as Parzen window estimators (Ta-

rassenko, 2005).

A large number of manual methods (Smith, 2008)

have been developed to allow clinicians to identify

patient deterioration on the general ward, based on

periodically-collected vital sign data, typically ac-

quired every two to four hours.

Both manual and automatic methods typically

perform novelty detection (or one-class classifica-

tion), in which deviations from some assumed

“normal” behaviour are identified. This is a com-

mon approach to the condition monitoring of critical

systems (Tarassenko, 2009), for which large num-

bers of examples of “normality” exist, but where

there are comparatively too few examples of system

failure to construct a multi-class classifier, in which

known failure conditions are explicitly modelled.

However, should sufficient examples of patient

deterioration be available, a multi-class approach

may be taken. It is assumed a priori that, given

sufficient examples of system failure, a multi-class

classifier will outperform a one-class classifier due

to the inclusion of more information in the classifier

(Bishop, 2006). This paper describes an investiga-

tion in which a large data-set of patient vital-sign

data was acquired during a clinical study such that a

multi-class approach to identifying patient deteriora-

tion may be taken. For this to be successful, accu-

rate class labels for the data are required. We inves-

tigate the reliability of class labels provided by clini-

cians, and propose methods to (i) estimate and refine

those labels automatically, and (ii) use the resultant

labels to optimise a multi-class classifier for detect-

ing patient deterioration.

2 ESTIMATING

CLINICAL LABELS

Obtaining accurate class labels for large, multivari-

ate data-sets of vital signs is particularly difficult.

The data-set considered in the investigation de-

scribed by this paper, for example, was four-

425

Khalid S., Clifton D., Clifton L. and Tarassenko L..

OPTIMISING CLASSIFIERS FOR THE DETECTION OF PHYSIOLOGICAL DETERIORATION IN PATIENT VITAL-SIGN DATA.

DOI: 10.5220/0003138904250428

In Proceedings of the International Conference on Bio-inspired Systems and Signal Processing (BIOSIGNALS-2011), pages 425-428

ISBN: 978-989-8425-35-5

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

dimensional, comprising heart rate (HR), breathing

rate (BR), peripheral arterial oxygen saturation

(SpO

), and the arithmetic mean of systolic and

diastolic blood pressures (the systolic-diastolic aver-

age, or SDA), was acquired from 332 patients in a

step-down unit, and contains over 18,000 hours of

patient data (Tarassenko, 2005). It is impractical for

clinical experts to annotate such a large data-set in

its entirety.

The approach taken with the data-set was to de-

termine retrospectively which periods of patient data

exceeded standard “medical emergency team”

(MET) criteria (Smith, 2008). The latter are stan-

dard thresholds on each vital sign that, if exceeded,

should result in the clinical review of the patient.

Periods of patient data that exceeded the MET crite-

ria for at least four minutes were shown to a panel of

clinicians, who then determined which periods were

due to artefact (such as a sensor becoming detached

from the patient), and which were sufficiently ab-

normal to require patient review. We here term the

latter class labels of “abnormal” patient condition

, and will refer to examples of “normal” patient

condition to have class label C

As vital signs can take more extreme values

during periods of abnormal physiology, the distribu-

tions of data from periods labelled C

have heavier

tails than the distributions of data from “normal”

patients. However, there is significant overlap be-

tween the distributions of data from the two classes.

Additionally some types of physiological abnormal-

ity are not represented in the data-set as frequently

as other types; e.g., apnoea and bradycardia (low BR

and HR, respectively) are under-represented in com-

parison with tachypnoea and tachycardia (high BR

and HR, respectively). We found that this imbal-

ance leads to a linear classifier trained using such

data successfully classifying the majority of the

more well-represented tachypnoea and tachycardia

data, while misclassifying the under-represented

apnoea and bradycardia data.

Similarly, a non-linear classifier trained using the

original labels incorrectly includes the distribution

of bradycardia and apnoea data within its decision

boundary, and hence misclassifies test data from this

region of data space as belonging to class C

The remainder of this paper investigates methods

for refining class labels C

and C

, such that they

may be used to construct a multi-class classifier that

successfully classifies under-represented types of

“abnormal” data. We will illustrate the procedure

using bivariate analysis, such that the decision

boundary of a classifier may be examined. The

application to the full multivariate data-set (e.g., 4-

dimensional in this example) is considered in

Section 4.

3 REFINING CLINICAL LABELS

TO OPTIMISE THE

CLASSIFIER

The left-hand plot of Figure 1 shows all of the data

from the two classes in the bivariate space of HR

and BR. Clusters of C

data corresponding to ap-

noea, tachypnoea, bradycardia, and tachycardia may

be seen in the figure, although data from class C

often overlap with those clusters. Intuitively, we

wish to increase the separation between the two

classes such that a classifier trained using those

labels results in a decision boundary that correctly

classifies data from all modes of class C

. We pro-

pose a method for doing so using an estimate of the

probability density function (pdf) of the entire data-

set.

3.1 Defining a Multivariate

Distribution to Estimate Labels

We approximated the pdf of the whole data-set using

a Parzen windows estimator (Bishop, 2006), after

reducing the size of the data-set to 400 prototype

patterns using k-means clustering with k = 400 clus-

ter centres. The covariance σ

of the 400 kernels in

the pdf was set using the heuristic proposed in

(Bishop, 2006). Given some data-point x’, its den-

sity κ

= p(x’) defines a contour on the pdf. We then

define a probability P[κ

x’

] as follows:





























(1)

where κ

= max[ p ], the density at the mode of the

pdf, p. Thus P[κ

x’

] is the probability mass contained

by integrating the pdf from its highest point down to

the probability density contour κ

x’

. This represents

the probability that some random data-point x dis-

tributed according to p will take a density value

higher than density value κ

x’

; i.e., P[κ

x’

] ≡ P[ p(x) ≥

x’

]. Thus, as x’ varies throughout the data space,

its probability density will vary over the range

[0 κ

], and thus P[κ

x’

] will vary over the range [0 1]

correspondingly.

We define a threshold T on P[κ

x’

], and consider

which data have P[κ

x’

] ≥ T for varying values of T.

As described above, we expect data that lie fur-

thest from the mode of the distribution of the whole

BIOSIGNALS 2011 - International Conference on Bio-inspired Systems and Signal Processing

426

Figure 1: Data from classes C

and C

shown in grey and black, respectively. The left-hand plot shows all data from classes

and C

. Data x from C

are shown where P[κ

] ≥ T, for T = 0, 0.5, and 0.9 from left to right, respectively, as described in

Section 3.1.

data-set to take largest value of P[κ

x’

], and hence as

T is increased, the proportion of data that have P[κ

x’

]

≥ T will decrease. This effect is shown in Figure 1,

in which increasing the value of T causes fewer data

to lie above the threshold.

This suggests that we could use such a threshold

T to estimate the clinical labels for “abnormal” data;

i.e., those from class C

3.2 Optimising a Classifier using a

Threshold on the Multivariate

Distribution

In order to increase the separation between “normal”

and “abnormal” data used for training a classifier,

we can refine the C

and C

class labels provided by

clinicians using the following rules:

i. Define the training set of “normal” data to



∈



| P













(2)

ii. Define the training set of “abnormal” data

to be



∈



| P













(3)

In order to examine the effect of varying values of

the threshold T, 75% of the data from C

that obeyed

the above selection criterion were drawn at random,

and an equal number of data from class C

were

drawn at random. All remaining data from class C

were used as test data, and an equal number of data

randomly selected from the unused data from class

were used as test data.

The procedure involving threshold T was used to

process the training data only, and thus results ob-

tained using the test data are independent of T, giv-

ing an accurate representation of the system’s per-

formance when classifying previously-unseen data.

This is required in order to allow a fair comparison

with classification performance obtained without

using the proposed method.

Figure 2 shows the misclassification rates

obtained when non-linear classifiers (multi-layer

perceptron with a single layer of hidden units, or

MLP, in this example) were compared over N = 50

experiments. We note that the results are shown

based on the test data, and that the classifier archi-

tecture was selected using 10-fold cross-validation.

It may be seen from the figure that, as the value

of the threshold T is increased from 0 to 1, the num-

ber of false-positive misclassifications decreases

while the number of false-negative misclassifica-

tions increases. This is because the “normal” train-

ing data (the refined version of class C

) cover a

larger locus as T increases. Conversely, the “ab-

normal” training data (the refined version of class

) cover a smaller locus as T increases. Thus, the

resulting decision boundary of the classifier be-

comes less sensitive to abnormality with increasing

T, because the classifier is trained with an increas-

ingly large cluster of “normal” data and a decreas-

ingly small set of clusters of “abnormal” data.

The figure shows that for T ≈ 0.4, these misclas-

sifications are minimised, which represents the “op-

timal” value of the threshold T for processing the

training data obtained from this data-set.

Similar results were obtained when applying the

proposed technique with a support vector machine

(SVM) classifier. Figure 3 shows the decision

boundary obtained using an SVM both using the

original C

and C

labels, and the “refined” labels

created by using the procedure described previously.

It may be seen that the SVM decision boundary

obtained without application of the proposed method

covers large areas of data space that correspond to

physiological deterioration; e.g., the region that

corresponds to tachypnoea (at BR > 30 rpm) for HR

≈ 75 bpm. In comparison, the SVM decision boun-

dary obtained after application of the proposed me-

thod more closely describes the locus of “normal”

data, in the centre of the data space, and accurately

separates the four modes of data-space that corres-

pond to apnoea, tachypnoea, bradycardia, and

tachycardia.

OPTIMISING CLASSIFIERS FOR THE DETECTION OF PHYSIOLOGICAL DETERIORATION IN PATIENT

VITAL-SIGN DATA

427

Figure 2: Misclassification performance evaluated using

the test set, as threshold T is varied. Mean false-positive

(FP, shown by a solid line) and false-negative (FN, shown

by a dotted line) errors over a range of N = 50 experiments

for each value of T, with a confidence interval on each

value shown at ± 1 standard deviation from the mean.

Figure 3: Using original (left-hand plot) and refined (right-

hand plot) C

and C

labels to construct a SVM classifier.

The SVM produces a decision boundary shown by the

black line, with support vectors indicated by circles.

4 DISCUSSION AND FUTURE

WORK

We have presented a method of (i) estimating clini-

cal labels, and (ii) using the technique to refine ex-

isting labels such that a classifier may be trained that

better separates “normal” data from “abnormal”

data, when compared with classifiers that do not use

the proposed technique.

While this paper has presented the results of a

bivariate analysis, such that the decision boundaries

of classifiers may be examined and compared, the

procedure should be applied in the dimensionality of

the original data space; in the example considered by

this paper, the data-set is 4-dimensional (HR, BR,

SpO

, and SDA). The thresholding procedure is

performed using the high-dimensional pdf, and so

the proposed method of estimating and refining

clinical labels is equally applicable to optimising

classifiers in the original high-dimensional data-

space.

A further advantage of the proposed method is

that, as described in Section 2, as the value of the

threshold T is increased, the distribution of data with

probability P exceeding that threshold tends towards

the distribution of data labelled as being class C

clinicians. While there is no equal substitute for the

annotations of clinical experts, it is impractical for a

panel of experts to review 18,000 hours of continu-

ous data. As described in Section 2, the C

labels

were obtained as being a subset of those periods that

exceeded univariate MET criteria for periods of at

least four minutes, and so even these are not “gold

standard” labels of the entire data-set. However,

being able to estimate such labels is useful: the pro-

cedure can be applied to further, unlabelled data-sets

in order to estimate their class C

labels. Thus, it

may be possible to obtain automatically labelled

data-sets from large unlabelled data-sets, which

would previously require the use of an unsupervised

classification approach, such as the one-class

method described in Section 1, and in (Tarassenko,

2005) and (Tarassenko, 2009).

REFERENCES

Buist, M. D., Jarmolowski, E., Burton, P. R., Bernard, S.

A., Waxman, B. P., Anderson, J., 1999: Recognising

Clinical Instability in Hospital Patients Before Cardiac

Arrest or Unplanned Admission to Intensive Care.

Med. J. Australia, Vol. 171 8–9.

NPSA, 2007. National Patient Safety Association: Safer

Care for Acutely Ill Patients: Learning from Serious

Accidents. Tech. Rep

Tarassenko, L., Hann, A., Patterson, A., Braithwaite, E.,

Davidson, K., Barber, V., Young, D., 2005: Multi-

parameter Monitoring for Early Warning of Patient

Deterioration. Proc. 3rd IEE Int. Seminar on Medical

Applications of Signal Processing, London, 71-6.

Smith, G. B., Prytherch, D. R., Schmidt, P. E., Feather-

stone, P. I.,2008: Review and Performance Evaluation

of Aggregate Weight “Track and Trigger” Systems. J.

Resuscitation, Vol. 77, 170-179.

Tarassenko, L., Clifton, D. A., Bannister, P. R., King, S.,

King, D., 2009: Novelty Detection. In: Staszewski,

W. J., Worden, K. (eds), Encyclopaedia of Structural

Health Monitoring, Wiley.

Bishop, C. M., 2006.: Pattern Recognition and Machine

Learning. Springer-Verlag.

BIOSIGNALS 2011 - International Conference on Bio-inspired Systems and Signal Processing

428