Event Clustering of Lifelog Image Sequence using Emotional and Image

Similarity Features

Photchara Ratsamee, Yasushi Mae, Masaru Kojima, Mitsuhiro Horade, Kazuto Kamiyama

and Tatsuo Arai

Graduate School of Engineering Science, Osaka University, 1− 3, Machikaneyama-cho, Toyonaka, Osaka, Japan

Keywords:

Lifelog Image Clustering, Rank-order Distance based Clustering and High Variance Event.

Abstract:

Lifelog image clustering is the process of grouping images into events based on image similarities. Until

now, groups of images with low variance can be easily clustered, but clustering images with high variance

is still a problem. In this paper, we challenge the problem of high variance, and present a methodology

to accurately cluster images into their corresponding events. We introduce a new approach based on rank-

order distance techniques using a combination of image similarity and an emotional feature measured from

a biosensor. We demonstrate that emotional features along with rank-order distance based clustering can

be used to cluster groups of images with low, medium, and high variance. Experimental evidence suggests

that compared to average clustering precision rate (65.2%) from approaches that only consider image visual

features, our technique achieves a higher precision rate (85.5%) when emotional features are integrated.

1 INTRODUCTION

Lifelogging (Bush, 2001) is a concept of recording

human daily activity using wearable sensors. For ex-

ample, a camera mounted on human head were used

to capture images along with GPS for human local-

ization. Currently, concept of lifelog, especially vi-

sual lifeloging, has been applied in many applications

such as memory recall for Alzheimer patient (Hodges

et al., 2006), summary of daily activities as a visual

diary or human attention analysis (Noris et al., 2011).

Since lifelog image sequences contain many im-

ages (average 2,000 images per day), they need to be

managed into a proper group (event) to retrieve by the

users in the future. Images with same content should

be properly grouped. Nowadays, there are many prac-

tical techniques that solved image clustering problem

in video or general photo clustering domain. The con-

ventional technique in video processing is based on

a scene change detection (Meng et al., 1995), which

detects the boundary frames between two consecutive

video shots. As referred in many publications, e.g.

(Yang and Cheng, 2012), shot boundary detection al-

gorithms basically detect the discontinuities in motion

activity and changes in pixel value histogram distribu-

tion between two consecutive frames. However, con-

ventional scene change detection is not applicable for

the lifelog image sequences because images from a

Lifelog

seq

Event Segmentation

(Image clustering)

Figure 1: (upper) Conceptual idea of lifelog image clus-

tering process. (lower) Example of lifelog image sequence

obtained from lifelog device. This kind of high variance im-

age sequence has difﬁculty in image clustering using related

features between consecutive frames.

visual lifelog device are passively captured in larger

discrete time intervals, e.g. 1 frame every 30 s (0.03

fps) which is relatively low when compared to video

processing (frame rate of video are usually at 24−30

fps). Two consecutive images might have a totally dif-

ference in term of image content (high variance event

as seen in Fig.1).

618

Ratsamee P., Mae Y., Kojima M., Horade M., Kamiyama K. and Arai T..

Event Clustering of Lifelog Image Sequence using Emotional and Image Similarity Features.

DOI: 10.5220/0004741206180624

In Proceedings of the 9th International Conference on Computer Vision Theory and Applications (VISAPP-2014), pages 618-624

ISBN: 978-989-758-003-1

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

(b)

(a)

(c)

Figure 2: Possible type of event occurred in lifelog image

sequence (a) low variance event (b) medium variance event

and (c) high variance event.

Commercial visual lifelogging devices are avail-

able, such as Sensecam (Hodges et al., 2006). It was

used in many scenarios such as visual diary,and mem-

ory assistance. Hierarchical clustering (Kuli´c et al.,

2011) is one approach that particularly applies to im-

age clustering problem. However, the number of ﬁ-

nal clusters i.e. the number of actual events need to

be known before clustering technique is applied. In

K-mean or X-means clustering (Blighe et al., 2008)

method, the number of clusters is predeﬁned. Later,

Wang et al. (Wang and Smeaton, 2012) improve K-

mean clustering by utilizing PCA to identify the most

appropriate number of clusters.

The image clustering in lifelog domain also uses

the beneﬁt of additional sensors to help in the seg-

mentation process. For example, Doherty et al. (Do-

herty et al., 2008; Doherty et al., 2007) investigate

Sensecam and proposed an image clustering frame-

work using multiple image features with the tempera-

ture readings, light measuring, and accelerometer val-

ues to help the event segmentation process. In (Blighe

et al., 2008), cellular data such as GSM from mobile

phone was fused with image features. Connaire et al.

(Conaire et al., 2007) use 3 features which are block-

based cross-correlation, accelerometer value, and the

distribution of the histogram information and spatial

information.

So far, images are clustered based on their abso-

lute distance from similarity features between consec-

utive images and data from additional sensors. How-

ever, many unsolved problems still remained in the

lifelog processing domain. Lifelog image sequences

are not intentionally captured and taken under un-

controlled environments. Three types of events, pre-

sented in Fig. 2, are expected to exist in the lifelog

image sequence which are:

1. Low Variance Event. The group of images that

usually have similar content. Examples are siting

in the room or using a computer.

2. Medium Variance Event. The group in which al-

most all images have similar content. However,

some images have different content but do not

considered as noise. The example of this type of

event is cooking since human are usually focus on

the cooking platform but sometime need to move

to refrigerator to take some ingredient.

3. High Variance Event. The group in which almost

all images have different content but are consid-

ered to be in the same event, for example, shop-

ping or sight seeing.

We deﬁne similarity in term of image features (i.e.

texture and color), as detailed in the Section 4. Many

sub clusters are formed when differentcontent images

exist in between similar images in medium to high

variance event. Furthermore, another problem of im-

age clustering is a highly subjectiveissue. When users

cluster images by themselves, they consider semantic

meaning, feeling or characteristic of event.

The objective of this study is to automatically and

accurately cluster lifelog images into a number of

events that mostly match the group of images clus-

tered by the user. We propose a lifelog image clus-

tering method based on not only image similarity

features but also emotional features. To detect hu-

man emotion, we utilize wearable bio-sensors (Wag-

ner et al., 2005) that the user will wear alongside the

camera. The wearable biosensor quantiﬁes excite-

ment by measuring physiological responses in skin

conductance. All similarity features are integrated

as an absolute distance. Finally, Rank-order distance

based clustering algorithm (Zhu et al., 2011), which

uses the degree of similarity, are applied for lifelog

image clustering.

This paper is structured as follows: Section 2

presents our proposed lifelog image clustering using

image similarity and emotional features. We describe

the evaluation of our proposed method in Section 3.

Section 4 describes experiments and results of lifelog

image segmentation. Finally, conclusion and future

work are discussed in Section 5.

2 METHODOLOGY

We investigate a user when he/she manually clus-

ters event from the enormous amount of images for

ground truth. We notice that the users use their mem-

ory and feeling to group images into event. Based on

this evidence, we design a method for lifelog image

clustering based on image similarity and emotional

features. The structure of the proposed lifelog image

clustering is shown in Fig. 3.

EventClusteringofLifelogImageSequenceusingEmotionalandImageSimilarityFeatures

619

Input

Camera

(Images

Sequence)

Biosensor

(EDA signal)

Feature

Extraction

SURF

Color histogram

Frame index

Emotion state

Rank Order

Distance based

Clustering

Event Set

Figure 3: The structure of the proposed lifelog image clus-

tering.

0 1000 2000 3000 4000 5000 6000 7000

Time (s)

EDA (uS)

(a)

(b) (c)

(d)

Figure 4: The EDA data from biosensor with correspond

to image sequence. Even the images are visually different,

the pattern of EDA data provides signiﬁcant clue for image

clustering.

2.1 Image Similarity Features

We extract 3 image similarity criteria which are the

number of Speeded Up Robust Features (SURF) (Bay

et al., 2006) matching points, color histogram inter-

section and frame number distance.

1. SURF Matching Points (m

): High SURF match-

ing points imply directly to the similarity between

images.

2. Color Histogram Intersection (m

): Apart from

the matching point, high overlapping of color his-

togram also implies high degree of similarity.

3. Frame Distance (m

): Closer frame have higher

chance to contain similar contents or similar

scenes than frames that are further away.

2.2 Emotional Feature (m

)

Only image similarity features are not sufﬁcient to

cluster medium and high variance events. We add clue

about human emotion such as exciting or relaxing pat-

tern for lifelog image clustering.

Different persons will have different EDA charac-

teristic and pattern. We use Support Vector Machine

(SVM) to learn the emotion state pattern presented in

Fig. 5. For this work, we predeﬁne 4 patterns for

emotional states which are

(a)

(b)

Figure 5: The class of emotional state corresponded to EDA

signal in Fig. 4 which are (a) Relaxing (b) Inactive (c) Ex-

citing and (d) Active.

Input

EDA

signal

0 500 1000

0.1

0.15

0.2

(a) Resting

1000 0 500 1000

(b) Incresingly Exciting

0 500 1000

Inactive

Exciting

Relaxing

(b) Incresingly Exciting

0 500 1000

5.5

(d) Exciting

Active

Figure 6: The SVM cascading classiﬁcation of emotional

state from input EDA data.

1. Relaxing: when EDA data decreases from high

level to low level.

2. Inactive: when EDA data remains at the low level.

3. Exciting: when EDA data increases from low

level to high level

4. Active: when EDA data remains at the high level.

The window of pattern is applied for learning and

classifying to over EDA signal. The window con-

ﬁguration will be discussed in Section 4.1. Because

SVM classiﬁer was basically designed for binary lin-

ear classiﬁcation, Multi-SVM classiﬁer are used to

classify emotional patterns. The decision tree of SVM

is shown in Fig. 6. Each classiﬁer in the decision tree

classiﬁes only the target pattern. The ﬁrst classiﬁer

is used to separate resting state out of others. After-

ward, the excitement state and getting excited state

will be classiﬁed on to the next classiﬁer respectively.

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

620

Image

closer

Image

Images sequence sorted by absolute distance

(b)

(a)

Image

Figure 7: (a) The problem of absolute distance (b) Sorted

images sequence.

i j

C C

j i

C C

Order: 0 1 2 3 4

…

(1)

i j

C C

(2)

i j

C C

(0)

i j

C C

Figure 8: Rank-order distance calculation.

For accuracy, different person will have their own cal-

ibration and training data.

To integrate emotional feature with image similar-

ity features, we compare emotional state of two im-

ages. If the emotional state is the same, we assign

probability equal to 1, otherwise 0.5 will be given.

By using this multi-classiﬁer, new pattern of emotion

can be easily added to decision tree.

2.3 Rank-order Distance Clustering

In our work, we complement image similarity fea-

tures and emotional feature with Rank-order distance

clustering to address non-uniform distribution prob-

lem in dynamic event. Rank-order distance clustering

was done in (Zhu et al., 2011)for face tagging appli-

cation. in this paper, we will brieﬂy explain how to

apply with lifelog images sequence. Firstly, we use

3 image similarity and 1 emotional features to deter-

mine the absolute distance. Superposition principle

is used to integrate all the features. Weighting fac-

tor, w

, is introduced to each measurement and fea-

ture. Hence, the absolute distance (d

(k)) is com-

puted from the features (m

− m

) at each sampling

time and is computed by

) =

∑

i=1

(k)

, (1)

where m

(k) is the feature i of frame k. η

is the vari-

ance of each feature i in each event used for normal-

ization. Therefore, d

(k) is ranged from 0 - 1. This

technique can be extended easily when the number of

features increased. After calculating the scores of all

images in the event, we sort the image by absolute

distance (Fig. 7(b)). As presented in Fig. 7(a), us-

ing only absolute distance in scene change detection

might lead to failed clustering. This is why rank-order

distance (which considered similarity between 2 im-

ages with relative to others) is applicable to this study.

Rank-order distance based clustering is an iterative

clustering algorithm to merge sub-clusters using the

combination of a rank-order distance and a normal-

ized distance. Clustering algorithm runs as follow:

1. Initially, we assign each image to a single cluster.

2. From Fig. 8, each 2 clusters (C

and C

) are consid-

ered at a time. Therefore, we have 2 sequences (S

and S

) which start from cluster C

and C

respec-

tively.

3. Calculate asymmetric Rank-order distance

Dr(C

) which is computed as

Dr(C

) =

∑



)



(2)

where f

) is the n

cluster in S

. Hence,



)



is the order of the cluster f

)

in S

. Symmetric Rank-order distance DR(C

)

is computed as

DR(C

) =

Dr(C

) + Dr(C

)

min(S

), S

))

(3)

4. Calculate Normalized distance DN(C

). First,

we ﬁnd φ(C

) which is the average distance of im-

ages in two clusters to top K neighbors. φ(C

) is

deﬁned as

φ(C

) =

| +



∑

im∈C

∪C

∑

k=1

(im, f

(k))

(4)

im refers to images in the set of im ∈ C

∪C

. K is

the number of nearest neighbors. |C

| and



are the

number of images in C

and C

. Finally, normalized

distance DN(C

) is found to be

DN(C

) =

φ(C

)

) (5)

5. Merge any cluster pair if DR(C

) < T

and

DN(C

) < T

. T

and T

are constant threshold.

6. Update clusters and cluster distances, and repeat

step 2. The algorithm stops when no cluster can be

merged.

EventClusteringofLifelogImageSequenceusingEmotionalandImageSimilarityFeatures

621

3 EVALUATION METHODS

We used the following terms to describe the utilized

methods. Our proposed image clustering method

(Rank-order Distance based clustering using a com-

bination of image similarity and emotional features)

is labeled as ‘RODE’. Scene change detection(Meng

et al., 1995) technique is labeled as ‘SC’ and with

additional emotional feature is is labeled as ‘SCE’.

As a comparison, we implemented K-mean cluster-

ing (Blighe et al., 2008), which is labeled as ‘KM’

and ‘KME’ for K-mean clustering with emotional fea-

ture. Finally, the ground truth in this study is the im-

age clustering from user, labeled as ‘USER’.

To evaluate the accuracy of each method, the im-

age clustering result (‘ROD’,‘KM’,‘SC’) and the im-

age clustering from user (‘USER’), are compared.

Considering direct comparison method, each cluster

should contain as many correct images as possible.

In this work, we follow the evaluation framework

based on similarity criteria described in (Zhu et al.,

2011). The precision (Ps) of clustering algorithm can

be measured by

Ps =

correct

total

(6)

whereC

correct

is number of clustered images in correct

group. C

total

is the number of images from USER

cluster.

4 EXPERIMENTS AND RESULTS

4.1 Experiment Setup

In our experiment, 6 participants wore the smartphone

(an Android phone with an automatic capturing ap-

plication)and biosensor for some of amount of time

(average 3 hours). Each participant is recorded over

a time period of 3.5 weeks. There are 25,451 im-

ages in 253 log events. Datasets range from daily

life activities for example using labtop, watching TV,

or shopping to more special ones such as traveling

and sightseeing. A sample of lifelog event set is

shown in Fig. 1. The proportions of the low, mid-

dle and high variance events are 41.2%, 23.2%, and

35.6% respectively. The lifelog image quality is var-

ied from high to low since it is unintentionally cap-

tured. The implemented image clustering and evalua-

tion method run on MATLAB in PC (E5420 2.50 GHz

Xeon CPU, 4096M RAM, NVIDIA Quadro FX 1700

graphic card). The processing time of each frame and

the evaluation process varies between 10 ms to 25 ms,

depending on the number of SURF keypoints in the

image stream and number of pairs between each clus-

ter and top neighbors considered in clustering pro-

cess. Biosensor are set to be time-synchronized with

the images from automatic capturing software.

In both emotion recognition and image clustering

method, there are several parameters that have to be

considered. To classify event into low, medium and

high variance event, we consider the average normal-

ized absolute distance d

between each consecutive

frames in that event. The criteria is as follow

1. Low Variance Event: when d

> 0.7

2. Medium Variance Event: when 0.4 < d

< 0.7

3. High Variance Event: when d

< 0.4

The window that apply for SVM learning and

classiﬁcation is set to be within the interval 0-1.5 sec.

The low and high level are set to be from 0-2.5 µS

and 4.5 - 10 µS respectively. The dominant features

in SVM are slope, peak value and distance between

peaks. For ROD clustering algorithm, Rank-Order

distance threshold T

, Normalized distance threshold

and the number of top neighbors K are set to be

10, 1, and 8 in all experiments.

4.2 Experiment Results

In this section, we present the results and advan-

tages of our proposed image clustering method. SC

and KM image clustering are implemented to ana-

lyze the clustering performance compared to our pro-

posed ROD image clustering. The comparison of im-

age clustering using only image similarity features is

presented in Fig 9(a) and using combination of im-

age similarity features and emotional feature is pre-

sented in Fig. 9(b). As number of images increased,

the precision of each algorithms are decreased when

high variance event are processed and increased when

low variance event are processed.

By integrating the emotional features from biosen-

sor, all clustering techniques outperform the result

when using only the visual features. The example of

image that cluster by RODE technique are presented

in Fig. 10. Our proposed RODE clustering method

achieves 93.33%, 86.8%, and 76.5% of precision rate

in low, medium and high variance event. Other meth-

ods such as SC and KM clustering also achieve better

results with 55.7% and 26.7% of precision rate in high

variance event segmentation respectively.

The quality of the proposed image clustering re-

sult can also be measured by the number of perfect

cluster. As presented in Table 1, our proposed RODE

method achieve highest perfect clustering (52.1%)

when compare to others (44.7% in KME and 32.6%

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

622

Low variance Event Medium variance Event High variance Event

Figure 10: The example of image clustering using our proposed method (RODE image clustering).

100

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Precision(%)

Number of Images

Scene Change Detection (SC) K-Mean (KM)

Rank-Order Distance (ROD)

100

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Precision(%)

Number of Images

Scene Change Detection (SC) K-Mean (KM)

Rank-Order Distance (ROD)

(a)

(b)

Figure 9: The precision result (Ps) from (a) scene change

detection, K-mean, and rank-order clustering algorithm us-

ing only image similarity features (b) scene change detec-

tion, K-mean, and rank-order clustering algorithm using

emotional feature and image similarity features.

Table 1: The percentage of average precision (Ps) and per-

fect cluster (Pc) of each image clustering methods.

Image clustering method

SC KM ROD SCE KME RODE

Ps 38.7 42.5 65.2 47.7 69.1 85.5

Pc 19.1 22.7 33.0 32.6 44.7 52.1

in SCE). The results conﬁrm that emotional relation-

ship between each image in the cluster is a meaningful

clue for clustering process.

5 CONCLUSIONS

We propose a lifelog image clustering based on im-

age similarity and emotional features. The visual

variance in lifelog image sequence varies depending

on how much surrounding context changes. Using

only image similarity features can be clustered only

in low variance events. To solve high variance of

visual information, we introduce emotional feature

from biosensor in clustering process. Human emo-

tional state (exciting or relaxing) together with image

similarity features are processed in rank-order dis-

tance based clustering to solve medium and high vari-

ance cluster. ROD image clustering using only im-

age similarity features were achieved with the aver-

age precision of 65.2%. Our proposed RODE method

achievedlifelog image clustering with an average pre-

cision rate of 85.5%. Especially, medium and high

variance event clustering was improved from 51.8%

to 76.5% and from 59.1% to 86.8% respectively.

ACKNOWLEDGEMENTS

We would like to thank the 6 volunteering students

for data collection, ground truth labeling and scoring

and Ubiqlog software from Computer Vision Lab in

Computer Vision Lab, Vienna University of Technol-

ogy(TU Wien), Austria. Special thank to Amornched

Jinda-apiraksa who provided valuable comments on

this work.

REFERENCES

Bay, H., Tuytelaars, T., and Van Gool, L. (2006). Surf:

Speeded up robust features. pages 404–417. Springer.

Blighe, M., O’Connor, N. E., Rehatschek, H., and Kien-

ast, G. (2008). Identifying different settings in a vi-

sual diary. In Image Analysis for Multimedia Interac-

tive Services, 2008. WIAMIS’08. Ninth International

Workshop on, pages 24–27. IEEE.

EventClusteringofLifelogImageSequenceusingEmotionalandImageSimilarityFeatures

623

Bush, V. (2001). As we may think. pages 141–59.

Conaire, C. O., O’Connor, N. E., Smeaton, A. F., and

Jones, G. J. (2007). Organising a daily visual diary

using multifeature clustering. In Electronic Imaging

2007, volume 6506, pages 65060C–65060C–11. In-

ternational Society for Optics and Photonics.

Doherty, A. R., Byrne, D., Smeaton, A. F., Jones, G., and

Hughes, M. (2008). Investigating keyframe selection

methods in the novel domain of passively captured vi-

sual lifelogs. In Proceedings of the international con-

ference on Content-based image and video retrieval,

pages 259–268. ACM.

Doherty, A. R., Smeaton, A. F., Lee, K., and Ellis, D. P.

(2007). Multimodal segmentation of lifelog data.

In Large Scale Semantic Access to Content (Text,

Image, Video, and Sound), pages 21–38. LE CEN-

TRE DE HAUTES ETUDES INTERNATIONALES

D’INFORMATIQUE DOCUMENTAIRE.

Hodges, S., Williams, L., Berry, E., Izadi, S., Srinivasan,

J., Butler, A., Smyth, G., Kapur, N., and Wood, K.

(2006). Sensecam: A retrospective memory aid. Ubi-

Comp: Ubiquitous Computing, pages 177–193.

Kuli´c, D., Takano, W., and Nakamura, Y. (2011). Towards

lifelong learning and organization of whole body mo-

tion patterns. In Robotics Research, pages 87–97.

Springer.

Meng, J., Juan, Y., and Chang, S.-F. (1995). Scene change

detection in an mpeg-compressed video sequence. In

IS&T/SPIE’s Symposium on Electronic Imaging: Sci-

ence & Technology, pages 14–25. International Soci-

ety for Optics and Photonics.

Noris, B., Barker, M., Nadel, J., Hentsch, F., Ansermet, F.,

and Billard, A. (2011). Measuring gaze of children

with autism spectrum disorders in naturalistic interac-

tions. In International Conference of Engineering in

Medicine and Biology Society (EMBC), pages 5356–

5359. IEEE.

Wagner, J., Kim, J., and Andr´e, E. (2005). From physiologi-

cal signals to emotions: Implementing and comparing

selected methods for feature extraction and classiﬁca-

tion. In International Conference on Multimedia and

Expo, pages 940–943. IEEE.

Wang, P. and Smeaton, A. F. (2012). Semantics-based selec-

tion of everyday concepts in visual lifelogging. Inter-

national Journal of Multimedia Information Retrieval,

1(2):87–101.

Yang, C. K. and Cheng, S. C. (2012). A novel algorithm for

key frames selection. Applied Mechanics and Materi-

als, 182:2025–2029.

Zhu, C., Wen, F., and Sun, J. (2011). A rank-order distance

based clustering algorithm for face tagging. In IEEE

Conference on Computer Vision and Pattern Recogni-

tion (CVPR), pages 481–488. IEEE.

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

624