
 
without any consideration of the temporal context. 
In this work, we consider the use of the 
trajectories of local space-time interest points 
(STIPs) that correspond to points with significant 
local variation in both space and time, thus 
extending the approaches above which are limited to 
2D interest points. In fact, STIPs have proven to be a 
strong feature extraction method that has given 
impressing results in real-world human action 
recognition tasks. Our motivation is that STIPs’ 
trajectories can provide rich spatio-temporal 
information about human activity at the local level. 
For sequence modeling at the global level, a suitable 
statistical sequence model is required.  
Hidden Markov Models (HMMs) (Rabiner, 
1989) have been widely used for temporal sequence 
recognition. However, HMMs make strong 
independence assumptions on feature independence 
that are hardly met in human activity tasks. 
Furthermore, generative models like HMMs often 
use a joint model to solve a conditional problem, 
thus focusing on modeling the observations that at 
runtime are fixed anyway. To overcome these 
problems, Lafferty et al. (Lafferty et al., 2001) have 
proposed powerful discriminative models: 
Conditional Random Fields (CRF) for sequence text 
labeling. CRF is a sequence labeling model that has 
the ability to incorporate a long range dependency 
among observations. CRF assign to each observation 
in a sequence a label but it cannot capture intrinsic 
sub-structures of observations. To deal with this, 
CRF is augmented with hidden states that can model 
the latent structures of the input domain with the so 
called Hidden CRF (HCRF) (Quattoni, 2004). This 
makes it better suited to modeling temporal and 
spatial variation in an observation sequence. Such a 
capability is particularly important as human 
activities usually consist of a sequence of elementary 
actions. However, HCRF needs a long time range 
for the training phase. To overcome this problem we 
propose to combine HCRF with a discriminative 
local classifier (e.g SVM). The local classifier 
predicts confidence of activity labels from input 
vectors. We use the predicted confidence 
measurements of different classes from the local 
discriminative classifier as the input observation to 
the HCRF model. Assuming, as is the usual case, 
that the number of classes is significantly lower than 
feature dimensionality, this will reduce as much the 
feature space dimensionality during HCRF inference 
while exploiting the high discriminative aspect of 
SVM. 
To summarize, the first objective of this paper is 
to investigate the use of STIPs’ trajectories as 
activity descriptors. To the best of our knowledge, 
such a descriptor has not been addressed before in 
the state of the art. The second objective is to assess 
the discriminant power of HCRF-SVM combination 
on a daily living activities recognition task. This 
constitutes the second contribution of our work.  
The organization of the paper is as follows. 
Section 2 gives a brief description of local space 
time features. HCRF and its combination with SVM 
are reviewed in Section 3. In Section 4, the 
databases used for experiments are described and 
results are detailed and compared with the state of 
the art. Section 5 draws some conclusions and 
sketches futures directions of this work. 
2 LOCAL SPACE-TIME 
TRAJECTORIES 
Local space-time features capture structural and 
temporal information from a local region in a video 
sequence. A variety of approaches exist to detect 
these features (Wang et al., 2009). One of the most 
popular methods is the one detecting Space Time 
Interest Points (STIP), proposed by Laptev et al. 
(Laptev et al., 2001), that extends Harris corner 
detector to the space- time domain. The main idea is 
to find points that have a significant change in space 
and time.  
To characterize the detected points, histograms 
of gradients (HOG) and histograms of optical flows 
(HOF) are usually calculated inside a volume 
surrounding the interest point and used as 
descriptors.  
To provide a description at the video action 
level, one of the most popular methods is to 
represent each video sequence by a BOW of 
HOG/HOF STIP’s descriptors. However, this 
representation does not capture the spatio-temporal 
layout of detected STIPs. To overcome this 
limitation, a number of recent methods encode the 
spatio-temporal distribution of interest points. 
Nevertheless, these methods typically ignore the 
spatio-temporal evolution of each STIP in the video 
sequence. As mentioned above, some approaches 
have attained a good result when using the 
trajectories of 2D-interest points that are mainly 
adapted to 2D space domain. In this section, we 
present our approach of activity representation based 
on the trajectories of STIPs (Figure 1) which are 
adapted to video data. 
To construct our basic feature, we first extract 
STIPs from the video sequences. Then we track 
them with Kanade-Lucas-Tomasi (KLT) tracker 
ACombinedSVM/HCRFModelforActivityRecognitionbasedonSTIPsTrajectories
569