The Inﬂuence of Labeling Techniques in Classifying Human

Manipulation Movement of Different Speed

Sadique Adnan Siddiqui

, Lisa Gutzeit

and Frank Kirchner

1,2

Robotics Research Group, University of Bremen, Bremen, Germany

Robotics Inovation Center, DFKI, Bremen, Germany

Keywords:

Movement Recognition, Human Movement Analysis, k-Nearest Neighbor, Convolutional Neural Networks,

Extreme Gradient Boosting, Random Forest, Long Short-term Memory Networks, CNN-LSTM Network.

Abstract:

Human action recognition aims to understand and identify different human behaviors and designate appro-

priate labels for each movement’s action. In this work, we investigate the inﬂuence of labeling methods on

the classiﬁcation of human movements on data recorded using a marker-based motion capture system. The

dataset is labeled using two different approaches, one based on video data of the movements, the other based

on the movement trajectories recorded using the motion capture system. The data was recorded from one

participant performing a stacking scenario comprising simple arm movements at three different speeds (slow,

normal, fast). Machine learning algorithms that include k-Nearest Neighbor, Random Forest, Extreme Gradi-

ent Boosting classiﬁer, Convolutional Neural networks (CNN), Long Short-Term Memory networks (LSTM),

and a combination of CNN-LSTM networks are compared on their performance in recognition of these arm

movements. The models were trained on actions performed on slow and normal speed movements segments

and generalized on actions consisting of fast-paced human movement. It was observed that all the models

trained on normal-paced data labeled using trajectories have almost 20% improvement in accuracy on test

data in comparison to the models trained on data labeled using videos of the performed experiments.

1 INTRODUCTION

Recognition of human actions is an active research

area utilizing both vision and non-vision based

modalities. Machine Learning and Deep Learning al-

gorithms have shown promising results in the identiﬁ-

cation and understanding of human behaviors, which

is important to improve the collaboration between hu-

mans and robots in several applications. The only ma-

jor concern with these supervised learning methods

is that the effectiveness of these methods desires an

ample amount of detailed labeled training data. De-

spite the need for a large amount of data for training

and human supervision for labeling these data, mak-

ing use of a robust alternative for supervised learning

algorithms is a difﬁcult task to accomplish in human

action recognition.

The tasks concerning the action classiﬁcation gen-

erally have four major phases: data acquisition, seg-

ment labeling, feature engineering, and ﬁnally train-

ing the classiﬁer. The data can be acquired using dif-

ferent sensing modalities (for example video streams,

IMUs, point clouds, etc.). If a sequence of several

movements is recorded, the data needs to be prepro-

cessed and segmented into smaller movement entities

and action labels need to be assigned to each segment.

Then, features have to be extracted from raw motion

data and normalized. Lastly, a classiﬁer to recognize

and infer the actions needs to be trained. The pipeline

for the creation of the dataset and labeling strategies

used in this work is depicted in Figure 1. The data la-

beling phase is a tedious and time-consuming process

and poses a major constraint in the creation of a robust

action recognition dataset. The data recorded using

RGB cameras and RGB-D cameras are easy to obtain,

and they provide rich appearance information. Thus,

it makes the labeling task less complicated. On the

other hand, sensor data recorded either using IMUs

or marker-based motion capture system requires care-

ful analysis of time series data to extract the stream

of motions and assign a set of actions to it. There are

previous works that propose to facilitate the data an-

notation process for time series data, e.g. Schr

oder

et al. developed a tool support that makes use of a

database schema for annotating sensor data (Schr

oder

et al., 2016). Cruciani et al. proposed a heuristic func-

338

Siddiqui, S., Gutzeit, L. and Kirchner, F.

The Inﬂuence of Labeling Techniques in Classifying Human Manipulation Movement of Different Speed.

DOI: 10.5220/0010906900003122

In Proceedings of the 11th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2022), pages 338-345

ISBN: 978-989-758-549-4; ISSN: 2184-4313

Figure 1: Pipeline for creation of a human action recognition dataset and classiﬁcation approach. Inspired from (Zhang et al.,

2019).

tion based on step count and GPS information for on-

line supervised training (Cruciani et al., 2018). An-

other technique is to utilize few-shot learning meth-

ods as mentioned in (Gutzeit, 2021), where small en-

tities of human manipulation movements can be de-

tected at high accuracy with ≤ 10 examples per class

in the training data. Using the models trained on such

a small dataset, it can be generalized to new unlabeled

data.

In this paper, the inﬂuence of different labeling

methods on the classiﬁcation of human movements

is investigated. Human movement is recorded based

on a simple stacking scenario using a marker-based

motion tracking system that measures the 3D po-

sitions of the human arm. Additionally, videos of

the movements are recorded. After that, the record-

ings are labeled using two different methods. In the

ﬁrst labeling approach, the stacking movements are

manually segmented using the video data. In the

second method, recorded movement trajectories are

automatically segmented into manipulation building

blocks characterized by a bell-shaped velocity pro-

ﬁle of the hand (Gutzeit et al., 2014) followed by

manual correction of wrongly segmented data. The

stacking movements are labeled carefully by exam-

ining the trajectory of the arm while performing the

experiments using a labeling tool developed in-house,

which visualizes the movement trajectories in 2D and

3D. Six different algorithms that are widely used for

human action recognition, the k-Nearest Neighbor

(KNN) classiﬁer, Decision-Tree based classiﬁer Ran-

dom Forest (RF), Extreme Gradient Boosting (XG-

Boost), Deep Learning algorithms such as Convo-

lutional Neural Networks (CNN), Long Short-Term

Memory networks (LSTM), and a combination of

CNN-LSTM networks are compared with respect to

their suitability to label the movements automatically.

The models are trained and evaluated on movements

recorded at different speeds in order to study the inﬂu-

ence of labeling techniques on the feasibility of trans-

fer of speed in simple action movements. This paper

is organized as follows: In section 2, an overview of

related work is given. In section 3, the feature ex-

traction and algorithms used for classiﬁcation are de-

scribed, along with the evaluation approaches. The

data recording and the labeling procedure along with

results and discussions are presented in section 4 and

section 5 respectively. The paper concludes with fu-

ture scope of this work in section 6.

2 RELATED WORK

There is already a lot of work done on action recogni-

tion, involving vision-based and sensor-based modal-

ities. The video streams provide rich spatial informa-

tion and when it combines with the temporal informa-

tion, i.e., image frames at time steps, it can be bene-

ﬁcial for the identiﬁcation of human actions. Starting

from the analysis of video streams, earlier works in-

volved the usage of handcrafted feature-based meth-

ods such as the position of skeleton joints for action

recognition tasks (Wang and Schmid, 2013). In re-

cent times, CNN-based approaches are quite popular

because of their benchmarked results in computer vi-

sion problems and their ability to extract high-level

representation in deep layers. Donahue et al. intro-

duced the Long-term Recurrent Convolutional Net-

work (LRCN) consisting of a 2D CNN and LSTM

for extracting RGB features and predicting action la-

bels from each image (Donahue et al., 2015). Ji

et al. tried to capture the motion information from

several adjacent frames by extracting spatial and tem-

poral dimension features by performing 3D convolu-

tions on the videos (Ji et al., 2013). The vision-based

modalities require the use of an appropriately placed

camera that poses mobility issues and privacy risks.

With the availability of low-cost sensors and activ-

ity trackers in smartphones, a sudden shift has been

observed in the usage of cameras for action recogni-

tion tasks. Halloran et al. presented a comparison of

Deep Learning models in human activity recognition

The Inﬂuence of Labeling Techniques in Classifying Human Manipulation Movement of Different Speed

339

on the MHEALTH dataset recorded using a smart-

phone (O’Halloran and Curry, 2019). They com-

pared machine learning models on sensor-based data

utilizing supervised learning methods. Some recent

works provide an alternative of using a labeled dataset

and propose to train the model using semi-supervised

methods, such as label propagation, requiring a less

labeled dataset. Cruciani et al. proposed a heuristic

function-based method for automatic labeling in an

online supervised training approach (Cruciani et al.,

2018). The algorithm generated weak labels by com-

bining step count and GPS information. Shamsipour

et al. addressed the issue of labeling videos by con-

sidering only a few frames depicting the information

of humans performing a particular activity (Sham-

sipour et al., 2017). They randomly selected three

video frames instead of employing all the frames, and

used CNN for extracting features and SVM for clas-

sifying actions from the conceptual features. How-

ever, the majority of the approaches in the literature

are applied to whole-body human movements, such

as walking, running, or sitting. In this work, we

compared the classiﬁers’ performance on movement

building blocks that can be found in natural and in-

tuitively performed movements and that can poten-

tially be transferred to a robotic system using learn-

ing from demonstration (Gutzeit et al., 2019a). To our

knowledge, there has been no previous work available

that analyzes the inﬂuence of labeling techniques in

movement classiﬁcation and examines the possibility

of speed transfer in action movements.

3 METHODS

In this section, the features that are extracted from the

raw movement trajectories captured using a Qualisys

motion capture system are described, along with the

classiﬁcation algorithms and the hyperparameter opti-

mization methods used for training those algorithms.

3.1 Feature Extraction

Feature extraction is an important procedure in train-

ing machine learning algorithms. A meaningful rep-

resentation of raw data can have a huge inﬂuence on

the performance of predictive models. In this work,

features are extracted from raw motion data in the

same manner as mentioned in (Gutzeit, 2021). The

data is recorded using a marker-based motion capture

system with several markers placed on the arm of the

subject as shown in Fig 2. The marker positions are

transformed into a global coordinate frame with re-

spect to the markers placed on the back of the sub-

ject. The features extracted directly from the raw data

are the marker’s 3D positions, velocity, orientation,

and joint angle between them. The feature trajectories

were interpolated to the same length and normalized

in the range [0, 1].

3.2 Classiﬁcation Models

3.2.1 K-Nearest Neighbor

KNN uses the proximity between the test data and al-

ready available training data to classify a sample. The

closest proximity of the test data from the training

data can be determined using distance metrics such as

Euclidean, Manhattan, or Minkowski. We use KNN

for comparison in this work because it performed ex-

ceptionally well on classifying human movement in

a small-sized training dataset (Gutzeit et al., 2019b).

The feature trajectories for each movement recording

are transformed into a 1-D feature vector. The clos-

est neighbor of each data sample is determined using

Euclidean distance, and the number of neighbors K

required to classify the test data is tuned using grid

search.

3.2.2 Random Forest

Random Forest is a bagging algorithm where random

bootstrap samples are drawn from the training data

and multiple decision trees are constructed. Each in-

dividual tree in the random forest outputs a class pre-

diction, and the overall prediction results of the model

are obtained by a voting approach on the individual

decision tree outputs (Ho, 1995). Due to the ran-

dom selection of training data and features, the con-

structed decision trees are independent of each other,

which makes the model resistant to overﬁtting prob-

lems. This improves its predictive performance and

generalization abilities on unseen data. As the deci-

sion trees are generated in parallel at the time of train-

ing, it results in a speed-up of the training process. In

this work, we have used a Random Forest classiﬁer

with grid search to tune the hyperparameters, such as

the number of trees and the maximum depth of each

decision tree in the algorithm.

3.2.3 Extreme Gradient Boosting

XGBoost (Chen and Guestrin, 2016) is one of the

most popular machine learning algorithms in recent

times, widely used in structured and tabular data.

XGBoost is a decision-tree-based ensemble algorithm

that utilizes a gradient boosting framework and efﬁ-

ciently makes use of parallel processing and cache op-

timization for better speed and performance. Boost-

ICPRAM 2022 - 11th International Conference on Pattern Recognition Applications and Methods

340

ing algorithms attempt to accurately predict the tar-

get variable by aggregating weak classiﬁers, which in

general do not perform so well individually but when

combined perform even better than the strongest indi-

vidual learner. In order to make the ﬁnal prediction,

XGBoost sequentially adds the classiﬁers and ﬁts the

new model on the residual or the errors of the previous

predictions that are then combined with previous trees

to make the ﬁnal prediction. It employs gradient de-

scent when introducing new models to minimize the

loss.

3.2.4 Convolutional Neural Network

In the last half of the decade, CNN has been one of

the most popular variants of Neural Networks because

of their substantial contribution and benchmarked re-

sults for computer vision, natural language process-

ing (Kim, 2014) and time series based problems (Is-

mail Fawaz et al., 2019). Contrary to images, CNNs

in time series can be seen as a kernel sliding in only

one dimension instead of two dimensions. In this

work, a very simple Convolutional Neural Network

is proposed consisting of two 1D convolution layers

followed by a pooling layer, dense layer and an out-

put layer with a softmax activation. The number of

ﬁlters in the convolution layer, number of neurons in

the dense layer, learning rate, and type of optimizer

are modulated using Keras Tuner as mentioned in sec-

tion 3.2.7.

3.2.5 Long Short-term Memory

LSTMs were introduced by (Hochreiter and Schmid-

huber, 1997) and are explicitly designed to avoid the

long-term dependency problem exhibited in Recur-

rent Neural Networks. In Recurrent Neural Networks,

there are directed cycles between the units, i.e., they

propagate data forward and also backward and have

the ability to process arbitrary long sequences of in-

puts using their internal memory. But where the prob-

lems of gradient explosion and gradient disappear-

ance arise in RNN, LSTMs are able to avoid these

problems with their architecture modiﬁcations, such

as cell state, and its various gates. These gates are re-

sponsible to keep the essential information or forget it

during training. In this work, we use a simple struc-

ture with two LSTM layers followed by a dropout

layer and output layer with a softmax activation.

3.2.6 CNN-LSTM Network

CNNs are known to be a robust feature extractor

and capable of creating informative representations

of time series data. On the other hand, LSTM net-

works perform pretty well at extracting patterns for

long input sequences. The combination of both the

networks has shown good results on challenging se-

quential datasets (Mutegeki and Han, 2020). A CNN-

LSTM model is used comprising two 1D convolution

layers with ReLu activation function followed by 1D

max-pooling layer and a ﬂattening layer for format-

ting the feature map so that it can be consumed by

the LSTM layer. Afterwards, the ﬂattened feature

maps from the previous layers are fed as an input to

an LSTM layer followed by a dense layer and out-

put layer with a softmax activation. We have also

used a dropout layer, where randomly selected neu-

rons are ignored during training. It helps in making

the network capable of better generalization and less

likely to overﬁt the training data. The number of lay-

ers, learning rate, and type of optimizer is modulated

using Keras Tuner as mentioned in section 3.2.7. All

the mentioned Deep Learning models are trained us-

ing sparse categorical cross-entropy loss function for

20 epochs using Adam optimizer and early stopping

is used to stop the training if the accuracy on a val-

idation dataset did not increase in the last n epochs,

where n is called patience value to prevent the model

from overﬁtting the training data. The experiments

were automatically tracked using Weights & Biases

tool (Biewald, 2020).

3.2.7 Hyperparameter Optimization

Hyperparameter tuning plays a vital role in the per-

formance of machine learning algorithms. Hyper-

parameters can have a signiﬁcant impact on model

training in terms of model accuracy, training time,

and computational requirements. In this work, Keras

Tuner (O’Malley et al., 2019) that can be seamlessly

integrated with Tensorﬂow 2 is used for tuning differ-

ent hyperparameters required in training deep learn-

ing models. Methods used for tuning hyperparame-

ters require deﬁning a search space consisting of hy-

perparameters and their ranges that are needed to be

optimized. Some hyperparameters that are optimized

in this work are number of ﬁlters in convolution lay-

ers, number of units in LSTM layer, number of neu-

rons in the dense layer and, choice of optimizers.

For each of the above-mentioned parameters, a

range of its possible values is provided. Keras Tuner

supports different search heuristics (such as random

search, Hyperband, and Bayesian optimization) that

helps in ﬁnding the best value of the deﬁned hy-

perparameter to enhance the model performance. In

this work, a Bayesian optimization search strategy is

utilized for the optimization of hyperparameters. It

eliminates the problem of choosing a combination of

The Inﬂuence of Labeling Techniques in Classifying Human Manipulation Movement of Different Speed

341

different hyperparameters randomly that could some-

times result in the abysmal combinations of parame-

ters. Thus, resulting in failure to improve the model

accuracy. Instead of choosing random combinations,

the Bayesian optimization strategy chooses the best

possible hyperparameters based on the model perfor-

mance of previous combinations. It constructs a prob-

abilistic representation of the performance of a given

Machine Learning algorithm, which is modeled using

Bayesian inference and Gaussian process.

3.3 Evaluation Approach

The six algorithms described in section 3.2 are com-

pared on the movement recordings labeled using

two different methods as mentioned in section 4.1.

The classiﬁers are trained on the data comprising

arm movements performed by the subject at a slow

and normal speed (section 4.1) and evaluated on the

movements performed by the subject at a fast speed.

The main focus of the experiments was to investigate

the impact of labeling techniques on the generaliza-

tion ability of the model for speed transfer in the fast-

paced movement. These movements were appropri-

ate for evaluating the trained model because segmen-

tation of the fast-paced movements into basic action

movements is more challenging compared to slower-

paced movements and requires precise tracking of the

subject’s arm position. The training data was divided

into 5 folds using stratiﬁed cross-validation. In each

split, the evaluation is performed on the unseen test

data. Finally, generalization accuracy with a standard

deviation of mean accuracy and F1 score is reported.

The dataset was completely balanced with an equal

number of data for each class.

4 EXPERIMENTAL DATA

4.1 Stacking Scenario Data

The experiment was conducted on a single subject

and movements were recorded with a Qualisys mo-

tion tracking system that uses infrared light reﬂect-

ing markers. Additionally, the performed movements

were recorded using a video camera. Markers were

attached to the right hand, elbow, shoulder and back

of the subjects, as shown in Figure 2. The marker

positions were tracked with 7 Qualisys cameras and

data was recorded at 500Hz. The subject was asked

to perform a basic stacking movement as shown in

Figure 3 where bricks of different colors were placed

on ﬁxed positions on the table and the participant was

asked to place the bricks in the middle of the table

Figure 2: Stacking-scenario setup. Positions of markers at-

tached on the arm and the back of the subject are recorded

using a camera based motion tracking system.

by stacking it one by one. The experiment was per-

formed by arranging the bricks in different stacking

order: the green brick was always kept at the bot-

tom while the other bricks (red, blue, yellow) were ar-

ranged in different permutations. Thus, overall, 6 dif-

ferent stacking orders were recorded with three repeti-

tions of each stacking order. The movement for stack-

ing the bricks was recorded at three different speeds

(slow, normal, fast) for all 6 stacking orders. The nor-

mal and slow speed were intended to provide a com-

fortable speed for placing bricks from their respective

position to the middle of the table one over another,

while the fast speed challenged the participant. There

were many instances where the bricks were not suc-

cessfully placed over one another due to the fast arm

movement, resulting in bricks tripping over the table.

4.2 Labeling Techniques

The movement data was decomposed using

two different ways into 8 classes (middle2front,

front2middle, middle2left, left2middle, middle2right,

right2middle, middle2down, down2middle) based on

the position from where the bricks were supposed

to be picked and placed. In the ﬁrst labeling tech-

nique, it was segmented using the video recorded

for the experiment that was synced precisely with

the data recorded from the Qualisys motion tracker.

The labeling method was quite tedious and time-

consuming, as one has to cautiously track the image

frames on the video where the subject picks and

places the bricks. Although tracking the movement

of the data recorded at slow and normal pace was

quite precise, labeling the data recorded at fast

pace was very challenging. In the second labeling

technique, the arm movement data was automatically

segmented using a velocity-based probabilistic

segmentation presented in (Gutzeit et al., 2014) into

ICPRAM 2022 - 11th International Conference on Pattern Recognition Applications and Methods

342

basic movement units with a bell-shaped velocity.

The unnecessary trajectories were removed, and

essential segments required for training the model

were annotated using a labeling tool developed at our

institute, which visualizes the movement trajectories

in 2D and 3D. Using this tool, inaccurate segment

boundaries of the automatic segmentation approach

were corrected. After labeling the data using both

methods, the dataset consisted of 144 arm movements

each for slow and normal paced recordings and 192

arm movements for fast-paced recordings, that means

in total 480 labeled movement sequences were

available.

Figure 3: Stacking-scenario setup. The left image shows the

different positions of the bricks on the table, and the right

image shows one of the stacking examples. The bricks were

stacked at the position of the cross.

4.3 Complexity of the Dataset

In this section, we compare the structure and diver-

sity of the movement recordings labeled using above-

mentioned techniques. For understanding the over-

lapping between different classes in the dataset, t-

SNE (t-distributed Stochastic Neighbor Embedding)

introduced by (van der Maaten and Hinton, 2008)

is used for exploring a set of points in a high-

dimensional space by transforming it into a lower-

dimensional space. In Figure 4, we can see that data

from the stacking scenarios labeled using the move-

ment trajectories is less complex and are separated

more clearly, but one can see overlap in at least 4

classes in the segments labelled using videos.

5 RESULTS AND DISCUSSIONS

In this work, the generalization ability of differ-

ent models, such as KNN, Random Forest, XG-

Boost, CNN, LSTM, and CNN-LSTM model, to data

recorded at different speeds are compared using two

different strategies. The main aim was to demon-

strate the inﬂuence of labeling methods in movement

speed transfer within human movements. The clas-

Figure 4: T-SNE plots of the stacking scenarios recorded at

a fast pace (a) Data labeled using Videos (b). Data labeled

using movement trajectories. Each action movement labels

can be identiﬁed by a different color.

siﬁers are trained on data consisting of movements

recorded at a slow and normal pace and tested on

data recorded on fast-paced movements. The results

of the generalization capabilities of the classiﬁers are

shown in Figure 5. As we can see from the plots, all

the classiﬁers have a much better generalization on

the fast-paced movements when training data is la-

beled using movement trajectories. The results illus-

trate the signiﬁcance of labeling strategies and their

impact on classiﬁcation accuracy irrespective of the

choice of classiﬁers. For the model trained on nor-

mal movements and labeled using trajectories, there

is almost 20% improvement in accuracy for all the

models except LSTM that has an accuracy difference

of approximately 35%. CNNs are the best perform-

ing model with a mean accuracy of 98% and F1-score

of 97.7% for normal movements labeled using tra-

jectories. For the model trained on slow movements

and labeled using trajectories, KNN was the best per-

forming model with an accuracy of 99% and an F1-

score of 98.5%, and all models except XGBoost have

an accuracy difference of approximately equal to or

greater than 16%. Precise labeling of the recordings

from videos requires meticulous tracking of the arm

by the labeling person, and there are chances of hu-

man errors in tracking the accurate frames from the

videos. That could be the reason for low accuracy

on data labeled using videos. As the dataset con-

sists of movements derived from simple stacking sce-

narios, distance-based algorithms performed consid-

erably better and were almost equivalent to CNNs in

performance. The LSTM classiﬁers fail to generalize

well, but providing more data by employing augmen-

tation techniques and training for more epochs could

further enhance the results.

The Inﬂuence of Labeling Techniques in Classifying Human Manipulation Movement of Different Speed

343

(a) (b)

Figure 5: (a) and (c) Accuracy and F1 score comparison when model trained on slow movements. (b) and (d) Accuracy and

F1 Score comparison when model trained on normal movements.

6 CONCLUSION AND FUTURE

WORK

In this paper, we studied the impact of annotation

quality on the classiﬁcation accuracy on data consist-

ing of basic human movements. Two different label-

ing strategies have been proposed and six different

Machine Learning and Deep Learning models were

compared. The potential possibility of speed trans-

fer using the model trained on the data labeled us-

ing these two strategies was examined. It is found

that fast-paced movements are better recognized on

data labeled using trajectories of the recorded move-

ments. The best results could be achieved with k-

Nearest Neighbor and CNNs, achieving an accuracy

of 99% and 98% on the model trained on slow and

normal paced movements respectively.

For future work regarding the movement recog-

nition models, some self-supervision methods that

showed promising results in the ﬁeld of Computer

vision and NLP domain could be explored and there

are possibilities to leverage such networks for sensor

data in human action recognition. Although it does

not completely discard the usage of labeled data, it

learns useful representations of the data from the un-

labeled dataset, which can then be ﬁne-tuned on a

small number of labeled data. Further, more focus

could be given to traditional machine learning meth-

ods for model explainability. It would help to prevent

model bias and could help to understand the working

of a model in a better way. Shapely values (Lund-

berg and Lee, 2017) and LIME (Ribeiro et al., 2016)

can give a rough idea about the features that greatly

inﬂuenced the performance of the classiﬁers. Model

interpretability has a huge prospect in the AI commu-

nity, one can compare the models used in this work

based on their interpretability to get a better under-

standing of these black-box models for human action

recognition tasks.

Regarding the inﬂuence of the labeling tech-

niques, the experiments conducted in this paper were

performed with just one subject on simple movement

data. For a deeper investigation of this inﬂuence,

the movements of more subjects and more complex

movements should be analyzed. Furthermore, not

only the labeling technique but also the experience of

the person labeling the data should be taken into ac-

count. However, the ﬁrst small study presented in this

paper already shows that accurately segmented data

could signiﬁcantly improve the movement classiﬁca-

tion accuracy.

ACKNOWLEDGEMENTS

This work was supported through a grant of the Ger-

man Federal Ministry for Economic Affairs and En-

ergy (BMWi, FKZ 50 RA 2023).

ICPRAM 2022 - 11th International Conference on Pattern Recognition Applications and Methods

344

REFERENCES

Biewald, L. (2020). Experiment tracking with weights and

biases. Software available from wandb.com.

Chen, T. and Guestrin, C. (2016). XGBoost: A scalable

tree boosting system. In Proceedings of the 22nd

ACM SIGKDD International Conference on Knowl-

edge Discovery and Data Mining, KDD ’16, pages

785–794, New York, NY, USA. ACM.

Cruciani, F., Cleland, I., Nugent, C., McCullagh, P., Synnes,

K., and Hallberg, J. (2018). Automatic annotation

for human activity recognition in free living using a

smartphone. Sensors, 18(7).

Donahue, J., Hendricks, L., Guadarrama, S., Rohrbach, M.,

Venugopalan, S., Darrell, T., and Saenko, K. (2015).

Long-term recurrent convolutional networks for visual

recognition and description. pages 2625–2634.

Gutzeit, L. (2021). A comparison of few-shot classiﬁcation

of human movement trajectories. In Proceedings of

the 10th International Conference on Pattern Recog-

nition Applications and Methods (ICPRAM-2021),

February 4-6, Austria, pages 243–250. SciTePress.

Gutzeit, L., Fabisch, A., Petzold, C., Wiese, H., and Kirch-

ner, F. (2019a). Automated robot skill learning from

demonstration for various robot systems. In KI 2019:

Advances in Artiﬁcial Intelligence. German Confer-

ence on Artiﬁcial Intelligence (KI-2019), Septem-

ber 23-26, Kassel, Germany, LNAI, pages 168–181.

Springer.

Gutzeit, L., Otto, M., and Kirchner, E. A. (2019b). Simple

and robust automatic detection and recognition of hu-

man movement patterns in tasks of different complex-

ity. In Physiological Computing Systems. Springer.

Gutzeit, L., Schr

oer, M., Metzen, J. H., and Kirchner, E. A.

(2014). Velocity-based multiple change-point infer-

ence for unsupervised segmentation of human move-

ment behavior. In Proceedings of the 22nd Inter-

national Conference on Pattern Recognition. Inter-

national Conference on Pattern Recognition (ICPR-

2014), 22nd, August 24-28, Stockholm, Sweden, pages

4564–4569. IEEE.

Ho, T. K. (1995). Random decision forests. In Proceedings

of 3rd international conference on document analysis

and recognition, volume 1, pages 278–282. IEEE.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

memory. Neural computation, 9(8):1735–1780.

Ismail Fawaz, H., Forestier, G., Weber, J., Idoumghar, L.,

and Muller, P.-A. (2019). Deep learning for time series

classiﬁcation: a review. Data Mining and Knowledge

Discovery, 33(4):917–963.

Ji, S., Xu, W., Yang, M., and Yu, K. (2013). 3d convolu-

tional neural networks for human action recognition.

IEEE Transactions on Pattern Analysis and Machine

Intelligence, 35(1):221–231.

Kim, Y. (2014). Convolutional neural networks for sentence

classiﬁcation. In Proceedings of the 2014 Confer-

ence on Empirical Methods in Natural Language Pro-

cessing, EMNLP 2014, October 25-29, 2014, Doha,

Qatar, A meeting of SIGDAT, a Special Interest Group

of the ACL, pages 1746–1751.

Lundberg, S. and Lee, S. (2017). A uniﬁed approach to in-

terpreting model predictions. CoRR, abs/1705.07874.

Mutegeki, R. and Han, D. S. (2020). A cnn-lstm approach

to human activity recognition. In 2020 International

Conference on Artiﬁcial Intelligence in Information

and Communication (ICAIIC), pages 362–366.

O’Halloran, J. and Curry, E. W. J. (2019). A comparison

of deep learning models in human activity recognition

and behavioural prediction on the mhealth dataset. In

AICS.

O’Malley, T., Bursztein, E., Long, J., Chollet, F., Jin,

H., Invernizzi, L., et al. (2019). Kerastuner.

https://github.com/keras-team/keras-tuner.

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). ”why

should I trust you?”: Explaining the predictions of any

classiﬁer. CoRR, abs/1602.04938.

Schr

oder, M., Yordanova, K., Bader, S., and Kirste, T.

(2016). Tool support for the online annotation of sen-

sor data.

Shamsipour, G., Shanbehzadeh, J., and Sarrafzadeh, H.

(2017). Human action recognition by conceptual fea-

tures.

van der Maaten, L. and Hinton, G. (2008). Visualizing data

using t-SNE. Journal of Machine Learning Research,

9:2579–2605.

Wang, H. and Schmid, C. (2013). Action recognition with

improved trajectories. In 2013 IEEE International

Conference on Computer Vision, pages 3551–3558.

Zhang, W., Zhao, X., and Li, Z. (2019). A comprehensive

study of smartphone-based indoor activity recognition

via xgboost. IEEE Access, 7:80027–80042.

The Inﬂuence of Labeling Techniques in Classifying Human Manipulation Movement of Different Speed

345