Authors:
Alexis Gabadinho
;
Gilbert Ritschard
;
Matthias Studer
and
Nicolas S. Müller
Affiliation:
University of Geneva, Switzerland
Keyword(s):
Categorical sequence data, Representativeness, Dissimilarity, Discrepancy of sequences, Summarizing sets of sequences, Visualization.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
BioInformatics & Pattern Discovery
;
Clustering and Classification Methods
;
Data Reduction and Quality Assessment
;
Information Extraction
;
Knowledge Discovery and Information Retrieval
;
Knowledge-Based Systems
;
Mining High-Dimensional Data
;
Symbolic Systems
;
Visual Data Mining and Data Visualization
Abstract:
This paper is concerned with the summarization of a set of categorical sequence data. More specifically, the problem studied is the determination of the smallest possible number of representative sequences that ensure a given coverage of the whole set, i.e. that have together a given percentage of sequences in their neighborhood. The goal is to yield a representative set that exhibits the key features of the whole sequence data set and permits easy sounded interpretation. We propose an heuristic for determining the representative set that first builds a list of candidates using a representativeness score and then eliminates redundancy. We propose also a visualization tool for rendering the results and quality measures for evaluating them. The proposed tools have been implemented in TraMineR our R package for mining and visualizing sequence data and we demonstrate their efficiency on a real world example from social sciences. The methods are nonetheless by no way limited to social sci
ence data and should prove useful in many other domains.
(More)