Ensemble Clustering based Semi-supervised Learning for Revenue

Accounting Workflow Management

Tianshu Yang

1,2

, Nicolas Pasquier

and Frederic Precioso

Université Côte d’Azur, CNRS, I3S, France

Amadeus, Sophia-Antipolis, France

Tianshu.Yang@amadeus.com, Nicolas.Pasquier@univ-cotedazur.fr, Frederic.Precioso@univ-cotedazur.fr

Keywords: Ensemble Clustering, Consensus Clustering, Closed Sets, Multi-level Clustering, Semi-supervised Learning,

Amadeus Revenue Management, Revenue Accounting, Anomaly Corrections.

Abstract: We present a semi-supervised ensemble clustering framework for identifying relevant multi-level clusters,

regarding application objectives, in large datasets and mapping them to application classes for predicting the

class of new instances. This framework extends the MultiCons closed sets based multiple consensus clustering

approach but can easily be adapted to other ensemble clustering approaches. It was developed to optimize the

Amadeus Revenue Management application. Revenue Accounting in travel industry is a complex task when

travels include several transportations, with associated services, performed by distinct operators and on

geographical areas with different taxes and currencies for example. Preliminary results show the relevance of

the proposed approach for the automation of Amadeus Revenue Management workflow anomaly corrections.

1 INTRODUCTION

Amadeus is the leading provider of IT solutions to the

global travel and tourism industry. Amadeus creates

solutions that enable airlines, airports, hotels,

railways, search engines, travel agencies, tour

operators and other stakeholders to operate and

improve travel management worldwide.

Revenue Accounting refers to the process of

managing and dispatching to the different suppliers

involved the amount collected from customer’s

payment for their travel. This process involves multiple

successive treatments of the data in input represented

as a ticket calculation code sequence for each travel.

The Amadeus Revenue Management application

helps customer performing revenue accounting. It

consists of a sequence of modules, each one

performing a computation from its input and sending

its output to the next module, that generates the

different amounts related to a journey and the

different travels it involves: Calculation of fees,

commissions and taxes, proration between

transportation operators, etc. This sequence of

modules, referred to as the Revenue Management

Workflow, is illustrated in Figure 1. The first stage of

the

process

validate

input

data.

Next,

amounts

Figure 1: Example Revenue Management Workflow and Error Tasks Raised by Anomalies.

Error Task Raised

ID: ******

Type: *******

…

BOM

Before

BOM

After

Error Task Raised

ID: ******

Type: *******

…

BOM

Before

BOM

After

Error Task Raised

ID: ******

Type: *******

…

BOM

Before

BOM

After

Error Task Raised

ID: ******

Type: *******

…

BOM

Before

BOM

After

Data

Validation

Proration

Tax

Calculation

Accounting

Output

Validation

•••

ut Ticket Calculation Code Se

uence

Out

ut Status

are prorated to travel coupon level. Then, taxes, fees,

charges and other values are calculated based on these

prorated coupon amounts and local government laws.

Finally, the accounting module checks if the amounts

are balanced, which means credit should be equal to

debit to avoid calculation errors.

This process involves complex management

constraints and is automated unless an error occurs.

Errors are defined by domain experts and refer to

situations where the input and/or the output of a

module is abnormal. Such anomalies are identified by

comparing the Business Object Model (BOM) values

before and after each module to generate Error Tasks

described by their associated Error Ticket. During

each module computation, one or several anomalies,

such as an incorrect amount computed due to

erroneous values in input, can occur.

The main limitation of the current Error Tasks

handling system is that each task is treated as

independent, even if similar errors have already been

corrected. The analysis of 2 000 sample tasks from

the Task Handling Module have shown up to 40%

similar tasks. This results in an important waste of

efforts and machine learning techniques are

considered to help in decreasing costs and time spent

on similar error tickets due to their required

individual correction.

The application of machine learning techniques

thus aims to improve the automation of the error

correction process with the automatic identification

of anomaly patterns in the Amadeus Revenue

Management workflow, and the automatic or semi-

automatic, depending on the type of the anomaly

pattern, correction of the error. This application

involves the two main steps described hereafter.

The first step is the identification of relevant

anomaly patterns, i.e., error distinctive features,

through the unsupervised classification, or

clustering,

of error tickets to form clusters of tickets

Figure 2: Clustering Anomaly Pattern Correction Tasks.

corresponding to similar anomalies and requiring

similar correction processes. Error tickets containing

information about the transportation coupons of a

travel are then grouped into clusters corresponding

each to a type of anomaly, e.g., a tax calculation or a

proration anomaly, as illustrated in Figure 2.

The second step is the learning of the correction

processes associated to each cluster of tickets, by the

analysis of logs of correction actions taken by the

correctors, for the automation of the error correction

processes. By this analysis, automatic processes of

anomaly correction can be defined for each type of

error pattern corresponding to a cluster of error

tickets. As illustrated in Figure 3, these correction

processes can require the intervention of the end-user.

The article is organized as follows. Section 2

reviews the central issues of classical clustering

approaches for semi-supervised learning and the most

recent

algorithmic developments to address these

Figure 3: Learning of Correction Processes for Anomaly Pattern Classes.

issues. In Section 3, we present the proposed semi-

supervised framework and the technical and scientific

challenges addressed during its development. Section

4 concludes the article.

2 CLUSTERING ALGORITHMS

Clustering, or unsupervised classification, is the

computational process that aims to discover clusters

(groups) of instances in a dataset. A cluster is a set of

instances, e.g., individuals, that are as much as

possible similar among themselves within the group

and different from one group to another regarding

their features represented as variable values. See

(Fahad et al., 2014), (Kriegel et al., 2009) and (Xu

and Tian, 2015) for surveys of clustering algorithms.

Algorithmic Configuration Choice Issue. Different

algorithmic configurations, i.e. a specific algorithm

with a specific parameterization, can provide

different clustering solutions. Hence, each algorithm

relies on a particular assumption on the distribution

model of instances in the data space, and each

parameterization defines a manner to put in practice

this model. The quality of the resulting clustering will

depend to which extent they are adequate to the

analysed data space properties, as studied in (Xu and

Wunsch, 2005) and (Hennig, 2016).

Cluster Internal Validation Issue. A distinctive

characteristic of clustering applications regarding

machine learning issues is the absence of initial prior

knowledge about the data space properties and of

labelled, i.e., annotated, data to help choosing an

algorithmic configuration that is appropriate for the

analysed dataset.

Moreover, the problem of choosing an adequate

algorithmic configuration and obtaining a meaningful

clustering is exacerbated by the current difficulty of

objectively assessing the quality of the resulting

clusters. If several internal validation measures have

been proposed, each measure also relies on a specific

assumption on the distribution model of instances in

the data space and can thus overrate clustering results

of algorithms based on the same model, e.g., centroid

or density based. See (Dalton et al., 2009), (Halkidi

et al., 2001), (Rendón et al., 2011) and (Tomasini et

al., 2017) for studies on clustering validation

measures.

Cluster Characterization to Application Class

Issue. The objective of the characterization of

clusters to application classes is to connect consensus

clusters and application classes, e.g., accounting

anomaly correction classes, so that each cluster is as

much as possible representative, i.e., distinctive in the

data space, of an application class. This procedure

implies the development of semi-supervised

algorithmic solutions combining unsupervised

internal validation of consensus clusters and

supervised external validation of consensus clusters

based on Amadeus business metrics. See (Färber et

al., 2010), (Halkidi et al., 2001) and (Xiong and Li,

2014) for theoretical and experimental comparisons

and studies on internal and external validation

measures.

2.1 Multi-level Clustering

The use of clustering techniques in this context aims

to discriminate the application classes according to

their properties in the data space, and potentially

refine them by distinguishing different sub-classes of

a class according to the different modeling properties

of each cluster in the data space. In the context of the

Amadeus Revenue Management workflow, clusters

can distinguish sub-classes of predefined anomaly

correction processes and overlapping clusters can

also distinguish correction action sequences that are

common to several classes of anomaly correction

processes. Indeed, one class of correction process can

correspond to several error ticket clusters, and each

cluster can correspond to several correction process

classes.

Multi-level Clustering generates a hierarchical

decomposition of clusters, where a cluster at a level

in the hierarchy can be decomposed into several

smaller clusters in the sub-levels of the hierarchy.

Such a hierarchical clustering can provide a relevant

framework for the identification of correction process

classes and sub-classes as illustrated in Figure 4 in

which the proration correction process is divided into

two sub-classes corresponding to two sub-clusters in

the data space (Färber et al., 2010).

Correction Process Class Prediction Model. Once

the most relevant multi-level clusters have been

identified, regarding internal and external validations,

their evaluation by the user is based on the statistical

and analytical exploration of cluster structures,

properties and relationships in the data space and their

adequation to the application through business related

criteria.

The validated clusters are then characterized by

the analysis of discriminative features regarding

internal and external validation results to identify

features that distinguish each of them in the data

space and to rank them from a business application

perspective.

Figure 4: Detection of Ticket Anomaly Correction Classes and Sub-classes Based on Clusters.

A comparative analysis of the characterizations of

clusters, to identify the features that distinguish each

cluster from the others in the data space, is then be

performed to learn a class prediction model of

instances. In the Amadeus Revenue Management

workflow context, the learned classifier aims to

predict error ticket classes and sub-classes for

automating the learned correction processes and

provide the corrector with indicators and potential

external references to support and optimize the

correction process of the ticket anomaly.

2.1.1 Ensemble Clustering

Ensemble Clustering, or Consensus Clustering,

approaches combine multiple clustering results,

called base clusterings, each generated by a different

algorithmic configuration, for generating more robust

consensus clusters corresponding to agreements

between base clusters.

Existing ensemble clustering approaches can be

classified into the four following categories:

 Approaches considering the clustering ensemble

problem as a clustering of categorical data.

 Approaches based on the generation of an

instance co-association matrix depicting the

number of assignments of each pair of objects to

the same cluster in a clustering solution.

 Approaches that rely on the generation of a

cluster association matrix based on the number of

objects that were commonly assigned to the

clusters in a clustering solution.

 Approaches that consider the problem as a graph,

or hypergraph, partitioning problem.

However, these approaches have some limitations

in this context. Indeed most of them require the user

to define the number of clusters to generate prior to

the execution, and approaches based on instance to

instance relationships require to generate large

association matrices (N

size for N instances) which

is unfeasible for very large datasets (e.g., millions of

objects) due to space and time complexities of the

matrix computation and manipulation.

Once a consensus clustering is generated, its

quality is evaluated using an internal validation

measure based on the analysis of the properties of

clusters in the clustering solution relatively to the

clusters in all the base clusterings. This evaluation is

usually based either on the Adjusted Rand Index

(ARI) measure or on the Normalized Mutual

Information (NMI) measure that assess the quality of

the resulting clustering by its average similarity with

all base clusterings.

See (Boongoen and Iam-On, 2018), (Ghosh and

Acharya, 2016) and (Vega-Pons and Ruiz-

Shulcloper, 2011) for extensive reviews and studies

on ensemble clustering algorithmic approaches.

2.2 Multiple Consensus Clustering

The proposed framework is an extension of the

MultiCons multiple consensus clustering approach

described in (Al-Najdi et al., 2016) with five

algorithmic variants of the approach, based each on a

different consensus creation process (merge/divide

based, graph based, etc.), and comparative studies of

their properties in different application contexts and

for datasets with distinct data space properties.

The MultiCons approach makes use of closed set

mining to discover clustering patterns among the

different base clustering solutions, each defining an

agreement between a set of clusters to group a set of

instances. These patterns are then processed by a

split/merge strategy to generates multiple consensus

clusterings

represented in the ConsTree tree-like

Figure 5: Semi-MultiCons Approach for Cluster Learning and Characterization for Class Prediction.

structure that helps understanding the clustering

process and data space subjacent intrinsic structures.

3 PROPOSED FRAMEWORK

The Semi-MultiCons approach proposed here is a

novel semi-supervised consensus clustering

algorithmic framework. It extends the MultiCons

approach to semi-supervised clustering, with a new

constraint-based iterative consensus creation process

and a new strategy for selecting the most relevant

consensus clusters in the ConsTree tree-like structure.

The Semi-MultiCons process is presented in Figure 5.

Starting from the dataset, the base clusterings are

generated and then combined in a membership matrix

representing the assigned cluster for each dataset

instance. Closed patterns, depicting each an

agreement between a set of base clusterings to group

a set of instances into the same cluster, are extracted

from the membership matrix and combined to

generate relevant clustering patterns of different

sizes. These patterns are processed by the consensus

function to generate the ConsTree hierarchical

graphical representation of the multi-level clusters

and identify the most relevant ones using internal and

external validations. These clusters are then

characterized and mapped to application classes to

predict the class of new instances using a neighbour-

based or model-based approach. Validated instance

class predictions can then be integrated in the process

as new constraints for the semi-supervised aspect.

This interactive process requires from the end-user to

configurate the data pre-processing step and the base

clustering algorithmic configurations, and to explore,

interpret and validate the results, i.e., the selected

multi-level clusters and their associated application

classes and sub-classes. Application domain expertise

is indeed required to optimize these tasks.

3.1 Semi-supervised Multiple

Consensus Clustering

Semi-supervised learning approaches combine

unsupervised classification, i.e., clustering, and

supervised classification, that is the subsequent

learning of classes from clusters, when partial prior

knowledge about the data is available, i.e., when

some dataset instances are annotated with class

labels. Short surveys on semi-supervised clustering-

based learning can be found in (Agovic and Banerjee,

2013), (Grira et al., 2005) and (Jain et al., 2016).

Studies of the Amadeus Revenue Management

workflow data and semi-supervised learning concepts

lead to the development of three new closed pattern

consensus-based semi-supervised algorithmic

approaches. These approaches extend the MultiCons

approach by integrating supervised information

represented as cannot-link and must-links constraints

between annotated dataset instances, i.e., pairs of

instances with different and identical class labels

respectively. Each approach integrates these

constraints in a different phase of the consensus

clustering process.

In the first proposed approach, depicted in Figure

6, cannot-link and must-link constraints are integrated

during the creation of the base clusterings by using

semi-supervised clustering algorithms.

In the second approach, depicted in Figure 7,

cannot-link

and

must-link

constraints

are

integrated

Figure 6: Integrating Constraints by using Semi-Supervised Clustering for Base Clusterings Creation.

during the processing of base clusterings to generate

the clustering ensemble, that can be represented as a

co-association matrix or a membership matrix

depending on the consensus function that will be used

to generate consensus clusters.

In the third approach, depicted in Figure 8,

cannot-link and must-link constraints are integrated

during the processing of the clustering ensemble by

the consensus function to generate consensus clusters,

so that the resulting consensus clusterings comply as

far as possible with the integrated constraints.

Different experimental protocols were defined

using reference benchmark datasets to study and

compare the three proposed approaches and other

classical single unsupervised and semi-supervised

clustering approaches. Datasets corresponding to

different ratios of annotated dataset instances and

different ratios of cannot-link and must-link

constraints among annotated dataset instances were

generated to assess the effect of these ratios on the

efficiency of the process and the relevance of the

clustering results. Results of this theoretical and

experimental study show the relevance of the three

proposed approaches for semi-supervised learning.

They also show that the integration step can be

adapted to the available prior knowledge and the

eventual integration restrictions, for example

regarding technical constraints on the use of semi-

supervised algorithms for generating base clusterings

or the use of constraints in the internal and external

validation measures applied for generating the

clustering ensemble and/or the consensus clusters.

Error ticket annotations by the end-users will be

converted to cannot-link and must-link constraints to

conduct experiments comparing classical and

ensemble semi-supervised approaches proposed in

the literature with the three developed approaches

from the viewpoints of the efficiency and the

scalability of the approaches, and of the quality of the

resulting clustering solutions.

3.2 Technical Challenges

3.2.1 Source Data Pre-processing

This challenge encompasses the representation,

storage, specialization and/or generalization and

manipulation of source data. Data collected from the

Figure 7: Integrating Constraints in the Ensemble Creation Process.

Figure 8: Integrating Constraints in the Clustering Function for Generating Consensus Clusters from the Ensemble.

Amadeus Revenue Management workflow contain

all accounting information required for processing a

travel that is coded internally as a ticket in input of

the workflow. Each ticket is a hierarchical data

structure representing the complete travel and its

associated coupons, each coupon corresponding to a

flight connection and related commercial treatments

in the travel. For each ticket, general data on the travel

(distance of travel, total duration, number of

connections, etc.) are included as well as data on each

coupon (departure and arrival airports, air operator,

duration, price, taxes, etc.). This study of the

Amadeus Revenue Management workflow data

shows that both the heterogeneity and number of

features associated with each ticket presents a great

variability, depending on the corresponding travel

and commercial treatments.

Different pre-processing steps were tested in

order to represent in a relevant format the information

on tickets regarding the applicability of unsupervised

and supervised algorithms versus the heterogeneity,

the number of objects and the number of variables in

the processed datasets.

3.2.2 Data Space Understanding

This challenge covers the analytical exploration and

identification of structural properties of the data space

regarding the issue of the parameterizations of base

clustering algorithms, to generate relevant base

clusterings in the ensemble. After the data exploration

and visualization phase, the initial datasets

constructed represent each ticket in input of the

Amadeus Revenue Management workflow as an

instance, i.e., a row, in the dataset. For this, the

hierarchical data structure representing tickets and

associated coupons was flattened: Each dataset

instance contains both data on ticket and its

associated coupons. This flattened representation of

tickets allows the applicability of all clustering

algorithms, whereas the heterogeneity of initial data

encoding does not allow to apply certain categories or

implementations of clustering algorithms.

Tickets in the dataset correspond to both tickets

with normal output, i.e., no anomaly detected, and

with error output, i.e., anomaly detected. These

datasets were sampled in order that tickets of both

classes, i.e., normal or error ticket, are sufficiently

balanced to ensure that the different classes can be

identified and segregated in the data space.

The first dataset contains 2 000 tickets, with 1 000

normal tickets and 1 000 error tickets of the anomaly

class ‘FOP Reconciliation, Unsettled Payment’.

Among these 2 000 tickets, 1 785 tickets are correctly

annotated (true class labels) while the 215 other

tickets represent noise in the data space (incorrect

class labels that were automatically generated by the

workflow, representing false positives). The second

dataset integrates with the data processed by the

Amadeus Revenue Management workflow the data

generated by the successive modules of the workflow

for the management of error tickets. This dataset

contains 20 000 tickets, with 10 000 normal tickets

and 10 000 error tickets of the anomaly class ‘FOP

Reconciliation, Unsettled Payment’.

These pre-processing operations show that a high

number of attributes are manipulated during the

Amadeus Revenue Management workflow, with up

to 39 889 features (variable values) per ticket in the

first dataset and up to 83 698 features per ticket in the

second dataset. However, this high dimensional data

space is sparse, meaning that only a small proportion

of the corresponding variables are filled in for most

tickets. If this flattened representation of ticket

features induces the applicability of all clustering

algorithms, high-dimensional data spaces impose

restrictions on the applied algorithmic configurations

regarding space and time complexities of the

computation as shown in the baseline experiments.

3.3 Scientific Challenges

3.3.1 Data Representation and Encoding

This challenge concerns the representation,

formatting and encoding of the heterogeneous data in

input of the workflow considering the applicability of

base clustering algorithms and their time and space

complexity classes relatively to the dataset size.

If the maximal number of features manipulated

for each ticket during the Amadeus Revenue

Management workflow is important, in the order of

tens of thousands, the analytical exploration of these

data and the application of supervised classification

and regression approaches show that only a small

proportion is relevant for the detection and the

prediction of classes of anomalies.

The use of feature selection techniques allows to

reduce the maximal number of features for each ticket

to the order of hundreds by removing irrelevant data

regarding the distinction of ticket classes in the data

space. This pre-processing both extends the list of

clustering algorithms that are applicable considering

their time and space complexities and to enhance the

quality of the result by reducing the negative impact

of the high-dimensionality of the data space on the

capabilities of distance measures to precisely assess

the similarity between objects in the data space

(Curse of Dimensionality issue).

3.3.2 Definition of Base Clustering

Algorithmic Configurations

The development of the semi-supervised clustering

approach integrating prior knowledge in the

generation of the base clusterings from which the

clustering ensemble is created is based on an

extensive study of semi-supervised clustering

algorithms. This study encompasses the different

algorithmic approaches and their variants that can be

divided into the following categories regarding the

underlying model they are based on: Semi-supervised

K-means, semi-supervised metric learning, semi-

supervised spectral clustering, semi-supervised

ensemble clustering, collaborative clustering,

declarative clustering, semi-supervised evolutionary

clustering and constrained expectation-maximization.

Diverse criteria were considered for determining

the best semi-supervised algorithmic approaches to

integrate for the generation of the base clusterings.

These criteria consider in first place the quality of the

clustering results, the efficiency and scalability of the

approach regarding dataset size, the applicability of

the approach to datasets containing heterogeneous

and missing data, and the robustness of the approach

to noise and outliers in the data. Considering reported

theoretical and experimental results in the literature,

and both the availability and the results of tests of

implementations, the COP K-means (Constraint-

Partitioning K-means), the MPCK-means (Metric

Pairwise Constrained K-means) and the LCVQE

(Constrained Vector Quantization Error) algorithmic

approaches were integrated. Their algorithmic

configurations are defined using an interval of values

for the K parameter (number of clusters) to comply

with the diversity required for the search space of the

consensus clustering function. This interval is

centered on the number of classes defining the

cannot-link and must-link constraints to improve the

robustness of consensus solutions.

3.3.3 Ensemble Definition and Formatting

This challenge addresses the problem of the

representation of base clustering results in the

ensemble. That is how resulting instance cluster

assignments are represented for partitioning,

overlapping and fuzzy based clustering algorithms.

The design of a semi-supervised clustering

approach integrating prior knowledge in the

generation of the clustering ensemble required to

develop new algorithmic approaches. This prior

knowledge consists of partial class label annotations

in the dataset, that is some dataset instances are of

known classes while others are not. These annotations

are used to generate cannot-link constraints between

instances of different classes and must-link

constraints between instances of identical classes.

The generated cannot-link and must-link constraints

are used to evaluate the quality of base clusters and

base clusterings by considering the number of

constraints that are violated and met in each cluster.

The results of this evaluation are used either to delete

from the clustering ensemble the base clusterings

with a low score, or to assign a reduced weight to base

clusterings with a low score and an increased weight

to base clusterings with a high score.

Depending on the consensus function used and

the data representation it requires as input, different

processes are defined to generate the clustering

ensemble. For co-association matrix-based consensus

functions, the co-association matrix can be generated

from the base clusterings with a sufficiently high

score only, or a weighted co-association matrix can

be generated using evaluation scores to weight co-

association values. For membership matrix-based, a

binary membership matrix can be generated from the

base clusterings with a sufficiently high score only, or

a weighted membership matrix can be generated

using constraint-based evaluation score of clusters to

weight cluster assignments with confidence degrees.

3.3.4 Definition of Clustering Patterns

This challenge concerns the definition of the criteria

used during the analysis of agreements between base

clusterings by the consensus function to identify

clustering patterns. A clustering pattern is a group of

instances that verifies some properties, e.g., based on

the number of base clustering agreements or

constraints it satisfies and violates, to form a cluster.

New algorithmic techniques were developed

during the design of the Semi-MultiCons approach to

integrate prior knowledge in the generation of

consensus clusters by the consensus function. The

prior knowledge, represented as partial annotations,

is used to generate cannot-link and must-link

constraints that are integrated during the processing

of the clustering ensemble by the consensus function

to obtain consensus clusters and consensus

clusterings that comply as far as possible with the

constraints.

The proposed approach first extracts closed

patterns from the membership matrix representing the

clustering ensemble and iteratively combine these

closed patterns to define clustering patterns, each one

representing a relevant agreement between base

clusterings on grouping a set of instances. These

clustering patterns, that can overlap, are evaluated

and compared to create the multi-level consensus

clusters using a constraints-based merging/splitting

method. The key step of this phase is to access a

normalized score that evaluates whether two

overlapping patterns should be merged or spitted. We

introduced three new constraints-based normalized

measures, that consider the reflexive property of the

cannot-link constraint type and the symmetrical,

reflexive and transitive properties of the must-link

constraint type, that are used to decide how to split or

merge patterns. Each measure corresponding to a

different situation from the viewpoint of the prior

knowledge available for the considered patterns.

When no prior knowledge is available, the classical

unsupervised measure of the approach, based on the

relative and absolute sizes of the overlapping and

distinct subsets of objects for the two patterns, is used.

Once the hierarchical structure of consensus

clusterings is created, the candidate consensus

clustering that satisfies the highest number of

constraints is selected as the recommended solution.

4 CONCLUSIONS

Figure 9: Principal Component Analysis Results for Task

Type Classes.

Figure 10: Principal Component Analysis Results for Semi-

MultiCons Clusters.

Both theoretical and experimental results obtained

during the initial phase of the Semi-MultiCons

development have demonstrated both the feasibility

and the relevance of semi-supervised learning

approaches relying on closed patterns-based multi-

level consensus clustering for improving processes

such as the Amadeus Revenue Management

workflow. Moreover, generating multi-level

consensus clusters, such as generated by the Semi-

MultiCons approach, can support the refinement of

application classes into more adequate sub-classes

regarding the application objectives, for example by

decomposing anomaly correction processes into sub-

processes, that can be common to several processes.

Initial experiments were conducted on a sample

dataset of 474 error tasks raised by the Accounting

module of the workflow. These tasks are annotated

with three different types: 94 tasks of type A, 210

tasks of type B and 170 tasks of type C. Figure 9

shows the result of the application of the Principle

Component Analysis approach for transforming raw

data into two-dimensional points, where horizontal

and vertical axes represent principal components

calculated by the approach. Each point in this scatter

plot represents a task which true label, i.e., type of

task, is represented in color. The scatter plot obtained

by the application of the same Principle Component

Analysis approach to the output of Semi-MultiCons

for this dataset is shown in Figure 10, where the

assigned cluster for each task is represented in color.

Using Jaccard index to compare true classes and

assigned clusters for the 474 tasks, an accuracy of

82% was calculated. It should be noted that these

initial results were obtained without tuning the

parameters of each step of the Semi-MultiCons

approach. In a second time, the Semi-MultiCons

approach was applied to a dataset of 303 064 error

tasks containing all error tasks raised by the Proration

module between January 2019 and September 2019

for a medium sized airline customer. Due to the size

of dataset, only partial information was available for

supervised validation of the results. However,

assuming clustering result is correct, the assessed rate

of tasks that are similar is 39.5%. With an estimated

average manual correction duration for tasks of more

than one minute, identifying similar tasks for their

simultaneous anomaly correction may save up to

2 000 hours of manual correction activity for these

303 064 tasks.

These achievements have also shown the

necessity for a speciation of semi-supervised

approaches to take into account the heterogeneous

internal and external available information, i.e., data

and prior knowledge, in input and the application

objectives from the perspective of the classes that are

to be distinguished: The potential overlapping

properties of classes in the data space, a hierarchical

structure of application classes, the availability of

prior knowledge such as data partially annotated with

application classes, the complex processing of logs of

sequential correction actions requiring deep learning

techniques, etc. Examples of recent applications with

similar considerations in the domains of ontology

matching and document classification can be found in

(Boeva et al., 2018) and (Ippolito and Júnior, 2016).

ACKNOWLEDGMENTS

This project was carried out as part of the IDEX

UCA

JEDI

MC2 joint project between Amadeus and the

Université Côte d'Azur. This work has been

supported by the French government, through the

UCA

JEDI

Investments in the Future project managed

by the National Research Agency (ANR) with the

reference number ANR-15-IDEX-01.

REFERENCES

Agovic A., Banerjee A. Semi-supervised Clustering. In

Data Clustering: Algorithms and Applications, Chapter

20, pp. 505-534, 2013, Chapman & Hall.

Al-Najdi A., Pasquier N., Precioso F. Frequent Closed

Patterns Based Multiple Consensus Clustering. In

ICAISC'2016 International Conference on Artificial

Intelligence and Soft Computing, pp. 14-26, June 2016,

LNCS 9693, Springer.

Al-Najdi A., Pasquier N., Precioso F. Using Frequent

Closed Pattern Mining to Solve a Consensus Clustering

Problem. In SEKE'2016 International Conference on

Software Engineering & Knowledge Engineering, pp.

454-461, July 2016, KSI Research Inc. SEKE'2016

Third Place Award.

Al-Najdi A., Pasquier N., Precioso F. Multiple Consensuses

Clustering by Iterative Merging/Splitting of Clustering

Patterns. In MLDM'2016 International Conference on

Machine Learning and Data Mining, pp. 790-804, July

2016, LNAI 9729, Springer.

Al-Najdi A., Pasquier N., Precioso F. Using Frequent

Closed Itemsets to Solve the Consensus Clustering

Problem. In International Journal of Software

Engineering and Knowledge Engineering,

26(10):1379-1397, December 2016, World Scientific.

Boeva V., Angelova M., Lavesson N., Rosander O.,

Tsiporkova, E. Evolutionary Clustering Techniques for

Expertise Mining Scenarios. In ICAART’2018

International Conference on Agents and Artificial

Intelligence, pp. 523-530, January 2018, SciTePress.

Boongoen T., Iam-On N. Cluster Ensembles: A Survey of

Approaches with Recent Extensions and Applications.

In Computer Science Review, vol. 28, pp. 1-25, 2018.

Dalton L., Ballarin V., Brun M. Clustering Algorithms: On

Learning, Validation, Performance, and Applications to

Genomics. In Current Genomics, 10(6):430-445, 2009,

Bentham Science Publisher.

Fahad A., Alshatri N., Tari Z., Alamri A., Khalil I., Zomaya

A., Foufou S., Bouras A. A Survey of Clustering

Algorithms for Big Data: Taxonomy and Empirical

Analysis. In IEEE Transactions on Emerging Topics in

Computing, 2(3):267-279, September 2014, IEEE

Computer Society.

Färber I., Günnemann S., Kriegel H.-P., Kröger P., Müller

E., Schubert E., Zimek A. On Using Class-Labels in

Evaluation of Clusterings. In KDD MultiClust

International Workshop on Discovering, Summarizing

and Using Multiple Clusterings, 2010.

Ghosh J., Acharya A. A Survey of Consensus Clustering.

In Handbook of Cluster Analysis, Chapter 22, pp. 497-

518, 2016, Chapman and Hall/CRC.

Grira I., Crucianu M., Boujemaa N. Unsupervised and

Semi-supervised Clustering. A Brief Survey. In A

Review of Machine Learning Techniques for

Processing Multimedia Content, vol. 1, pp. 9-16, 2005.

Halkidi M., Batistakis Y., Vazirgiannis, M. On Clustering

Validation Techniques. In Journal of Intelligent

Information Systems, vol. 17, pp. 107-145, 2001,

Springer.

Hennig C. Clustering Strategy and Method Selection. In

Handbook of Cluster Analysis, Chapter 31, pp. 703-

730, 2016, Chapman and Hall/CRC.

Ippolito A., Júnior J.R. Ontology Matching based on Multi-

Aspect Consensus Clustering of Communities.

ICEIS'2016 International Conference on Enterprise

Information Systems, Volume 2, pp. 321-326, April

2016, SciTePress.

Jain A., Jin R., Chitta R. Semi-supervised Clustering. In

Handbook of Cluster Analysis, Chapter 20, pp. 443-

468, 2016, Chapman and Hall/CRC.

Kriegel H.-P., Kröger P., Zimek A. Clustering High-

dimensional Data: A Survey on Subspace Clustering,

Pattern-based Clustering, and Correlation Clustering. In

ACM Trans. Knowl. Discov. Data, vol. 3, num. 1,

article 1, March 2009, ACM Publisher.

Vega-Pons S., Ruiz-Shulcloper J. A Survey of Clustering

Ensemble Algorithms. In International Journal of Pat-

tern Recognition and Artificial Intelligence, 25(3):337-

372, 2011.

Rendón E., Abundez I., Arizmendi A., Quiroz E.M. Internal

versus External Cluster Validation Indexes. In

International Journal of Computers and

Communication, 5(1):27-34, 2011.

Tomasini C., Borges E.N., Machado K., Emmendorfer L.R.

A Study on the Relationship between Internal and

External Validity Indices Applied to Partitioning and

Density-based Clustering Algorithms. In ICEIS'2017

International Conference on Enterprise Information

Systems, Volume 3, pp. 89-98, April 2017, SciTePress.

Xiong H., Li Z. Clustering Validation Measures. In Data

Clustering Algorithms and Applications, Chapter 23,

pp. 571-605, 2014, CRC Press.

Xu D., Tian A. A Comprehensive Survey of Clustering

Algorithms. In Annals of Data Science, 2(2):165-193,

2015, Springer.

Xu R., Wunsch D. Survey of Clustering Algorithms. In

IEEE Transactions on Neural Networks, 16(3):645-

678, 2005, IEEE Computer Society.