Incorporating Privileged Information to Improve Manifold Ordinal

Regression

M. P

erez-Ortiz, P. A. Guti

errez and C. Herv

as-Mart

ınez

University of C

ordoba, Dept. of Computer Science and Numerical Analysis,

Rabanales Campus, Albert Einstein Building, 14071 C

ordoba, Spain

Keywords:

Manifold Learning, Ordinal Regression, Privileged Information, Kernel Learning.

Abstract:

Manifold learning covers those learning algorithms where high-dimensional data is assumed to lie on a low-

dimensional manifold (usually nonlinear). Speciﬁc classiﬁcation algorithms are able to preserve this manifold

structure. On the other hand, ordinal regression covers those learning problems where the objective is to

classify patterns into labels from a set of ordered categories. There have been very few works combining

both ordinal regression and manifold learning. Additionally, privileged information refers to some special

features which are available during classiﬁer training, but not in the test phase. This paper contributes a

new algorithm for combining ordinal regression and manifold learning, based on the idea of constructing a

neighbourhood graph and obtaining the shortest path between all pairs of patterns. Moreover, we propose

to exploit privileged information during graph construction, in order to obtain a better representation of the

underlying manifold. The approach is tested with one synthetic experiment and 5 real ordinal datasets, showing

a promising potential.

1 INTRODUCTION

Ordinal regression is a learning task where the ob-

jective is to classify patterns into a set of prede-

ﬁned labels, but the labels include an order (Car-

doso and da Costa, 2007; Chu and Keerthi, 2007; Li

and Lin, 2007). For example, for an age estimation

problem, people images could be classiﬁed into the

classes {newborn, baby, young,adult, senior}. These

categories are reﬂecting intervals of an actual latent

variable (the real age of the person) but, contrary to

standard regression, the latent variable is unobserv-

able. On the other hand, the order between the cat-

egories makes this problem different from standard

classiﬁcation, and speciﬁc ordinal regression algo-

rithms try to improve the quality of the classiﬁer by

introducing the order in the model and/or penalising

the different classiﬁcation errors (the magnitude of

the error should be higher when the predicted class

if further to the actual class) (Lin and Li, 2012).

Different methods have been proposed to deal

with ordinal regression problems. Threshold mod-

els are one of the most popular approaches (McCul-

lagh, 1980; Verwaeren et al., 2012), where the or-

dinal regression problem is formulated as the prob-

lem of estimating a real valued function and a set

of Q − 1 thresholds (Q is the number of classes), in

such a way that one interval is assigned to each class

([−∞, b

), [b

, b

), . . . , [b

Q−1

, ∞)). This is the struc-

ture of the ﬁrst speciﬁc model for ordinal regres-

sion, the proportional odds model (McCullagh, 1980),

which is an ordinal version of binary logistic regres-

sion. Later on, nonlinear threshold models have ap-

peared in the machine learning community, including

different adaptions of other methods to the ordinal set-

ting, such as support vector machines (R. Herbrich

and Obermayer, 2000; Shashua and Levin, 2003; Chu

and Keerthi, 2007), discriminant analysis (Sun et al.,

2010) or Gaussian processes (Chu and Ghahramani,

2005). Other works decompose the original ordinal

regression problem into several binary classiﬁcation

ones, by sequentially dividing the ordinal scale in bi-

nary labels (Frank and Hall, 2001; Cheng et al., 2008;

Deng et al., 2010). Finally, a reduction framework

can be found in (Cardoso and da Costa, 2007; Lin and

Li, 2012), where ordinal regression is reduced to bi-

nary classiﬁcation, but learning one single model for

the binary problem where the input patterns are repli-

cated, extended and weighted according to the ordinal

label.

In this paper, we consider a manifold learning ap-

proach for ordinal regression. The idea of manifold

187

Pérez-Ortiz M., A. Gutiérrez P. and Hervás-Martínez C..

Incorporating Privileged Information to Improve Manifold Ordinal Regression.

DOI: 10.5220/0005075801870194

In Proceedings of the International Conference on Neural Computation Theory and Applications (NCTA-2014), pages 187-194

ISBN: 978-989-758-054-3

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

learning is to uncover the nonlinear structure embed-

ded in a dataset, assuming that the high-dimensional

observations lie on or close to an intrinsically low-

dimensional manifold. There are different algorithms

to learn this kind of structures, including the iso-

metric feature mapping (Isomap) (Tenenbaum et al.,

2000) or Laplacian eigenmaps (Belkin and Niyogi,

2001). Based on them, other manifold learning

algorithms have been also proposed for classiﬁca-

tion, such as locality preserving projections (He and

Niyogi, 2003) or the discriminant Laplacian embed-

ding (DLE) (Wang et al., 2010).

In the context of ordinal regression, manifold

learning has been considered in (Liu et al., 2011a;

Liu et al., 2011b) based on the idea of preserving

the intrinsic geometry of the data via the deﬁnition

of a neighbourhood graph which also preserves the

ordinal nature of the dataset. This graph is used to

construct an adjacency matrix by using a generalised

radial basis function. The Laplacian matrix is then

derived and used for the learning process. A related

method is proposed in (Liu et al., 2012), where several

projections are iteratively computed. Finally, rank-

ing on data manifolds is investigated in (Zhou et al.,

2004), although the problem is deﬁned as ranking,

which is different from ordinal regression.

On the other hand, Vapnik and Vashist recently

proposed a framework to apply support vector ma-

chines (SVM) to those cases where privileged infor-

mation is available during the training phase, but not

during test (Vapnik and Vashist, 2009). This kind of

information can be found in many learning problems,

where training samples present some special features

which are not available during test because of their

cost or simply because it is not possible. For exam-

ple, suppose our goal is to ﬁnd a rule that can pre-

dict outcome y of a treatment in a year given the cur-

rent symptoms x of a patient. At the training stage,

a doctor can also provide additional information x

∗

about the development of symptoms in three months,

six months, and nine months (Vapnik and Vashist,

2009). The algorithm in (Vapnik and Vashist, 2009)

was based on considering a slack model for this priv-

ileged information. Given that slacks are only con-

sidered during SVM optimisation and not included

in the ﬁnal model, their approach was able to bene-

ﬁt from this privileged information, mainly improving

the convergence of the learning algorithm.

In this paper, we extend the ordinal regression

manifold approach in (Liu et al., 2011b; Liu et al.,

2011a) by considering privileged information during

the neighbourhood graph construction. Under the as-

sumption that privileged features are useful for the

classiﬁcation task, this approach would modify the

neighbourhood structure to better represent the learn-

ing task. Moreover, we also consider a different ap-

proach for constructing the ﬁnal distance matrix (by

making use of the Dijkstra algorithm) and include this

information into a kernel function, in order to apply

support vector ordinal regression (Chu and Keerthi,

2007), as opposed to the ordinal discriminant-based

projection method in the original proposal. There-

fore, two main objectives can be found in this paper:

Firstly, to analyse whether it is feasible to reformu-

late the notion of similarity for kernel functions when

considering an ordinal manifold of the data and sec-

ondly, to study if the inclusion of privileged informa-

tion helps to improve the constructed model. The ap-

proach is tested in one synthetic dataset and 5 real

ones, showing a competitive performance.

The rest of the paper is organised as follows: Sec-

tion 2 presents the methodology proposed, while Sec-

tion 3 presents and discusses the experimental results.

The last section summarises the main contributions of

the paper.

2 METHODOLOGY

When dealing with multiclass classiﬁcation, the goal

is to assign an input vector x to one of Q discrete

classes C

, q ∈ {1, . . . , Q}. To obtain the prediction

rule C : X → Y , we use an i.i.d. training sample X =

, y

}

i=1

where N is the number of training patterns,

∈ X , y

∈ Y , X ⊂ R

is the d-dimensional input

space and Y = {C

, C

, . . . , C

} is the label space. We

are also provided a test set to obtain a reliable estima-

tion of the classiﬁcation error, X

= {x

, y

}

i=1

, where

is the number of test patterns and x

∈ X , y

∈ Y .

Finally, many learning problems present some fea-

tures which are available during training but not in the

test phase. This privileged information complements

training data in such a way that the training sample

is X = {x

, x

∗

, y

}

i=1

, where x

∈ X , x

∗

∈ X

∗

, y

∈ Y

and X

∗

⊂ R

∗

is the d

∗

-dimensional input privileged

space. The test set is the same, given that privileged

information is not available when applying the classi-

ﬁer.

Ordinal regression or ordinal classiﬁcation are

those problems where patterns have to be classiﬁed

into naturally ordered labels. Consequently, the deﬁ-

nition of this kind of problems is similar to the one in-

troduced in the previous paragraph, but incorporating

the following constraint: C

≺ C

≺ · ·· ≺ C

, where

≺ denotes this order information.

Considering this ordering scale, one of the main

hypothesis in ordinal regression is that the distance

to adjacent classes is lower than the distance to non-

NCTA2014-InternationalConferenceonNeuralComputationTheoryandApplications

188

adjacent classes. Therefore, it can be said that ideally

there exists a latent distance-based manifold of the

output variable that results in C

lying in the space be-

tween C

q−1

and C

q+1

. In this paper, we test two differ-

ent hypotheses. On the one hand and motivated by the

large amount of ordinal kernel methods in the litera-

ture (Chu and Ghahramani, 2005; Chu and Keerthi,

2007; Sun et al., 2010; Liu et al., 2012), we test

whether it is possible to include the manifold struc-

ture in the kernel matrix of kernel methods. Kernel

matrices can be seen as structures of data that contain

information about similarities among the patterns in a

dataset. This notion of similarity is usually based on

a distance relation between the patterns. Therefore,

this distance can be modiﬁed to consider the mani-

fold structure of the data. On the other hand, we test

whether the inclusion of privileged information in the

construction of the neighbourhood graph helps to im-

prove the robustness and efﬁciency of the classiﬁca-

tion model. The following two subsections are related

to the ﬁrst hypothesis, while the last subsection covers

the second one.

2.1 Constructing a Representative

Graph for the Ordinal Manifold

This subsection comprises some elementary notions

for constructing a representative graph for the or-

dinal manifold, which are used both in this paper

and the previous work (Liu et al., 2011a; Liu et al.,

2011b). Consider an undirected graph of N vertices,

G = (V, E), where V corresponds to the vertices of

the graph and E ⊆ [V ]

to the edges. In this case,

the set of the training patterns form the set of vertices,

V = {v

, v

, . . . , v

} = {x

, x

, . . . , x

}, and the differ-

ent edges connect pairs of patterns:

E = {e

i, j

} = {(v

, v

)} = {(x

, x

)}, (1)

where 1 ≤ i ≤ N and 1 ≤ j ≤ N. The set of edges is

obtained via a k-neighbourhood analysis of the data,

i.e. v

is connected to v

if x

is one of the k-nearest

neighbours of x

or viceversa. Instead of this or, we

could have considered the logical operator and, but

we introduce this relaxed version of the neighbour-

ing structure to prevent unconnected regions in the

dataset. Note that if v

is connected to v

, there exist

i, j

such that e

i, j

∈ E. For the purpose of construct-

ing the neighbourhood graph, the Euclidean distance

is used as the weight function (i.e. the one used for

the neighbourhood analysis):

f (e

i, j

) = d(x

, x

) = ||x

− x

, (2)

being || · ||

the L

-norm operator.

As we aim to preserve the ordinal structure of the

manifold, we could try to enlarge the locality between

different ranks, as done in (Liu et al., 2011b). To do

so, we can include a weight parameter w for the dis-

tances in such a way that these weights reﬂect the rank

differences between data points:

i, j

= |y

− y

| + 1. (3)

This weight information is applied to the distance

function as follows: d(x

, x

) = w

i, j

· ||x

− x

. The

possibility of considering these weights is explored

in the experiments of this paper (i.e. we consider

both the weighted and unweighted versions of the pro-

posal). Recall that this transformation of the distances

is done before constructing the neighbourhood graph.

2.2 Including Graph Shortest Paths in

the Kernel Matrix

Usually, for manifold learning algorithms, an adja-

cency matrix is used for the learning process (which

is the underlying idea in (Liu et al., 2011a; Liu et al.,

2011b)). In this paper, however, we try to anal-

yse whether it is feasible to reformulate the notion

of similarity for kernel functions when considering

an ordinal manifold of the data. The main idea is

to use the graph information obtained in the previ-

ous step to locate the different patterns in the un-

derlying ordinal manifold of the data. To do so, we

use the shortest path of the graph in order to pro-

vide a more smooth approach for the distances (as op-

posed to other manifold-based techniques where non-

connected points are assumed to present an inﬁnite

distance).

In graph theory, the shortest path problem is the

problem of ﬁnding a path between two vertices in a

graph such that the sum of the weights of its con-

stituent edges is minimised. As said, the constructed

graph is undirected, so the notion of path is deﬁned

as a sequence of z vertices from v

to v

, p

, v

, . . . , v

) ∈ V

, such that v

is adjacent to v

i+1

for 1 ≤ i < z (and therefore e

i,i+1

exists). Moreover,

given a real-valued weight function f : E → R (as

said, the weighted or unweighted Euclidean distance)

that assigns a cost to each edge and an undirected

graph G, the shortest path from v to v

is the path

1,z

= (v

, . . . , v

) (where v

= v and v

= v

) that over

all possible paths minimises the sum

∑

z−1

i=1

f (e

i,i+1

where e

i,i+1

∈ E.

To compute the distance from one data pattern x

to the rest but taking into account the manifold struc-

ture, we can compute the shortest paths from the ver-

tex v

to all the rest of vertices considering the well-

known Dijkstra’s algorithm (Dijkstra, 1959). Denote

by P the set of paths obtained from this process, where

IncorporatingPrivilegedInformationtoImproveManifoldOrdinalRegression

189

−1 −0.5 0 0.5 1 1.5 2

−1

−0.5

0.5

1.5

−1

Figure 1: Representation of the spiral synthetic ordinal dataset. Left plot: Original dataset without privileged information.

Right plot: Dataset including the privileged information as an additional feature. It can be seen that this privileged information

improves the potential separability of the data.

i, j

∈ P is the shortest path between v

and v

. There-

fore, the distance from any two points x

and x

in the

training set is:

d(x

, x

) =

z−1

∑

h=1

f (e

h,h+1

), v

= x

, v

= x

. (4)

where z is the length of the path between x

and x

Note that d(x

, x

) = w

i, j

·||x

−x

if x

is one of the

nearest neighbours of x

. Therefore, to introduce the

information of the location of each data point in the

manifold in the kernel matrix, we modify the kernel

function as follows:

k(x

, x

) = exp



−

d(x

, x

)

2σ



, (5)

where d(x

, x

) is deﬁned as in Eq. (4) and σ is

the kernel parameter, as opposed to using the stan-

dard Gaussian kernel with the L

-norm: k(x

, x

) =

exp



−

||x

−x

2σ



. Note that this kernel matrix will

still be positive semideﬁnite given that the only in-

formation changed is the distance function.

The kernel matrix obtained by this process is the

one used for the training step. For the test phase, we

ﬁrst compute the distance from each test pattern x

its nearest neighbour in training x

and then, sum this

distance to the shortest paths from x

to the rest of

training patterns. Consequently, the distance between

a test point x

and all training points is:

d(x

, x

) = d(x

, x

) + d(x

, x

), (6)

for 1 ≤ i ≤ N

and 1 ≤ z ≤ N.

This idea for the test phase corresponds to locate

the test pattern in the graph and use the shortest paths

information to compute the distance to the whole set

of training patterns.

2.3 Including Privileged Information in

the Graph

In order to motivate the inclusion of privileged infor-

mation during manifold learning, Figure 1 represents

a synthetic dataset presenting an ordinal manifold-

based structure where the label of points is assigned

according to the z coordinate. Data points lie on a

leaning 3-dimensional spiral and labels are ordinal,

with four classes C

, C

and C

. The Figure 1 also

includes the projection over x and y coordinates. As

can be seen, z coordinate is crucial to obtain a neigh-

bourhood graph able to help in the ordinal classiﬁca-

tion task. Considering this z value as privileged in-

formation during graph construction would allow the

classiﬁcation of patterns, even when only x and y fea-

tures are available during the test phase.

The privileged information can be easily included

during distance calculation to construct a neighbour-

hood graph which takes into account this additional

information. We can make use of the privileged fea-

tures in the real-valued weight function f that assigns

a value to edges of the graph:

∗

i, j

) = ||(x

, x

∗

) − (x

, x

∗

)||

(7)

∑

s=1

− x

)

∗

∑

s=1

∗

− x

∗

)

The whole process of neighbourhood analysis and

shortest path computation is reformulated to work

with this real-valued weight function. When consid-

ering this weight function, f

∗

i, j

), the distance func-

tion on Eq. (4) will be d

∗



, x

∗

), (x

, x

∗

)



and will

be applied to the kernel function on Eq. (6). For the

test phase, the privileged information is only consid-

NCTA2014-InternationalConferenceonNeuralComputationTheoryandApplications

190

ered for the graph that has been previously learnt, i.e.:



, (x

, x

∗

)



= ||x

− x

+ d

∗



, x

∗

), (x

, x

∗

)



where 1 ≤ z ≤ N and x

is the closest training point

from the test point evaluated x

3 EXPERIMENTS

The proposed methodologies are based on generat-

ing a modiﬁed version of the kernel matrix (by ex-

ploiting the neighbourhood graph of the data), so they

can be applied to any kernel classiﬁer. In this way,

we have considered the Support Vector Ordinal Re-

gression with Implicit Constraints (SVORIM) (Chu

and Keerthi, 2007), as it is one of the best perform-

ing threshold models for ordinal regression (Guti

errez

et al., 2012). 5 benchmark ordinal regression datasets

have been used for the analysis, which are taken from

publicly available repositories

(Asuncion and New-

man, 2007; PASCAL, 2011). Additionally, a more

controlled environment is provided by the spiral

dataset, introduced in Section 2.3. Table 1 shows the

characteristics of the evaluated datasets, where it can

be checked that number of classes varies between 3

and 5.

In the experiments, we evaluate two different fac-

tors:

• The introduction of ordinal costs for penalising

distances during the construction of the graph. Or-

dinal costs are based on the absolute cost. This

factor will be used to conﬁrm whether these costs

are really useful for ordinal regression, as dis-

cussed in previous works (Liu et al., 2011b; Liu

et al., 2011a).

• The improvement obtained by the privileged in-

formation. The graph will be constructed with

and without privileged information to evaluate if

the additional variables improve the quality of the

model.

The most common evaluation measures for ordi-

nal regression are the Mean absolute error (MAE) and

the accuracy ratio (Acc) (Guti

errez et al., 2012; Bac-

cianella et al., 2009; Cruz-Ram

ırez et al., 2014). The

MAE measure is used when the costs of different mis-

classiﬁcation errors is not constant:

MAE =

∑

i=1

− ˆy

|, (8)

Note that many of these datasets are frequently treated

as nominal ones, without taking into account the order

scale.

where ˆy

is the label predicted for x

. MAE values

range from 0 to Q − 1 (Baccianella et al., 2009).

Regarding the experimental setup, the datasets

were divided 30 times using a holdout stratiﬁed tech-

nique with a 75% of the patterns for training and the

remaining 25% for test. The splits of each holdout

are the same for all the algorithms and one model is

obtained for each training set and evaluated in the test

set. The average test evaluation measures and the cor-

responding standard deviations are ﬁnally reported as

the summary of the algorithm performance.

We use the standard Gaussian kernel for all the

methods. Model selection is accomplished by cross-

validating the hyperparameters of the algorithms con-

sidering only the training data (with a 5-fold cross-

validation). The measure used to select the best pa-

rameter combination is MAE. The two parameters

to be optimised are the kernel width (σ) and the cost

parameter (C), both being selected within the values

σ,C ∈ {10

−3

, 10

−2

, . . . , 10

}. The number of nearest

neighbours to be considered during graph construc-

tion is k = 3. In those cases that a pattern is not con-

nected to any other ones for the current value of k, we

increase k until all patterns are connected to at least

one.

For the spiral dataset, privileged information is

the z coordinate. For the rest of datasets, we apply the

Relief feature selection algorithm (Kira and Rendell,

1992) over the training set to sort the features by their

relevance. We select half of the features (the most

relevant ones) as privileged information (x

∗

) and the

rest as the original information (x).

3.1 Results

Table 2 shows the test results for the 6 ordinal datasets

considered in terms of Acc and MAE. The best result

for each dataset is in bold face and the second one in

italics. From this Table, we can outline several con-

clusions:

• When considering the ordinal weights, the Acc

and MAE results are always improved by the

privileged information. However, if the costs

are not included, there are some datasets where

the privileged information does not improve

the results (bondrate, contact-lenses and

squash-unstored). Given that the cross-

validation criterion is the MAE (which is based

on an absolute cost loss), we conclude that us-

ing these weights is necessary to properly obtain

a beneﬁt from the privileged information.

• From all the combinations, considering privileged

information and ordinal weights is the best one,

IncorporatingPrivilegedInformationtoImproveManifoldOrdinalRegression

191

Table 1: Characteristics of the six datasets used for the experiments: number of instances (Size), inputs (#In.), classes (#Out.)

and patterns per-class (#PPC)

Dataset Size #In. #Out. #PPC

bondrate 57 37 5 (6, 33, 12, 5, 1)

contact-lenses 24 6 3 (15, 5, 4)

pasture 36 25 3 (12, 12, 12)

spiral 400 3 4 (50, 50, 50, 50)

squash-unstored 52 52 3 (24, 24, 4)

tae 151 54 3 (49, 50, 52)

Table 2: Test results obtained for the different datasets (Mean± Standard Deviation of the 30 splits) by considering all the

different manifold classiﬁcation algorithms based on SVORIM.

Acc MAE

Privileged Information Privileged Information

Dataset Ordinal Weights No Yes No Yes

bondrate No 57.28 ± 3.82 56.54 ± 6.50 0.6272 ± 0.0647 0.6296 ± 0.0893

Yes 56.54 ± 5.02 58.52 ± 5.34 0.6346 ± 0.0676 0.6123 ± 0.0996

contact-lenses No 61.11 ± 10.11 61.11 ±10.11 0.5500 ± 0.0892 0.5500 ± 0.0892

Yes 58.89 ± 12.17 62.22 ± 8.68 0.5722 ± 0.1132 0.5389 ± 0.0717

pasture No 48.89 ± 14.76 51.85 ± 14.69 0.5370 ± 0.1600 0.5074 ± 0.1450

Yes 42.96 ± 15.91 43.70 ± 16.49 0.6037 ± 0.1668 0.6000 ± 0.1716

spiral No 82.37 ± 4.16 87.80 ± 2.70 0.2260 ± 0.0589 0.1867 ± 0.0505

Yes 85.03 ± 3.62 87.90 ± 2.76 0.2120 ± 0.0567 0.1857 ± 0.0520

squash-unstored No 52.56 ± 13.71 50.77 ± 10.42 0.4795 ± 0.1452 0.4949 ± 0.1082

Yes 49.74 ± 9.84 51.54 ± 11.45 0.5077 ± 0.0960 0.4897 ± 0.1115

tae No 35.35 ± 8.62 35.53 ± 8.40 0.6570 ± 0.0867 0.6526 ± 0.0783

Yes 34.91 ± 6.66 35.53 ± 8.40 0.6754 ± 0.0783 0.6500 ± 0.0770

obtaining the best results in four datasets and the

second one in another.

• The most clear contribution of the privileged in-

formation is obtained for the spiral dataset. This

is due to the fact in this more controlled envi-

ronment data clearly belong to a low dimensional

manifold and the class label is assigned accord-

ing to the privileged information (z value). For

the rest of datasets, the privileged information has

been selected according to the Relief algorithm,

which has known limitations. Nevertheless, there

are some datasets where the contribution of priv-

ileged information is still quite noticeable (e.g.

bondrate and contact-lenses).

• The original SVORIM algorithm (without using

a manifold assumption) was run for the spiral

dataset and the same conﬁguration, leading to a

performance of Acc = 78.80 ± 3.53 and MAE =

0.2617 ± 0.0467. It is noticeable that these values

are worse than the ones obtained by the manifold

proposals in this paper.

4 CONCLUSIONS

This paper considers a new approach to face ordi-

nal regression problems based on manifold learning.

This approach is based on constructing a neighbour-

hood graph with the purpose of obtaining the intrin-

sic structure of the data. The main paper contribution

is that this neighbourhood graph can be improved by

the use of privileged information, information that is

available during training but not in the test phase.

The algorithm is applied to 5 ordinal classiﬁcation

real problems and one synthetic dataset. When com-

bined with SVORIM, the results of this paper conﬁrm

that privileged information is able to improve general-

isation results for almost all the cases considered. The

distances used in the kernel matrices are obtained us-

ing the privileged features, which (under the assump-

tion that privileged information is really informative)

better reﬂects the data structure.

Several future research directions are still open

from the work in this paper. First of all, more datasets

should be considered, including datasets with a higher

number of patterns and with a more clear manifold

NCTA2014-InternationalConferenceonNeuralComputationTheoryandApplications

192

structure. For example, the experiments considered in

(Liu et al., 2011b) cover the UMIST face, MovieLens

and the USPS datasets, which are known to contain

an underlying manifold structure. The problem is that

meaningful privileged information has to be found

for these problems. Secondly, the methods should

be compared against standard manifold classiﬁers to

check their performance. Finally, alternative kernel

methods apart from SVORIM could be considered to-

gether with the proposals in this paper.

ACKNOWLEDGEMENTS

This work has been subsidized by the TIN2011-22794

project of the Spanish Ministerial Commission of Sci-

ence and Technology (MICYT), FEDER funds and

the P11-TIC-7508 project of the “Junta de Andaluc

ıa”

(Spain).

REFERENCES

Asuncion, A. and Newman, D. (2007). UCI machine learn-

ing repository.

Baccianella, S., Esuli, A., and Sebastiani, F. (2009). Evalu-

ation measures for ordinal regression. In Proceedings

of the Ninth International Conference on Intelligent

Systems Design and Applications (ISDA 09), pages

283–287, Pisa, Italy.

Belkin, M. and Niyogi, P. (2001). Laplacian eigenmaps and

spectral techniques for embedding and clustering. In

NIPS, volume 14, pages 585–591.

Cardoso, J. S. and da Costa, J. F. P. (2007). Learning to clas-

sify ordinal data: The data replication method. Jour-

nal of Machine Learning Research, 8:1393–1429.

Cheng, J., Wang, Z., and Pollastri, G. (2008). A neural net-

work approach to ordinal regression. In Proceedings

of the IEEE International Joint Conference on Neu-

ral Networks (IJCNN2008, IEEE World Congress on

Computational Intelligence), pages 1279–1284. IEEE

Press.

Chu, W. and Ghahramani, Z. (2005). Gaussian processes

for ordinal regression. Journal of Machine Learning

Research, 6:1019–1041.

Chu, W. and Keerthi, S. S. (2007). Support vector ordinal

regression. Neural Computation, 19(3):792–815.

Cruz-Ram

ırez, M., Herv

as-Mart

ınez, C., S

anchez-

Monedero, J., and Guti

errez, P. A. (2014). Metrics

to guide a multi-objective evolutionary algorithm for

ordinal classiﬁcation. Neurocomputing, 135:21–31.

Deng, W.-Y., Zheng, Q.-H., Lian, S., Chen, L., and Wang,

X. (2010). Ordinal extreme learning machine. Neuro-

computation, 74(1-3):447–456.

Dijkstra, E. W. (1959). A note on two problems in connex-

ion with graphs. Numerische mathematik, 1(1):269–

271.

Frank, E. and Hall, M. (2001). A simple approach to ordi-

nal classiﬁcation. In Proc. of the 12th Eur. Conf. on

Machine Learning, pages 145–156.

Guti

errez, P. A., P

erez-Ortiz, M., Fernandez-Navarro,

F., S

anchez-Monedero, J., and Herv

as-Mart

ınez, C.

(2012). An Experimental Study of Different Ordi-

nal Regression Methods and Measures. In 7th Inter-

national Conference on Hybrid Artiﬁcial Intelligence

Systems (HAIS), volume 7209 of Lecture Notes in

Computer Science, pages 296–307.

He, X. and Niyogi, P. (2003). Locality preserving projec-

tions. In NIPS, volume 16, pages 234–241.

Kira, K. and Rendell, L. A. (1992). The feature selection

problem: Traditional methods and a new algorithm.

In AAAI, pages 129–134.

Li, L. and Lin, H.-T. (2007). Ordinal Regression by Ex-

tended Binary Classiﬁcation. In Advances in Neural

Inform. Processing Syst. 19.

Lin, H.-T. and Li, L. (2012). Reduction from cost-sensitive

ordinal ranking to weighted binary classiﬁcation. Neu-

ral Computation, 24(5):1329–1367.

Liu, Y., Liu, Y., and Chan, K. C. C. (2011a). Ordinal regres-

sion via manifold learning. In Burgard, W. and Roth,

D., editors, Proceedings of the 25th AAAI Conference

on Artiﬁcial Intelligence (AAAI’11), pages 398–403.

AAAI Press.

Liu, Y., Liu, Y., Chan, K. C. C., and Zhang, J. (2012).

Neighborhood preserving ordinal regression. In Pro-

ceedings of the 4th International Conference on Inter-

net Multimedia Computing and Service (ICIMCS12),

pages 119–122, New York, NY, USA. ACM.

Liu, Y., Liu, Y., Zhong, S., and Chan, K. C. (2011b).

Semi-supervised manifold ordinal regression for im-

age ranking. In Proceedings of the 19th ACM inter-

national conference on Multimedia (ACM MM2011),

pages 1393–1396, New York, NY, USA. ACM.

McCullagh, P. (1980). Regression models for ordinal data.

Journal of the Royal Statistical Society, 42(2):109–

142.

PASCAL (2011). Pascal (pattern analysis, statistical mod-

elling and computational learning) machine learning

benchmarks repository.

R. Herbrich, T. G. and Obermayer, K. (2000). Large mar-

gin rank boundaries for ordinal regression. In Smola,

A., Bartlett, P., Sch

olkopf, B., and Schuurmans, D.,

editors, Advances in Large Margin Classiﬁers, pages

115–132. MIT Press.

Shashua, A. and Levin, A. (2003). Ranking with large mar-

gin principle: Two approaches. In Advances in Neural

Information Processing Systems (NIPS), pages 937–

944. MIT Press, Cambridge.

Sun, B.-Y., Li, J., Wu, D. D., Zhang, X.-M., and Li, W.-B.

(2010). Kernel discriminant learning for ordinal re-

gression. IEEE Transactions on Knowledge and Data

Engineering, 22:906–910.

Tenenbaum, J. B., De Silva, V., and Langford, J. C. (2000).

A global geometric framework for nonlinear dimen-

sionality reduction. Science, 290(5500):2319–2323.

Vapnik, V. and Vashist, A. (2009). A new learning

IncorporatingPrivilegedInformationtoImproveManifoldOrdinalRegression

193

paradigm: Learning using privileged information.

Neural Networks, 22(5–6):544–557.

Verwaeren, J., Waegeman, W., and De Baets, B. (2012).

Learning partial ordinal class memberships with

kernel-based proportional odds models. Computa-

tional Statistics & Data Analysis, 56(4):928–942.

Wang, H., Huang, H., and Ding, C. H. (2010). Discriminant

laplacian embedding. In AAAI.

Zhou, D., Weston, J., Gretton, A., Bousquet, O., and

Sch

olkopf, B. (2004). Ranking on data manifolds. In

Proceedings of the Seventeenth Annual Conference on

Neural Information Processing Systems (NIPS2003),

pages 169–176.

NCTA2014-InternationalConferenceonNeuralComputationTheoryandApplications

194