Minimum Modal Regression
Koichiro Yamauchi
1
and Vanamala Narasimha Bhargav
2
1
Department of Computer Science, Chubu University, Matsumoto-cho 1200 Kasugai, Japan
2
Indian Institute of Technology, Guwahati, Assam, India
Keywords: Modal Regression, Kernel Distribution Estimator, Incremental Learning on a Budget, Kernel Machines,
Projection Method.
Abstract: The recent development of microcomputers enables the execution of complex software in small embedded
systems. Artificial intelligence is one form of software to be embedded into such devices. However, almost
all embedded systems still have restricted storage space. One of the authors has already proposed an
incremental learning method for regression, which works under a fixed storage space; however, this method
cannot support the multivalued functions that usually appear in real-world problems. One way to support the
multivalued function is to use the model regression method with a kernel density estimator. However, this
method assumes that all sample points are recorded as kernel centroids, which is not suitable for small
embedded systems. In this paper, we propose a minimum modal regression method that reduces the number
of kernels using a projection method. The conditions required to maintain accuracy are derived through
theoretical analysis. The experimental results show that our method reduces the number of kernels while
maintaining a specified level of accuracy.
1 INTRODUCTION
The recent development of microcomputers enables
the embedding of complex software into small
devices. Machine learning algorithms are one
example of such software. One of the authors has
previously proposed a learning algorithm for kernel
regression in embedded systems (Yamauchi, 2014),
but this general regression method estimates the
conditional expectation of the dependent variable (Y)
given the independent variables (X=x). In contrast,
modal regression (Einbeck et al, 2006) estimates the
conditional modes of Y given X=x. This strategy
enables the learning machine to predict a portion of
the missing variables from the other known variables
according to the given sample distribution. This
property is quite different from that of other typical
regression methods.
To estimate the conditional modes, partial mean
shift (PMS) is an assured method. At first, the PMS
method attempts to obtain the joint kernel density and
derives it using the gradient ascent. However, if the
number of samples is increasing, minimum modal
regression is proposed, which can estimate the joint
kernel density by projecting the new sample,
replacing the old kernel, or adding the new kernel to
the sample. The equation for PMS is then modified
accordingly.
2 MODAL REGRESSION
Modal regression approximates a multivalued
function to search the local peaks of a given sample
distribution. Modal regression consists of the kernel
density estimator with a PMS method.
2.1 Kernel Density Estimator
The kernel density estimator (KDE) is a variation of
the Parzen window (Parzen, 1962).
Let
be the set of learning samples, and
Np
n
p
,,2,1 x
. The estimator
approximates the probability density function by
using a number of kernels, namely, the support set
t
S
.
The kernels used are Gaussian kernels, and
t
Si
x
i
h
Kp
xx
x)(
,
(1)
448
Yamauchi, K. and Bhargav, V.
Minimum Modal Regression.
DOI: 10.5220/0006601304480455
In Proceedings of the 7th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2018), pages 448-455
ISBN: 978-989-758-276-9
Copyright © 2018 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
where
.
(2)
Normally, the same number of kernels as that of
the dataset is required. However, if the storage
capacity of a target device is small, the number of
kernels must be restricted. There are several ways to
realize the density estimation using a limited number
of kernels. Traditionally, self-organizing feature
maps or learning vector quantization methods
approximate the distribution by using a fixed number
of templates.
As mentioned in (Sasaki et al., 2016), the KDE
used in modal regression should approximate the
peak points of the distribution, rather than the
distribution itself. Let
)(
ˆ
xp
be
t
Si
x
i
h
Kp
xx
x)(
ˆ
,
(3)
then
)(
ˆ
xp
should satisfy the following condition.
0)(,0)(
ˆ
0)()(
ˆ
**
**
22
xxxx
xxxx
pp
pp
xx
xx
XX
xx
,
(4)
where
*
x
denotes a local peak point of the
distribution.
2.2 Partial Mean Shift
Modal regression searches the peaks of the
distribution model represented by the KDE. The PMS
method realizes quick convergence to the nearest
peak from the initial point. Let us denote the initial
point as
0
x
, representing the starting point for the
search of the peak points. Thus, modal regression
repeats the modification of the current
y
as follows:
j
x
j
y
jold
i
x
i
y
iold
old
new
h
K
h
yy
K
h
K
h
yy
Ky
y
XX
XX
,
(5)
where
X
denotes
T
N
yxx ,
1
X
. Note
that
X
includes
y
.
3 MINIMUM MODAL
REGRESSION
To realize the minimum modal regression, a
minimum KDE, which realizes the KDE with a
minimum support set, is proposed. Moreover, the
KDE should support incremental learning during its
service. To this end, we modify an online learning
method for kernel perceptrons on a budget and apply
the modified method for online learning of the KDE.
The existing kernel perceptron on a budget
maintains a minimized or a constant support set by
applying projection and pruning with replacement. In
this study, we derivate some conditions to make an
online learning algorithm for the KDE in order to be
used in the modal regression.
In the following section, we use the following
relationship to represent the pruning with
replacement and a projection of kernels. Therefore,
we choose Gaussian kernel for
)(K
, which is a kind
of reproducing kernel. Thus, we have following
relationship, referred to as the kernel trick:
),(),,(
xx
xx
kk
h
K
j
x
j
,
(6)
where
,
denotes the dot product.
3.1 Minimum KDE
The KDE for modal regression should represent the
peak points of the distribution within a certain
number of kernels. Therefore, the modal regression
finds the
T
MP
T
MPMP
yxX
which satisfies the
following two conditions:
0)(
ˆ
0)(
ˆ
2
MP
MP
p
p
x
x
xx
xx
X
X
,
(7)
where
)(
ˆ
Xp
is defined in (3). Next, we describe
)(
ˆ
Xp
as a dot product of the corresponding vector
in Hilbert space and the input:
),(,
ˆ
)(
ˆ
XX kpp
.
As
)(
ˆ
Xp
is described by a linear combination of
several Gaussian kernels, which is one of the
reproducing kernels, we can apply the kernel trick to
calculate it. Thus, the KDE is also described by using
the kernel method. Therefore, the learning method of
the KDE is described as follows. Let us assume that
Minimum Modal Regression
449
the KDE used in this study tends to realize a sparse
allocation of kernels. Therefore, the KDE normally
adds a new kernel when a new sample
tt
y,x
is
presented. Therefore,
,),,(
ˆˆ
11
tSSkwpp
tttttt
X
(8)
where
t
S
denotes the support set at the
t
th round,
1
t
w
, and
t
p
ˆ
is
,),(
ˆ
j
jjt
kwp X
(9)
where
j
w
is the extension coefficient for each kernel,
whose default value is 1 and
0
j
w
. The KDE is
not for regression, so (9) does not contain
t
y
. Instead,
t
y
is one element of the centroid of a kernel.
Equation (8) represents the same procedure as that of
the original kernel distribution estimator. This
strategy, however, continues to increment the size of
the support set
t
S
forever if the number of datasets
is infinite. This is not suitable for an environment in
which storage space is limited.
t
S
should only
contain some essential kernels to represent the
distribution of inputs.
To maintain a small value of
t
S
, we apply an
improved version of the kernel perceptron on a
budget (Orabona et al., 2008) (He et al., 2012)
(Yamauchi, 2013). If we apply their method to the
KDE, the KDE attempts to apply the projection or
replacement operation instead of appending a new
kernel. Therefore, if a condition explained in the latter
section is satisfied, the KDE applies the replacement
or projection operation. The replacement operation is
).,(
),(),(
ˆˆ
****
1
1
t
iit
i
ii
tt
k
kPwkwpp
X
XX
.
(10)
On the other hand, the projection operation is
),(
ˆˆ
*
1
1
t
t
tt
kPpp X
,
(11)
where
),(
**
1
iit
kP X
denotes the projected vector of
the
*
i
th kernel to the space spanned by the remaining
kernels. The projected vector
),(
**
1
iit
kP X
is
*
***
\
1
),(),(
iSj
j
jiiit
t
kakP XX
.
(12)
This means that the KDE removes the most
ineffective
*
i
th kernel after projecting the kernel to
the space spanned by the remaining kernels. The most
ineffective kernel is detected by estimating the
approximated linear dependency.
ii
i
minarg
*
,
(13)
where
2
\
),(),(min
iSj
jijiai
t
i
kak XX
.
(14)
The following two theorems derivate the
condition to maintain the
'
MP
X
s of the peak points,
even after the replacement or projection operations.
Theorem 1
Let
*
i
be the most ineffective kernel in
1t
S
,
which is determined by (13). Let
'
ˆ
t
p
be
),(),(
ˆˆ
**
11
'
i
t
i
itt
kPkwpp XX
.
Let
MP
x
be the point that satisfies
0),(,
ˆ
0),(,
ˆ
1
2
1
MP
MP
kp
kp
tX
tX
xx
xx
X
X
.
When
0
2
*
i
, we have
0),(,
ˆ
0),(,
ˆ
'2
'
MP
MP
kp
kp
tx
tx
xx
xx
X
X
.
Theorem 2
Let
t
X
be a new input at the
t
th round, and
),(
1
tt
kP x
be the projected vector of
),(
t
k X
to the space spanned by the kernels at round
1t
. Let
'
ˆ
t
p
be
),(
ˆˆ
11
'
tttt
kPpp X
.
Let
MP
x
be a point that satisfies the following
condition.
0),(,
ˆ
0),(,
ˆ
1
2
1
MP
MP
kp
kp
tx
tx
xx
xx
X
X
ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods
450
When
0
2
t
, we have
x
MP
xtxtx
x
MP
xtx
h
Kpkp
h
Kkp
MP
MP
XX
X
XX
X
xx
xx
2
1
2'2
'
ˆ
),(,
ˆ
),(,
ˆ
.
The proofs for the Theorems 1 and 2 are described
in the appendix.
Theorem 2 demonstrates that if
MP
X
is far from
t
X
,
0),(,
ˆ
'
x
MP
xtx
h
Kkp
MP
XX
X
xx
(15)
0
ˆ
ˆ
),(,
ˆ
1
2
2
1
2
'2
tx
x
MP
xtx
tx
p
h
Kp
kp
MP
XX
X
xx
(16)
From these theorems, the minimum KDE can be
described in Algorithm 1.
Algorithm 1: Learning algorithm for the Minimum KDE.
Receive
),(
tt
yX
Detect the most ineffective kernel
*
i
by using (13) (the
lightweight version is (19)).
If
2
t
111
),,(
ˆˆ
tttttt
SSkPpp X
else if
2
*
i
),(),(
ˆˆ
**
11
i
t
i
itt
kPkwpp XX
tiSS
tt
\
1
else
tSSkpp
ttttt
11
),,(
ˆˆ
X
Endif
For all i
if
0
i
w
// To maintain
0
i
w
0
i
w
endif
endfor
1 tt
Return
t
p
ˆ
3.2 Modified Partial Mean Shift
The minimum KED described in the previous
section maintains the minimum size of the support
set by applying a projection or pruning with a
replacement. Through these processes, the
expansion parameter of each kernel
i
w
has a
certain value to represent the target distribution.
For example, if
2
i
w
, the ith kernel shares the
duty of two kernels. Therefore, we have also
improved the PMS method to adjust the solution
according to the expansion parameters, as follows.
j
x
j
y
jold
j
i
x
i
y
iold
iold
new
h
K
h
yy
Kw
h
K
h
yy
Kwy
y
XX
XX
(17)
3.3 Lightweight Learning Algorithm
In Section 3.1, we have already presented the
minimum KDE. The algorithm includes the
calculation of the approximated linear dependency
(ALD) to detect the most ineffective kernel, which
has a wasteful computational cost of
)(
3
t
SO
. The
computational cost is too large to execute the
minimum KDE. To overcome this difficulty, we need
a lightweight version of the minimum KDE.
The lightweight KDE does not use (13) to detect
the most ineffective kernel. Instead, the proposed
algorithm uses a slightly improved version of a
lightweight algorithm from our previous study
(Yamauchi, 2014). Therefore, the proposed method
chooses the most ineffective kernel, which has the
largest value, defined as
jSi
x
ij
j
t
h
KV
\
XX
.
(18)
Note that if the kernel is located in the
neighborhood of other kernels,
j
V
becomes large.
There is a high possibility that such a kernel can be
represented by a linear combination of the other
kernels. Therefore, instead of applying (13), (19) is
used.
jj
Vi maxarg
*
(19)
Minimum Modal Regression
451
Algorithm 2: Minimum modal regression.
If a new learning sample
T
ttt
yxX
is given,
Learn the minimum KDE by Algorithm 1
endif
If a new query
p
x
is given,
For (i=0; i<M; i++)
Select one of a kernel index
)(
p
k xΝ
(see
(20)) randomly.
set the initial
y
as
Nk
Xy
.
Set initial
X
as
y
T
p
xX
.
For (r=0; r<R; r++)
Update
y
by using (17)
Reset
X
as
y
T
p
xX
endfor
yAnsAns
endfor
endif
return
.Ans
where
)(
p
xΝ
denotes a set of kernels defined below
equation.
s
h
Kj
x
nj
p
xx
x )(
,
(20)
where
s
denotes a threshold and we set
.1.0s
4 EXPERIMENT
In this section, some preliminary results of the
proposed method are shown.
4.1 Performance for Synthetic Dataset
We tested the proposed method with two synthetic
datasets and evaluated its performance.
4.1.1 Third-Order Function
The first dataset is generated by
nyyx 4
3
,
where
n
is a uniform random value in the interval of
]1,1[
. With a changing
y
in the interval [-3, 3],
8000 datasets were generated. The dataset was
presented to the minimum KDE, and the minimal
modal regression predicted the values for you from
the value of each x. The number of repeats for the
prediction (the parameter R in Algorithm 2) was 10.
The hyper parameters used were
25.0
x
h
and
25.0
y
h
. The evaluation should be made using the
mean square error between the desired and predicted
values of y.
However, the evaluation of multi-valued output is
complex, so we evaluated the proposed method as
follows. Instead of making a direct comparison of the
resultant and predicted values of y, we calculated the
corresponding
yyx 4
ˆ
3
and compare the
actual x with
x
ˆ
. The difference was evaluated by the
averaged square error:
]
ˆ
[
2
xxE
.
Figure 1, 2 and 3 show the results of y predicted by
the proposed method with
9.0,5.0,1.0
,
respectively. From these figures, we can see that the
threshold value
is small, and the predicted values
show a smooth curve. The estimated errors and
number of kernels are listed in Table 1. From this table,
the estimated error of the modal regression is reduced
when the threshold value is small. However, the
number of kernels is increased when the threshold
value is small. Therefore, there are tradeoff
relationships between the error and number of kernels.
Table 1: Number of kernels and the averaged error for the
corresponding x for each threshold value.
Threshold (
)
0.9
0.5
0.1
No. of kernels
124
188
292
]
ˆ
[
2
xxE
0.018
0.010
0.0063
Figure 1: The predicted values from the proposed method
with
1.0
.The x-axis denotes x and the y-axis denotes
the predicted value.
ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods
452
Figure 2: The predicted values from the proposed method
with
5.0
. The x-axis denotes x and the y-axis denotes
the predicted value.
Figure 3: The predicted values from the proposed method
with
9.0
. The x-axis denotes x and the y-axis
denotes the predicted value.
4.1.2 Helix Function
The second dataset is a helix dataset. By using this
dataset, we have checked whether our method
approximates more complex outputs. The dataset is
described as follows.
ttttttttt
bzayax
,sin,cos
,
where
t
t
2
. We set
tt
na 2
, where
t
n
denotes a uniform random value in the interval [-0.1,
0.1], and
tt
nb 3
. By increasing
t
gradually
from 0 to 9, 3000 instances were generated. The
dataset has a spiral shape. The used hyper parameters
were
.0.2,0.2
yx
hh
Figure 4 and Figure 5
show the results for a threshold of
= 0.1 and 0.95.
In the case of threshold
= 0.1, 157 kernels were
generated. On the other hand, in the case of threshold
= 0.95, 45 kernels were generated. In the both cases,
the proposed system regenerated almost the same
correct multivalued outputs.
Figure 4: The output of the proposed method of Helix data
for a threshold
.1.0
Figure 5: The output of the proposed method of Helix data
for a threshold
.95.0
4.2 Performance for Real Dataset
We also tested the proposed method with a real-
dataset: Data from the network journey time and
traffic flow on highways in England
1
. We used the
traffic flow data on January 2006 MIDIAS Site 1030
(LM205) and made the proposed system learn the
pairwise data between total carriageway flow versus
total flow vehicles above 11.6m. The dataset records
the data at every 15 minutes. The four total carriage
flows and corresponding speed flow between every
45 minutes are almost the same. Therefore, we picked
up the first data of the four data set for the
corresponding 45 minutes. By this procedure, we
reduced the dataset size to 1/4 (8580 instances).
Moreover, each speed data and flow data was
normalized by dividing them by 140 and 1400,
respectively. The used hyper-parameters are
2.0,15.0
yx
hh
. From the data plotted in
Minimum Modal Regression
453
Figure 6 The predicted outputs from the proposed method
with
1.0
. The generated kernel size was 57.
Figure 7: The predicted outputs from the proposed method
with
5.0
. The generated kernel size was 29.
The predicted outputs from the proposed method with
1.0
and 0.5 are shown in Figure 6 and Figure 7.
The kernel sizes were 57 and 29, respectively.
We can see the proposed method predicted more than
two distributions in the speed flow.
5 CONCLUSION
In this paper, we proposed a new method for modal
regression. While forming the KDE when a new
sample is given, it may be projected onto the existing
kernel space, it may replace the existing kernel, or a
new kernel may be generated with a given sample as
the center. This depends on the threshold and the
dependencies of each kernel in the existing kernel
space. The equation for the PMS method is also
modified according to this method by adding weights
to the kernels. The experimental results show that the
proposed method can approximate the multivalued
functions properly, and it also reduces the complexity
greatly compared to the case where a kernel is
allocated to each sample.
REFERENCES
Einbeck, J. & Tutz, G. (2006), ‘Modelling beyond
regression functions: an applica- tion of multimodal
regression to speed?flow data’, Applied Statistics
55(4), 461475.
He, W. & Wu, S. (2012), ‘A kernel-based perceptron with
dynamic memory’, Neural Networks 25, 105113.
Orabona, F., Keshet, J. & Caputo, B. (2008), The
projectron: a bounded kernel-based perceptron, in
‘ICML2008’, pp. 720727.
Parzen, E. (1962), ‘On estimation of a probability density
function and mode’, Annals of Mathematical Statistics
33(3), 10651076.
Sasaki, H., Ono, Y. & Sugiyama, M. (2016), Modal
regression via direct log-density derivative estimation,
in A. Hirose, S. Ozawa, K. Doya, K. Ikeda, M. Lee &
D. Liu, eds, ‘Neural Information Processing 23rd
International Conference, ICONIP 2016–’, Vol. PartII,
Springer-Verlag.
Yamauchi, K. (2013), An importance weighted projection
method for incremental learning under unstationary
environments, in ‘IJCNN2013: The International Joint
Conference on Neural Networks 2013’, The Institute of
Electrical and Electronics Engineers, Inc. New York,
New York, pp. 19.
Yamauchi, K. (2014), ‘Incremental learning on a budget
and its application to quick maximum power point
tracking of photovoltaic systems’, Journal of Advanced
Computational Intelligence and Intelligent Informatics
18(4), 682696.
APPENDIX
The proof of Theorem 1 is
Proof 1.
From
0
2
*
i
, we obtain
00
**
2
i
x
i
x
.
Therefore, we also have
0
*
2
i
x
.
From the pruning and replacement operation,
1
http://tris.highwaysengland.co.uk/detail/trafficflowdata
ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods
454
0),(,
ˆ
),(,
ˆ
),(,
ˆ
1
1
'
*
MP
MP
MP
MP
kp
w
kpkp
tx
i
xi
txt
xx
xx
xx
xx
x
x
xx
0),(,
ˆ
),(,
ˆ
),(,
ˆ
1
2
2
1
2'2
*
MP
MP
MP
x
MP
kp
w
kpkp
tx
i
xi
tt
xx
xx
xx
xx
x
xx
x
This concludes the proof.
The proof of Theorem 2 is
Proof 2.
From
0
2
t
, we obtain
00
2
txtx
.
Therefore, we also have
0
2
tx
.
From the projection operation, we have
tttt
pkp
1
'
ˆ
),(
ˆ
x
.
From this equation, we obtain the following two
equations.
MP
MP
MP
MP
MP
MP
x
t
xt
txtx
x
t
xt
h
Kkp
kp
h
Kkp
xx
xx
x
xx
xx
xx
xx
x
xx
x
x
xx
x
),(,
ˆ
0),(,
ˆ
),(,
ˆ
'
1
'
MP
MP
MP
MP
MPMP
MP
MP
x
t
x
txtx
tx
txtx
x
t
xtx
h
K
kpkp
kp
kp
h
Kkp
xx
xx
xx
xx
xxxx
xx
xx
xx
xx
x
x
xx
x
2
1
2'2
1
2
2
1
2
2'2
),(,
ˆ
),(,
ˆ
),(,
ˆ
),(,
ˆ
),(,
ˆ
This concludes the proof.
Minimum Modal Regression
455