Domain Adaptation Transfer Learning by SVM Subject to a

Maximum-Mean-Discrepancy-like Constraint

Xiaoyi Chen and R´egis Lengell´e

LM2S, Institut Charles Delaunay, UMR CNRS 6281, University of Technology of Troyes,

12 rue Marie Curie, CS 42060 - 10004, Troyes Cedex, France

{xiaoyi.chen, regis.lengelle}@utt.fr

Keywords:

Transfer Learning, Kernel, SVM, Maximum Mean Discrepancy.

Abstract:

This paper is a contribution to solving the domain adaptation problem where no labeled target data is available.

A new SVM approach is proposed by imposing a zero-valued Maximum Mean Discrepancy-like constraint.

This heuristic allows us to expect a good similarity between source and target data, after projection onto an

efﬁcient subspace of a Reproducing Kernel Hilbert Space. Accordingly, the classiﬁer will perform well on

source and target data. We show that this constraint does not modify the quadratic nature of the optimization

problem encountered in classic SVM, so standard quadratic optimization tools can be used. Experimental

results demonstrate the competitiveness and efﬁciency of our method.

1 INTRODUCTION

Recently, Transfer Learning has received much at-

tention in the machine learning community. First

formally deﬁned in (Pan and Yang, 2010), the aim

of Transfer Learning is to learn a good-performance

classiﬁer or regressor in a new domain with the help

of previous knowledge issued from different but re-

lated domains; the new domain is designated as tar-

get while domains of previous knowledge are desig-

nated as sources. In this paper, we propose to solve

the transfer learning problem where there is no la-

beled target data available. According to the taxon-

omy given in (Pan and Yang, 2010), our proposed

method belongs to the transductive transfer learning

where the source and the target share the same label

space but differentiate from each other in the feature

space. Marginal, conditional distributions and priors

might differ. This problem is also known as domain

adaptation.

There is a variety of methods for transfer learn-

ing. In this paper, we propose the use of a Support

Vector Machine (SVM) subject to a zero valued Maxi-

mum Mean Discrepancy (MMD)-like constraint. The

choice of a zero-valued MMD as the constraint is

that MMD is a non-parametric measure of the dis-

tance between 2 distributions (Dudley, 2002) and it

can be easily kernelized (Gretton et al., 2012). There-

fore, the combination of MMD and SVM is promis-

ing. SVM is a widely known classiﬁcation method

used in binary classiﬁcation. It is well known for

its high generalization ability and the simplicity in

dealing with non-linearly separable data set by using

the kernel trick. Our method keeps these advantages

while performing well in the transfer learning con-

text. As shown in section 3, the optimization problem

remains convex and can be directly implemented us-

ing standard quadratic optimization tools. Adding a

MMD-like constraint is a heuristic that allows us to

expect that source and target data will become similar

in some selected subspace of the feature space. There-

fore, the separating hyperplane found by SVM for

source data can perform well for target data. The ex-

perimental results prove the effectiveness of our idea.

This paper is organized as follows: in section 2,

we give a short summary of related work; then we

present our method in section 3 together with the op-

timization solution to the problem (in section 4); we

prove the effectiveness of the proposed method on

synthetic and real data sets in section 5. Finally, we

conclude this paper and suggest perspectives.

2 RELATED WORK

Because the aim of our work is to perform MMD-like

SVM based transductive transfer learning, we ﬁrst re-

view the general transductive transfer learning prob-

lem, followed by a presentation of SVM based trans-

fer learning and MMD based transfer learning. Inter-

Chen, X. and Lengellé, R.

Domain Adaptation Transfer Learning by SVM Subject to a Maximum-Mean-Discrepancy-like Constraint.

DOI: 10.5220/0006119900890095

In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2017), pages 89-95

ISBN: 978-989-758-222-6

ested readers are referred to (Pan and Yang, 2010) and

(Jiang, 2008) for more general transfer learning and

domain adaptation surveys. For a more recent survey

on domain adaptation, readers are referred to (Patel

et al., 2015)

Transductive transfer learning refers to a shared

label space but different source and target feature

spaces with different marginal and/or conditional dis-

tributions (Pan and Yang, 2010). To take full advan-

tage of source information is the key issue to make the

improvement in learning the target task. When target

labels are not available, typical methods include in-

stance weighting (Huang et al., 2006) with the neces-

sary assumption of the same conditional distributions.

Other authors propose structural corresponding learn-

ing for information retrieval (Blitzer et al., 2007).

SVM based transfer learning adapts the traditional

SVM to the transfer learning context. To the best of

our knowledge, there are ﬁve principal kinds of SVM

based transfer learning methods:

• transferring common parameter (w

common

target

− w

specific

) (Zhang et al., 2009)

• iteratively using SVM to label target domain data

(Bruzzone and Marconcini, 2010)

• reweighting the penalty term of SVM (Liang

et al., 2014)

• adding extra regularization term to standard SVM

(Huang et al., 2012), (Tan et al., 2012)

• SVM by integrating a transformed alignement

constraint combining the knowledge of different

natures (Li et al., 2011)

MMD based transfer learning combines the

MMD, which will be presented later in this paper,

with standard learning method to perform transfer. To

the best of our knowledge, MMD is used as a regular-

ization term of the objective function. The principal

idea is to deal with the trade-off between the classiﬁ-

cation performance of source data and the similarity

of source and target. The interested reader could re-

fer to SVM-based transfer learning classiﬁcation in

(Quanz and Huan, 2009), multiple kernel learning in

(Ren et al., 2010), multi-task clustering in (Zhang

and Zhou, 2012), maximum margin classiﬁcation in

(Yang et al., 2012), feature extraction in (Pan et al.,

2011) (Uguroglu and Carbonell, 2011), etc.

3 PRESENTATION OF THE MMD

CONSTRAINED SVM METHOD

In this section, we present our MMD constrained

SVM transfer learning method. We ﬁrst brieﬂy

review the basic theoretical foundations of MMD and

its kernelized version

3.1 Review of Basic Theoretical

Foundations

3.1.1 Maximum Mean Discrepancy

Maximum Mean Discrepancy (MMD) is a non-

parametric distance measure which can be used to

evaluate the difference between two distributions.

The deﬁnition of MMD is:

Deﬁnition 1 (Maximum Mean Discrepancy (Fortet

and Mourier, 1953)).

Let F be a class of functions f: X → R and p, q

two Borel probabilistic measures deﬁned on X . The

Maximum Mean Discrepancy (MMD) between p and

q is deﬁned as:

MMD[F , p, q] = sup

f∈F

[ f(x)] − E

[ f(y)])

As a ”distance measure” between two distribu-

tions, MMD has the following property:

Theorem 1 (Dudley, 1984).

Let (X ,d) be a metric space and p, q two Borel prob-

abilistic measures deﬁned on X , p = q iff E

[ f(x)] =

[ f(y)] for any function f ∈ C(X ), where C(X ) is

the space of continuous bounded functions and x, y

are random variables drawn from distribution p and q

respectively.

Thanks to the works of Smola (Smola, 2006) and

Gretton et al. (Gretton et al., 2012), distributions can

be embedded in a Reproducing Kernel Hilbert Space

(RKHS), where a distribution can be considered as

some mean element of this RKHS (H ):

µ[P

] = E

[k(x, .)]

(Smola et al., 2007). Accordingly, MMD can be

evaluated as MMD[F , p,q] = kµ

− µ

, where µ

stands for E

[k(x, .)] and k(x, .) is the representation

of x in the RKHS.

As a simple deduction, the squared MMD is:

MMD

[F , p, q] = kµ

− µ

= E

p,p

[k(x, x

′

)] − 2E

p,q

[k(x, y)] + E

q,q

[k(y,y

′

)]

Here, x and x

′

are independent observations drawn

from distribution p, y and y

′

are independent obser-

vations from distribution q, k designates a universal

kernel function (which means that k(x,.) is continu-

ous for all x and the RKHS induced by k is dense in

C(X )).

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

Theorem 2 (Steinwart (Steinwart, 2002) and Smola

(Smola, 2006)).

MMD[F , p, q] = 0 iff p = q when F = { f : k fk

≤

1} provided that H is universal.

An unbiased estimate of kernelized squared MMD

is proposed in (Serﬂing, 2009):

MMD

[F ,X,Y] =

m(m− 1)

∑

i=1

∑

j6=i

k(x

)

n(n− 1)

∑

i=1

∑

j6=i

k(y

) −

∑

i=1

∑

j=1

k(x

)

where x

,i = 1, ... , m and y

,i = 1, ... , n are iid exam-

ples drawn from p and q respectively.

SVM aims to ﬁnd the hyperplane that maximally

separates two classes. The commonly used formula-

tion is:

min

||w||

∑

i=1

s.t. ε

≥ 0

φ(x

) + b) ≥ 1− ε

∀i = 1, ...,n

where, as usual, w is the hyperplane parameter, ε

the error term associated to observation i, C is the

trade-off parameter between the margin term and the

classiﬁcation error, φ(x

) is the kernel representation

of x

, y

is the label of x

and b is the bias.

3.2 MMD Constrained SVM Transfer

Learning

We now propose a heuristic to constrain the hyper-

plane that maximizes the margin between the source

classes (and minimizes the corresponding classiﬁca-

tion error) to lie in a subspace where source and tar-

get distributions are as similar as possible. Another

assumption is that the conditional probability distri-

butions of labels are also similar (hypothesis that can-

not be veriﬁed because the target labels are supposed

unknown). Accordingly, we can expect the classi-

ﬁer to perform well, both on source and target data.

The heuristic used to maximize the similarity between

source and target is to satisfy the proposed constraint:

< µ

− µ

,w >

= 0

where µ

(µ

) is the sample mean of source (target)

data in H and can be estimated by µ

∑

φ(X

)

(µ

∑

φ(X

)).

By imposing < µ

− µ

,w >

= 0, we expect

that source and target data will be similar in H .

The SVM problem can now be formulated as fol-

lows:

min

||w||

∑

i=1

s.t. < µ

− µ

,w >

= 0

≥ 0

φ(x

) + b) ≥ 1− ε

∀i = 1, ...,n

(1)

Our approach of using a MMD-like constraint in-

stead of a MMD-regularization-term is to guarantee

the transfer ability. In (Quanz and Huan, 2009),

Quanz and Huan suggest to solve the problem :

min

||w||

∑

i=1

+ λ|| < µ

− µ

,w >

. In

that case, depending on the ﬁnite value of the regu-

larization parameter λ, we may sometimes sacriﬁce

this similarity to achieve a high classiﬁcation accu-

racy for source only. Furthermore, during the opti-

mization process, their method requires the calcula-

tion of the inverse of a matrix which slows down the

algorithm and causes inaccuracy,while this is avoided

in our work.

4 DUAL FORM OF THE

OPTIMIZATION PROBLEM

In order to solve the above primal problem, we use the

representer theorem (Sch¨olkopf et al., 2001). w, the

optimum solution of Equation 1 in the above section ,

can be expressed as:

w =

∑

k=1

φ(x

) +

∑

l=1

φ(x

) (2)

where β

and β

are the unknowns. Incorporating this

expression into the constraint, we obtain:

< µ

− µ

,w >

∑

i=1

φ(x

) −

∑

j=1

φ(x

∑

k=1

φ(x

) +

∑

l=1

φ(x

) >

∑

k=1

∑

i=1

< φ(x

),φ(x

) >

−

∑

k=1

∑

j=1

< φ(x

),φ(x

) >

∑

l=1

∑

i=1

< φ(x

),φ(x

) >

−

∑

l=1

∑

j=1

< φ(x

),φ(x

) >

= (K

Domain Adaptation Transfer Learning by SVM Subject to a Maximum-Mean-Discrepancy-like Constraint

where K =





=< φ(x

),φ(x

) >

=< φ(x

),φ(x

) >

=< φ(x

),φ(x

) >

=< φ(x

),φ(x

) >

. Here x

∈ X

and x

∈

; β = [β

,β

]

and

1 = [

,...,

{z }

,−

,..., −

{z }

]

Incorporating w (2) into ||w||

, we have : ||w||

Kβ.

We now introduce the Lagrange parameters to solve

this constrained problem:

L = max

α,µ,η

min

β,ε,b

Kβ+C

∑

i=1

−

∑

i=1

−

∑

i=1

(β

φ(X)φ(x

) + b) − 1+ ε

] − η(K

β)

After some manipulations, we obtain the dual form:

max

µ,η

∑

i=1

−

(

∑

i=1

)

−1

(

∑

j=1

. j

)

−

1− η(

∑

i=1

)

s.t. 0 ≤ µ

≤ C and

∑

i=1

= 0

where K

=< φ(X), φ(x

) >

and X represents the

ensemble of X

and X

; x

is a single point either from

or X

As there are two different kinds of Lagrange pa-

rameters µ and η, we eliminate one by ﬁrst ﬁxing the

value of µ and maximizing only the two latter terms

(related with η) of the Lagrange function. The op-

timal value of η can be expressed as a function of

µ: η = −

(

∑

i=1

)

. We now obtain the ﬁnal dual

form of the optimization problem:

max

∑

i=1

−

(

∑

i=1

)

−1

−

)(

∑

j=1

. j

)

s.t. 0 ≤ µ

≤ C and

∑

i=1

= 0.

Let γ

denote µ

, the previous problem becomes:

max

Y −

−

)γ

s.t.

∑

i=1

= 0 and min(0,Cy

) ≤ γ

≤ max(0,Cy

where K

∑

i=1

. The matrix K

−

the matrix of inner products (in the subspace orthogo-

nal to w) of source data. As stated in (Paulsen, 2009),

if H is a RKHS on X and H

∈ H is a closed sub-

space, then H

is also a RKHS on X. Therefore, the

matrix K

new

= K

−

is the new Gram matrix

corresponding to the projected kernel, K

new

is positive

semi-deﬁnite.

Considering the dual form of the optimization

problem, we can solve it using standard quadratic

programming tools. However, in order to shorten

calculations, we used here an adaptation of the F-

SVC decomposition algorithm proposed in (Tohm´e

and Lengell´e, 2008). Adaptation and implementation

are straightforward.

5 EXPERIMENTS

5.1 Data Sets

Our goal is to improve the classiﬁcation performance

on target data with the help of related but different

source data.

To illustrate our method on a simple data set,

we ﬁrst consider some linearly separable data and

we select the linear kernel (which is not univer-

sal so the heuristic should not lead to satisfactory

results). We generate two almost linearly separa-

ble gaussian groups denoted as source-positive and

source-negative. Then we do the same to generate the

target data (there is no label provided for the target

data). An example of this data set is shown in ﬁg. 1.

A second, more complicated synthetic data set is

the well-known banana-orange data set. We desig-

nate the banana as the source positive and the orange

as the source negative. We also generate a target data

set which is drawn from a translated and distorted ver-

sion of the distribution of the source data. Here again,

no label information is available for the target (see an

example in ﬁg. 3).

We now use the USPS data set, a famous hand-

written digital number data set. The version used is

composed of training and testing parts, both contain-

ing the image information (16 * 16 pixels) of 10 dif-

ferent numbers. As proposed in (Uguroglu and Car-

bonell, 2011), we choose to separate digits 4 and 7 as

the source classiﬁcation problem. All the source data

is extracted from the training subset of USPS and is

perfectly labeled. The target classiﬁcation problem

aims at separating digits 4 and 9 (without the use of

the corresponding labels). All target data is extracted

from the testing subset of the database USPS.

We compare the results we obtained with the

method proposed in (Quanz and Huan, 2009) LM and

also with standard SVM trained only on source data

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

−4 −2 0 2 4 6 8 10

(a) Example of a classiﬁer obtained with our

method (for the optimal value of σ)

−5

−3

−2

−1

(b) Decision surface obtained

−4 −2 0 2 4 6 8 10

(for the optimal value of σ)

−5

−4

−2

(d) Decision surface (LM)

Figure 3: Results obtained on the banana-orange data set. In 3(a) and 3(c), circles and stars represent the labeled source data

while ”plus” symbols are the unlabeled target data. In 3(b) and 3(d), the decision surfaces are plotted as functions of the input

space coordinates. Thresholding these surfaces at 0 level gives the decision curves corresponding to the classiﬁers in 3(a) and

3(c), respectively.

(no transfer learning in this case). In (Quanz and

Huan, 2009), LM has been proved superior to other

transfer learning methods so we omit here the com-

parison to other transfer learning methods.

5.2 Experimental Results and Analysis

For a visual comprehension of our SVM-MMD

method, we show in ﬁg. 1 the results obtained on the

ﬁrst synthetic data set. Stars represent source-positive

data, triangles are source-negative data, crosses are

target data; the two circles are the means of source

−4 −2 0 2 4 6 8 10 12 14

−20

−15

−10

−5

Figure 1: Linearly separable data set using the linear kernel

(triangles and stars represent the labeled source data, while

”plus” symbols represent the unlabeled target data).

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

0.4

0.5

0.6

0.7

0.8

0.9

1.1

Figure 2: Average performance (good classiﬁcation rate)

±1 s.d. as a function of the gaussian kernel parameter. Red

line : our method. Black line : LM.

and target data, respectively. As can be seen, the nor-

mal to the obtained discriminant function is orthogo-

nal to ~m

− ~m

, as expected (for this kernel, the mean

of the original source (target) data coincides with µ

(µ

).)

For the second synthetic data set (ﬁg. 2), we

show the classiﬁcation result we obtained compared

to those of LM. We do not compare with standard

SVM on source target data, because obviously stan-

dard SVM will fail here (see ﬁg. 3(a)). Example of

classiﬁcation results (data sets, discriminant functions

obtained on source and target, decision surfaces) are

shown in ﬁg. 3.

Domain Adaptation Transfer Learning by SVM Subject to a Maximum-Mean-Discrepancy-like Constraint

0 1 2 3 4 5 6 7 8 9 10

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

SVM

SVMMMD

Figure 4: Results (good classiﬁcation rates) obtained on the

USPS data set as a function of the gaussian kernel parame-

ter.

We independently generate 50 different banana-

orange data sets and show the average performance

(±1 standard deviation) in ﬁg. 2. We conclude that

most of the time our method achieves better results

than LM for a wider range of the kernel parameter

value.

We now show the results obtained on the USPS

data set. As shown in ﬁg. 4, our method provides

higher performance for almost all the kernel parame-

ter values considered.

6 CONCLUSION AND FUTURE

DIRECTIONS

In this paper, we propose a new approach to solve

the domain adaptation problem when no labeled tar-

get data is available. The idea is to perform a pro-

jection of source and target data onto a subspace of a

RKHS where source and target data distributions are

expected to be similar. To do so, we select the sub-

space which ensures nullity of a Maximum Mean Dis-

crepancy based criterion. As source and target data

become similar, the SVM classiﬁer trained on source

data performs well on target data. We have shown that

this additional constraint on the primal optimization

problem does not modify the nature of the dual prob-

lem so that standard quadratic programming tools can

be used. We have applied our method on synthetic

and real data sets and we have shown that our results

compare favorably with Large Margin Transductive

Transfer Learning.

As an important short term development, we must

propose a method to automatically determine an ade-

quate value of the gaussian kernel parameter used in

our paper. We also have to consider multiple kernel

learning. Finally, more complex real data sets are to

be used to benchmark our transfer learning method.

REFERENCES

Blitzer, J., Dredze, M., Pereira, F., et al. (2007). Biogra-

phies, bollywood, boom-boxes and blenders: Domain

adaptation for sentiment classiﬁcation. In ACL, vol-

ume 7, pages 440–447.

Bruzzone, L. and Marconcini, M. (2010). Domain adapta-

tion problems: A dasvm classiﬁcation technique and a

circular validation strategy. Pattern Analysis and Ma-

chine Intelligence, IEEE Transactions on, 32(5):770–

787.

Dudley, R. M. (1984). A course on empirical processes. In

Ecole d’´et´e de Probabilit´es de Saint-Flour XII-1982,

pages 1–142. Springer.

Dudley, R. M. (2002). Real analysis and probability, vol-

ume 74. Cambridge University Press.

Fortet, R. and Mourier, E. (1953). Convergence de la

r´epartition empirique vers la r´eparation th´eorique.

Ann. Scient.

Ecole Norm. Sup., pages 266–285.

Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch¨olkopf, B.,

and Smola, A. (2012). A kernel two-sample test. J.

Mach. Learn. Res., 13:723–773.

Huang, C.-H., Yeh, Y.-R., and Wang, Y.-C. F. (2012).

Recognizing actions across cameras by exploring the

correlated subspace. In Computer Vision–ECCV

2012. Workshops and Demonstrations, pages 342–

351. Springer.

Huang, J., Gretton, A., Borgwardt, K. M., Sch¨olkopf, B.,

and Smola, A. J. (2006). Correcting sample selection

bias by unlabeled data. In Advances in neural infor-

mation processing systems, pages 601–608.

Jiang, J. (2008). A literature survey on domain adaptation

of statistical classiﬁers. URL: http://sifaka. cs. uiuc.

edu/jiang4/domainadaptation/survey.

Li, L., Zhou, K., Xue, G.-R., Zha, H., and Yu, Y.

(2011). Video summarization via transferrable struc-

tured learning. In Proceedings of the 20th interna-

tional conference on World wide web, pages 287–296.

ACM.

Liang, F., Tang, S., Zhang, Y., Xu, Z., and Li, J. (2014).

Pedestrian detection based on sparse coding and

transfer learning. Machine Vision and Applications,

25(7):1697–1709.

Pan, S. J., Tsang, I. W., Kwok, J. T., and Yang, Q. (2011).

Domain adaptation via transfer component analysis.

Neural Networks, IEEE Transactions on, 22(2):199–

210.

Pan, S. J. and Yang, Q. (2010). A survey on transfer learn-

ing. Knowledge and Data Engineering, IEEE Trans-

actions on, 22(10):1345–1359.

Patel, V. M., Gopalan, R., Li, R., and Chellappa, R. (2015).

Visual domain adaptation: A survey of recent ad-

vances. IEEE signal processing magazine, 32(3):53–

69.

Paulsen, V. I. (2009). An introduction to the theory of re-

producing kernel hilbert spaces. Lecture Notes.

Quanz, B. and Huan, J. (2009). Large margin transductive

transfer learning. In Proceedings of the 18th ACM

conference on Information and knowledge manage-

ment, pages 1327–1336. ACM.

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

Ren, J., Liang, Z., and Hu, S. (2010). Multiple kernel learn-

ing improved by mmd. In Advanced Data Mining and

Applications, pages 63–74. Springer.

Sch¨olkopf, B., Herbrich, R., and Smola, A. J. (2001). A

generalized representer theorem. In Computational

learning theory, pages 416–426. Springer.

Serﬂing, R. J. (2009). Approximation theorems of mathe-

matical statistics, volume 162. John Wiley & Sons.

Smola, A. (2006). Maximum mean discrepancy. In 13th

International Conference, ICONIP 2006, Hong Kong,

China, October 3-6, 2006: Proceedings.

Smola, A., Gretton, A., Song, L., and Sch¨olkopf, B. (2007).

A hilbert space embedding for distributions. In Algo-

rithmic Learning Theory, pages 13–31. Springer.

Steinwart, I. (2002). On the inﬂuence of the kernel on the

consistency of support vector machines. The Journal

of Machine Learning Research, 2:67–93.

Tan, Q., Deng, H., and Yang, P. (2012). Kernel mean match-

ing with a large margin. In Advanced Data Mining and

Applications, pages 223–234. Springer.

Tohm´e, M. and Lengell´e, R. (2008). F-svc: A simple and

fast training algorithm soft margin support vector clas-

siﬁcation. In Machine Learning for Signal Processing,

2008. MLSP 2008. IEEE Workshop on, pages 339–

344. IEEE.

Uguroglu, S. and Carbonell, J. (2011). Feature selec-

tion for transfer learning. In Machine Learning and

Knowledge Discovery in Databases, pages 430–442.

Springer.

Yang, S., Lin, M., Hou, C., Zhang, C., and Wu, Y. (2012). A

general framework for transfer sparse subspace learn-

ing. Neural Computing and Applications, 21(7):1801–

1817.

Zhang, P., Zhu, X., and Guo, L. (2009). Mining data streams

with labeled and unlabeled training examples. In Data

Mining, 2009. ICDM’09. Ninth IEEE International

Conference on, pages 627–636. IEEE.

Zhang, Z. and Zhou, J. (2012). Multi-task clustering via

domain adaptation. Pattern Recognition, 45(1):465–

473.

Domain Adaptation Transfer Learning by SVM Subject to a Maximum-Mean-Discrepancy-like Constraint