TOWARDS ONLINE COMPOSITION OF PMML PREDICTION

MODELS

Diana Gorea

Faculty of Computer Science, University ”Al. I. Cuza”, 16 G-ral Berthelot, 700483 Iasi, Romania

Keywords:

Prediction model composition, PMML, online scoring, web services, XML.

Abstract:

The paper presents a general context in which composition of prediction models can be achieved within the

boundaries of an online scoring system called DeVisa. The system provides its functionality via web services

and stores the prediction models represented in PMML in a native XML database. A language called PMQL

is deﬁned, whose purpose is to process the PMML models and to express consumers’ goals and the answers

to the goals. The composition of prediction models can occur either implicitly within the process of online

scoring, or explicitly, in which the consumer builds or trains a new model based on the existing ones in the

DeVisa repository. The main scenarios that involve composition are adapted to the types of composition

allowed in the PMML speciﬁcation, i.e sequencing and selection.

1 INTRODUCTION

In general the composition of prediction models can

be realized in various ways. They all have in com-

mon the goal of making predictions more reliable.

Composing several prediction models means merging

the various outputs into a single prediction. Several

machine learning techniques do this by learning an

ensemble of models and using them in combination.

The most popular are bagging, boosting and stacking

(Ian H. Witten, 2005).

Bagging applies an unweighted voting scheme on

the outcomes of different classiﬁers built on possibly

different data sets (which can be obtained by resam-

pling the original data set). In the case of numeric

prediction, instead of voting on the outcome, the indi-

vidual predictions, being real numbers, are averaged.

The component models are usually built separately.

Boosting also uses voting or averaging schemes to

combine the outcomes, but, unlike bagging, it uses

weighting to give more inﬂuence to the more success-

ful models. Furthermore the process is iterative, each

component model is built upon the previous model

and therefore inﬂuenced by their performance.

Stacking applies on heterogeneous classiﬁers and

trains a new meta learner on the predictions of the

component classiﬁers using a validation data set. The

meta learner can be of various types depending on the

set of attributes used for meta learning. Some meta-

learners use only the class predictions of the compo-

nent models for training, while others use both the

class predictions and all the original input attributes.

It applies both to categorical and numeric predictions.

Another approach which is mostly useful in dis-

tributed data mining is combining models with var-

ious levels of granularity. For instance, it might be

the case that a model classiﬁes an instance at a coarse

level, while another model does it with a ﬁner granu-

larity. One can use the ﬁrst model to tag the instances

with a more general class and then to use a special-

ized classiﬁer for each of the resulted groups. This

technique is also useful in classiﬁcation when an al-

gorithm cannot predict multi-class attributes, such as

standard SVM. Another direct use of this technique is

model selection in PMML, which is described in 2.

In the distributed data mining model different

models might be built on vertically fragmented data

(usually that reside at different sites). Each of the in-

dividual models is built on projections on the same

relation, but unable to detect cross-site correlations.

A meta-learning approach has been proposed in (Pro-

dromidis et al., 2000) that uses classiﬁers trained at

different sites to develop a global classiﬁer.

Another possibility to combine data mining mod-

els is to adjust a given model to a consumer’s speciﬁc

needs - model customization. A model consumer re-

peatedly uses the model in its knowledge integration

processes and may collect information on its perfor-

300

Gorea D. (2008).

TOWARDS ONLINE COMPOSITION OF PMML PREDICTION MODELS.

In Proceedings of the Third International Conference on Software and Data Technologies - ISDM/ABF, pages 300-303

DOI: 10.5220/0001890503000303

 SciTePress

mance. A new model can be constructed based on

the speciﬁc needs for the consumer. The same ap-

plies to refreshing a model to reﬂect the new trends in

the data. This is related to the concept of incremen-

tal data mining, which was introduced in (M.Harries

et al., 1998). A related approach can be found in

(Kuncheva, 2004), where strategies for building en-

sembles of classiﬁers in non-stationary environments

in which even the classiﬁcation task may change are

presented.

The current work aims to provide a general con-

text in which composition of data mining models can

be achieved within the boundaries of an online scor-

ing system called DeVisa. While in (Gorea, 2008) and

(DeVisa, 2007) a general description of the system is

given, the current paper focuses on the model compo-

sition aspect. DeVisa only stores the prediction mod-

els expressed in PMML (PMML, 2007), not the orig-

inal training data. Therefore we consider the training

data as not being available. Subsequently the model

composition in DeVisa is limited to certain techniques

which are described in 2. However the consumer ap-

plication can build new models based on the existing

DeVisa models and its own validation set.

2 DEVISA COMPOSITION OF

PREDICTIVE MODELS

DeVisa supports two types of composition methods

described in the PMML speciﬁcations: model se-

quencing and model selection. In its current version

(3.2), PMML supports the combination of decision

trees or rules and simple regression models.

Model sequencing is the case in which two or

more models are combined into a sequence where

the results of one model are used as input in another

model. Model sequencing is very often an intrin-

sic part of a model, namely a transformation func-

tion. For instance, a supervised discretization algo-

rithm is applied to a certain attribute, which is de-

scribed within a transformation dictionary, or missing

values for an attribute are ﬁlled using a transforma-

tion function made of decision rules. Model selection

is when one of many models can be selected based on

decision rules. A common model selection method

for optimizing prediction models is the combination

of segmentation and regression.

Although the producer applications can upload

composite models in DeVisa, in this section we fo-

cus moreover on the situations in which DeVisa is re-

sponsible for composing the models. Depending on

the moment when the composition process occurs, we

can further classify the composition in DeVisa in im-

plicit or explicit composition.

Implicit composition is the situation when the mod-

els are composed within the orchestration of the scor-

ing or search service (see 3).

A scoring goal (query) is a tuple (MSpec, R),

where

1. MSpec is the model speciﬁcation, deﬁned as

MSpec ::= {MRef} | ({Filter}, SRef | S[, DRef])

The model speciﬁcation has several instances:

Exact model case, in which exact references to

one or more DeVisa model that the consumer wishes

to score on is given via MRe f.

Exact schema case, in which the consumer gives

an exact reference to a mining schema SRef and

wishes to score on the models complying to that

schema. However, an additional set of ﬁlters corre-

sponding to the properties that the model needs to

conform to can be speciﬁed via the Filter element.

Match schema case, in which S describes a min-

ing schema that needs to be matched against one or

more in the DeVisa Catalog. To restrict the search,

an existing DeVisa data dictionary can be optionally

referenced. A reference to an ontology in order to

explain the terminology can be included. Also an op-

tional Filter element can be speciﬁed.

2. R is the dataset to score.

The implicit model composition is applicable in

two situations, given that the consumer allows scoring

on composite models:

1. More models complying to MSpec can be found;

2. No model complying to MSpec can be found.

The ﬁrst situation can occur in all the model

speciﬁcation instances (exact model, exact schema or

match schema).

In the ﬁrst two instances, given the models or

schema reference, an existing data dictionary is im-

plicit. In the match schema case a mining schema

S and a reference to a data dictionary in DeVisa, D

should be provided.

In the exact model case, all the referred models are

retrieved. In the exact schema case the engine ﬁnds

all the models complying to the speciﬁed schema. In

the match schema case, the engine tries to ﬁnd one

or more models that match S. The composition de-

scribed below applies to the situation in which more

models satisfy the requirements. They are combined

to give the best prediction as follows. The composer

component of the engine (see 3.1) scores on all the

models and then applies a voting procedure (similar to

the bagging approach) and returns either the outcome

that has the highest vote (in the case of categorical

predictions), or the average (in the case of numeric

TOWARDS ONLINE COMPOSITION OF PMML PREDICTION MODELS

301

predictions). Note that if the model composition is

not allowed the engine builds a query plan and exe-

cutes it against the retrieved model (or models) in the

repository - classical scoring scenario.

In the match schema case, the names of the at-

tributes in S should either be among the attribute

names in D , or they should refer to the same terms

in an ontology/taxonomy, so that a clear mapping

can be made. Thus we refer to the case when no

model applicable on S can be found. Then the en-

gine is going to invoke the composer module that at-

tempts to build a new model from the models com-

plying to the data dictionary D via composition. An

example can be seen in Figure 1, which depicts a

sequencing composition. The two DeVisa classi-

ﬁcation models φ

, φ

are deﬁned on the schemas

({A, B, D, F}, P

), S

({E, F, G}, P

) in the data dic-

tionary D . The scoring goal speciﬁes a mining

schema S = ({A, B, C, E, G}, P). The composer at-

tempts to build a new model by sequencing φ

and

Figure 1: An example of implicit sequencing of models.

In the implicit composition scenario the PMQL

engine builds the sequenced model in order to score

on the dataset provided in the scoring request. It then

stores the model back in the repository for future use.

Explicit Composition is when a DM consumer ex-

plicitly speciﬁes in its goal that a composition is de-

sired. The composition query usually includes the

models to be composed, the composition method and

a validation data set. DeVisa identiﬁes the speciﬁed

models and checks them for compatibility. If all the

prerequisites for the composition are fulﬁlled then a

new valid model is returned to the user/stored in the

repository.

An explicit composition scenario occurs when a

consumer wants to make the best out of several het-

erogeneous models in DeVisa complying to the same

mining schema. In this case it provides a composition

goal (MSpec, R), where MSpec is deﬁned as above.

R is a validation set with classiﬁed instances deﬁned

on the same schema as the models satisfying MSpec.

DeVisa uses a stacking approach (see 1) to train

a meta-learner ϕ based on the outcomes of the ex-

isting DeVisa base models, i.e the models that sat-

isfy MSpec and the relation R provided by the con-

sumer. Let’s assume that the base schema is S =

(U, P), where U = {A

, A

, . . . , A

} and the base

classiﬁers φ

, φ

, . . . , φ

: dom(A

)× dom(A

)× ··· ×

dom(A

) → dom(C) , whereC ∈ U. The meta-learner

is a simple decision tree ϕ : dom(C)

→ dom(C) (by

default DeVisa uses ID3), since the main work is

done by the base classiﬁers. The outcome is another

classiﬁer φ : dom(A

) × dom(A

) × ··· × dom(A

) →

dom(C), as depicted in Figure 2.

DeVisa

Consumer

...

Figure 2: Explicit DeVisa composition - stacking approach.

This approach has the advantage of ﬁtting the

model to the consumer needs and the particularities

of its own data (model customization instance).

Model composition is allowed only within a com-

mon data dictionary. Intuitively, a data dictionary

refers to a strict domain. In the absence of a refer-

ence to a data dictionary in the consumer’s goal, the

same results can be achieved upon the availability of

a common domain ontology, composed of concepts

describing the domain in different abstraction levels,

into which URLs are mapped.

3 ARCHITECTURE AND

IMPLEMENTATION

3.1 PMQL

DeVisa deﬁnes a XML-based language called PMQL

(Predictive Model Query Language) (PMQL, 2008)

that is used to realize both the communication with

the consumer application and the internal communi-

cation between DeVisa components. The consumer

application expresses its goal in PMQL and wraps

it in a SOAP ((SOAP, 2007)) message that is sent

to the PMQL Web Service. The PMQL Web Ser-

vice forwards the PMQL goal to a DeVisa compo-

nent called PMQL engine, which is responsible with

processing PMQL. It transforms the goal (query) so

that it matches the existing DeVisa resources and, af-

ter successful matching, transfers the PMQL answer

back to the consumer ( Figure 3). To resolve a scor-

ing or composition goal, the PMQL engine performs

a sequence of steps: annotation, rewriting and plan

building (Gorea, 2008). If a composition is necessary

then an additional composing phase, which assembles

the base models, is performed. The execution invokes

certain XQuery (XQuery, 2007) functions against the

repository of PMML models.

ICSOFT 2008 - International Conference on Software and Data Technologies

302

PMQL Engine

PMQL Web

Service

Consumer

Goal

Concrete

service

PMQL Goal

PMQL

Answer

Rewriter

Composer

Executer

Plan Builder

PMML

model

repository

Model

Query plan/

answer

...

Figure 3: Resolving consumers’ goals in DeVisa.

3.2 The PMQL Web Service

DeVisa’s approach uses several layers of abstraction

in deﬁning its services. The PMQL Web Service is

an abstract computational entity meant to provide ac-

cess to the concrete DeVisa services. The materializa-

tion of those services is achieved after interpretation

of the PMQL goal. The services are the effective val-

ues that are provided to the user as a response to its

goal. They are tightly coupled with the business con-

text (metadata) or with the circumstances of the re-

quest. More exactly, it depends on a data dictionary,

on the existence of a DeVisa applicable model etc.

The PMQL Web Service deals with arbitrary XML of

the incoming and outgoing SOAP Envelopes without

any type mapping / data binding - message service

(Axis, 2007). The raw XML in a received SOAP en-

velope is passed to the PMQL engine, which attempts

to interpret the XML as a PMQL query. This type of

service has the advantage of separating the expression

of the consumer’s goal from the choreography of the

concrete services.

DeVisa borrows some principles used in the

WSMO framework (Fensel et al., 2007): web compli-

ance, ontology as data models, strict decoupling of re-

sources deﬁnitions, separation of the description from

the implementation, ontological role separation, exe-

cution semantics. A clear distinction between a ser-

vice and a web service is made. The web service is a

computational entity able to achieve the user’s goal by

invocation whereas the service is the actual value pro-

vided by the invocation of the web service.The model

composition in DeVisa follows a resembling pattern.

The consumer provides a goal, providing the available

input and the expected output. The composer will at-

tempt to produce an orchestration, which at least pro-

duces all expected outputs, and at most expects all

possible input messages.

4 CONCLUSIONS

The paper presented a theoretical foundation on the

prediction models composition problem within an on-

line scoring system called DeVisa. The prediction

models are stored in PMML format in a native XML

database. The model composition in DeVisa is limited

by the non-availability of the original data set and to

model sequencing and selection supported by PMML.

Nevertheless, the consumer can provide a validation

data set to train a new customized model. The pa-

per identiﬁes the contexts in which model compo-

sition can occur (implicit or explicit) and analyzes

the possible approaches.To achieve a clear separation

between the consumer’s goal and the effective De-

Visa services a specialized language (PMQL) is in-

troduced. Because the main DeVisa functionality is

available through Web Services, the model composi-

tion can follow some of the principles used in the web

services composition process.

REFERENCES

Axis (2007). Apache axis. http://ws.apache.org/axis/.

DeVisa (2007). Devisa. http://devisa.sourceforge.net.

Fensel, D., Lausen, H., Polleres, A., de Bruijn, J., Stollberg,

M., Roman, D., and Domingue, J. (2007). Enabling

Semantic Web Services. Springer, Berlin Heidelberg.

Gorea, D. (2008). Devisaconcepts and architecture of a data

mining models scoring and management web system.

In Proceedings of the 10th International Conference

on Enterprise Information Systems. INSTICC.

Ian H. Witten, E. F. (2005). Data Mining Practical Machine

Learning Tools and Techniques. Morgan Kaufmann

series in data management systems. Elsevier.

Kuncheva, L. I. (2004). Classiﬁer ensembles for changing

environments. In Proc. 5th Int. Workshop on Multiple

Classiﬁer Systems, volume 3077 of LNCS, pages 1–

15. Springer-Verlag.

M.Harries, Sammut, C., and Horn, K. (1998). Extracting

hidden context. Machine Learning, 36(2):101–126.

PMML (2007). Pmml version 3.2.

http://www.dmg.org/pmml-v3-2.html.

PMQL (2008). Predictive modelling query language.

http://devisa.sourceforge.net/pmql.shtml.

Prodromidis, A., Chan, P., and Stolfo, S. (2000). Meta-

learning in distributed data mining systems: Issues

and approaches, chapter 3. AAAI/MIT Press.

SOAP (2007). Soap version 1.2 part 1:

Messaging framework (second edition).

http://www.w3.org/TR/soap12-part1/.

XQuery (2007). Xquery 1.0: An xml query language.

http://www.w3.org/TR/xquery/.

TOWARDS ONLINE COMPOSITION OF PMML PREDICTION MODELS

303