Best Weighted Selection in Handling Error Heterogeneity Problem

on Spatial Regression Model

Sri Sulistijowati Handajani

, Cornelia Ardiana Savita

, Hasih Pratiwi

, and Yuliana Susanti

Statistics of Study Program FMIPA, Universitas Sebelas Maret, Jl.Ir. Sutami 36 A Kentingan, Surakarta, Indonesia

Mathematics of Study Program FMIPA, Universitas Sebelas Maret, Jl.Ir. Sutami 36 A Kentingan, Surakarta, Indonesia

Keywords: Spatial Regression Model, Heterogeneity in Error, Ensemble Technique, R

, RMSE.

Abstract: Spatial regression model is a regression model that is formed because of the relationship between independent

variables with dependent variable with spasial effect. This is due to a strong relationship of observation in a

location with other adjacent locations. One of assumptions in spatial regression model is homogeneous of

error variance, but we often find the diversity of data in several different locations. This causes the assumption

is not met. One such case is the poverty case data in Central Java Province. The objective of this research is

to get the best model from this data with the heterogeneity in error. Ensemble technique is done by simulating

noises (m) from normal distribution with mean nol and a standard deviation σ of the spasial model error taken

and adding noise to the dependent variable. The technique is done by comparing the queen weighted and the

cross-correlation normalization weighted in forming the model. Furthermore, with these two weights, the

results will be compared using R

and RMSE on the poverty case data in province of Central Java. Both of

weights are calculated to determine the significant factors that give influence on poverty and to choose the

best model. The results of the case study show that the spatial regression model of the SEM ensemble already

does not have a variance error that is not homogeneous and the model using cross-normalization weight is

better than the spatial regression model of SEM ensemble with Queen contiguity weight.

1 INTRODUCTION

Regression modeling is one form of classical

modeling that is used to provide a model of the

relationship between independent variables with

dependent variable. Fulfillment of the assumptions of

this modeling is necessary to ensure that the model is

good and can be used for prediction. The problem that

often arises is when the model does not meet the

necessary assumptions.

The condition of data in the field that the more

diverse the pattern resulted in the development of

existing classical models. One of the assumptions that

often is not met is the autocorrelation of errors. If the

dependent variable is drawn from several adjacent

areas, this often results in errors of the resulting

model being correlated. The phenomena studied often

show significant associations or interactions of

variables in adjacent areas, as expressed by Anselin

(1988). The development of a regression model by

adding spatial effects can eliminate any dependencies

between errors. This is in accordance with the results

of Lesage (1997) which states that if the model

obtained without considering the spatial effect then

the conclusions obtained will be invalid.

Models with spatial effects have been proposed by

Qu (2013), i.e. the Spatial Auto Regressive (SAR)

model which shows the dependence of observation on

the dependent variable (autoregression) between the

locations. While Mac Millen (1992) discusses the

model of Spatial Error Model (SEM) which indicates

an error correlation between locations.

In spatial models that formed, there arose another

problem that is the heterogeneity of errors that

resulted in instability in the parameter estimation. The

instability of the parameter prediction resulted in less

valid results. This is also stated by Dimopoulus,

Tsiros, Serelis and Chronopoulou (2004) regarding

its application in the neural network model for

various nonlinear problems. This instability is a

weakness of the model that is formed.

Several approaches to the classical regression

model have been done to overcome these assumption

deviations. Such as the robust regression method by

Chen (2002) with its emphasis on the detection of

extreme observations called outliers and its resistance

Handajani, S., Savita, C., Pratiwi, H. and Susanti, Y.

Best Weighted Selection in Handling Error Heterogeneity Problem on Spatial Regression Model.

DOI: 10.5220/0008521002930299

In Proceedings of the International Conference on Mathematics and Islam (ICMIs 2018), pages 293-299

ISBN: 978-989-758-407-7

293

to the model. Approach of regression model with 

estimation has also been proposed by Montgomery

and Peck (2006) and Yuliana and Susanti (2008). The

model solutions with S estimation have been

introduced by Rousseeuw and Yohai (1984), and MM

estimation has been discussed also by Yohai (1987).

The problems that arise is how to approach the

regression model that error in addition to

autocorrelation also there is heterogeneity in the error

obtained from the classical regression model.

To overcome the errors that contain

autocorrelation is by spatial regression modeling. In

this research will be discussed the problem of

heterogeneity of error on spatial regression model

with solution which will be applied in this research is

ensemble method. According to Mevik, Segtnan and

Naes (2005) and Canuto, Oliveira, Junior, Santos and

Abreu (2005), ensemble techniques can be used to

reduce the diversity contained in predictive models

and can improve prediction accuracy. This method is

as one solution by combining k spatial regression

model formed from the addition of noise. The

approach is through non-hybrid ensemble approach

and hybrid ensemble approach. The principle of non-

hybrid ensemble method is to combine estimation

results from simulation of one model to a final

estimate. While hybrid ensemble method involves

several suitable models and combine the predicted

simulation results generated by each model into one

final prediction. In this study we studied non-hybrid

ensemble approach by comparing queen weights with

spatial weights of cross-correlation normalization and

using R

and RMSE indicators to select the best

approach.

2 SPATIAL PANEL REGRESSION

MODEL

In the data taken based on time and location, the

analyzes were performed using panel data analysis.

Due to the effect of spatial effect in panel data

analysis so that the appropriate model used is spatial

panel regression model. One of the spatial panel

regression models is the panel spatial error model

(SEM) (Lesage, 2009) yang sebelumnya

dikembangkan dari model SEM yang diusulkan

Anselin (2003). The SEM panel regression model is

























(1)

where 



being the dependent variable of the data in

the 

observation unit and the t-time,



is the

independent variable of the data in the i-th

observation unit and the t-time,  is the standardized

spatial weighted row matrix, α is the intercept, β is the

parameter of the independent variable, 



spatial error

in the i-th region of time t, and 



is the model error

on the i-th observation and the th time.

The model (1) is further simplified to be





  







where Z = [I : X] dan 











By estimating the parameters using the maximum

likelihood method, it is necessary to first form the

likelihood function. Using the Jacobian

transformation is obtained































and likelihood function :

































































(2)

To facilitate the estimation of the parameters, the

two sections of (2) are logged and the following

results are obtained

















 









 



  









  



  









  



  









(3)

By partially deriving (3) against 



,  and λ and

making it equal to zero, we get the estimator for 



 and λ as follows





 

























  



  









  



  



















   







   





ICMIs 2018 - International Conference on Mathematics and Islam

294

















  



















 



















  





























  











  









  











  























  















  













 





























  







  







  









 









3 EXPERIMENTAL DETAILS

1. Derive estimation the parameters for the SEM

spatial panel regression model with maximum

likelihood.

2. Looking for an example case that meets one of the

spatial regression models with an area approach

and contains heterogeneity in the error. An

example of cases taken in this study is the case of

poverty in 35 districts / cities of Central Java

Province. Data taken from Central Bureau of

Statistics (2008-2015) with the variables taken

are percentage of poor people (Y), percentage of

poor people aged more than 14 years and not

finished primary school (X

), percentage of

population not illiterate age 15-55 years (X

percentage of poor people aged more than 14

years and unemployed (X

), percentage of poor

people aged more than 14 years and agriculture

work (X

), percentage of women using

contraceptives (X

), percentage of women with

poor status aged 15-49 years whose first delivery

was assisted by health personnel (X

), the

percentage of households whose houses have a

per capita floor area of less than 8 m

percentage of households using their own toilet /

joint (X

), the percentage of households ever buy

raskin rice (X

), and percentage of population

growth rate (X

3. Determine independent variables that

significantly influence the percentage of poor

people with stepwise method to form a simple

linear regression model.

4. Determine the weighted matrix by using the

spatial matrix of Queen contiguity and the cross-

correlation normalization matrix.

5. Detect spatial effect by using Moran Index test.

6. Test the LM to determine the effect of spatial

dependence.

7. Establish a corresponding spatial regression

model and test its assumptions.

8. Add noise which is generated k times from the

normal distribution with zero mean and model

error variance of σ and giving a zero value to the

negative value data, to dependent variable to

generate k new data.

9. In the k new data is done spatial regression

modeling as follows.

a. Test Lagrange for spatial dependence

b. Test the Breusch Pagan to test for spatial

diversity

c. Estimate model parameters and test their

significance

d. Measuring the goodness of the spatial

regression model with R

10. Establish an ensemble model which is a

composite of k spatial regression models by

calculating the average coefficients of the model.

11. Compare the spatial regression model ensemble

for both weights by looking at the greatest R

and

minimum RMSE.

4 RESULT AND DISCUSSION

In the case of poverty in Central Java province, the

linear regression model begins with the selection of

variables that have a significant effect on the

percentage of poor people in Central Java Province in

2015. Variable selection is done by stepwise method

and the obtained linear regression model is





 



 



 



To see the feasibility of the model, tested

normality with statistics Kolmogorov Smirnov and

obtained the test statistic value is 0.0728571. With a

significance level of 5 percent then taken the

conclusion of the assumption of normality error is

met. Furthermore, for multicollinearity test with VIF,

the three independent variables show the VIF value

less than 10 so it is concluded that the above model

Best Weighted Selection in Handling Error Heterogeneity Problem on Spatial Regression Model

295

does not occur multicolinearity. While homogeneity

test of variance is used statistic of Breusch Pagan

(BP) and obtained value of BP = 7,8908>



(0,05;3)

= 7,815 so it is concluded that model there is

heteroscedasticity. In addition, spatial correlation

testing is also done between the errors by using

Moran Index. The test results showed positive spatial

autocorrelation with IM value = 0.24 which means

there is similarity error value from adjacent locations

and error value tend to group.

4.1 SEM Model

There is an indication of the spatial effect of the

Moran Index so that an analysis to test for the effect

is necessary. This test uses Lagrange Multiplier (LM)

lag and LM error statistics. LM lag statistic value of

2.1087 and LM error value of 4,0997 compared with

the value of chi square table of 3.851. The conclusion

obtained shows that there is no spatial effect of lag

but there is a spatial effect of error on the model so

that the appropriate model is the SEM model.

Determination of SEM model parameter

estimation with weighted queen obtained result with

each parameter is significant is as follows.





 



 









 .







To overcome the non-homogeneous variance,

noise is added to the dependent variable. The noise is

generated from the normal distribution having a mean

of zero and the standard deviation is the standard

deviation error of 2.53. Noise is simulated 100 times.

The next step is modeled into the SEM model for each

noise simulation result and a spatial regression model

of the ensemble is obtained. The spatial model of the

ensemble model is the result of the average parameter

estimation p of the regression model, where p is the

number of spatial regression models. The spatial

regression ensemble model is expressed as





















From the data analysis obtained by regression

model of spatial error ensemble from mean of

estimation result of hundredth parameter of model is





  



 









 

with the R

value of 0.7271, which means that 72% of

the total poor is affected by the percentage of

households using their own latrines / joints, the

percentage of households who have bought raskin

rice, and the percentage of population growth rate. To

see the feasibility of the model, tested normality with

statistics Kolmogorov Smirnov and obtained the test

statistic value is 0,147. With a significance level of 5

percent then taken the conclusion of the assumption

of normality error is met. Furthermore, homogeneity

test of variance is used statistic of Breusch Pagan

(BP) and obtained value of BP = 6:99240<



(0,05;3)

= 7,815 so it is concluded that the model there isn’t

heteroscedasticity again.

While the panel data analysis is the first done by

analysis with regular regression model. The

formation of the linear regression model is begun by

variable selection which is significant to the model

with stepwise method. In the case of poverty in

Central Java Province in 2008 until 2015, the

obtained linear regression model is







  



 



 



 



RMSE = 3.151822

To see the feasibility of the model, tested normality

with statistic Kolmogorov Smirnov and obtained the

test statistic value is 0.2071 > D(0.05;280) = 0.0807.

With a significance level of 5 percent then taken the

conclusion of the assumption of normality error is

met. Furthermore, for multicollinearity test with VIF,

from the four independent variables of the model

above shows the VIF value of each less than 10 which

it means that the model does not occur

multicollinearity. While homogeneity test of variance

used statistic Breusch Pagan (BP) and obtained value

of BP = 17.0267 >









 so

concluded that model there is heteroscedasticity. In

addition, spatial correlation testing is also done

between the errors by using Moran Index. The test

results show that there is negative spatial

autocorrelation with IM = -0.0807, which means

different errors in adjacent locations and the errors

tend to spread. There is an indication of the spatial

effect of the Moran Index so that an analysis to test

for the effect is necessary. This test uses Lagrange

Multiplier (LM) lag and LM error statistics. The LM

lag statistic value is and the LM error

value is and each is compared with the

chi square table value of 3.841. The conclusion

obtained shows that there is no spatial effect of lag

ICMIs 2018 - International Conference on Mathematics and Islam

296

but there is a spatial effect on the model error so that

the appropriate model is the SEM model.

4.2 SEM Panel Model with Queen

Weight

Estimation of SEM model parameters with queen

weights obtained with each significant parameter are

as follows



















 



 













with the value of R

is 0.725807 which means 72.58%

percentage of the poor is affected by the percentage

of households using their own latrines / joint, the

percentage of households who have bought raskin

rice, the percentage of poor people aged more than 14

years working in the sector agriculture, the

percentage of poor women aged 15-49 years whose

first delivery was helped by health personnel and

RMSE = 3.130008.

To overcome the heteroscedasticity of errors

variance is done adding noise to the dependent

variable. The noise is generated from the normal

distribution having a mean of zero and the standard

deviation is a standard deviation error of 0.23. Noise

is simulated 100 times. The next step is modeled into

the SEM model for each noise and the average model

of the hundredth model is searched. The ensemble

model of the error spatial regression model is







 



 









 



 













with a R

value of 0.750338, which means 75.03% of

the percentage of the poor is affected by the

percentage of households using their own latrines /

joints, the percentage of households who have bought

raskin rice, the percentage of poor people aged over

14 who work in agriculture, the percentage of poor

women aged 15-49 years whose first delivery was

helped by health personnel and RMSE = 3.117259. To

see the feasibility of the model, tested normality with

statistics Kolmogorov Smirnov and obtained the test

statistic value is 0.02438 >D(0.05;280) = 0.0807.

With a significance level of 5 percent then taken the

conclusion of the assumption of normality error is

met. Furthermore, homogeneity test of variance is

used statistic of Breusch Pagan (BP) and obtained

value of BP = 6.217639<



(0,05;3) = 9,48773 so it is

concluded that the model there isn’t

heteroscedasticity again.

4.3 SEM Panel Model with Cross-

Correlation Normalization Weight

While the panel data analysis for the use of weighting

cross-correlation normalization first done by regular

regression model analysis. The formation of the linear

regression model is begun by variable selection which

is significant to the model with stepwise method. In

the case of poverty in Central Java Province in 2008

until 2015, the obtained linear regression model is







 



 



 



 



RMSE = 3.151822.

To see the feasibility of the model, tested

normality with statistic Kolmogorov Smirnov and

obtained the test statistic value is 0.2493. With a

significance level of 5 percent then taken the

conclusion of the assumption of normality error is

met. Furthermore, for multicollinearity test with VIF,

from the four independent variables of the model

above shows the VIF value of each less than 10 which

it means that the model does not occur

multicollinearity. While homogeneity test of variance

used statistic Breusch Pagan (BP) and obtained value

of BP = 12.9205 >









 so

concluded that model there is heteroscedasticity. In

addition, spatial correlation testing is also done

between the errors by using Moran Index. The test

results show that there is negative spatial

autocorrelation with IM = -0.03062855, which means

different errors in adjacent locations and the errors

tend to spread. There is an indication of the spatial

effect of the Moran Index so that an analysis to test

for the effect is necessary. This test uses Lagrange

Multiplier (LM) lag and LM error statistics. The LM

lag statistic value is 2.635515 and the LM error value

is 11.721478 and each is compared with the chi

square table value of 3.841. The conclusion obtained

shows that there is no spatial effect of lag but there is

a spatial effect on the model error so that the

appropriate model is the SEM model.

Estimation of SEM model parameters with cross

correlation normalization weights obtained with each

significant parameter are as follows

Best Weighted Selection in Handling Error Heterogeneity Problem on Spatial Regression Model

297







 



 



 



 



 













with the value of R

is 0.7433 which means 74.3%

percentage of the poor is affected by the percentage

of households using their own latrines / joint, the

percentage of households who have bought raskin

rice, the percentage of poor people aged more than 14

years working in the sector agriculture, the

percentage of poor women aged 15-49 years whose

first delivery was helped by health personnel and

RMSE = 3.107759.

To overcome the heteroscedasticity of errors

variance is done adding noise to the dependent

variable. The noise is generated from the normal

distribution having a mean of zero and the standard

deviation is a standard deviation error of 0.21. Noise

is simulated 100 times. The next step is modeled into

the SEM model for each noise and the average model

of the hundredth model is obtained. The ensemble

model of the error spatial regression model is







 



 









 



 













with a R

value of 0.7775, which means 78% of the

percentage of the poor is affected by the percentage

of households using their own latrines / joints, the

percentage of households who have bought raskin

rice, the percentage of poor people aged over 14 who

work in agriculture, the percentage of poor women

aged 15-49 years whose first delivery was helped by

health personnel and RMSE = 3.082789. To see the

feasibility of the model, tested normality with

statistics Kolmogorov Smirnov and obtained the test

statistic value is 0.048286. With a significance level

of 5 percent then taken the conclusion of the

assumption of normality error is met. Furthermore,

homogeneity test of variance is used statistic of

Breusch Pagan (BP) and obtained value of BP =

5.73146<



(0,05;3) = 9,48773 so it is concluded that

the model there isn’t heteroscedasticity again.

From the two-weighting used in the panel data, the

results of the analysis concluded that the model with

the normalization of cross-correlation weighting was

better than the model with queen weighting. From the

best model, the percentage of poor people in Central

Java Province in 2015 is predicted. The prediction

results obtained are then grouped into six priority

zones are as follows,

Table 1: Zones of percentage the poor population.

Zone

Percentage of the poor population

>35 %

30 % - 34.99 %

25 % - 29.99 %

20 % - 24,99 %

15 % - 19.99 %

<15%

Of the six priority zones in Table 1. the percentage of

poverty for districts and cities in Central Java

province is only at three zones, i.e. the second, third

and fourth priority zones. The results of the is shown

in Table 2.

Table 2: The districts and cities of poverty in Central Java

Province

Zone

The districts and cities of the poor population

Wonosobo regency, Banjarnegara regency,

Wonogiri regency

Cilacap regency, Banyumas regency,

Purbalingga regency, Kebumen regency,

Purworejo regency, Magelang regency,

Boyolali regency, Klaten regency, Sukoharjo

regency, Karanganyar regency, Sragen regency,

Grobogan regency, Blora regency, Rembang

regency, Pati regency, Kudus regency, Jepara

regency, Demak regency, Semarang regency,

Temanggung regency, Batang regency,

Pekalongan regency, Pemalang regency, Tegal

regency, Brebes regency

Surakarta city, Salatiga city, Pekalongan city,

Magelang city, Semarang city, Tegal city,

Kendak regency

5 CONCLUSIONS

The poverty case shows that the spatial regression

model of the SEM ensemble already does not have a

variance error that is not homogeneous and the

model using cross-normalization weight is better

than the spatial regression model of SEM ensemble

with Queen contiguity weight.

ACKNOWLEDGMENT

The authors would like to thank to Research Group

of Applied and Inference Statistics for the

discussions. We also appreciate Universitas Sebelas

ICMIs 2018 - International Conference on Mathematics and Islam

298

Maret for providing financial support through Grant

of Fundamental Research 2018.

REFERENCES

Anselin, L.,1988. Spatial Econometrics: Methods and

Models. Academic Publishers, Dordrecht.

Anselin, L., 2003. Spatial Multipliers, and Spatial

Econometrics , International Regional Science

Review, University of Illinois.

Chen,C.,2002. Robust regression and Outlier Detection

with the ROBUSTREG Procedure,Paper 265-27,

Statistics and Data Analysis, SUGI 27,North

Carolina:SAS Institute Inc.

Canuto,A.M.P., Oliveira,L.,Junior,J.C.X.,Santos,A.,Abreu, M.,

2005. Perfomance and diversity evaluation in hybrid and non-

hybrid structures of ensembles, Proceedings of the Fifth

International Conference on Hybrid Intelligent Systems.

Central Bureau of Statistics.2015. Data dan informasi

kemiskinan kabupaten/kota, BPS, Indonesia.

Dimopoulos,L.F., Tsiros,L.X., Serelis, K. &

Chronopoulou,A.,2004. Combining Neural Network

Models to Predict Spatial Patterns of Airborne

Pollutant Accumulation in Soils around an Industrial

Point Emission Source, Journal of the Air & Waste

Management Association, 54(12), 1506–1515.

LeSage,J.P. 1997. Bayesian estimation of spatial

autoregressive models. International Regional Science

Review 20, no 1 dan 2, pp 113-129.

LeSage, J. P.,2009. Intoduction to Spatial Econometrics,

CRC Press Taylor and Francis Group, Florida.

Mevik,H.B., Segtnan,V.H., T. Naes,T., 2005. Ensemble

methods and partial least squares regression, Journal

of Chemometrics18 no. 11, 498-507.

Millen,D.P, Mac. 1992. Probit with spatial autocorrelation.

Journal of Regional Science 32, No.3 , pp. 335-348.

Montgomery, D.C., and Peck,E.A.,2006, Introduction to

Linear Regression Analysis, John Wiley & Sons Inc.,

New York.

Qu, Xi. 2013. Three Essays on the Spatial Autoregressive

Model in Spatial Econometrics, Dissertation at The

Ohio State University.

Rousseeuw,P.J. and Yohai, V.J.,1984. Robust Regression by

Mean of S-Estimators, , Springer-Verlag, Berlin.

Yohai,V.J.,1987. High Breakdown Point and High

Efficiency Robust Estimates for Regression,The

Annals of Statistics, 15, 642-656.

Yuliana dan Susanti,Y.,2008. Estimasi-M dan Sifat-

sifatnya pada Regresi Linear Robust”, Jurnal Math-

Info, Vol. l No. 10.

Best Weighted Selection in Handling Error Heterogeneity Problem on Spatial Regression Model

299