Sales Forecasting for Firms based on Multiple Regression Model

Guanyi Wang

Ansai Senior High School, Yan’an, China

guanghua.ren@gecacademy.cn

Keywords: Sales Prediction, Linear Model, Multiple Linear Regression.

Abstract: This paper focuses on the building and usage of a multiple linear regression model (MLR) for predicting a

firm’s sales. According to the data provided by a semiconductor manufacturing company, ABCtronics on its

historical sales from 2004 to 2013 and the data with three factors that may affect the sales (i.e., overall market

demand, price per chip, and economic condition), a multiple linear regression model can be built based on

these data. Hence, the future sales figure can be also estimated by using the model. The model is constructed

via the Excel in order to find the values of coefficients for each independent variable. The resulting model

offers a guideline for a way of more accurately and validly forecasting a firm’s sales or predicting other trends

and relationships in a situation of having multiple variables by using a multiple linear regression model.

1 INTRODUCTION

Understanding the sales area and making forecasts of

sales will help corporations set a realistic goal and

understand the scope of their business. Obtaining

accurate sales forecasts is almost as important as

achieving revenue targets themselves. However, with

so many different sales forecasting methods, it is

unknown which technology could provide the most

accurate view. According to CSO insights, 60% of the

forecast transactions are not actually completed. As

expected, the data also showed that 25% of sales

managers were not satisfied with the accuracy of their

forecasts. Prediction is based on the application of data

demand and data in predicting future sales. Sales

forecasts can only be as good as the data they based

on. Prediction experts use three types of sales

forecasting techniques in their sales forecasting.

Prediction techniques are based on the input data types

used to predict the requirements. Choosing the right

forecasting technology can greatly improve the ability

to accurately predict future revenue (Michael, 2021,

Box, Jenkins, 1970, McKenzie, 1984, Hyndman, Rob,

2015, Bao, Yue, Rao, 2017).

As for the prediction model, there are plenty of

factors affecting the prediction model, it is difficult to

predict the time series data, e.g., stock price. In

addition, the impact of different factors on stock price

may be linear or nonlinear. Contemporarily, the

https://orcid.org/ 0000-0003-0085-1270

emergence of a good model of stock price has posed a

challenge to researchers. Long and short term memory

is a variant of recurrent neural network, which can

capture time series and has achieved great success in

time series prediction. In addition, convolutional

neural network is we compare our proposed model

with different methods in two real stock data sets. The

results confirm the efficiency and scalability of our

proposed method (Tomas, Martin, Lukas, Jan,

Sanjeev, 2010, Ronald Williams, Geoffrey Hinton,

Rumelhart, David, 1986, Yoshua, Patrice, Paolo,

1994, Hochreiter, Schmidhuber, 1997, Cho,

Merrienboer, Bahdanau, Yoshua 2014, Chen, Zhou,

Dai, 2015, Nelson, Pereira, De Oliveira, 2017, Zhang,

Li, Morimoto, 2019, Chen, Chen, Huang, Huang,

Chen, 2016, Bahdanau, Cho, Bengio, 2014).

The rest part of the paper is organized as follows.

The Sec. 2 will introduce the data origination and

analysis method. The Sec. 3 will display the analysis

results as well as offer a corresponding explanation.

Eventually, a brief summary is given in Sec. 4.

2 DATA AND METHOD

Table I provides ABCtronics’ total sales (in millions)

from 2004 to 2013, as well as data on three factors

that may affect its sales, namely, overall market

demand,

chip unit price, and economic conditions.

628

Wang, G.

Sales Forecasting for Firms based on Multiple Regression Model.

DOI: 10.5220/0011198600003440

In Proceedings of the International Conference on Big Data Economy and Digital Management (BDEDM 2022), pages 628-633

ISBN: 978-989-758-593-7

Table 1: Historical Sales Figure of ABCtronics.

Year

ABCtronics’ sales volume

(in millions)

Overall market

demand (in millions)

Price per chip

(in $)

Economic

condition*

2004 2.39 297 0.832 0

2005 3.82 332 0.844 1

2006 3.33 195 0.854 0

2007 2.49 182 1.155 1

2008 1.56 93 1.303 0

2009 0.97 98 1.265 0

2010 1.32 198 1.368 1

2011 1.42 188 1.208 0

2012 1.48 285 1.234 1

2013 1.85 264 1.282 1

Note. *Economic condition: 1 signifies favourable market condition and 0 signifies otherwise.

Besides, it contains a dummy variable which is the

economic condition.

In this analysis, the method used for predicting

sales figure for a firm is by utilizing multiple linear

regression model (MLR). Multiple linear regression

model provides an explanation for the relationship

between multiple independent variables and an

outcome variable (dependent variable) through a

mathematical function. The general formula of MLR

can be notated as following:

𝑌=𝛽



+𝛽



𝑋



+𝛽



𝑋



+⋯+𝛽



𝑋



+∈





where Y is the dependent variable, 𝛽



is the y-

intercept of the linear model, 𝛽



… 𝛽



are the

coefficients for each independent variables, 𝑋



𝑋



…𝑋



are the numbers of independent variables,

and ∈ is the error term which is the difference of the

actual value and predicted value ( ∈=𝑌 −𝐸𝑥).

Referring to the case of ABCtronics, there are three

independent variables which are overall market

demand, chip unit price, and economic conditions,

hence each of them can be 𝑋



, 𝑋



, and 𝑋



respectively. Therefore, the model used for predicting

sales figure for this case is:

𝑌=𝛽



+𝛽



𝑋



+𝛽



𝑋



+𝛽



𝑋



+∈





Before the constructing the model, the first thing

to do is to analyze the relationship and correlation

between all the variables. This can be done through

viewing a matrix plot or running a correlation table.

Figure 1: Matrix Plot of overall market demand, chip unit price, economic conditions, and sales volume.

Sales Forecasting for Firms based on Multiple Regression Model

629

Table 2: Correlations between each variable.

Sales volume

Overall market

deman

Price per chip Economic condition

Sales Volume 1

Overall Market deman

0.533 1

Price

er chi

-0.855 -0.536 1

Economic Condition 0.146 0.508 0.212 1

From the Fig. 1 and Table II, there is weak

positive relation between sales volume and economic

condition. As well as a weak positive relation

between economic condition and Price per chip.

However, other correlations are either strong or

moderate as the absolute values of these correlations

are around 0.5 to 0.8. The model seeks a smaller

correlation between independent variables, but the

correlation coefficient of overall market demand with

other independent variables are dominantly moderate

with an absolute value around 0.5. This might in turn

affect the model, which possibly lead to the difficulty

for estimating the relationship between each

independent variable and dependent variable

individually. This idea is also used in future testing of

multicollinearity.

Nevertheless, the model is not always perfect as

each predicted value usually have a slightly

difference to the actual value. Thus, evaluating the

multiple linear regression model plays a crucial part

in predicting the sales figure accurately and validly.

The model can be evaluated by residual analysis and

coefficient of determination, which is R-squared:

𝑅



=1−

𝑅𝑆𝑆

𝑇𝑆𝑆





where RSS is residual sum of squares which is

explained variation and TSS is the total sum of

squares which is the total variation.

The value of R-squared is always between 0 to 1,

which indicate the percentage of how well this model

could explain the variations in variables. Larger the

R-square value represent a greater and more accurate

model for the future predicting vice versa.

In fact, R-squared value is not always consistent

and accurate as adding more variables for the model

does not appear to lower or decrease R-squared value.

However, adjusted R-squared can be introduced to

deal with this problem in order to increase the validity

and reliability of the MLR. The equation of the

adjusted R-squared is:

𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅



=1−



1−𝑅





𝑛−1





𝑛−𝑘−1









where n is the number of data, and k is the number of

independent variables.

After running the model in Excel, the null

hypothesis that the coefficient is equal to zero is

tested by the p-value for each variable. A low p-value

in this case, 0.05 suggests that the null hypothesis

may be rejected. In other words, because changes in

the predictor’s value are connected to changes in the

dependent variable, an independent variable with a

low p-value is likely to be a useful addition to your

model. A greater or insignificant p-value, on the other

hand, indicates that changes in the independent

variable are unrelated to changes in the responder. If

a p-value is higher than the typical threshold alpha,

0.05, this would indicate that it is not statistically

significant. Thus, this corresponding variable should

be ignored and removed.

Next step is to test the overall significance in the

regression relationship. The overall significance uses

F test statistic to show if there is a linear relationship

between all independent variables and the dependent

variable. We set up the null hypothesis and alternative

hypothesis as H

: 𝛽



=𝛽



=...=𝛽



=0 and H

: at

least one 𝛽



≠0. The formula of F test is:

𝐹=

𝑀𝑆𝑅

𝑀𝑆𝐸





Here,

𝑀𝑆𝑅=

𝑆𝑆𝑅

𝑘

,𝑀𝑆𝐸=

𝑆𝑆𝐸

𝑛−𝑘−1





and the critical F value is 𝛼=0.05. There is a need

to test multicollinearity in order to reduce the

standard errors. The variance inflation factor (VIF)

between the independent variables can be used to see

if any correlation exists. The equation of the VIF is:

𝑉𝐼𝐹 =

1−𝑅









where 𝑅





is the R-squared of independent variables.

If VIF is greater than 5, multicollinearity exists.

The final step of evaluating the model is to do

residual analysis. The regression model is regarded

valid if the residual is closer to 0, which tells how

good the model line fits in the data points. The

residual is defined as:

𝑒=𝑌−𝑌







where Y is the actual data value and 𝑌



is the

predicted value. The residual plot can be constructed

BDEDM 2022 - The International Conference on Big Data Economy and Digital Management

630

with residuals on the vertical axis against the

independent variable. The shape and trend of the plot

can be used to identify how well the model fits for

predicting values.

3 RESULTS AND DISCUSSIONS

Based on the data provided from Table I, a regression

result table can be performed using Excel. The result

tables of the model are shown in Tables III-V.

Table 3: Regression Statistics.

Multiple R 0.952

R Square 0.907

Adjusted R Square 0.861

Standard Erro

0.346

Observations 10

Table 4: ANOVA Analysis.

ANOVA df SS MS F

Significance

Regression 3 7.060 2.353 19.628 0.001

Residual 6 0.719 0.119

Total 9 7.780

Table 5: Regression of Coefficients & statistic Report.

Coefficients

Standard

Erro

t Stat P-value

Lower

95%

Upper

95%

Lower

95.0%

Upper

95.0%

Intercept 8.860 1.348 6.571 0.0005 5.561 12.159 5.561 12.159

Overall Market

demand

-0.005 0.002 -2.027 0.088 -0.011 0.001 -0.011 0.001

Price per chip -5.505 0.881 -6.246 0.0007 -7.662 -3.348 -7.662 -3.348

Economic

condition

1.130 0.342 3.303 0.016 0.293 1.967 0.293 1.967

According to the results, the values of 𝛽



,𝛽



,𝛽



and 𝛽



can be found in the column of coefficients

where it highlighted as yellow. Hence, the equation

of this model can be written as:

𝑌



=8.8607 − 0.0052𝑋



− 5.5054𝑋



− 1.1302𝑋







However, by interpreting the p-values for each

independent variable in regression model, the

independent variable’s relation can be examined. At

5% level of significance, the p-value for the overall

market demand (X

) is more than 0.05 showing a

value around 0.088 which is insignificant. Hence, the

independent variable of overall market demand

should be ignored and perform a new multiple linear

regression model for the independent variables of

price per chip and economic condition (X

and X

Table 6: Regression Statistics.

Multiple R 0.918

R Square 0.844

Adjusted R Square 0.799

Standard Error 0.416

Observations 10

Table 7: ANOVA Analysis.

ANOVA df SS MS F

Significance

Regression 2 6.567 3.283 18.959 0.001

Residual 7 1.212 0.173

Total 9 7.780

Table 8: Regression of Coefficients & statistic Report

Coefficients Standard Erro

t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 6.451 0.766 8.421 6.556 4.640 8.263 4.640 8.263

Price per chip -4.135 0.680 -6.079 0.0005 -5.744 -2.526 -5.744 -2.526

Condition 0.606 0.269 2.250 0.059 -0.030 1.243 -0.030 1.243

Therefore, from the Tables VI-VIII, the new

equation of this model is:

𝑌



=6.4517 − 4.1356𝑋



+ 0.6062𝑋







After constructing the equation of the model, the

model should be evaluated through the coefficient of

determination or R-squared. Seen from the tables, the

R-squared value is around 0.84, which means that the

model explains 84% of the total variations. However,

due to the multiple independent variables that the

model has, the evaluation should consider looking at

the adjusted R-squared value which is around 0.80.

Sales Forecasting for Firms based on Multiple Regression Model

631

Hence, around 80% of the variations could be

explained by this model, which is still considerably

good and fit. The regression result table also provides

data for F value and significance F values as shown

in Table X. As the null hypothesis and alternative

hypothesis in this case are H

: 𝛽



=𝛽



=0, and H

at least one 𝛽



≠0. By looking at the value of F in

Table X, which is around 18.96. this value far exceeds

the value of significance F, 0.0015. Thus, rejecting

the null hypothesis by F test statistic, which means

that at least one independent variable is significant.

The model needs to test the multicollinearity in order

to reduce the standard errors. It can be done through

the use of variance inflation factor (VIF) between the

independent variables of price per chip and economic

condition (X1 and X2) to see any correlation exists.

The equation of the VIF is:

𝑉𝐼𝐹 =

1−𝑅

 





𝑅

 



=0.04521448







Based on the Excel, the calculated value of VIF of

these two variables is 1.002048 which means that this

model is not affected by the multicollinearity as the

value is less than 5.

Figure 2: Price per chip and economic condition residual

plot.

The residual plot of two independent variable can

be generated by Excel as shown in Fig. 2. Based on

the results, the residuals values of price per chip and

economic conditions are all less than the absolute

value of 1. There are no U-shaped or inverted U-

shaped appeared in the plot. The residuals are

basically all scattered randomly in both price per chip

and economic condition, which indicates that this

regression model provides a decent fit to the data

given. Finally, this multiple linear regression model

is ready to predict ABCtronics’s future sales figure as

the model is well fitted in the given data.

Throughout the process of developing the

multiple linear regression model for the predicting

sales figures, there are still some limitations of this

model. One of the limitations is there are only ten

given data for each independent variable, which

resulted reducing the reliability of the model. There is

a still high p-value of the economic condition (0.059)

in the new model, even though the overall market

demand variable is removed, whereas, this could be

another limitation.

4 CONCLUSIONS

In summary, we investigate sales prediction model

based on multiple linear regression model. In terms of

the analysis, multiple linear regression model is

increasingly advancing in its evolution and play an

important role in the sales prediction. According to

the models, overall market demand, strongly affect its

sales while chip unit price, and economic condition

has a weak positive correlation between economic

conditions and the price per chip. However, the model

is not always perfect because each predicted value

usually varies slightly from the actual value. This can

be achieved by using the variance expansion factor

between the chip price and economic conditions (X1

and X2). Therefore, there is still a certain space for

development to explore and discover. Overall, these

results still offer a guideline for sales predication

based on multifactorial linear models.

REFERENCES

B. Yoshua, S. Patrice and F. Paolo, "Learning long-term

dependencies with gradient descent is difficult", IEEE

transactions on neural networks, vol. 5, no. 2, pp. 157-

166, 1994.

D. Bahdanau, K. Cho and Y. Bengio, Neural machine

translation by jointly learning to align and translate,

2014.

D. M. Q. Nelson, A. C. M. Pereira and R. A. De Oliveira,

"Stock market's price movement prediction with LSTM

neural networks", Proceedings of the International Joint

Conference on Neural Networks, pp. 1419-26, 2017.

E. D. McKenzie, "General exponential smoothing and the

equivalent ARMA process", J. Forecasting, pp. 333-

344, 1984.

G. E. P. Box and G. Jenkins, Time series analysis

forecasting and control, San Francisco, CA:Holden-

Day, 1970.

Hyndman and J. Rob, "Athanasopoulos George. 8.9

seasonal ARIMA models", Forecasting: principles and

practice oTexts, 2015.

J. F. Chen, W. L. Chen, C. P. Huang, S. H. Huang and A.

P. Chen, "Financial time-series data analysis using deep

convolutional neural networks", 7th International

Conference on Cloud Computing and Big Data

(CCBD), pp. 87-92, 2016.

BDEDM 2022 - The International Conference on Big Data Economy and Digital Management

632

J. Ronald Williams, E. Geoffrey Hinton, Rumelhart and E.

David, "Learning representations by back-propagating

errors", Nature, vol. 323, no. 6088, pp. 533-536, 1986.

K. Chen, Y. Zhou and F. Dai, "A LSTM-based method for

stock returns prediction: A case study of China stock

market", Proceedings: 2015 IEEE International

Conference on Big Data, pp. 2823-4, 2015.

K. Cho, B. V. Merrienboer, D. Bahdanau and B. Yoshua,

On the properties of neural machine translation:

encoder-decoder approaches, 2014.

M. Tomas, K. Martin, B. Lukas, C. Jan and K. Sanjeev,

"Recurrent neural network based language model",

Interspeech, vol. 2, pp. 3, 2010.

Proven Sales Forecasting Methods for Greater Accuracy by

Michael Pici / Jul 02, 2021. Continue reading at

https://www.saleshacker.com/sales-forecasting-

methods/ | Sales Hacker

S. Hochreiter and J. Schmidhuber, "Long Short-Term

Memory", Neural Computation, vol. 9, no. 8, pp. 1735-

1780, 1997.

W. Bao, J. Yue and Y. Rao, "A deep learning framework

for financial time series using stacked autoencoders and

long-short-term memory", PLoS One, vol. 12, no. 7,

2017.

X. Zhang, C. Li and Y. Morimoto, "A multi-factor approach

for stock price prediction by using recurrent neural

networks", Bulletin of Networking Computing Systems

and Software, vol. 8, no. 1, pp. 9-13, 2019.

Sales Forecasting for Firms based on Multiple Regression Model

633