On the Prediction of a Nonstationary Bernoulli Distribution

based on Bayes Decision Theory

Daiki Koizumi

Otaru University of Commerce, 3–5–21, Midori, Otaru-city, Hokkaido, 045–8501, Japan

Keywords:

Probability Model, Bayes Decision Theory, Nonstationary Bernoulli Distribution, Hierarchical Bayesian

Model.

Abstract:

A class of nonstationary Bernoulli distribution is considered in terms of Bayes decision theory. In this nonsta-

tionary class, the Bernoulli distribution parameter follows a random walking rule. Even if this general class is

assumed, it is proved that the posterior distribution of the parameter can be obtained analytically with a known

hyper parameter. With this theorem, the Bayes optimal prediction algorithm is proposed assuming the 0-1 loss

function. Using real binary data, the predictive performance of the proposed model is evaluated comparing to

that of a stationary Bernoulli model.

1 INTRODUCTION

Binary data is popular subject for data analysis and

is a topic of frequent research (Cox, 1970). From

the perspective of Bayesian statistics, the stationary

Bernoulli distribution and the stationary binomial dis-

tribution are frequently used to deal with binary data

(Press, 2003) (Bernardo and Smith, 2000) (Berger,

1985). For Bayesian posterior parameter estimation

under the stationary Bernoulli and binomial distribu-

tions, one of the most reasonable approaches is to as-

sume the beta distribution as the prior of the param-

eter. This assumption drastically reduces the compu-

tational cost of obtaining the posterior distribution of

the parameter using the Bayes theorem, and this prior

is called the natural conjugate (Bernardo and Smith,

2000) (Berger, 1985).

In contrast, there have been many approaches to

generalize the stationarity of the parameter by consid-

ering certain aspects of the nonstationarity of the pa-

rameter. In general, assuming the nonstationarity of

parameters requires additional parameters compared

to the stationary model. Furthermore, if the Bayesian

approach is used, it is often difﬁcult to save computa-

tional cost when obtaining the posterior of the param-

eter. This point depends on the class of nonstationar-

ity of the parameter, and one important result is the

SPSM, Simple Power Steady Model (Smith, 1979), to

the best of author’s knowledge. Under SPSM, it is

https://orcid.org/0000-0002-5302-5346

guaranteed that the posterior of the parameter can be

obtained analytically. Similar aspects were discussed

from the generalized perspective of the Kalman ﬁlter

(Harvey, 1989). Some researchers have tried to apply

this result to the discrete probability distributions and

proposed predictive algorithms (Koizumi et al., 2009)

(Koizumi, 2020) (Koizumi et al., 2012) (Yasuda et al.,

2001). Koizumi et al. assumed a nonstationary Pois-

son distribution and proposed the Bayes optimal pre-

diction algorithm under the known nonstationary hy-

per parameter (Koizumi et al., 2009). Koizumi re-

cently generalized this prediction algorithm to the

credible interval prediction (Koizumi, 2020). They

obtained better predictive performance compared to

a stationary Poisson distribution with real web trafﬁc

data. They also assumed a nonstationary Bernoulli

distribution to predict SQL injection attacks in the

ﬁeld of network security (Koizumi et al., 2012). How-

ever, they deﬁned an incorrect class of nonstationary

parameters. Furthermore, they did not show any proof

that the posterior parameter distribution was analyti-

cally obtained under their nonstationary model. Ya-

suda et al. assumed a similar nonstationary Bernoulli

distribution and proposed the Bayes optimal predic-

tion algorithm under the known nonstationary hyper

parameter (Yasuda et al., 2001). However, they did

not present any proof that the posterior parameter dis-

tribution can be obtained analytically under the non-

stationary model again.

In this paper, a class of nonstationary Bernoulli

distribution is proposed. This class has only one ad-

Koizumi, D.

On the Prediction of a Nonstationary Bernoulli Distribution based on Bayes Decision Theory.

DOI: 10.5220/0010270709570965

In Proceedings of the 13th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2021) - Volume 2, pages 957-965

ISBN: 978-989-758-484-8

957

ditional hyper parameter to express the nonstationar-

ity of the Bernoulli parameter. Moreover, the pre-

diction problem is considered under the proposed

nonstationary Bernoulli distribution. Bayes decision

theory (Weiss and Blackwell, 1961) (Berger, 1985)

(Bernardo and Smith, 2000) is a powerful theoret-

ical framework to deﬁne the prediction error. In

terms of Bayes decision theory, the predictive esti-

mator that minimizes the average predictive error is

called the Bayes optimal prediction. Considering this

point, this paper proposes the Bayes optimal predic-

tion algorithm under a certain class of nonstation-

ary Bernoulli distribution, if the nonstationary hyper

parameter is known. The predictive performance of

the proposed algorithm was evaluated with real bi-

nary data. When considering real data, the above-

mentioned hyper parameter should be estimated. For

this purpose, this study takes the empirical Bayesian

approach, and the objective parameter is estimated by

the approximate maximum likelihood estimation with

numerical calculation.

The remainder of this paper is organized as fol-

lows. Section 2 provides the basic deﬁnitions of

the nonstationary Bernoulli distribution, and some

lemmas and corollaries in terms of the hierarchical

Bayesian modeling approach. Section 3 begins with

the basic deﬁnitions in terms of Bayes decision the-

ory, then proves the main theorems of the proposed

nonstationary Bernoulli distribution, discusses the hy-

per (nonstationary) parameter estimation, and pro-

poses the Bayes optimal prediction algorithm. Sec-

tion 4 gives some numerical examples with real bi-

nary data. Section 5 discusses the results. Section 6

concludes this paper.

2 HIERARCHICAL BAYESIAN

MODELING WITH

NONSTATIONARY BERNOULLI

DISTRIBUTION

2.1 Preliminaries

Let t = 1, 2, .. . be a discrete time index and X

= x

be a discrete random variable at t. Assume that

∈

{

0, 1

}

and X

∼ Bernoulli (θ

) where 0 ≤ θ

≤ 1

is a nonstationary parameter. Then the probability

function of the nonstationary Bernoulli distribution







is deﬁned as the following:

Deﬁnition 2.1. Nonstationary Bernoulli Distribution







= θ

(1 − θ

)

1−x

, (1)

where 0 ≤ θ

≤ 1. 2

Deﬁnition 2.2. Function for Θ

, A

and B

Let Θ

= θ

, A

= a

, and B

= b

be random vari-

ables where A

and B

are mutually independent, then

a function for Θ

is deﬁned as,

+ B

, (2)

where 0 < a

, 0 < b

. 2

Deﬁnition 2.3. Nonstationarity of A

, B

Let C

= c

, D

= d

be random variables, then the

nonstationary functions for A

and B

are deﬁned as,

t+1

= C

, (3)

t+1

= D

, (4)

where 0 < c

< 1, 0 < d

< 1 and they are sampled

from the following two types of Beta distributions:

∼ Beta [kα

, (1 − k) α

] , (5)

∼ Beta [kβ

, (1 − k) β

] , (6)

where k is a real valued constant and 0 < k ≤ 1 . 2

Deﬁnition 2.4. Conditional Independence for A

(or B

, D

) under α

(or β

)



, c





= p













, (7)



, d





= p













. (8)

Deﬁnition 2.5. Initial Distributions for A

, B

∼ Gamma(α

, 1) , (9)

∼ Gamma(β

, 1) , (10)

where 0 < α

and 0 < β

. 2

Deﬁnition 2.6. Initial Distributions for C

, D

∼ Beta [kα

, (1 − k)α

] , (11)

∼ Beta [kβ

, (1 − k)β

] . (12)

Deﬁnition 2.7. Gamma Distribution for q Gamma

distribution of Gamma(r, s) is deﬁned as,





r, s



Γ(r)

r−1

exp(−sq) , (13)

where 0 < q, 0 < r, 0 < s, and Γ(r) is the gamma func-

tion deﬁned in Deﬁnition 2.9. 2

Deﬁnition 2.8. Beta Distribution for q

Beta distribution of Beta (r, s) is deﬁned as,





r, s



Γ(r + s)

Γ(r) Γ(s)

r−1

(1 − q)

s−1

, (14)

where 0 < q < 1, 0 < r, 0 < s. 2

Deﬁnition 2.9. Gamma Function for q

Γ(q) =

+∞

q−1

exp(−y)dy , (15)

where 0 < q . 2

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

958

2.2 Lemmas

Lemma 2.1. Transformed Distribution for A

For any t ≥ 1, the transformed random variable

t+1

= C

in Deﬁnition 2.3 follows the following

Gamma distribution:

t+1

∼ Gamma(kα

, 1) . (16)

Proof of Lemma 2.1.

See APPENDIX A. 2

Lemma 2.2. Transformed Distribution for B

For any t ≥ 1, the transformed random variable

t+1

= D

in Deﬁnition 2.3 follows the following

Gamma distribution:

t+1

∼ Gamma(kβ

, 1) . (17)

Proof of Lemma 2.2.

The proof is exactly same as Lemma 2.1, replac-

ing A

t+1

by B

t+1

, C

by D

, and α

by β

This completes the proof of Lemma 2.2. 2

Lemma 2.3. Transformed Distribution for Θ

For any t ≥ 2, the transformed random variable

in Deﬁnition 2.2 follows the following

Beta distribution:

∼ Beta (kα

t−1

, kβ

t−1

) . (18)

Proof of Lemma 2.3.

See APPENDIX B. 2

Corollary 2.1. Transformed Initial Distribution for

The transformed random variable Θ

Deﬁnition 2.2 follows the following Beta distribution:

∼ Beta (α

, β

) . (19)

Proof of Corollary 2.1.

From Deﬁnition 2.5,

∼ Gamma(α

, 1) ,

∼ Gamma(β

, 1) .

If Lemma 2.3 is applied to the above A

and B

, then

the following holds.

∼ Beta (α

, β

) . (20)

This completes the proof of Corollary 2.1. 2

3 PREDICTION ALGORITHM

BASED ON BAYES DECISION

THEORY

3.1 Preliminaries

Deﬁnition 3.1. Loss Function

L (ˆx

t+1

, x

t+1

) =

(

0 if ˆx

t+1

= x

t+1

;

1 if ˆx

t+1

6= x

t+1

(21)

Deﬁnition 3.2. Risk Function

R( ˆx

t+1

, θ

t+1

)

∑

t+1

L (ˆx

t+1

, x

t+1

) p



t+1



t+1



, (22)

where p



t+1



t+1



is from Deﬁnition 2.1. 2

Deﬁnition 3.3. Bayes Risk Function

BR( ˆx

t+1

)

R( ˆx

t+1

, θ

t+1

) p



t+1





dθ

t+1

. (23)

Deﬁnition 3.4. Bayes Optimal Prediction

The Bayes optimal prediction ˆx

∗

t+1

is obtained by,

ˆx

∗

t+1

= argmin

ˆx

t+1

BR( ˆx

t+1

) . (24)

3.2 Main Theorems

Theorem 3.1. Posterior Distribution for θ

Let the prior distribution of parameter θ

the nonstationary Bernoulli distribution in Deﬁni-

tion 2.1 be Θ

∼ Beta (α

, β

). For any t ≥ 2,

let x

t−1

= (x

, x

, . . . , x

t−1

) be the observed data se-

quence. Then, the posterior distribution of Θ



t−1

can be obtained as the following closed form:



t−1

∼ Beta (α

, β

) , (25)

where the parameters α

, β

are given as,











= k

t−1

∑

i=1

t−i

= k

t−1

∑

i=1

t−i

(1 − x

) .

(26)

On the Prediction of a Nonstationary Bernoulli Distribution based on Bayes Decision Theory

959

Proof of Theorem 3.1.

For any t ≥ 2, the posterior of parameter distri-

bution p





t−1



remains in the closed form Θ

∼

Beta (α

, β

) if X

∼ Bernoulli (θ

) in Deﬁnition 2.1

and Θ

∼ Beta(α

, β

) in Corollary 2.1 according to

the nature of conjugate families (Bernardo and Smith,

2000, 5.2, p.265) (Berger, 1985, 4.2.2, p.130).

Furthermore, assuming that x

t−1

is the observed

data,



= α

t−1

+ x

t−1

;

= β

t−1

+ 1 − x

t−1

(27)

holds by conjugate analysis (Bernardo and Smith,

2000, Example 5.4, p.271). This is the proof of Eq.

(25).

In this paper, nonstationary parameter model is as-

sumed. Therefore, if both Lemma 2.1, and Lemma

2.2 are recursively applied to Eq. (27), then,



= k (α

t−1

+ x

t−1

);

= k (β

t−1

+ 1 − x

t−1

) ,

(28)

holds.

Finally, Eq. (26) is obtained if Eq. (28) is recur-

sively applied until the initial conditions α

, β

from

both Deﬁnition 2.5 and Corollary 2.1 appear.

This completes the proof of Theorem 3.1. 2

Remark 3.1.

For the second terms of the right hand sides of Eq.

(26), each observed data x

is exponentially weighted

by k

t−i

where i = 1, 2, . . . ,t − 1. This structure is

called the EWMA, Exponentially Weighted Moving

Average (Harvey, 1989, 6.6, p.350).

Theorem 3.2. Predictive Distribution



t+1















t+1

+β

t+1

if x

t+1

= 0;

t+1

+β

t+1

if x

t+1

= 1,

(29)

where α

t+1

and β

t+1

are in Eq. (26) . 2

Proof of Theorem 3.2.

See APPENDIX C. 2

Theorem 3.3. Bayes Optimal Prediction

ˆx

∗

t+1







0 if α

t+1

< β

t+1

;

1 if α

t+1

> β

t+1

(30)

Proof of Theorem 3.3.

In terms of Bayes decision theory (Weiss

and Blackwell, 1961) (Berger, 1985) (Bernardo

and Smith, 2000), the Bayes optimal prediction

ˆx

t+1

= ˆx

∗

t+1

maximizes the predictive distribution



t+1





if 0 − 1 loss function in Deﬁnition 3.1 is

deﬁned. Since ˆx

∗

t+1

∈

{

0, 1

}

and Theorem 3.2 holds,

this maximization can be done by comparing just two

cases. Therefore,

ˆx

∗

t+1

= argmax

t+1



t+1











0 if α

t+1

< β

t+1

;

1 if α

t+1

> β

t+1

This completes the proof of Theorem 3.3. 2

3.3 Hyper Parameter Estimation with

Empirical Bayes Method

Since a hyper parameter 0 < k ≤ 1 in Deﬁnition 2.3 is

assumed to be known, it should be estimated in prac-

tice. One of estimation methods can be the maximum

likelihood estimation with numerical approximation

in terms of empirical Bayes approach. This is,

k = argmax

L (k) , (31)

where 0 < k ≤ 1 and,

L (k)

= p





, k



p(θ

)

∏

i=2





i−1

, k



∏

i=1



+ β



1−x



+ β



. (32)

Note that Eq. (32) is obtained by applying Theorem

3.2.

Therefore, its log-likelihood function logL (k) is,

logL (k)

∑

i=1

{

(1 − x

)[log β

− log(α

+ β

)]

[logα

− log (α

+ β

)]

}

. (33)

Eqs. (31) and (33) can not be solved analytically and

then the approximate numerical method should be ap-

plied.

3.4 Proposed Bayes Optimal Prediction

Algorithm

Based on main Theorems in Subsection 3.2, the fol-

lowing Bayes optimal prediction algorithm is pro-

posed.

Algorithm 3.1. Proposed Bayes Optimal Algorithm

1. Estimate hyper parameter k from training data by

approximate maximum likelihood estimation with

Eqs. (31) and (33).

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

960

2. Set t = 1 and deﬁne α

> 0, β

> 0 in Deﬁnition

2.5 in order to set the initial prior of parameter

distribution Θ

∼ Beta (α

, β

) in Corollary 2.1.

3. With test data sequence x

, update the pos-

terior of nonstationary parameter distribution



t+1



t+1

, β

t+1

, x



with Eq. (26) in Theorem

3.1.

4. Calculate the predictive distribution p



t+1





in Theorem 3.2.

5. Obtain the Bayes optimal prediction ˆx

∗

t+1

in The-

orem 3.3.

6. If t < t

max

, then update t → t +1 and back to 3.

7. If t = t

max

, then terminate the algorithm.

4 NUMERICAL EXAMPLES

This section shows numerical examples to evaluate

the performance of Algorithm 3.1. Subsection 4.1 ex-

plains both the training and test data speciﬁcations.

Training data is applied to estimate the hyper param-

eters: k in Deﬁnition 2.3 and α

, β

in Deﬁnition 2.5,

where the latter is used for the prior of parameter Θ

to predict test data. Test data were applied to evalu-

ate the predictive performances of the proposed algo-

rithm.

4.1 Binary Data Speciﬁcations

Table 1 and 2 show the training and test data speciﬁ-

cations, respectively. These binary data were obtained

from the daily rainfall data in Tokyo from January

1, 2018 to December 31, 2019 (Japan Meteorologi-

cal Agency, 2020). Note that the threshold of binary

data is deﬁned by the following rule: ith daily rainfall:

= 1 if its amount is greater than 0.5 mm, otherwise

= 0.

Table 1: Training Data Speciﬁcations.

Items Values

From: January 1, 2018

To: December 31, 2018

Total Days: 365

Table 2: Test Data Speciﬁcations.

Items Values

From: January 1, 2019

To: December 31, 2019

Total Days: 365

4.2 Evaluations for Bayes Optimal

Predictions

This subsection mainly evaluates two aspects of the

Bayes optimal predictions from both the proposed

nonstationary and conventional stationary Bernoulli

distribution models. The ﬁrst is the predictive per-

formance between two models with non-informative

priors. The second is that with informative priors.

4.2.1 Prediction Results with Non-informative

Priors

Before evaluating the predictive performance, the hy-

per parameter

k is estimated using Eq. (31) from train-

ing data. This is the approximate maximum likeli-

hood estimation with numerical calculation. The re-

sults are shown in Table 3.

Table 3: Estimated Hyper Parameter from Training Data.

Item Value

k 0.971

In this evaluation, the hyper parameters α

and β

the prior distribution p





, β



are assumed to be

non-informative. This initial prior should be a uni-

form distribution. The deﬁned values of the hyper pa-

rameters are shown in Table 4.

Table 4: Deﬁned Hyper Parameters for Non-informative

Priors of Test Data.

1.000 1.000

Using

k, α

, and β

from Tables 3 and 4, the predictive

errors for the proposed and stationary Bernoulli mod-

els

∑

365

i=1

L (ˆx

, x

) are calculated with test data. The

results are shown in Table 5.

Table 5: Predictive Errors with Test Data for Proposed and

Stationary Models with Non-informative Priors.

Items Proposed Stationary

∑

365

i=1

L (ˆx

, x

) 173 187

4.2.2 Prediction Results with Informative Priors

In this evaluation, the hyper parameters α

and β

the prior distribution p





, β



are assumed to be

informative. In this case, the empirical Bayesian ap-

proach is adopted. Both α

and β

are obtained from

On the Prediction of a Nonstationary Bernoulli Distribution based on Bayes Decision Theory

961

the posterior distribution of p





, α

, β



from the

training data, and these are used as the initial prior of

p(θ



, β

) to predict the test data. These values

are listed in Table 6.

Table 6: Deﬁned Hyper Parameters for Informative Priors

of Test Data.

16.429 34.612

Using

k, α

, and β

from Tables 3 and 6, the predic-

tive errors are calculated for both models with the test

data. The results are shown in Table 7.

Table 7: Predictive Errors with Test Data for Proposed and

Stationary Models with Informative Priors.

Items Proposed Stationary

∑

365

i=1

L (ˆx

, x

) 178 179

5 DISCUSSIONS

Table 5 shows that the total loss of the proposed

nonstationary Bernoulli model is smaller than that of

the stationary model, with accuracies of 52.6% and

48.8%, respectively. Moreover, the time series of the

posterior probability p(θ

= 1



)

is calculated and

plotted in Figure 1. In Figure 1, the vertical axis is

the posterior probability, the horizontal axis shows the

indices of days, the red line is the time series of the

posterior probabilities from the proposed model, and

the blue line is that from the stationary model. From

Figure 1, it can be observed that the posterior from the

proposed model drifts more drastically than that of the

stationary model. Thus, the extra hyper parameter k

in the proposed model must work relatively well with

a non-informative prior.

However, if the AIC, Akaike Information Crite-

rion (Akaike, 1973) values for both models are cal-

culated with test data, the values in Table 8 are ob-

tained. From the perspective of model selection the-

ory, the smaller the AIC value, the more appropriate

the model is under the observed data. Table 8 indi-

cates that the stationary model is more appropriate

than the proposed model with test data. However,

as mentioned above, the proposed model is superior

to the stationary model in terms of predictive perfor-

mance. Thus, the result of the ﬁrst evaluation with

a non-informative prior cannot be explained by AIC

Each value is the daily probability of rainfall.

with the speciﬁc test data in this paper.

days

posterior

0 50 100 150 200 250 300 350

0.0 0.2 0.4 0.6 0.8 1.0

Proposed

Stationary

Figure 1: Posterior Probability Plot of p



= 1





with

Non-informative Priors.

Table 8: AIC values for Proposed and Stationary Models

with Non-informative Priors.

Items Proposed Stationary

AIC -500.476 -505.316

In contrast, according to Table 7, the difference in

the predictive performance for both models becomes

smaller than that of the ﬁrst evaluation. In fact, the

result is almost a draw, with accuracies of 51.2% and

51.0%, respectively. Moreover, Figure 2 shows the

time series of the posterior rainfall probability for

both models. Note that an informative prior is as-

sumed in this evaluation. From Figure 2, the ﬁrst 50

points of the time series of the proposed model (red

line) are more stable than those of the proposed model

in Figure 2. This difference can be interpreted as the

effect of informative priors. However, the predictive

performance becomes worse in the proposed model.

In this case, it can be considered that the setting of the

informative prior weakens the effect of the estimated

nonstationary hyper parameter

k. From Figure 2, the

entire blue plot of the stationary model becomes more

stable than that of the stationary model in Figure 1. In

this case, the posterior of the stationary model almost

converges, and its predictive performance is improved

effectively as shown by the comparison of the results

from Tables 5 and 7.

Table 9 shows the AIC values for both mod-

els. From the perspective of AIC, the value of the

proposed model with the informative prior become

slightly smaller than that of the proposed model with

the non-informative prior. For the stationary model,

this difference becomes larger. Thus, the theory of

AIC explains the predictive performance of the sta-

tionary model well. However, the same situation does

not hold true for the proposed model.

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

962

days

posterior

0 50 100 150 200 250 300 350

0.0 0.2 0.4 0.6 0.8 1.0

Proposed

Stationary

Figure 2: Posterior Probability Plot of p



= 1





with

Informative Priors.

Table 9: AIC values for Proposed and Stationary Models

with Informative Priors.

Items Proposed Stationary

AIC -505.776 -522.896

6 CONCLUSIONS

This paper proposes a class of nonstationary Bernoulli

distribution and the Bayes optimal prediction algo-

rithm under the known nonstationary hyper parame-

ter. The proposed class has only one extra hyper pa-

rameter compared to the stationary Bernoulli distribu-

tion, and it is proved that the posterior distribution of

the Bernoulli parameter is obtained analytically. Fur-

thermore, the predictive performance of the proposed

algorithm is evaluated using real binary data. As a

result, a certain advantage for predictive performance

is discovered by comparing the results to those of the

stationary Bernoulli model; however, this point can-

not be explained in terms of model selection theory.

As important factor in the abovementioned ad-

vantage is the additional nonstationary hyper param-

eter in the proposed model. Because the empirical

Bayesian approach is used in this study and the ad-

ditional hyper parameter is estimated by the approx-

imate maximum likelihood estimation, the objective

likelihood function should be analyzed in detail. This

point will be left for future work.

REFERENCES

Akaike, H. (1973). Information theory and an extension of

the maximum likelihood principle. 2nd International

Symposium on Information Theory, pages 267–281.

Berger, J. (1985). Statistical Decision Theory and Bayesian

Analysis. Springer-Verlag, New York.

Bernardo, J. M. and Smith, A. F. (2000). Bayesian Theory.

John Wiley & Sons, Chichester.

Cox, D. R. (1970). The Analysis of Binary Data. Chapman

and Hall, London.

Harvey, A. C. (1989). Forecasting, Structural Time Series

Models and the Kalman Filter. Cambridge University

Press, Marsa, Malta.

Japan Meteorological Agency (2020). ClimatView

(in Japanese). https://www.data.jma.go.jp/gmd/cpd/

monitor/dailyview/. Browsing Date: Nov. 27, 2020.

Koizumi, D. (2020). Credible interval prediction of a non-

stationary poisson distribution based on bayes deci-

sion theory. In Proceedings of the 12th International

Conference on Agents and Artiﬁcial Intelligence - Vol-

ume 2: ICAART,, pages 995–1002, Valletta, Malta.

INSTICC, SciTePress.

Koizumi, D., Matsuda, T., and Sonoda, M. (2012). On the

automatic detection algorithm of cross site scripting

(xss) with the non-stationary bernoulli distribution.

In The 5th International Conference on Communica-

tions, Computers and Applications (MIC-CCA2012),

pages 131–135, Istanbul, Turkey. IEEE.

Koizumi, D., Matsushima, T., and Hirasawa, S. (2009).

Bayesian forecasting of www trafﬁc on the time vary-

ing poisson model. In Proceeding of The 2009 In-

ternational Conference on Parallel and Distributed

Processing Techniques and Applications (PDPTA’09),

volume II, pages 683–689, Las Vegas, NV, USA.

CSREA Press.

Press, S. J. (2003). Subjective and Objective Bayesian

Statistics: Principles, Models, and Applications. John

Wiley & Sons, Hoboken.

Smith, J. Q. (1979). A generalization of the bayesian steady

forecasting model. Journal of the Royal Statistical So-

ciety - Series B, 41:375–387.

Weiss, L. and Blackwell, D. (1961). Statistical Decision

Theory. McGraw-Hill, New York.

Yasuda, G., Nomura, R., and Matsushima, T. (2001). A

study of coding for sources with nonstationary pa-

rameter (in Japanese). Technical Report of IEICE

(IT2001-15), 101(177):25–30.

APPENDIX

A: Proof of Lemma 2.1

Suppose t = 1, A

= a

and C

= c

are deﬁned as,

∼ Gamma(α

, 1) , (34)

∼ Beta [kα

, (1 − k) α

] , (35)

according to Deﬁnition 2.5 and Deﬁnition 2.6, respec-

tively.

Since A

= C

from Deﬁnition 2.3, and A

and

are conditional independent from Deﬁnition 2.4,

On the Prediction of a Nonstationary Bernoulli Distribution based on Bayes Decision Theory

963

the joint distribution of p(c

, a

) becomes,

p(c

, a

)

= p





kα

, (1 − k) α







, 1



Γ(α

)

Γ(kα

)Γ [(1 − k)α

]

kα

−1

(1 − c

)

(1−k)α

−1

Γ(α

)

exp(−a

)

kα

−1

(1 − c

)

(1−k)α

−1

Γ(kα

)Γ [(1 − k)α

]

−1

exp(−a

) .

Now, denote the two transformation as,



v = a

;

w = a

(1 − c

(36)

where 0 < v, 0 < w.

Then, the inverse transformation of Eq. (36) be-

comes,



= v + w;

v+w

(37)

The Jacobian J

of Eq. (37) is,



∂a

∂v

∂a

∂w

∂c

∂v

∂c

∂w



1 1

(v+w)

−

(v+w)



= −

v + w

= −

6= 0.

Then, the transformed joint distribution p(v, w) is ob-

tained by the product of p(c

, a

) and the absolute

value of J

p(v, w)

= p (c

, a

)



−





v+w



kα

−1



v+w



(1−k)α

−1

Γ(kα

)Γ [(1 − k)α

]

·(v + w)

−1

exp[−(v + w)] ·

v + w

kα

−1

(1−k)α

−1

Γ(kα

)Γ [(1 − k)α

]

exp[−(v + w)] .(38)

Then, p(v) is obtained by marginalizing Eq. (38) with

respect to w,

p(v) =

∞

p(v, w) dw

kα

−1

exp(−v)

Γ(kα

)Γ [(1 − k)α

]

∞

(1−k)α

−1

exp(−w)dw

kα

−1

exp(−v)

Γ(kα

)Γ [(1 − k)α

]

· Γ[(1 − k)α

]

Γ(kα

)

kα

−1

exp(−v) . (39)

Eq. (39) exactly corresponds to Gamma(kα

, 1) ac-

cording to Deﬁnition 2.7. Recalling v = a

from

Eq. (36) and A

= C

from Deﬁnition 2.3,

∼ Gamma(kα

, 1) .

Thus if t = 1, then A

t+1

∼ Gamma(kα

, 1) holds.

For t ≥ 2, by substituting α

= kα

t−1

, A

= a

and

= c

are deﬁned as,

∼ Gamma(α

, 1) , (40)

∼ Beta [kα

, (1 − k) α

] . (41)

Eqs. (40) and (41) correspond to Eqs. (34) and (35),

respectively. Therefore the same proof can be applied

for the case of t ≥ 2 and it can be proved that,

∀t, A

t+1

∼ Gamma(kα

, 1) .

This completes the proof of Lemma 2.1. 2

B: Proof of Lemma 2.3

From Lemma 2.1 and 2.2,

∀t ≥ 2, A

∼ Gamma(kα

t−1

, 1) ,

∀t ≥ 2, B

∼ Gamma(kβ

t−1

, 1) .

According to Deﬁnition 2.2, two random variables A

and B

are independent. Therefore, the joint distribu-

tion pf p (a

, b

) becomes,

p(a

, b

)

= p





kα

t−1

, 1







kβ

t−1

, 1



kα

t−1

−1

exp(−a

)

Γ(kα

t−1

)

kβ

t−1

−1

exp(−b

)

Γ(kβ

t−1

)

kα

t−1

−1

kβ

t−1

−1

Γ(kα

t−1

)Γ (kβ

t−1

)

exp[−(a

+ b

)] .

Denoting the two transformations,



λ = a

+ b

;

µ =

(42)

where 0 < λ, 0 < µ.

The inverse transformation of Eq. (42) becomes,



= λµ;

= λ(1 − µ) .

(43)

Then, the Jacobian J

of Eq. (43) is,



∂a

∂λ

∂a

∂µ

∂b

∂λ

∂b

∂µ



µ λ

1 − µ −λ



= −λ = −(a

+ b

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

964

Then, the transformed joint distribution p(λ, µ) is ob-

tained by the product of p (a

, b

) and the absolute

value of J

as the following,

p(λ, µ)

= p (a

, b

)·



−(a

+ b

)



(λµ)

kα

t−1

−1

[λ(1 − µ)]

kβ

t−1

−1

Γ(kα

t−1

)Γ (kβ

t−1

)

exp(−λ) · λ

kα

t−1

−1

(1 − µ)

kβ

t−1

−1

Γ(kα

t−1

)Γ (kβ

t−1

)

kα

t−1

+kβ

t−1

−1

exp(−λ) .

(44)

Then, p (µ) is obtained by marginalizing Eq. (44) with

respect to λ,

p(µ)

∞

p(λ, µ)dλ

kα

t−1

−1

(1 − µ)

kβ

t−1

−1

Γ(kα

t−1

)Γ (kβ

t−1

)

∞

kα

t−1

+kβ

t−1

−1

exp(−λ)dλ

kα

t−1

−1

(1 − µ)

kβ

t−1

−1

Γ(kα

t−1

)Γ (kβ

t−1

)

· Γ(kα

t−1

+ kβ

t−1

)

Γ(kα

t−1

+ kβ

t−1

)

Γ(kα

t−1

)Γ (kβ

t−1

)

kα

t−1

−1

(1 − µ)

kβ

t−1

−1

(45)

Eq. (45) exactly corresponds to Beta (kα

t−1

, kβ

t−1

)

according to Deﬁnition 2.8.

Recalling µ =

from Eq. (42) and Θ

from Deﬁnition 2.2,

∀t ≥ 2, Θ

∼ Beta (kα

t−1

, kβ

t−1

) ,

holds.

This completes the proof of Lemma 2.3. 2

C: Proof of Theorem 3.2

Since the predictive distribution is Binomial-Beta

distribution (Bernardo and Smith, 2000, p.117),



t+1





becomes,



t+1







t+1



t+1





t+1





dθ

t+1

= c · Γ (α

t+1

+ x

t+1

)Γ (β

t+1

+ 1 − x

t+1

where c =

Γ(α

t+1

+β

t+1

)

Γ(α

t+1

)Γ(β

t+1

)Γ(α

t+1

+β

t+1

+1)

If x

t+1

= 0, then,



t+1





Γ(α

t+1

+ β

t+1

)

Γ(α

t+1

)Γ (β

t+1

)Γ (α

t+1

+ β

t+1

+ 1)

· Γ(α

t+1

)Γ (β

t+1

+ 1)

Γ(α

t+1

+ β

)

Γ(α

t+1

)Γ (β

t+1

)(α

t+1

+ β

t+1

)Γ (α

t+1

+ β

t+1

)

· Γ(α

t+1

)β

t+1

Γ(β

t+1

) (46)

t+1

+ β

t+1

Note that Eq. (46) in obtained by applying the follow-

ing property of Gamma function: Γ(q + 1) = qΓ(q).

If x

t+1

= 1, then,



t+1





Γ(α

t+1

+ β

t+1

)

Γ(α

t+1

)Γ (β

t+1

)Γ (α

t+1

+ β

t+1

+ 1)

· Γ(α

t+1

+ 1)γ(β

t+1

)

Γ(α

t+1

+ β

t+1

)

Γ(α

t+1

)Γ (β

t+1

)(α

t+1

+ β

t+1

)Γ (α

t+1

+ β

t+1

)

· α

t+1

Γ(α

t+1

)Γ (β

t+1

)

t+1

+ β

t+1

This completes the proof of Theorem 3.2 . 2

On the Prediction of a Nonstationary Bernoulli Distribution based on Bayes Decision Theory

965