Product Embedding for Large-Scale Disaggregated Sales Data
Yinxing Li
a
and Nobuhiko Terui
b
Graduate School of Economics and Management, Tohoku University, Japan
Keywords: LDA2Vec, Item2Vec, Demographics, Hierarchical Model, Customer Heterogeneity, Topic Model.
Abstract: This paper recommends a system that incorporates the marketing environment and customer heterogeneity.
We employ and extend Item2Vec and Item2Vec approaches to high-dimensional store data. Our study not
only aims to propose a model with better forecasting precision but also to reveal how customer demographics
affect customer behaviour. Our empirical results show that marketing environment and customer
heterogeneity increase forecasting precision and those demographics have a significant influence on customer
behaviour through the hierarchical model.
1 INTRODUCTION
Marketing data are expanding in several modes
nowadays, as the number of variables explaining
customer behavior has greatly increased, and
automated data collection in the store has also led to
the recording of customer choice decisions from large
sample sizes. Thus, high-dimensional models have
recently gained considerable importance in several
areas, including marketing. Despite the rapid
expansion of available data, Naik et al. (2008)
mentioned that many algorithms do not scale linearly
but scale exponentially as the dimension of variable
expends. This highlights the urgent need for faster
numerical methods and efficient statistical estimators.
While some previous researches focused on the
dimension reduction approaches for the products (e.g.,
Salakhutdinov and Mnih, 2008, Koren et al., 2009,
Paquet and Koenigstein, 2013), learning the product
similarities is the final goal rather than the forecasting.
After Word2Vec was proposed (Mikolov et al.,
2013) regarding natural language processing, which
is designed to deal with high-dimensional sparse
vocabulary data, many studies applied and extended
the model to other fields, such as item
recommendation, including Prod2Vec (Grbovic et al.
2015), Item2Vec (Barkan and Koenigstein, 2016),
and Meta-Prod2Vec (Vasile et al., 2016). These
approaches indicate that the Word2Vec framework
outperforms existing econometric models in sales
a
https://orcid.org/0000-0001-9335-9802
b
https://orcid.org/0000-0003-4868-0140
prediction. Besides, Pennington et al. (2014)
proposed a model which factorize a large-scale word
matrix to improve the performance of paring the
similar words. This approach is further employed for
parsing tasks by Levy and Goldberg (2014).
However, the main limitation of the existing
approaches is the lack of interpretability of the model.
Similar to the most nonlinear machine learning
approach, the Word2Vec framework cannot evaluate
the effect of variables, which may limit its
implications in the marketing field, such as the
effective personalization and targeting (
Essex, 2009
).
Although extension models, such as Prod2Vec,
involve various marketing variables such as price and
customer demographic data, the role of the variables
in forecasting is still not discussed.
In light of the limitations mentioned above, we
propose a Word2Vec based framework that
incorporates marketing variables. The main research
purposes are to (i) improve the precision of
forecasting by involving the hierarchical structure of
the Word2Vec framework with marketing mix
variables, and (ii) investigate and interpret the role of
the marketing mix variables.
In order to fulfil these aims, we analyze the large-
scale sales data of a retail store for our empirical
application. In addition to daily sales data for each
unique customer, our data also include daily price
information, several promotional information, and
demographic data for each customer. Our approach is
Li, Y. and Terui, N.
Product Embedding for Large-Scale Disaggregated Sales Data.
DOI: 10.5220/0010677500003064
In Proceedings of the 13th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2021) - Volume 1: KDIR, pages 69-75
ISBN: 978-989-758-533-3; ISSN: 2184-3228
Copyright
c
๎€ 2021 by SCITEPRESS โ€“ Science and Technology Publications, Lda. All rights reserved
69
based on LDA2Vec (Moody, 2016), which is an
extension of Word2Vec involving the topic model
proposed by Blei et al. (2003). All the models from
previous studies are not only structurally unable to
represent customer heterogeneity in the marketing
environment but also lack interpretability for the
results. Compared to previous studies mentioned
above, our proposed model contributes both to higher
precision for forecasting by incorporating the
marketing environment and customer heterogeneity
into the model, and better interpretability with a
hierarchical model.
We explain our model in Section 2. In Section 3 we
present the empirical results for sales forecasting and
demonstrate the performance and interpretability of
our proposed model. We conclude and summarize the
future implications in Section 4.
2 MODEL
We extend the LDA2Vec model mainly in two ways:
(i) considering the marketing environment and (ii) a
hierarchical structure that considers customer
heterogeneity based on customer demographic data.
Fig 1 shows the framework of our model.
Figure 1: Framework of the model.
2.1 Skip-gram Negative Sampling
We employ skip-gram negative-sampling (SGNS)
from Item2Vec (Barkan and Koenigstein, 2016) to
propose a model that utilizes item-wide, customer-
wide, and marketing-wide feature vectors.
Given a sequence of items ๎ตซ๐‘ฅ
๎ฏ
๎ตฏ
๎ฏ๎ญ€๎ฌต
๎ฏ†
for a finite item
basket ๐ผ๎ตŒ๎ตซ๐‘ฅ
๎ฏ
๎ตฏ
๎ฏ๎ญ€๎ฌต
๎ฏ†
, the skip-gram model maximizes
the following term
1
๐‘€
๎ท๎ทlog๐‘๎ตซ๐‘ฅ
๎ฏ
|๐‘ฅ
๎ฏ”
๎ตฏ
๎ฏ†
๎ฏ๎ฎท๎ฏ”
๎ฏ†
๎ฏ”๎ญ€๎ฌต
,
(1)
where ๐‘€ is the number of items in the same market
basket as the receipt, and ๐‘๎ตซ๐‘ฅ
๎ฏ
|๐‘ฅ
๎ฏ”
๎ตฏ is defined as
๐‘๎ตซ๐‘ฅ
๎ฏ
|๐‘ฅ
๎ฏ”
๎ตฏ๎ตŒ
exp๏ˆบ๐šคโƒ—
๎ฏ”
๎ฏ
๐šคโƒ—
๎ฏ
๏ˆป
โˆ‘
exp๏ˆบ๐šคโƒ—
๎ฏ”
๎ฏ
๐šคโƒ—
๎ฏ 
๏ˆป
๎ฏ ๎ญ€๎ฌต
.
(2)
๐šคโƒ—
๎ฏ
represents the V-dimensional latent vector that
corresponds to item j, where V is a parameter that
represents the dimension of the vector. This formula
means, by transforming the items into latent vectors,
the conditional probability of purchase for product j
when item a is in the market basket is represented by
the inner product of their latent vectors. The latent
vector is transformed from
๐šค
โƒ—
๎ฏ
๎ตŒ๐‘Š
๏ˆบ๎ฏœ๏ˆป
๐‘ฅ
๎ฏ
,
(3)
where ๐šคโƒ—
๎ฏ
is the item vector for product j, ๐‘พ
๏ˆบ๐’Š๏ˆป
reflects
the coefficient vector, and ๐‘ฅ
๎ฏ
is the dummy variable
for product j. As Eq (1) is inefficient when the
dimension of an item is large, negative sampling is
employed for solving the computational problem. We
replace Eq. (1) with
๐‘๎ตซ๐‘ฅ
๎ฏ
|๐‘ฅ
๎ฏ”
๎ตฏ๎ตŒฯƒ๏ˆบ๐šค
โƒ—
๎ฏ”
๎ฏ
๐šค
โƒ—
๎ฏ
๏ˆป๎ท‘ฯƒ๏ˆบ๎ต†๐šคโƒ—
๎ฏ”
๎ฏ
๐šคโƒ—
๎ฏก
๏ˆป
๎ฏ‡
๎ฏก๎ญ€๎ฌต
,
(4)
where ฯƒ
๏ˆบ
๐‘ฅ
๏ˆป
๎ตŒ1/๏ˆบ1๎ต…exp
๏ˆบ
๎ต†๐‘ฅ
๏ˆป
๏ˆป , and N is a
hyperparameter that represents the number of
negative examples to be selected per positive example.
We sample a negative item ๐‘ฅ
๎ฏก
from the unigram
distribution raised to the 3rd/4th power, as the
empirical study shows that it outperforms the unigram
distribution (Mikolov et al., 2013).
Eq. (4) represents the skip-gram model from
Item2Vec, which only considers the item. The only
data in this model is the item, which means Item2Vec
ignores customer information. We further modified
Eq. (4) by employing the approach from LDA2Vec
(Moody, 2016) as follows:
๐‘๎ตซ๐‘ฅ
๎ฏ
|๐‘ฅ
๎ฏ”
๎ตฏ3 ๎ตŒ ฯƒ๏ˆบ๐‘
โƒ—
๎ฏ”๎ฏ›
๎ฏ
๐šค
โƒ—
๎ฏ
๏ˆป๎ท‘ฯƒ๏ˆบ๎ต†๐‘โƒ—
๎ฏ”๎ฏ›
๎ฏ
๐šคโƒ—
๎ฏก
๏ˆป
๎ฏ‡
๎ฏก๎ญ€๎ฌต
,
(5)
where
๐‘
โƒ—
๎ฏ”๎ฏ›
๎ตŒ๐šค
โƒ—
๎ฏ”
๎ต…๐‘‘
โƒ—
๎ฏ›
.
(6)
๐‘โƒ—
๎ฏ”๎ฏ›
is defined as the context vector, which is
explicitly designed to be the sum of an item vector ๐šคโƒ—
๎ฏ”
for product a and a customer vector ๐‘‘
โƒ—
๎ฏ›
for customer
h. This structure is designed to represent customer-
wide relationships by incorporating a new term ๐‘‘
โƒ—
๎ฏ›
for
each customer while preserving the ๐šคโƒ—
๎ฏ”
for item-wide
KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval
70
relationships.
Next, for better interpretable representations,
rather than producing a dense vector for every
customer independently, ๐‘‘
โƒ—
๎ฏ›
is designed to be a mixed
membership of common topic vectors. When defining
the ๐‘ก
๎ญฉ
๏ˆฌ
๏ˆฌ
๏ˆฌ
โƒ—
as a V-dimensional topic vector for topic k, the
๐‘‘
๎ฏ›
๏ˆฌ
๏ˆฌ
๏ˆฌ
๏ˆฌ
can be represented by
๐‘‘
๎ฏ›
๏ˆฌ
๏ˆฌ
๏ˆฌ
๏ˆฌ
โƒ—
๎ตŒ๐‘
๎ฏ›๎ฌต
โˆ™๐‘ก
๎ฌต
๏ˆฌ
๏ˆฌ
๏ˆฌ
โƒ—
๎ต…๐‘
๎ฏ›๎ฌต
โˆ™๐‘ก
๎ฌถ
๏ˆฌ
๏ˆฌ
๏ˆฌ
โƒ—
๎ต…โ‹ฏ๎ต…๐‘
๎ฏ›๎ฏ„
โˆ™๐‘ก
๎ฏ„
๏ˆฌ
๏ˆฌ
๏ˆฌ
๏ˆฌ
โƒ—
,
(7)
where the weight 0๎ต‘๐‘
๎ฏ›๎ฏž
๎ต‘1 is a fraction that
denotes the membership of customer h in topic k,
enforcing the constraint that
โˆ‘
๐‘
๎ฏ›๎ฏž๎ฏž
๎ตŒ1. Note that
topic vector ๐‘ก
๎ฏž
๏ˆฌ
๏ˆฌ
๏ˆฌ
can be interpreted as a common
feature as it is shared among all customers, and the
influence of each topic vector is modulated by the
weight ๐‘
๎ฏ›๎ฏž
that is unique to each customer.
2.2 Vectorizing the Marketing
Environment
In addition to the product and customer, we consider
the marketing environment, such as promotional
information, which also plays an important role when
customers choose items, as shown in previous studies.
First, we categorize all the combinations of marketing
environments. That is, assuming we use two binary
variables, ๐‘ง
๎ฏฃ๎ฏฅ๎ฏข๎ฏ ๎ฏข๎ฏง๎ฏœ๎ฏข๎ฏก
, ๐‘ง
๎ฏ—๎ฏœ๎ฏฆ๎ฏ–๎ฏข๎ฏจ๎ฏก๎ฏง
, as marketing
environment (whether there are promotions and
discounts for the product), we categorize the
marketing environment by considering all the
patterns of variable combinations, ๐‘ง
๎ฏฃ๎ญ€๎ฌต
๎ตŒ
๐‘“๎ตซ๐‘ง
๎ฏฃ๎ฏฅ๎ฏข๎ฏ ๎ฏข๎ฏง๎ฏœ๎ฏข๎ฏก
๎ตŒ0,๐‘ง
๎ฏ—๎ฏœ๎ฏฆ๎ฏ–๎ฏข๎ฏจ๎ฏก๎ฏง
๎ตŒ0๎ตฏ๎ตŒ
๏ˆบ
1,0,0,0
๏ˆป
,
๐‘ง
๎ฏฃ๎ญ€๎ฌถ
๎ตŒ๐‘“๎ตซ๐‘ง
๎ฏฃ๎ฏฅ๎ฏข๎ฏ ๎ฏข๎ฏง๎ฏœ๎ฏข๎ฏก
๎ตŒ0,๐‘ง
๎ฏ—๎ฏœ๎ฏฆ๎ฏ–๎ฏข๎ฏจ๎ฏก๎ฏง
๎ตŒ1๎ตฏ๎ตŒ
๏ˆบ0,1,0,0๏ˆป , ๐’›
๐’‘๎ญ€๐Ÿ‘
๎ตŒ๐‘“๎ตซ๐‘ง
๎ฏฃ๎ฏฅ๎ฏข๎ฏ ๎ฏข๎ฏง๎ฏœ๎ฏข๎ฏก
๎ตŒ1,๐‘ง
๎ฏ—๎ฏœ๎ฏฆ๎ฏ–๎ฏข๎ฏจ๎ฏก๎ฏง
๎ตŒ
0๎ตฏ ๎ตŒ ๏ˆบ0,0,1,0๏ˆป , ๐‘ง
๎ฏฃ๎ญ€๎ฌธ
๎ตŒ๐‘“๎ตซ๐‘ง
๎ฏฃ๎ฏฅ๎ฏข๎ฏ ๎ฏข๎ฏง๎ฏœ๎ฏข๎ฏก
๎ตŒ
1,๐‘ง
๎ฏ—๎ฏœ๎ฏฆ๎ฏ–๎ฏข๎ฏจ๎ฏก๎ฏง
๎ตŒ1๎ตฏ๎ตŒ
๏ˆบ
0,0,0,1
๏ˆป
. ๐‘ง
๎ฏฃ
is a categorical
dummy variable, which means the pattern of the
marketing environment, similar to Eq (3). We
vectorize the categorical variable ๐‘ง
๎ฏฃ
as
๐‘š๏ˆฌ
๏ˆฌ
โƒ—
๎ฏฃ
๎ตŒ๐‘พ
๏ˆบ๐’Ž๏ˆป
๐‘ง
๎ฏฃ
,
(8)
where ๐‘š๏ˆฌ
๏ˆฌ
๎ฏฃ
is the marketing vector for marketing
pattern p and ๐‘พ
๏ˆบ๐’Ž๏ˆป
stands for the coefficient vector.
We then incorporate the extracted marketing vector
into the context vector in (7) as
๐‘โƒ—
๎ฏ”๎ฏ›๎ฏฃ
๎ตŒ๐šคโƒ—
๎ฏ”
๎ต…๐‘‘
โƒ—
๎ฏ›
๎ต…๐‘š๏ˆฌ
๏ˆฌ
โƒ—
๎ฏฃ
.
(9)
The context vector is now defined as the sum of the
item, customer, and marketing vectors. Compared to
Eq. (7), the choice of target item of a customer is
designed to stem partially from the marketing
environment.
2.3 Hierarchical Model for the
Customer Heterogeneity
We impose the marketing vector into the skip-gram
model to reflect the influence of the marketing
environment. However, Eq. (9) may still have
limitations in its structure, as we did not consider
customer heterogeneity: simply adding three vectors
implicitly assumes that the influence of each vector is
the same among customers. That is, discounts or
promotions would have the same influence on all
customers. Thus, we extended Eq. (9) by
incorporating customer heterogeneity as follows:
๐‘
โƒ—
๎ฏ”๎ฏ›๎ฏฃ
๎ตŒ๐›ฝ
๎ฌต๎ฏ›
๐šค
โƒ—
๎ฏ”
๎ต…๐›ฝ
๎ฌถ๎ฏ›
๐‘‘
โƒ—
๎ฏ›
๎ต…๐›ฝ
๎ฌท๎ฏ›
๐‘š๏ˆฌ
๏ˆฌ
โƒ—
๎ฏฃ
.
(10)
where ๐›ฝ
๎ฌต๎ฏ›
, ๐›ฝ
๎ฌถ๎ฏ›
, ๐›ฝ
๎ฌท๎ฏ›
represent the weights of the three
vectors for customer h, respectively. Note that
๐›ฝ
๎ฏฉ๎ฏ›
,๐‘ฃ ๎ตŒ 1,2,3 are always non-negative and sum to
unity, as this change allows us to interpret the weight
of vectors as percentages rather than unbounded
weights. For a better understanding, we define the
๐›ฝ
๎ฏฉ๎ฏ›
as vector probability of the v-th vector for
customer h in this study. Next, as interpreting the
factor of customer behavior is one of our research
interests, we also propose a hierarchical structure for
vector probability using customer demographic data,
defining that
๐›ฝ
๎ฏฉ๎ฏ›
๎ตŒ
๐›ฝ
๎ฏฉโ„Ž
โ€ฒ
โˆ‘
๐›ฝ
๎ฏโ„Ž
โ€ฒ
3
๎ฏ๎ญ€1
.
(11)
Then, we propose the hierarchical model as
๐›ฝ
๎ฏฉ๎ฏ›
๏‡ฑ
๎ตŒ๐œถ
๎ฏž
๐’
๎ฏ›
๎ต…๐œ€
๎ฏฉ
, ๐œ€
๎ฏž
~๐‘
๏ˆบ
0,๐œŽ
๎ฏฉ
๏ˆป
(12)
where ๐’
๎ฏ›
is the demographic data for customer h and
๐œถ
๎ฏฉ
is coefficients vector.
2.4 Model Optimization
We optimize the model by minimizing the total loss
given by
๐ฟ๎ตŒ๎ท๎ท๐ฟ
๎ฏ”๎ฏ
๎ฏ‡๎ฏ˜๎ฏš
๎ฏ†
๎ฏ”๎ญ€๎ฌต
๎ฏ†
๎ฏ๎ญ€๎ฌต
๎ต…๐ฟ
๎ฏ—
๎ต…๎ท๐ฟ
๎ฏฉ
๎ฏ๎ฏœ๎ฏ˜๎ฏฅ
๎ฌท
๎ฏฉ๎ญ€๎ฌต
.
(13)
The first two terms on the right-hand side of the
equation follow Moody (2016). ๐ฟ
๎ฏ”๎ฏ
๎ฏ‡๎ฏ˜๎ฏš
is the loss from
negative sampling, derived from Eq. (5), defined as
๐ฟ
๎ฏ”๎ฏ
๎ฏ‡๎ฏ˜๎ฏš
๎ตŒlogฯƒ๎ตซ๐‘
โƒ—
๎ฏ”๎ฏ›๎ฏฃ
๎ฏ
๐šค
โƒ—
๎ฏ
๎ตฏ๎ต…๎ทlogฯƒ๎ตซ๎ต†๐‘โƒ—
๎ฏ”๎ฏ›๎ฏฃ
๎ฏ
๐šคโƒ—
๎ฏก
๎ตฏ
๎ฏ‡
๎ฏก๎ญ€๎ฌต
.
(14)
The second term, ๐ฟ
๎ฏ—
, is defined as the loss of
sparsity of the customer weight ๐‘
๎ฏ›๎ฏž
in Eq. (7),
represented by the Dirichlet likelihood with a low
concentration parameter ๐›ผ,
โ„’
๎ฏ—
๎ตŒ๐œ†๎ท
๏ˆบ
๐›ผ๎ต†1
๏ˆป
log๎ตซ๐‘
๎ฏ๎ฏž
๎ตฏ
๎ฏ๎ฏž
,
(15)
Product Embedding for Large-Scale Disaggregated Sales Data
71
where ๐œ† is a tuning parameter that controls the
strength of the loss and ๐›ผ is a tuning parameter that
controls the sparseness of the customer weight. As
this term measures the likelihood of customer h in
topic k summed over all available customers, โ„’
๎ฏ—
encourages customer weight vectors to become more
concentrated, which improves interpretability as
customers are more likely to belong to fewer topics.
The last term, ๐ฟ
๎ฏ›
is defined as the loss of the
hierarchical structure proposed in Eq. (12), which is
represented by the mean squared error (MSE) as
๐ฟ
๎ฏฉ
๎ฏ๎ฏœ๎ฏ˜๎ฏฅ
๎ตŒ๐›ฟ
1
๐‘
๎ท
๏ˆบ
๐›ฝ
๎ฏ›๎ฏฉ
๏‡ฑ
๎ต†๐œถ
๎ฏฉ
๐’
๎ฏ›
๏ˆป
๎ฌถ
๎ฏ‡
๎ฏ›๎ญ€๎ฌต
,
(16)
where ๐›ฟ is a tuning parameter that controls the
loss strength. When ๐›ฟ is smaller, the vector
probability will be less interpreted by the
demographic data, and the interpretability will
increase if ๐›ฟ increases.
3 RESULTS AND DISCUSSION
3.1 Data
We applied our model to daily scanner sales data from
a store in Japan. The data were recorded between
January 2, 2000, and December 5, 2001. There were
56,630 receipts generated by 1,476 unique customers
in total; the dataset included 11,983 unique items, and
the mean number of items in each receipt was 8.83.
We used binary factors โ€œdiscount,โ€ โ€œpromotion,โ€ and
โ€œweekdayโ€ as marketing environment variables in
this empirical study, and 12 variables including age,
family members, job, etc., as customer demographic
data. The details of the demographic variables are
presented in Table 1.
Table 1: Demographic variables.
3.2 Model Comparison
We compare the three models according to the
context vector composition as listed in Table 2. We
use the grid search method for tuning the parameters,
including vector dimension V (10, 20, โ€ฆ, 100), topic
dimension K (5, 10, โ€ฆ 50), and tuning parameters ๐œ†,
๐›ผ, ๐›ฟ in the loss function for each model.
Table 2: Model Comparison.
Model Context Vector
Model 1
LDA2Vec
๐‘โƒ—
๎ฏ”๎ฏ›๎ฏฃ
๎ตŒ๐šคโƒ—
๎ฏ”
๎ต…๐‘‘
โƒ—
๎ฏ›
Model 2
LDA2Vec+Marketing
๐‘โƒ—
๎ฏ”๎ฏ›๎ฏฃ
๎ตŒ๐šคโƒ—
๎ฏ”
๎ต…๐‘‘
โƒ—
๎ฏ›
๎ต…๐‘š๏ˆฌ
๏ˆฌ
โƒ—
๎ฏฃ
Model3
LDA2Vec+Marketing+H
๐‘
โƒ—
๎ฏ”๎ฏ›๎ฏฃ
๎ตŒ๐›ฝ
๎ฌต๎ฏ›
๐šคโƒ—
๎ฏ”
๎ต…๐›ฝ
๎ฌถ๎ฏ›
๐‘‘
โƒ—
๎ฏ›
๎ต…๐›ฝ
๎ฌท๎ฏ›
๐‘š
๏ˆฌ
๏ˆฌ
โƒ—
๎ฏฃ
We retained the last receipt of 1000 customers as
test data. Following the related empirical studies (e.g.,
Le et al., 2007, Caselles-Duprรฉ et al., 2018), we
evaluated the models with a Hit ratio at K (HR@K).,
which is equal to 1 if the test item appears in the list
of K, predicted items, otherwise 0. The result is
shown in Fig 2.
The results show that both the marketing
environment and hierarchical structure improve
forecasting accuracy. Specifically, when K is smaller
than 5, our proposed models (Models 2 and 3)
significantly outperformed the benchmark models.
This shows that our proposed model enhances
practicality in real business scenarios.
Figure 2: Hit rate @ K.
3.3 Parameter Estimates
We explain the parameter estimates from Model 3, as
it performs best among all models.
(i) Topic vector
Considering the mixed structure in Eq. (7) and the
context vector in (10), the formula ensures that the
topic vector ๐‘ก
๎ฏž
๏ˆฌ
๏ˆฌ
๏ˆฌ
โƒ—
and item vector ๐šคโƒ—
๎ฏ”
operate in the same
space. This allows us to list the most similar words
given a topic vector by simply calculating the
similarity between the word and topic vectors. The
top words for each topic are shown in Fig. 3. The item
Variables Type Description
age numeric The age of the customer.
family numeric The number of family members of the customer.
time numeric The time cost for arriving at the store.
walk dummy If the customer walks to the store.
bike/bicycle dummy If the customer uses a bike or bicycle.
Car (no drive) dummy If
the
customer
reaches
the
store
by
car,
but
not
as
a
driver.
car(drive) dummy If the customer drive to the store.
parttime dummy If the customer has a part-time job.
fulltime dummy If the customer has a full-time job.
unknown dummy If the job of the customer is unknown.
housework dummy If the customer is a homemaker.
Work home dummy If the customer works at home.
KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval
72
Figure 3: Topic Interpretation.
Figure 4: Topic Distribution.
ID and its category are displayed, that is, โ€œ[213] Tofuโ€
means the item id is 213 and it belongs to the โ€œTofuโ€
category.
Fig 4 shows that most customers are concentrated
on a few topics, which means the interpretation for
each customer is relatively easy. Considered together
with Fig 3, we can understand the preference of an
individual customer. For example, if a customerโ€™s
topics are distributed in Topic 19 and Topic 4, we can
interpret that this customer mainly has two shopping
patterns โ€“ one is the combination of milk, bread, and
coffee (Topic 19), maybe for breakfast, and another
is the combination of Natto and tofu (Topic 4), maybe
for lunch or dinner.
(ii) Vector probability ๐›ฝ
๎ฏฉ๎ฏ›
Fig 5 shows the estimated ๐›ฝ
๎ฏฉ๎ฏ›
. The figure in the upper
panel shows the vector probability for all customers,
while the bottom panel shows the customers sorted by
vector probability of item, customer, and marketing
vector. As we mentioned, vector probability ๐›ฝ
๎ฏฉ๎ฏ›
can
be interpreted as the influence of vector v for
customer h.
Figure 5: (a) Overview of vector probability.
We can conclude that the customer vector has the
biggest influence overall, followed by the marketing
vector; the item vector has the smallest influence.
This can be explained by the fact that most customers
(i.e., customer #19, #435) behave according to their
own interests, and are hardly affected by the
marketing environment including discounts or
promotions. In contrast, some customers (i.e.,
customer #1275, #1260) are highly influenced by the
marketing environment. This implies that marketing
promotion for these customers can be effective.
Customers influenced by the item vector (i.e.,
customer #1453, #1250) may prefer common
combinations.
(iii) Hierarchical model
Table 3 provides the coefficient estimates for the
hierarchical structure, Eq. (12). We interpret the
coefficients for each vector as follows. First, for the
(a) item vector, we find that the job variables are not
Product Embedding for Large-Scale Disaggregated Sales Data
73
Figure 5: (b) Top Customers for three vectors.
Table 3: Coefficients of the Hierarchical Model.
quite significant, except โ€œhousework.โ€ In addition,
the negative values of โ€œageโ€ and โ€œfamilyโ€ indicate
that younger customers and those with fewer family
members may be less influenced by the product itself.
Second, we find that all the variables are significant,
and most variables from (b) customer vector and (c)
marketing vector have opposite influences, except for
means of transportation. On one hand, for the
customer vector, -0.013 for the covariate โ€œfamilyโ€
means that customers with fewer family members are
more likely to be influenced by their own interests
when they are shopping. In addition, customers who
spend more time and have jobs other than part-time
jobs tend to be highly influenced by their own
interests rather than the item or marketing
environment. On the other hand, older customers who
live with more family members, or have part-time
jobs, tend to be more influenced by marketing
environments such as discounts and promotions
rather than their own preferences.
By interpreting the estimates of the hierarchical
model, we can further understand the role of
demographic data and customer behavior, which will
provide useful insights into real business scenarios
and marketing decisions such as personal marketing.
4 CONCLUSION
This study proposes a new framework that extends
the LDA2Vec approach by including the marketing
environment and hierarchical structure with customer
demographic data. The empirical results show that
our approach not only improves the precision of
forecasting but also enhances the interpretability of
the model.
Our study highlights the significance of the
marketing environment as well as the demographics
in large-scale marketing implications. Furthermore,
considering the topic distribution and vector
probability together, we can further understand the
customer behavior pattern and the latent factor by
evaluating the estimates for the hierarchical structure.
Further issues remain. We fixed several
hyperparameters in our empirical study, such as the
dimension of the topic and vector. Another
challenging problem is that some estimates are still
difficult to interpret, such as the interpretation of each
topic and the role of the means of transportation. We
leave these issues for future research.
REFERENCES
Barkan, O. and Koenigstein, N. (2016). Item2vec: Neural
Item Embedding for Collaborative Filtering. 2016 IEEE
KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval
74
26th International Workshop on Machine Learning for
Signal Processing. 1-6.
Blei, D.M., Ng, A.Y. and Jordan, M.I. (2003). Latent
dirichlet allocation. The Journal of Machine Learning
Research, 3, 993โ€“1022.
Caselles-Duprรฉ, H., Lesaint, F. and Royo-Letelier J. (2018).
Word2Vec applied to Recommendation:
Hyperparameters Matter. Proceedings of the 12th ACM
Conference on Recommender Systems, 352-356.
Levy, O. and Goldberg, Y. (2014). Dependency-Based
Word Embeddings. In Proceedings of the 52nd Annual
Meeting of the Association for Computational
Linguistics, (2), 302โ€“308.
Grbovic, M., Radosavljevic, V., Djuric, N., Bhamidipati, N.,
Savla, J., Bhagwan, V. and Sharp, D. (2015). E-
commerce in your inbox: Product recommendations at
scale. In Proceedings of the 21st ACM SIGKDD
International Conference on Knowledge Discovery and
Data Mining. New York: ACM Press, 1809โ€“1818.
Koren Y, Bell R, Volinsky C. (2009). Matrix
factorization techniques for recommender systems.
Computer, 1(8), 30-7.
Essex, D. (2009). Matchmaker, matchmaker.
Communications of the ACM, 52(5),16โ€“17.
Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013).
Efficient Estimation of Word Representations in Vector
Space. Proceedings of the International Conference on
Learning Representations,2013, 1-13.
Moody, C. (2016). Mixing Dirichlet Topic Models and
Word Embeddings to Make lda2vec.
Naik, P., Wedel M., Bacon, L. and Bodapati, A. (2008).
Challenges and opportunities in high-dimensional
choice data analyses. Marketing Letters, 19(3), 201-213.
Paquet, U. and Koenigstein, N. (2013). One-class
collaborative filtering with random graphs. In
Proceedings of the 22nd international conference on
World Wide Web, 999-1008.
Pennington, J., Socher, R. and Manning, C. (2014). Glove:
Global Vectors for Word Representation. In
Proceedings of the 2014 Conference on Empirical
Methods in Natural Language, 1532โ€“1543.
Salakhutdinov R. and Mnih A. (2008). Bayesian
probabilistic matrix factorization using Markov chain
Monte Carlo. In Proceedings ICML, 880-887.
Vasile, F., Smirnova, E. and Conneau, A. (2016). Meta-
prod2vec: Product embeddings using side-information
for recommendation. In Proceedings of the 11th ACM
Conference on Recommender Systems. New York:
ACM Press, 225โ€“232.
Product Embedding for Large-Scale Disaggregated Sales Data
75