Product Embedding for Large-Scale Disaggregated Sales Data 
Yinxing Li
a
 and Nobuhiko Terui
b
  
Graduate School of Economics and Management, Tohoku University, Japan 
Keywords:  LDA2Vec, Item2Vec, Demographics, Hierarchical Model, Customer Heterogeneity, Topic Model. 
Abstract:  This paper recommends a system that incorporates the marketing environment and customer heterogeneity. 
We employ and extend Item2Vec and Item2Vec approaches to high-dimensional store data. Our study not 
only aims to propose a model with better forecasting precision but also to reveal how customer demographics 
affect  customer  behaviour.  Our  empirical  results  show  that  marketing  environment  and  customer 
heterogeneity increase forecasting precision and those demographics have a significant influence on customer 
behaviour through the hierarchical model.
1  INTRODUCTION 
Marketing  data  are  expanding  in  several  modes 
nowadays,  as  the  number  of  variables  explaining 
customer  behavior  has  greatly  increased,  and 
automated data collection in the store has also led to 
the recording of customer choice decisions from large 
sample  sizes.  Thus,  high-dimensional  models  have 
recently  gained  considerable  importance  in  several 
areas,  including  marketing.  Despite  the  rapid 
expansion  of  available  data,  Naik  et  al.  (2008) 
mentioned that many algorithms do not scale linearly 
but scale exponentially as the dimension of variable 
expends.  This  highlights  the  urgent  need  for  faster 
numerical methods and efficient statistical estimators. 
While  some  previous  researches  focused  on  the 
dimension reduction approaches for the products (e.g., 
Salakhutdinov  and  Mnih,  2008,  Koren  et  al.,  2009, 
Paquet and Koenigstein, 2013), learning the product 
similarities is the final goal rather than the forecasting. 
After  Word2Vec  was  proposed  (Mikolov  et  al., 
2013) regarding natural language processing,  which 
is  designed  to  deal  with  high-dimensional  sparse 
vocabulary data, many studies applied and extended 
the  model  to  other  fields,  such  as  item 
recommendation, including Prod2Vec (Grbovic et al. 
2015),  Item2Vec  (Barkan  and  Koenigstein,  2016), 
and  Meta-Prod2Vec  (Vasile  et  al.,  2016).  These 
approaches  indicate  that  the  Word2Vec  framework 
outperforms  existing  econometric  models  in  sales 
 
a
 https://orcid.org/0000-0001-9335-9802 
b
 https://orcid.org/0000-0003-4868-0140 
prediction.  Besides,  Pennington  et  al.  (2014) 
proposed a model which factorize a large-scale word 
matrix  to  improve  the  performance  of  paring  the 
similar words. This approach is further employed for 
parsing tasks by Levy and Goldberg (2014). 
However,  the  main  limitation  of  the  existing 
approaches is the lack of interpretability of the model. 
Similar  to  the  most  nonlinear  machine  learning 
approach, the Word2Vec framework cannot evaluate 
the  effect  of  variables,  which  may  limit  its 
implications  in  the  marketing  field,  such  as  the 
effective personalization and targeting (
Essex, 2009
). 
Although  extension  models,  such  as  Prod2Vec, 
involve various marketing variables such as price and 
customer demographic data, the role of the variables 
in forecasting is still not discussed.  
In  light  of  the  limitations  mentioned  above,  we 
propose  a  Word2Vec  based  framework  that 
incorporates marketing variables. The main research 
purposes  are  to  (i)  improve  the  precision  of 
forecasting by involving the hierarchical structure of 
the Word2Vec framework with marketing mix 
variables, and (ii) investigate and interpret the role of 
the marketing mix variables. 
In order to fulfil these aims, we analyze the large-
scale  sales  data  of  a  retail  store  for  our  empirical 
application.  In  addition  to  daily sales  data  for  each 
unique  customer,  our  data  also  include  daily  price 
information,  several  promotional  information,  and 
demographic data for each customer. Our approach is