On the Limits to Multi-modal Popularity Prediction on Instagram:

A New Robust, Efﬁcient and Explainable Baseline

Christoffer Riis

1,∗ a

, Damian Konrad Kowalczyk

1,2,∗ b

and Lars Kai Hansen

1 c

Technical University of Denmark, DTU Compute, Matematiktorvet 303B, Denmark

Microsoft Corporation, Business Applications Group, Kanalvej 7, 2800 Kongens Lyngby, Denmark

Keywords:

Visual, Popularity, Explainable, Instagram, Social.

Abstract:

Our global population contributes visual content on platforms like Instagram, attempting to express themselves

and engage their audiences, at an unprecedented and increasing rate. In this paper, we revisit the popularity

prediction on Instagram. We present a robust, efﬁcient, and explainable baseline for population-based pop-

ularity prediction, achieving strong ranking performance. We employ the latest methods in computer vision

to maximise the information extracted from the visual modality. We use transfer learning to extract visual

semantics such as concepts, scenes, and objects, allowing a new level of scrutiny in an extensive, explainable

ablation study. We inform feature selection towards a robust and scalable model, but also illustrate feature

interactions, offering new directions for further inquiry in computational social science. Our strongest models

inform a lower limit to population-based predictability of popularity on Instagram. The models are immedi-

ately applicable to social media monitoring and inﬂuencer identiﬁcation.

1 INTRODUCTION

Social media platforms are full of societal metrics.

The reach of social media postings and the mech-

anisms determining popularity are of increasing in-

terest for scholars of diverse disciplines. In sociol-

ogy, it can be used to understand the connection be-

tween popularity and self-esteem (Wang et al., 2017);

in marketing and branding, it can clarify how to best

engage and communicate with customers (Overgoor

et al., 2017); in journalism, it can be used to de-

cide which posts to share on social media (Chopra

et al., 2019); and in political science, it can be used

to understand how personalised content affect popu-

larity (Larsson, 2019). From a data science point of

view, giving a lower bound on the limits to the pre-

dictability of human behaviour is a challenging task.

In Song et al.’s seminal work on limits to mobility

prediction, they argue that there is a huge gap be-

tween population and individual prediction: while in-

dividual predictability is high, population-based pre-

dictability is much lower (Song et al., 2010). Well-

aligned with Song et al. (2010), very high popular-

https://orcid.org/0000-0002-4540-6691

https://orcid.org/0000-0002-5612-0859

https://orcid.org/0000-0003-0442-5877

∗

These authors contributed equally

ity predictability of individuals’ postings on Insta-

gram are found by combining individualised models

(Gayberi and Oguducu, 2019). Oppositely, this pa-

per focuses on Instagram popularity prediction as the

hard problem of predicting popularity using popula-

tion models. Following the generality track in the

population models, we will not restrict the analysis

to any speciﬁc segment. Instead we will use a gen-

eral segment, which is in sharp contrast to previous

studies on Instagram predicting popularity (Mazloom

et al., 2016, 2018; Overgoor et al., 2017). To the best

of our knowledge, we are the ﬁrst to use population

models to predict popularity on Instagram as a regres-

sion and ranking problem with a general segment. In

this paper, we further investigate and explain the vi-

sual modality and its potential for popularity ranking.

Our contributions can be summarized as follows:

1. we advance user-generated visual modality repre-

sentation with a novel and rich set of features, and

provide detailed explanations of their impact,

2. we provide two new popularity models for Insta-

gram, which achieve strong ranking performance

in a robust and explainable way, and ﬁnally

3. we offer a new lower bound to predictability of

Instagram popularity with the above general pop-

ulation models.

1200

Riis, C., Kowalczyk, D. and Hansen, L.

On the Limits to Multi-modal Popularity Prediction on Instagram: A New Robust, Efﬁcient and Explainable Baseline.

DOI: 10.5220/0010377112001209

In Proceedings of the 13th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2021) - Volume 2, pages 1200-1209

ISBN: 978-989-758-484-8

Additionally, our modelling contributions are bridg-

ing previous studies of the visual modality on In-

stagram (Mazloom et al., 2016, 2018; Gayberi and

Oguducu, 2019; Overgoor et al., 2017; Rietveld et al.,

2020) through a clariﬁcation of the inﬂuence of dif-

ferent visual aspects on popularity alongside an in-

vestigation of the role of four different feature sets in

a comprehensive ablation study.

2 RELATED WORK

With the ever increasing volume of multi-modal up-

loads to the social media platforms, the challenge

of predicting the popularity of user-generated con-

tent inspires multi-modal approaches including con-

tent (metadata), author, textual, and visual informa-

tion. Content and user information are used with a

Gradient Boosting Machine (GBM) to achieve excel-

lent results (Kang et al., 2019). In multiple ablation

studies, it is reported that the content and user infor-

mation indeed are the strongest predictors among the

four modalities (Ding et al., 2019b; He et al., 2019;

Wang et al., 2018). These studies also show how the

modelling of textual content improve the performance

but show mixed performance for the visual content,

suggesting that care has to be exercised when com-

bining the modalities. In the following, we pay extra

attention to the visual modality and how it is modelled

in earlier work.

Khosla et al. (2014) ﬁnd performance gains from

combining low-level features and semantic features

such as objects. Moreover, they conclude that scenes,

objects and faces are good as predictors for image

popularity. Similarly, other studies consider both

colour features, analysis of the scenery, and the num-

ber of faces in the images (McParlane et al., 2014),

and visual information extracted form a pre-trained

neural network (Cappallo et al., 2015). Both studies

show promising results for the visual modality as a

descriptor for popularity prediction.

Extant recent work considers high level visual in-

formation such as concepts, scenes, and objects de-

rived by transfer learning in the form of neural net-

works trained for classiﬁcation or object detection

tasks (Gayberi and Oguducu, 2019; Gelli et al., 2015;

Mazloom et al., 2018; Ortis et al., 2019). An overview

is shown in Table 1. Gayberi and Oguducu (2019)

suggest that objects and categories are important fea-

tures in order to utilise the visual modality in the

best way possible and therefore propose to use the

MS COCO Model (Caesar et al., 2018) for object

detection. Gelli et al. (2015) use a pre-trained net-

work for object detection to extract high-level features

and objects. Their quantitative analysis shows how

the visual features complement the strong informa-

tion from the content and author features. Mazloom

et al. (2018) focus on popularity prediction within

different categories such as action, animal, people,

and scene. They show how human faces and animals

are important for popularity prediction. Ortis et al.

(2019) hypothesise that semantic features of the im-

ages such as objects and scenes have an impact on

the performance and therefore, they extract predic-

tions from two different neural networks. Another ap-

proach is to use an image-captioning model to extract

the high level information (Hsu et al., 2019; Zhang

et al., 2018). Visual features include brightness, style,

Table 1: Summary of the use of concepts, scenes, and ob-

jects extracted from the visual modality.

Concepts Scenes Objects

Gayberi and Oguducu (2019) X X

Gelli et al. (2015) X

Khosla et al. (2014) X

Mazloom et al. (2018) X

Mazloom et al. (2016) X X

McParlane et al. (2014) X X

Ortis et al. (2019) X X

Overgoor et al. (2017) X

Rietveld et al. (2020) X

This study X X X

and colour. Quantifying the aesthetics of images in

popularity prediction is seen in several papers (Chen

et al., 2019b; Ding et al., 2019b; Hidayati et al., 2017;

Mazloom et al., 2016). Chen et al. (2019b) propose to

use moments to quantify the style and colour. Ding

et al. (2019b) use a network directly pre-trained to

access the image aesthetics. Hidayati et al. (2017)

hypothesise that visual aesthetics are important infor-

mation and therefore, they extract several high-level

semantic features such as brightness, clarity, colour,

and background simplicity. Mazloom et al. (2016) di-

rectly extract image aesthetics as a 42-dimensional bi-

nary vector given by the content information from In-

stagram in the form of the feature ﬁlter. Another high-

level feature is visual sentiment, which can be directly

assessed with neural networks (Gelli et al., 2015; Ma-

zloom et al., 2016). However, we hypothesise that

these features are captured in the high-level features

from a deep neural network and consequently, we do

not apply this approach.

In multiple works, visual features are extracted

implicitly by neural network embeddings pre-trained

for general object recognition tasks. Many use a

deep neural network pre-trained on ImageNet (Rus-

sakovsky et al., 2015) for classiﬁcation (e.g. (Ma-

zloom et al., 2018; Ortis et al., 2019; Wang et al.,

On the Limits to Multi-modal Popularity Prediction on Instagram: A New Robust, Efﬁcient and Explainable Baseline

1201

2018)). It is most common to use the embeddings

from the last pooling layer with either 1024 or 2048

individual real-valued features, depending on the net-

work structure (Ding et al., 2019b; Mazloom et al.,

2018, 2016; Overgoor et al., 2017). Ortis et al. (2019)

extract high-level features from three different net-

works by considering the last two activation layers.

The three networks are pre-trained predicting classes,

adjective-noun pairs, and object and scenes. Wang

et al. (2018) use features from a network pre-trained

on ImageNet and afterwards ﬁne-tune the network for

popularity prediction.

While several papers deploy transfer learning to

access semantic and high-level features, recent work

applies end-to-end models on the visual modality

(Ding et al., 2019a; Zhang and Jatowt, 2019). Zhang

and Jatowt (2019) investigate the effectiveness of us-

ing neural networks in the modelling of image pop-

ularity. They hypothesise that the text features have

a stronger predictive power than the visual features.

With a six-layer end-to-end network, they outperform

their baseline comprised of a pre-trained deep neu-

ral network together with Support Vector Regression

and show how their network is comparable with the

text-based embeddings methods. Ding et al. (2019a)

investigate the contribution of the visual content in

popularity prediction by training a deep neural net-

work to predict the intrinsic image popularity. By

dividing posts into different pairs giving user statis-

tics, upload time, and captions, they train the net-

work with a Siamese architecture. Through a qualita-

tive analysis and a psycho-physical experiment, they

show how their intrinsic image popularity assessment

model (IIPA) achieves human-level performance.

Our Design Space. We aim to construct a new im-

age feature extractor building upon recent work util-

ising deep learning (e.g (Ding et al., 2019b; He et al.,

2019; Ortis et al., 2019)). In recent years, the applica-

tion of deep learning and neural networks have grown

intensively as the ﬁeld of computer vision has ad-

vantaged within classiﬁcation (Tan and Le, 2019) and

object detection (Redmon and Farhadi, 2018) among

others. Accordingly, we propose to use transfer learn-

ing with the most recent networks of computer vision

to represent visual information and measure its impor-

tance in predicting popularity on social media. In re-

lation to previous use of transfer learning and embed-

dings (e.g. (Ding et al., 2019a; Mazloom et al., 2016;

Ortis et al., 2019)), we improve the explainability of

the embeddings by constructing them as the input to

the classiﬁer softmax, i.e. the last layer prior to the

softmax, so each feature has a class label associated.

Networks pre-trained for different tasks have dif-

ferent internal representations, which means that the

high-level features will be complementary in describ-

ing images (Zhou et al., 2014). Therefore, we will

use the deep neural network EfﬁcientNet-B6 (Tan

and Le, 2019) pre-trained for classiﬁcation, Places365

ResNet-18 (Zhou et al., 2018) pre-trained for scene

classiﬁcation, and YOLOv3 (Redmon and Farhadi,

2018) pre-trained for object detection. We adopt the

model IIPA (Ding et al., 2019a) to assess the intrin-

sic image popularity directly. Besides introducing the

state-of-the-art networks EfﬁcientNet, Places365, and

YOLOv3 in popularity prediction, these pre-trained

models give a novel combination (also shown in Ta-

ble 1) of the visual semantics concepts, scenes, and

objects. The combination of the four complementary

models leads to a rich image representation, instru-

mental for advancing the popularity prediction on In-

stagram. We maximise the semantic diversity of the

representation to boost the ﬁnal model’s ranking per-

formance and engagement explainability simultane-

ously. To test the ﬁnal model, we gathered one million

posts from Instagram (more details in the sections on

methods). Figure 1 shows that the size of our data set

is among the largest data sets on both Instagram and

other social media platforms.

This study

McParlane et al.

Wu et al.

Ortis et al.

Meghawat et al.

He et al.

Chen et al.

Hsu et al.

Ding et al.

Hu et al.

Khosla et al.

Kang et al.

Hidayati et al.

Gelli et al.

Cappallo et al.

Zhang et al.

Almgren et al.

Ding et al.

Gayberi and Oguducu

Overgoor et al.

Rietveld et al.

Mazloom et al.

Deza and Parikh

Chen et al.

Kowalczyk and Larsen

Wang et al.

Zhang and Jatowt

Kowalczyk and Hansen

Wu et al.

Mishra et al.

2014 2015 2016 2017 2018 2019 2020

Data volume

Flickr Instagram Twitter Reddit

Figure 1: Different sizes of data sets have been used on the

different platforms. This study (orange point) with 1 million

samples is among the largest popularity prediction studies

on both Instagram and social media in general. Points are

shifted left or right for visual clarity.

Finally, we deﬁne our scope of popularity prediction

and measurement. There exist multiple ways to ad-

dress popularity prediction on social media. Previous

work predict the number of mentions for a speciﬁc

event (Chen et al., 2019a); look at the popularity over

time or as a cascade (Almgren et al., 2016; Mishra

et al., 2016; Ortis et al., 2019; Wu et al., 2016, 2017);

deﬁne it as a binary classiﬁcation problem (Deza and

Parikh, 2015; McParlane et al., 2014; Zhang et al.,

2018); but the main focus in popularity prediction on

social media is to predict the number of likes, shares,

views, etc., as a regression and ranking problem (e.g.

(Chen et al., 2019b; He et al., 2019; Kowalczyk and

Larsen, 2019)). In this paper, we address popular-

ity prediction as a regression and ranking problem.

As popularity measurement, we follow the majority

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

1202

of the literature and use the number of likes as our

response variable (e.g (Ding et al., 2019a; Rietveld

et al., 2020; Zhang et al., 2018)).

3 METHODS

In this section, we ﬁrst describe the 1M size data set

and how it was gathered. Next, we outline the fea-

ture extraction by going through the social features

and the enhanced visual feature extractor. Then, we

describe the gradient boosting machine used for pre-

diction. Lastly, we brieﬂy introduce our use of the

explainability tool SHAP (Lundberg and Lee, 2017).

As mentioned by several studies, there does not

exist a publicly available data set for Instagram (e.g.

(Gayberi and Oguducu, 2019; Overgoor et al., 2017;

Zhang et al., 2018)). Similar to previous studies (e.g.

(Gayberi and Oguducu, 2019; Mazloom et al., 2018;

Rietveld et al., 2020)), we scraped Instagram and cre-

ated a multi-modal data set for this study speciﬁcally.

The data set consists of one million image posts gath-

ered from 2018-10-31 to 2018-12-11. The data set is

neither categorical nor user-speciﬁc and can thus be

seen as a general subset of all image posts on Insta-

gram. However, we are aware of the inevitable bias

that lies in the discard of non-public posts. The im-

age, engagement signal, and social information were

picked up 48 hours after upload time.

Previous studies show that the performance of

popularity prediction beneﬁts from a multi-modal ap-

proach (Ding et al., 2019b; Hsu et al., 2019; Wang

et al., 2018). Therefore, we extract features from sev-

eral information sources. Overall, the features col-

lected from each post can be divided into social fea-

tures and visual features. The social features are

branched into author, content, and temporal features.

Among the author features, we extract how many fol-

lowers the user has, how many other users the user

follows, and the number of posts the user has made. In

order to stabilise the variance, we log-normalise these

three variables (e.g. (Ding et al., 2019b; Gayberi and

Oguducu, 2019; Kowalczyk and Hansen, 2020)). The

transformation of a variable x is given as follows by

ﬁrst log transforming the variable x

log

= log(x + 1)

and then subtracting the mean

trans f ormed

= x

log

− mean(x

log

). (1)

Furthermore, we augment the features by computing

the ratios follower per post and follower per follow-

ing (Kowalczyk and Larsen, 2019). Regarding the

content features, we extract image ﬁlter, number of

users tagged, whether the user liked the post, if ge-

olocation is available, language, the number of tags,

and the length of the caption measured in words and

characters. From the language features, we augment

the data with is English. Regarding the temporal fea-

tures, we extract the feature consisting of the date and

time for posting and split it into posted date, posted

week day, and posted hour (Kowalczyk and Hansen,

2020). User ID and activity ID are discarded as irrel-

evant for the population-based approach, effectively

anonymizing the training. In creating a comprehen-

sive visual feature extractor, we use transfer learning

and deploy four pre-trained neural networks in order

to describe concepts, scenes, objects, and intrinsic im-

age popularity.

Concept Features: To extract concept features, we

use the state-of-the-art model EfﬁcientNet-B6 (Tan

and Le, 2019) pre-trained on ImageNet (Russakovsky

et al., 2015). We use the values in the last layer prior

to the softmax normalization layer. This provides a

1000-dimensional vector each entry corresponding to

a high level object class label.

Scene Features: We extract a diverse set of scene

features by using Places365 ResNet-18 (Zhou et al.,

2018). We use the values of the last layer prior

to softmax normalization. This provides a 365-

dimensional interpretable vector of scene concepts,

a 102-dimensional feature vector of SUN scene at-

tributes (Patterson and Hays, 2012), and a single entry

indicating if the scene is indoors or outdoors.

Object Features: YOLOv3 (Redmon and Farhadi,

2018) pre-trained on COCO (Lin et al., 2014) is used

to detect multiple occurrences of 80 different ob-

jects. For each object, we count the number of in-

stances providing a 80-dimensional ‘bag-of-objects’

histogram of object occurrences.

Intrinsic Image Popularity: Here, we adopt the

model IIPA (Ding et al., 2019a) to directly assess the

intrinsic image popularity in a single variable.

In total, we have 1548 features representing con-

cepts, scenes, objects, and the intrinsic image pop-

ularity resulting in an expressive and comprehensive

visual feature representation. A feature extraction is

illustrated in Figure 2.

Gradient boosting algorithms are used in social

media popularity prediction due to speed, perfor-

mance and explainability (e.g. (Chen et al., 2019b;

Gayberi and Oguducu, 2019; Kang et al., 2019)). We

use the framework LightGBM (Ke et al., 2017) in line

with other recent studies (He et al., 2019; Hsu et al.,

2019; Kowalczyk and Hansen, 2020; Kowalczyk and

Larsen, 2019). LightGBM is a leaf-wise growth al-

gorithm and uses a histogram-based algorithm to ap-

proximately ﬁnd the best split. The algorithm handles

integer-encoded categorical features and uses Exclu-

sive Feature Bundling (EFB). By combining gradient-

On the Limits to Multi-modal Popularity Prediction on Instagram: A New Robust, Efﬁcient and Explainable Baseline

1203

Figure 2: Example of the features extracted from an im-

age. The associated concepts are extracted with Efﬁcient-

Net, objects are detected using YOLOv3, and the associ-

ated scenes and scene attributes as well as the environment

(indoor/outdoor) are extracted with Places365. Addition-

ally, the image scores a neutral IIPA value at 1.96 on a nor-

malised scale from -4 to 8, with a mean of 2.

based one-side sampling and EFB, Ke et al. (2017)

show how this algorithm can accelerate the training of

previous GBMs by 20 times or more while achieving

at par accuracy across multiple public data sets. Sim-

ilar to (e.g. (Ding et al., 2019a; Mazloom et al., 2018;

Gayberi and Oguducu, 2019)), we choose to predict

the log-normalised number of likes.

Explainable ML. We use the SHAP (Lundberg and

Lee, 2017) library to compute feature level explana-

tions. Single Shapley value quantiﬁes the effect on a

prediction, which is attributed to a feature. Two prop-

erties of these values make them ideal for explaining

our ablation study:

Consistency and Local Accuracy: If we change

the model such that a feature has a greater impact,

the attribution assigned to that feature will never de-

crease. Features missing in the original input (i.e. re-

moved in ablation) are attributed no importance. The

values can be used to explain single predictions and

to summarise the model.

Additivity of Explanations: Summing the effects

of all feature attributions approximates the output of

the original model. Additivity, therefore, enables ag-

gregating explanations, e.g., on a group level, towards

an accurate and consistent attribution for each of the

modalities in the study.

Model Training. We train 111 models for the

ablation study (37 combinations in 3-fold cross-

validation) in a distributed environment of Apache

Spark. The cluster consists of 3 nodes, each pow-

ered by a 6-core Intel Xeon CPU and an NVidia Tesla

V100 GPU. We perform a basic hyper-parameter tun-

ing of LightGBM on the full combination of feature

groups (denoted as YIEPACT) and ﬁx these param-

eters across ablation experiments to ensure fair com-

parison. We cap the number of leaves at 256, set the

feature sampling at every iteration to 0.5 (expecting

many noisy features to slow down the training oth-

erwise), limit the number of bins when building the

histograms to 255 and set the learning rate to 0.05.

4 RESULTS & MAIN FINDINGS

In Figure 3, the average absolute SHAP value for

each feature aggregated within each group of features

are displayed for each model together with the corre-

sponding SRC. The base model CT with content fea-

CT +A +I +IA

0.417

0.501

ACT

0.435

ICT

0.508

IACT

0.433

YCT

0.508

YACT

0.444

YICT

0.510

YIACT

0.455

ECT

0.505

EACT

0.458

IECT

0.507

IEACT

0.452

PCT

0.504

PACT

0.456

IPCT

0.505

IPACT

+YE

0.457

YECT

0.506

YEACT

0.461

YIECT

0.509

YIEACT

+YP

0.456

YPCT

0.506

YPACT

0.459

YIPCT

0.508

YIPACT

+EP

0.460

EPCT

0.508

EPACT

0.462

IEPCT

0.510

IEPACT

+YEP

0.461

YEPCT

0.509

YEPACT

0.463

YIEPCT

0.510

YIEPACT

Feature

group

Author

Content

Temporal

EfficientNet

IIPA

Places365

YOLO

Figure 3: Average absolute SHAP value for each feature ag-

gregated within each feature group displayed for the mod-

els. The upper left plot shows the base model with Content

and IIPA (I) features are added, and in the rows Efﬁcient-

Net (E), Places365 (P), and YOLOv3 (Y) - corresponding to

concepts, scenes, and objects respectively - are added. The

Spearman’s rank correlation is shown for each model.

tures (C) and temporal features (T) achieving an SRC

of 0.417 is shown in the upper left corner. C affects

the prediction more than T, since the content bar is

higher than the temporal bar.

Author Features Are Essential. In the columns,

we add author features (A), IIPA (I), and the com-

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

1204

bination of the two (IA). In the ﬁrst row with the base

model CT, we observe that adding I to the base model

increases the performance to 0.435 SRC, whereas

adding A gives a very high increase in the perfor-

mance reaching an SRC at 0.501. In fact, all the rows

in the second and fourth column show that these mod-

els with the author features do indeed score an SRC

above 0.5. Thus, the author features appear essential

for reaching strong performance.

EfﬁcientNet Has the Largest Effect on the Predic-

tions. In the rows below the base model CT in Fig-

ure 3, the different semantic concepts (E: Efﬁcient-

Net), scenes (P: Places365), and objects (Y: YOLOv3)

are added to the model. A comparison of the three

models YCT, ECT, and PCT show that E on aver-

age, has the largest effect on the predictions. In the

lower half of the column, we have the models com-

bining these features, and again it appears that E has

the largest effect. This observation can be validated

across the other columns.

Visual Semantics Are Correlated. Adding combi-

nations of the semantic groups gives a decrease in the

contribution for a single group, e.g. in YEPCT the ef-

fect of both E, P, and Y are lower than for the other

models in this column. At the same time, the SRC

is increased every time new features are added to the

model, indicating that the different features are com-

plementary. However, the decrease in the different

bars together with the increase in the SRC also indi-

cate that the groups are slightly correlated and that the

model might learn a better representation such that

some of the features within the different groups are

disregarded. In other words, this illustrates the syn-

ergy between the groups and how some features are

substituted by including other features. These obser-

vations can be validated across the other columns.

Object Detection Works Better with Author Fea-

tures. In the second column in Figure 3, we add A

to the base model CT and observe a sudden increase

in the performance reaching an SRC at 0.501. In the

ﬁrst column without A, the increase in performance

is higher when adding E or P instead of Y, e.g. the

model EPCT achieve a higher SRC than both YECT

and YPCT. The same patterns are seen in the third col-

umn. However, in the ﬁrst column with A, the pattern

is more cluttered, since YACT achieves a higher SRC

than both EACT and PACT. Moreover, adding either

E or P to YACT results in a performance decrease,

but adding all of them in YEPACT gives the highest

performance in this column. Withal, the combination

of EP in EPACT achieves the same performance as

YACT. Lastly, even though both YEACT and YPACT

have lower performance than YACT, adding all three

visual semantics in YEPACT gives a small increase

in performance. These hypotheses are validated by

the fourth column. However, no performance gain

is obtained by combing YIACT and IEPACT into

YIEPACT. The three models achieve the highest ob-

served SRC at 0.51. In sum, we see how objects to-

gether with authors features are very powerful, but

also how the combination of concepts and scenes is

indeed powerful with and without author features.

Followers

Following

Posts

Followers per posts

Followers per following

Filter

Users tagged

Has geolocation

Language

Hashtag count

Word count

Body length

Posted day

Posted week day

Posted hour

Jinrikisha

Lakeside

Monastery

Packet

Website

IIPA

Amusement arcade

Balcony interior

Brigde

Childs room

Indoor/outdoor

Matte

Patio

Plastic

Person

Average absolute SHAP value

Feature group

Author

Content

Temporal

EfficientNet

IIPA

Places365

YOLO

Figure 4: Average absolute SHAP value for top 30 features.

The features are chosen by highest average absolute SHAP

values across all models.

In the following, we will investigate the features af-

fecting the prediction the most by ﬁnding the top-30

most prominent features based on the average abso-

lute SHAP value across all models. More precisely,

we aggregate the average absolute SHAP value for

each feature across all models, and then divide by the

number of times that feature is present in the models.

In Figure 4, the top-30 features are coloured after each

feature group. The features hashtag count and posted

day have the largest average absolute SHAP value and

thereby affect a prediction the most. The author fea-

tures followers and followers per post come right af-

ter but more interestingly, note how the two computed

ratios followers per post and followers per following

both are high and are actually affecting the prediction

more than the two features following and posts. The

three temporal features all have a high effect on the

prediction which both shows that the day of the week

and the time of the day is important information for

predicting the popularity. Among the visual features,

IIPA and Person have the largest effect and are both

comparable to the social features. Yet, in general, all

the visual features have a smaller effect than the so-

cial features.

The social features are explained using the SHAP

values individually. We summarise the SHAP values

in two numbers computed as the mean of all positive

and all negative SHAP values separately. In this way,

On the Limits to Multi-modal Popularity Prediction on Instagram: A New Robust, Efﬁcient and Explainable Baseline

1205

0.417

0.501

ACT

0.435

ICT

0.508

IACT

0.433

YCT

0.508

YACT

0.444

YICT

0.510

YIACT

0.455

ECT

0.505

EACT

0.458

IECT

0.507

IEACT

0.452

PCT

0.504

PACT

0.456

IPCT

0.505

IPACT

0.457

YECT

0.506

YEACT

0.461

YIECT

0.509

YIEACT

0.456

YPCT

0.506

YPACT

0.459

YIPCT

0.508

YIPACT

0.460

EPCT

0.508

EPACT

0.462

IEPCT

0.510

IEPACT

0.461

YEPCT

0.509

YEPACT

0.463

YIEPCT

0.510

YIEPACT

Author

Followers

Following

Posts

Followers per posts

Followers per following

Content

Filter

Users tagged

Has geolocation

Language

Hashtag count

Word count

Body length

Temporal

Posted day

Posted week day

Posted hour

Figure 5: Average positive and negative SHAP values for

most prominent social features displayed for each model.

we both preserve the sign and the deviation of the

SHAP values. In contrast, SHAP values of different

signs will cancel out each other in a regular mean cal-

culation. In Figure 5, the positive and negative mean

SHAP values for the social features are visualised.

Hashtag Count and Posted Day Are Good Discri–

minators. In Figure 5, the base model CT consist-

ing of content and temporal features indicate that

hashtag count and posted day are good discrimina-

tors. The reason is two-fold: 1) they have high posi-

tive and negative means (e.g. the bars are large) and

2), the magnitude of the positive and negative mean

is similar, meaning that features can affect a predic-

tion in a positive and negative direction, equally. The

feature users tagged also has a high impact on the pre-

diction, but the effect is mainly in a positive direction,

since the positive mean is of larger magnitude than the

negative mean and, consequently, it is not as good a

discriminator as the two aforementioned. Regarding

the size of the bars, similar trends from the top fea-

tures in Figure 4 are observed in the ﬁgure.

Language is Important with Visual Features. If

we consider the ﬁrst column in Figure 5, only small

changes are observed down the rows. The size of the

bars is decreasing slightly as we add visual features,

e.g word count is larger in CT than YEPCT. Adding Y

only seem to have very small effects on the bars and is

not changing the relative distribution, whereas adding

E and P give an increase in the positive mean of lan-

guage. In fact, all the features are smaller in YEPCT

than in CT except language, which is slightly higher.

A similar trend is observed in the last two columns,

where IIPA (I) is added to CT and ACT. I also affects

the positive mean of language in a positive direction,

e.g. comparing CT with ICT. This is also seen for the

other rows though the increase is smaller due to the

increase from E and P. This indicates that language

is more important when visual semantics and I are

added to the model. We hypothesise that the visual

predictors of popularity vary across cultures.

The Caption is Less Important with Visual Fea-

tures. If we compare the models in the ﬁrst row

with the models in the last row in Figure 5, the attri-

bution of the feature word count has decreased. This

indicates a connection between the visual features and

the word count, which suggests that the visual infor-

mation can partly substitute the information in the

word count. Word count is the number of words in

the caption, and thus, we observe how the caption is

less important when visual features are present.

Visual Features Have a Small Impact on Social

Features. Overall, only small changes are observed

across the models in Figure 5, indicating that the vi-

sual features only slightly affect the impact of the so-

cial features on a prediction. If we compare the mod-

els in the ﬁrst row and last row, the features language

has increased and word count has decreased. If we

compare ACT with YIEPACT, it is observed that the

majority of the features have a smaller impact and

word count is very small but the author features fol-

lowers and followers per post are unchanged, and the

content feature language is actually larger. This sug-

gests that author features are important no matter the

visual information, that language might capture some

sort of user segment, and that word count and visual

information are highly related.

Table 2: Ablation study with feature groups removed. Per-

formance metrics are given by Spearman’s rank correlation

(SRC) and root mean square error (RMSE) together with

the training and prediction time. All standard deviations

with respect to RSME and SRC are below 0.002.

Performance Time

Group removed SRC RMSE Train (s) Pred. (ms)

Author 0.463 1.202 1075 186

EfﬁcientNet 0.509 1.158 421 1055

Places365 0.509 1.158 772 1111

YOLOv3 0.510 1.157 1170 1051

IIPA 0.509 1.159 1105 1104

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

1206

SRC RMSE

0.500

0.504

0.508

0.512

0.516

1.25

1.26

1.27

1.28

1.29

SRC

RMSE

Training time

0.25

0.26

0.27

0.28

0.29

500

1000

1500

ACT

PACT

EACT

IPACT

YEACT

YPACT

IEACT

YACT

EPACT

YIPACT

IACT

YEPACT

YIEACT

IEPACT

YIEPACT

YIACT

Training time (s)

Figure 6: Performance for models getting an SRC higher

than 0.5. The boxes shows ±2 standard deviations. (A)

Spearman’s rank correlation (SRC) and Root Mean Square

Error (RMSE). (B) R

and training time.

The performance of the models is quantiﬁed us-

ing Spearman’s rank correlation (SRC), Root Mean

Square Error (RMSE), R

, and training time. In the

top panel of Figure 6, the performance ±2 standard

deviations for the 16 best models are shown. As ex-

pected, the SRC and RMSE are inversely related. The

standard deviations of performance between cross-

validation folds form a conservative (too large) es-

timate of the standard error of the mean. YIACT

has the highest SRC, but also a high standard devi-

ation, while the model IEPACT has similar perfor-

mance but is more robust. If we also include the R

and the training time from the bottom panel of Figure

6, we note that the models ACT, YACT, IACT, and

YIACT are fast with training times below 200 sec-

onds. All the other models have more than four times

as many features, which is reﬂected in the increased

training time. If R

is also taken into account, YI-

ACT has the highest values but IACT has similar per-

formance with much lower standard deviation. The

model IACT has a low training time, a high R

, and

a high SRC with a small conﬁdence interval. Hence,

it is a good candidate for a strong, robust, and efﬁ-

cient baseline for Instagram popularity prediction. If

we accept the somewhat larger training time (about

20 minutes), the model IEPACT is an excellent and

robust candidate with a strong, consistent SRC per-

formance across cross-validation folds.

For a real-time application, the prediction time is

a central metric. The prediction time includes the fea-

ture extraction, and we assume that if you want to pre-

dict the popularity of a new post, you have the image,

content, and temporal information at hand. The au-

thor features are crawled from WWW and the visual

features are obtained via a propagation through the

networks. In parallel, all LightGBM models run in

less than one tenth of a millisecond. In Table 2 and

Table 3: Quantitative evaluation of all models given by

Spearman’s rank correlation (SRC), root mean square error

(RMSE), the R

, and the prediction time given in millisec-

onds. Abbr.: author (A), content (C), temporal (T), Efﬁ-

cientNet (E), Places365 (P), YOLO (Y), and IIPA (I).

SRC RMSE R

Time

µ σ µ σ µ σ ms

T .261 .001 1.306 .001 .086 .001 <1

C .305 .002 1.291 .001 .108 .001 <1

A .349 .002 1.266 .001 .141 .001 935

CT .417 .001 1.231 .001 .188 .000 <1

AT .425 .001 1.219 .002 .204 .001 936

AC .426 .000 1.216 .001 .207 .000 936

YCT .433 .000 1.222 .001 .200 .000 71

ICT .435 .001 1.219 .001 .204 .000 18

YICT .444 .001 1.214 .001 .211 .001 88

PCT .452 .001 1.210 .001 .216 .001 33

ECT .455 .000 1.208 .001 .219 .001 89

YPCT .456 .000 1.207 .002 .220 .001 103

IPCT .456 .000 1.206 .001 .221 .001 50

YECT .457 .000 1.206 .002 .221 .001 159

IECT .458 .001 1.205 .001 .222 .000 106

YIPCT .459 .000 1.204 .001 .224 .001 120

EPCT .460 .001 1.205 .001 .223 .000 99

YIECT .461 .000 1.204 .001 .224 .001 176

YEPCT .461 .000 1.204 .002 .224 .001 169

IEPCT .462 .001 1.202 .001 .226 .001 116

YIEPCT .463 .000 1.202 .001 .227 .001 186

ACT

ACT .501 .000 1.163 .001 .276 .000 936

PACT .504 .001 1.162 .001 .277 .001 968

EACT .505 .001 1.162 .002 .277 .001 1024

IPACT .505 .000 1.160 .001 .279 .001 985

YEACT .506 .001 1.160 .002 .279 .001 1094

YPACT .506 .001 1.160 .002 .279 .002 1038

IEACT .507 .001 1.160 .002 .280 .001 1041

YACT .508 .001 1.158 .002 .282 .001 1006

EPACT .508 .000 1.159 .002 .280 .001 1034

YIPACT .508 .000 1.158 .002 .282 .001 1055

IACT .508 .001 1.156 .001 .284 .001 954

YEPACT .509 .001 1.159 .002 .281 .001 1104

YIEACT .509 .001 1.158 .001 .282 .001 1111

IEPACT .510 .000 1.157 .002 .283 .001 1051

YIEPACT .510 .001 1.157 .002 .283 .002 1121

YIACT .510 .003 1.155 .002 .285 .003 1023

Table 3, the prediction time for a single evaluation of

a post is seen.

5 CONCLUSIONS

In this paper, we revisit the problem of content popu-

larity ranking on Instagram with a general population-

based approach and no prior information about the

content’s authors. We use a multi-modal approach

to popularity prediction and focus on enhancing

the visual modality’s predictive power alongside the

On the Limits to Multi-modal Popularity Prediction on Instagram: A New Robust, Efﬁcient and Explainable Baseline

1207

model’s explainability, scalability, and robustness.

We design a comprehensive ablation study including

transfer learning to represent visual semantics with

the explainable features concepts, scenes, and objects.

The approach is strong, since we show robustness and

consistency across models that take advantage of the

synergy between the visual semantics. We show that

the approach is explainable on both a high-level with

feature groups and a low-level with individual fea-

tures. We use Shapley analysis to quantify each fea-

ture’s impact on the predictions. We calculate Shap-

ley values for every prediction, before aggregating the

explanations to provide novel attributions for all the

visual semantics detected. In particular, we ﬁnd that

object detection works better with author features,

and language is important with visual semantics.

Finally, we recommend two strong, explainable

and scalable baselines which also inform a new

lower limit in popularity ranking on Instagram, with

population-based approach and without prior author

information. We can lower bound the predictability

as Spearman’s rank correlation (SRC) > 0.5. Based

on the many combinations of multi-modal models,

we make the following recommendations: If train-

ing time is of importance, we recommend the model

(IACT) that combines author, content and temporal

features with a single dimension measure of image

popularity. This model trains in less than three min-

utes. If the focus is on robust performance and less on

time to train, we recommend the model (IEPACT) that

combines the social features with intrinsic image pop-

ularity and visual embeddings from EfﬁcientNet and

Places, which is about seven times slower in training.

However, the latter model shows both strong and con-

sistent SRC across cross-validation folds.

Immediate avenues of future inquiry include ex-

periments to explain how the impact of visual seman-

tics varies across languages or investigating why ob-

ject detection performs better with author informa-

tion. Separately, it would be of high interest to apply

the proposed visual feature extraction across popula-

tion segments and social media platforms. Eventually,

we hope to inspire further applications of explainable

transfer learning to computational social science at

scale.

ACKNOWLEDGMENTS

This project is supported by the Innovation Fund

Denmark, the Danish Center for Big Data Analytics

driven Innovation (DABAI) and the Business Appli-

cations Group within Microsoft.

REFERENCES

Almgren, K., Lee, J., and Kim, M. (2016). Prediction

of image popularity over time on social media net-

works. 2016 Annual Connecticut Conference on In-

dustrial Electronics, Technology and Automation, CT-

IETA 2016, pages 1–6.

Caesar, H., Uijlings, J., and Ferrari, V. (2018). COCO-Stuff:

Thing and Stuff Classes in Context. Proceedings of

the IEEE Computer Society Conference on Computer

Vision and Pattern Recognition, pages 1209–1218.

Cappallo, S., Mensink, T., and Snoek, C. G. (2015). Latent

factors of visual popularity prediction. ICMR 2015 -

Proceedings of the 2015 ACM International Confer-

ence on Multimedia Retrieval, pages 195–202.

Chen, G., Kong, Q., Xu, N., and Mao, W. (2019a). NPP:

A neural popularity prediction model for social media

content. Neurocomputing, 333:221–230.

Chen, J., Liang, D., Zhu, Z., Zhou, X., Ye, Z., and Mo,

X. (2019b). Social media popularity prediction based

on visual-textual features with XGboost. MM 2019

- Proceedings of the 27th ACM International Confer-

ence on Multimedia, pages 2692–2696.

Chopra, A., Dimri, A., and Rawat, S. (2019). Comparative

Analysis of Statistical Classiﬁers for Predicting News

Popularity on Social Web. 2019 International Confer-

ence on Computer Communication and Informatics,

ICCCI 2019, pages 1–8.

Deza, A. and Parikh, D. (2015). Understanding image viral-

ity. Proceedings of the IEEE Computer Society Con-

ference on Computer Vision and Pattern Recognition,

07-12-June:1818–1826.

Ding, K., Ma, K., and Wang, S. (2019a). Intrinsic image

popularity assessment. MM 2019 - Proceedings of the

27th ACM International Conference on Multimedia,

(October):1979–1987.

Ding, K., Wang, R., and Wang, S. (2019b). Social media

popularity prediction: A multiple feature fusion ap-

proach with deep neural networks. MM 2019 - Pro-

ceedings of the 27th ACM International Conference

on Multimedia, pages 2682–2686.

Gayberi, M. and Oguducu, S. G. (2019). Popularity predic-

tion of posts in social networks based on user, post

and image features. 11th International Conference

on Management of Digital EcoSystems, MEDES 2019,

pages 9–15.

Gelli, F., Uricchio, T., Bertini, M., Bimbo, A. D., and

Chang, S. F. (2015). Image popularity prediction

in social media using sentiment and context features.

MM 2015 - Proceedings of the 2015 ACM Multimedia

Conference, pages 907–910.

He, Z., He, Z., Wu, J., and Yang, Z. (2019). Feature con-

struction for posts and users combined with lightgBM

for social media popularity prediction. MM 2019 -

Proceedings of the 27th ACM International Confer-

ence on Multimedia, pages 2672–2676.

Hidayati, S. C., Chen, Y. L., Yang, C. L., and Hua,

K. L. (2017). Popularity meter: An inﬂuence- and

aesthetics-aware social media popularity predictor.

MM 2017 - Proceedings of the 2017 ACM Multime-

dia Conference, pages 1918–1923.

Hsu, C. C., Lee, J. Y., Kang, L. W., Zhang, Z. X., Lee,

C. Y., and Wu, S. M. (2019). Popularity prediction

of social media based on multi-modal feature mining.

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

1208

MM 2019 - Proceedings of the 27th ACM Interna-

tional Conference on Multimedia, pages 2687–2691.

Kang, P., Lin, Z., Teng, S., Zhang, G., Guo, L., and Zhang,

W. (2019). Catboost-based framework with additional

user information for social media popularity predic-

tion. MM 2019 - Proceedings of the 27th ACM In-

ternational Conference on Multimedia, pages 2677–

2681.

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma,

W., Ye, Q., and Liu, T. Y. (2017). LightGBM: A

highly efﬁcient gradient boosting decision tree. Ad-

vances in Neural Information Processing Systems,

2017-Decem(Nips):3147–3155.

Khosla, A., Das Sarma, A., and Hamid, R. (2014). What

makes an image popular? In Proceedings of the 23rd

international conference on World wide web - WWW

’14, pages 867–876, New York, New York, USA.

ACM Press.

Kowalczyk, D. K. and Hansen, L. K. (2020). The complex-

ity of social media response: Statistical evidence for

one-dimensional engagement signal in twitter. In 12th

International Conference on Agents and Artiﬁcial In-

telligence, pages 918–925. SciTePress.

Kowalczyk, D. K. and Larsen, J. (2019). Scalable privacy-

compliant virality prediction on twitter. In AAAI-19

Workshop On Affective Content Analysi & CL-AFF

Happiness Shared Task, pages 12–27. CEUR-WS.

Larsson, A. O. (2019). Skiing all the way to the polls: Ex-

ploring the popularity of personalized posts on politi-

cal Instagram accounts. Convergence, 25(5-6):1096–

1110.

Lin, T., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick,

R. B., Hays, J., Perona, P., Ramanan, D., Doll

ar, P.,

and Zitnick, C. L. (2014). Microsoft COCO: common

objects in context. CoRR, abs/1405.0312.

Lundberg, S. M. and Lee, S. I. (2017). A uniﬁed ap-

proach to interpreting model predictions. Advances

in Neural Information Processing Systems, 2017-

December(Section 2):4766–4775.

Mazloom, M., Pappi, I., and Worring, M. (2018). Cate-

gory Speciﬁc Post Popularity Prediction. In MultiMe-

dia Modeling, volume 10704 LNCS, pages 594–607.

Springer International Publishing.

Mazloom, M., Rietveld, R., Rudinac, S., Worring, M., and

Van Dolen, W. (2016). Multimodal popularity predic-

tion of brand-related social media posts. MM 2016

- Proceedings of the 2016 ACM Multimedia Confer-

ence, pages 197–201.

McParlane, P. J., Moshfeghi, Y., and Jose, J. M. (2014).

”Nobody comes here anymore, it’s too crowded”; pre-

dicting image popularity on Flickr. ICMR 2014 -

Proceedings of the ACM International Conference on

Multimedia Retrieval 2014, pages 385–391.

Mishra, S., Rizoiu, M. A., and Xie, L. (2016). Fea-

ture driven and point process approaches for popular-

ity prediction. International Conference on Informa-

tion and Knowledge Management, Proceedings, 24-

28-October-2016:1069–1078.

Ortis, A., Farinella, G. M., and Battiato, S. (2019). Predict-

ing Social Image Popularity Dynamics at Time Zero.

IEEE Access, 7:171691–171706.

Overgoor, G., Mazloom, M., Worring, M., Rietveld, R., and

Van Dolen, W. (2017). A spatio-temporal category

representation for brand popularity prediction. ICMR

2017 - Proceedings of the 2017 ACM International

Conference on Multimedia Retrieval, pages 233–241.

Patterson, G. and Hays, J. (2012). SUN attribute database:

Discovering, annotating, and recognizing scene at-

tributes. Proceedings of the IEEE Computer Society

Conference on Computer Vision and Pattern Recogni-

tion, pages 2751–2758.

Redmon, J. and Farhadi, A. (2018). YOLOv3: An Incre-

mental Improvement.

Rietveld, R., van Dolen, W., Mazloom, M., and Worring, M.

(2020). What You Feel, Is What You Like Inﬂuence

of Message Appeals on Customer Engagement on In-

stagram. Journal of Interactive Marketing, 49:20–53.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,

Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-

stein, M., Berg, A. C., and Fei-Fei, L. (2015). Ima-

geNet Large Scale Visual Recognition Challenge. In-

ternational Journal of Computer Vision, 115(3):211–

252.

Song, C., Qu, Z., Blumm, N., and Barab

asi, A.-L. (2010).

Limits of predictability in human mobility. Science,

327(5968):1018–1021.

Tan, M. and Le, Q. V. (2019). EfﬁcientNet: Rethinking

Model Scaling for Convolutional Neural Networks.

Wang, K., Bansal, M., and Frahm, J. M. (2018). Retweet

wars: Tweet popularity prediction via dynamic mul-

timodal regression. Proceedings - 2018 IEEE Win-

ter Conference on Applications of Computer Vision,

WACV 2018, 2018-Janua:1842–1851.

Wang, R., Yang, F., and Haigh, M. M. (2017). Let me take

a selﬁe: Exploring the psychological effects of post-

ing and viewing selﬁes and groupies on social media.

Telematics and Informatics, 34(4):274–283.

Wu, B., Cheng, W. H., Zhang, Y., Huang, Q., Li, J., and

Mei, T. (2017). Sequential prediction of social me-

dia popularity with deep temporal context networks.

IJCAI International Joint Conference on Artiﬁcial In-

telligence, 0:3062–3068.

Wu, B., Cheng, W. H., Zhang, Y., and Mei, T. (2016). Time

matters: Multi-scale temporalization of social media

popularity. MM 2016 - Proceedings of the 2016 ACM

Multimedia Conference, pages 1336–1344.

Zhang, Y. and Jatowt, A. (2019). Image tweet popu-

larity prediction with convolutional neural network.

In Advances in Information Retrieval, volume 11437

LNCS, pages 803–809. Springer International Pub-

lishing.

Zhang, Z., Chen, T., Zhou, Z., Li, J., and Luo, J. (2018).

How to Become Instagram Famous: Post Popularity

Prediction with Dual-Attention. Proceedings - 2018

IEEE International Conference on Big Data, Big Data

2018, pages 2383–2392.

Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., and Tor-

ralba, A. (2018). Places: A 10 Million Image

Database for Scene Recognition. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence,

40(6):1452–1464.

Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., and Oliva,

A. (2014). Learning Deep Features for Scene Recog-

nition using Places Database - Supplementary Materi-

als. NIPS’14 Proceedings of the 27th International

Conference on Neural Information Processing Sys-

tems, 1:487–495.

On the Limits to Multi-modal Popularity Prediction on Instagram: A New Robust, Efﬁcient and Explainable Baseline

1209