Client-driven Lightweight Method to Generate Artistic Media for

Feature-length Sports Videos

Ghulam Mujtaba

1 a

, Jaehyuk Choi

2 b

and Eun-Seok Ryu

3 c

C-JeS Gulliver Studios, Seoul, Republic of Korea

Department of Software, Gachon University, Seongnam, Republic of Korea

Department of Computer Science Education, Sungkyunkwan University (SKKU), Republic of Korea

Keywords:

Artistic Media, Animated GIFs, Thumbnail Containers, Client-driven.

Abstract:

This paper proposes a lightweight methodology to attract users and increase views of videos through per-

sonalized artistic media i.e., static thumbnails and animated Graphics Interchange Format (GIF) images. The

proposed method analyzes lightweight thumbnail containers (LTC) using computational resources of the client

device to recognize personalized events from feature-length sports videos. In addition, instead of processing

the entire video, small video segments are used in order to generate artistic media. This makes our approach

more computationally efﬁcient compared to existing methods that use the entire video data. Further, the

proposed method retrieves and uses thumbnail containers and video segments, which reduces the required

transmission bandwidth as well as the amount of locally stored data that are used during artistic media genera-

tion. After conducting experiments on the NVIDIA Jetson TX2, the computational complexity of our method

was 3.78 times lower than that of the state-of-the-art method. To the best of our knowledge, this is the ﬁrst

technique that uses LTC to generate artistic media while providing lightweight and high-performance services

on resource-constrained devices.

1 INTRODUCTION

Over the past few years, various types of streaming

platforms in the form of video on demand (V0D) and

360-degree live streaming services have become very

popular. In comparison to traditional cable networks

that users can view on television, video streaming is

ubiquitous and provides the ﬂexibility of watching

video content on various devices. In most cases, such

services have very large video catalogs for users to

browse and watch anytime. Yet, it is often challenging

for users to ﬁnd relevant content due to innumerable

data and time constraints. This considerable growth

has increased the need for technologies that enable

users to browse the extensive and ever-growing con-

tent collections, and quickly retrieve the content of

interest. The development of new techniques for gen-

erating artistic media (i.e., static thumbnails and an-

imated Graphics Interchange Format (GIF) images)

is an aspect of this demand (Song et al., 2016; Yuan

https://orcid.org/0000-0001-9244-5346

https://orcid.org/0000-0002-4367-3913

https://orcid.org/0000-0003-4894-6105

Figure 1: Artistic media in the form of static thumb-

nails and animated GIF images are used in popular stream-

ing platforms to highlight recommended videos. Animated

GIFs are played whenever a user hovers over the static

thumbnail (above). Generally, the most preferred events are

selected as static thumbnails according to the video cate-

gory to attract users to get more views (below).

et al., 2019; Xu et al., 2021).

Almost every streaming platform uses artistic me-

dia to provide a quick glimpse of video content. The

static thumbnail provides viewers with a quick video

preview. Meanwhile, the animated GIF delivers a

condensed preview of the video for 3–15 sec (Bakhshi

102

Mujtaba, G., Choi, J. and Ryu, E.

Client-driven Lightweight Method to Generate Artistic Media for Feature-length Sports Videos.

DOI: 10.5220/0011335300003289

In Proceedings of the 19th International Conference on Signal Processing and Multimedia Applications (SIGMAP 2022), pages 102-111

ISBN: 978-989-758-591-3; ISSN: 2184-9471

et al., 2016). Figure 1 illustrates artistic media for

sports videos: (1) an animated GIF played when the

user hovers the mouse on a static thumbnail (above),

and (2) the most preferred frames are selected as a

static thumbnail from the feature-length video. View-

ers often decide whether to watch or skip a video

based on its static thumbnail and animated GIF. Due

to their importance, there is a growing interest in auto-

matically creating compelling and expressive artistic

media.

Click-through rate (CTR) is a signiﬁcant met-

ric for boosting the popularity of newly published

feature-length sports videos on streaming platforms.

However, many platforms (such as YouTube) provide

only one type of artistic media for a given video with-

out considering user preferences. In recent studies, it

has been shown that personalized artistic media could

play a signiﬁcant role in video selection and improve

the CTR of videos (Mujtaba et al., 2021). However,

creating artistic media manually is time-consuming,

and their qualities are not guaranteed. Their extensive

adoption and prevalence have increased the demand

for methods that can automatically generate personal-

ized artistic media from feature-length sports videos.

Nowadays, some popular video streaming sites

are investigating server-side technology solutions for

automatically generating personalized artistic media

(Xu et al., 2021). There are four key concerns when it

comes to server-based solutions: (i) due to ﬁnite com-

puting capabilities, personalized artistic media may

not be simultaneously generated in a timely manner

for multiple users, (ii) consumer privacy is prone to

invasions in a personalized approach, (iii) user be-

havior should be overseen with recommendation al-

gorithms, and (iv) current solutions process the en-

tire video (frames) to generate artistic media result-

ing in an increase of the overall computational dura-

tion and a requirement for signiﬁcant computational

resources. Since personalization is one of the key el-

ements for early media content adoption, we focused

on the personalization and lightweight processing as-

pects of artistic media generation.

Taking the aforementioned observations into ac-

count, we propose an innovative and computationally

efﬁcient client-driven method that generates personal-

ized artistic media for multiple users simultaneously.

Considering that the computational resources are lim-

ited, we use lightweight thumbnail containers (LTC)

of the corresponding feature-length sports video in-

stead of processing the entire video (frames). Since

every sports video has key events (e.g., penalty shots

in soccer videos), we utilize LTC to detect events that

reduce the overall processing time. Therefore, we

aim to reduce the computational load and process-

ing time while generating personalized artistic media

from feature-length sports videos. Twelve publicly

broadcasted sports videos were analyzed to estimate

the effectiveness of the proposed method. The main

contributions of this research are summarized as fol-

lows:

• We propose a new lightweight client-driven tech-

nique to automatically create artistic media for

feature-length sports videos. To the best of our

knowledge, this is the ﬁrst work in the literature

to address this challenging problem.

• To support the study, we have collected twelve

feature-length sports videos with approximately

1, 467.2 minutes duration, in six different sports

categories, namely, baseball, basketball, boxing,

cricket, football, and tennis.

• We designed an effective 2D Convolutional Neu-

ral Network (CNN) model for LTC analysis that

can classify personalized events.

• Extensive quantitative and qualitative analy-

ses were conducted using feature-length sports

videos. The results indicated that the computa-

tional complexity of the proposed method is 3.78

times lower than that of the state-of-the-art ap-

proach on the resource-constrained NVIDIA Jet-

son TX2 device (described in Section 4.2). Addi-

tionally, qualitative evaluations were conducted in

collaboration with nine participants (described in

Section 4.3).

The remainder of this paper is organized as fol-

lows. Section 2 provides an overview of existing liter-

ature. Section 3 describes the proposed client-driven

method. Section 4 discusses the qualitative and quan-

titative results. Finally, the concluding remarks of this

study are presented in Section 5.

2 RELATED WORK

2.1 Animated GIF Generation Methods

Animated GIF images were ﬁrst created in 1987, and

have been widely used in recent years. Speciﬁcally,

in the study presented in (Bakhshi et al., 2016), ani-

mated GIFs were reported to be more attractive than

other forms of media, including photos and videos on

social media platforms such as Tumblr. They identi-

ﬁed some of the GIF features that contribute to fas-

cinating users, such as animations, storytelling capa-

bilities, and emotional expression. In addition, sev-

eral studies (Chen et al., 2017; Jou et al., 2014) have

devised methods for predicting viewers’ sentiments

Client-driven Lightweight Method to Generate Artistic Media for Feature-length Sports Videos

103

Figure 2: Semantic architecture of proposed client-driven LTC artistic media generation method. First, the thumbnail con-

tainers of the corresponding video are downloaded from the streaming server: (1) denotes the thumbnail container analyzer

module that extracts thumbnails from the acquired LTC. The extracted thumbnails are analyzed using the proposed thumbnail

container analyzer module to obtain artistic static thumbnails; (2) denotes the animated GIF generation module, which uses

timestamp information from artistic thumbnails that are used to request and download speciﬁc video segments of the corre-

sponding video from the streaming server. The artistic animated GIFs are generated from downloaded video segments.

towards animated GIFs. Despite the viewer engage-

ment, in (Jiang et al., 2018), it was concluded that

viewers may have diverse interpretations of animated

GIFs used in communication. They predicted fa-

cial expressions, histograms, and aesthetic features;

then, they compared them to the study in (Jou et al.,

2014) to ﬁnd the most appropriate video features for

expressing useful emotions in GIFs. In another ap-

proach presented in (Liu et al., 2020), sentiment anal-

ysis was used to estimate annotated GIF text and vi-

sual emotion scores. From an aesthetic perspective,

in (Song et al., 2016), frames were picked by measur-

ing various subjective and objective metrics of video

frames (such as visual quality and aesthetics) to gen-

erate GIFs. In a recent study described in (Mujtaba

et al., 2021), the authors proposed a client-driven

method to mitigate privacy issues while designing a

lightweight method for streaming platforms to cre-

ate GIFs. Instead of adopting full-length video con-

tent in their method, they used an acoustic feature to

reduce the overall computational time for resource-

constrained devices.

2.2 Video Understanding Methods

Video understanding is a prominent ﬁeld in com-

puter vision research. Action recognition (Carreira

and Zisserman, 2017) and temporal action localiza-

tion (Farha and Gall, 2019) are the two main issues

addressed in the literature pertaining to video under-

standing. Action recognition involves recognizing

events from a cropped video clip, which is accom-

plished through various methods such as two-stream

networks (Simonyan and Zisserman, 2014) and recur-

rent neural networks (RNNs) (Donahue et al., 2015).

Another popular action recognition method uses a

two-stream structure to extend a 3D CNN (Carreira

and Zisserman, 2017). It is obtained by pretraining

a 2D CNN model using the ImageNet dataset (Deng

et al., 2009) and extending the 2D CNN model to

a 3D CNN by repeated weighting in a depth-wise

manner. These features are local descriptors that are

obtained using the bag-of-words method or global

descriptors retrieved by CNNs. Most of the meth-

ods adopt temporal segments (Yang et al., 2019) to

prune and classify videos. Recent research studies

have focused on exploiting the context information

to further improve event recognition. Context rep-

resents and utilizes both spatio-temporal information

and attention, which helps in learning adaptive conﬁ-

dence scores to utilize surrounding information (Heil-

bron et al., 2017). Other methods utilize time inte-

gration and motion-aware sequence learning such as

long short-term memory (LSTM) (Agethen and Hsu,

2019). Attention-based models have also been used

to improve the integrated spatio-temporal information

(Peng et al., 2018).

In comparison to the proposed method, HECATE

(Song et al., 2016) is the most similar approach as

it can generate artistic media – i.e., static thumb-

nails and animated GIFs. Lightweight client-driven

techniques for generating artistic media are still in

the early stages of development, and more effective

methods are needed to bridge the semantic gap be-

tween video understanding and personalization. Ad-

ditionally, most modern client devices have limited

computational capabilities. Moreover, inspecting a

full-length video to create artistic media is time-

consuming and not reasonable for real-time solutions

(Song et al., 2016).

SIGMAP 2022 - 19th International Conference on Signal Processing and Multimedia Applications

104

Figure 3: Proposed architecture of 2D convolutional neural network. It analyzes each extracted thumbnail image as input.

The most preferred detected events from thumbnail images are classiﬁed as output.

3 PROPOSED METHOD

This paper proposes new techniques for advancing

research on generating anticipated artistic media us-

ing a client-driven approach. The proposed method

uses LTC, which are widely used in streaming plat-

forms for timeline manipulation of videos (Mujtaba

and Ryu, 2020), instead of the entire video to analyze

personalized events. Subsequently, artistic media is

created within adequate processing time duration on

resource-constrained client devices such as NVIDIA

Jetson TX2 (i.e., an embedded AI computing device).

Figure 2 depicts the semantic architecture of the

proposed artistic media method. There are two phases

of generating artistic media. Each phase processes

and generates a different artistic media type. In the

ﬁrst phase, the LTC is analyzed using the Thumbnail

Container Analyzer module, and artistic thumbnails

are obtained. The HTTP Live Streaming (HLS) server

is conﬁgured to obtain the LTC (Mujtaba and Ryu,

2020). The thumbnail containers are collected indi-

vidually from the source video using FFmpeg (FFm-

peg, 2020) in the streaming server. Every thumbnail

container has 25 thumbnails; typically, the ﬁrst frame

of the video is selected as the thumbnail. The se-

quence of 25 thumbnails produces a single thumbnail

container. Thumbnails are merged into 5 × 5 contain-

ers according to playtime. A single thumbnail and

thumbnail container depict the playback time of the

corresponding video for 1 and 25 sec, respectively.

The entire duration of the source video is covered

in sequences of thumbnail containers. The size of

each thumbnail and thumbnail container is ﬁxed at

160 × 90 and 800 × 450 pixels, respectively (like in

the work presented in (Mujtaba and Ryu, 2020)).

The timestamp information in the ﬁrst phase is

used to generate the artistic animated GIF from the

given video segment in the second phase of the pro-

posed method. The second phase consists of the

Animated GIF Generation module. The proposed

method and its components are described in the fol-

lowing subsections.

3.1 Thumbnail Container Analyzer

Module

The thumbnail container analyzer module consists of

three segments, i.e., a backbone feature extractor, an

attention module, and a classiﬁer as shown in Fig-

ure 3. The backbone feature extractor utilizes a pre-

trained CNN model (i.e., Xception model (Chollet,

2017) pre-trained on ImageNet dataset (Deng et al.,

2009)) to extract high-level semantic features from

the thumbnails. The features are then fed to an at-

tention module to extract contextual information from

the high-level semantic features. Speciﬁcally, vortex

pooling was used as an attention module to enhance

the efﬁciency of the proposed neural network (Xie

et al., 2018). The module uses multi-branch convo-

lution with dilation rates to aggregate contextual in-

formation, making it more effective. The aggregated

features are then fed to a dense block-based classiﬁer

(Shen et al., 2017) to classify the events.

The thumbnail container analyzer module ana-

lyzes each extracted thumbnail individually based

on the event(s) and according to user preferences.

The proposed method selects a personalized artistic

thumbnail from the analyzed LTC. The artistic thumb-

nails are selected based on a threshold that is set

to maintain generated media quality. A text-based

timestamp information ﬁle is generated for all se-

lected personalized artistic thumbnails obtained from

the LTC for the artistic GIF generation process. The

Client-driven Lightweight Method to Generate Artistic Media for Feature-length Sports Videos

105

data inside the artistic thumbnail ﬁle are ranked in

chronological order.

3.2 Animated GIF Generation Module

The animated GIF generation module examines the

segment numbers from the text-based timestamp ﬁle

generated from the detected thumbnails. This infor-

mation is utilized to obtain the corresponding seg-

ments from the HLS server to create an animated GIF

(Mujtaba and Ryu, 2020). The proposed method uses

the ﬁrst 3 sec of the segment in the animated GIF

generation process. Even though the duration of all

generated GIFs is ﬁxed, it should be noted that this

approach is also applicable in the case of generating

a GIF with variable length. Algorithm 1 depicts the

processing steps required to generate a GIF from a

video with the proposed method.

Data: Input thumbnail containers

- N: number of thumbnails T inside

thumbnail containers LTC

Initialization:- Personalize events P;

Segments S; threshold = 80

Main loop: while i < (N) do

Extract T from LTC

determineEvents(T, P, threshold)

Identify the S number from text-based ﬁle

Download S

Generate animated GIF from S

end

Function determineEvents (T, P, threshold)

Analyze T as per P

Select artistic T according to threshold

Prepare text-ﬁle of selected T

return text-based selected T list

Result: Generated Artistic Media

Algorithm 1: Process to analyze personalize events

from thumbnail containers to generate artistic me-

dia.

4 RESULTS AND DISCUSSION

4.1 Experimental Setup

4.1.1 Video Dataset

The performance evaluation was conducted using

twelve feature-length sports videos obtained from the

YouTube streaming platform. Table 1 provides the de-

tailed descriptions of the selected videos. The videos

are split into six categories based on their content,

namely, baseball, basketball, boxing, cricket, foot-

ball, and tennis. All videos used in the experiments

have a resolution of 640 × 480 pixels. All selected

videos were examined using ten different events se-

lected from the action list provided in the UCF-101

dataset. The ten selected events were basketball, bas-

ketball dunk, boxing punching bag, boxing speed bag,

cricket bowling, cricket shot, punch, soccer juggling,

soccer penalty, and tennis swing. These events were

selected based on the video content. All thumbnails

were selected with an accuracy exceeding 80.0% of

the threshold, which was set to maintain the artis-

tic media quality. It should be noted that the pro-

posed method is not bound by these events; additional

events can be included according to the video content.

4.1.2 Baseline Methods

This section describes the baseline methods that are

compared to the proposed artistic media generation

method. As explained in Section 2, some well-known

approaches use the entire video to generate animated

GIFs. The baseline approaches are listed as follows:

• HECATE (Song et al., 2016) analyzes aesthetic

features obtained from video frames. The cor-

responding video is stored locally on the device.

During the process, the frames are extracted, tem-

porarily stored, and then analyzed. HECATE

(Song et al., 2016) only supports a ﬁxed duration

and number of GIFs (ten artistic thumbnails and

GIFs for each video are generated for the experi-

mental analysis).

• AV-GIF (Mujtaba et al., 2021) analyzes the entire

audio and video ﬁles to create animated GIFs. AV-

GIF generates one GIF for each video.

• CL-GIF (Mujtaba et al., 2021) uses acoustic fea-

tures to analyze the audio climax portion and em-

ploys segments to generate GIFs. This is the state-

of-the-art client-driven animated GIF generation

method. Here, similar to (Mujtaba et al., 2021),

only one GIF was generated using default param-

eters.

• FB-GIF Instead of analyzing the LTC, this

method uses video frames of the corresponding

video to detect personalized events. Initially,

frames are extracted from the video; then, the

proposed thumbnail container analyzer module is

used to detect the corresponding events from the

extracted frames.

4.1.3 Hardware Conﬁguration

The HLS server and client hardware devices were

conﬁgured locally for the experimental evaluations.

SIGMAP 2022 - 19th International Conference on Signal Processing and Multimedia Applications

106

Table 1: The details of feature-length sports video used for performance analysis in the proposed method.

S/N Category Title Playtime # Frames # LTC # Thumb YouTube ID

Football

Belgium vs Japan

1h 52m 14s 202,036 270 6734 ervkVzoFJ5w

Brazil vs Belgium

1h 50m 50s 199,506 267 6650 5OJfbYQtKtk

Basketball

France vs USA

2h 14m 39s 242,135 324 8079 8YSrNfcKvA0

USA vs Spain

2h 53m 54s 260,886 418 10434 l9wUr-CK1Y4

Boxing

Davis vs Gamboa

1h 3m 2s 113,368 152 3782 KZtVQo8lpqY

Dirrell vs Davis

47m 29s 85,392 114 2849 sVtzzpvaEjc

Baseball

Giants vs Dodgers

2h 11m 42s 236,827 317 7902 ScmHL8YVM5E

Giants vs Royals

2h 36m 50s 282,024 377 9410 YJmwofDYOeo

Cricket

India vs Pakistan

1h 25m 2s 153,065 205 5102 uSGCAJS6qWg

Peshawar Zalmi vs

Islamabad United

2h 17m 15s 205,170 274 6845 uzErZgKuuSM

Tennis

Novak Djokovic vs

Daniil Medvedev

2h 1m 6s 181,654 291 7266 MG-RjlqyaJI

Roger Federer vs

Rafael Nadal

3h 5m 37s 278,448 446 11137 wZnCcqm g-E

For HLS clients, two end user devices were conﬁg-

ured with different hardware conﬁgurations: a high

computational resource (HCR) end user device run-

ning on the open-source Ubuntu 18.04 LTS operat-

ing system, and a low computational resource (LCR)

end user machine utilizing a NVIDIA Jetson TX2 de-

vice. The proposed and baseline approaches were set

up separately on HCR and LCR machines. The HLS

server machine was set up with the Windows 10 oper-

ating system and was used in our experiments. Table 2

shows the speciﬁcations of the hardware devices used

in all experiments.

Table 2: HLS server and client hardware device speciﬁca-

tions.

Device CPU GPU RAM

HLS

Server

Intel Core

i7-8700K

GeForce

GTX 1080

32 GB

HCR

Client

Quad-core

2.10 GHz

GeForce

RTX 2080 Ti

62 GB

LCR

Client

Quad ARM

A57/2MB L2

Nvidia

Pascal 256

8 GB

4.2 Objective Evaluation

4.2.1 Event Recognition

The ﬁrst training/testing partition of the UCF-101

dataset was used as recommended in (Soomro et al.,

2012). Each video was subsampled up to 40 frames

to train the model using the UCF-101 dataset. All im-

ages were pre-processed through cropping their cen-

tral area and resizing them to 244 × 244 pixels. Data

augmentation was applied to reduce overﬁtting. The

Table 3: Performance analysis of proposed thumbnail con-

tainer analyzer module.

CNN Methods Validation Acc (%)

(Sandler et al., 2018) 59.06%

(Huang et al., 2017) 65.31%

(Karpathy et al., 2014) 65.40%

(Chollet, 2017) 68.44%

(Howard et al., 2019) 71.88%

(Mujtaba and Ryu, 2020) 73.75%

(Shu et al., 2018) 76.07%

Proposed 76.25%

varied stochastic gradient descent optimizer was used

with a learning rate of 0.01, a momentum of 0.9, and

the 0.001 weight decay value (SGDW) to train the

model (Loshchilov and Hutter, 2017). In the exper-

iment, an early stop mechanism was applied during

the training process with patience of ten. The train-

ing data were provided in mini-batches with a size of

32, and 1,000 iterations were performed to train the

sequence patterns in the data. The Keras toolbox was

used for deep feature extraction, and a GeForce RTX

2080 Ti GPU was used for the implementation. The

method performed 51.32 million ﬂoating-point oper-

ations per second with a total number of 25.6 million

trainable parameters.

To the best of our knowledge, the method pro-

posed in (Mujtaba and Ryu, 2020) is the only one that

uses thumbnail containers to recognize events; it had

the best performance on the UCF-101 dataset when

using thumbnail containers. The proposed thumbnail

container analyzer module performs 2.5% better in

terms of validation accuracy compared to the results

in (Mujtaba and Ryu, 2020). The experimental results

of the proposed and baseline approaches on the UCF-

Client-driven Lightweight Method to Generate Artistic Media for Feature-length Sports Videos

107

101 dataset are listed in Table 3. All models that are

used for comparison (Chollet, 2017; Sandler et al.,

2018; Howard et al., 2019; Huang et al., 2017) were

trained on the UCF-101 dataset with similar conﬁgu-

rations without adopting an attention module. It can

be seen from the table that the proposed method out-

performs most of the compared methods with a large

margin. This is because of the context aggregation

method utilized in the thumbnail container analyzer

module.

4.2.2 Performance Analysis

Static Thumbnail Generation. Static Thumbnail

Generation. To evaluate the performance of the

proposed method, the HECATE (Song et al., 2016)

method was implemented in the HCR device and used

as the baseline method with the default conﬁguration.

Table 4 shows the number of artistic thumbnails and

the computation time required (in minutes) to gener-

ate them using the proposed and baseline methods.

The proposed approach required considerably less

computation time than the HECATE method (Song

et al., 2016). It is important to note that all artistic

thumbnails obtained using the proposed method have

personalized events. Meanwhile, the artistic thumb-

nails are generated using HECATE (Song et al., 2016)

as the one-size-ﬁts-all framework. Figure 4 illus-

trate artistic thumbnails generated using proposed and

baseline methods.

Table 4: Computation times required (in minutes) to gener-

ate artistic thumbnails using baseline and proposed methods

on the HCR device.

HECATE Proposed

#Thumb Total #Thumb Total

10 50.19 1849 1.75

10 86.59 2819 1.64

10 130.16 2930 2.01

10 158.84 3376 2.64

10 19.63 1341 0.96

10 13.33 1477 0.72

10 14.05 2295 0.86

GIF Generation on HCR Device. We compared

the computation time required to generate artistic an-

imated GIFs using the proposed and baseline ap-

proaches on the HCR device. Table 5 shows the com-

parison of computation times required (in minutes) to

generate GIFs. The HECATE method (Song et al.,

2016) analyzes every frame in the video and deter-

mines aesthetic features that can be used for gener-

ating GIFs. The AV-GIF (Mujtaba et al., 2021) uses

the entire video and audio clips to generate animated

Figure 4: Artistic thumbnails generated using proposed and

baseline methods.

GIFs. Meanwhile, the CL-GIF (Mujtaba et al., 2021)

uses segments and audio climax portions to generate

animated GIFs. The proposed method uses consid-

erably small images (thumbnails) to analyze person-

alized events, which results in a signiﬁcantly lower

computation time for generating animated GIFs.

GIF Generation in LCR Device. Table 5 shows

the computation times required (in minutes) to cre-

ate artistic GIFs when implementing the baseline and

proposed methods on the LCR device (i.e., Nvidia

NVIDIA Jetson TX2). HECATE (Song et al., 2016),

and AV-GIF (Mujtaba et al., 2021) cannot be used

in practice because they require signiﬁcant computa-

tional resources owing to requiring the requirement

of lengthy videos. Only the CL-GIF (Mujtaba et al.,

2021) method can be used on the LCR device to gen-

erate a GIF. The overall processing time of the pro-

posed method is signiﬁcantly shorter than that of CL-

GIF (Mujtaba et al., 2021).

Communication and Storage. The HECATE

(Song et al., 2016) approach requires a locally stored

video ﬁle to begin processing. Similarly, the corre-

sponding full-length audio ﬁle and video segment

must be downloaded when using the CL-GIF method

to generate a GIF (Mujtaba et al., 2021). However,

the proposed method requires only LTC downloaded

for the same process. For example, the video and

audio sizes of the Brazil vs. Belgium match were 551

and 149 MB, respectively. However, the LTC size

was 22.2 MB for the same video. Thus, the proposed

method signiﬁcantly reduced the download time

and storage requirements compared to the baseline

methods.

Overall Computation Analysis. Table 6 shows the

comparison of overall computation for generating

thumbnails and GIFs on HCR and LCR devices, re-

spectively.

SIGMAP 2022 - 19th International Conference on Signal Processing and Multimedia Applications

108

Table 5: Computation times required (in minutes) to generate artistic animated GIFs using the baseline and proposed methods.

S/N

HECATE AV-GIF CL-GIF FB-GIF CL-GIF Proposed

HCR LCR LCR HCR

1 51.52 21.60 8.16 70.67 38.71 10.08 2.20

2 89.79 21.36 8.56 65.31 36.17 9.85 2.02

3 199.44 26.36 8.22 137.77 45.27 11.94 2.38

4 245.67 47.64 12.55 84.30 58.44 15.42 3.09

5 33.12 10.86 4.87 64.33 21.28 5.63 1.37

6 20.92 8.07 3.04 43.35 15.87 4.21 1.17

7 93.92 29.68 9.38 155.61 44.25 11.67 2.47

8 132.03 104.24 15.52 98.34 52.70 13.90 2.83

9 35.08 17.38 6.68 48.44 28.71 7.57 1.71

10 49.70 23.92 9.93 69.22 46.28 12.21 2.03

11 128.32 31.37 20.79 152.01 100.67 26.65 4.86

12 79.24 41.05 13.87 181.49 62.51 16.65 3.18

To create artistic thumbnails for the seven cor-

responding videos using the HCR end user device,

HECATE (Song et al., 2016) required 472.79 min

while the proposed method required 10.57 min; i.e.,

the proposed method is 44.72 times faster than the

baseline method HECATE (Song et al., 2016) when

generating the personalized artistic thumbnails. This

is because HECATE (Song et al., 2016) requires

the analysis of the whole video while the proposed

method utilizes LTC to create artistic thumbnails.

For generating the corresponding GIFs of the

twelve feature-length videos using the HCR end user

device, the proposed method takes 29.31 min com-

pared with 1,158.73 min required by HECATE (Song

et al., 2016). Again, for generating GIFs for the

twelve videos on the LCR devices, the proposed

method took 145.79 min while the CL-GIF (Mujtaba

et al., 2021) took 550.87 min.

Table 6: Overall computational analysis (in minutes) of pro-

posed and baseline methods.

Artistic Data Devices Methods Total

Thumbnail HCR

HECATE 472.79

Proposed 10.57

Animated GIF

LCR

CL-GIF 550.87

Proposed 145.79

HCR

HECATE 1,158.73

AV-GIF 383.54

CL-GIF 121.58

FB-GIF 1,170.85

Proposed 29.31

Therefore, the analysis of these twelve videos

indicates that, on average, the proposed method is

39.54, 13.09, 4.15, and 39.95 times faster than the

HECATE (Song et al., 2016), AV-GIF (Mujtaba et al.,

2021), CL-GIF (Mujtaba et al., 2021), and FB-GIF

methods when using the HCR device, respectively.

Similarly, when using the LCR device, the proposed

method is 3.78 times faster than the CL-GIF (Mujtaba

et al., 2021) method. Additionally, the proposed ap-

proach generates more GIFs than the baseline meth-

ods. For example, while most methods are restricted

with one GIF (e.g., AV-GIF (Mujtaba et al., 2021),

CL-GIF (Mujtaba et al., 2021)) or a ﬁxed number of

GIFs (e.g., 10 GIFs for HECATE (Song et al., 2016)),

the proposed method can generate 25 GIFs, showing

better computational efﬁcacy than most of the meth-

ods in both HCR and LCR devices.

4.3 Subjective Evaluation

This section evaluates the subjective evaluation of

generated GIFs using the proposed approach com-

pared to those obtained from YouTube or created uti-

lizing baseline approaches. The subjective evalua-

tion was conducted using a survey with nine partic-

ipants. Demographically, the participants were from

three different countries namely Pakistan, Vietnam,

and South Korea. A group of students was selected

based on their interest in sports. The survey was based

on the ﬁrst six videos (Table 1). The quality of the

created GIFs was assessed with respect to exact rating

scales. The participants were asked to grade the GIFs

based on perceived joy. An anonymous questionnaire

was designed for the created GIFs to prevent users

from determining the method used to create a given

GIF. The participants were requested to view all GIFs

and rank them on a scale of 1 to 10 (1 being the low-

est and 10 being the highest ranking). Table 7 lists the

rankings of the three methods as they were given by

the participants. With regards to the six videos, the

average ratings for YouTube, HECATE (Song et al.,

2016), CL-GIF (Mujtaba et al., 2021), and the pro-

posed method were 5.0, 6.46, 5.65, and 7.48, respec-

tively. The sample frames obtained from the gener-

Client-driven Lightweight Method to Generate Artistic Media for Feature-length Sports Videos

109

ated GIFs using the proposed and baseline methods

are presented in Figure 5.

Table 7: Average ratings (1∼10) assigned by participants

for the proposed and baseline methods.

YouTube HECATE CL-GIF Proposed

4.67 6.78 5.67 8.11

4.67 6.22 7.00 8.56

4.78 7.56 5.33 8.44

5.56 5.44 5.22 5.78

4.22 6.33 5.00 7.44

6.11 6.44 5.67 6.56

Figure 5: Sample frames taken from GIFs generated using

the proposed and baseline methods.

4.4 Discussion

The proposed method achieved signiﬁcantly higher

performance and less computation time on both HCR

and LCR devices, compared with existing methods.

This is because the proposed method uses LTC and

video segments to generate artistic media instead of

processing the entire video. The main advantage

of using LTC is that the number of thumbnails is

very small compared to the number of frames in the

video. For example, the Belgium vs. Japan football

video with 1h 52m duration has 202, 036 frames and

6734 thumbnails. Additionally, the 160 × 90 (width ×

height) size of thumbnails remains lightweight for ev-

ery video resolution (HD, 2K, 4K, etc.) compared

to the frame size of the corresponding video. The

proposed method reduces the overall computational

power and time required to produce artistic media on

client devices.

In the qualitative experiment involving partici-

pants (described in Section 4.3), the proposed ap-

proach obtained a higher average rating than those of

other methods. This is mainly because the GIFs are

generated based on user interests with the proposed

approach. In addition, the proposed method can gen-

erate more than one GIF, which can then be used ran-

domly to obtain a greater CTR for the corresponding

video. In practical applications, the proposed method

can signiﬁcantly improve the CRT of newly broad-

casted full-length sports videos on streaming plat-

forms. The client-driven approaches are in their in-

fancy. The proposed method can be also useful for

short videos (Mujtaba and Ryu, 2021).

5 CONCLUSIONS

This paper proposes a new lightweight method for

generating artistic media using limited computational

resources on end user devices. Instead of process-

ing the entire video, the proposed method analyzes

thumbnails to recognize personalized events and uses

the corresponding video segments to generate artistic

media. This improves computational efﬁciency and

reduces the demand for communication and storage

resources in resource-constrained devices. The ex-

perimental results that are based on a set of twelve

feature-length sports videos show that the proposed

approach is 4.15 and 3.78 times faster than the state-

of-the-art method during the animated GIF generation

process when using the HCR and LCR devices, re-

spectively. The qualitative evaluation indicated that

the proposed method outperformed the existing meth-

ods and received higher overall ratings. In the future,

the proposed method could be implemented for other

sports categories by considering various events using

resource-constrained devices.

ACKNOWLEDGEMENTS

This work was supported by Institute of Information

& communications Technology Planning & Evalua-

tion (IITP) grant funded by the Korean government

(MSIT) No.2020-0-00231-003, Development of Low

Latency VR-AR Streaming Technology based on 5G

edge cloud. This work was also supported in part

by the National Research Foundation of Korea (NRF)

funded by the Korean government (MSIT) with grant

No. NRF-2020R1A2C1013308.

REFERENCES

Agethen, S. and Hsu, W. H. (2019). Deep multi-kernel

convolutional lstm networks and an attention-based

mechanism for videos. IEEE Transactions on Mul-

timedia, 22(3):819–829.

Bakhshi, S., Shamma, D. A., Kennedy, L., Song, Y.,

De Juan, P., and Kaye, J. (2016). Fast, cheap, and

SIGMAP 2022 - 19th International Conference on Signal Processing and Multimedia Applications

110

good: Why animated gifs engage us. In Proceedings

of the 2016 chi conference on human factors in com-

puting systems, pages 575–586, New York, NY, USA.

Carreira, J. and Zisserman, A. (2017). Quo vadis, action

recognition? a new model and the kinetics dataset.

In proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 6299–6308.

Chen, W., Rudovic, O. O., and Picard, R. W. (2017).

Gifgif+: Collecting emotional animated gifs with

clustered multi-task learning. In 2017 Seventh Inter-

national Conference on Affective Computing and In-

telligent Interaction (ACII), pages 510–517.

Chollet, F. (2017). Xception: Deep learning with depthwise

separable convolutions. In Proceedings of the IEEE

conference on computer vision and pattern recogni-

tion, pages 1251–1258.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,

L. (2009). Imagenet: A large-scale hierarchical image

database. In 2009 IEEE conference on computer vi-

sion and pattern recognition, pages 248–255.

Donahue, J., Anne Hendricks, L., Guadarrama, S.,

Rohrbach, M., Venugopalan, S., Saenko, K., and Dar-

rell, T. (2015). Long-term recurrent convolutional net-

works for visual recognition and description. In Pro-

ceedings of the IEEE conference on computer vision

and pattern recognition, pages 2625–2634.

Farha, Y. A. and Gall, J. (2019). Ms-tcn: Multi-stage tem-

poral convolutional network for action segmentation.

In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 3575–3584.

FFmpeg (2020). Ffmpeg github page.

Heilbron, F. C., Barrios, W., Escorcia, V., and Ghanem, B.

(2017). Scc: Semantic context cascade for efﬁcient

action detection. In 2017 IEEE Conference on Com-

puter Vision and Pattern Recognition (CVPR), pages

3175–3184.

Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B.,

Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V.,

et al. (2019). Searching for mobilenetv3. In Proceed-

ings of the IEEE International Conference on Com-

puter Vision, pages 1314–1324.

Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger,

K. Q. (2017). Densely connected convolutional net-

works. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 4700–

4708.

Jiang, J. A., Fiesler, C., and Brubaker, J. R. (2018). ’the

perfect one’ understanding communication practices

and challenges with animated gifs. Proceedings of the

ACM on human-computer interaction, 2(CSCW):1–

20.

Jou, B., Bhattacharya, S., and Chang, S.-F. (2014). Predict-

ing viewer perceived emotions in animated gifs. In

Proceedings of the 22nd ACM international confer-

ence on Multimedia, pages 213–216, New York, NY,

USA.

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Suk-

thankar, R., and Fei-Fei, L. (2014). Large-scale video

classiﬁcation with convolutional neural networks. In

2014 IEEE Conference on Computer Vision and Pat-

tern Recognition, pages 1725–1732.

Liu, T., Wan, J., Dai, X., Liu, F., You, Q., and Luo, J. (2020).

Sentiment recognition for short annotated gifs using

visual-textual fusion. IEEE Transactions on Multime-

dia, 22(4):1098–1110.

Loshchilov, I. and Hutter, F. (2017). Decoupled weight de-

cay regularization.

Mujtaba, G., Lee, S., Kim, J., and Ryu, E.-S. (2021). Client-

driven animated gif generation framework using an

acoustic feature. Multimedia Tools and Applications.

Mujtaba, G. and Ryu, E.-S. (2020). Client-driven person-

alized trailer framework using thumbnail containers.

IEEE Access, 8:60417–60427.

Mujtaba, G. and Ryu, E.-S. (2021). Human character-

oriented animated gif generation framework. In 2021

Mohammad Ali Jinnah University International Con-

ference on Computing (MAJICC), pages 1–6. IEEE.

Peng, Y., Zhao, Y., and Zhang, J. (2018). Two-stream col-

laborative learning with spatial-temporal attention for

video classiﬁcation. IEEE Transactions on Circuits

and Systems for Video Technology, 29(3):773–786.

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and

Chen, L.-C. (2018). Mobilenetv2: Inverted residu-

als and linear bottlenecks. In Proceedings of the IEEE

conference on computer vision and pattern recogni-

tion, pages 4510–4520.

Shen, T., Lin, G., Shen, C., and Reid, I. (2017). Learning

multi-level region consistency with dense multi-label

networks for semantic segmentation. arXiv preprint

arXiv:1701.07122.

Shu, Y., Shi, Y., Wang, Y., Zou, Y., Yuan, Q., and Tian, Y.

(2018). Odn: Opening the deep network for open-set

action recognition. In 2018 IEEE International Con-

ference on Multimedia and Expo (ICME), pages 1–6.

Simonyan, K. and Zisserman, A. (2014). Two-stream con-

volutional networks for action recognition in videos.

Advances in neural information processing systems,

27:568–576.

Song, Y., Redi, M., Vallmitjana, J., and Jaimes, A. (2016).

To click or not to click: Automatic selection of beauti-

ful thumbnails from videos. In Proceedings of the 25th

ACM International on Conference on Information and

Knowledge Management, page 659–668, New York,

NY, USA.

Soomro, K., Zamir, A. R., and Shah, M. (2012). Ucf101:

A dataset of 101 human actions classes from videos in

the wild.

Xie, C.-W., Zhou, H.-Y., and Wu, J. (2018). Vortex pooling:

Improving context representation in semantic segmen-

tation.

Xu, Y., Bai, F., Shi, Y., Chen, Q., Gao, L., Tian, K.,

Zhou, S., and Sun, H. (2021). Gif thumbnails: At-

tract more clicks to your videos. In Proceedings of

the AAAI Conference on Artiﬁcial Intelligence, pages

3074–3082.

Yang, K., Shen, X., Qiao, P., Li, S., Li, D., and Dou, Y.

(2019). Exploring frame segmentation networks for

temporal action localization. Journal of Visual Com-

munication and Image Representation, 61:296–302.

Yuan, Y., Ma, L., and Zhu, W. (2019). Sentence speciﬁed

dynamic video thumbnail generation. In Proceedings

of the 27th ACM International Conference on Multi-

media, pages 2332–2340.

Client-driven Lightweight Method to Generate Artistic Media for Feature-length Sports Videos

111