Reinforcement Learning based Video Summarization with Combination

of ResNet and Gated Recurrent Unit

Muhammad Sohail Afzal and Muhammad Atif Tahir

National University of Computer and Emerging Sciences, Karachi, Pakistan

Keywords:

Video Summarization, Reinforcement Learning, Reward Function, ResNet, Gated Recurrent Unit.

Abstract:

Video cameras are getting ubiquitous with passage of time. Huge amount of video data is generated daily in

this world that needs to be handled efﬁciently in limited storage and processing power. Video summarization

renders the best way to quickly review over lengthy videos along with controlling storage and processing

power requirements. Deep reinforcement-deep summarization network (DR-DSN) is a popular method for

video summarization but performance of this method is limited and can be enhanced with better representation

of video data. Most recently, it has been observed that deep residual networks are quite successful in many

computer vision applications including video retrieval and captioning. In this paper, we have investigated

deep feature representation for video summarization using deep residual network where ResNet 152 is being

used to extract deep videos features. To speed up the model, long short term memory is replaced with gated

recurrent unit, thus gave us ﬂexibility to add another RNN layer which resulted in signiﬁcant improvement in

performance. With this combination of ResNet-152 and two layered gated recurrent unit (GRU), we performed

experiments on SumMe video dataset and got results not only better than DR-DSN but also better than several

state of art video summarization methods.

1 INTRODUCTION

The deployment of cameras are getting proliferated

on each successive day. As the number of cameras are

increasing, we are gaining more and more video data.

Almost 10 million GB of data related to videos is gen-

erated in a single week in this world (Lai et al., 2016).

This needs to be efﬁciently stored and processed. But

we always have some upper limit for available storage

and processing power. It would be a wise decision if

we reduce this data to summary of important frames

and discard useless frames. Both the problems will be

solved provided that the generated summary is good

representation of complete video ensuring that impor-

tant frames are not lost. To sum up long videos in just

few frames is a challenging task and requires an ap-

proach to efﬁciently solve this problem with reason-

able accuracy. Several supervised approaches have

been proposed for video summarization but it is in-

timidate task to annotate whole video by just one la-

bel as there is no single ground truth for any single

video. So DR-DSN (Zhou et al., 2018a) introduced

unsupervised video summarization approach based on

reinforcement learning reward function. Basic idea of

how reinforcement learning interacts with video sum-

marization is shown in Fig. 1. where agent generates

summary and gives it to reward function for evalua-

tion. Reward function gives evaluation feedback to

agent through which agent learns to generate better

summary next time.

Figure 1: Basic idea of reinforcement learning in video

summarization.

Residual network (ResNet) has always been of

great importance in learning deep feature represen-

tation of data as these networks are being used in

many computer vision challenges like ILSVRC and

COCO competitions to get better results (He et al.,

Afzal, M. and Tahir, M.

Reinforcement Learning based Video Summarization with Combination of ResNet and Gated Recurrent Unit.

DOI: 10.5220/0010197402610268

In Proceedings of the 16th Inter national Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2021) - Volume 4: VISAPP, pages

261-268

ISBN: 978-989-758-488-6

261

2016). Different ResNet models were employed for

cancer prediction and malware detection on two dif-

ferent datasets (Khan et al., 2018) where ResNet-152

achieved top most accuracy. Deep ResNet has been

used for classiﬁcation of hyper spectral images and

it was noticed that the more deeper the network is,

the better features it can capture and hence the better

result (Zhong et al., 2017). GoogleNet and ResNet

were compared on malware detection problem (Khan

et al., 2019) and it was concluded that ResNet gave

more accurate results than GoogleNet but took more

time than GoogleNet to solve problem at hand.

It was noticed in a study related to recurrent neu-

ral network (Elsayed et al., 2018) that replacing long

short term memory (LSTM) with gated recurrent unit

(GRU) increased the performance of classiﬁcation

task. Furthermore, this GRU was combined with

convolution neural network in the same classiﬁcation

task and it showed performance improvement which

is what we also adopted in our proposed method of

video summarization. To the best of our knowledge,

this combination of ResNet and GRU has never been

investigated before for video summarization methods.

Our approach of video summarization is designed in

a way to get better results in less amount of time.

ResNet-152 is a deeper architecture with 152 layers

but it takes more time to extract all important features

so our recurrent neural network, which is basically a

two layered gated recurrent unit, compensates for this

time as it takes less time than long short term memory

that was previously employed in DR-DSN. The pro-

posed approach is evaluated on SumMe dataset which

is widely used benchmark dataset for video summa-

rization. The results indicate a signiﬁcant increase

in the performance when compared with the state-of-

the-art video summarization methods.

The paper is organized as follows. Section 2 gives

background of previously used approaches for video

summarization. Section 3 elaborates our proposed

methodology for video summarization with explana-

tion of all components of our approach. Section 4 de-

scribes experimental settings and results that we got

after comparison of our approach with DR-DSN and

several other well known approaches. Section 5 ﬁ-

nally concludes the paper.

2 RELATED WORK

Various studies have been conducted for solving

video summarization through reinforcement learn-

ing (Masumitsu and Echigo, 2000)(Zhou et al.,

2018a)(Zhou et al., 2018b)(Zhang et al., 2019)(Lei

et al., 2018). Importance score of each frame was cal-

culated (Masumitsu and Echigo, 2000) by projecting

features in eigen space and then video was generated

through reinforcement learning. Experiments were

manipulated on soccer video game and higher preci-

sion values were achieved. In (Zhou et al., 2018a), a

deep summarization network and an end to end re-

inforcement learning approach is proposed to gen-

erate high quality summary. Results were obtained

on SumMe and TVSum dataset that were better than

other state of art methods. Another approach based

on action parsing is presented (Lei et al., 2018) in

which video was ﬁrst cut by action parsing and then

summarized through reinforcement learning. Exper-

iments were manipulated on SumMe and TVSum

dataset achieving better F-score than several compet-

itive methods. In (Zhou et al., 2018b), a summariza-

tion network is presented which was trained through

deep Q-learning and tested on CoSum and TVSum

dataset. Results were better than several advanced

methods. MapNet is exploited (Zhang et al., 2019)

to ﬁrst map frames with respective queries and then

summaries are generated through summNet. Exper-

iments were manipulated on UT Egocentric dataset

where state of art results were achieved.

Several signiﬁcant researches related to video

summarization have also been conducted being spe-

ciﬁc to surveillance cameras. A video summariza-

tion technique was proposed (Yang et al., 2011) for

surveillance cameras based on key frame extraction.

The most informative scenes were selected until re-

quired summarization rate was achieved. Experi-

ments were manipulated on CAVIAR dataset and re-

sults were better than other famous techniques of op-

tical ﬂow and color spatial distribution. Target driven

summarization of surveillance video for tracing sus-

pects is proposed (Chen et al., 2013) that works in

two steps. In ﬁrst step, by using ﬁltered summarized

video of any camera, targets can be detected. This

ﬁrst step will ﬁlter for the categories of target along

with the time information. After identiﬁcation of tar-

gets, appearance signals are activated in other cam-

eras. A perspective dependent model is then proposed

based on grid that showed satisfactory results. Event

based video summarization of surveillance cameras

is proposed (Yun et al., 2014) that focuses on the ap-

pearance as well as patterns of movement. The prob-

lem of occlusion solved by separating local motion

from global motion and hence greater accuracy was

achieved. Another event driven approach was pre-

sented (Dimou et al., 2015) that focused on low level

visual features. Score was assigned to each frame

based on its importance. This work is more of a user

centric one allowing users to select variable number

of key frames for video summary.

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

262

A diversity based video summarization technique

is introduced (Chen et al., 2015) based on dictionary

learning while keeping in mind about relationships

among different video samples which was the serious

problem in dictionary learning methods. Geometrical

distributions were drawn employing a strategy based

on graph to address this problem and impressive re-

sults were obtained. Another diversity aware work

(Panda et al., 2017) involves unsupervised framework

to efﬁciently summarize videos using a minimization

algorithm that minimizes overall loss function. A

new Tour20 dataset was introduced to perform ex-

periments and state of art results were achieved. In

(Sharghi et al., 2016), a video summarization method

is proposed that works by selecting key frames from

a video with the help of user query. Datasets con-

taining annotated videos were used in the experiment

and state of art results were achieved. This method

is capable of handling lengthy videos and also on-

line streaming videos. Subjectiveness in video sum-

marization is carefully analyzed (Sharghi et al., 2017)

and some solutions were suggested. Determinant pro-

cesses and memory networks were used in the sum-

marizer to get better representative and diverse sum-

mary of original video. This method outperformed

two popular methods of video summarization. One

video was used for testing, one for validation and two

videos were used for training making total of four ex-

periment rounds.

A query based video summarization method is

introduced (Ji et al., 2017) that searches user pref-

erence content based on the query. This is based

on sparse coding framework. Authors also intro-

duced a public dataset containing total of 1000 videos

that are annotated as well. Extensive experiments

were manipulated on this public dataset and state of

art results proved the effectiveness of this proposed

method. Visual and textual embedding (Vasudevan

et al., 2017) is employed so that better representa-

tive summary can be obtained through a user query.

A new dataset related to selection of thumbnail was

also used in this paper consisting of labeled videos.

This approach proved that more advanced text model

with better training goal and a better modelling qual-

ity gives highest performance gains. A multimodal

approach for video summarization of cricket match

(Bhalla et al., 2019) is introduced that notices impor-

tant events in cricket match and then generates high-

lights that are good reﬂection of original video. In

order to detect important things in ground like wick-

ets and boundaries, techniques like optical charac-

ter recognition were used. Events are then joined

together to generate highlights for the entire cricket

match. Accuracy of 89 percent was achieved by us-

ing this method. In (Ma et al., 2019), another video

summarization method is proposed that is based on

sparse dictionary selection that works by supposing

relationship of linearity between frames. A non linear

model is contructed and then video is mapped to high

dimensional feature space with the help of kernel to

convert non linearity into linearity. Furthermore, two

greedy algorithms with strategy of back tracking are

suggested to manipulate model. Experiments were

conducted on two datasets that are SumMe and TV-

Sum and it was concluded that summaries produced

were better than summaries of other state of art video

summarization algorithms.

3 PROPOSED APPROACH

The proposed approach is shown in Fig. 2. Videos

are converted into frames and then these frames are

fed to residual network. Residual network in our ap-

proach is ResNet-152 having total of 152 layers. As it

is quite deep architecture so it will extract all deep and

important features from videos that could be very im-

portant in generating better representative summaries.

The extraction of these features is in 2048 dimen-

sions for each video so that we do not loss impor-

tant features that could be important for generation of

better representative summaries. After extraction of

these features, they will be further given to two lay-

ered gated recurrent unit which will generate hidden

states from these features. These hidden states will

be of two types i-e forward hidden states and back-

ward hidden states. Sigmoid layer is mounted at the

end of gated recurrent unit that will take these hidden

states as input to ﬁnally generate probability scores

for all the frames. Now each frame will be associated

with a probability score. Frames with higher prob-

ability scores will be selected for further evaluation

and will be fed to reinforcement learning reward func-

tion. Reward function is the sum of two other reward

functions i-e diversity reward function and representa-

tive reward function. These two reward functions will

evaluate these frames based on their diversity and rep-

resentativeness. Diversity reward function will make

sure that frames selected should be diverse enough to

better represent all different parts of video. Repre-

sentative reward function will check that frames se-

lected are either better representing original video or

not. Main goal is to maximize the sum of these two

reward functions that will automatically generate bet-

ter summary. In order to accomplish this task, a pol-

icy is learnt through adam’s optimization till the re-

ward function is maximized. The maximization of re-

ward function will generate better summaries. This is

Reinforcement Learning based Video Summarization with Combination of ResNet and Gated Recurrent Unit

263

Figure 2: Our proposed approach.

working of our approach. Now we will separately dis-

cuss all important components of this approach which

are ResNet, Gated recurrent unit and reinforcement

learning reward function.

3.1 ResNet-152

Residual network (Nguyen et al., 2018), which is also

known as ResNet, belong to deep neural network fam-

ily having same structure but with different depths. It

avoids degradation in neural networks through a unit

known as residual learning unit. It is well known ar-

chitecture to extract deep features from images. The

structure of residual learning unit in ResNet is a feed

forward network which is capable of adding new in-

puts in the network and thus generating new outputs

through a shortcut connection. The main advantage

of ResNet is it produces better results with higher ac-

curacy with out increasing complexity. Fig. 3. shows

basic architecture of ResNet-152.

Figure 3: The basic architecture of ResNet-152 (Nguyen

et al., 2018).

So ResNet will take all the frames for every sin-

gle video as an input and generate feature matrices

for each video in 2048 dimensions. As it has network

depth of 152 so it will make sure that all deep and im-

portant features have been extracted from each video.

Extraction of important features from each video is

very necessary because if we miss important features

from videos then we can not generate better hidden

sates through gated recurrent unit as important fea-

tures were not fed to it. So all further work will be

useless. That is why ResNet-152 serves the purpose

here as it is quite deeper architecture.

3.2 Gated Recurrent Unit (GRU)

Gated recurrent unit is a sequential type of encoder.

The advantage of gated recurrent unit over long short

term memory is it has less parameters than long short

term memory and this results in less training time with

a very low risk of overﬁtting (Chung et al., 2014).

Its performance has also bypassed performance of

LSTM in several cases (Elsayed et al., 2018) includ-

ing our case too. In our approach, ﬁrst we checked

performance of LSTM on all videos and then we re-

placed it with GRU. We discovered that though LSTM

gave good results on fewer videos where GRU did not

but overall GRU was performing much better on most

of the videos with less training time.

So GRU gets all the important features that were

previously extracted by ResNet-152 and generates

hidden states that are forward hidden states and back-

ward hidden states. GRU has a sigmoid layer at

the end where all these hidden states are transformed

into different probability scores. Actions will be per-

formed to select frames with higher probability scores

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

264

and these selected frames will further be given to re-

inforcement learning reward function.

3.3 Reward Function

The purpose of reinforcement learning agent is to

maximize its reward function after learning from the

environment. Interaction of reward function with

video summarization system is also shown in Fig. 1.

Reward function here is basically combination of two

other reward functions which are diversity reward

function and representative reward function as can be

seen in Fig. 2.

3.3.1 Diversity Reward Function

It is responsible for ﬁnding key frames from the se-

lected ones. It keeps on comparing every frame with

every other to ﬁnd the most dissimilar frames. Sup-

pose K is the set of selected frames which are given

to diversity reward function and ‘b’ denotes each sin-

gle frame then this reward function is given by the

equation :

div

|K |(|K | − 1)

∑

t∈K

∑

∈K

6=t

(1 −

) (1)

3.3.2 Representative Reward Function

It will make sure that selected frames are good rep-

resentation of original video and this is done by se-

lecting those frames that are nearer to clusters in the

whole feature space. This reward function is given by

the equation:

rep

= exp

−

∑

t=1

min

∈K

− b

(2)

Total reward function is sum of both of these re-

ward functions :

R(total) = R

div

+ R

rep

(3)

3.4 Learning Descend Policy Gradient

Agent should learn a policy for maximizing rewards.

If ‘p’ is the probability score and θ is the policy func-

tion parameter then objective function can be given

by:

L(θ) = E

p(a|π)

div

+ R

rep

] (4)

In order to ﬁnd gradient, we need to ﬁnd derivative

of objective function. So derivative will be:

∇

L(θ) ≈

∑

k=1

∑

t=1



div

+ R

rep

− m



∇

logπ (a

;θ)

(5)

where π is the policy function, ‘m’ is the rewards

moving average, h

are the hidden states generated by

GRU and a

are the actions taken by model.

Now optimization is performed through Adam’s

algorithm (Kingma and Ba, 2014). It will aid agent

to take those steps more and more that can lead to

better reward and avoid those steps that can lead to

low reward value.

4 EXPERIMENTS AND RESULTS

4.1 Dataset

SumMe dataset is used for experiments that contains

total 25 videos from various topics such as sports, hol-

idays and various other events. Each video has length

of approximately 1 to 6 minutes and each is annotated

by 15 to 18 persons which means there are multiple

ground truth summaries available for each video.

4.2 Evaluation Metric

In order to assess automatic summaries that are gen-

erated by our summarization network, F- measure is

used as evaluation metric to ﬁnd the difference be-

tween ground truth summaries and automatic sum-

maries. F-measure is harmonic mean of precision and

recall so here it is used to evaluate on testing dataset.

4.3 Evaluation Settings

Leave one out cross validation is used for evaluation

of our system. Model is trained with N − 1 videos

and evaluated on remaining one video. This process

is repeated N times with different training sets of size

N − 1.

4.4 Implementation Details

We implemented our approach using pytorch on

GTX 1050 Ti GPU. Features were extracted through

ResNet in 2048 dimensions and were given to model

with 256 hidden units having two RNN layers. We

ran training for 70 epochs. Total 30 steps were set to

decay the learning rate.

4.5 Comparison with DR-DSN

We replicated DR-DSN (Zhou et al., 2018a) and ran it

on each video of SumMe dataset. We got same results

for DR-DSN as given in paper (Zhou et al., 2018a).

Reinforcement Learning based Video Summarization with Combination of ResNet and Gated Recurrent Unit

265

Table 1: Comparison with DR-DSN (Zhou et al., 2018a) on SumMe dataset.

F-measure

Video Video Name Camera Type Video Length Frames DR-DSN Our method

Video 1 Air Force One static 2 min 60 sec 4494 60 60

Video 2 Base Jumping egocentric 2 min 39 sec 4729 22.3 28.7

Video 3 Bearpark Climbing moving 2 min 14 sec 3341 35.2 54.1

Video 4 Bike Polo egocentric 1 min 43 sec 3064 54.4 54.9

Video 5 Bus in Rock Tunnel moving 2 min 51 sec 5131 29.7 36.2

Video 6 Car RailCrossing moving 2 min 49 sec 5075 21 21

Video 7 Cockpit Landing moving 5 min 2 sec 9046 29.2 29.2

Video 8 Cooking moving 1 min 27 sec 1286 49.4 50.1

Video 9 Eiffel Tower moving 3 min 20 sec 4971 30.9 32.6

Video 10 Excavator River Cross moving 6 min 29 sec 9721 26.5 31.4

Video 11 Fire Domino static 0 min 55 sec 1612 60.2 60.2

Video 12 Jumps moving 0 min 39 sec 950 0 40.4

Video 13 Kids Playing in Leaves moving 1 min 46 sec 3187 50.2 22.9

Video 14 Notre Dam moving 3 min 12 sec 4608 44.6 37.4

Video 15 Paintball static 4 min 16 sec 6096 55.1 55.2

Video 16 Playing on Water Slide moving 1 min 42 sec 3065 34.9 34.9

Video 17 Saving Dolphines moving 3 min 43 sec 6683 32.7 43

Video 18 Scuba egocentric 1 min 14 sec 2221 67.5 67.5

Video 19 St. Marten Landing moving 1 min 10 sec 1751 60.8 41.7

Video 20 Statue of Liberty moving 2 min 36 sec 3863 61.4 43.4

Video 21 Uncut Evening Flight moving 5 min 23 sec 9672 17.4 27.8

Video 22 Valparaiso Downhill egocentric 2 min 53 sec 5178 39 44.1

Video 23 Car Over Camera static 2 min 26 sec 4382 66.7 66.7

Video 24 Paluma Jump moving 1 min 26 sec 2574 29.5 30.6

Video 25 Playing Ball moving 1 min 44 sec 3120 55.3 77.7

RESULT (Average F-measure) 41.4 43.7

Then we ran our approach on each video and com-

pared results of both approaches. Results are shown

in Table 1. It can be clearly seen that our approach

performed 2.3 percent better than DR-DSN approach.

Even there is one video in the dataset i-e Video 12

where DR-DSN failed completely as given in Ta-

ble 1 while our approach gave reasonable F-measure

of 40.4 on this particular video. Our approach per-

formed better on most of the videos and hence overall,

our proposed method is better than DR-DSN.

Table 2: Comparison with other approaches.

Method Dataset F-measure

CSUV SumMe 23

Uniform Sampling SumMe 29.3

Vsumm SumMe 33.7

Dictionary Selection SumMe 37.8

GAN dpp SumMe 39.1

DR-DSN SumMe 41.4

Ours SumMe 43.7

4.6 Comparison with Other State of Art

Approaches

Several unsupervised video summarization ap-

proaches have been proposed till now as can be seen

in Table 2. We replicated CSUV approach (Gygli

et al., 2014) on SumMe dataset that basically works

through superframe segmentation. Results achieved

from our model were far better than this approach.

Uniform sampling (Jadon and Jasim, 2019) is also

popular unsupervised technique for video summa-

rization that works using key frame extraction. It tries

to extract important parts of video but results showed

that it is not effective summarization approach.

Clustering has always been very common in video

summarization and Vsumm approach does the same

through k means. It makes clusters of similar type of

frames. Frames will be part of those clusters where

their distance from centroid is minimal.So this way

k-means algorithm works for summarizing videos

but it is not good enough as our proposed approach.

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

266

Dictionary selection video summarization approach

(Ma et al., 2019) works by assuming relationship

of linearity between frames. A non linear model

is contructed and then video is mapped to high

dimensional feature space with the help of kernel

to convert non linearity into linearity. Furthermore,

two greedy algorithms with strategy of back tracking

were suggested to manipulate model in dicitonary

selection but the ﬁnal results were not better than

our proposed approach. GAN dpp (Mahasseni

et al., 2017) summarization approach works through

discriminator and a summarizer. Long short term

memory acts as summarizer as well as discriminator.

As a discriminator, it distinguishes between summary

generated by the system and original summary. This

approach is based on generative adversarial network

but it is not as effective as our approach. Comparison

of all of these approaches with our approach on

SumMe dataset can be seen in Table 2 where it can

be clearly seen that our approach is leading all other

approaches.

5 CONCLUSIONS

In this paper, we proposed improved video summa-

rization method that outperformed several state of

art methods. Residual network ResNet-152 was em-

ployed with gated recurrent unit having two RNN lay-

ers. We performed detailed comparison of our ap-

proach with DR-DSN by providing results on each

and every video in SumMe dataset. Furthermore,

we compared overall average F-score of our approach

with average F-score of several other state of art video

summarization methods and concluded the fact that

our method is best in terms of generating better rep-

resentative summaries of original videos.

ACKOWLEDGEMENT

We thank Kaiyang Zhou for detailed discussion about

his paper (Zhou et al., 2018a) . This work is sup-

ported by Video Surveillance Lab, Karachi, Pakistan

afﬁliated from National Center of Big data and Cloud

Computing, Pakistan.

REFERENCES

Bhalla, A., Ahuja, A., Pant, P., and Mittal, A. (2019). A

multimodal approach for automatic cricket video sum-

marization. In 2019 6th International Conference on

Signal Processing and Integrated Networks (SPIN),

pages 146–150. IEEE.

Chen, S.-C., Lin, K., Lin, S.-Y., Chen, K.-W., Lin, C.-W.,

Chen, C.-S., and Hung, Y.-P. (2013). Target-driven

video summarization in a camera network. In 2013

IEEE International Conference on Image Processing,

pages 3577–3581. IEEE.

Chen, X., Li, X., and Lu, X. (2015). Representative and

diverse video summarization. In 2015 IEEE China

Summit and International Conference on Signal and

Information Processing (ChinaSIP), pages 142–146.

IEEE.

Chung, J., Gulcehre, C., Cho, K., and Bengio, Y.

(2014). Empirical evaluation of gated recurrent neu-

ral networks on sequence modeling. arXiv preprint

arXiv:1412.3555.

Dimou, A., Matsiki, D., Axenopoulos, A., and Daras, P.

(2015). A user-centric approach for event-driven sum-

marization of surveillance videos.

Elsayed, N., Maida, A. S., and Bayoumi, M. (2018).

Deep gated recurrent and convolutional network hy-

brid model for univariate time series classiﬁcation.

arXiv preprint arXiv:1812.07683.

Gygli, M., Grabner, H., Riemenschneider, H., and

Van Gool, L. (2014). Creating summaries from user

videos. In European conference on computer vision,

pages 505–520. Springer.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Jadon, S. and Jasim, M. (2019). Video summarization us-

ing keyframe extraction and video skimming. arXiv

preprint arXiv:1910.04792.

Ji, Z., Ma, Y., Pang, Y., and Li, X. (2017). Query-aware

sparse coding for multi-video summarization. arXiv

preprint arXiv:1707.04021.

Khan, R. U., Zhang, X., and Kumar, R. (2019). Analy-

sis of resnet and googlenet models for malware de-

tection. Journal of Computer Virology and Hacking

Techniques, 15(1):29–37.

Khan, R. U., Zhang, X., Kumar, R., and Aboagye, E. O.

(2018). Evaluating the performance of resnet model

based on image recognition. In Proceedings of the

2018 International Conference on Computing and Ar-

tiﬁcial Intelligence, pages 86–90.

Kingma, D. P. and Ba, J. (2014). Adam: A

method for stochastic optimization. arXiv preprint

arXiv:1412.6980.

Lai, P. K., D

ecombas, M., Moutet, K., and Lagani

ere, R.

(2016). Video summarization of surveillance cameras.

In 2016 13th IEEE International Conference on Ad-

vanced Video and Signal Based Surveillance (AVSS),

pages 286–294. IEEE.

Lei, J., Luan, Q., Song, X., Liu, X., Tao, D., and Song,

M. (2018). Action parsing-driven video summariza-

tion based on reinforcement learning. IEEE Transac-

tions on Circuits and Systems for Video Technology,

29(7):2126–2137.

Ma, M., Mei, S., Wan, S., Wang, Z., and Feng, D. (2019).

Reinforcement Learning based Video Summarization with Combination of ResNet and Gated Recurrent Unit

267

Video summarization via nonlinear sparse dictionary

selection. IEEE Access, 7:11763–11774.

Mahasseni, B., Lam, M., and Todorovic, S. (2017). Un-

supervised video summarization with adversarial lstm

networks. In Proceedings of the IEEE conference on

Computer Vision and Pattern Recognition, pages 202–

211.

Masumitsu, K. and Echigo, T. (2000). Video summariza-

tion using reinforcement learning in eigenspace. In

Proceedings 2000 International Conference on Image

Processing (Cat. No. 00CH37101), volume 2, pages

267–270. IEEE.

Nguyen, L. D., Lin, D., Lin, Z., and Cao, J. (2018). Deep

cnns for microscopic image classiﬁcation by exploit-

ing transfer learning and feature concatenation. In

2018 IEEE International Symposium on Circuits and

Systems (ISCAS), pages 1–5. IEEE.

Panda, R., Mithun, N. C., and Roy-Chowdhury, A. K.

(2017). Diversity-aware multi-video summariza-

tion. IEEE Transactions on Image Processing,

26(10):4712–4724.

Sharghi, A., Gong, B., and Shah, M. (2016). Query-focused

extractive video summarization. In European Confer-

ence on Computer Vision, pages 3–19. Springer.

Sharghi, A., Laurel, J. S., and Gong, B. (2017). Query-

focused video summarization: Dataset, evaluation,

and a memory network based approach. In Proceed-

ings of the IEEE Conference on Computer Vision and

Pattern Recognition, pages 4788–4797.

Vasudevan, A. B., Gygli, M., Volokitin, A., and Van Gool,

L. (2017). Query-adaptive video summarization via

quality-aware relevance estimation. In Proceedings of

the 25th ACM international conference on Multime-

dia, pages 582–590.

Yang, Y., Dadgostar, F., Sanderson, C., and Lovell, B. C.

(2011). Summarisation of surveillance videos by

key-frame selection. In 2011 Fifth ACM/IEEE Inter-

national Conference on Distributed Smart Cameras,

pages 1–6. IEEE.

Yun, S., Yun, K., Kim, S. W., Yoo, Y., and Jeong, J.

(2014). Visual surveillance brieﬁng system: Event-

based video retrieval and summarization. In 2014 11th

IEEE International Conference on Advanced Video

and Signal Based Surveillance (AVSS), pages 204–

209. IEEE.

Zhang, Y., Kampffmeyer, M., Zhao, X., and Tan, M. (2019).

Deep reinforcement learning for query-conditioned

video summarization. Applied Sciences, 9(4):750.

Zhong, Z., Li, J., Ma, L., Jiang, H., and Zhao, H. (2017).

Deep residual networks for hyperspectral image clas-

siﬁcation. In 2017 IEEE International Geoscience and

Remote Sensing Symposium (IGARSS), pages 1824–

1827. IEEE.

Zhou, K., Qiao, Y., and Xiang, T. (2018a). Deep reinforce-

ment learning for unsupervised video summarization

with diversity-representativeness reward. In Thirty-

Second AAAI Conference on Artiﬁcial Intelligence.

Zhou, K., Xiang, T., and Cavallaro, A. (2018b). Video sum-

marisation by classiﬁcation with deep reinforcement

learning. arXiv preprint arXiv:1807.03089.

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

268