Invertible Neural Network-Based Video Compression

Zahra Montajabi, Vahid Khorasani Ghassab and Nizar Bouguila

Concordia Institute for Information Systems Engineering, Concordia University, Montreal, Canada

Keywords:

Computer Vision, Deep Neural Networks, Invertible Neural Network (INN), Video Coding.

Abstract:

Due to the recent advent of high-resolution mobile and camera devices, it is necessary to develop an opti-

mal solution for saving the new video content instead of traditional compression methods. Recently, video

compression received enormous attention among computer vision problems in media technologies. Using

state-of-the-art video compression methods, videos can be transmitted in a better quality requiring less band-

width and memory. The advent of neural network-based video compression methods remarkably promoted

video coding performance. In this paper, an Invertible Neural Network (INN) is utilized to reduce the infor-

mation loss problem. Unlike the classic auto-encoders which lose some information during encoding, INN

can preserve more information and therefore, reconstruct videos with more clear details. Moreover, they don’t

increase the complexity of the network compared to traditional auto-encoders. The proposed method is eval-

uated on a public dataset and the experimental results show that the proposed method outperforms existing

standard video encoding schemes such as H.264 and H.265 in terms of peak signal-to-noise ratio (PSNR),

video multimethod assessment fusion (VMAF), and structural similarity index measure (SSIM).

1 INTRODUCTION

Nowadays, many web activities and web-based ap-

plications such as real-time communications and live

streaming include a large number of video content.

Video contents contribute to about 80% of the In-

ternet trafﬁc (Networking, 2016). As high-quality

video content such as 4k videos become more preva-

lent, more sophisticated video compression meth-

ods are required to save the communication band-

width/storage space while offering high-quality video

encoding with less loss. Moreover, the higher quality

of encoded videos enhances computer vision schemes

such as object tracking and action recognition.

Among different recent methods developed for

compression, deep learning-based compression meth-

ods gained more interest and attention. There are sev-

eral traditional compression methods like JPEG (Wal-

lace, 1992) and JPEG 2000 (Skodras et al., 2001),

which are not optimized; because each module is opti-

mized independently. In these methods, they map the

input to latent feature representation linearly. In con-

trast, deep neural network-based methods are capable

of using highly non-linear transformations and end-

to-end training on a large scale to optimize compres-

sion. Some of the recent deep learning-based models

are reviewed comprehensively in (Liu et al., 2021).

Recently, INNs got more popular than previous

auto-encoder frameworks and they concentrate on

learning the forward process, using additional la-

tent output variables to capture the information that

would otherwise be lost, in contrast to classical neu-

ral networks that attempt to solve the ambiguous in-

verse problem directly (Kingma and Dhariwal, 2018),

(Dinh et al., 2016), (Ardizzone et al., 2018). Although

autoencoders are very capable of choosing the signif-

icant information for reconstruction, some amount of

information is completely lost; while using INN for

encoding and decoding helps to preserve the informa-

tion. Speciﬁc features of INN are: it has a bijective

mapping between input and output and its inverse ex-

ists; it is possible to compute both forward and in-

verse mapping efﬁciently; and there is a tractable Ja-

cobian of both mappings that makes it possible to cal-

culate posterior probabilities explicitly. The work in

(Ardizzone et al., 2018) demonstrated theoretically

and practically that INNs are an effective analysis tool

by using both synthetic data and real-world issues

from astronomy and medicine categories. Another

work on image generation (Ardizzone et al., 2019)

used conditional INN architecture which performs

better than variational autoencoders (VAEs) and gen-

erative adversarial networks (GANs). In (Lugmayr

et al., 2020), INN is employed to more effectively

address the ill-posed issue of super-resolution com-

558

Montajabi, Z., Ghassab, V. and Bouguila, N.

Invertible Neural Network-Based Video Compression.

DOI: 10.5220/0011807600003411

In Proceedings of the 12th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2023), pages 558-564

ISBN: 978-989-758-626-2; ISSN: 2184-4313

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

pared to GAN-based frameworks. Similarly, for im-

age rescaling in (Xiao et al., 2020), an invertible bi-

jective transformation is used to reduce the ill-posed

nature of image upscaling.

In each image or frame of a video, there are some

parts that are less important, and by compressing them

more than other parts and using fewer bits for them,

it is not easy to notice the difference with the origi-

nal one unless by pixel by pixel comparison. There-

fore, each image/video frame can be grouped into

perceptually signiﬁcant and perceptually insigniﬁcant

areas. Sorting the areas based on signiﬁcance is

called texture analyzer which is used in many com-

pression methods, and the inverse process is called

texture synthesizer, which restores the pixels (Ding

et al., 2021). This type of analysis/synthesis method is

used in our video compression model. The proposed

method, inspired by (Minnen et al., 2018), (Kingma

and Dhariwal, 2018), (Xie et al., 2021), (Shi et al.,

2016), (Cheng et al., 2020) incorporates four different

modules: feature enhancement which is used to im-

prove the nonlinear representation, INN which helps

to reduce information loss, attention squeeze which is

used to stabilize the training process instead of using

unstable sampling technique (Wang et al., 2020), and

hyperprior which performs analysis/synthesis and en-

tropy coding.

To validate the performance of our method, we

tested the model on the YouTube UGC video com-

pression dataset (Wang et al., 2019) and reported the

results by using PSNR, VMAF, and SSIM quality

metrics under a similar setting. The results are re-

ported and compared numerically and visually. The

visual results demonstrate that under the same BPP,

reconstructed video frames using our method has

more clear details.

The paper is organized as follows. The video cod-

ing method is described in detail in Section 2. Experi-

mental results are presented in Section 3, followed by

concluding remarks in Section 4.

2 APPROACH

The proposed method consists of four modules: fea-

ture enhancement, INN, squeeze module, and main

module. The architecture is shown in Fig. 1. Let

V = { f

, f

, ..., f

} be the video sequences and f

the video frame f at time t. The input frame f has

dimensions of (3, H, W). The ﬁrst step, feature en-

hancement, adds non-linearity to each input frame

and turns it to j with the same dimension of (3, H,

W). The second step, INN, turns j to q with dimen-

sions of (3×4

) in the forward pass. Third

step, attention-squeeze module, leads q to turn into

y with dimensions of (

3 × 4

) in which α

is the compression ratio. For the rest of the model,

the hyperprior of (Minnen et al., 2018) paper is used

in which Minnen presented an autoregressive context

model with a mean and scale hyperprior; In the hyper-

prior, using a mean and scale gaussian distribution,

the quantized latent features ˆy is parameterized with

an analysis transform hyper encoder and a synthesis

transform hyper decoder. The hyper encoder consists

of three convolution layers with Leaky Relu activation

between them. Analysis hyper encoder takes y to gen-

erate side information z and synthesis hyper decoder

takes quantized side information ˆz. Asymmetric nu-

meral system (ANS) (Duda, 2009) is used for entropy

coding. Each frame loss is calculated based on the

following equation (Minnen et al., 2018):

L = R( ˆy

) + R( ˆz

) + λ · D( f

) =

E[−log

ˆy

( ˆy

)] + E[−log

ˆz

( ˆz

)] + λ · D( f

)

(1)

In which rate R is the entropy of quantized latent

features, λ is the Lagrange multiplier and its differ-

ent values correspond to different bit rates, D is the

distortion term which can represent mean squared er-

ror (MSE) for MSE optimization, or 1−MS-SSIM for

MS-SSIM optimization (Wang et al., 2003), and here

it represents mean squared error.

The inverse for the decoding process is as follows:

squeeze module copies ˆy for alpha times and reshapes

it to ˆq. Then ˆq is passed to the INN inversely and

turned to

j and ﬁnally, after the inverse of feature en-

hancement, the reconstructed video with frames

f is

generated. The detail of each module is elaborated in

this section:

2.1 Feature Enhancement

Feature enhancement module is used to improve

the nonlinear representation of the network because

INNS are not often capable of nonlinear representa-

tion (Dinh et al., 2015). This module includes a Dense

block (Huang et al., 2016) with input channel size of

3 and output channel size of 64, three convolution lay-

ers, and another Dense block with input channel size

of 64 and output channel size of 3. Convolution lay-

ers have input channel size of 64, output channel size

of 64, stride 1, and kernel sizes of 1, 3, and 1.

Invertible Neural Network-Based Video Compression

559

Figure 1: Structure of our video compressor.

2.2 INN

There are two invertible layers in the INN module

which are the down-sampling layer and the coupling

layer. A down-sampling layer includes a pixel shuf-

ﬂing layer (Shi et al., 2016) and an invertible 1x1 con-

volution layer (Kingma and Dhariwal, 2018). Four

invertible blocks are utilized for down-sampling and

up-sampling similar to the method proposed in (Min-

nen et al., 2018). Each of them consists of one down-

sampling layer and three coupling layers. Using

these four blocks, the input is down-sampled by 16

times (the down-sampling factor in each pixel shuf-

ﬂing layer is 2). Afﬁne coupling layer (Dinh et al.,

2016) is deﬁned as following equations:

i+1

t,1:d

= j

t,1:d

⊙ exp(σ

( j

t,d+1:D

))) + h

( j

t,d+1:D

)

(2)

i+1

t,d+1:D

= j

t,d+1:D

⊙ exp(σ

( j

i+1

t,1:d

))) + h

( j

i+1

t,1:d

)

(3)

In the above equations, j

t,1:D

is the D dimensional

input at time frame t to the i

coupling layer which

is divided into two parts with dimensions d and D-d.

The functions h

, h

, g

are Convolutions with

stride one with activation functions. Each of them

consists of a convolution with kernel size 3, leaky

relu, convolution with kernel size 1, leaky relu, and

another convolution layer with kernel size 3. Also, ⊙,

exp, and σ

show the Hadamard product, exponential,

and center sigmoid functions respectively.

The inverse process is similar;

t,d+1:D

= (

i+1

t,d+1:D

−h

(

i+1

t,1:d

)⊙exp(−σ

(

i+1

t,1:d

)))

(4)

t,1:d

= (

i+1

t,1:d

− h

(

t,d+1:D

)) ⊙ exp(−σ

(

t,d+1:D

)))

(5)

2.3 Attention-Squeeze Module

INN does not change the size of input while many of

the pixels are useless for compression. So, the chan-

nel dimension of the INN’s output is reduced through

the Squeeze layer. If the output tensor of INN has the

size of (D, H, W), it is reshaped into (r,

, H, W) in

which r is the compression ratio. Then, it takes the

average on the ﬁrst dimension to turn the tensor into

size (

, H, W) and passes it to the attention mod-

ule; Attention module helps the model to pay more

attention to challenging parts and reduce the bits of

simple parts (Cheng et al., 2020). The inverse module

in the decompression phase has an attention module

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

560

and then it copies the quantized tensor r times and re-

shapes it to the size of (D, H, W).

3 EXPERIMENTS

Experiments are performed on a Tesla V100-SXM2-

16GB GPU. The network is trained for 550 epochs us-

ing a batch size of 16, learning rates of 0.0001 for the

ﬁrst 450 epochs and 0.00001 for the rest, with Adam

optimizer (Kingma and Ba, 2015) on Pytorch frame-

work.

3.1 Dataset

To train our video compression model, the Vimeo-90k

dataset is used (Xue et al., 2019) which is a large

dataset developed for different kinds of video pro-

cessing tasks. The dataset includes 89800 clips with

different contents in different categories. 5000 video

clips from different categories are used to train our

model. The resolution of the videos is cropped to 256

x 256 pixels.

To test the method and report the results, YouTube

UGC Dataset (Wang et al., 2019) is used. There are

1500 video clips with the length of 20 seconds in dif-

ferent categories such as gaming, sports, animation,

and lecture. Each clip is available in 4:2:0 YUV for-

mat and resolutions of 360P, 480P, 720P, and 1080P.

To test the method, we selected 40 videos in differ-

ent categories with 720P resolution. Also, another

public dataset of Ultra Video Group (UVG) (Mercat

et al., 2020) is used which includes 16 versatile 4K

(3840 × 2160) video sequences. Each video is 50 or

120 frames per second available in 4:2:0 YUV format.

To test the videos, the 1920 x 1080 resolution of the

dataset is used.

3.2 Evaluation Method

To measure the performance of the proposed frame-

work, peak-signal-to-noise ratio (PSNR), video mul-

timethod assessment fusion (VMAF), and structural

similarity index measure (SSIM) quality metrics are

used. A higher PSNR value shows a higher image

quality (Hor

e and Ziou, 2010), and SSIM (Dossel-

mann and Yang, 2005) measures the similarity be-

tween two images and is better correlated with the

human perception of distortion. Compared to PSNR

and SSIM, VMAF is a subjective measure of the hu-

man eye perception and so it is a pivotal measure in

real-world applications (Rassool, 2017).

Table 1: Performance comparison of the methods applied

to the YouTube UGC dataset. The proposed method outper-

forms others in terms of PSNR, VMAF, and SSIM.

Quality Metric PSNR VMAF SSIM

Proposed 45.5 98.9 0.982

H.264 43.9 97.5 0.975

H.265 40.6 93.4 0.96

Table 2: Performance comparison of the methods applied to

the UVG dataset. The proposed method outperforms others

in terms of PSNR, VMAF, and SSIM.

Quality Metric PSNR VMAF SSIM

Proposed 43.9 97.4 0.971

H.264 43.1 94.5 0.96

H.265 41.5 91.5 0.955

3.3 Results

The average PSNR, VMAF, and SSIM of the pro-

posed model on the UGC dataset are 45.5, 98.9, and

0.982, respectively, which outperforms H.264 (Wie-

gand et al., 2003) and H.265 (Sullivan et al., 2012)

as can be seen in Table 1. The average bit per pixel

(BPP) for all test sets in the UGC dataset is 0.6.

Fig. 2 shows samples of the outputs of our frame-

work and H.264 and H.265 on the UGC dataset. In

the ﬁrst example of the lecturer 2a, the text on the

laptop is sharper and more clear using our method.

In the second example 2b, the edge of the shapes is

more quantized in H.264 and more blurry in H.265.

Also, in the third example of a television clip 2c, the

face of the person is less quantized, sharper, and more

clear using our method.

The average PSNR, VMAF, and SSIM of the pro-

posed model on the UVG dataset are 43.9,97.4, and

0.971, respectively, which outperforms H.264 and

H.265 as can be seen in Table 2. The average bit per

pixel (BPP) for all test samples in the UVG dataset is

0.4.

Fig. 3 shows samples of the outputs of our frame-

work and H.264 and H.265 on the UVG dataset. In the

ﬁrst example of the honey bee among ﬂowers 3a, the

output of H.264 is more quantized and the honey bee

in the output of H.265 is more blurry while using our

method it is more clear and sharper. In the second ex-

ample 3b, the numbers are more quantized and more

blurry in H.264 and H.265 than in our method.

4 CONCLUSION

A video compression method based on INN has been

introduced. First, a feature enhancement method

is used for enhancing the nonlinear representation.

Invertible Neural Network-Based Video Compression

561

(a)

(b)

(c)

Figure 2: Samples of the proposed method, H.264, and H.265 outputs. The data used here are from the UGC dataset. (a) The

text on the laptop is sharper and more clear using our method. (b) The edge of the shapes is more quantized in H.264 and

more blurry in H.265. (c) The face of the person is less quantized, sharper, and more clear using our method.

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

562

(a)

(b)

Figure 3: Samples of the proposed method, H.264, and H.265 outputs. The data used here are from the UVG dataset. (a)

The output is sharper and more clear using our method than H.264 and H.265. (b) The numbers are more quantized and more

blurry in H.264 and H.265 than in our method.

Then, INN is used to decrease the information loss

problem. Compared with traditional auto-encoders

which lose information in the encoding process, INN

can persevere the information and leads to recon-

structed videos with more clear details without mak-

ing the network more complex. To solve the problem

of unstable training in INNs, attention-squeeze mod-

ule is used which makes the feature dimension adjust-

ment stable and tractable.

The results of the method are reported and com-

pared numerically and visually. Evaluations of the

proposed method on two standard public datasets

show better quality (under the same setting) in terms

of PSNR, VMAF, and SSIM as compared to the rec-

ognized methods such as H.264 and H.265. The vi-

sual comparison of the output of the proposed method

shows that it has more clear details than the outputs of

H.264 and H.265.

To improve the method in future works, it may be

possible to deploy more kinds of layers for feature

enhancement module or even replace the newer pub-

lished hyperpriors instead and use them with INNs.

Also, the idea of current model can be used for simi-

lar applications such as video denoising.

REFERENCES

Ardizzone, L., Kruse, J., Wirkert, S. J., Rahner, D.,

Pellegrini, E. W., Klessen, R. S., Maier-Hein, L.,

Rother, C., and K

othe, U. (2018). Analyzing inverse

Invertible Neural Network-Based Video Compression

563

problems with invertible neural networks. CoRR,

abs/1808.04730.

Ardizzone, L., L

uth, C., Kruse, J., Rother, C., and K

othe,

U. (2019). Guided image generation with conditional

invertible neural networks. CoRR, abs/1907.02392.

Cheng, Z., Sun, H., Takeuchi, M., and Katto, J. (2020).

Learned image compression with discretized gaus-

sian mixture likelihoods and attention modules. 2020

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition (CVPR), pages 7936–7945.

Ding, D., Ma, Z., Chen, D., Chen, Q., Liu, Z., and Zhu,

F. (2021). Advances in video compression system us-

ing deep neural network: A review and case studies.

Proceedings of the IEEE, 109(9):1494–1520.

Dinh, L., Krueger, D., and Bengio, Y. (2015). Nice: Non-

linear independent components estimation. CoRR,

abs/1410.8516.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. (2016). Density

estimation using real NVP. CoRR, abs/1605.08803.

Dosselmann, R. and Yang, X. D. (2005). Existing and

emerging image quality metrics. In Canadian Confer-

ence on Electrical and Computer Engineering, 2005.,

pages 1906–1913.

Duda, J. (2009). Asymmetric numeral systems. CoRR,

abs/0902.0271.

Hor

e, A. and Ziou, D. (2010). Image quality metrics: Psnr

vs. ssim. In 2010 20th International Conference on

Pattern Recognition, pages 2366–2369.

Huang, G., Liu, Z., and Weinberger, K. Q. (2016).

Densely connected convolutional networks. CoRR,

abs/1608.06993.

Kingma, D. P. and Ba, J. (2015). Adam: A method for

stochastic optimization. CoRR, abs/1412.6980.

Kingma, D. P. and Dhariwal, P. (2018). Glow: Gener-

ative ﬂow with invertible 1x1 convolutions. ArXiv,

abs/1807.03039.

Liu, D., Li, Y., Lin, J., Li, H., and Wu, F. (2021). Deep

learning-based video coding. ACM Computing Sur-

veys, 53(1):1–35.

Lugmayr, A., Danelljan, M., Gool, L. V., and Timofte, R.

(2020). Srﬂow: Learning the super-resolution space

with normalizing ﬂow. CoRR, abs/2006.14200.

Mercat, A., Viitanen, M., and Vanne, J. (2020). Uvg dataset:

50/120fps 4k sequences for video codec analysis and

development. Proceedings of the 11th ACM Multime-

dia Systems Conference.

Minnen, D. C., Ball

e, J., and Toderici, G. (2018). Joint au-

toregressive and hierarchical priors for learned image

compression. ArXiv, abs/1809.02736.

Networking, C. V. (2016). Cisco global cloud index: Fore-

cast and methodology, 2015-2020. white paper. Cisco

Public, San Jose, page 2016.

Rassool, R. (2017). Vmaf reproducibility: Validating a per-

ceptual practical video quality metric. In 2017 IEEE

International Symposium on Broadband Multimedia

Systems and Broadcasting (BMSB), pages 1–2.

Shi, W., Caballero, J., Husz

ar, F., Totz, J., Aitken, A. P.,

Bishop, R., Rueckert, D., and Wang, Z. (2016). Real-

time single image and video super-resolution using

an efﬁcient sub-pixel convolutional neural network.

CoRR, abs/1609.05158.

Skodras, A., Christopoulos, C., and Ebrahimi, T. (2001).

The jpeg 2000 still image compression standard. IEEE

Signal Processing Magazine, 18(5):36–58.

Sullivan, G. J., Ohm, J.-R., Han, W.-J., and Wiegand, T.

(2012). Overview of the high efﬁciency video coding

(hevc) standard. IEEE Transactions on Circuits and

Systems for Video Technology, 22(12):1649–1668.

Wallace, G. (1992). The jpeg still picture compression stan-

dard. IEEE Transactions on Consumer Electronics,

38(1):xviii–xxxiv.

Wang, Y., Inguva, S., and Adsumilli, B. (2019). Youtube

ugc dataset for video compression research. 2019

IEEE 21st International Workshop on Multimedia Sig-

nal Processing (MMSP), pages 1–5.

Wang, Y., Xiao, M., Liu, C., Zheng, S., and Liu, T. (2020).

Modeling lost information in lossy image compres-

sion. CoRR, abs/2006.11999.

Wang, Z., Simoncelli, E., and Bovik, A. (2003). Multiscale

structural similarity for image quality assessment. vol-

ume 2, pages 1398 – 1402 Vol.2.

Wiegand, T., Sullivan, G., Bjontegaard, G., and Luthra, A.

(2003). Overview of the h.264/avc video coding stan-

dard. IEEE Transactions on Circuits and Systems for

Video Technology, 13(7):560–576.

Xiao, M., Zheng, S., Liu, C., Wang, Y., He, D., Ke, G.,

Bian, J., Lin, Z., and Liu, T.-Y. (2020). Invertible im-

age rescaling. ArXiv, abs/2005.05650.

Xie, Y., Cheng, K. L., and Chen, Q. (2021). Enhanced in-

vertible encoding for learned image compression. In

Proceedings of the 29th ACM International Confer-

ence on Multimedia, MM ’21, page 162–170, New

York, NY, USA. Association for Computing Machin-

ery.

Xue, T., Chen, B., Wu, J., Wei, D., and Freeman,

W. T. (2019). Video enhancement with task-oriented

ﬂow. International Journal of Computer Vision,

127(8):1106–1125.

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

564