DESIGN OF A COMPRESSIVE REMOTE IMAGING SYSTEM

COMPENSATING A HIGHLY LIGHTWEIGHT ENCODING

WITH A REFINED DECODING SCHEME

George Tzagkarakis

, Arnaud Woiselle

, Panagiotis Tsakalides

and Jean-Luc Starck

CEA/DSM, SEDI-SAp, Service d’Astrophysique, Centre de Saclay, F-91191 Gif-Sur-Yvette, France

Sagem D

efense S

ecurit

e, 95100 Argenteuil, France

Institute of Computer Science (ICS), Foundation for Research & Technology - Hellas (FORTH), Crete, Greece

Keywords:

Compressive Video Sensing, Lightweight Remote Imaging Systems.

Abstract:

Lightweight remote imaging systems have been increasingly used in surveillance and reconnaissance. Never-

theless, the limited power, processing and bandwidth resources is a major issue for the existing solutions, not

well addressed by the standard video compression techniques. On the one hand, the MPEGx family achieves

a balance between the reconstruction quality and the required bit-rate by exploiting potential intra- and inter-

frame redundancies at the encoder, but at the cost of increased memory and processing demands. On the other

hand, the M-JPEG approach consists of a computationally efﬁcient encoding process, with the drawback of

resulting in much higher bit-rates. In this paper, we cope with the growing compression ratios, required for all

remote imaging applications, by exploiting the inherent property of compressive sensing (CS), acting simul-

taneously as a sensing and compression framework. The proposed compressive video sensing (CVS) system

incorporates the advantages of a very simple CS-based encoding process, while putting the main computa-

tional burden at the decoder combining the efﬁciency of a motion compensation procedure for the extraction

of inter-frame correlations, along with an additional super-resolution step to enhance the quality of recon-

structed frames. The experimental results reveal a signiﬁcant improvement of the reconstruction quality when

compared with M-JPEG, at equal or even lower bit-rates.

1 INTRODUCTION

Modern lightweight digital sensing devices with high-

resolution signal acquisition, processing, and commu-

nication capabilities are largely based on the well-

established Shannon and Nyquist theories. Managing

ever increasing amounts of data remains a challenging

task, especially for practical applications, where de-

vices with limited processing, storage, and bandwidth

resources are involved. Moreover, the increasing de-

mand for higher acquisition rates and even improved

resolution is placing a signiﬁcant burden on the exist-

ing hardware architectures.

Video acquisition and processing posses a cen-

tral role in numerous emerging applications, such as

surveillance and reconnaissance, both at the civil-

ian and battlegroup levels, robot navigation, remote

surgery and entertainment. Most video surveillance

systems monitor actively and remotely an area of in-

terest through video streaming, or passively by storing

the captured video for future use. In addition, recent

technological advances enable the design of low-

cost devices that incorporate enhanced multimodal

sensing, processing, and communication capabilities.

At the same time, the limited resources of the

compression hardware is still a major bottleneck for

the design of lightweight remote imaging systems,

where very low data rates are strongly required to

maximize the lifetime of the system (e.g., in the case

of a terrestrial sensor network), while preserving an

increased performance. To cope with such growing

compression ratios existing video compression tech-

niques, such as the MPEGx proﬁles and the Motion

JPEG (M-JPEG) scheme, may result in poor image

quality.

The framework of compressive sensing

(CS) (Cand

es et al., 2006), acting simultane-

ously as a sensing and compression protocol, could

be exploited in the design of low-complexity on-

board remote imaging devices with reduced power

Tzagkarakis G., Woiselle A., Tsakalides P. and Starck J..

DESIGN OF A COMPRESSIVE REMOTE IMAGING SYSTEM COMPENSATING A HIGHLY LIGHTWEIGHT ENCODING WITH A REFINED

DECODING SCHEME.

DOI: 10.5220/0003842400460055

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2012), pages 46-55

ISBN: 978-989-8565-03-7

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

and processing requirements. Its simplicity stems

from the linearity of the associated non-adaptive

incoherent projections, which are employed for the

representation and reconstruction of sparse signals.

Recently, the framework of compressive video

sensing (CVS) was introduced as a natural extension

proposing distinct approaches for video acquisition

using a reduced amount of data, while maintaining a

similar reconstruction performance when compared

to standard video compression techniques.

Existing CVS approaches perform separate

encoding of each frame, based on a non-overlapping

block splitting to reduce the storage and computa-

tional costs, by combining full sampling of reference

frames with CS applied on non-reference frames.

Then, at the decoder, the reconstruction is performed

separately (Stankovi

c et al., 2008), or jointly by either

considering a joint sparsity model as in (Kang and

Lu, 2009) or by designing an adaptive sparsifying

basis using neighboring blocks in previously recon-

structed frames (Do et al., 2009; Prades-Nebot et al.,

2009). The major drawbacks are that, since potential

spatio-temporal redundancies are not removed at the

encoder, the corresponding CVS methods usually

result in increased bit-rates, while also being sensitive

to the propagation of reconstruction errors along the

sequence in the case of joint decoding.

The efﬁciency of typical video compression

standards, such as the MPEGx, in achieving a good

tradeoff between the reconstruction quality and

the associated bit-rates, is primarily based on the

capability of removing potential spatio-temporal

redundancies by means of intra-frame transform

coding and inter-frame motion prediction. However,

an encoder with increased memory and processing

resources is required, which may be prohibitive in

a lightweight remote imaging system. On the other

hand, the use of M-JPEG, which is an intra-frame-

only video compression scheme, has the advantage of

imposing signiﬁcantly lower processing and memory

requirements on the hardware, but at the cost of

increasing signiﬁcantly the required bit-rate, which is

restrictive in the case of limited bandwidth.

In the present work, we address the drawbacks of

the previous CVS methods, as well as the limitations

of MPEGx and M-JPEG compression techniques, by

introducing a CVS scheme which could be integrated

in onboard video sensing devices with restricted

resources. In particular, the proposed CVS method

combines a simpliﬁed encoding process by embed-

ding a CS module in an M-JPEG-like encoder, along

with a reﬁnement phase based on inter-frame predic-

tion at the decoder (as opposed to MPEGx, where

the prediction errors are formed at the encoder). The

idea of transferring the computational burden of the

motion estimation and compensation processes at the

decoder was also appeared in (Jung and Ye, 2010)

in the framework of dynamic magnetic resonance

imaging, where an auxiliary sequence of residual

frames was generated at the decoder recursively

using a set of fully-sampled reference frames in

conjunction with the low-resolution dynamic frames.

Moreover, the required bit-rate of our proposed

encoder can be further decreased by downsampling

the non-reference frames, followed by an additional

super-resolution step at the decoder to restore the

reconstructed frames in their original resolution.

The use of super-resolution as a tool to resize the

frames in their original dimension is motivated by

recent works on sparse representation-based image

super-resolution via dictionary learning (Freeman

et al., 2002; Yang et al., 2008; Wang et al., 2011;

Zhang et al., 2011), where it has been shown that

a super-resolution method results in images with

superior quality when compared to the commonly

used 2-dimensional interpolation schemes (e.g.,

bilinear, bicubic, spline).

The paper is organized as follows: in Section 2,

the model for the compressed measurements acqui-

sition is introduced. Section 3 describes in detail

the proposed CVS architecture, while a performance

evaluation is carried out in Section 4. Finally,

conclusions and further extensions are outlined in

Section 5.

2 CS MEASUREMENTS MODEL

In the following, we consider for convenience the case

of square N × N frames, although the proposed ap-

proach is extended straightforwardly in the general

non-square case. The main disadvantage when we

deal with a remote imaging device with limited capa-

bilities, as mentioned in Section 1, is the high memory

and computational expense when we operate at high

resolutions. This drawback can be alleviated by pro-

ceeding in a block-wise fashion. More speciﬁcally, in

the proposed CVS system, each frame is divided into

equally sized B × B non-overlapping blocks. Then, a

measurement vector g

, j = 1,...,n

, is generated for

each one of the n

blocks via a simple linear model as

follows,

= Φ

, (1)

where Φ

∈ R

M×B

is a suitable measurement ma-

trix (M  B

) and x

∈ R

denotes the j-th block

of frame x, reshaped as a column vector. Although,

in general, a distinct measurement matrix can be as-

DESIGN OF A COMPRESSIVE REMOTE IMAGING SYSTEM COMPENSATING A HIGHLY LIGHTWEIGHT

ENCODING WITH A REFINED DECODING SCHEME

signed to each block, in the following we apply the

same measurement matrix on each block for simplic-

ity, that is, Φ

≡ Φ, j = 1,. .. ,n

. Thus, the measure-

ment vector g

provides directly a compressed repre-

sentation of the original space-domain block x

Common choices for the measurement matrix Φ

are random matrices with independent and identically

distributed (i.i.d.) Gaussian or Bernoulli entries. In

a remote imaging system some additional require-

ments should be posed on the choice of Φ, such as

the use of a minimal number of compressed measure-

ments, and the fast and memory efﬁcient computation

along with a “hardware-friendly” implementation. A

class of matrices satisfying these requirements, the

so-called structurally random matrices, was intro-

duced recently (Do et al., 2008). The block Walsh-

Hadamard (BWHT) operator is a typical member of

this class and will be used in the proposed design.

If the j-th block is compressible or has a K-sparse

representation in an appropriate sparsifying transform

domain, and if the measurement matrix along with

the sparsifying dictionary satisfy a sufﬁcient inco-

herence condition, then x

can be recovered from

M & O(2K logB) measurements by solving the fol-

lowing optimization problem,

min

s.t. g

= ΦΨ

−1

, (2)

where Ψ

is an appropriate sparsifying transforma-

tion, such as an orthonormal basis (e.g., discrete

cosine transform (DCT), discrete wavelet transform

(DWT)) or an overcomplete dictionary (e.g., undec-

imated DWT (UDWT)), and w

is the transform-

domain sparse representation of x

in Ψ

. If Ψ

different than the identity, then the reconstruction of

the j-th block is ﬁrst performed in the transform do-

main,

, followed by an inversion to obtain the ﬁ-

nal space-domain estimate,

= Ψ

−1

, otherwise

the space-domain solution is obtained directly, that is,

≡

. In the subsequent experimental evaluations,

and in order to be consistent with the M-JPEG com-

pression scheme to which we compare, the DCT is

used as the sparsifying dictionary.

The next section analyzes in detail the structural

components of the proposed CVS architecture, start-

ing with a brief overview of M-JPEG. The main rea-

son for choosing the M-JPEG as a benchmark method

for comparison, and not a member of the MPEGx

family, stems from the fact that both M-JPEG and

our approach are based on a very simpliﬁed encoder,

working directly in the original frame-domain, with-

out exploiting inter-frame redundancies via motion

compensation, as in MPEGx, which is proven to be

the step with the highest memory and power con-

sumption. We emphasize again that our goal is to

build a CVS system with a lightweight encoder, so as

to satisfy the constraints of a remote imaging system.

On the other hand, we consider that the reconstruction

takes place at a base station, where increased memory

and computational resources are available in practice.

3 PROPOSED CVS SYSTEM

In the following, the main building blocks of the pro-

posed CVS system are described in detail. Before

proceeding, we start with a brief overview of the basic

structure of M-JPEG, to which we compare in the rest

of the paper.

3.1 Overview of M-JPEG

M-JPEG is a lossy intra-frame compression scheme.

In particular, each frame of the video sequence is

transformed in the DCT domain, followed by the

quantization of the corresponding DCT coefﬁcients

using a perceptual model based loosely on the human

visual system. This model discards high-frequency

information since its changes are almost unperceiv-

able to the human eye. The quantized coefﬁcients are

then encoded losslessly and packed into the output

bit-stream. The building blocks of an M-JPEG sys-

tem are shown in Fig. 1

As a purely intra-frame-only compression

scheme, the performance of M-JPEG is related

directly to the spatial complexity of each video

frame. In particular, frames that contain large smooth

transitions or monotone surfaces are compressed

well, and are more likely to preserve their original

details with few visible compression artifacts. On

the contrary, frames containing structures such as

complex textures, ﬁne curves and lines are prone

to exhibit artifacts such as ringing, smudging, and

macroblocking. An advantage of M-JPEG is that it is

insensitive to the motion complexity.

Since the video frames are compressed separately,

M-JPEG imposes lower processing and memory re-

quirements on the hardware devices, when compared

with MPEGx. This justiﬁes its widespread use in digi-

tal and IP cameras, in HDTV media players and game

consoles. Although the bit-rate of M-JPEG is lower

than that of an uncompressed video, however, it is

much higher than the bit-rate of a video which is com-

pressed using inter-frame motion compensation, such

as the MPEGx does.

Motivated by the above, the proposed CVS en-

coder is designed with the goal of preserving the sim-

ple structure of an M-JPEG-based encoder, with the

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

Figure 1: Structure of an M-JPEG system.

power of CS to represent the information content of a

given signal, characterized by a sparse representation

in an appropriate transform domain, using a highly

reduced set of compressed measurements.

3.2 CVS Architecture

3.2.1 Encoder

The need to satisfy the constraints of a lightweight

remote imaging system motivated the design of a

CVS encoder which combines the simplicity of the

M-JPEG approach, with the efﬁciency of CS to rep-

resent the salient information content of a given sig-

nal via a low-dimensional set of compressed measure-

ments. Fig. 2(a) shows the building components of

the proposed CVS encoder. More speciﬁcally, the in-

put video sequence is divided into successive groups

of pictures (GOPs) of the form I PP .. . P I, where I

corresponds to a reference frame (I-frame) and P to

a non-reference frame (P-frame). The I-frames are

fully sampled and compressed using JPEG, that is,

the DCT is applied on non-overlapping 8 × 8 blocks

of the original frame, followed by quantization and

encoding of the transform coefﬁcients.

Without loss of generality, in the following we

will focus on the case of grayscale videos. The quan-

tization of the DCT coefﬁcients corresponding to the

luminance component is performed as follows,

(m,n) = round



D(m,n)

S · Q(m,n)



, (3)

where D ∈ R

8×8

denotes the matrix of DCT coef-

ﬁcients for an 8 × 8 block of the original frame,

∈ R

8×8

is the corresponding matrix of the quan-

tized coefﬁcients, S is a scaling factor varying from 1

(high image quality) to 100 (low image quality), and

Q ∈ R

8×8

is the following quantization table,

Q =







8 16 19 22 26 27 29 34

16 16 22 24 27 29 34 37

19 22 26 27 29 34 34 38

22 22 26 27 29 34 37 40

22 26 27 29 32 35 40 48

26 27 29 32 35 40 48 58

26 27 29 34 38 46 56 69

27 29 35 38 46 56 69 83







(4)

Then, the quantized DCT coefﬁcients are encoded

losslessly using an improved Huffman coding scheme

using recursive splitting (Skretting et al., 1999).

In contrast to the full sampling of the I-frames,

a CS measurements acquisition process is applied on

the P-frames. In particular, a compressed measure-

ment vector is generated for each B × B block of a

P-frame. A satisfactory tradeoff between the compu-

tational complexity at the encoder and the reconstruc-

tion performance at the decoder is achieved for blocks

of size 16 × 16 and 32 × 32. A further reduction of

the transmission bit-rate can be attained by downsam-

pling a P-frame prior to the measurements acquisi-

tion, that is, the measurement model given by (1) is

generalized as follows,

= ΦD {x

} , (5)

where D{·} denotes the downsampling operator. In

the subsequent evaluations we will consider that each

P-frame is downsampled by a factor of 2. Then a uni-

form scalar quantization is applied on the CS mea-

surements, followed by Huffman coding.

From the above it can be seen that the parame-

ters which affect the performance of the encoder are:

i) the size of a GOP, ii) the scaling factor S (for the

I-frames) and the number of quantization levels (for

the P-frames), and iii) the sampling rate M/B

, where

M is the number of compressed measurements for a

block of size B × B. More speciﬁcally, the size of a

GOP is related to the motion complexity of a video se-

quence, and it should decrease (or equivalently, more

fully-sampled I-frames must be inserted) for videos

with highly varying content. Of course, as the number

of the fully-sampled I-frames increases, the required

bit-rate also increases. Regarding the quantization of

the I-frames, the image quality diminishes for increas-

ing values of S, while the quality of the P-frames im-

proves by increasing the number of quantization lev-

els of the uniform quantizer. Finally, the reconstruc-

tion quality is enhanced as the sampling rate grows,

but at the cost of higher bit-rates.

3.2.2 Decoder

In contrast to the MPEGx, whose encoding efﬁciency

is primarily based on the inter-frame prediction, the

DESIGN OF A COMPRESSIVE REMOTE IMAGING SYSTEM COMPENSATING A HIGHLY LIGHTWEIGHT

ENCODING WITH A REFINED DECODING SCHEME

lack of a motion compensation step at the proposed

encoder, as well as in M-JPEG, to account for the

temporal redundancies, affects the tradeoff between

the reconstruction quality and the associated bit-rate.

Motivated by this, and also under the assumption

that the base station, where the reconstruction takes

place, has the necessary memory and computational

resources, we transfer the tasks of motion estimation

and compensation at the decoder. Doing this, the mo-

tion compensation acts as a recursive reﬁnement pro-

cess which improves the quality of the reconstructed

P-frames, combined with a super-resolution module

to cope with their potential downsampling at the en-

coder. Fig. 2(b) shows the block-diagram of the pro-

posed CVS decoder.

Starting with the fully-sampled frames, the recon-

struction of a reference I-frame is performed using the

inverse JPEG procedure, that is, the received bits are

ﬁrst decoded, followed by inverse quantization and

inverse DCT (IDCT), resulting in an approximation

I (there will be always some loss of information due

to the quantization).

Regarding the non-reference frames, the received

bits for the current P-frame are ﬁrst decoded and de-

quantized. Then, a reconstruction CS algorithm is ap-

plied on the dequantized compressed measurements

yielding an estimate of the original frame

P, by solv-

ing (2). Speciﬁcally, the iterative hard thresholding

(IHT) algorithm (Blumensath and Davies, 2009) is

employed for the reconstruction of each block of the

current P-frame as follows,

n+1

+ Φ

− Φ

) (6)

n+1

= Ψ

−1



T {Ψ

(

n+1

)}



, (7)

where T {·} denotes a hard thresholding operator.

The algorithm terminates when a predetermined num-

ber of iterations L

max

has been reached, or if the er-

ror between successive iterations falls below a given

threshold, k

n+1

−

≤ ε.

In the subsequent analysis, the DCT is used as the

sparsifying transformation Ψ

, while the threshold for

the hard thresholding is given by (Donoho and John-

stone, 1994)

T h

= λσ

2log(B

) , (8)

where λ is a scaling factor varying usually between

3 and 5, and σ is the noise standard deviation, which

is estimated using the mean absolute deviation of the

transform coefﬁcients

n+1

= Ψ

(

n+1

) as follows,

σ =

median(|

n+1

0.6745

. (9)

Iterative Frame Reﬁnement. As it was mentioned

in Section 3.2.1, the sampling rate r = M/B

is one of

the key factors, which controls the tradeoff between

the bit-rate and the achieved reconstruction quality.

In order to satisfy the limitations of a lightweight re-

mote imaging system the CVS encoder must operate

at very low sampling rates. However, as the value of

r decreases the quality of the reconstructed frames di-

minishes rapidly, as the subsequent experimental re-

sults reveal. This deterioration can be alleviated using

an iterative reﬁnement process based on motion esti-

mation and compensation between the reconstructed

I- and P-frames at the decoder. An inter-frame com-

pensation method, such as in MPEGx, computes the

prediction errors in the original space domain as fol-

lows,

R = P − M {I} , (10)

where R is the residual frame (or equivalently, the pre-

diction error) and M {·} denotes the motion compen-

sation operator.

In our case, the decoder receives the encoded I-

frames at full resolution, and the encoded compressed

measurements, which are, nevertheless, related with

the original (possibly downsampled) P-frames. One

of the key properties of CS is that all the components

of a CS measurement vector g

are equally important.

This means that the reconstruction performance can

be practically unaffected even if some measurements

are lost or disturbed (e.g., due to channel errors). Be-

cause of this increased robustness when working di-

rectly with the CS measurements, the iterative reﬁne-

ment procedure is implemented in the CS measure-

ments domain. By assuming for convenience that the

frames involved in (10) are represented as a single

block (reshaped as a column vector), and by substitut-

ing M {I} with I

, the following expressions hold,

R =

P − D{

} ⇒

R = Φ(

P − D{

}) ⇒

R = Φ

P − ΦD{

}

(5)

⇒

error

= g − g

. (11)

Notice that in the general case where the P-frames are

downsampled at the encoder, then, the reconstructed

I-frames must be also downsampled at the decoder

before the CS measurements acquisition (g

), so

as to be consistent with the frame dimensions. This

explains the presence of the downsampling operator

D{·} in the above equations.

Given the CS measurements g and constructing

, we generate the CS measurements of the resid-

ual via (11). Then, by applying the IHT (ref. (6)-(7))

on g

error

we obtain an estimate of the residual frame

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

(a) CVS encoder.

(b) CVS decoder.

Figure 2: Block-diagram of the proposed CVS architecture.

(

R), which can be used subsequently to reﬁne the es-

timate of the current P-frame,

P = D {

} +

R . (12)

The iterative reﬁnement of the P-frames is summa-

rized as follows:

= ΦD{

}

error

= g − g

error

IHT

−→

n+1

= D{

} +

The reﬁnement process terminates when a predeter-

mined number of iterations C

max

has been reached,

where a small number usually sufﬁces to achieve a

signiﬁcant improvement (in the subsequent experi-

ments it is set equal to 10). Notice also the presence

of the superscript n in the above expressions associ-

ated with the motion compensated frame. This is jus-

tiﬁed by the fact that the motion estimation and mo-

tion compensation are performed after each update

of the reconstructed P-frame (

P) and thus, resulting

in a different motion compensated frame (

). The

increased computational resources, which are now

available at the decoder, enable us to use a more ac-

curate sub-pixel motion estimation instead of using

integer steps. In the following, the motion vectors are

estimated with an accuracy of 1/4 pixel .

A further improvement of the reconstruction qual-

ity can be achieved by scanning the current GOP

Figure 3: Bi-directional GOP scanning.

(IP P ... P I) forward and backward, as shown in

Fig. 3, since for the P-frames which are placed on the

left of the GOP we expect the motion estimation to

be more accurate, and thus resulting in sparser resid-

uals, by employing the leftmost I-frame, while for the

P-frames which are placed on the right an increased

accuracy can be achieved using the rightmost I-frame.

In our implementation a single scan in both directions

is used.

Super-resolution for P-frame Resizing. As men-

tioned in Section 3.2.1, an optional downsampling of

the P-frames can be used at the proposed CVS en-

coder to further reduce the transmission workload,

when a limited bandwidth is available. In case of

downsampling, the reconstructed P-frames must be

restored in their original dimension. A straightfor-

ward approach would be to apply some of the well-

established 2-dimensional interpolation techniques,

such as a bilinear or a bicubic one. However, re-

DESIGN OF A COMPRESSIVE REMOTE IMAGING SYSTEM COMPENSATING A HIGHLY LIGHTWEIGHT

ENCODING WITH A REFINED DECODING SCHEME

cent works (Freeman et al., 2002; Yang et al., 2008;

Wang et al., 2011; Zhang et al., 2011) have shown

that a super-resolution method results in images with

superior quality. In the proposed CVS decoder we

employ a dictionary-based super-resolution approach,

as it is described in (Yang et al., 2008). In partic-

ular, the super-resolved image is generated from its

low-resolution input, whose patches are assumed to

have a sparse representation with respect to an over-

complete dictionary. Two coupled dictionaries are

trained, with the ﬁrst one (D

) corresponding to the

high-resolution patches and the second one (D

) to

the low-resolution patches. Then, the sparse represen-

tation of a low-resolution patch in the P-frame to be

resized, in terms of D

, will be used to construct the

corresponding high-resolution patch from D

In our proposed system, the initial training of D

and D

is performed using a set of arbitrary im-

ages. Notice also that the two dictionaries should be

re-trained if the downsampling factor changes. As it

was mentioned before, in the following evaluations a

factor of 2 will be used. In order to increase the adap-

tivity of the trained dictionaries on the speciﬁc content

of a given video sequence, we apply an updating pro-

cess by incorporating the patches of the reconstructed

I-frames, whose quality is superior in comparison to

the quality of the reconstructed P-frames, since they

are fully sampled at the encoder. However, we note

that the updating phase using the current implementa-

tion requires an increased amount of time, which can

be a drawback in a real-world scenario under poten-

tial time limitations. On the other hand, the design

of fast and efﬁcient dictionary updating methods is by

no means a signiﬁcant task, which can be the subject

of a separate thorough study.

4 EXPERIMENTAL EVALUATION

In this section, the performance of the proposed CVS

system is evaluated and compared with M-JPEG. For

this purpose, three videos with distinct content are

used

, namely, i) the “Akiyo”, which consists of a

static background and a slowly moving foreground,

ii) the “News”, with motion in both the background

and the foreground, and iii) the “Coastguard”, which

is characterized by a highly dynamic scene with com-

plex motion compared with the other two sequences.

For the encoding of P-frames, GOPs of size 6 and

blocks of size 16 × 16 are used, while the BWHT

is chosen as a measurement matrix. The number of

quantization levels varies from 2

to 2

(or equiva-

http://media.xiph.org/video/derf/

Table 1: Correspondence between the number of quantiza-

tion bits (CVS) and the scaling factor S (M-JPEG).

# Quantization bits

Video sequence 6 7 8

Akiyo 40 30 20

SNews 50 40 30

Coastguard 35 27 20

lently, the number of quantization bits varies from 6

to 8) and the sampling rate is ﬁxed at r = 0.10 (that

is, for each block M = 0.10 · 16

= 26 CS measure-

ments are acquired). Regarding the reﬁnement pro-

cess at the decoder, the parameters which control its

performance are set as follows: λ = 3, L

max

= 400,

ε = 10

−4

, and C

max

= 10. In order to achieve sim-

ilar bit-rates for the M-JPEG, for a fair comparison,

the value of the scaling factor S in (3) depends on the

input video sequence. More speciﬁcally, the corre-

spondence between the number of quantization bits,

controlling the bit-rate of CVS, and the value of S for

each video sequence are summarized in Table 1.

In the following, the reconstruction quality is mea-

sured in terms of the structural similarity index (SSI),

which resembles more closely the human visual per-

ception than the commonly used peak signal-to-noise

ratio (PSNR). For a given image I and its reconstruc-

tion

I the SSI is deﬁned by,

SSI =

(2µ

+ c

)(2σ

+ c

)

(µ

+ µ

+ c

)(σ

+ σ

+ c

)

, (13)

where µ

, σ

are the mean and standard deviation of

the luminance of image I (similarly for

I), σ

denotes

the correlation coefﬁcient of the two images, and c

stabilize the division with a weak denominator. In

particular, when SSI is equal to 0 the two images are

completely distinct, while when the two images are

matched perfectly SSI is equal to 1.

Fig. 4 shows the reconstruction performance aver-

aged over the ﬁrst 50 frames of each video sequence,

as a function of the required bit-rate (in Kbps), for

the proposed CVS method and an M-JPEG coding

scheme. As it can be seen, for the same bit-rates,

our CS-based video compression method achieves a

signiﬁcant improvement of the frame reconstruction

quality over the M-JPEG method. Most importantly,

this improvement is obtained using an encoder with

an even decreased computational complexity when

compared with the M-JPEG encoder, since the com-

pression of P-frames is performed via simple matrix-

vector products (ref. (5)) instead of using a transform

coding approach as the M-JPEG does.

For a visual inspection of the reconstruction qual-

ity for each one of the three video sequences, Fig. 5

shows the ﬁfth frame of the original sequence, along

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

Table 2: SSI values between the original and the recon-

structed frames shown in Fig. 5.

M-JPEG CS w/o MC CS with MC

Akiyo 0.829 0.584 0.871

News 0.733 0.464 0.768

Coastguard 0.678 0.516 0.711

with the frame reconstructed using M-JPEG, as well

as with the proposed approach before and after apply-

ing the iterative reﬁnement step. First, by comparing

the third and the fourth columns, presenting the recon-

structed frames by applying separate decoding and

by employing motion estimation and compensation at

the decoder, we can see the signiﬁcant improvement

we are able to achieve via the iterative motion com-

pensation process. Moreover, the M-JPEG method is

sensitive to blocking artifacts as the number of quanti-

zation levels decreases, as it can be seen in the images

of the second column. The corresponding SSI values

between the original frames and the M-JPEG recon-

struction, along with the proposed approach without

and with the use of motion compensation are shown

in Table 2.

5 CONCLUSIONS

In the present work we introduced a compressive

video sensing method for a lightweight remote imag-

ing system. The limited memory, power, and band-

width resources of these systems necessitate the de-

sign of a very simple encoder, while putting the main

computational burden at the decoder. Motivated by

the simplicity of M-JPEG, the proposed CVS encoder

is able to achieve low bit-rates with a reduced com-

putational complexity, by combining a DCT-based

transform coding applied on the reference I-frames

with a CS-based compression applied on the (pos-

sibly downsampled) non-reference P-frames. At the

decoder, the inverse processes are followed to recon-

struct the I and P frames. However, the main draw-

back is the decreased reconstruction performance of

the P-frames when we work at very low sampling

rates. This problem is alleviated by applying an iter-

ative motion compensation at the decoder, acting as a

reﬁnement step. Finally, when the acquired P-frames

are downsampled at the encoder, a dictionary-based

super-resolution method is employed at the decoder

to restore the reconstructed frames in their original di-

mension. As a general conclusion, the proposed CVS

system yielded a superior reconstruction quality com-

pared with M-JPEG, especially at low bit-rates.

Further improvements can be achieved by imple-

Figure 4: Comparison of reconstruction performance as a

function of the bit-rate between CVS and M-JPEG.

menting a more efﬁcient CS reconstruction method,

resulting in an increased quality of the initially recon-

structed P-frames at very low sampling rates. More-

over, with the current implementation, the dictionary-

based super-resolution method cannot be applied ef-

ﬁciently in a real-time scenario, because of the delay

in updating the dictionary with previously decoded I-

frames. Thus, the design of new techniques for fast

dictionary update is necessary so as to enable the use

of the proposed method in a real-world application w-

DESIGN OF A COMPRESSIVE REMOTE IMAGING SYSTEM COMPENSATING A HIGHLY LIGHTWEIGHT

ENCODING WITH A REFINED DECODING SCHEME

Figure 5: Visual inspection of the reconstruction performance for M-JPEG and the proposed CVS scheme.

ith time limitations at the decoder. When working at

very low sampling rates, we do not expect to achieve

the superior performance of an MPEGx-like proﬁle,

which is based on the increased sparsity of the resid-

ual frames exploited at the encoder. However, as

the experimental results revealed, the reconstruction

quality of the proposed lightweight remote imaging

system is high enough so as to be effective in perform-

ing tasks such as detection and classiﬁcation, which

are also of importance in several surveillance applica-

tions, and require a more thorough study.

ACKNOWLEDGEMENTS

This work is supported by CS-ORION Marie

Curie Industry-Academia Partnerships and Pathways

(IAPP) project funded by the European Commission

in FP7 (PIAP-GA-2009-251605).

REFERENCES

Blumensath, T. and Davies, M. (2009). Iterative hard

thresholding for compressed sensing. Applied and

Computational Harmonic Analysis, 27(3):265–274.

Cand

es, E., Romberg, J., and Tao, T. (2006). Robust

uncertainty principles: exact signal reconstruction

from highly incomplete frequency information. IEEE

Trans. on Information Theory, 52:489–509.

Do, T., Chen, Y., Nguyen, D., Nguyen, N., Gan, L., and

Tran, T. (2009). Distributed compressed video sens-

ing. In 43rd Annual Conf. Inf. Sci. and Sys. (CISS’09),

Baltimore, MD.

Do, T., Tran, T., and Gan, L. (2008). Fast compressive sam-

pling with structurally random matrices. In IEEE Int.

Conf. on Acoust., Speech and Sig. Proc. (ICASSP’08),

Las Vegas, NV.

Donoho, D. and Johnstone, I. (1994). Ideal spatial adapta-

tion by wavelet shrinkage. Biometrika, 81:425–455.

Freeman, T., Jones, T., and Pasztor, E. (2002). Example-

based super-resolution. IEEE Computer Graphics and

Applications, 22(2):56–65.

Jung, H. and Ye, J.-C. (2010). Motion estimated and com-

pensated compressed sensing dynamic magnetic reso-

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

nance imaging: What we can learn from video com-

pression techniques. Intl. J. of Imaging Systems and

Technology, 20(2):81–98.

Kang, L.-W. and Lu, C.-S. (2009). Distributed compressive

video sensing. In IEEE Int. Conf. on Acoust., Speech

and Sig. Proc. (ICASSP’09), Taipei.

Prades-Nebot, J., Ma, Y., and Huang, T. (2009). Distributed

video coding using compressive sampling. In Picture

Coding Symp. (PCS’09), Chicago, IL.

Skretting, K., Husøy, J. H., and Aase, S. O. (1999). Im-

proved Huffman coding using recursive splitting. In

Norwegian Signal Proc. Symp., Norway.

Stankovi

c, V., Stankovi

c, L., and Cheng, S. (2008). Com-

pressive video sampling. In European Sig. Proc. Conf.

(EUSIPCO’08), Lausanne.

Wang, P., Hu, X., Xuan, B., Mu, J., and Peng, S. (2011). Su-

per resolution reconstruction via multiple frames joint

learning. In Intl. Conf. Multimedia and Signal Proc.

(CMSP’11), Guilin, Guangxi.

Yang, J., Wright, J., Huang, T., and Ma, Y. (2008). Im-

age super-resolution as sparse representation of raw

image patches. In Computer Vision and Pattern Rec.

(CVPR’08), Anchorage, AK.

Zhang, H., Zhang, Y., and Huang, T. (2011). Efﬁcient

sparse representation based image super resolution via

dual dictionary learning. In Intl. Conf. Multimedia and

Expo (ICME’11), Barcelona.

DESIGN OF A COMPRESSIVE REMOTE IMAGING SYSTEM COMPENSATING A HIGHLY LIGHTWEIGHT

ENCODING WITH A REFINED DECODING SCHEME