LOW COMPLEXITY, LOW DELAY AND SCALABLE AUDIO

CODING SCHEME BASED ON A NOVEL STATISTICAL

PERCEPTUAL QUANTIZATION PROCEDURE

esar Alonso Abad

Music Technology Group, Pompeu Fabra University, Barcelona, Spain

Miguel

Angel Mart

ın Fern

andez, Carlos Alberola L

opez

Image Processing Laboratory, University of Valladolid, Valladolid, Spain

Keywords:

Fast perceptual quantization, perceptual audio coding, Huffman, histogram, scalability, low delay.

Abstract:

In this paper we present Fast Perceptual Quantization (FPQ), a novel procedure to quantize and code audio

signals. It employs the same psychoacoustics principles used in the popular MPEG/Audio coders, but substan-

tially simpliﬁes the complexity and computational needs of the encoding process. FPQ is based on deﬁning a

hierarchy of privileged quantization values so that the masking threshold calculated through a psychoacoustic

model is leveraged to quantize the real values to the privileged ones when possible. The computational cost of

this process is very low compared to MP3’s or AAC’s quantization/coding loops. Experimental results show

that it is possible to achieve nearly transparent coding using as few as approximately 100 quantization values.

This leads to very efﬁcient bit compaction using Huffman or arithmetic coding so that nearly state-of-the-art

performance can be achieved in terms of quality/bit-rate trade-off. Since quantization and codiﬁcation (bit

compaction) procedures are completely independent here, efﬁcient scalable decoding can be achieved either

by parsing and entropy re-encoding the original quantized values or by coding the bit-planes independently

and sorting them in order of perceptual signiﬁcance. Very low delay performance is also possible to achieve,

which makes the proposed coding scheme suitable for real-time applications.

1 INTRODUCTION

In this paper we introduce a perceptual quantiza-

tion and coding scheme based on the MDCT and

the Psychoacoustic Model 1 (ISO/MPEG, 1992) that

achieves transparent coding at bit-rates comparable

to those obtained using MP3 and also provides scal-

ability and an algorithmic delay of about 5-10 ms.

Our main contribution is the quantization procedure,

namely FPQ (Fast Perceptual Quantization), which is

rather simple but nevertheless leads to very interesting

results in practice.

2 FAST PERCEPTUAL

QUANTIZATION (FPQ)

FPQ is a novel quantization procedure that takes ad-

vantage of a tolerance margin below which quantiza-

tion noise is not perceived or detected in such a way

that the statistical properties of the quantized values

make them especially suitable for efﬁcient entropy

coding. The idea behind most perceptual compres-

sion schemes is the following: given the modulus of

an N-point spectral frame —from an FFT, MDCT,

etc.— and an N point global masking threshold for

that frame, we want to quantize the spectral coefﬁ-

cients with the largest quantization step —i.e. with

the smallest number of quantization bits— that al-

lows keeping quantization noise below the masking

threshold. MPEG/Audio coders use Huffman coding

iteratively to ﬁnd the shortest codewords that quan-

tize and encode the spectral values without exceed-

ing the masking threshold for each scalefactor-band

(ISO/MPEG, 1992; Bosi et al., 1997). Other recent

approaches like the ones proposed in (Derrien et al.,

2006; Kramer et al., 2004; Wabnik et al., 2006) take

into account the statistical behavior of the quantiza-

tion noise to simplify the quantization/coding iterative

procedure.

The approach of FPQ is very related to these last

ones, but instead of modeling the statistical properties

Alonso Abad C., Ángel Martín Fernández M. and Alberola López C. (2007).

LOW COMPLEXITY, LOW DELAY AND SCALABLE AUDIO CODING SCHEME BASED ON A NOVEL STATISTICAL PERCEPTUAL QUANTIZATION

PROCEDURE.

In Proceedings of the Second International Conference on Signal Processing and Multimedia Applications, pages 31-34

DOI: 10.5220/0002140100310034

 SciTePress

of the quantization noise, the idea is to take advantage

of the masking threshold to force some interesting sta-

tistical property to occur. For example, it is very de-

sirable for Huffman or arithmetic coding purposes to

concentrate the probability of occurrence of quanti-

zation values in a small subset of “privileged” ones.

In this way, we can use less bits to encode the most

probable quantization values at the expense of using

more bits to encode those quantization values that are

rarely taken. And what is also important: coding bits

are not spent in quantization values that are actually

never taken.

0 50 100 150 200 250

−20

−10

Spectral bin number

Power (dB)

Quantized signal (FPQ)

Masking threshold

Quantization error

Figure 1: Short spectral frame of a music signal quantized

using FPQ. Note that quantization error does not exceed the

masking threshold at any point.

One simple and efﬁcient way to achieve this con-

centration of probability in a small subset of quantiza-

tion values is to quantize each spectral bin according

to the following procedure:

1. First we try the largest possible quantization step,

i.e. we try to quantize the actual value either to

zero or the maximum of the dynamic range. Typ-

ically, the dynamic range would be 0-65535 (16

quantization bits).

2. We calculate the quantization error as the absolute

value of the difference between the actual value

and the quantized one using the actual quantiza-

tion step.

3. If this error level is below the value of the mask-

ing threshold for the particular spectral bin con-

sidered, the distortion is assumed to be inaudible

and the bin is quantized with the actual quantiza-

tion step.

4. If not, we divide the quantization step by 2 and

repeat the process from the second step.

These steps are repeated for each and every spectral

value until all of them are quantized in such a way that

quantization error never exceeds the masking thresh-

old, as it is shown in Figure 1. This simple algorithm

has several advantages:

- The quantization procedure has a very low computa-

tional cost and can be implemented using a few sim-

ple operations. Only one loop is necessary to quan-

tize each value, with 15 iterations at the most (using

16 quantization bits) but about 3-4 iterations in mean.

- Entropy coding is only made once, when all the val-

ues are quantized, as opposed to MPEG/Audio quan-

tization/coding nested loops, in which Huffman cod-

ing is performed in every iteration until any of the ter-

mination conditions of the loops are met.

- Spectral bins tend to quantize to powers of 2 or mul-

tiples of powers of 2. If we use natural binary coding,

most of the values are expressed with binary words

with few ones surrounded by zeros. This sparsity is

interesting for compressing each bit-plane separately

for scalability purposes.

- Experimental results show that it is possible to

achieve transparent coding of modern music audio

signals using as low as about 100 quantization values,

which is a dramatic dimensionality reduction from the

initial 65536 possible if we use 16 quantization bits.

- Spectral bins in every frame are quantized one by

one and not at sub-band level. In other words, they are

quantized to the maximum possible frequency resolu-

tion given a window length. This allows for high cod-

ing gains even when using short analysis windows,

what is desirable for very low delay applications.

- As quantization is independent from the later en-

tropy coding procedure used, we can achieve scalable

decoding easily. For example by compressing the bit-

planes separately, taking advantage of the aforemen-

tioned sparsity of the values quantized with FPQ ex-

pressed in natural binary form, and selecting them in

order of signiﬁcance.

- Fixed bit-rate streaming can be also achieved by se-

lecting the proper number of bit-planes for each tem-

poral frame.

3 ENCODING SCHEME

To test the properties of FPQ embedded in a com-

plete coding scheme, we developed a prototype of

a simple perceptual audio compression system using

MATLAB. The scheme is shown in Figure 2. An in-

put PCM signal sampled at 44100 Hz and quantized

with 16 bits is divided into N-point 50% overlapped

frames to calculate the MDCT. N should not be very

large to avoid pre-echoes. In our experiments each

SIGMAP 2007 - International Conference on Signal Processing and Multimedia Applications

psychoacoustic

model

Huffman

codewords

side

information

compressed

signal

histogram

buffer

entropy

coding

quantization

(FPQ)

MDCT

(or DCT)

PCM

signal

Figure 2: Proposed scheme for a simple encoder that makes use of FPQ and Huffman entropy coding.

spectral frame had 256 points (N = 512). For each

spectral frame, the masking threshold or JND level

is calculated, as shown in Figure 1. This is done

through Psychoacoustic Model 1 but using the mod-

ulus of the MDCT coefﬁcients instead of making a

separate FFT analysis. The modulus of every single

bin is quantized independently using FPQ, according

to the value of the masking threshold for this particu-

lar bin. The signs are compressed separately using a

lossless scheme.

Once every bin on every frame is quantized with

FPQ, the resulting data are very highly compress-

ible by almost any entropy coder. We tried Huffman

and arithmetic coding, PNG (Deﬂate), RAR and ZIP,

achieving in all cases compression ratios between 1:7

and 1:10 for nearly transparent coding of mono sig-

nals. In other words: this simple scheme is com-

parable to MP3 compression in terms of quality/bit-

rate trade off. The decoding procedure is also very

simple: the entropy-coded spectral coefﬁcients are

uncompressed and the inverse MDCT is performed.

Note that the Huffman codewords are the only side

information needed for decoding.

4 EXPERIMENTAL RESULTS

4.1 Non-real-time Performance

When low delay is not important, we can use the

buffer in Figure 2 to calculate the histogram and there-

fore the optimum codewords for entropy coding the

signal considered. As it has been said, FPQ quantizes

the modulus of spectral audio coefﬁcients in such a

way that a few quantization values —say about 100—

are enough to cover the whole dynamic range with-

out introducing audible distortion. It has been found

experimentally that the probability distribution of the

quantized spectral values for all the music signals we

used for testing is rather similar, as shown in Figure 3.

In all cases approximately the same few values (most

of them powers of two and multiples of powers of

two) are largely the most probable.

To measure the compressibility of the coefﬁcients

0 1 2 3 4 5 6

x 10

Possible quantization values (only 87 are actually taken)

Rock

0 1 2 3 4 5 6

x 10

Possible quantization values (only 111 are actually taken)

Pop

0 1 2 3 4 5 6

x 10

Possible quantization values (only 56 are actually taken)

Techno

0 1 2 3 4 5 6

x 10

Possible quantization values (only 79 are actually taken)

Classic

Figure 3: Histogram of the spectral quantization values for

nearly transparent coding using FPQ for four 8-second mu-

sical excerpts named Rock, Pop, Techno, and Classic. ac-

cording to their music style. Values are quantized using 16

bits (the dynamic range is therefore 0-65535). Vertical axis

is in logarithmic scale and normalized to 1 for better visu-

alization.

quantized using FPQ and the performance of the en-

tire prototype of the encoder here proposed, we used

several 8-second music excerpts, representative of

various music styles. First we compressed them us-

ing MP3 and AAC coders at the less possible bit-

rate which produced nearly transparent coding, that

is, distortion which is perhaps slightly perceptible,

but not annoying. Then we used the MATLAB pro-

totype based on FPQ described in section 3 to ob-

tain the corresponding quantized MDCT coefﬁcients

for each signal. The global masking threshold level

used for quantization was adjusted in order to obtain

decoded audio ﬁles with a similar quality as those

coded with MP3 and AAC in the previous step. Fi-

nally we compressed the resulting quantized spectral

LOW COMPLEXITY, LOW DELAY AND SCALABLE AUDIO CODING SCHEME BASED ON A NOVEL

STATISTICAL PERCEPTUAL QUANTIZATION PROCEDURE

coefﬁcients using Huffman, arithmetic, RAR and ZIP

entropy coders. The resulting ﬁle lengths were very

close to those obtained with MP3 coding in all cases.

AAC compression was signiﬁcantly better for similar

audio quality.

We used also another method to entropy code the

quantized values for scalability purposes. Taking ad-

vantage of the sparse structure that quantized values

using FPQ present, we compressed each bit-plane us-

ing PNG for binary images. The resulting ﬁle lengths

obtained were between 5% and 23% bigger than those

obtained with MP3. Nevertheless, the ﬁles thus com-

pressed have the advantage of being able to gener-

ate scalable bit-streams without any additional com-

putational cost. It is possible to achieve ﬁne-grain

scalability by further subdividing the bit-planes into

smaller binary images. Experiments show that grace-

ful degradation is obtained by successively removing

the less signiﬁcant bit-planes.

4.2 Real-time Performance

When low-delay is required each frame must be coded

and sent immediately. Using 512-point temporal win-

dows for the MDCT, the minimum algorithmic delay

is 11.6 ms for a sampling rate of 44100 Hz. Using

DCT with non-overlapped 256-point windows, delay

can be reduced to 5.8 ms. As long as FPQ and Huff-

man coding are very low costing operations, only the

psychoacoustic model would increase these delay val-

ues signiﬁcantly. However, as FPQ uses the masking

threshold to the maximum possible frequency resolu-

tion for a given a window length, it is expected that

even shorter windows could be used while maintain-

ing acceptable coding gains.

The price we have to pay for low-delay perfor-

mance is that the Huffman codewords must be actual-

ized and sent to the decoder each and every time new

quantization values appear in the incoming frames.

After an initial start-up stage in which Huffman code-

words must be sent for every frame, the histogram be-

comes approximately stable. In this steady-state sce-

nario, the same codewords are nearly optimal for the

frames transmitted thereafter. If new values appear,

a new Huffman table including these values must be

generated. Fortunately, this happens only occasion-

ally after the ﬁrst frames are coded and sent, and the

overload penalty is only an increase of 10-15% in

the amount of information that must be transmitted.

Another possibility is to predeﬁne a ﬁxed “standard”

Huffman table for all the cases, taking into account

that histograms of very different signals are actually

not so different, as shown in Figure 3: approximately

the same values (powers of 2 and multiples of powers

of 2) are the most probable in all cases.

5 CONCLUSIONS

In this paper, a novel perceptual quantization method

for audio coding called FPQ has been proposed. This

method is an alternative to those used in the most im-

portant perceptual audio coders and has several ad-

vantages with respect to them: it is much simpler and

computationally inexpensive; it can produce scalable

bit-streams with no additional computational cost and

it is suitable for real-time applications.

For the purpose of measuring the FPQ capabili-

ties, a simple audio compression prototype was built

in MATLAB. The results of the experiments showed

that the quantized spectral values given by FPQ are

very well suited for compression using almost any en-

tropy coding system. Compression rates comparable

to those obtained using MP3 coding for the same au-

dio quality were achieved. Very low delay and scal-

ability is also achievable at the expense of slightly

higher bit-rates. It is expected that the compression

capacity of the system will be signiﬁcantly improved

using more reﬁned psychoacoustic models and en-

tropy coders, while keeping the advantages of being

simple, computationally inexpensive and scalable.

REFERENCES

Bosi, M., Brandenburg, K., Quackenbush, S., Fielder, L.,

Akagiri, K., Fuchs, H., Dietz, M., Herre, J., David-

son, G., and Oikawa, Y. (1997). ISO/IEC MPEG-2 ad-

vanced audio coding. J. Audio Eng. Soc., 45(10):789–

814.

Derrien, O., Duhamel, P., Charbit, M., and Richard, G.

(2006). A New Quantization Optimization Algorithm

for the MPEG Advanced Audio Coder Using a Statis-

tical Subband Model of the Quantization Noise. IEEE

Transactions on Audio, Speech and Language Pro-

cessing, 14(4):1328–1339.

ISO/MPEG (1992). Information technology—Coding of

moving pictures and associated audio for digital stor-

age media at up to about 1.5 Mbit/s—Part 3: Audio.

IS11172-3 (MPEG-1).

Kramer, U., Schuller, G., Wabnik, S., Klier, J., and

Hirschfeld, J. (2004). Ultra Low Delay audio coding

with constant bit rate. 117th AES Convention.

Wabnik, S., Schuller, G., Hirschfeld, J., and Kraemer, U.

(2006). Different quantisation noise shaping methods

for predictive audio coding. IEEE International Con-

ference on Acoustics, Speech and Signal Processing

(ICASSP), Toulouse.

SIGMAP 2007 - International Conference on Signal Processing and Multimedia Applications