PITCH ESTIMATION OF DIFFICULT POLYPHONY SOUNDS

OVERLAPPING SOME FREQUENCY COMPONENTS

Yoshiaki Tadokoro, Masanori Natsui, Yasuhiro Seto

Dept. of Information and Computer Sciences, Toyohashi University of Technology,Toyohashi,441-8580 Japan

Michiru Yamaguchi

Dept. of Computer Engineering, Toyama National College of Maritime of Technology, Imizu, 933-0293 Japan

Keywords: pitch estimation, polyphony, unison, octave different tone, three-times tone, transcription.

Abstract: There are some difficult polyphony to estimate these pitches for transcription. This paper proposes two new

methods for the pitch estimation of these difficult polyphony. One of them is based on the beat components

of the polyphony analyzed by the short-time Fourier transform (STFT). The other is a method noticing the

period of the residual signal after the elimination of polyphony components using a comb filter

(

N

zzH

−

−= 1)(

). These methods are based on the fact that there is a small frequency difference between

the real sound and the ideal one.

1 INTRODUCTION

Musical transcription is to produce scores from

musical sounds and is necessary in the musical field,

musical retrieval and also a significant problem in

artificial intelligence (Roads, 1985), (Sterian and

Wakefield, 2000), (Pollastri, 2002). In the

transcription, the pitch estimation is most important

and many studies have been done (Roads, 1996),

(Tadokoro et al, 2001, 2002, 2003). We also

proposed a unique method of the pitch estimation

that is based on the elimination of the pitch and its

harmonic components using the cascade or parallel

connections of the comb filters (Tadokoro et al,

2001, 2002, 2003). But there is a difficult problem in

the pitch estimation that has not been solved clearly.

That is the pitch estimation of polyphony where the

frequency components of each tone overlap

completely or partly. For this problem, the methods

based on the musical rules (Katano et al, 1996), the

assumption of power spectra addition (Ueda and

Hashimoto, 1997) and the genetic algorithm (Ono et

al, 1997) have been proposed, but they have some

problems for the practical use.

Figure 1 shows the spectra of a piano sound (

3

C :

tone name

C in octave 3) and a violin sound (

3

G ).

Showing in Fig.1, a musical sound is composed of a

basic frequency

p

f (pitch) and its harmonic

components

p

nf

. The frequency ratio between

adjacent tones in the equal temperament of 12

degrees is

12/1

2 . And so it is occurred that the

Figure 1: Examples of magnitude spectrum of musical

sounds.

tone p1

(C4)

tone p2

(C5)

tone p3

(G5)

1p

f

1

2

p

f

1

3

p

f

2p

f

3p

f

1

4

p

f

1

5

p

f

1

6

p

f

1

7

p

f

2

3

p

f

2

2

p

f

3

2

p

f

tone p1

(C4)

tone p2

(C5)

tone p3

(G5)

1p

f

1

2

p

f

1

3

p

f

2p

f

3p

f

1

4

p

f

1

5

p

f

1

6

p

f

1

7

p

f

2

3

p

f

2

2

p

f

3

2

p

f

Figure 2: Overlap relation of frequency components of

difficult polyphony.

168

Tadokoro Y., Natsui M., Yamaguchi M. and Seto Y. (2006).

PITCH ESTIMATION OF DIFFICULT POLYPHONY SOUNDS OVERLAPPING SOME FREQUENCY COMPONENTS.

In Proceedings of the Third International Conference on Informatics in Control, Automation and Robotics, pages 168-173

DOI: 10.5220/0001206101680173

Copyright

c

SciTePress

p

N

p

zzH

−

−= 1)(

p

f

p

f3

p

f2

p

N

p

zzH

−

−= 1)(

p

f

p

f3

p

f2

Figure 3: Magnitude characteristic of the comb filter.

frequency components of some polyphony overlap

as demonstrated in Fig.2. We have proposed the new

pitch estimation method using comb filters as shown

in Fig.3. The comb filter to eliminate the tone

1p

also eliminates the tones

2p and 3p in Fig.2, and so

we cannot estimate the pitches of tones

2p and 3p .

The pitch estimation for these polyphony is also an

unsolved problem in other pitch estimation methods.

This paper presents two new methods for the pitch

estimation of difficult polyphony. One of them is a

method based on the beat signals of polyphony

frequency components analyzed by the short-time

Fourier transform (STFT) and the starting point

difference of each tone. Actual polyphony has a

small time difference between the starting points of

each tone and a small frequency difference between

each frequency component, but ideally these start

points and frequency components are same. From

these starting point difference and the beat signals

by the frequency difference, we can estimate the

pitches of the difficult polyphony. The other is a

method based on the period measurement of the

residual signal that is an output signal of the comb

filter to eliminate the polyphony frequency

components. The residual signal is occurred by the

frequency difference between an ideal and a real

musical sounds. From the periods of the residual

signal, we can obtain the clues of the pitches of the

difficult polyphony.

In this paper, we assume that the polyphony is

composed of two tones in the octave 4 and 5 of

which lower pitch has already estimated by the pitch

estimation method described in section 2. And the

input polyphony is made from real sounds of the

RWC music database (the Real World Computing

Partnership in Japan). Each tone has almost same

amplitude. The sampling frequency is

kHzf

s

1.44= .

Figure 4: Pitch estimation system using parallel connected

comb filters and minimum output.

2 PITCH ESTIMATION METHOD

USING COMB FILTERS

Figure 3 shows a frequency characteristic of a comb

filter (

p

N

p

zzH

−

−= 1)( ). The comb filter has zero

points at

nNf

s

)/( where

s

f is a sampling

frequency. Using a comb filter, we can eliminate all

components of one tone of which basic frequency

p

f (pitch frequency) is Nf

s

/ . The feature of our

pitch estimation method is to eliminate the

frequency components of a musical sound using

these comb filters. The conventional pitch estimation

methods, on the other hand, are based on the

extraction of the pitch frequencies.

Figure 4 shows one of the pitch estimation systems

that have been proposed by us. This pitch estimation

system is composed of twelve comb filters

connected in parallel where the lowest notch

frequency (zero point) of each comb filter is

corresponding to each tone’s basic frequency in one

octave. If the input sound is monophony, then we

can estimate the pitch of its sound by knowing the

zero output of the parallel connected twelve comb

filters. When the input sound is polyphony, we can

detect the pitches by knowing the minimum output

of the parallel connected comb filters and then

connecting its comb filter to other parallel connected

eleven comb filters in cascade and detecting the

minimum output of the eleven comb filters as shown

in Fig.4. But if the polyphony is difficult one

showing in Fig.2, we cannot estimate the pitches

other than the lowest pitch, because other tones are

eliminated by the comb filter corresponding to the

lowest pitch. The pitch estimation method for these

PITCH ESTIMATION OF DIFFICULT POLYPHONY SOUNDS OVERLAPPING SOME FREQUENCY

COMPONENTS

169

difficult polyphony has not been developed also by

other pitch estimation methods.

(a)

(b)

2ms

STFT

STFT

(c)

Figure 5: Calculations of STFTs, (a) four periods data, (b)

hamming windowed data, (c) STFTs every 2 ms.

3 SHORT-TIME FOURIER

TRANSFORM (STFT) METHOD

3.1 Calculations of STFTs and Their

Results

We calculate the STFTs for four periods data of a

basic component of a musical sound windowed by the

hamming window every 2 ms as shown in Fig.5.

Figure 6 shows the time changes of the magnitude

characteristics of each frequency component of some

musical sounds, where +50 ms or -50 ms means that

its tone starts at 50ms after or before the clarinet C4

tone starts, respectively.

3.2 Consideration of Pitch

Estimation

As mentioned in section 1, we assume that polyphony

is composed of two tones and we know the pitch of

the lower tone. From the results in Fig.6, we want to

discriminate the following four tones, that is, (1)

t

t

Figure 6: Magnitude spectrum components by STFT.

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0 0. 05 0.1 0.15 0.2 0.25 0.

3

magnit ude

time

[

s

]

1

2

3

4

5

6

7

1

2

3

4

5

6

7

0

0.005

0.01

0.015

0.02

0.025

0.1 0.15 0.2 0. 25 0.3 0.35 0.

4

magnit ude

time

[

s

]

1

2

3

4

5

6

7

1

2

3

4

5

6

7

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0 0.05 0.1 0.15 0.2 0.25 0.

3

magni tude

time

[

s

]

1

2

3

4

5

6

7

1

2

3

4

567

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0 0.05 0.1 0.15 0.2 0.25 0.

3

magni tude

time

[

s

]

1

2

3

4

5

6

7

1

2

3

4

5

6

7

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0 0.0 5 0.1 0.15 0.2 0.25 0.

3

magni t ude

time

[

s

]

1

2

3

4

5

6

7

1

2

3

4

5

6

7

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0 0.0 5 0.1 0.15 0.2 0.25 0.

3

magnitude

time

[

s

]

1

2

3

4

5

6

7

1

2

3

4

5

6

7

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0 0.0 5 0.1 0.15 0.2 0.25 0.

3

magnitude

time

[

s

]

1

2

3

4

5

6

7

1

2

3

4

5

6

7

(a) clarinet C4

(b) clarinet C5

(c) trumpet C4

(d) trumpet C5

(e) trumpet G5

(g) clarinet C4 +

clarinet C5 (+50 ms)

(h) trumpet C4 +

trumpet C5 (+50 ms)

0

0.005

0.01

0.015

0.02

0.025

0.03

0 0. 05 0.1 0.15 0.2 0.25 0.

3

magni tude

time

[

s

]

1

2

3

4

5

6

7

1

2

3

4

5

6

7

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0 0.05 0.1 0.15 0.2 0.25 0.

3

magni tude

time

[

s

]

1

2

3

4

5

6

7

1

2

3

4

5

6

7

(i) clarinet C4+

trumpet G5 (+50 ms)

(f) clarinet C4+

trumpet C4(-50 ms)

ICINCO 2006 - SIGNAL PROCESSING, SYSTEMS MODELING AND CONTROL

170

monophony (C4) or polyphony (C4 +

α

), (2)

polyphony (C4 + C4 : unison), (3) polyphony

(C4+C5) :octave), (4) polyphony (C4+G5:three-

times tone), where we assume the lower tone is C4

and we denote these tones by (2) unison, (3) octave

and (4) three-times tone.

First, we must discriminate if the input sound is

monophony or polyphony. If we detect the beat

components, then we know the sound may be

polyphony. But there are some monophony with beat

components like trumpet G5 in Fig.6 (e). Another

method to determine if the sound is monophony or

polyphony is to notice the time difference of the

starting points. For example, we can detect the

difference of the starting points in the case of Fig.6 (f).

Next, we must determine if the tone is unison,

octave or three-times tone. If we detect that all most

components of the sound are beating, we can know

the sound may be a union, for example in the case of

Fig.6 (f). If even components are beating like Fig.6

(g) and (h), we can know the sound may be an octave.

If harmonic components of the third component are

beating like Fig.6 (i), we can determine the sound

may be a three-times tone.

The pitch estimation method by the beat signals

uses some measurement time, about 100 or 200ms.

This is a problem to estimate the pitches for shorter

sounds.

4 COMB FILTER METHOD

In this method, we process an input sound using a

comb filter and the sample data of 2000 (45 ms) to

3000 (68 ms) from the starting point of the input

sound. For simplicity, we assume that the lower tone

of polyphony is a C4 tone. In this case, we must

discriminate the following four tones, (1) monophony

C4 or polyphony, (2) polyphony (C4+C4: unison), (3)

polyphony (C4+C5:octave), (4) polyphony (C4+G5:

three-times tone).

First, an input sound is passed by a comb filter C4.

The comb filter C4 means the filter

p

N

p

zzH

−

−= 1)(

where

p =C4 and

62.261/1.44[]/[ kHzffN

psp

==

.168] =Hz Ideally the comb filter C4 can eliminate

all above four tones, i.e., monophony, unison, octave

and three times tone. But we can obtain a small output

signal caused by some frequency difference from

ideal frequencies. Next, we measure the periods of the

output signal of the comb filter C4. From these

periods, we can get the clues to discriminate the

above four tones.

Figure 7 shows the input and output waveforms

(sample number, n=2000-3000) of the comb filter C4:

input ((a)-(d)) and output ( (e)-(i)).

First we use the comb filter C4 of

168

0

=

p

N that

is a sample number determined from

].62.261/44100[/ HzHzff

ps

= When the

monophony C4 in Fig.7(a) is filtered by the comb

filter C4 of

168

0

=

p

N

, we obtain the output signal in

Fig.7(e) of which amplitude is decreased by the factor

of 0.04 from the input one. Next we measure the

period of the comb filter output signal (Fig.7(e)) and

obtain the period of 166

1

=

p

N . Then we filter the

input sound again by the comb filter C4 of 166

1

=

p

N

and this time we measure the period to be

167

2

=

p

N .When we pass the input sound through

the comb filter C4 of

167

2

=

p

N

, we obtain the filter

output signal having its period

166

3

=

p

N . These

waveforms of the output signals in the comb filters of

167

2

=

p

N

and

166

3

=

p

N

are almost same and so

we determine that the input sound is monophony C4.

When the input sound of Fig.7(b) is filtered by the

comb filter C4 of

168

0

=

p

N

, we obtain the filter

output signal in Fig.7(f) of which amplitude is

decreased by the factor of 0.16. From the output

signal in Fig.7(f), we measure the period to be

170

1

=

p

N . Then we filter the input signal in

Fig.7(b) by the comb filter C4 of

170

1

=

p

N and we

obtain the output signal in Fig.7(g) with the period of

168

2

=

p

N (or 167). The waveforms of Fig.7(f) and

(g) are different and so we determine that the input

sound of Fig.7 (b) is polyphony (C4+C4: unison).

Above we showed an example to discriminate the

monophony and polyphony in the comb filter method.

But we think that a more effective method using a

comb filter is to use two input sounds obtained from

two different points and measure the amplitude ratio

of the output/input signals of the comb filter C5 or G5.

If the input sound is monophony, then the

output/input ratio does not change for two input

sounds. If the input sound is polyphony, the

waveforms of two input sounds change for the phase

relation of two tones and so the output/input ratio also

change. We can determine if an input sound is

monophony or polyphony by noticing the change of

the output/input ratio for two input sounds.

Next we consider the polyphony of octave and

three-times tone. When the input sound of Fig.7(c) is

passed through the comb filter C4 of

168

0

=

p

N

, we

PITCH ESTIMATION OF DIFFICULT POLYPHONY SOUNDS OVERLAPPING SOME FREQUENCY

COMPONENTS

171

clarinetC4 n=2000- 3000

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 300

0

(a) input:clarinet C4

clarinetC4+alt - saxC4 n=2000- 3000

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 300

0

(b) input:clarinet C4+alt-sax C4

clarinetC4+alt - saxC5 n=2000- 3000

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 300

0

(c) input:clarinet C4+alt-sax C5

clarinetC4+alt - saxG5 n=2000- 3000

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 300

0

(d) input:clarinet C4+alt-sax G5

Figure 7.1: Input waveforms for comb filter C4.

obtain the output signal of Fig.7 (h) of which

amplitude is decreased by the factor 0.5 and we can

measure the period of )87(86

1

=

p

N . The period

comb filt er

(

N=168

)

out

p

ut: in

p

ut

(

clarinetC4

)

n=2000- 3000

-0.01

- 0.008

- 0.006

- 0.004

- 0.002

0

0.002

0.004

0.006

0.008

2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 300

0

166 166

(e) comb filter (N=168) output:input (clarinet C4)

comb filt er C4

(

N=168

)

out

p

ut: in

p

ut

(

clarinetC4+alt- saxC4

)

n=2000- 3000

-0.05

-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

0.04

0.05

0.06

2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 300

0

170 170

(f) comb filter (N=168) output :input (clarinet C4+ alt-sax

C4)

comb filt er C4

(

N=170

)

out

p

ut: in

p

ut

(

clarinetC4+alt- saxC4

)

n=2000- 3000

-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

0.04

2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 300

0

168 167

(g) comb filter (N=170) output :input (clarinet C4+alt-sax

C4)

comb filter C4

(

N=168

)

out

p

ut :in

p

ut

(

clarinetC4+alt- saxC5

)

n=2000- 3000

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 300

0

86 87

(h) comb filter (N=168) output :input(clarinet C4+alt-sax

C5)

ICINCO 2006 - SIGNAL PROCESSING, SYSTEMS MODELING AND CONTROL

172

comb filt erC4

(

N=168

)

out

p

ut :in

p

ut

(

clar inetC4+alt - saxG5

)

n=2000- 3000

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

0.04

2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000

279/ 5=56

(i) comb filter (N=168) output:input (clarinet C4 +alt-sax

G5)

Figure 7.2: Output waveforms of comb filter C4.

of 86

1

=

p

N corresponds to the basic period of the

C5 tone and we can determine that the input sound is

polyphony of the octave (C4+C5). In this case, we

know that the basic frequency of the C5 tone is

different from

p

nf with some extend, because the

output amplitude of the comb filter is not decreased

largely.

When we pass the input sound of Fig.7 (d)

through the comb filter of

168

0

=

p

N

, we get the

output signal of Fig.7 (i) of which amplitude is

decreased by the factor of 0.1. From the output signal

of Fig.7 (i), we measure the period of

56

1

=

p

N and

its period corresponds to the G5 tone. We can clearly

measure the period of the output signal using the

autocorrelation function as shown in Fig.8. Then we

can determine that the input sound is polyphony

of the three-times tone (C4+G5).

By the method mentioned above, we can

discriminate the difficult four tones (monophony or

polyphony, unison, octave and three-times tone).

5 CONCLUSIONS

We proposed two new methods to estimate the

pitches of the difficult polyphony where all or some

frequency components of the tones overlap, i.e.

unison, octave and three-times tones. One of them is

the method using the beat signals of the spectrum

components analyzed by the STFT that are happened

for a small frequency difference between two tones.

This method has some measurement time of about

100 or 200 ms for the detection of the beat signals.

The other is the method using a comb filter. We can

obtain a small output signal of the comb filter for a

small frequency difference between the ideal and

real tones. We can discriminate these difficult four

clarinetC4+alt- saxG5 : autocorrelation function

-0.0015

- 0.001

-0.0005

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

0.0035

0.004

1 21 41 61 81 101 121 141 161

56

clarinetC4+alt- saxG5 : autocorrelation function

-0.0015

- 0.001

-0.0005

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

0.0035

0.004

1 21 41 61 81 101 121 141 161

56

Figure 8: Autocorrelation function of Fig.7-2 (i).

tones by the measurements of the periods of the

filter output signals. The measurement time of the

comb filter method is about 50 ms.

As a feature research, we want to test the

proposed two methods for many musical sound data.

REFERENCES

Kashino,K., Kinoshita,T., Nakadai,K. and Tanaka,H.,

1996. ”Chord recognition mechanisms in the OPTIMA

processing architecture for music scene analysis,”

Trans. IEICE of Japan, vol.J79-D-II, no.11, pp.1771-

1781.

Ono, T., Saito,H. and Ozawa,S., 1997. “Mixed tones

estimation for transcription using GA,” Trans.SICE of

Japan, vol.33, no.5, pp.417-423.

Pollastri,E., 2002. “A pitch tracking system dedicated to

process singing voice for musical retrieval,” Proc. of

IEEE Int. Conf. on Multimedia and Xpo, ICME2002.

Roads,C.,1985. ”Research in music and artificial

intelligence,” ACM computing Survey, vol.17, no.2,

pp.163-190.

Roads,C.,1996. ”The Computer Music Tutorial,” MIT

Press.

Sterian,A., Wakefield,G.H.,2000. ”Musical transcription

system : From sound to symbol,” Proc. AAAI-2000

Workshop.

Tadokoro,Y.and Yamaguchi,M., 2001. “Pitch detection of

duet song using double comb filters,” Proc. of

ECCTD’01, I, pp.57-60.

Tadokoro, Y., Matsumoto, W. and Yamaguchi,M.,2002.

“Pitch detection of musical sounds using adaptive

comb filters controlled by time delay,” ICME2002,

P03.

Tadokoro,Y., Morita, T. and Yamaguchi,M., 2003. “Pitch

detection of musical sounds noticing minimum output

of parallel connected comb filters, “ IEEE

TENCON2003, tencon-072.

Ueda,M. and Hashimoto,S., 1997. “Blind decomposition

alagorithm for the sound separation,” Trans. IPS of

Japan, vol.38, no.1, pp.146-157.

PITCH ESTIMATION OF DIFFICULT POLYPHONY SOUNDS OVERLAPPING SOME FREQUENCY

COMPONENTS

173