Pitch distortion in VoIP?

Robert Scott · May 27, 2006

I have a question that is directed at those involved in the design and
implementation of VoIP (Voice Over Internet Protocol). I want to know if the
frequency of an audio tone can be faithfully transmitted through the system. In
particular, if I call the National Institute of Standards and Technology
standard time and frequency service using VoIP and listen to their precise 500
Hz and 600 Hz tones, will the frequency of those tones as received be any more
precise than the audio sample rate of the sound card in my computer? It seems
hard to believe that the loose arrival timing of TCP/IP packets can be used to
synchronize the playback rate, unless a very long averaging period is involved.
I have measured the free-running sample rate of typical sound cards and found
them to vary from their nominal rate by as much as 0.5%, although most of them
are under 0.1%.

I can see how if a VoIP connection is maintained for more than a minute, then
perhaps some clever software could perform the averaging to determine the
difference in sample rates between the transmitting and the receiving ends, and
perhaps start to compensate by stretching or shrinking the raw data stream. But
for the first 15 seconds or so, the playback rate must be essentially
free-running, right? Anybody know for sure?

Robert Scott
Ypsilanti, Michigan

Tim Wescott · May 27, 2006

Robert said:
I have a question that is directed at those involved in the design and
implementation of VoIP (Voice Over Internet Protocol). I want to know if the
frequency of an audio tone can be faithfully transmitted through the system. In
particular, if I call the National Institute of Standards and Technology
standard time and frequency service using VoIP and listen to their precise 500
Hz and 600 Hz tones, will the frequency of those tones as received be any more
precise than the audio sample rate of the sound card in my computer? It seems
hard to believe that the loose arrival timing of TCP/IP packets can be used to
synchronize the playback rate, unless a very long averaging period is involved.
I have measured the free-running sample rate of typical sound cards and found
them to vary from their nominal rate by as much as 0.5%, although most of them
are under 0.1%.

I can see how if a VoIP connection is maintained for more than a minute, then
perhaps some clever software could perform the averaging to determine the
difference in sample rates between the transmitting and the receiving ends, and
perhaps start to compensate by stretching or shrinking the raw data stream. But
for the first 15 seconds or so, the playback rate must be essentially
free-running, right? Anybody know for sure?

Robert Scott
Ypsilanti, Michigan

I wouldn't even trust VoIP to reproduce a tone that accurately
_assuming_ good sampling -- they compress pretty heavily, and I wouldn't
be able to tell you if the message that comes across isn't 'tone at
around 500Hz' rather than the actual samples.

I _do_ know that one of the oft-used algorithms is SPEEX. Look at the
acronym and guess what it's optimized for. You can web search for the
spec and look to see what it does to tones.

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Posting from Google? See http://cfaj.freeshell.org/google/

"Applied Control Theory for Embedded Systems" came out in April.
See details at http://www.wescottdesign.com/actfes/actfes.html

Didi · May 28, 2006

standard time and frequency service using VoIP and listen to their precise 500

Hz and 600 Hz tones, will the frequency of those tones as received be any more
precise than the audio sample rate of the sound card in my computer?

The frequency you get should only depend on the precision of the
oscillator
timing the conversion at your side. Encoded signal frequency has
nothing
to do with transmission speed. Whether you can transfer the signal in
real time or not depends on the transmission speed vs. data size, of
course.

It seems
hard to believe that the loose arrival timing of TCP/IP packets can be used to
synchronize the playback rate, unless a very long averaging period is involved.

Transmitting voice over tcp would not be very wise...
(retransmissions).
Just IP - they probably do it under UDP somehow - should be enough.
"Best delivery effort" - which is how IP works - would mean that some
lost packet here and there will result in some clicks/noise etc., but
given that "most" packets make it the voice will still be
recognizable.

Dimiter

Robert Scott · May 28, 2006

I wouldn't even trust VoIP to reproduce a tone that accurately
_assuming_ good sampling -- they compress pretty heavily, and I wouldn't
be able to tell you if the message that comes across isn't 'tone at
around 500Hz' rather than the actual samples...

I tend to agree with your point and it is the point I have been promoting in
another forum. However it is refuted by direct experiementation by others in
that forum where they call up the NIST tones at 303-499-7111 on the VoIP and
simulataneously on their regular phone. They report a noticeable phase delay
between the two feeds, however they report no beat whatsoever between the two
tones. And these people are piano tuners, so they would recognize a beat if it
exists, even down to one beat in 10 seconds. So are you sure the VoIP hasn't
done something with a software phase-locked loop to generate a local correction
to the sample rate difference?

Robert Scott
Ypsilanti, Michigan

Tim Wescott · May 28, 2006

Robert said:
I tend to agree with your point and it is the point I have been promoting in
another forum. However it is refuted by direct experiementation by others in
that forum where they call up the NIST tones at 303-499-7111 on the VoIP and
simulataneously on their regular phone. They report a noticeable phase delay
between the two feeds, however they report no beat whatsoever between the two
tones. And these people are piano tuners, so they would recognize a beat if it
exists, even down to one beat in 10 seconds. So are you sure the VoIP hasn't
done something with a software phase-locked loop to generate a local correction
to the sample rate difference?

Robert Scott
Ypsilanti, Michigan

Absolutely not. It sounds like my concern is quite effectively refuted,
at least for the VoIP services that the experiment was conducted with.

I really _don't_ know how VoIP works -- this means I can't say "nay",
but that I can think of a lot of reasons not to say "aye" without things
like direct experimentation.

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

Posting from Google? See http://cfaj.freeshell.org/google/

"Applied Control Theory for Embedded Systems" came out in April.
See details at http://www.wescottdesign.com/actfes/actfes.html

Rich Grise · May 30, 2006

I tend to agree with your point and it is the point I have been promoting in
another forum. However it is refuted by direct experiementation by others in
that forum where they call up the NIST tones at 303-499-7111 on the VoIP and
simulataneously on their regular phone. They report a noticeable phase delay
between the two feeds, however they report no beat whatsoever between the two
tones. And these people are piano tuners, so they would recognize a beat if it
exists, even down to one beat in 10 seconds. So are you sure the VoIP hasn't
done something with a software phase-locked loop to generate a local correction
to the sample rate difference?

There is no beat, so they're on freq; but that doesn't say anything about
the phase. I could see syncing up packets to some local clock, like the
computer's RTC, just for fidelity's sake; but I wouldn't place any bets on
how many cycles of 600Hz' delay there is.

On WWV on the air, there's a voice announcement every minute or hour or
something; if that's on VOIP, it'd be interesting to compare the two voice
segments.

Cheers!
Rich

Nico Coesel · May 30, 2006

---@--- (Robert Scott) said:
I tend to agree with your point and it is the point I have been promoting in
another forum. However it is refuted by direct experiementation by others in
that forum where they call up the NIST tones at 303-499-7111 on the VoIP and
simulataneously on their regular phone. They report a noticeable phase delay
between the two feeds, however they report no beat whatsoever between the two
tones. And these people are piano tuners, so they would recognize a beat if it
exists, even down to one beat in 10 seconds. So are you sure the VoIP hasn't
done something with a software phase-locked loop to generate a local correction
to the sample rate difference?

IIRC VOIP has some means to do bitrate control between both sides so
the transmit and receive rates are sort of synchronized (there is
probably a lot of phase jitter). This is required in order not to be
forced to drop or insert audio frames (which will lead to noticable
distortions). So the NIST tone through VOIP should -in theory- be just
as precise as the analog version.

glenn.b.dixon@gmail.com · May 30, 2006

Voip must be able to pass DTMF tonesets unaltered in frequency and
amplitude (not in phase), so there are some tone encoding features in
the algorithm. This is not really for dialing, but for accessing
automated answering systems and such.

I unfortunately do not know a lot of details, though. I would think
that tones other than the standard DTMF frequencies will likely be
passed with unaltered amplitude and frequency as well.

Glenn

Robert Scott · May 31, 2006

There is no beat, so they're on freq; but that doesn't say anything about
the phase. I could see syncing up packets to some local clock, like the
computer's RTC, just for fidelity's sake; but I wouldn't place any bets on
how many cycles of 600Hz' delay there is.

For my purposes, I don't care about a fixed phase delay. I only want to use the
NIST tones for audio frequency calibration.

I am still looking for definitive evidence that VoIP contains a mechanism to
ensure frequency accuracy. Negative evidence of deviation is not proof,
although positive evidence of deviaiton would be proof that there is no such
mechanism.

Robert Scott
Ypsilanti, Michigan

Robert Scott · May 31, 2006

Voip must be able to pass DTMF tonesets unaltered in frequency and
amplitude (not in phase), so there are some tone encoding features in
the algorithm. This is not really for dialing, but for accessing
automated answering systems and such.

I unfortunately do not know a lot of details, though. I would think
that tones other than the standard DTMF frequencies will likely be
passed with unaltered amplitude and frequency as well.

The required frequency accuracy of DTMF tones is not very strict. They would
work even if VoIP just used an uncalibrated sound card sample rate.

Robert Scott
Ypsilanti, Michigan

Robert Scott · May 31, 2006

IIRC VOIP has some means to do bitrate control between both sides so
the transmit and receive rates are sort of synchronized (there is
probably a lot of phase jitter). This is required in order not to be
forced to drop or insert audio frames (which will lead to noticable
distortions). So the NIST tone through VOIP should -in theory- be just
as precise as the analog version.

I'm not sure that an occasional dropped sample would be all that noticed,
especially in speech. I can see how information on the fullness of the FIFO
could be used in a low-bandwidth feedback loop to calibrate the rate at which
data is extracted from the FIFO, but what could the bandwidth of such a loop be?
10 seconds? Surely the resulting phase jitter would be audible if that were the
case, right?

Robert Scott
Ypsilanti, Michigan

Iwo Mergler · May 31, 2006

Robert said:
I have a question that is directed at those involved in the design and
implementation of VoIP (Voice Over Internet Protocol). I want to know if
the
frequency of an audio tone can be faithfully transmitted through the
system. In particular, if I call the National Institute of Standards and
Technology standard time and frequency service using VoIP and listen to
their precise 500 Hz and 600 Hz tones, will the frequency of those tones
as received be any more
precise than the audio sample rate of the sound card in my computer? It
seems hard to believe that the loose arrival timing of TCP/IP packets can
be used to synchronize the playback rate, unless a very long averaging
period is involved. I have measured the free-running sample rate of
typical sound cards and found them to vary from their nominal rate by as
much as 0.5%, although most of them are under 0.1%.

I can see how if a VoIP connection is maintained for more than a minute,
then perhaps some clever software could perform the averaging to determine
the difference in sample rates between the transmitting and the receiving
ends, and
perhaps start to compensate by stretching or shrinking the raw data
stream. But for the first 15 seconds or so, the playback rate must be
essentially
free-running, right? Anybody know for sure?

Robert Scott
Ypsilanti, Michigan

If you can reproduce a particular frequency accurately, depends
on the codec. You can force a specific codec from either end, and
you could select a-law or u-law PCM, so there is no frequency encoding
involved.

The sample rates are negotiated over the network, so both ends
use the same rate, as referenced to the two local quartz crystals.
The VoIP system compensates against slow drift by inserting or
deleting a few ms of signal at a time.

In other words, if you want to calibrate against a tone over VoIP,
don't. Spare yourself the effort and just divide down your local
quartz timebase to generate the desired tone. Same accuracy.

Kind regards,

Iwo

Rich Grise · May 31, 2006

For my purposes, I don't care about a fixed phase delay. I only want to use the
NIST tones for audio frequency calibration.

I am still looking for definitive evidence that VoIP contains a mechanism to
ensure frequency accuracy. Negative evidence of deviation is not proof,
although positive evidence of deviaiton would be proof that there is no such
mechanism.

Well, this:
http://www.google.com/search?q="voip+specification"
turns up a lot of hits - there might be something in there. The first hit,
http://www.voipsurvival.com/VoIPSpecification.html
looks fairly comprehensive.

Good Luck!
Rich

Nico Coesel · May 31, 2006

---@--- (Robert Scott) said:
I'm not sure that an occasional dropped sample would be all that noticed,

In a single tone dropping or adding a sample is very noticable. In
speech it is not.

especially in speech. I can see how information on the fullness of the FIFO
could be used in a low-bandwidth feedback loop to calibrate the rate at which
data is extracted from the FIFO, but what could the bandwidth of such a loop be?
10 seconds? Surely the resulting phase jitter would be audible if that were the
case, right?

I don't think so. The difference is 50 to 100 ppm at most. I don't
think the human ear can hear such a small frequency variation.

Don Bowey · May 31, 2006

Well, this:
http://www.google.com/search?q="voip+specification"
turns up a lot of hits - there might be something in there. The first hit,
http://www.voipsurvival.com/VoIPSpecification.html
looks fairly comprehensive.

Good Luck!
Rich

Google this: voip synchronization

It is more to the point you are seeking. In a good network (synchronized) I
don't think you will have a frequency problem, and phase problems should be
minimized.

Don

Robert Scott · Jun 1, 2006

Google this: voip synchronization

Thanks for the tip. That did indeed lead to a very informative whitepaper on
synchronization, and I can see that the issue has been seriously addressed by
the designers of VoIP.

It appears that the main concern of synchronization in VoIP is voice latency.
Anything more than 150 msec. becomes a noticeable annoyance to the users. The
150 msec. is mostly the result of a "Jitter Buffer" that takes in packets at a
variable rate and puts them out at a fixed rate. Occasional anomalous delays in
the Internet can result in clicks, pops, or dropouts. But as long as these
occurances are fairly rare, users will put up with them. Having a larger Jitter
Buffer would reduce the incidence of these dropouts even more, but doing so
would make the latency unaccepably long.

All that I understand. However I can still see where one particular concern of
mine is not addressed in these descriptions. That is the tie-in between time
synchronization and audio frequency accuracy. If the raw time-series data were
transferred uncompressed, then time-synchronization would automatically imply
audio frequency precision. But, as Tim Wescott pointed out, they use some
pretty aggresive speech compression in VoIP. If speech is compressed using
frequency-domain techniques, then it is entirely possible to have perfect time
synchronization, but still have small pitch errors in the playback. The
possible disconnect between time and pitch is most easily seen in the systems
that create time-compressed audio books without resulting in the "chipmunks"
type of pitch raising. That is an extreme example, but it illustrates the
potential disconnect. I hope to find that this type of disconnect does not
happen in VoIP, but I will have to look a little more to find out for sure.

Robert Scott
Ypsilanti, Michigan

Don Bowey · Jun 1, 2006

Thanks for the tip. That did indeed lead to a very informative whitepaper on
synchronization, and I can see that the issue has been seriously addressed by
the designers of VoIP.

It appears that the main concern of synchronization in VoIP is voice latency.
Anything more than 150 msec. becomes a noticeable annoyance to the users. The
150 msec. is mostly the result of a "Jitter Buffer" that takes in packets at a
variable rate and puts them out at a fixed rate. Occasional anomalous delays
in
the Internet can result in clicks, pops, or dropouts. But as long as these
occurances are fairly rare, users will put up with them. Having a larger
Jitter
Buffer would reduce the incidence of these dropouts even more, but doing so
would make the latency unaccepably long.

All that I understand. However I can still see where one particular concern
of
mine is not addressed in these descriptions. That is the tie-in between time
synchronization and audio frequency accuracy. If the raw time-series data
were
transferred uncompressed, then time-synchronization would automatically imply
audio frequency precision. But, as Tim Wescott pointed out, they use some
pretty aggresive speech compression in VoIP. If speech is compressed using
frequency-domain techniques, then it is entirely possible to have perfect time
synchronization, but still have small pitch errors in the playback. The
possible disconnect between time and pitch is most easily seen in the systems
that create time-compressed audio books without resulting in the "chipmunks"
type of pitch raising. That is an extreme example, but it illustrates the
potential disconnect. I hope to find that this type of disconnect does not
happen in VoIP, but I will have to look a little more to find out for sure.

Robert Scott
Ypsilanti, Michigan

I haven't had time to study voip in great detail, but it's not unreasonable
to make some assumptions about it....

I believe voip, uses a form of adpcm to code to a low bit-rate. There is
nothing in this process that would cause frequency shift in the digitally
encoded signal; it only limits the analog bandwidth of the decoded signal
while causing uniform delay. However, and this is most unlikely, if every
other packet was thrown away then the reconstructed voice signal from the
PAM samples, would shift the sound frequency down by half.

I see in today's financial news that Vonage has created a serious
non-technical problem by their initial stock offering methods.

Don

Moore's Lobby Podcast

Menu

Categories

Platforms

Content

Connect With Us

Network

Pitch distortion in VoIP?

Pitch distortion in VoIP?

Robert Scott

Tim Wescott

Didi

Robert Scott

Tim Wescott

Rich Grise

Nico Coesel

[email protected]

Robert Scott

Robert Scott

Robert Scott

Iwo Mergler

Rich Grise

Nico Coesel

Don Bowey

Robert Scott

Don Bowey

Similar threads