I didn't think he was trying to use a human in this instance
and that the IVR is playing the exact same speech
each time. So would you not be able to do a cross-correlation?
Could you put a test mode in your IVR?
Perhaps have it respond with something easy to detect like DTMF?
....or perhaps figure out a way to subtract the one recording from the
other
and except for some gain adjust and phase offset the results should be
a close to silence. Calculate the amplitude of the results and see that it
is low.
There is simple problem with this: there is no way to adjust the phase
because phase only make sense in context of periodic signals. A time
domain signal as above is not periodic, but one can pluck components
from frequency domain from each signal and look at their phases.
In other words, if a speaker is offered $100US if s/he can create the
same sampled digital signal, more or less, by speaking into IVR, such
that only by shifting signal2 a bit relative to signal1 he is able to
get the signals properly aligned for comparison, he will fail. The
reason is that, even at the relatively low sample rate of 8kHz, no
human is able to begin speaking just at the right instant, let alone
control the physiology of speech path to generate more-or-less the
exact same signal. Any attempt to find out when a signal begins is
hopeless in the time domain. Is it the first non-zero sample? The
second? Third? Is that noise or voice? Is it when the "hump" is really
high? Almost really high? One cannot know.
This is classical problem in speech recognition and related areas. I
responded to OP in comp.dsp with outline of what he needs to do:
http://tinyurl.com/4568b3
-Le Chaud Lapin-