Andrew said:
This envelope was basically made by feeding the ac signal through a
diode, and then have a resister and cap in parallel from the diode to
ground. the size of the cap and resister has to be chosen to smooth out
the waveform sufficiently, while still responding fast enough. I had
extra buffering and amplification around this to get a solid signal.
I then sampled this signal with an A/D converter and ran tests on it in
the pic to recognize different words. Initially it only recognized stop
and go, and did so quite well.
Congratulations (Andrew) on also making an AM demodulator, which is
what that is, but you probably knew that. It works, but there is
another way that you might want to try in software before you implement
it in hardware, and it is also a great way to explore the basics of
signal processing cheaply:
Get a standard PC running Windows or Linux. Use a sound application to
digitize a vocabulary of your words.
Compute the frequency-domain discrete Fourier transform (DFT) using the
Fast Fourier Transform (FFT) from these time domain samples. You can
find how to do this by searching on Google. The math can be confusing
at first, but this is one of the most beautiful processes in all of
engineering, and its definitely worth learning if you haven't already.
Once you have done that, you will have the frequency bins that Rich
Grise mentioned in his parallel post, each bin essentially represening
the energy of particular frequency. Then, intuitiviely, a particular
word will have a "signature", or a frequency pattern depending on what
word it is. If your vocabulary has 16 words, you should have 16
representative signatures(frequency domain signals).
Then normalize the signatures by imagining the height (modulus of the
coresponding component) of each frequency bin to represent the
component of a vector. The FFT of the word would yield a vector in
N-space, where N is the number of samples in the frequency-domain
signal...you should normalize this vector to unity (having length 1) by
replacing it with a the vector where each of the components was divided
by the length of the vector. Naturally, you compute the length of the
vector by taking square root of its scalar product : sqrt(A*A) =
sqrt(a0*a0 + a1*a1 + ...an-1*an-1).
After you have normalized your vocabulary, you can normalize the
utterances as they come in using the exact same procedure: take sample,
compute DFT with N samples, regard as vector, normalize vector.
After you have this input vector, you want to guess which word was
uttered. The simplest thing you can use is a minimum distance
algorithm. Since each of your vocabulary utterances is a vector in
N-space, and your input word is also a vector in N-space, and all of
these vectors have length of one due to normalization, most likely, the
vector of the uttered word will have its tip closest to the tip of
vector of the corresponding utterance. You compute the distance
between the utterance vector and input vector using standard formula
for distance between two vectors. Whichever utterance yields the
smallest distance, thats' the one you choose. Naturally, if someone
uttters a word not in the vocabulary, you will necessarily have a
mismatch, so it might make sense to have a threshold or thresholds.
Note that some words will be longer than others. Resist the
temptation to to contract longer words in the time domain so that they
all "have the same length", for if you did that, your utterances will
sound like they were uttered by chipmunks.
To test your energy normalization, try yelling the word, then murmuring
it, and see if the matching still works. But keep in mind that, when a
person yells a word, the increase in intensity is not distributed along
the entire length of the utterance. Often, in poly-syllabic words, the
accented syllable takes the bulk of the emphasis. This illustrates
another important point: You want very clear demarcations between the
engineering aspect of what you're doing and the art aspect. Many
speech recognition companies suffered during the 1980's and 1990's
because they thought that engineering aspect would carry the day, but
there is only so much that the math can do. The rest is art.
After you get all of this working, you can do all kinds of things like
calculate the probability of a miss, a hit, conditional probabilities
of miss or hit given a particular word was uttered. You can also take
your vocabulary vectors and determine the degree of "orthogonality" in
the utterances - the degree to which each word is likely to be
distinguiable from the other words. You do this by taking the scalar
product between every two vector pairs in you M utterances. If the
scalar product is close to unity, that's not good; that means that the
words are spectrally close, like "lighting" and "lightning". On the
other hand, if it's close to zero, great!
It should be fun to find a vocabulary with maximum spectral
orthongality which has the semantics you want: instead of the word
"eat", say "ingest", if your vocabulary absolutely positively must
contain "meat".
Once you have written the software, you can put it in hardware. You'll
have to create data types and functions for complex numbers, but hey,
let's not kid ourselves - there was never such a thing as a non-complex
number anyway.
-Chaud Lapin-