speech recognition

lvkeegan@comcast.net · Jul 17, 2005

I am working on a system of PIC microcontrollers in which there will
be communication between computer and user. The computer will choose
from a menu of Chipcorder messages, but the user or operator, I want to
utilize a small vocabulary of spoken words.

I am learning that there some speech recognition chips available, and
even setting these up I realize is not a trivial task. But I would like
to do the experimenting with A/D and DAC and digitizing individual
words and working on algorithms for comparing the operators spoken
word with a stored template. This might even lead me into pattern
recognition and neural network technologies.

Has anyone experimented in these areas or set up any type of speech
recognition system?

Joe G \(Home\) · Jul 18, 2005

Checkout Sensory Inc

A single chip can do all the PIC work as well as the VR.

http://www.sensoryinc.com/

JG

Andrew Leech · Jul 18, 2005

I actually developed a very simple version of this using a pic16f84
running at 4MHz when I was about 16 (year 11 at high school). My method
basically used circuitry to convert the signal from the microphone to an
'envelope' (i think that's what it's called), which can be visualized be
looking at the output of the microphone on a oscilloscope, and then
tracing a line along the top of the waveform. This basically gives
different words a quite definable shape Which can be sampled at a
relatively slow rate.
This envelope was basically made by feeding the ac signal through a
diode, and then have a resister and cap in parallel from the diode to
ground. the size of the cap and resister has to be chosen to smooth out
the waveform sufficiently, while still responding fast enough. I had
extra buffering and amplification around this to get a solid signal.
I then sampled this signal with an A/D converter and ran tests on it in
the pic to recognize different words. Initially it only recognized stop
and go, and did so quite well.

<brag> incidentally, I developed this as a prototype of a project to be
given to university students, it was later used as a semester long
project for 3rd year elec eng students, and I was later told that my
method which was developed in 3 weeks performed better than all of the
ones the students (who had the advantage of 5 years of age and 2.5 years
of uni) developed in 12 weeks

</brag>

Andrew

lvkeegan@comcast.net · Jul 18, 2005

Joe G
Thanks very much - this is exciting to see all these features
incorporated into the microprocessor itself, ans also their software
packages. I wsaved the link and will be studying it carefully. Thanks.
Larry Keegan

lvkeegan@comcast.net · Jul 18, 2005

hi andrew
this is basically what I thought of doing, except I thought using the
A/D device and digitizing
the word would give me a better handle. But I see what you are doing,
and capturing a template of sorts seems to be the name of the game. In
my experiments I will certainly try using the caps and developing an
analog envelope.
It will also be interesting to see the images, if I can get one of my 3
non working scopes up and going.
I do have a graphics LCD working and I may display my digitized samples
of a word like "yes" on
the LCD. Sounds as tho u had an interesting experience with speech
recognition.
Larry Keegan

Rich Grise · Jul 18, 2005

I am working on a system of PIC microcontrollers in which there will
be communication between computer and user. The computer will choose
from a menu of Chipcorder messages, but the user or operator, I want to
utilize a small vocabulary of spoken words.

I am learning that there some speech recognition chips available, and
even setting these up I realize is not a trivial task. But I would like
to do the experimenting with A/D and DAC and digitizing individual
words and working on algorithms for comparing the operators spoken
word with a stored template. This might even lead me into pattern
recognition and neural network technologies.

Has anyone experimented in these areas or set up any type of speech
recognition system?

I made a sort of "spectrum analyzer" with eight active filters
spaced across about 300Hz - 3KHz, kinda like those displays on
a "graphic equalizer". I found out that when you take the
fundamental out, that each phoneme has a unique spectrum,
regardless who's talking! I got hung up on the pattern
matching algorithm, however.

These days they'd probably do the whole thing with DSP
or something.

Good Luck!
Rich

Le Chaud Lapin · Jul 19, 2005

Andrew said:
This envelope was basically made by feeding the ac signal through a
diode, and then have a resister and cap in parallel from the diode to
ground. the size of the cap and resister has to be chosen to smooth out
the waveform sufficiently, while still responding fast enough. I had
extra buffering and amplification around this to get a solid signal.
I then sampled this signal with an A/D converter and ran tests on it in
the pic to recognize different words. Initially it only recognized stop
and go, and did so quite well.

Congratulations (Andrew) on also making an AM demodulator, which is
what that is, but you probably knew that. It works, but there is
another way that you might want to try in software before you implement
it in hardware, and it is also a great way to explore the basics of
signal processing cheaply:

Get a standard PC running Windows or Linux. Use a sound application to
digitize a vocabulary of your words.

Compute the frequency-domain discrete Fourier transform (DFT) using the
Fast Fourier Transform (FFT) from these time domain samples. You can
find how to do this by searching on Google. The math can be confusing
at first, but this is one of the most beautiful processes in all of
engineering, and its definitely worth learning if you haven't already.

Once you have done that, you will have the frequency bins that Rich
Grise mentioned in his parallel post, each bin essentially represening
the energy of particular frequency. Then, intuitiviely, a particular
word will have a "signature", or a frequency pattern depending on what
word it is. If your vocabulary has 16 words, you should have 16
representative signatures(frequency domain signals).

Then normalize the signatures by imagining the height (modulus of the
coresponding component) of each frequency bin to represent the
component of a vector. The FFT of the word would yield a vector in
N-space, where N is the number of samples in the frequency-domain
signal...you should normalize this vector to unity (having length 1) by
replacing it with a the vector where each of the components was divided
by the length of the vector. Naturally, you compute the length of the
vector by taking square root of its scalar product : sqrt(A*A) =
sqrt(a0*a0 + a1*a1 + ...an-1*an-1).

After you have normalized your vocabulary, you can normalize the
utterances as they come in using the exact same procedure: take sample,
compute DFT with N samples, regard as vector, normalize vector.

After you have this input vector, you want to guess which word was
uttered. The simplest thing you can use is a minimum distance
algorithm. Since each of your vocabulary utterances is a vector in
N-space, and your input word is also a vector in N-space, and all of
these vectors have length of one due to normalization, most likely, the
vector of the uttered word will have its tip closest to the tip of
vector of the corresponding utterance. You compute the distance
between the utterance vector and input vector using standard formula
for distance between two vectors. Whichever utterance yields the
smallest distance, thats' the one you choose. Naturally, if someone
uttters a word not in the vocabulary, you will necessarily have a
mismatch, so it might make sense to have a threshold or thresholds.

Note that some words will be longer than others. Resist the
temptation to to contract longer words in the time domain so that they
all "have the same length", for if you did that, your utterances will
sound like they were uttered by chipmunks.

To test your energy normalization, try yelling the word, then murmuring
it, and see if the matching still works. But keep in mind that, when a
person yells a word, the increase in intensity is not distributed along
the entire length of the utterance. Often, in poly-syllabic words, the
accented syllable takes the bulk of the emphasis. This illustrates
another important point: You want very clear demarcations between the
engineering aspect of what you're doing and the art aspect. Many
speech recognition companies suffered during the 1980's and 1990's
because they thought that engineering aspect would carry the day, but
there is only so much that the math can do. The rest is art.

After you get all of this working, you can do all kinds of things like
calculate the probability of a miss, a hit, conditional probabilities
of miss or hit given a particular word was uttered. You can also take
your vocabulary vectors and determine the degree of "orthogonality" in
the utterances - the degree to which each word is likely to be
distinguiable from the other words. You do this by taking the scalar
product between every two vector pairs in you M utterances. If the
scalar product is close to unity, that's not good; that means that the
words are spectrally close, like "lighting" and "lightning". On the
other hand, if it's close to zero, great!

It should be fun to find a vocabulary with maximum spectral
orthongality which has the semantics you want: instead of the word
"eat", say "ingest", if your vocabulary absolutely positively must
contain "meat".

Once you have written the software, you can put it in hardware. You'll
have to create data types and functions for complex numbers, but hey,
let's not kid ourselves - there was never such a thing as a non-complex
number anyway.

-Chaud Lapin-

larry k · Jul 19, 2005

to Rich Grise and the "warm rabbit"
thank you
I had seen FFT mentioned and now know it is Fast Fourier Transform
I can't understand all of the explanations, including some of Andrew's
info, but I am putting it into my notebook for further study

I think a speech recognition system you will be hearing about is
Sensory's RSC-4x Speech Recognition anbd Synthesis Microcontroller.
I will not touch this myself, since I want to work with my own PIC
microcontroller which is the PIC16F628. I wrote my own assembler and
created programming software and hardware for it. It takes me 25
seconds to go from
assembly pgm to a burned in chip.

Rich Grise · Jul 19, 2005

Congratulations (Andrew) on also making an AM demodulator, which is
what that is, but you probably knew that. It works, but there is
another way that you might want to try in software before you implement
it in hardware, and it is also a great way to explore the basics of
signal processing cheaply:

Get a standard PC running Windows or Linux. Use a sound application to
digitize a vocabulary of your words.

Compute the frequency-domain discrete Fourier transform (DFT) using the
Fast Fourier Transform (FFT) from these time domain samples. You can
find how to do this by searching on Google. The math can be confusing
at first, but this is one of the most beautiful processes in all of
engineering, and its definitely worth learning if you haven't already.

Once you have done that, you will have the frequency bins that Rich
Grise mentioned in his parallel post, each bin essentially represening
the energy of particular frequency. Then, intuitiviely, a particular
word will have a "signature", or a frequency pattern depending on what
word it is. If your vocabulary has 16 words, you should have 16
representative signatures(frequency domain signals).

OK, here is where my 8-band filter project would have diverged - I was
giving each phoneme a "signature", rather than whole words - too many
axes! - then, if I had figured out how to do a pattern match on that
signature, which you've so wonderfully described below, by vectors, (I was
just comparing batches of numbers) my plan was to run the phoneme stream
through something like a reverse Huffman table, or soundex table, and
pick words by their sound sequence. Then, of course, just send the stream
of derived words to a speakwrite, robot, microwave oven, or whatever.

Alas, where were you in 1990? ;-)

Thanks!
Rich

[repeated for completeness]

Robert Baer · Jul 20, 2005

Rich said:
I made a sort of "spectrum analyzer" with eight active filters
spaced across about 300Hz - 3KHz, kinda like those displays on
a "graphic equalizer". I found out that when you take the
fundamental out, that each phoneme has a unique spectrum,
regardless who's talking! I got hung up on the pattern
matching algorithm, however.

These days they'd probably do the whole thing with DSP
or something.

Good Luck!
Rich

Bell called the various phonemes "formants" as i remember.
They evenhad a kit using various LC resonators that one could combine
combinations at selected amplitudes to create these formants - to make
the vowels.

Moore's Lobby Podcast

Menu

Categories

Platforms

Content

Connect With Us

Network

speech recognition

speech recognition

[email protected]

Joe G \(Home\)

Andrew Leech

[email protected]

[email protected]

Rich Grise

Le Chaud Lapin

larry k

Rich Grise

Robert Baer

Similar threads