AVR interrupt response time

Pygmi · Jan 12, 2005

I just started my first time critical project with AVR's.
And time critical meaning interrupt response times.
So far I have been using avr-gcc (3.3.x) and I have been
pretty happy with it. And I have written ALL code in C.

I'm hoping to get some code executed within 2 us or so
after external interrupt (INT0/INT1 with ATMega32).
I wrote the code to be executed today and ended up to
appr. 20 instructions/cycles. With 16 MHz clock that
means something like 1.25 us. Nothing much to optimize
there.
From datasheets I have found out that it takes 4 cycles
minimum (?) to jump to interrupt handler. By adding some
register saving and stuff, I was expecting less than 0.5 us
to start executing my own code => resulting in <2 us.

Ok, that was what I was hoping...

When I compiled the code and ran it, I noticed that it
took about 2.5 us to start executing my code?!?!
(I used ATMega8 as I don't have any M32 at the
moment, but I guess it isn't relevant??)
I checked the list file and one reason is the
LENGHTY prologue added by gcc into interrupt
handler (17 instructions!!!), saving LOT of registers...

Two questions:
1. Even with 4 cycles + 17 instructions there is 1 us
missing?? What else happens before my own handler
code starts executing?
2. Is there any way to tell gcc NOT to 'push' all those
registers in to the prologue??

And finally:
If I expect to have my own code to execute
within 0.5 us, is the assembler the only way to go??

Thanks for any info in advance....
Any links to good resources are appreciated as well.
I REALLY like to know exactly what happens there.

Pygmi

Jeroen · Jan 12, 2005

Pygmi said:
I just started my first time critical project with AVR's.
And time critical meaning interrupt response times.
So far I have been using avr-gcc (3.3.x) and I have been
pretty happy with it. And I have written ALL code in C.

I'm hoping to get some code executed within 2 us or so
after external interrupt (INT0/INT1 with ATMega32).
I wrote the code to be executed today and ended up to
appr. 20 instructions/cycles. With 16 MHz clock that
means something like 1.25 us. Nothing much to optimize
there.
From datasheets I have found out that it takes 4 cycles
minimum (?) to jump to interrupt handler. By adding some
register saving and stuff, I was expecting less than 0.5 us
to start executing my own code => resulting in <2 us.

Ok, that was what I was hoping...

When I compiled the code and ran it, I noticed that it
took about 2.5 us to start executing my code?!?!
(I used ATMega8 as I don't have any M32 at the
moment, but I guess it isn't relevant??)
I checked the list file and one reason is the
LENGHTY prologue added by gcc into interrupt
handler (17 instructions!!!), saving LOT of registers...

Two questions:
1. Even with 4 cycles + 17 instructions there is 1 us
missing?? What else happens before my own handler
code starts executing?
2. Is there any way to tell gcc NOT to 'push' all those
registers in to the prologue??

And finally:
If I expect to have my own code to execute
within 0.5 us, is the assembler the only way to go??

Thanks for any info in advance....
Any links to good resources are appreciated as well.
I REALLY like to know exactly what happens there.

Pygmi

The processor first synchronizes the external input to it's own clock,
that's takes at 2 clocks. The processor also has to finish the currently
executing instruction. It takes 3 cyles to go the interrupt vector, from
where it executes a jump to your ISR, another 3 cycles. This is 8 cycles to
11 cycles total time, depending on the executing instruction; or 0.6875 us.
Then it has entered your ISR; you at least need to save the statusregister
and a few registers before useful work can be done.

How did you check the response time? With a scope?

Assembly will be neccesary if you want to sqeeze out every last bit of
performance. What's the application that this is so critical?

Jeroen

Pygmi · Jan 12, 2005

Jeroen said:
The processor first synchronizes the external input to it's own clock,
that's takes at 2 clocks. The processor also has to finish the currently
executing instruction. It takes 3 cyles to go the interrupt vector, from
where it executes a jump to your ISR, another 3 cycles. This is 8 cycles to
11 cycles total time, depending on the executing instruction; or 0.6875 us.
Then it has entered your ISR; you at least need to save the statusregister
and a few registers before useful work can be done.

How did you check the response time? With a scope?

Assembly will be neccesary if you want to sqeeze out every last bit of
performance. What's the application that this is so critical?

Jeroen

Thanks for the response.

Yes, I checked the response time with scope. From external
signal to first executed instruction of my "own" code in interrupt
handler.

I have a need to service ISA bus logic (I/O read/writes), and I have
been told that R/W requests should be serviced within 2.5 us
(so not actually 2 us). I'm not quite sure about the 2.5 us requirement,
but if it is valid, it seems to be too much for AVR with 16 MHz...
Maybe if this could be the only interrupt in the system or having
nested interrupts.

...or I should forget all about interrupts and do the things I need
by polling. Not very tempting.
...or just faster processor (which would mean also jump from
AVR to another architecture)
...or the solution is a dual ported RAM??
...or some other option...there are of course options...but for
additional HW cost of course

Pygmi

Jeroen · Jan 12, 2005

Pygmi said:
....

Thanks for the response.

Yes, I checked the response time with scope. From external
signal to first executed instruction of my "own" code in interrupt
handler.

I have a need to service ISA bus logic (I/O read/writes), and I have
been told that R/W requests should be serviced within 2.5 us
(so not actually 2 us). I'm not quite sure about the 2.5 us requirement,
but if it is valid, it seems to be too much for AVR with 16 MHz...
Maybe if this could be the only interrupt in the system or having
nested interrupts.

..or I should forget all about interrupts and do the things I need
by polling. Not very tempting.
..or just faster processor (which would mean also jump from
AVR to another architecture)
..or the solution is a dual ported RAM??
..or some other option...there are of course options...but for
additional HW cost of course

Pygmi

Latency on bigger processors is usually even worse... Interrupt latency on
for example a 80386 can take hunderds of cycles.

It's better to have some hardware to interface the ISA bus. A small cheap
CPLD is best, all you really need is an address decoder and a few registers.
The ISA bus runs at 8Mhz, the AVR runs at 16Mhz; this is just 2 instructions
for each ISA bus cycle. A jump alone is 3 cycles. So it's not possible, the
AVR just can't do anything useful. Only a much faster CPU could do it, but
still then the load is still very high.

The ISA interface can be done in plain HCT logic, and will only be a few
chips. A possible solution is a GAL20V8 as adress decoder. This decoder will
generate two strobes. One to enable a '574 that stores the data from the
databus and another to pass data from the AVR to the ISA via an '244. INT0/1
on the AVR can be used to let the AVR know something has been written. An
external interrupts needs to be at least 2 AVR clock cycles before it's
recognized, but to be on the safe side, it's better to use a flipflop that's
set by the address decoder, and reset by the AVR. The output of the FF goes
the INT0/1 input. This costs only 4 chips that cost next to nothing. If
board space is at premium, a 44 pin CPLD like a MAX7000S could be used.

Jeroen

CBFalconer · Jan 12, 2005

Jeroen said:
The processor first synchronizes the external input to it's own
clock, that's takes at 2 clocks. The processor also has to finish
the currently executing instruction. It takes 3 cyles to go the
interrupt vector, from where it executes a jump to your ISR,
another 3 cycles. This is 8 cycles to 11 cycles total time,
depending on the executing instruction; or 0.6875 us. Then it has
entered your ISR; you at least need to save the statusregister
and a few registers before useful work can be done.

How did you check the response time? With a scope?

Assembly will be neccesary if you want to sqeeze out every last
bit of performance. What's the application that this is so
critical?

And all that assumes that the executing code has no critical
sections implemented by disabling interrupts. Does no ARM
instruction take over 3 cycles? What about a return? What about
other interrupts and returns from them, if any. Hairy.

Ulf Samuelsson · Jan 13, 2005

When I compiled the code and ran it, I noticed that it took about 2.5 us
to start executing my code?!?!

(I used ATMega8 as I don't have any M32 at the moment, but I guess it isn't relevant??)
I checked the list file and one reason is the LENGHTY prologue added by gcc into interrupt
handler (17 instructions!!!), saving LOT of registers...

You can try a better compiler than WinAVR!

// IAR C interrupt handler
#pragma vector=12
__interrupt void handler()
{
BYTE i = PORTB;
PORTB = 0xF0;
PORTB = 0x0F;
PORTB = i;
}

Generated code

51 __interrupt void handler()
\ handler:
52 {
\ 00000000 931A ST -Y,R17
\ 00000002 930A ST -Y,R16
53 BYTE i = PORTB;
\ 00000004 B318 IN R17,0x18
54 PORTB = 0xF0;
\ 00000006 EF00 LDI R16,240
\ 00000008 BB08 OUT 0x18,R16
55 PORTB = 0x0F;
\ 0000000A E00F LDI R16,15
\ 0000000C BB08 OUT 0x18,R16
56 PORTB = i;
\ 0000000E BB18 OUT 0x18,R17
57 }
\ 00000010 9109 LD R16,Y+
\ 00000012 9119 LD R17,Y+
\ 00000014 9518 RETI
58

Two registers used, two registers pushed.
As you see, there is no reason to even push the PSR in this case since the
flags do not get updated.

If you need fast interrupt response, and need to do a lot,
then consider to divide the handler into two parts.

First part (minimal) does minimal fast processing and at the end, it sets an
external interrupt
which continues the processing after the fast interrupt has exited.

__no_init __register BYTE SavePortB @4; Put i in Register r4
#pragma vector=TIMER
__interrupt void fast_handler(void)
{
SavePortB = PORTB;
set_ext_interrupt_pending();
}

#pragma vector=EXT_INT_HANDLER
__interrupt void slow_handler(void)
{
// Continue slow processing after fast handler has exited.
PORTB = 0xF0;
PORTB = 0x0F;
PORTB = i;
}

Since the processing is minimal in the fast handler, very few registers
should be pushed by a good compiler.

There is a 4kB restricted C compiler for tests.
You have to personally contact IAR to get it. It is not on their web page.
This does not generate assembly code , only object code.

Ulf Samuelsson · Jan 13, 2005

And all that assumes that the executing code has no critical

sections implemented by disabling interrupts. Does no ARM
instruction take over 3 cycles? What about a return? What about
other interrupts and returns from them, if any. Hairy.

Don't forget that the main reason for long worst case interrupt latencies is
probably another
interrupt which does not enable the global interrupt flag, this allowing
nexted interrupt.
This conflict will only appear AFTER customer shipment,according to Murphys
law.
You have to add together ALL interrupts in the system which has higher
priority
to find your worst case latency.
This is not something that can be tested. You have to do the calculations.

CBFalconer · Jan 13, 2005

Ulf said:
Don't forget that the main reason for long worst case interrupt
latencies is probably another interrupt which does not enable the
global interrupt flag, this allowing nexted interrupt. This
conflict will only appear AFTER customer shipment,according to
Murphys law. You have to add together ALL interrupts in the
system which has higher priority to find your worst case latency.
This is not something that can be tested. You have to do the
calculations.

Of course it is not impossible that the OP has something that runs
a basic loop and has only one interrupt in the system, in which
case there will be no critical sections and the latency is
controlled by the longest instruction. However the return
instruction in many systems implies interrupt disable for the
following instruction, as a measure to avoid stack overflow in some
worst cases. There are also special cases, such as the x86 string
instructions when using a repeat prefix. Don't know about the ARM.

Pygmi · Jan 13, 2005

Ulf Samuelsson said:
You can try a better compiler than WinAVR!

Actually I made a quick test today. Empty handler and
then I wrote the code using inline assembly. Prologue of
17 instruction dropped down to 4 and time from external
trigger from >2.5 us down to 1.2 us or so (at 14.75 MHz).

I was also able to snip few instruction out of my own code.
So, I think there is hope. It is possible to get all done within
2.5 us. But not with handlers written using C/gcc combination.
Maybe with C/IAR or some other compiler. And of course
by writing those critical parts in assembly.

I do understand also that this only half way there. I need to
design & write the other code in such away, that this one
critical handler is serviced with minimum delay...
But it is possible....maybe.

If there just was a AVR running with 25-30 MHz clock.

// IAR C interrupt handler
#pragma vector=12
__interrupt void handler()
{
BYTE i = PORTB;
PORTB = 0xF0;
PORTB = 0x0F;
PORTB = i;
}

Generated code

51 __interrupt void handler()
\ handler:
52 {
\ 00000000 931A ST -Y,R17
\ 00000002 930A ST -Y,R16
53 BYTE i = PORTB;
\ 00000004 B318 IN R17,0x18
54 PORTB = 0xF0;
\ 00000006 EF00 LDI R16,240
\ 00000008 BB08 OUT 0x18,R16
55 PORTB = 0x0F;
\ 0000000A E00F LDI R16,15
\ 0000000C BB08 OUT 0x18,R16
56 PORTB = i;
\ 0000000E BB18 OUT 0x18,R17
57 }
\ 00000010 9109 LD R16,Y+
\ 00000012 9119 LD R17,Y+
\ 00000014 9518 RETI
58

Two registers used, two registers pushed.
As you see, there is no reason to even push the PSR in this case since the
flags do not get updated.

I must admit that this seems very reasonable.

If you need fast interrupt response, and need to do a lot,
then consider to divide the handler into two parts.

Not a bad idea. In general. But in this case I want fast service
(i.e. end result) after external interrupt. So dividing would make
it even worse.

First part (minimal) does minimal fast processing and at the end, it sets an
external interrupt
which continues the processing after the fast interrupt has exited.

__no_init __register BYTE SavePortB @4; Put i in Register r4
#pragma vector=TIMER
__interrupt void fast_handler(void)
{
SavePortB = PORTB;
set_ext_interrupt_pending();
}

#pragma vector=EXT_INT_HANDLER
__interrupt void slow_handler(void)
{
// Continue slow processing after fast handler has exited.
PORTB = 0xF0;
PORTB = 0x0F;
PORTB = i;
}

Since the processing is minimal in the fast handler, very few registers
should be pushed by a good compiler.

There is a 4kB restricted C compiler for tests.
You have to personally contact IAR to get it. It is not on their web page.
This does not generate assembly code , only object code.

I have seen you recommending IAR earlier. Unfortunately I have some
bad memories working with IAR compiler.
....ok several years back with Hitachi H8
....so maybe not very relevant today
....buts at that time we were consdering to go to gcc to get rid of all
the bugs in IAR libs. But all this happened in last century.

So, maybe I should give it a try one of thes days.

--
Best Regards
Ulf at atmel dot com
These comments are intended to be my own opinion and they
may, or may not be shared by my employer, Atmel Sweden.

Thanks
Pygmi

Pygmi · Jan 13, 2005

Jeroen said:
Latency on bigger processors is usually even worse... Interrupt latency on
for example a 80386 can take hunderds of cycles.

It's better to have some hardware to interface the ISA bus. A small cheap
CPLD is best, all you really need is an address decoder and a few registers.
The ISA bus runs at 8Mhz, the AVR runs at 16Mhz; this is just 2 instructions
for each ISA bus cycle. A jump alone is 3 cycles. So it's not possible, the
AVR just can't do anything useful. Only a much faster CPU could do it, but
still then the load is still very high.

The ISA interface can be done in plain HCT logic, and will only be a few
chips. A possible solution is a GAL20V8 as adress decoder. This decoder will
generate two strobes. One to enable a '574 that stores the data from the
databus and another to pass data from the AVR to the ISA via an '244. INT0/1
on the AVR can be used to let the AVR know something has been written. An
external interrupts needs to be at least 2 AVR clock cycles before it's
recognized, but to be on the safe side, it's better to use a flipflop that's
set by the address decoder, and reset by the AVR. The output of the FF goes
the INT0/1 input. This costs only 4 chips that cost next to nothing. If
board space is at premium, a 44 pin CPLD like a MAX7000S could be used.

Jeroen

Thanks again...
What I want to do...is to enable 16 bit I/O register
read/write access using inw/outw functions.
And the registers are in AVR SRAM.

One way to reduce load (a little bit) would be to go
to 8-bit registers. (inb/outb)

I already wrote an answer to Ulf's response telling that:
I made a quick test today. Empty handler and then I wrote
the code using inline assembly. Prologue of 17 instruction
dropped down to 4 and time from external trigger from

2.5 us down to 1.2 us or so (at 14.75 MHz).

So I think that using assembly to write handler I might just
be able to do the thing.

Pygmi

Pygmi · Jan 13, 2005

CBFalconer said:
Of course it is not impossible that the OP has something that runs
a basic loop and has only one interrupt in the system, in which
case there will be no critical sections and the latency is
controlled by the longest instruction.

Actually this is something that I have considered. One interrupt
for one time critical action. Rest of the stuff will probably get most
of the processor time...but in second priority.

However the return
instruction in many systems implies interrupt disable for the
following instruction, as a measure to avoid stack overflow in some
worst cases. There are also special cases, such as the x86 string
instructions when using a repeat prefix. Don't know about the ARM.

Pygmi

Ulf Samuelsson · Jan 14, 2005

Actually I made a quick test today. Empty handler and

then I wrote the code using inline assembly. Prologue of
17 instruction dropped down to 4 and time from external
trigger from >2.5 us down to 1.2 us or so (at 14.75 MHz).

I was also able to snip few instruction out of my own code.
So, I think there is hope. It is possible to get all done within
2.5 us. But not with handlers written using C/gcc combination.
Maybe with C/IAR or some other compiler. And of course
by writing those critical parts in assembly.

I do understand also that this only half way there. I need to
design & write the other code in such away, that this one
critical handler is serviced with minimum delay...
But it is possible....maybe.

If there just was a AVR running with 25-30 MHz clock.

AT76C713 runs at 48 Mhz, should be enough!
Has 24 kB internal SRAM..
AT94K05AL FPSLIC runs at 35 MHz, has 20 kB SRAM and a small FPGA.
You could conceivably do something in the FPGA.
Both loads their code SRAM from an external serial EEPROM

Pygmi · Jan 14, 2005

Ulf Samuelsson said:
AT76C713 runs at 48 Mhz, should be enough!
Has 24 kB internal SRAM..
AT94K05AL FPSLIC runs at 35 MHz, has 20 kB SRAM and a small FPGA.
You could conceivably do something in the FPGA.
Both loads their code SRAM from an external serial EEPROM

--
Best Regards
Ulf at atmel dot com
These comments are intended to be my own opinion and they
may, or may not be shared by my employer, Atmel Sweden.

AT76C713?? Never heard. I'll check the info.
Need to check datasheet, availability and pricing and tools and... as well.

But thanks anyway
Pygmi

Moore's Lobby Podcast

Menu

Categories

Platforms

Content

Connect With Us

Network