# ugly math

Discussion in 'Electronic Design' started by John Larkin, Nov 12, 2006.

1. ### John LarkinGuest

OK, I've got a 64-bit unsigned binary number, in a pair of 32-bit
registers, in an MC68332 cpu. Its maximum value is 1e13, which is 10
seconds measured in picoseconds, so the MS longword can only get up to
about 2328 or something like that. I'm looking for an efficient way to
turn this into a decimal ASCII string, so basicly need to convert it
into bcd. The bcd output will need to have 14 digits. "Numerical
Recipes" doesn't address mundane problems like this.

One way is to create a BCD buffer, zero it, and map each nibble of the
input through a 16-entry bcd lookup table, and sum all the bcd terms.
The 68332 has a fairly efficient BCD add instruction, so adding a bcd
table entry to the running sum isn't ghastly. The lookup table needs
12 entries (one per active nibble) each holding 16 bcd numbers, each
being a 14 digit bcd number, namely 8 bytes, grand total 1536 bytes.
Creating this table would be a minor nuisance; I'd need a
jillion-digit calculator. Well, PowerBasic does have a 64-bit integer
type, so maybe I can write a little Basic program to gen the tables.
There's even a BCD data type, up to 18 digits.

One nice algorithm for binary-to-bcd is to just keep dividing by 10
and posting the remainders as the bcd digits, in reverse order of
course. The 68332 can divide a 64-bit value by 10, with a remainder,
but the quotient is limited to 32 bits, not enough. I could use this
mode to convert the low 32-bit part to bcd, then use four nibble table
lookups on the high half to finish things off. Since I also need a
32-bit-to-bcd thing somewhere else, maybe that isn't such a gross
idea.

If the raw input could be scaled to 64-bit fractional format,
successive multiplies by 10 will pump out digits, ms digit first.
Gotta think about that one... it may have embarassing rounding
problems.

Or maybe I could program our Xilinx FPGA to do a hardware divide,
divide a 64 bit integer by 10 and feed me the remainders. Hell, maybe
it could do the entire ascii conversion in hardware.

Oh, I'm programming in bare-metal assembly.

Anybody have any tricks here?

(You comp.arch guys take it easy on me, please. I'm just a simple
engineer.)

John

2. ### richard mullensGuest

Something like this:-

Divide the 64 bit value by 1 million.
Then
Convert the quotient and remainder to ascii strings.

(Divide by 10, push remainder, divide quotient by 10, push remainder, ..., store quotient, pop remainder, store, pop remainder,
store ...)

3. ### Andrew HolmeGuest

I think you can use the same trick which the FORTH word M/MOD uses. This
divides a 32-bit value by a 16-bit value, giving a 32-bit quotient and
16-bit remainder; but it does it using the U/ primitive, which is 32/16 =
16q+16r.

Say you want to work out (a + 2^32*b) / n using your 64/32 --> 32q, 32r
instruction

Divide your high word first (set the top 32-bits to zero and move b into the
lower 32)
b/n = q1 + r1/n

Now do the low word, adding in the remainder from the above:
(a + 2^32*r1) / n = q2 + r2/n

You can now form a 64-bit quotient and 32-bit remainder from the two
results:
(a + 2^32*b) / n = 2^32q1 + q2 + r2/n

Here's the code in FORTH:

\$Colon MDivMod, 'M/MOD'
; Unsigned 32/16 -> 32q 16r
; >R 0 R U/ ROT SWAP R> U/ ROT SWAP ( l h d -- l h r )
;
;>R l h d
;0 l h 0 d
;R l h 0 d d
;U/ l q r d
;ROT q r l d
;SWAP q l r d
;R> q l r d
;U/ q1H q2L r2
;ROT q2L r2 q1H
;SWAP q2L q1H r2

4. ### John LarkinGuest

Gadzooks, my 68332 can do that beautifully. That splits the number
into two chunks I can process with the available 64/32 divide opcodes.

Don't need no stinkin' pushes and pops... I've got 16 registers!

Thanks!

John

5. ### richard mullensGuest

Thanks,
Interesting. Maybe I'll write the (32 bit) code in Macro-11 for reference later on.
Should be useful for microcontrollers with h/w integer divide
Richard

6. ### ArletGuest

Given the 64/32 bit divide, this is the best solution. Instead of
storing all the BCD nibbles on the stack, or in registers, just write
them to memory as you produce them. The trick is to write them
'backwards'. Set up an address register with the end of your ASCII
buffer, write a zero byte with autodecrement, and write each nibble
with autodecrement. When you're done, the register holds the start of
the string in the correct order, terminated by zero.

7. ### Robert BaerGuest

Attached is a FORTRAN program that converts internal binary back to
ASCII decimal, for a FHT (Hartley) multi-million digit multiply routine.
As such, the input (large) number was "sampled" into 7-digit or less
length sequences, converted to binary, then FHT, convolve, inverse FHT,
and this used to convert back to ASCII in the same-sized sequences.
The secret to this speedy code (attached) is a 579Kbyte look-up table
(actually two tables ASK4 and ASK5) placed in (labelled) common by an
ASM "program" which i could send to you if you wish.

CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
C C
C G: input numeric array from multiply C
C An: output char array, each element ISMP long C
C IS10: output, =1 if result overflows, is == 10^ISMP C
C *Converts internal format to ASCII, 4-7 characters. C
C 10/29/98 C
C C
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

c\$noreference
C * Common to all programs *
IMPLICIT NONE
COMMON IP,N,N2,N4,N7,IP7,ISMP,JFLAG,TCA,TCB,IRecSz1,IRecSz2,
1 IRecSz3
C Table lookup numeric to ASCII, 4 & 5 digits via ASCII arrays
C Note: these arrays are filled in DGROUP.asm data declarations
INTEGER*4 IP,N,N2,N4,N7,IP7,ISMP,JFLAG,IRecSz1,IRecSz2,IRecSz3
c\$reference

INTEGER*4 I,IS10,T,U
DOUBLE PRECISION G(0:N7)

C Table lookup numeric to ASCII, 4 & 5 digits via ASCII array ASK
! get integer overflow somewhere..
N2=N4+N4
IF (INT(G(N7)/10**ISMP) .EQ. 1) IS10=1 ! yup; fix exponent

SELECT CASE (ISMP)
CASE (3)
An(N7-I)=TMP(2:4)
END DO
CASE (4,5)
END DO
CASE (6) ! uses ASK5 for top 5 digits
DO I=0, N7 ! 6 digit table would take 6 megs
T=INT(G(I)*1.0D-1) ! remove divide
U=G(I)-T*10 ! remove modulo; 10 percent improvement
END DO
DO I=0, N7
T=INT(G(I)*1.0D-4) ! remove divide
U=G(I)-T*10000 ! remove modulo; 10 percent improvement
END DO
END SELECT

RETURN
END

8. ### Frank BemelmanGuest

Why does it need to be efficient? Code-space efficient or execution
time efficient? For displaying purposes, a refresh rate of 10 times
per second is good enough. A lookup table with 1,10,100,1000 upto
1000000000000 converted to 64 bit values, and then subtract those
values until you hit 0. Or a larger table, with 1,2,3,4,5,6,7,8,9,10,
20,30,40,50 and so on, with compares and subs. Such table would
hold 9x16x8 bytes (1152) of data.

Or use a different hardware counter, to count the picoseconds,
lsb part of it in binary, msb part of it in decimal.

9. ### Terje MathisenGuest

You can of course use a 64/32->32,32 (result/remainder) in a loop to
divide an arbitrarily long bitstring by a 32-bit number.

If this needs to be efficient, then it is much faster to do the long
divisions by 1e9 (which fits in 32 bits, then use regular ascii
conversion code to convert each 9-digit part into ascii/bcd digits.

If time is _really_ critical, then you should search for my _fast_
10-digit 32-bit binary to ascii conversion algorithm, which I posted on
usenet (in asn asm newsgroup) many years ago. It is fast enough that AMD
'borrowed' it without attribution for their code optimization manual! :-(

Terje

10. ### daveGuest

You don't provide much context for this problem, so I'll just drop this
into the thread. If you have a C compiler available, why not just port
the Calc program (supports arbitrary precision arithmetic) to your
computer? (http://isthe.com/chongo/tech/comp/calc/)

Dave Feustel

11. ### ArletGuest

Seems overkill to use arbitrary precision when only 64 bits are needed.
Most modern C compilers have a 64 bit type, usually implemented as
'long long'.

In that case, a simple one line call:

sprintf( asc, "%lld", bin )

will do the requested conversion.

Of course, OP said he wants to use assembler, presumably because the
available memory does not allow a few kB for the library calls. For
code efficiency, it's hard to beat the solution proposed by other
poster where you do an initial 64/32 bit division by 1 million,
followed by repeated 32-bit divisions by 10 to get the BCD digits.

12. ### daveGuest

I agree. I have found calc to be quite useful on OpenBSD, although it's
probably not useful if the OP is developing code for an embedded system.

13. ### colinGuest

ha ive just implemented such a thing in c to display speed of my motor
controller on a dspic30,
but it uses modulus and divide per digit wich is probaly too slow for you if
you are doing it in assembler as you probably need speed, else why do it in
assembler.

but heres an idea i just made up to do it quickly with no divides or
multiplications but instead uses lots of code
but its very fast i would think, only executing 4 compares 1 and + 1
subtract per decimal digit (78 ops).

its aranged like a binary decision tree, each decision spliting the problem
into two aprox halves,

psuedo code :-
(well ok its c actually and totaly untested !)

unsigned64 result=0;
x=input;
if (x>=5e12)
if(x>=7e12)
if(x>=9e12)
{
result&=9<<62;
input-=9e12;
}
else
if(x>=8e12)
{
result&=8<<62;
input-=8e12;
}
else
{
result&=7<<62;
input-=7e12;
}
else
if(x>=6e12)
{
result&=6<<62;
input-=6e12;
}
else
{
result&=5<<62;
input-=5e12;
}
else if(x>2e12 etc ......

if (x>=5e11)
if(x>=7e11)
if(x>=9e11)
{
result&=9<<48;
input-=9e11;
}
etc ....

Colin =^.^=

14. ### colinGuest

I spoted a few mistake already as I made some changes just before posting,
x is an uneeded temp and shld be replaced by input,
I assumed input is actualy <1e13
you could make a macro for each digit and call it once for each digit
so digital exponent would be text replaced by macro parameter, binary shift
<- parameter*4

Colin =^.^=

15. ### John LarkinGuest

My thoughts exactly.

Here's my code, based on Richard's suggestion, but dividing by 1e9
instead of 1e6. I haven't read it through enough, so there's likely a
bug or so. I usually code bugs here and there, but catch then on
read-throughs, before I actually run the code.

ACORN is separately callable to convert 32-bit integers.

John

.SBTTL . ASCII CONVERSIONS

; GIVEN A 64-BIT TIME IN PICOSECONDS, CONVERT THAT INTO AN ASCII
; STRING IN OUR "BCD" STRING BUFFER.

; INPUT IS IN D0 1, MAX VALUE 99,999,999,999,999, 14 DIGITS!

; WE'LL BREAK D0 1 INTO TWO CHUNKS BY FIRST DIVIDING BY 1E9

ATIME: MOVE.L # 1000000000, D3 ; ONE WHOLE BILLION! WOW.
DIVU.Q D3, D0 1 ; DO THAT DIVIDE

; D0 (REMAINDER) HOLDS THE LOW 9 DECIMAL DIGITS, 999,999,999 MAX
; D1 (QUOTIENT) NOW HOLDS THE NUMBER OF BILLIONS, 10,999 MAX

MOVEA.W # BCX, A0 ; AIM AT END OF 'BCD' BUFFER, +1
MOVE.W # 9-1, D7 ; REQUEST 9 DIGITS
MOVE.L D0, D6 ; COPY CORRESPONDING CHUNK
BSR.S ACORN ; CONVERT THEM, POKE INTO BUFFER

MOVE.W # 5-1, D7 ; MAKE READY FOR BIG 5 DIGITS
MOVE.L D1, D6 ; COPY THE GIGAS
; AND FALL INTO FINAL CONVERSION

; THIS LITTLE SUB CONVERTS THE CONTENTS OF D5 INTO
; AN ASCII STRING.

; D7 IS NUMBER OF DIGITS TO OUTPUT, DBF STYLE
; D6 HOLDS UNSIGNED LONG TO CONVERT
; A0 AIMS AT START, THE *LS* DIGIT OF THE STRING+1,
; MEANING WE'LL WORK BACKWARDS, -(A0) MODE

ACORN: MOVE.L # 10, D3 ; RADIX CONSTANT FOR HUMANOIDS

ACRID: CLR.L D5 ; MAKE INPUT INTO 64-BIT INT
DIVU.Q D3, D5 6 ; DIVIDE, REMAINDER IN D5
ADDI.W # "0", D5 ; CONVERT TO ASCII 0..9
MOVE.B D5, -(A0) ; POKE A NEW DIGIT

DBF D7, ACRID ; AND LOOP
RTS

; ON EXIT, A0 AIMS AT THE MS CHARACTER

16. ### John LarkinGuest

I'd love to see that; could you possibly repost here?

Is there a competition for the *slowest* binary to decimal-ascii
conversion? I think I may have written a candidate, for the 6802 uP, a
while ago.

John

17. ### John LarkinGuest

Partly for elegance. The box has a 38.4 kbaud serial interface, and
users can interrogate it for current settings, so I would like to not
be speed-limiting the communications. I guess it would be OK to do the
64-bit binary to 14-char ASCII in a couple of milliseconds, which
ain't hard on a 68332 no matter how you do it. The other efficiency is
my coding time.

I posted what looks like a nice routine, based on Richard's divide
breakup, followed by using the 68332's very nice 64/32 divide with
remainder opcode.

I've done that, on machines without the divide opcode, but the divide
thing just spits out ASCII digits in a 5-opcode loop, a few
microseconds per character.

I considered having my FPGA guy do the whole thing for me in VHDL,
sort of as a math coprocessor, but the split divide thing looks fine.

John

18. ### John LarkinGuest

Yeah, I expect the entire application (managing an embedded digital
delay generator) to come in around 6 kilobytes or so, including serial
command parser, 'help' text, FPGA loader, flash manager, and of course
the hardware management.

John

19. ### Ancient_HackerGuest

one simple way is to have a powers of ten table:

dq 1000000000000,10000000000,1000000000
dq 10000000,1000000,1000000,100000,10000,1000,100,1

Start by seeing how many times you can subtract the first entry--
continue doing the same for each succeeding table entry. Quite fast,
and no divides needed.

20. ### David BrownGuest

I know you're using assembly, but I can't be bothered writing it out in
assembly (certainly not tight and fast code) at this stage. But here's
an idea in rough C, that you should be able to translate well enough.

static const uint64 bcds = {
{ 0, 0x0001, 0x0002, ..., 0x0255 },
{ 0, 0x00000256, 0x00000512, 0x00000768, ..., 0x00065535 },
{ 0, 0x00065536, 0x00131072, ..., 0x16711425 },
...
{ 0, 0x02814749 76710656, ... } // Overflows after 35th entry }
{ ... } // Row 8 is all overflows, and can be skipped altogether
};

uint64 intToBcd(uint64 x) {
uint64 sum = 0;
int i;
uint8 *p = (uint8* &x) + 7; // Point to LSB

for (i = 0; i < ; i++) {
};
return sum;
}

The idea is to have 8 lookup tables, one for each byte placing in the
64-bit number (excluding the msb, as that will certainly overflow - you
could skip some of the other bytes too as you don't need the full
64-bit). Each table gives the bcd value of each of the 256 values for a
byte in that place. You decompose the original 64-bit int into bytes,
and build up the bcd sum. If you want to save space, you could use 16
4-bit lookup tables instead of 8 8-bit tables.  